# Extracting Data from Companies House Electronic Records

Companies house receives 75% of its records in XBRL or iXBRL format, a glorified tagged xml document that should allow for easy automated extraction of statistics.

The software in this repo was developed after reading of this (American) example:
https://www.codeproject.com/Articles/1227765/Parsing-XBRL-with-Python

The functions for doing so are hosted in the module xbrl_parser.py

Both xbrl_parser.py and this script have a number of python package dependencies so expect to have to install some things.


## Returned dict schema for html/xml sourced data

A practical note:  Apart from explicitly elevated metadata, all extracted values are stored in a list of "elements" within the returned dict.  Each element is itself a dict, containing the name and value of the discovered data along with fields unit and date for metadata.

# Setup (import modules, set up a helper function for getting filepaths)

In [1]:
import xbrl_parser as xp
import os
import numpy as np
import pandas as pd
import importlib

def get_filepaths(directory):

    """ Helper function - 
    Get all of the filenames in a directory that
    end in htm* or xml.
    Under the assumption that all files within
    the folder are financial records. """

    files = [directory + "/" + filename
                for filename in os.listdir(directory)
                    if (("htm" in filename.lower()) or ("xml" in filename.lower())) ]
    return(files)

# Extracting data from documents

We'll import the module, and process some files


In [2]:
# Get all the filenames from the example folder
files = get_filepaths("./example_data_XBRL_iXBRL")

# There's 379 examples currently
files[0:7]

['./example_data_XBRL_iXBRL/Prod224_0042_00958610_20160930.xml',
 './example_data_XBRL_iXBRL/Prod223_2125_09749826_20170831.html',
 './example_data_XBRL_iXBRL/Prod223_2125_09170142_20170831.html',
 './example_data_XBRL_iXBRL/Prod224_0042_03237381_20160831.xml',
 './example_data_XBRL_iXBRL/Prod223_2125_09900460_20161231.html',
 './example_data_XBRL_iXBRL/Prod223_2125_09652609_20180331.html',
 './example_data_XBRL_iXBRL/Prod223_2125_09722743_20170831.html']

In [12]:
# Reload the xbrl_parser module (don't need this normally, it's just useful for me
# for iterative testing of changes)
importlib.reload(xp)

# try getting the first file (an XML, or XBRL, file)
doc = xp.process_account(files[0])

# View the content
doc

./example_data_XBRL_iXBRL/Prod224_0042_00958610_20160930.xml


{'doc_name': 'Prod224_0042_00958610_20160930.xml',
 'doc_type': 'xml',
 'doc_upload_date': '2018-11-14 12:02:17.304881',
 'arc_name': 'example_data_XBRL_iXBRL',
 'parsed': True,
 'doc_balancesheetdate': '2016-09-30',
 'doc_companieshouseregisterednumber': '00958610',
 'doc_standard_type': 'uk-gaap-ae',
 'doc_standard_date': '2009-06-21',
 'doc_standard_link': 'http://www.companieshouse.gov.uk/ef/xbrl/uk/fr/gaap/ae/2009-06-21/uk-gaap-ae-2009-06-21.xsd',
 'elements': [{'name': 'companynotdormant',
   'value': 'true',
   'unit': 'NA',
   'date': '2016-09-30'},
  {'name': 'entitycurrentlegalname',
   'value': 'S.L.M. (Model) Engineers Limited',
   'unit': 'NA',
   'date': '2016-09-30'},
  {'name': 'companieshouseregisterednumber',
   'value': '00958610',
   'unit': 'NA',
   'date': '2016-09-30'},
  {'name': 'balancesheetdate',
   'value': '2016-09-30',
   'unit': 'NA',
   'date': '2016-09-30'},
  {'name': 'profitlossaccountreserve',
   'value': 83402.0,
   'unit': 'GBP',
   'date': '2016-0

In [4]:
# try getting the second file (an HTML, or iXBRL, file)
doc2 = xp.process_account(files[1])

# View the content
doc2

./example_data_XBRL_iXBRL/Prod223_2125_09749826_20170831.html


{'doc_name': 'Prod223_2125_09749826_20170831.html',
 'doc_type': 'html',
 'doc_upload_date': '2018-11-14 11:59:32.265198',
 'arc_name': 'example_data_XBRL_iXBRL',
 'parsed': True,
 'doc_balancesheetdate': '2017-08-31',
 'doc_companieshouseregisterednumber': '09749826',
 'doc_standard_type': 'FRS-102',
 'doc_standard_date': '2014-09-01',
 'doc_standard_link': 'https://xbrl.frc.org.uk/FRS-102/2014-09-01/FRS-102-2014-09-01.xsd',
 'elements': [{'name': 'nameproductionsoftware',
   'value': 'Caseware UK (AP4) 2016.0.181',
   'unit': 'NA',
   'date': '2017-08-31'},
  {'name': 'versionproductionsoftware',
   'value': ' 2016.0.181',
   'unit': 'NA',
   'date': '2017-08-31'},
  {'name': 'enddateforperiodcoveredbyreport',
   'value': '2017-08-31',
   'unit': 'NA',
   'date': '2017-08-31'},
  {'name': 'balancesheetdate',
   'value': '2017-08-31',
   'unit': 'NA',
   'date': '2017-08-31'},
  {'name': 'accountstypefullorabbreviated',
   'value': 'FullAccounts',
   'unit': 'NA',
   'date': '2017-08-

# Retrieve elements

In [5]:
# Loop through the document, retrieving any element with a matching name
for element in doc['elements']:
    if element['name'] == 'netassetsliabilitiesincludingpensionassetliability':
        print(element)

{'name': 'netassetsliabilitiesincludingpensionassetliability', 'value': 88402.0, 'unit': 'GBP', 'date': '2016-09-30'}
{'name': 'netassetsliabilitiesincludingpensionassetliability', 'value': 81151.0, 'unit': 'GBP', 'date': '2015-09-30'}


In [18]:
# Extract the all the data to long-thin table format for use with SQL
# Note, tables from docs should be appendable to one another to create
# tables of all data
xp.flatten_data(doc)

Unnamed: 0,date,name,unit,value,doc_name,doc_type,doc_upload_date,arc_name,parsed,doc_balancesheetdate,doc_companieshouseregisterednumber,doc_standard_type,doc_standard_date,doc_standard_link
0,2016-09-30,companynotdormant,,true,Prod224_0042_00958610_20160930.xml,xml,2018-11-14 12:03:12.081506,example_data_XBRL_iXBRL,True,2016-09-30,00958610,uk-gaap-ae,2009-06-21,http://www.companieshouse.gov.uk/ef/xbrl/uk/fr...
1,2016-09-30,entitycurrentlegalname,,S.L.M. (Model) Engineers Limited,Prod224_0042_00958610_20160930.xml,xml,2018-11-14 12:03:12.081506,example_data_XBRL_iXBRL,True,2016-09-30,00958610,uk-gaap-ae,2009-06-21,http://www.companieshouse.gov.uk/ef/xbrl/uk/fr...
2,2016-09-30,companieshouseregisterednumber,,00958610,Prod224_0042_00958610_20160930.xml,xml,2018-11-14 12:03:12.081506,example_data_XBRL_iXBRL,True,2016-09-30,00958610,uk-gaap-ae,2009-06-21,http://www.companieshouse.gov.uk/ef/xbrl/uk/fr...
3,2016-09-30,balancesheetdate,,2016-09-30,Prod224_0042_00958610_20160930.xml,xml,2018-11-14 12:03:12.081506,example_data_XBRL_iXBRL,True,2016-09-30,00958610,uk-gaap-ae,2009-06-21,http://www.companieshouse.gov.uk/ef/xbrl/uk/fr...
4,2016-09-30,profitlossaccountreserve,GBP,83402,Prod224_0042_00958610_20160930.xml,xml,2018-11-14 12:03:12.081506,example_data_XBRL_iXBRL,True,2016-09-30,00958610,uk-gaap-ae,2009-06-21,http://www.companieshouse.gov.uk/ef/xbrl/uk/fr...
5,2015-09-30,profitlossaccountreserve,GBP,76151,Prod224_0042_00958610_20160930.xml,xml,2018-11-14 12:03:12.081506,example_data_XBRL_iXBRL,True,2016-09-30,00958610,uk-gaap-ae,2009-06-21,http://www.companieshouse.gov.uk/ef/xbrl/uk/fr...
6,2016-09-30,shareholderfunds,GBP,88402,Prod224_0042_00958610_20160930.xml,xml,2018-11-14 12:03:12.081506,example_data_XBRL_iXBRL,True,2016-09-30,00958610,uk-gaap-ae,2009-06-21,http://www.companieshouse.gov.uk/ef/xbrl/uk/fr...
7,2015-09-30,shareholderfunds,GBP,81151,Prod224_0042_00958610_20160930.xml,xml,2018-11-14 12:03:12.081506,example_data_XBRL_iXBRL,True,2016-09-30,00958610,uk-gaap-ae,2009-06-21,http://www.companieshouse.gov.uk/ef/xbrl/uk/fr...
8,2016-09-30,calledupsharecapital,GBP,5000,Prod224_0042_00958610_20160930.xml,xml,2018-11-14 12:03:12.081506,example_data_XBRL_iXBRL,True,2016-09-30,00958610,uk-gaap-ae,2009-06-21,http://www.companieshouse.gov.uk/ef/xbrl/uk/fr...
9,2015-09-30,calledupsharecapital,GBP,5000,Prod224_0042_00958610_20160930.xml,xml,2018-11-14 12:03:12.081506,example_data_XBRL_iXBRL,True,2016-09-30,00958610,uk-gaap-ae,2009-06-21,http://www.companieshouse.gov.uk/ef/xbrl/uk/fr...


In [19]:
# Finally, build a table of all variables from all example (digital) documents
# This can take a while

# Empty table awaiting results
results = pd.DataFrame()

# For every file
for file in files:
    
    # Read the file
    doc = xp.process_account(file)
    
    # tabulate the results
    doc_df = xp.flatten_data(doc)
    
    # append to table
    results = results.append(doc_df)

./example_data_XBRL_iXBRL/Prod224_0042_00958610_20160930.xml
./example_data_XBRL_iXBRL/Prod223_2125_09749826_20170831.html
./example_data_XBRL_iXBRL/Prod223_2125_09170142_20170831.html
./example_data_XBRL_iXBRL/Prod224_0042_03237381_20160831.xml
./example_data_XBRL_iXBRL/Prod223_2125_09900460_20161231.html


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort)


./example_data_XBRL_iXBRL/Prod223_2125_09652609_20180331.html
./example_data_XBRL_iXBRL/Prod223_2125_09722743_20170831.html
./example_data_XBRL_iXBRL/Prod223_2125_09418436_20180228.html
./example_data_XBRL_iXBRL/Prod223_2125_09734389_20171231.html
./example_data_XBRL_iXBRL/Prod223_2125_09209882_20170930.html
./example_data_XBRL_iXBRL/Prod223_2125_10021277_20180228.html
./example_data_XBRL_iXBRL/Prod223_2125_09187008_20170831.html
./example_data_XBRL_iXBRL/Prod223_2125_09900330_20171231.html
./example_data_XBRL_iXBRL/Prod223_2125_09181696_20170831.html
./example_data_XBRL_iXBRL/Prod223_2125_09928600_20171231.html
./example_data_XBRL_iXBRL/Prod224_0042_00973132_20160831.xml
./example_data_XBRL_iXBRL/Prod224_0042_02506535_20160930.xml
./example_data_XBRL_iXBRL/Prod223_2125_09390208_20180331.html
./example_data_XBRL_iXBRL/Prod224_0042_02824859_20160331.xml
./example_data_XBRL_iXBRL/Prod223_2125_09618448_20170630.html
./example_data_XBRL_iXBRL/Prod223_2125_09726815_20170831.html
./example_d

./example_data_XBRL_iXBRL/Prod223_2125_10088559_20180331.html
./example_data_XBRL_iXBRL/Prod223_2125_09287062_20171031.html
./example_data_XBRL_iXBRL/Prod223_2125_09898492_20171231.html
./example_data_XBRL_iXBRL/Prod223_2125_10053460_20180331.html
./example_data_XBRL_iXBRL/Prod223_2125_09680485_20171231.html
./example_data_XBRL_iXBRL/Prod223_2125_09886696_20180331.html
./example_data_XBRL_iXBRL/Prod223_2125_09593111_20170331.html
./example_data_XBRL_iXBRL/Prod223_2125_09668636_20170731.html
./example_data_XBRL_iXBRL/Prod223_2125_10052939_20180331.html
./example_data_XBRL_iXBRL/Prod224_0042_03090836_20160831.xml
./example_data_XBRL_iXBRL/Prod223_2125_09379430_20170630.html
./example_data_XBRL_iXBRL/Prod223_2125_09258374_20171031.html
./example_data_XBRL_iXBRL/Prod223_2125_09240869_20170930.html
./example_data_XBRL_iXBRL/Prod223_2125_09787769_20170930.html
./example_data_XBRL_iXBRL/Prod223_2125_09489941_20180331.html
./example_data_XBRL_iXBRL/Prod224_0042_01745847_20160831.xml
./example_

./example_data_XBRL_iXBRL/Prod224_0042_03421506_20160831.xml
./example_data_XBRL_iXBRL/Prod224_0042_03357695_20160831.xml
./example_data_XBRL_iXBRL/Prod224_0042_02640114_20160831.xml
./example_data_XBRL_iXBRL/Prod224_0042_00631408_20160930.xml
./example_data_XBRL_iXBRL/Prod224_0042_02356050_20160831.xml
./example_data_XBRL_iXBRL/Prod224_0042_01870321_20160831.xml
./example_data_XBRL_iXBRL/Prod224_0042_02846831_20160831.xml
./example_data_XBRL_iXBRL/Prod224_0042_02876370_20161130.xml
./example_data_XBRL_iXBRL/Prod224_0042_03056697_20160831.xml
./example_data_XBRL_iXBRL/Prod224_0042_02893416_20160831.xml
./example_data_XBRL_iXBRL/Prod224_0042_02546826_20161031.xml
./example_data_XBRL_iXBRL/Prod224_0042_03305320_20160831.xml
./example_data_XBRL_iXBRL/Prod224_0042_02869875_20161231.xml
./example_data_XBRL_iXBRL/Prod224_0042_02116000_20160831.xml
./example_data_XBRL_iXBRL/Prod224_0042_02729992_20160831.xml
./example_data_XBRL_iXBRL/Prod224_0042_01338588_20160930.xml
./example_data_XBRL_iXBR

In [20]:
results

Unnamed: 0,arc_name,date,doc_balancesheetdate,doc_companieshouseregisterednumber,doc_name,doc_standard_date,doc_standard_link,doc_standard_type,doc_type,doc_upload_date,name,parsed,sign,unit,value
0,example_data_XBRL_iXBRL,2016-09-30,2016-09-30,00958610,Prod224_0042_00958610_20160930.xml,2009-06-21,http://www.companieshouse.gov.uk/ef/xbrl/uk/fr...,uk-gaap-ae,xml,2018-11-14 12:27:24.186386,companynotdormant,True,,,true
1,example_data_XBRL_iXBRL,2016-09-30,2016-09-30,00958610,Prod224_0042_00958610_20160930.xml,2009-06-21,http://www.companieshouse.gov.uk/ef/xbrl/uk/fr...,uk-gaap-ae,xml,2018-11-14 12:27:24.186386,entitycurrentlegalname,True,,,S.L.M. (Model) Engineers Limited
2,example_data_XBRL_iXBRL,2016-09-30,2016-09-30,00958610,Prod224_0042_00958610_20160930.xml,2009-06-21,http://www.companieshouse.gov.uk/ef/xbrl/uk/fr...,uk-gaap-ae,xml,2018-11-14 12:27:24.186386,companieshouseregisterednumber,True,,,00958610
3,example_data_XBRL_iXBRL,2016-09-30,2016-09-30,00958610,Prod224_0042_00958610_20160930.xml,2009-06-21,http://www.companieshouse.gov.uk/ef/xbrl/uk/fr...,uk-gaap-ae,xml,2018-11-14 12:27:24.186386,balancesheetdate,True,,,2016-09-30
4,example_data_XBRL_iXBRL,2016-09-30,2016-09-30,00958610,Prod224_0042_00958610_20160930.xml,2009-06-21,http://www.companieshouse.gov.uk/ef/xbrl/uk/fr...,uk-gaap-ae,xml,2018-11-14 12:27:24.186386,profitlossaccountreserve,True,,GBP,83402
5,example_data_XBRL_iXBRL,2015-09-30,2016-09-30,00958610,Prod224_0042_00958610_20160930.xml,2009-06-21,http://www.companieshouse.gov.uk/ef/xbrl/uk/fr...,uk-gaap-ae,xml,2018-11-14 12:27:24.186386,profitlossaccountreserve,True,,GBP,76151
6,example_data_XBRL_iXBRL,2016-09-30,2016-09-30,00958610,Prod224_0042_00958610_20160930.xml,2009-06-21,http://www.companieshouse.gov.uk/ef/xbrl/uk/fr...,uk-gaap-ae,xml,2018-11-14 12:27:24.186386,shareholderfunds,True,,GBP,88402
7,example_data_XBRL_iXBRL,2015-09-30,2016-09-30,00958610,Prod224_0042_00958610_20160930.xml,2009-06-21,http://www.companieshouse.gov.uk/ef/xbrl/uk/fr...,uk-gaap-ae,xml,2018-11-14 12:27:24.186386,shareholderfunds,True,,GBP,81151
8,example_data_XBRL_iXBRL,2016-09-30,2016-09-30,00958610,Prod224_0042_00958610_20160930.xml,2009-06-21,http://www.companieshouse.gov.uk/ef/xbrl/uk/fr...,uk-gaap-ae,xml,2018-11-14 12:27:24.186386,calledupsharecapital,True,,GBP,5000
9,example_data_XBRL_iXBRL,2015-09-30,2016-09-30,00958610,Prod224_0042_00958610_20160930.xml,2009-06-21,http://www.companieshouse.gov.uk/ef/xbrl/uk/fr...,uk-gaap-ae,xml,2018-11-14 12:27:24.186386,calledupsharecapital,True,,GBP,5000


That's ~380 files extracted to obtain ~ 22,000 variables - on average 60 variables per record.  As you've just seen though, extraction can take a while!  Searching through the documents using BeautifulSoup can take a long time, especially where chasing element links to get information on units.  Hopefully this is the sort of thing that can be optimised in future, or it'll be rendered irrelevant by Moore's Law.

In [21]:
results.to_csv("example_extracted_XBRL_data.csv", index=False)

# Get summary variables

These I've implemented to work off the MongoDB/Dict representation of the data that the scraping code returns.  It's assumed that if you wish to work with the "flattened" SQL-compatible data instead you can develop your own queries :)

In [7]:
index = 3
doc = xp.process_account(files[index])

# This tries to add up every variable it can find in a list of variable names
test = xp.summarise_by_sum(doc, ["fixedassets",
                                 "currentassets",
                                 "intangibleassets",
                                 "tangiblefixedassets",
                                 "intangiblefixedassets",
                                 "investmentsfixedassets",
                                 "cashbankinhand",
                                 "cashbankonhand",
                                 "cashbank",
                                 "cashonhand",
                                 "cashinhand",
                                 "calledupsharecapitalnotpaidnotexpressedascurrentasset",
                                 "otherdebtors"])
test

./example_data_XBRL_iXBRL/Prod224_0042_03237381_20160831.xml


{'total_assets': 537155.0, 'unit': 'GBP'}

In [8]:
# This returns the first variable it finds in a prioritised list
# Here I've gone looking for net assets/liabilities
test = xp.summarise_by_priority(doc, ["netassetsliabilitiesincludingpensionasset",
                                      "netassetsliabilityexcludingpensionasset",
                                      "netassetsliabilities",
                                      "totalassetslesscurrentliabilities",
                                      "netcurrentassetsliabilities"])
test

{'primary_assets': 247028.0, 'unit': 'GBP'}

In [9]:
# Here I've applied it to shareholder funds/equity
test = xp.summarise_by_priority(doc, ["shareholderfunds",
                                      "equity",
                                      "capitalandreserves"])
test

{'primary_assets': 247028.0, 'unit': 'GBP'}

In [10]:
# This one just tries to return all named variables
test = xp.summarise_set(doc, ["creditors",
                              "debtors",
                              'accountstypefullorabbreviated',
                              'descriptionprincipalactivities',
                              'accountingstandardsapplied',
                              'entitytradingstatus'])
test

{'debtors': 70073.0}