# Extracting Data from Companies House Electronic Records

Companies house receives 75% of its records in XBRL or iXBRL format, a glorified tagged xml document that should allow for easy automated extraction of statistics.

The software in this repo was developed after reading of this (American) example:
https://www.codeproject.com/Articles/1227765/Parsing-XBRL-with-Python

The functions for doing so are hosted in the module xbrl_parser.py

Both xbrl_parser.py and this script have a number of python package dependencies so expect to have to install some things.


## Returned dict schema for html/xml sourced data

A practical note:  Apart from explicitly elevated metadata, all extracted values are stored in a list of "elements" within the returned dict.  Each element is itself a dict, containing the name and value of the discovered data along with fields unit and date for metadata.

# Setup (import modules, set up a helper function for getting filepaths)

In [2]:
import xbrl_parser as xp
import os
import numpy as np
import pandas as pd
import importlib

def get_filepaths(directory):

    """ Helper function - 
    Get all of the filenames in a directory that
    end in htm* or xml.
    Under the assumption that all files within
    the folder are financial records. """

    files = [directory + "/" + filename
                for filename in os.listdir(directory)
                    if (("htm" in filename.lower()) or ("xml" in filename.lower())) ]
    return(files)

# Extracting data from documents

We'll import the module, and process some files


In [16]:
# Get all the filenames from the example folder
files = get_filepaths("./example_data_XBRL_iXBRL")

# There's 379 examples currently
files[0:7]

['./example_data_XBRL_iXBRL/Prod224_0042_00958610_20160930.xml',
 './example_data_XBRL_iXBRL/Prod223_2125_09749826_20170831.html',
 './example_data_XBRL_iXBRL/Prod223_2125_09170142_20170831.html',
 './example_data_XBRL_iXBRL/Prod224_0042_03237381_20160831.xml',
 './example_data_XBRL_iXBRL/Prod223_2125_09900460_20161231.html',
 './example_data_XBRL_iXBRL/Prod223_2125_09652609_20180331.html',
 './example_data_XBRL_iXBRL/Prod223_2125_09722743_20170831.html']

In [17]:
# Reload the xbrl_parser module (don't need this normally, it's just useful for me
# for iterative testing of changes)
importlib.reload(xp)

# try getting the first file (an XML, or XBRL, file)
doc = xp.process_account(files[0])

# View the content
doc

./example_data_XBRL_iXBRL/Prod224_0042_00958610_20160930.xml


{'doc_name': 'Prod224_0042_00958610_20160930.xml',
 'doc_type': 'xml',
 'doc_upload_date': '2018-10-29 15:36:16.212768',
 'arc_name': 'example_data_XBRL_iXBRL',
 'parsed': True,
 'doc_balancesheetdate': '2016-09-30',
 'doc_companieshouseregisterednumber': '00958610',
 'doc_standard_type': 'uk-gaap-ae',
 'doc_standard_date': '2009-06-21',
 'doc_standard_link': 'http://www.companieshouse.gov.uk/ef/xbrl/uk/fr/gaap/ae/2009-06-21/uk-gaap-ae-2009-06-21.xsd',
 'elements': [{'name': 'companynotdormant',
   'value': 'true',
   'unit': 'NA',
   'date': '2016-09-30'},
  {'name': 'entitycurrentlegalname',
   'value': 'S.L.M. (Model) Engineers Limited',
   'unit': 'NA',
   'date': '2016-09-30'},
  {'name': 'companieshouseregisterednumber',
   'value': '00958610',
   'unit': 'NA',
   'date': '2016-09-30'},
  {'name': 'balancesheetdate',
   'value': '2016-09-30',
   'unit': 'NA',
   'date': '2016-09-30'},
  {'name': 'profitlossaccountreserve',
   'value': 83402.0,
   'unit': 'GBP',
   'date': '2016-0

In [18]:
# try getting the second file (an HTML, or iXBRL, file)
doc2 = xp.process_account(files[1])

# View the content
doc2

./example_data_XBRL_iXBRL/Prod223_2125_09749826_20170831.html


{'doc_name': 'Prod223_2125_09749826_20170831.html',
 'doc_type': 'html',
 'doc_upload_date': '2018-10-29 15:36:20.197275',
 'arc_name': 'example_data_XBRL_iXBRL',
 'parsed': True,
 'doc_balancesheetdate': '2017-08-31',
 'doc_companieshouseregisterednumber': '09749826',
 'doc_standard_type': 'FRS-102',
 'doc_standard_date': '2014-09-01',
 'doc_standard_link': 'https://xbrl.frc.org.uk/FRS-102/2014-09-01/FRS-102-2014-09-01.xsd',
 'elements': [{'name': 'nameproductionsoftware',
   'value': 'Caseware UK (AP4) 2016.0.181',
   'unit': 'NA',
   'date': '2017-08-31'},
  {'name': 'versionproductionsoftware',
   'value': ' 2016.0.181',
   'unit': 'NA',
   'date': '2017-08-31'},
  {'name': 'enddateforperiodcoveredbyreport',
   'value': '2017-08-31',
   'unit': 'NA',
   'date': '2017-08-31'},
  {'name': 'balancesheetdate',
   'value': '2017-08-31',
   'unit': 'NA',
   'date': '2017-08-31'},
  {'name': 'accountstypefullorabbreviated',
   'value': 'FullAccounts',
   'unit': 'NA',
   'date': '2017-08-

# Retrieve elements

In [17]:
# Loop through the document, retrieving any element with a matching name
for element in doc['elements']:
    if element['name'] == 'netassetsliabilitiesincludingpensionassetliability':
        print(element)

{'name': 'netassetsliabilitiesincludingpensionassetliability', 'value': 88402.0, 'unit': 'GBP', 'date': '2016-09-30'}
{'name': 'netassetsliabilitiesincludingpensionassetliability', 'value': 81151.0, 'unit': 'GBP', 'date': '2015-09-30'}


In [21]:
# Convert all of a document's elements to a pandas dataframe format for ease of searching
df = pd.DataFrame(doc['elements'])
df['company'] = doc['doc_companieshouseregisterednumber']
# Note, all fields are currently string (character) format after extraction
df

Unnamed: 0,date,name,unit,value,company
0,2016-09-30,companynotdormant,,true,00958610
1,2016-09-30,entitycurrentlegalname,,S.L.M. (Model) Engineers Limited,00958610
2,2016-09-30,companieshouseregisterednumber,,00958610,00958610
3,2016-09-30,balancesheetdate,,2016-09-30,00958610
4,2016-09-30,profitlossaccountreserve,GBP,83402,00958610
5,2015-09-30,profitlossaccountreserve,GBP,76151,00958610
6,2016-09-30,shareholderfunds,GBP,88402,00958610
7,2015-09-30,shareholderfunds,GBP,81151,00958610
8,2016-09-30,calledupsharecapital,GBP,5000,00958610
9,2015-09-30,calledupsharecapital,GBP,5000,00958610


# Get summary variables

In [51]:
importlib.reload(xp)
# get a document ingested
index = 3
doc = xp.process_account(files[index])

# This tries to add up every variable it can find in a list of variable names
test = xp.summarise_by_sum(doc, ["fixedassets",
                                 "currentassets",
                                 "intangibleassets",
                                 "tangiblefixedassets",
                                 "intangiblefixedassets",
                                 "investmentsfixedassets",
                                 "cashbankinhand",
                                 "cashbankonhand",
                                 "cashbank",
                                 "cashonhand",
                                 "cashinhand",
                                 "calledupsharecapitalnotpaidnotexpressedascurrentasset",
                                 "otherdebtors"])
test

./example_data_XBRL_iXBRL/Prod224_0042_03237381_20160831.xml


{'total_assets': 691.0, 'unit': 'GBP'}

In [52]:
# This returns the first variable it finds in a prioritised list
# Here I've gone looking for net assets/liabilities
test = xp.summarise_by_priority(doc, ["netassetsliabilitiesincludingpensionasset",
                                      "netassetsliabilityexcludingpensionasset",
                                      "netassetsliabilities",
                                      "totalassetslesscurrentliabilities",
                                      "netcurrentassetsliabilities"])
test

{'primary_assets': 247028.0, 'unit': 'GBP'}

In [53]:
# Here I've applied it to shareholder funds/equity
test = xp.summarise_by_priority(doc, ["shareholderfunds",
                                      "equity",
                                      "capitalandreserves"])
test

{'primary_assets': 247028.0, 'unit': 'GBP'}

In [54]:
# This one just tries to return all named variables
test = xp.summarise_set(doc, ["creditors",
                              "debtors",
                              'accountstypefullorabbreviated',
                              'descriptionprincipalactivities',
                              'accountingstandardsapplied',
                              'entitytradingstatus'])
test

{'debtors': 70073.0}