# Extracting Data from Companies House Electronic Records

Companies house receives 75% of its records in XBRL or iXBRL format, a glorified tagged xml document that should allow for easy automated extraction of statistics.

The software in this repo was developed after reading of this (American) example:
https://www.codeproject.com/Articles/1227765/Parsing-XBRL-with-Python

The functions for doing so are hosted in the module xbrl_parser.py

Both xbrl_parser.py and this script have a number of python package dependencies so expect to have to install some things.


## Returned dict schema for html/xml sourced data

A practical note:  Apart from explicitly elevated metadata, all extracted values are stored in a list of "elements" within the returned dict.  Each element is itself a dict, containing the name and value of the discovered data along with fields unit and date for metadata.

# Setup (import modules, set up a helper function for getting filepaths)

In [7]:
import xbrl_parser as xp
import os
import numpy as np
import pandas as pd
import importlib

def get_filepaths(directory):

    """ Helper function - 
    Get all of the filenames in a directory that
    end in htm* or xml.
    Under the assumption that all files within
    the folder are financial records. """

    files = [directory + "/" + filename
                for filename in os.listdir(directory)
                    if (("htm" in filename.lower()) or ("xml" in filename.lower())) ]
    return(files)

# Extracting data from documents

We'll import the module, and process some files


In [2]:
# Get all the filenames from the example folder
files = get_filepaths("./example_data_XBRL_iXBRL")

# There's 379 examples currently
files[0:7]

['./example_data_XBRL_iXBRL/Prod224_0042_00958610_20160930.xml',
 './example_data_XBRL_iXBRL/Prod223_2125_09749826_20170831.html',
 './example_data_XBRL_iXBRL/Prod223_2125_09170142_20170831.html',
 './example_data_XBRL_iXBRL/Prod224_0042_03237381_20160831.xml',
 './example_data_XBRL_iXBRL/Prod223_2125_09900460_20161231.html',
 './example_data_XBRL_iXBRL/Prod223_2125_09652609_20180331.html',
 './example_data_XBRL_iXBRL/Prod223_2125_09722743_20170831.html']

In [31]:
# try getting the first file (an XML, or XBRL, file)
importlib.reload(xp)
doc = xp.process_account(files[2])

# View the content
doc

./example_data_XBRL_iXBRL/Prod223_2125_09170142_20170831.html


{'doc_name': 'Prod223_2125_09170142_20170831.html',
 'doc_type': 'html',
 'doc_upload_date': '2018-10-25 15:55:30.730675',
 'arc_name': 'example_data_XBRL_iXBRL',
 'parsed': True,
 'doc_balancesheetdate': '2017-08-31',
 'doc_companieshouseregisterednumber': '09170142',
 'doc_standard_type': 'uk-gaap-full',
 'doc_standard_date': '2009-09-01',
 'doc_standard_link': 'http://www.xbrl.org/uk/gaap/core/2009-09-01/uk-gaap-full-2009-09-01.xsd',
 'elements': [{'name': 'ukcompanieshouseregisterednumber',
   'value': '09170142',
   'unit': 'NA',
   'date': '2017-08-31'},
  {'name': 'entitycurrentlegalorregisteredname',
   'value': 'O THONGTHAI LIMITED',
   'unit': 'NA',
   'date': '2017-08-31'},
  {'name': 'balancesheetdate',
   'value': '2017-08-31',
   'unit': 'NA',
   'date': '2017-08-31'},
  {'name': 'namedirectorsigningaccounts',
   'value': '',
   'unit': 'NA',
   'date': '2017-08-31'},
  {'name': 'entitydormant',
   'value': 'true',
   'unit': 'NA',
   'date': '2017-08-31'},
  {'name': 'st

In [10]:
# try getting the second file (an HTML, or iXBRL, file)
doc2 = xp.process_account(files[1])

# View the content
doc2

./example_data_XBRL_iXBRL/Prod223_2125_09749826_20170831.html


{'doc_name': 'Prod223_2125_09749826_20170831.html',
 'doc_type': 'html',
 'doc_upload_date': '2018-10-24 14:15:06.082682',
 'arc_name': 'example_data_XBRL_iXBRL',
 'parsed': True,
 'doc_balancesheetdate': '2017-08-31',
 'doc_companieshouseregisterednumber': '09749826',
 'doc_standard_type': 'FRS-102',
 'doc_standard_date': '2014-09-01',
 'doc_standard_link': 'https://xbrl.frc.org.uk/FRS-102/2014-09-01/FRS-102-2014-09-01.xsd',
 'elements': [{'name': 'nameproductionsoftware',
   'value': 'Caseware UK (AP4) 2016.0.181',
   'unit': 'NA',
   'date': '2017-08-31'},
  {'name': 'versionproductionsoftware',
   'value': ' 2016.0.181',
   'unit': 'NA',
   'date': '2017-08-31'},
  {'name': 'enddateforperiodcoveredbyreport',
   'value': '2017-08-31',
   'unit': 'NA',
   'date': '2017-08-31'},
  {'name': 'balancesheetdate',
   'value': '2017-08-31',
   'unit': 'NA',
   'date': '2017-08-31'},
  {'name': 'accountstypefullorabbreviated',
   'value': 'FullAccounts',
   'unit': 'NA',
   'date': '2017-08-

# Retrieve elements

In [17]:
# Loop through the document, retrieving any element with a matching name
for element in doc['elements']:
    if element['name'] == 'netassetsliabilitiesincludingpensionassetliability':
        print(element)

{'name': 'netassetsliabilitiesincludingpensionassetliability', 'value': 88402.0, 'unit': 'GBP', 'date': '2016-09-30'}
{'name': 'netassetsliabilitiesincludingpensionassetliability', 'value': 81151.0, 'unit': 'GBP', 'date': '2015-09-30'}


In [21]:
# Convert all of a document's elements to a pandas dataframe format for ease of searching
df = pd.DataFrame(doc['elements'])
df['company'] = doc['doc_companieshouseregisterednumber']
# Note, all fields are currently string (character) format after extraction
df

Unnamed: 0,date,name,unit,value,company
0,2016-09-30,companynotdormant,,true,00958610
1,2016-09-30,entitycurrentlegalname,,S.L.M. (Model) Engineers Limited,00958610
2,2016-09-30,companieshouseregisterednumber,,00958610,00958610
3,2016-09-30,balancesheetdate,,2016-09-30,00958610
4,2016-09-30,profitlossaccountreserve,GBP,83402,00958610
5,2015-09-30,profitlossaccountreserve,GBP,76151,00958610
6,2016-09-30,shareholderfunds,GBP,88402,00958610
7,2015-09-30,shareholderfunds,GBP,81151,00958610
8,2016-09-30,calledupsharecapital,GBP,5000,00958610
9,2015-09-30,calledupsharecapital,GBP,5000,00958610
