# Automate the retrieval of PDF records for companies

For companies for which we can't find an electronic record in our database, we may be able to find and parse a PDF or electronic record using the Companies House API, which works by querying the filing history for a specific company and then requesting to download a specific document if we find one we want.

This can somewhat inflate the number of companies on which we have data.  As an aside, because our database includes only all digital records from 1 yr, a significant number of companies we didn't capture with that have electronic records that have either been uploaded since or just before the window from which the database was created.

### Biggest discovery so far
I need better decision rules for discovering which pages represent a balance sheet - because there are lots that are called "Statement of Financial Position" or some such.


### API URL information

Can get basic company information using;  GET https://api.companieshouse.gov.uk/company/{company_number}

Can get their filing history with;  GET https://api.companieshouse.gov.uk/company/{company_number}/filing-history

Can retrieve a specific document with;  GET http://document-api.companieshouse.gov.uk/document/{id}/content

In [1]:
import requests
import json
import shutil
import pymongo
import random
import importlib
import os
import pickle

import time as tic
import numpy as np
import pandas as pd

import xbrl_image_parser as xip
import xbrl_parser as xp

## Putting together the code for querying the Companies House API

In [None]:
information_url = "https://api.companieshouse.gov.uk/company/{}/filing-history"    # format with CH number
document_url = "GET http://document-api.companieshouse.gov.uk/document/{}" # format with doc id

In [None]:
# Gets an api key I saved to a text file (this stuff to avoid sharing the API key on GitHub)
with open("CH_api_key.txt") as f:
    key = f.read().split(":")[-1].strip()

In [None]:
r = requests.get(information_url.format("00002404"), auth=(key, ""))

In [None]:
r.json()

In [None]:
# This finds all transaction id's for annual account documents

doc_ids = []

for each in r.json()['items']:
    
    if each['type'] == "AA":
        doc_ids.append( each['links']['document_metadata'])

doc_ids

In [None]:
aa = requests.get(doc_ids[0] + "/content", auth=(key, ""))

In [None]:
with open("test.pdf", "wb") as f:
        f.write(aa.content)

## Create a list of companies for which we don't have an electronic record on file

In [None]:
# Connect to mongodb of digital records for purposes of cross-checking
# Believe I've previously generated an index on CH code so searching should be fast
import pymongo

cl = pymongo.MongoClient()
db = cl['CH_records']
col = db['digital_record_scrapes']

In [None]:
counter = 0
recorded = 0

# Load the very large CSV in chunks (to fit within RAM)
for chunk in pd.read_csv("~/data/BasicCompanyDataAsOneFile-2018-10-01.csv", chunksize=1000):
    print("Loaded chunk {}, containing {} records".format(counter, len(chunk)))
    
    no_record = pd.DataFrame()
    
    # Iterate through the entries, checking if each exists in the database
    for index, row in chunk.iterrows():
        doc_count = col.count_documents({'doc_companieshouseregisterednumber':row[' CompanyNumber']})
        
        # If it doesn't, record it
        if doc_count == 0:
            no_record = no_record.append(row, ignore_index=True)
            recorded += 1
    
    # Append the discovered missing DB entries to the output csv file
    if (counter > 0) & (len(no_record) > 0):
        with open("./output/CH_no_digital_records.csv", 'a') as f:
            no_record.to_csv(f, mode='a', header=False, index=False)
            print("Saved a chunk")
    
    # Create a csv file for the discovered missing DB entries if one doesn't exist yet
    else:
        no_record.to_csv("./output/CH_no_digital_records.csv", index=False)
        print("Saved first chunk")
    
    counter += 1

In [None]:
col.create_index('doc_companieshouseregisterednumber')

## Determine whether, for each file with no digital record, a paper record was submitted

For each entry in the "doesn't have an electronic record" csv file, see if it has a paper record instead by querying the Companies House API.  Record the date of the entry.

In [None]:
check = pd.read_csv("./output/CH_no_digital_records.csv", nrows=1000)

results = pd.DataFrame()

for chnum in check[' CompanyNumber']:
    
    try:
        tic.sleep(.2)
        r = requests.get(information_url.format(chnum), auth=(key, ""))
        for each in r.json()['items']:
            results = results.append({"chnum":chnum,
                                      "type":each['type'],
                                      "desc":each['description'],
                                      "cat":each['category']}, ignore_index=True)
    
    except:
        continue

In [None]:
results[['desc', 'type', 'chnum']].groupby(['desc', 'type']).agg('count')

In [None]:
counter = 0

for chunk in pd.read_csv("./output/CH_no_digital_records.csv", chunksize=100):
    
    results = pd.DataFrame()
    
    for index, row in chunk.iterrows():
        
        # Wait for 2/10th of a second - this to accomodate rate limiting by CH to 600 requests/minute
        tic.sleep(.2)
        
        try:
            r = requests.get(information_url.format(row[' CompanyNumber']), auth=(key, ""))
        except:
            continue

        doc_dates = []
            
        try:
            for each in r.json()['items']:
    
                if each['type'] in ["AA", "AAMD", "BS"]:
                    doc_dates.append( each['date'])
        
            row['num_paper_records'] = len(doc_dates)
            row['paper_record_dates'] = ":".join(doc_dates)
            row['response_code'] = r.status_code
            
            results = results.append(row, ignore_index=True)
            
        except:
            row['num_paper_records'] = None
            row['paper_record_dates'] = None
            row['response_code'] = r.status_code
            
            results = results.append(row, ignore_index=True)

    # Append the discovered missing DB entries to the output csv file
    if (counter > 0) & (len(chunk) > 0):
        with open("./output/CH_no_digital_records_searched.csv", 'a') as f:
            results.to_csv(f, mode='a', header=False, index=False)
            print("Saved a chunk.  Reporting latest:", r.status_code, doc_dates)
    
    # Create a csv file for the discovered missing DB entries if one doesn't exist yet
    else:
        results.to_csv("./output/CH_no_digital_records_searched.csv", index=False)
        print("Saved first chunk")
    
    counter += 1

In [None]:
counter

## Calculate the percentage of companies for which there's  no digital record, for which there IS a pdf record

In [None]:
pdf_df = pd.read_csv("./output/CH_no_digital_records_searched.csv")

In [None]:
pdf_df.head()

In [None]:
# Little custom function for checking dates in the paper_record_dates field
def check_recent(dates_str, limit=2017):
    try:
        years = [int(x[0:4])>=limit for x in dates_str.split(":")]
    except:
        return(False)
    
    return(sum(years) > 0)

In [None]:
# Create a field reporting if there's a recent paper record
pdf_df['recent_record_exists'] = pdf_df['paper_record_dates'].apply(check_recent)

In [None]:
percent_paper = sum(pdf_df['recent_record_exists']) * 100.0 / len(pdf_df)
percent_paper

## Download a Random Sample of 1000 files

In [None]:
random.seed(7)
samples_df = pdf_df[pdf_df['recent_record_exists'] == True].sample(n=1000)

In [None]:
pdf_counter = 0

for index, row in samples_df.iterrows():
    
    try:
        # Get information about a company, includes a list of filed documents
        r = requests.get(information_url.format(row[' CompanyNumber']), auth=(key, ""))
        tic.sleep(0.1)
    
        # Extract a list of filed documents (tuple of links and dates)
        docs = []
        for each in r.json()['items']:
        
            # Get any document that has type "Annual Accounts", "Annual Accounts Modified", "Balance Sheet"
            if each['type'] in ["AA", "AAMD", "BS"]:
            
                # Detect if the document was filed as a "paper" account that was scanned
                try:
                    paper_filed = each['paper_filed']
                except:
                    paper_filed = False
            
                # record all this metadata
                docs.append( (each['links']['document_metadata'],
                            int(each['date'][0:4]),
                            each['date'],
                            paper_filed) )

        # select the most recent document
        dates = [x[1] for x in docs]
        max_index = dates.index(max(dates))
    
        print(docs[max_index])
    
        try:
            aa = requests.get(docs[max_index][0] + "/content", auth=(key, ""))
    
        except:
            tic.sleep(0.2)
            try:
                aa = requests.get(docs[max_index][0] + "/content", auth=(key, ""))
            
            except:
                print("Failed on " + row[' CompanyNumber'])
                continue
        
        
        if "pdf" in aa.headers['Content-Type']:
            file_end = ".pdf"
            pdf_counter += 1
            
        else:
            file_end = ".html"
        
        with open("./api_requested_documents/" + row[' CompanyNumber'] + "_" + docs[max_index][2] + file_end, "wb") as f:
            f.write(aa.content)
        tic.sleep(0.1)
    
    except Exception as e:
        print("Failed on ", row[' CompanyNumber'])
        print(e)
        continue
        
print(pdf_counter)

## Parse every one of the downloaded files (that is a PDF)

In [None]:
importlib.reload(xip)

# Get a list of all of the pdf files in the directory "example_data_PDF"
files = ["./api_requested_documents/"+filename for filename in os.listdir("./api_requested_documents") if ".pdf" in filename]

times = {}

for file in files:
    
    print("Processing: " + file)

    try:
        t0 = tic.time()
        temp_df =  xip.process_PDF(file)
        t1 = tic.time()
        
        print("processed file {} in {}".format(file, t1-t0))
        temp_df.to_csv("." + file.strip(".pdf") + ".csv")
        times[file] = t1-t0
        
    except Exception as e:
        print("Failed on: ", file)
        print(e)

Processing: ./api_requested_documents/SC315245_2017-11-13.pdf
Converting PDF image to multiple png files
./api_requested_documents/SC315245_2017-11-13.pdf
Performing pre-processing on all png images
processed file ./api_requested_documents/SC315245_2017-11-13.pdf in 79.95127010345459
Processing: ./api_requested_documents/02031807_2018-03-21.pdf
Converting PDF image to multiple png files
./api_requested_documents/02031807_2018-03-21.pdf
Performing pre-processing on all png images
processed file ./api_requested_documents/02031807_2018-03-21.pdf in 70.65506839752197
Processing: ./api_requested_documents/07623258_2019-02-05.pdf
Converting PDF image to multiple png files
./api_requested_documents/07623258_2019-02-05.pdf
Performing pre-processing on all png images
processed file ./api_requested_documents/07623258_2019-02-05.pdf in 14.320111751556396
Processing: ./api_requested_documents/05157431_2018-06-06.pdf
Converting PDF image to multiple png files
./api_requested_documents/05157431_2018

Skipping line 24: Expected 12 fields in line 24, saw 13. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.


processed file ./api_requested_documents/05157431_2018-06-06.pdf in 34.437753438949585
Processing: ./api_requested_documents/09727421_2018-05-07.pdf
Converting PDF image to multiple png files
./api_requested_documents/09727421_2018-05-07.pdf
Performing pre-processing on all png images
processed file ./api_requested_documents/09727421_2018-05-07.pdf in 9.775703430175781
Processing: ./api_requested_documents/07219146_2018-10-06.pdf
Converting PDF image to multiple png files
./api_requested_documents/07219146_2018-10-06.pdf
Performing pre-processing on all png images
processed file ./api_requested_documents/07219146_2018-10-06.pdf in 27.014737367630005
Processing: ./api_requested_documents/10146931_2018-05-04.pdf
Converting PDF image to multiple png files
./api_requested_documents/10146931_2018-05-04.pdf
Performing pre-processing on all png images
Failed on:  ./api_requested_documents/10146931_2018-05-04.pdf
'currYr'
Processing: ./api_requested_documents/03440203_2019-02-07.pdf
Converting

Skipping line 13: Expected 12 fields in line 13, saw 13. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.


processed file ./api_requested_documents/03440203_2019-02-07.pdf in 69.41715574264526
Processing: ./api_requested_documents/04848475_2018-10-06.pdf
Converting PDF image to multiple png files
./api_requested_documents/04848475_2018-10-06.pdf
Performing pre-processing on all png images
processed file ./api_requested_documents/04848475_2018-10-06.pdf in 318.4305477142334
Processing: ./api_requested_documents/08056186_2017-04-27.pdf
Converting PDF image to multiple png files
./api_requested_documents/08056186_2017-04-27.pdf
Performing pre-processing on all png images
processed file ./api_requested_documents/08056186_2017-04-27.pdf in 50.633567333221436
Processing: ./api_requested_documents/07687603_2018-09-11.pdf
Converting PDF image to multiple png files
./api_requested_documents/07687603_2018-09-11.pdf
Performing pre-processing on all png images
Failed on:  ./api_requested_documents/07687603_2018-09-11.pdf
'currYr'
Processing: ./api_requested_documents/03569937_2018-12-24.pdf
Converting 

Performing pre-processing on all png images
Failed on:  ./api_requested_documents/09447254_2017-08-15.pdf
'currYr'
Processing: ./api_requested_documents/05407505_2018-10-02.pdf
Converting PDF image to multiple png files
./api_requested_documents/05407505_2018-10-02.pdf
Performing pre-processing on all png images
processed file ./api_requested_documents/05407505_2018-10-02.pdf in 101.97738289833069
Processing: ./api_requested_documents/08907676_2018-05-22.pdf
Converting PDF image to multiple png files
./api_requested_documents/08907676_2018-05-22.pdf
Performing pre-processing on all png images
Failed on:  ./api_requested_documents/08907676_2018-05-22.pdf
'currYr'
Processing: ./api_requested_documents/11060238_2019-02-07.pdf
Converting PDF image to multiple png files
./api_requested_documents/11060238_2019-02-07.pdf
Performing pre-processing on all png images
processed file ./api_requested_documents/11060238_2019-02-07.pdf in 15.878416538238525
Processing: ./api_requested_documents/10713

Skipping line 227: Expected 12 fields in line 227, saw 13. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.
Skipping line 752: Expected 12 fields in line 752, saw 13. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.


Failed on:  ./api_requested_documents/10713517_2018-04-10.pdf
'currYr'
Processing: ./api_requested_documents/SC215880_2018-08-16.pdf
Converting PDF image to multiple png files
./api_requested_documents/SC215880_2018-08-16.pdf
Performing pre-processing on all png images
processed file ./api_requested_documents/SC215880_2018-08-16.pdf in 15.048614501953125
Processing: ./api_requested_documents/01291565_2018-10-17.pdf
Converting PDF image to multiple png files
./api_requested_documents/01291565_2018-10-17.pdf
Performing pre-processing on all png images
processed file ./api_requested_documents/01291565_2018-10-17.pdf in 26.6050865650177
Processing: ./api_requested_documents/03010158_2018-08-08.pdf
Converting PDF image to multiple png files
./api_requested_documents/03010158_2018-08-08.pdf
Performing pre-processing on all png images
processed file ./api_requested_documents/03010158_2018-08-08.pdf in 102.51154565811157
Processing: ./api_requested_documents/09038360_2019-02-05.pdf
Converting 

Skipping line 93: Expected 12 fields in line 93, saw 13. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.
Skipping line 110: Expected 12 fields in line 110, saw 13. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.
Skipping line 145: Expected 12 fields in line 145, saw 13. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.
Skipping line 521: Expected 12 fields in line 521, saw 13. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.
Skipping line 541: Expected 12 fields in line 541, saw 13. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.
Skipping line 226: Expected 12 fields in line 226, saw 13. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.
Skipping line 86: Expected 12 fields in line 86, saw 13. Error could possibly be due to quotes being ignored when 

processed file ./api_requested_documents/02338548_2018-03-23.pdf in 535.9478747844696
Processing: ./api_requested_documents/05290340_2018-01-09.pdf
Converting PDF image to multiple png files
./api_requested_documents/05290340_2018-01-09.pdf
Performing pre-processing on all png images
processed file ./api_requested_documents/05290340_2018-01-09.pdf in 62.61824870109558
Processing: ./api_requested_documents/03074921_2018-09-28.pdf
Converting PDF image to multiple png files
./api_requested_documents/03074921_2018-09-28.pdf
Performing pre-processing on all png images
processed file ./api_requested_documents/03074921_2018-09-28.pdf in 14.964815378189087
Processing: ./api_requested_documents/08185172_2018-05-16.pdf
Converting PDF image to multiple png files
./api_requested_documents/08185172_2018-05-16.pdf
Performing pre-processing on all png images
Failed on:  ./api_requested_documents/08185172_2018-05-16.pdf
'currYr'
Processing: ./api_requested_documents/09469412_2019-01-05.pdf
Converting 

Skipping line 223: Expected 12 fields in line 223, saw 13. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.
Skipping line 138: Expected 12 fields in line 138, saw 13. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.
Skipping line 487: Expected 12 fields in line 487, saw 13. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.


Failed to process line: 2018  7 2017
processed file ./api_requested_documents/09469412_2019-01-05.pdf in 31.81312346458435
Processing: ./api_requested_documents/10616881_2019-02-04.pdf
Converting PDF image to multiple png files
./api_requested_documents/10616881_2019-02-04.pdf
Performing pre-processing on all png images
Failed on:  ./api_requested_documents/10616881_2019-02-04.pdf
'currYr'
Processing: ./api_requested_documents/08704115_2018-09-08.pdf
Converting PDF image to multiple png files
./api_requested_documents/08704115_2018-09-08.pdf
Performing pre-processing on all png images
Failed on:  ./api_requested_documents/08704115_2018-09-08.pdf
'the label [0] is not in the [index]'
Processing: ./api_requested_documents/04051648_2018-10-06.pdf
Converting PDF image to multiple png files
./api_requested_documents/04051648_2018-10-06.pdf
Performing pre-processing on all png images


Skipping line 532: Expected 12 fields in line 532, saw 13. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.
Skipping line 537: Expected 12 fields in line 537, saw 13. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.
Skipping line 552: Expected 12 fields in line 552, saw 13. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.
Skipping line 566: Expected 12 fields in line 566, saw 13. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.


Failed on:  ./api_requested_documents/04051648_2018-10-06.pdf
'currYr'
Processing: ./api_requested_documents/04809823_2018-07-12.pdf
Converting PDF image to multiple png files
./api_requested_documents/04809823_2018-07-12.pdf
Performing pre-processing on all png images
Failed on:  ./api_requested_documents/04809823_2018-07-12.pdf
'currYr'
Processing: ./api_requested_documents/02222361_2018-10-22.pdf
Converting PDF image to multiple png files
./api_requested_documents/02222361_2018-10-22.pdf
Performing pre-processing on all png images
processed file ./api_requested_documents/02222361_2018-10-22.pdf in 14.694850444793701
Processing: ./api_requested_documents/10647114_2018-11-30.pdf
Converting PDF image to multiple png files
./api_requested_documents/10647114_2018-11-30.pdf
Performing pre-processing on all png images
Failed on:  ./api_requested_documents/10647114_2018-11-30.pdf
'currYr'
Processing: ./api_requested_documents/04317894_2018-06-11.pdf
Converting PDF image to multiple png file

Skipping line 46: Expected 12 fields in line 46, saw 13. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.


processed file ./api_requested_documents/09154504_2017-06-16.pdf in 82.81708145141602
Processing: ./api_requested_documents/07564261_2018-12-18.pdf
Converting PDF image to multiple png files
./api_requested_documents/07564261_2018-12-18.pdf
Performing pre-processing on all png images
processed file ./api_requested_documents/07564261_2018-12-18.pdf in 18.460702180862427
Processing: ./api_requested_documents/01309169_2018-10-01.pdf
Converting PDF image to multiple png files
./api_requested_documents/01309169_2018-10-01.pdf
Performing pre-processing on all png images
Failed on:  ./api_requested_documents/01309169_2018-10-01.pdf
'currYr'
Processing: ./api_requested_documents/03209358_2018-09-28.pdf
Converting PDF image to multiple png files
./api_requested_documents/03209358_2018-09-28.pdf
Performing pre-processing on all png images
processed file ./api_requested_documents/03209358_2018-09-28.pdf in 24.733627319335938
Processing: ./api_requested_documents/02080819_2018-04-24.pdf
Converting

Skipping line 50: Expected 12 fields in line 50, saw 13. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.
Skipping line 135: Expected 12 fields in line 135, saw 13. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.


Failed on:  ./api_requested_documents/00547325_2018-06-06.pdf
'currYr'
Processing: ./api_requested_documents/05218852_2019-02-08.pdf
Converting PDF image to multiple png files
./api_requested_documents/05218852_2019-02-08.pdf
Performing pre-processing on all png images


Skipping line 142: Expected 12 fields in line 142, saw 13. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.
Skipping line 246: Expected 12 fields in line 246, saw 13. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.


processed file ./api_requested_documents/05218852_2019-02-08.pdf in 57.36773490905762
Processing: ./api_requested_documents/03337861_2018-11-13.pdf
Converting PDF image to multiple png files
./api_requested_documents/03337861_2018-11-13.pdf
Performing pre-processing on all png images
processed file ./api_requested_documents/03337861_2018-11-13.pdf in 5.415456295013428
Processing: ./api_requested_documents/05242411_2018-03-21.pdf
Converting PDF image to multiple png files
./api_requested_documents/05242411_2018-03-21.pdf
Performing pre-processing on all png images
processed file ./api_requested_documents/05242411_2018-03-21.pdf in 32.899619579315186
Processing: ./api_requested_documents/03828538_2018-06-07.pdf
Converting PDF image to multiple png files
./api_requested_documents/03828538_2018-06-07.pdf
Performing pre-processing on all png images
processed file ./api_requested_documents/03828538_2018-06-07.pdf in 52.18483352661133
Processing: ./api_requested_documents/06318736_2018-05-03.