## Companies House API

**Documentation:**

[Overview](https://developer.companieshouse.gov.uk/api/docs/index/gettingStarted.html#overview)   
[Authentication](https://developer.companieshouse.gov.uk/api/docs/index/gettingStarted/apikey_authorisation.html)

- uses HTTP basic authentication, but requiring only single **API key** rather than the `usualusername:password` pair of values

- we have provided a key for you to use below (but please register for your own if you are to continue using the API in future)

In [None]:
import json
import requests

In [None]:
key = "GVhX2aQDL8l1C0t8J2QLOW8aX4JU7byCLc5oAS7D"

In [None]:
BASE_URL = "https://api.companieshouse.gov.uk"

In [None]:
resp = requests.get("https://api.companieshouse.gov.uk/search/companies",
            params={"q": "HSBC"},
            headers={"Authorization": key})

The search for HSBC returns hundreds of results:

In [None]:
hsbc_data = resp.json()
hsbc_data

If we know the company number of the specific company we are interested in, we can use that instead:

- each company has its own endpoint, with the final part of the URL being the company number

In [None]:
HSBC_NUMBER = "06388542"

In [None]:
hsbc_resp = requests.get(f"https://api.companieshouse.gov.uk/company/{HSBC_NUMBER}",
            headers={"Authorization": key})

We can then access the JSON data:

In [None]:
data = hsbc_resp.json()
data

... and access specific information of interest:

- notice that Python dictionaries also have a `.get()` method

In [None]:
sic_codes = data.get("sic_codes")
sic_codes

### Making larger numbers of API calls

In [None]:
resp = requests.get(f"https://api.companieshouse.gov.uk/search/companies",  
                    params={"q": "Lloyds"},                                    
                    headers={"Authorization": key})

In [None]:
total = resp.json()['total_results']
total

Our request for details on companies using `LLoyds` as the search term matches a lot of results!

Unfortunately, it does not contain data for all of them:

In [None]:
lloyds_comps = resp.json()['items']
len(lloyds_comps)

The [documentation](https://developer.companieshouse.gov.uk/api/docs/search/search.html) provides information on some parameters which may be useful to access all of the records.

In [None]:
resp = requests.get(f"https://api.companieshouse.gov.uk/search/companies",  
                    params={"q": "Lloyds",
                           "items_per_page": 989},                                    
                    headers={"Authorization": key})

In [None]:
lloyds_comps = resp.json()['items']
len(lloyds_comps)

It seems that the maximum number of records per page is 100. We would therefore need to use the `start_index` parameter as well if we wanted to collect more records beyond the first 100.

In [None]:
total

Let's stick with the first 100, and create a list of the `company_number` from each record:

In [None]:
numbers = [comp['company_number'] for comp in lloyds_comps]
print(numbers)

We could then use this list to access the company-specific endpoint for each one, to gather some specific information:

In [None]:
comp_statuses = []

for comp in numbers:
    comp_resp = requests.get(f"https://api.companieshouse.gov.uk/company/{comp}",
            headers={"Authorization": key})
    status = comp_resp.json()['company_status']
    comp_statuses.append(status)

In [None]:
print(comp_statuses)

### API usage limits

The owner of an API will typically set limits on the number of requests which can be made in a given time period, for reasons of: 

- performance
- cost
- security

Therefore when making higher numbers of requests we may need to reduce their frequency.

The python `time` module (and its `.sleep()` function) can help:

In [None]:
import time

In [None]:
print('hello')
time.sleep(5)
print('hello again')

In [None]:
def get_type(number):
    
    comp_resp = requests.get(f"https://api.companieshouse.gov.uk/company/{comp}",
            headers={"Authorization": key})
    comp_type = comp_resp.json()['type']
    
    return comp_type
    

In [None]:
my_companies = ['04280591', '05017245', '04440298', '00212497']

A safe option could be to sleep after every request:

In [None]:

comp_types = []
for company in my_companies:
    comp_types.append(get_type(company))    
    time.sleep(5)

In [None]:
comp_types

If we know the limit, we could sleep after reaching it:

In [None]:
comp_types = []
LIMIT = 600
counter = 0

for company in my_companies:
    comp_types.append(get_type(company))    
    counter += 1
    
    if counter >= LIMIT:
        time.sleep(60 * 5)

In [None]:
comp_types

Repeatedly attempting to make requests should be avoided:

In [None]:
# BAD ONE (can be banned)

comp_types = []
index = 0

# Using while so we can retry the same index multiple times if it fails
while index < len(my_companies):
    company = my_companies[index]
    print(company)
    try:
        comp_types.append(get_type(company))
        index += 1
        
    except Exception:
        time.sleep(5)
        
        continue

In [None]:
comp_types