# Extracting Data from Ordnance Survey

## API familiarisation

In this assignment we are going to be working with data from the Ordnance Survey ("OS"). It provides a large number of datasets which are summarised on its __[Data Hub](https://osdatahub.os.uk)__. We will be using the **Names API** - take a minute to read the __[overview](https://osdatahub.os.uk/docs/names/overview)__.

The main purpose of this assignment is to extract the names of all the hospitals in Cambridge and to save the result in a form that other users could easily access.

## Accessing the API

First, import the libraries that we will need during this assignment.

In [None]:
# Import libraries
import requests
import json
import pandas as pd

You will need to register with the OS to use the Names API. Go to the __[homepage](https://osdatahub.os.uk/)__ and click on the 'Sign up' button in the top right and then, on the next page, click on the 'Sign-up' button within the 'Free OS OpenData' option. Follow the instructions to create an account.

Now turn to the 'Getting Started' instructions __[here](https://osdatahub.os.uk/docs/names/gettingStarted)__.  You will immediately see that you need an API key to access the Names API and a bullet list of instructions is provided to guide you through the process. Follow these instructions. (Note: the 'Project' which you are asked to set up is used by the OS to monitor the rate of API calls and to impose a limit if too many calls are made - an extremely unlikely eventuality in this assignment.)

Now turn to the __[technical specification](https://osdatahub.os.uk/docs/names/technicalSpecification)__ and note that under 'Authentication' there are a number of options, including: _"You can choose to authenticate your API request using a HTTP header. The header name should be 'key', and the value should be the Project API Key."_

In this assignment we are using the Python requests library to access the API. Therefore we will use this header option and set up a dictionary to hold the API key information (following the example in your workshop).

**Part 1:** Write code which does the following:
- Creates a dictionary, assigned to the variable `api_key`, to hold the authentication information, with the *key: value* pair as described in the technical specification (and repeated in italics above).
- Assigns the variable `path` to the string `'https://api.os.uk/search/names/v1/find'`
- Assigns the variable `param` to a dictionary whose key is 'query' (see the table under 'Operation' in the technical specification) and the value is set to 'Cambridge'.

In [None]:
# Add your code here
# path = ...
# param = ...
api_key = {'key': }
path = 'https://api.os.uk/search/names/v1/find'
param = {'query': 'Cambridge'}
param


Next we will use the above variables together with the requests library, to make a request to the OS Names API and to check that the request has been successful.

**Part 2** Write code which uses the `get()` method of the requests library to make a call to the OS Names API, using all the parameters set up in Part 1 above. Check the status code of the response to make sure it is working correctly and assign the result to the variable name `status_code`.

In [None]:
# Add your code here
# status_code = ...
api_key = {'key': }
path = 'https://api.os.uk/search/names/v1/find'
param = {'query': 'Cambridge'}
res = requests.get(path, params=param, headers=api_key)
status_code = res.status_code
status_code


We will use the contents of the response to find out how many results are returned from the search name Cambridge.

**Part 3**  Write code which does the following:
- Uses the `.json()` method on the response from Part 2, and prints the result to the screen
- Extracts the value of `totalresults`, which can be seen as part of the dictionary associated with the key `header`. This extracted value should be assigned to the variable `total_results`.

In [None]:
# Add your code here
# total_results = ...
api_key = {'key': }
path = 'https://api.os.uk/search/names/v1/find'
param = {'query': 'Cambridge'}
res = requests.get(path, params=param, headers=api_key)
res_dict = res.json()

print(res_dict)

total_results = res_dict['header']['totalresults']
total_results


## Exploring the API functionality
Now we will explore the functionality of the API and find out what it offers that might prove useful.

Look at the __[technical specification](https://osdatahub.os.uk/docs/names/technicalSpecification)__ and scroll down to the heading 'Operation'. You will notice that it describes two types of request: 'find' and 'nearest'.  We are using the former - see the `path` variable above.

The table under 'Operation' lists the parameters that are available for find requests: two are required (which we have used above) and five are optional. We will be using two of the optional parameters, namely `maxresults` and `fq`. First we will explore `maxresults` which is described as _'The maximum number of results to return. Default: 100.'_

**Part 1** Write code which sets the maximum number of results to 10. Print the contents of the response and examine the value of the 'header' key - is it what you would expect?

Assign the variable `res_header` to the value of the header key.

In [None]:
# Add your code here
# res_header = ...
api_key = {'key':}
path = 'https://api.os.uk/search/names/v1/find'
param = {'query': 'Cambridge', 'maxresults': 10}
res = requests.get(path, params=param, headers=api_key)
res_dict = res.json()

print(res_dict)

res_header = res_dict['header']
res_header


## Filtering the response

Our task is to find all the hospitals in Cambridge and, for this, we need to use the parameter `fq` from the API. The specification says _'Filters the results by bounding box or local_type.'_ We will use 'local_type' which is specified in more detail just a bit further down in the technical specification under 'Filtering'. Take a minute to look through the table which shows 'Type' and 'local_type' for a variety of geographical features.

You should be able to find a local_type for hospitals, which is what we need to use for the value of the parameter `fq`.

**Part 1** Write code which requests the hospitals in Cambridge. Set the value of `maxresults` back to its default value for this.  Convert the contents of the response to a Python dictionary and assign the dictionary to the variable `camb_hosp`.  (_Hint: the format for the value of `fq` is non-standard and not immediately obvious from the documentation.  Use 'LOCAL_TYPE:Hospital'_.)

In [None]:
# Add your code here
# camb_hosp = ...
api_key = {'key': }
path = 'https://api.os.uk/search/names/v1/find'
param = {'query': 'Cambridge', 'fq':'LOCAL_TYPE:Hospital'}
res = requests.get(path, params=param, headers=api_key)
camb_hosp = res.json()
camb_hosp


**Part 2** Write code which uses the dictionary `camb_hosp` from above and puts the value of the 'results' key into a Python list. There is one hospital per item in the list. How many hospitals are there? Assign the answer to the variable `hosp_no`.

In [None]:
# Add your code here
# hosp_no = ...
api_key = {'key': }
path = 'https://api.os.uk/search/names/v1/find'
param = {'query': 'Cambridge', 'fq':'LOCAL_TYPE:Hospital'}
res = requests.get(path, params=param, headers=api_key)
camb_hosp= res.json()
hosp_no = len(camb_hosp['results'])
hosp_no


## Saving results to file

At this stage we have extracted the data on Cambridge hospitals and it could be useful to save the information as a JSON file so that it could be passed on to someone else.

**Part 1** Write code so that the contents of the response to the request for all hospitals in Cambridge (ie Part 1 of the previous question) are saved to a JSON file in the `data/` folder.  Check that it has been saved correctly by reading it back in to a Python dictionary called `test_ok`.

In [None]:
# Add your code here
# Make sure your last line of code is test_ok = .....
api_key = {'key': }
path = 'https://api.os.uk/search/names/v1/find'
param = {'query': 'Cambridge', 'fq':'LOCAL_TYPE:Hospital'}
res = requests.get(path, params=param, headers=api_key)

with open('data/res_file.json', 'w') as f:
    json.dump(res.json(), f) 

with open('data/res_file.json', 'r') as f:
    test_ok = json.load(f)
test_ok



## Reading JSON from file

Now we will load a JSON file to continue with the assignment. We will _not_ use the file you have just saved; instead, we will use `data/OS_hosp_in_camb.json`. It has the same structure as the file you have just saved, so all the features can be assumed to be the same.

**Part 1** Read the file `data/OS_hosp_in_camb.json` into a Python dictionary called `os_camb_hosp`.

In [None]:
# Add your code here
# os_camb_hosp = ...
with open('data/OS_hosp_in_camb.json', 'r') as f:
    os_camb_hosp = json.load(f)

os_camb_hosp


**Part 2** Write code to find the type of `os_camb_hosp` and assign it to the variable `os_camb_hosp_type`.

In [None]:
# Add your code here
# os_camb_hosp_type = ...
with open('data/OS_hosp_in_camb.json', 'r') as f:
    os_camb_hosp = json.load(f)
os_camb_hosp_type = type(os_camb_hosp)
os_camb_hosp_type


## Extracting to a DataFrame

We have now loaded a JSON file as a Python dictionary. In the previous questions, we have examined the content using Python dictionary and list operations. Now we will move on to using pandas to look at the data in more detail.

**Part 1** Write code which uses the dictionary produced from the previous question, together with the findings from the earlier questions, to read the value of the 'results' key into a pandas DataFrame assigned to the value `df`.

In [None]:
# Add your code here
# df = ...
with open('data/OS_hosp_in_camb.json', 'r') as f:
    os_camb_hosp = json.load(f)

df = pd.DataFrame(os_camb_hosp['results'])
df


If you take a look at `df` it isn't particularly helpful, as the values for `GAZETTER_ENTRY` have all been put into one column. Let's unpack that column.

**Part 2** Use `json_normalize()` to unpack the `GAZETTEER_ENTRY` column in `df` (you will need to import the function from `pandas` first) and assign the resulting DataFrame to the variable `df_unpack`.

In [None]:
# Add your code here
# Make sure your last line of code is df_unpack = .....
from pandas import json_normalize

with open('data/OS_hosp_in_camb.json', 'r') as f:
    os_camb_hosp = json.load(f)

df = pd.DataFrame(os_camb_hosp['results'])
df_unpack = json_normalize(df['GAZETTEER_ENTRY'])
df_unpack


If you look at the DataFrame, you might notice that there are some hospitals in the list that don't appear to be in Cambridge - two in Surrey and one in Scotland! It appears that the search parameters may have picked up some names which are similar to 'Cambridge'. If we were going to use this dataset we would need to do some cleaning, possibly using the 'bounding boxes' filters to retain only the hospitals within a certain distance of Cambridge. For now, we will just remove any hospitals which are not in Cambridgeshire.

**Part 3** Amend the DataFrame `df_unpack` to only retain the rows where `COUNTY_UNITARY` equals 'Cambridgeshire'.  Assign the result to variable `df_clean`.

In [None]:
# Add your code here
# Make sure your last line of code is df_clean = .....
from pandas import json_normalize
with open('data/OS_hosp_in_camb.json', 'r') as f:
    os_camb_hosp = json.load(f)

df = pd.DataFrame(os_camb_hosp['results'])
df_unpack = json_normalize(df['GAZETTEER_ENTRY'])

df_clean = df_unpack[df_unpack['COUNTY_UNITARY'] == 'Cambridgeshire']
df_clean


## Saving data

Having done some initial tidying up, the DataFrame now needs to be saved to a file so that it can be shared with others.

**Part 1** Write `df_clean` to a JSON file in the `data/` directory. Finally, to ensure that this has been successful, load the file you have just saved into a DataFrame called `test_df`.

In [None]:
# Add your code here
# df_clean = ...
from pandas import json_normalize
with open('data/OS_hosp_in_camb.json', 'r') as f:
    os_camb_hosp = json.load(f)

df = pd.DataFrame(os_camb_hosp['results'])
df_unpack = json_normalize(df['GAZETTEER_ENTRY'])

df_clean = df_unpack[df_unpack['COUNTY_UNITARY'] == 'Cambridgeshire']
df_clean.to_json('data/df_hosp_in_camb.json')

test_df = pd.read_json('data/df_hosp_in_camb.json')
test_df
