![UKDS Logo](./images/UKDS_Logos_Col_Grey_300dpi.png)

# Web-scraping for Social Science Research

Welcome to the <a href="https://ukdataservice.ac.uk/" target=_blank>UK Data Service</a> training series on *New Forms of Data for Social Science Research*. This series guides you through some of the most common and valuable new sources of data available for social science research: data collected from websites, social media platorms, text data, conducting simulations (agent based modelling), to name a few. To help you get to grips with these new forms of data, we provide webinars, interactive notebooks containing live programming code, reading lists and more.

* To access training materials for the entire series: <a href="https://github.com/UKDataServiceOpen/new-forms-of-data" target=_blank>[Training Materials]</a>

* To keep up to date with upcoming and past training events: <a href="https://ukdataservice.ac.uk/news-and-events/events" target=_blank>[Events]</a>

* To get in contact with feedback, ideas or to seek assistance: <a href="https://ukdataservice.ac.uk/help.aspx" target=_blank>[Help]</a>

<a href="https://www.research.manchester.ac.uk/portal/julia.kasmire.html" target=_blank>Dr Julia Kasmire</a> and <a href="https://www.research.manchester.ac.uk/portal/diarmuid.mcdonnell.html" target=_blank>Dr Diarmuid McDonnell</a> <br />
UK Data Service  <br />
University of Manchester <br />
April 2020

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Guide-to-using-this-resource" data-toc-modified-id="Guide-to-using-this-resource-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Guide to using this resource</a></span><ul class="toc-item"><li><span><a href="#Interaction" data-toc-modified-id="Interaction-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Interaction</a></span></li><li><span><a href="#Learn-more" data-toc-modified-id="Learn-more-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Learn more</a></span></li></ul></li><li><span><a href="#Collecting-data-from-online-databases-using-an-API" data-toc-modified-id="Collecting-data-from-online-databases-using-an-API-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Collecting data from online databases using an API</a></span><ul class="toc-item"><li><span><a href="#What-is-an-API?" data-toc-modified-id="What-is-an-API?-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>What is an API?</a></span></li><li><span><a href="#Reasons-to-interact-with-an-API" data-toc-modified-id="Reasons-to-interact-with-an-API-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Reasons to interact with an API</a></span></li><li><span><a href="#Logic-of-using-an-API" data-toc-modified-id="Logic-of-using-an-API-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Logic of using an API</a></span></li></ul></li><li><span><a href="#Example:-Capturing-Covid-19-data" data-toc-modified-id="Example:-Capturing-Covid-19-data-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Example: Capturing Covid-19 data</a></span><ul class="toc-item"><li><span><a href="#Locating-the-API" data-toc-modified-id="Locating-the-API-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Locating the API</a></span></li><li><span><a href="#API-terms-of-use" data-toc-modified-id="API-terms-of-use-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>API terms of use</a></span></li><li><span><a href="#Locating-data" data-toc-modified-id="Locating-data-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Locating data</a></span></li><li><span><a href="#Registering-use-of-API" data-toc-modified-id="Registering-use-of-API-4.4"><span class="toc-item-num">4.4&nbsp;&nbsp;</span>Registering use of API</a></span></li><li><span><a href="#Requesting-data" data-toc-modified-id="Requesting-data-4.5"><span class="toc-item-num">4.5&nbsp;&nbsp;</span>Requesting data</a></span></li><li><span><a href="#Saving-results" data-toc-modified-id="Saving-results-4.6"><span class="toc-item-num">4.6&nbsp;&nbsp;</span>Saving results</a></span></li><li><span><a href="#Refining-Covid-19-data-collection" data-toc-modified-id="Refining-Covid-19-data-collection-4.7"><span class="toc-item-num">4.7&nbsp;&nbsp;</span>Refining Covid-19 data collection</a></span></li><li><span><a href="#Concluding-remarks-on-Covid-19-data" data-toc-modified-id="Concluding-remarks-on-Covid-19-data-4.8"><span class="toc-item-num">4.8&nbsp;&nbsp;</span>Concluding remarks on Covid-19 data</a></span></li><li><span><a href="#A-social-research-example:-UK-police-data" data-toc-modified-id="A-social-research-example:-UK-police-data-4.9"><span class="toc-item-num">4.9&nbsp;&nbsp;</span>A social research example: UK police data</a></span></li></ul></li><li><span><a href="#Value,-limitations-and-ethics" data-toc-modified-id="Value,-limitations-and-ethics-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Value, limitations and ethics</a></span><ul class="toc-item"><li><span><a href="#Value" data-toc-modified-id="Value-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Value</a></span></li><li><span><a href="#Limitations" data-toc-modified-id="Limitations-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Limitations</a></span></li><li><span><a href="#Ethical-considerations" data-toc-modified-id="Ethical-considerations-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>Ethical considerations</a></span></li></ul></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Conclusion</a></span></li><li><span><a href="#Bibliography" data-toc-modified-id="Bibliography-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Bibliography</a></span></li><li><span><a href="#Further-reading-and-resources" data-toc-modified-id="Further-reading-and-resources-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Further reading and resources</a></span></li><li><span><a href="#Appendices" data-toc-modified-id="Appendices-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Appendices</a></span><ul class="toc-item"><li><span><a href="#Appendix-A---Requesting-URLs" data-toc-modified-id="Appendix-A---Requesting-URLs-9.1"><span class="toc-item-num">9.1&nbsp;&nbsp;</span>Appendix A - Requesting URLs</a></span></li><li><span><a href="#Appendix-B---Capturing-UK-Police-data" data-toc-modified-id="Appendix-B---Capturing-UK-Police-data-9.2"><span class="toc-item-num">9.2&nbsp;&nbsp;</span>Appendix B - Capturing UK Police data</a></span></li></ul></li></ul></div>

## Introduction

In this training series we cover some of the essential skills needed to collect data from the web. In particular we focus on two different approaches:
1. Collecting data stored on web pages. <a href="./web-scraping-websites-code-2020-04-23.ipynb" target=_blank>[LINK]</a>
2. Downloading data from online databases using Application Programming Interfaces (APIs). [Focus of this notebook]
    
Do not be alarmed by the technical aspects: both approaches can be implemented using simple code, a standard desktop or laptop, and a decent internet connection.    

Given the Covid-19 public health crisis in which this programme of work occurred, we will examine ways in which computational methods can provide valuable data for studying this phenomenon. This is a fast moving, evolving public health emergency that, in addition to other impacts, will shape research agendas across the sciences for years to come. Therefore it is important to learn how we, as social scientists, can access or generate data that will provide a better understanding of this disease.

In this lesson we will search the <a href="https://www.theguardian.com" target=_blank>Guardian newspaper's</a> online database for articles referring to Covid-19. We will also examine two APIs that should be of wider interest to the social science community: 
* UK Police API, which provides data on street-level crime and outcomes, police stations and much more. [IN DEVELOPMENT]
* Companies House API, which provides data on the governance, finances and activities of UK registered companies. [IN DEVELOPMENT]

## Guide to using this resource

This learning resource was built using <a href="https://jupyter.org/" target=_blank>Jupyter Notebook</a>, an open-source software application that allows you to mix code, results and narrative in a single document. As <a href="https://jupyter4edu.github.io/jupyter-edu-book/" target=_blank>Barba et al. (2019)</a> espouse:
> In a world where every subject matter can have a data-supported treatment, where computational devices are omnipresent and pervasive, the union of natural language and computation creates compelling communication and learning opportunities.

If you are familiar with Jupyter notebooks then skip ahead to the main content (*Collecting data from online databases using an API*). Otherwise, the following is a quick guide to navigating and interacting with the notebook.

### Interaction

**You only need to execute the code that is contained in sections which are marked by `In []`.**

To execute a cell, click or double-click the cell and press the `Run` button on the top toolbar (you can also use the keyboard shortcut Shift + Enter).

Try it for yourself:

In [None]:
print("Enter your name and press enter:")
name = input()
print("\r")
print("Hello {}, enjoy learning more about Python and web-scraping!".format(name)) 

### Learn more

Jupyter notebooks provide rich, flexible features for conducting and documenting your data analysis workflow. To learn more about additional notebook features, we recommend working through some of the <a href="https://github.com/darribas/gds19/blob/master/content/labs/lab_00.ipynb" target=_blank>materials</a> provided by Dani Arribas-Bel at the University of Liverpool. 

## Collecting data from online databases using an API

### What is an API?

An Application Programming Interface (API) is
> a set of functions and procedures allowing the creation of applications that access the features or data of an operating system, application, or other service" (Oxford English Dictionary). 

In essence: an API acts as an intermediary between software applications. Think of an API's role as similar to that of a translator faciliating a conversation between two individuals who do not speak the same language. Neither individual needs to know the other's language, just how to formulate their response in a way the translator can understand. Similarly, an API simplifies how applications communicate with each other.

It performs this role by providing a set of protocols/standards for making *requests* and formulating *responses* between applications. For example, a smart phone application might need real-time traffic data from an online database. An API can validate the application's request for data, and handle the online database's response (i.e., the transfer of data to the application). In the absence of an API, the smart phone application would need to know a lot more technical information about the online database in order to communicate with it (e.g., what commands does the database understand?). But thanks to the API, the smart phone application only needs to know how to formulate a request that the API understands, which then communicates the request to the database.

(If you want to learn more about APIs, especially from a technical perspective, we highly recommend one of our previous webinars: <a href="https://www.youtube.com/watch?v=ffSQHPHO_c0" target=_blank>What are APIs?</a>)

### Reasons to interact with an API

Many public, private and charitable institutions collect and share data of value to social scientists. Often they deposit their data to a data portal - e.g., <a href="https://data.gov.uk/" target=_blank>UK Government Open Data</a> -, allowing you to download the files as and when needed. However, another approach they can adopt is to allow access to the underlying information that is stored in their database through an API. Using this method, individuals can send a customised *request* for information to the database; if the request is valid, the database *responds* by providing you with the information you asked for. Think of using an API as the difference between downloading a raw data file which then needs to be filtered to arrive at the information you need, and performing the filtering when you request the data, so only what you need is returned (the API method).

Before we delve into writing code to capture data through an API, let's clearly state the logic underpinning the technique.

### Logic of using an API

We begin by identifying an online database containing information of interest. Then we need to **know** the following:
1. The location of the API (i.e., web address) through which the database can be accessed. For example, the UK Police API can be accessed via <a href="https://data.police.uk/api" target=_blank>https://data.police.uk/api</a>.
2. The terms of use associated with the API. Many APIs restrict the number of requests you can make over a given time period, while others require registration in order to authenticate who is trying to access the data. For example, the UK Police API does not require you to provide authentication but restricts the number of requests for data you can make (15 per second) - the number of allowable requests is known as the *rate limit*.
3. The location of the data of interest on the API. For example, data on street-level crime from the UK Police API is available at: <a href="https://data.police.uk/api/crimes-street" target=_blank>https://data.police.uk/api/crimes-street</a>. The location of the data is known as its *endpoint*.

We can usually find all of the information we need by reading the API's documentation e.g., <a href="https://data.police.uk/docs/" target=_blank>https://data.police.uk/docs/</a>.

Then we need to **do** the following:
4. Register your use of the API (if required).
5. Request data from the endpoint of interest, supplying authentication if required. This process is known as *making a call* to the API.
6. Write this data to a file for future use.

## Example: Capturing Covid-19 data

<a href="https://open-platform.theguardian.com/documentation/" target=_blank>The Guardian API</a> provides access to some of the data and metadata associcated with its content. For example, you can query its database to search for articles relating to certain topics (e.g., "environmenmt", "covid-19"), or articles published over a certain date range.

### Locating the API

In some cases the organisation will allow you to explore the API without the need to write code or register for an API key. For instance, the Guardian API can be interacted with through the following user interface <a href="https://open-platform.theguardian.com/explore/" target=_blank>https://open-platform.theguardian.com/explore/</a>.

**TASK**: take some time to interact with the Guardian API interface using the link above.

(Note: it possible to load websites into Python in order to view them, however the Guardian API doesn't allow this. See the example code below for how it would work for a different website - just remove the quotation marks enclosing the code and run the cell).

In [None]:
"""
from IPython.display import IFrame

IFrame("https://ukdataservice.ac.uk/", width="600", height="650")
"""

However, interacting with an API through a user interface (i.e., text boxes, drop-down menues) is slow, untransparent, labour intensive and often not possible. We will instead focus on writing code that performs the requests for us. Therefore, we need to use the following web address to access the API: https://content.guardianapis.com.

(Note that you cannot request this web address through your browser; this is because access to the API is only possible by providing authentication i.e., an API key. Try for yourself by clicking on the links).

### API terms of use

The Guardian API is well documented (not always the case, unfortunately) and we can clearly identify what is required in order to interact with it. Firstly, the API requires authentication, in the form of an API key, to be provided when making requests. The API key is generated when you register your use of the API.

Secondly, the API provides multiple levels of access, each with its own set of restrictions and usage allowances. The free level of access allows you to make up to 12 calls per second, with a daily limit of 5,000; requests are also restricted to text content (no access to images, audio, video) - if you need more than the free level of access provides, there is a commercial product with custom rate limits, access to other type of data (e.g., images, videos) etc.

See <a href="https://open-platform.theguardian.com/access/" target=_blank>https://open-platform.theguardian.com/access/</a> for full information on levels of access.

### Locating data

The Guardian API allows access to five endpoints:
* Content - https://content.guardianapis.com/search
* Tags - http://content.guardianapis.com/tags
* Sections - https://content.guardianapis.com/sections
* Editions - https://content.guardianapis.com/editions
* Single Item - https://content.guardianapis.com/

### Registering use of API

We need an API key in order to start requesting data. The API key acts as a user's unique id when accessing the API. For the purposes of this lesson we have <a href="https://bonobo.capi.gutools.co.uk/register/developer" target=_blank>registered our use</a> and been given an API key which is contained in a file called *guardian-api-key.txt*. 

Run the code below to check if the file exists:

In [None]:
import os

os.listdir("./auth")

Good, now we need to load in the API key from this file.

(Delete the `#` symbol if you want to see the value of the API key).

In [None]:
api_key = open("./auth/guardian-api-key.txt", "r").read() # open the file and read its contents
# api_key

You should generate your own API key, and keep it private and secure, when using the Guardian API for your own purposes.

### Requesting data

We're ready for the interesting bit: requesting data through the API. We'll focus on finding and saving data about articles relating to the Covid-19 public health crisis.

There is a preliminary step, which is setting up Python with the modules it needs to interact with the API.

In [None]:
# Import modules

import os # module for navigating your machine (e.g., file directories)
import requests # module for requesting urls
import json # module for working with JSON data structures
from datetime import datetime # module for working with dates and time
print("Succesfully imported necessary modules")

Modules are additional techniques or functions that are not present when you launch Python. Some do not even come with Python when you download it and must be installed on your machine separately - think of using `ssc install <package>` in Stata, or `install.packages(<package>)` in R. For now just understand that many useful modules need to be imported every time you start a new Python session.

Next, let's search for articles mentioning the term "covid-19".

In [None]:
# Define web address and search terms

baseurl = "http://content.guardianapis.com/search?" # base web address
searchterm = "covid-19" # term we want to search for
auth = {"api-key": api_key} # authentication
webadd = baseurl + "q=" + searchterm # construct web address for requesting
print(webadd)

# Make call to API

response = requests.get(webadd, headers=auth) # request the web address
response.status_code # check if API was requested successfully

Let's unpack the above code, as there is a lot happening. First, we define a variable (also known as an 'object' in Python) called `baseurl` that contains the base web address of the *Content* endpoint. Then we define a variable containing the term we are interested in searching for (`searchterm`). Next, we define a variable to store the API key that is needed when making the request (`auth`). Finally we concatenate these separate elements to form a valid web address that can be requested from the API (`webadd`).

The next step is to use the `get()` method of the `requests` module to request the web address, and in the same line of code, we store the results of the request in a variable called `response`. Finally, we check whether the request was successful by calling on the `status_code` attribute of the `response` variable.

Confused? Don't worry, the conventions of Python and using its modules take a bit of getting used to. At this point, just understand that you can store the results of commands in variables, and a variable can have different attributes that can be accessed when needed. Also note that you have a lot of freedom in how you name your variables (subject to certain restrictions - see <a href="https://www.python.org/dev/peps/pep-0008/" target=_blank>here for some guidance</a>).

For example, the following would also work (bonus points if you can name the brewery lending its name to the variables):

In [None]:
highlander = "http://content.guardianapis.com/search?" # base web address
jarl = "covid-19" # term we want to search for
avalanche = {"api-key": api_key} # authentication
hurricane_jack = highlander + "q=" + jarl # construct web address for requesting
print(hurricane_jack)

# Make call to API

beers = requests.get(webadd, headers=avalanche) # request the web address
beers.status_code # check if API was requested successfully

Back to the request:

Good, we get a status code of _200_ - this means we made a successful call to the API. <a href="https://www.textbook.ds100.org/ch/07/web_http.html" target=_blank>Lau, Gonzalez and Nolan</a> provide a succinct description of different types of status codes:

* **100s** - Informational: More input is expected from client or server (e.g. 100 Continue, 102 Processing)
* **200s** - Success: The client's request was successful (e.g. 200 OK, 202 Accepted)
* **300s** - Redirection: Requested URL is located elsewhere; May need user's further action (e.g. 300 Multiple Choices, 301 Moved Permanently)
* **400s** - Client Error: Client-side error (e.g. 400 Bad Request, 403 Forbidden, 404 Not Found)
* **500s** - Server Error: Server-side error or server is incapable of performing the request (e.g. 500 Internal Server Error, 503 Service Unavailable)

For clarity:
* **Client**: your machine
* **Server**: the machine you are requesting the resource from

(See Appendix A for more examples of how the `requests` module works and what information it returns.)

You may be wondering exactly what it is we requested. To see the content of our request, we can call the `json()` method on the `response` variable:

In [None]:
data = response.json() 
data # view the content of the response

Just by scanning the first few lines we can pick out useful metadata about the response: we can see our search has yielded over 15,000 results, and there are ten results per page spread out over 1,500 pages. We can also see where the actual content of the search results is contained, in a section helpfully called `results`.

It is important we familiarise ourselves with the hierarchical structure of the returned data. Note how we needed to call the `json()` method on the `response` variable. JSON (Javascript Object Notation) is a hierarchical data structure based on key-value pairs (known as *items*), which are separated by commas (Brooker, 2020; Tagliaferri, n.d.). For example, the `pageSize` key stores the value `10`. The hierarchical structure is evident by observing how a key can contain a list of other key-value pairs (items). For example, the `results` key contains a list of items relating to each search result.

A JSON data structure (known in Python as a *dictionary*) can be difficult to understand at first, in no small part due to the unappealing presentation format. Visually, it is worth noting that this data structure begins and ends with curly braces (`{}`). 

Let's examine some Python methods for navigating and processing this data structure.

#### Navigating a dictionary (JSON) variable

The first thing we should do is list the keys contained in the dictionary:

In [None]:
data.keys()

If you're wondering why only one key is listed, remember that we are dealing with a hierarchical data structure. The value of the `response` key is itself another dictionary, therefore we can access those keys as follows: 

In [None]:
data["response"].keys() # list keys contained in the "response" key

Note how we mentioned that keys can contain a list of other key-value pairs. A *list* is a Python data type used to store ordered sequences of elements (Tagliaferri, n.d.). Let's see how we can deal with lists by examining the `results` key:

In [None]:
search_results = data["response"]["results"] 
search_results # view the contents of the "results" key nested within the "response" key

Visually, a list begins and ends with square parentheses (`[]`) -  however we can just ask Python to confirm that the `search_results` variable is a list:

In [None]:
type(search_results)

We can check how many elements are in a list by calling on the `len()` function:

In [None]:
len(search_results)

We can view each element in the list of search results as follows:

In [None]:
# View the values of certain keys in each element in the list

for result in search_results:
    print(result["type"]) # view content type
    print(result["sectionName"]) # view newspaper section content appeared in
    print(result["webPublicationDate"]) # view date content was published online
    print("\r")
    print("-------------")
    print("\r")

It takes a bit of time to get used to unfamiliar data structures and types. If you are a quantitative social scientist, you may be used to working with data structured in a tabular (variable-by-case) format: every row is an observation, every column is a variable, and every cell is a value.

In section 4.7, we'll see how we can convert data from a JSON structure to a tabular format, but just keep in mind that JSON is a popular means of storing and sharing data found on the web and it is worth becoming proficient in managing and manipulating it.

### Saving results

The final task is to save the data to a file that we can use in the future. We'll write the data to a JSON file format, as this is the structure the data were returned in.

In [None]:
# Create a downloads folder

try:
    os.mkdir("./downloads")
except:
    print("Unable to create folder: already exists")

The use of "./" tells the `os.mkdir()` command that the "downloads" folder should be created at the same level of the directory where this notebook is located. So if this notebook was stored in a directory located at "C:/Users/joebloggs/notebooks", the `os.mkdir()` command would result in a new folder located at "C:/Users/joebloggs/notebooks/downloads".
   
(Technically the "./" is not needed and you could just write `os.mkdir("downloads")` but it's good practice to be explicit.)

In [None]:
# Write the results to a JSON file

date = datetime.now().strftime("%Y-%m-%d") # get today's date in YYYY-MM-DD format
print(date)

outfile = "./downloads/guardian-api-covid-19-search-" + date + ".json"

with open(outfile, "w") as f:
    json.dump(data, f)

The code above defines a name and location for the file which will store the results of the API request. We then open the file in *write* mode and save (or "dump") the contents of the `data` variable to it.

How do we know this worked? The simplest way is to check whether a) the file was created, and b) the results were written to it.

In [None]:
# a) Check presence of file in "downloads" folder

os.listdir("./downloads")

In [None]:
# b) Open file and read (import) its contents

with open(outfile, "r") as f:
    data = json.load(f) # use the "load()" method of the "json" module
    
data

And Voila, we have successfully requested data using an API!

### Refining Covid-19 data collection

We will complete our work gathering data on articles relating to Covid-19 by refining our request and handling of the response. In particular, we will deal with the following outstanding issues:
1. Including additional search terms
2. Dealing with multiple pages of results
3. Requesting the contents (text) of some of the articles contained in our list of search results
4. Handling the rate limit

The following blocks of code contain fewer comments explaining what each command is doing, so feel free to add these back in if you like.

This section - and the notebook in general - is sequential, therefore not running the code in descending order will invariably lead to errors down the line. If this occurs, return to this point and run the code cells one-by-one.

Let's get some the preliminaries out of the way before we begin:

In [None]:
import os
import requests
import json
import pandas as pd
from datetime import datetime
from bs4 import BeautifulSoup as soup

date = datetime.now().strftime("%Y-%m-%d")

api_key = open("./auth/guardian-api-key.txt", "r").read() # open the file and read its contents

#### Including additional search terms

We can use logical operators - AND, OR and NOT - to include additional search terms in our request.

In [None]:
# Define web address and search terms

baseurl = "http://content.guardianapis.com/search?"
searchterms = "covid-19 OR coronavirus" # terms we want to search for
auth = {"api-key": api_key}

webadd = baseurl + "q=" + searchterms
print(webadd)

# Make call to API

response = requests.get(webadd, headers=auth)
response.status_code

In [None]:
response.json()

Note how the total number of results is higher than if we just searched for "covid-19" on its own.

**EXERCISE**: Adapt the code above to include other search terms that you consider relevant to Covid-19 (e.g., covid-19 AND Scotland). Consult the <a href="https://open-platform.theguardian.com/documentation/search" target=_blank>API's documentation</a> if you need help constructing more complicated search queries (*Query term* section).

In [None]:
# INSERT EXERCISE CODE HERE

#### Dealing with multiple pages of results

Sticking with our previous example, we noted how the search results were restricted to ten per page. APIs often do not return all of the results for a given request due to the computing resource needed to process large volumes of data. There are a couple of ways of dealing with this issue:
* Increasing the number of results returned per page; AND
* Requesting each individual page of results

First, increase the number of results per page using the `page-size` filter:

In [None]:
baseurl = "http://content.guardianapis.com/search?"
searchterms = "covid-19 OR coronavirus"
auth = {"api-key": api_key}
numresults = "50" # number of results per page

webadd = baseurl + "q=" + searchterms + "&page-size=" + numresults
print(webadd)

# Make call to API

response = requests.get(webadd, headers=auth)
response.status_code

Confirm the `page-size` filter was applied:

In [None]:
data = response.json()
data["response"]["pageSize"]

Good, now let's request each page of results. We start by finding out how many pages we need to request:

In [None]:
total_pages = data["response"]["pages"]
total_pages = total_pages + 1
total_pages

Note how we needed to add one to the total number of pages. This is because of the way Python loops over a range of numbers. For example, the code below loops over a range of numbers beginning at one and **up to but not including** five:

In [None]:
for i in range(1, 5):
    print(i)

Requesting the total number of pages may take a while, so let's just request the first 20:

In [None]:
for pagenum in range(1, 21):
    
    webadd = baseurl + "q=" + searchterms + "&page-size=" + numresults \
        + "&page=" + str(pagenum)
    print(webadd)

    response = requests.get(webadd, headers=auth)
    
    outfile = "./downloads/guardian-api-covid-19-search-page-" \
        + str(pagenum) + "-" + date + ".json"

    with open(outfile, "w") as f:
        json.dump(data, f)

In [None]:
os.listdir("./downloads")

#### Requesting article contents

The list of search results contain data and metadata about individual articles. What if we wanted the contents of the article i.e., the text? Thankfully the API has an endpoint (*Single Item*) that provides access to this data.

First, we need to extract an article from our list of search results. We do this by accessing its *positional value* (index) in the list: 

In [None]:
search_results = data["response"]["results"]
article = search_results[0] # extract first article in list of results
article

In Python, indexing begins at zero (in R indexing begins at 1). Therefore, the first item in a list is located at position zero (e.g., `search_results[0]`), the second item at position one (e.g., `search_results[1]`) etc.

**TASK**: extract a different article using another index value. Remember, there are 50 elements in the `search_results` list, so `[49]` is the highest index value you can refer to.

Next, we need to request its contents using the web address contained in the `apiUrl` key and specifying we want the body (text) of the article returned also:

In [None]:
baseurl = article["apiUrl"]
auth = {"api-key": api_key}
field = "body"

webadd = baseurl + "?show-fields=" + field

# Make call to API

response = requests.get(webadd, headers=auth)
response.status_code

In [None]:
data = response.json()
data

Note we now have a key called `fields` which contains a dictionary with one key (`body`):

In [None]:
data["response"]["content"]["fields"].keys()

In [None]:
text = data["response"]["content"]["fields"]["body"]
text

The article's content is returned as a long piece of unstructured text (known as a *string* data type in Python). However you may have noticed lots of strange symbols or tags scattered throughout (e.g., `<p>`, `</ul>`) - their presence confirms that the content is actually a piece of HTML, the programming language that web pages are written in. Thankfully, Python has a module for working with HTML data.

`BeautifulSoup` is a Python module that provides a systematic way of navigating the elements of a web page and extracting its contents. Let's see how it works in practice:

In [None]:
from bs4 import BeautifulSoup as soup # module for parsing web pages

soup_text = soup(text, "html.parser") # convert text to HTML
type(soup_text)

In the above code, we apply the `soup()` method from the `BeautifulSoup` module to the `text` variable, and store the results in a new variable called `soup_text`. 

Now that the raw text has been converted to a different data type, we are able to navigate and extract the information of interest. For example, let's extract all of the links in the article:

In [None]:
links = soup_text.find_all("a") # find all <a> tags
links

We used the `find_all()` method to search for all `<a>` tags in the article. And because there is more than one set of tags matching this id, we get a list of results. We can check how many tags there are by calling on the `len()` function:

In [None]:
len(links)

The list has six elements, which means this article links to six other resources.

We can view each element in the list of results as follows:

In [None]:
for link in links:
    print("--------")
    print(link.get("href")) # extract the URL from within the <a> tag
    print("--------")
    print("\r") # print some blank space for better formatting

Parsing HTML data is a topic in itself and we cover it in much more detail in our <a href="https://github.com/UKDataServiceOpen/web-scraping/blob/master/code/web-scraping-websites-code-2020-04-23.ipynb" target=_blank>Websites as a Source of Data</a> lesson.

Finally, let's conclude by saving the article's data and metadata to a JSON file:

In [None]:
# Use the articles unique id to name the file

article_id = data["response"]["content"]["id"].replace("/", "-")
#
# We need to remove forward slashes ("/") as our computer interprets
# these as folder separators.
#

outfile = "./downloads/" + article_id + ".json"

with open(outfile, "w") as f:
    json.dump(data, f)
    
outfile # view the file name  

#### Handling the rate limit

The Guardian API's free level of access allows up to 12 calls (requests) per second, up to a daily limit of 5,000. You may consider this more than adequate for your purposes, but for argument's sake let's say you're worried about breaching these limits. How can you keep track of the number of calls you make?

Thankfully, APIs include some metadata with the response that tells you the rate limit and how many calls you have remaining. First, let's see how we can access this metadata:

In [None]:
baseurl = "http://content.guardianapis.com/search?"
searchterms = "covid-19 OR coronavirus"
auth = {"api-key": api_key}

webadd = baseurl + "q=" + searchterms
print(webadd)

response = requests.get(webadd, headers=auth)
response.headers # view response metadata

While most of this metadata won't be of interest (to a researcher, though software/web developers may disagree), some fields provide important information for tracking our use of the API:
* `X-RateLimit-Limit-day` - the number of calls you can make per day
* `X-RateLimit-Limit-minute` - the number of calls you can make per day
* `X-RateLimit-Remaining-day` - the number of calls remaining per day
* `X-RateLimit-Remaining-day` - the number of calls remaining per minute

In addition, the `Content-Type` field is interesting as it tells us what format the content of the response is returned as (e.g., JSON, XML, CSV).

Let's use some of this metadata to define some variables that track our use of the API:

In [None]:
rate_limit_daily = response.headers["X-RateLimit-Limit-day"]
remaining_daily = response.headers["X-RateLimit-Remaining-day"]
total_calls = int(rate_limit_daily) - int(remaining_daily)

print("We have made {} calls to the API today.".format(total_calls))

(Note how we needed to convert the `rate_limit_daily` and `remaining_daily` variables to integers before we could perform mathematical operations on them.)

Great, now we have a way of keeping track of the number of calls we make to the API. Let's test this approach by making more requests:

In [None]:
# Initialise the variables counting the number of calls

if "total_calls" in globals():
    print("You have made {} calls this session".format(total_calls))
    counter = total_calls
else:
    counter = 0

# Request data from Guardian API

for pagenum in range(1, 6):
    
    baseurl = "http://content.guardianapis.com/search?"
    searchterms = "covid-19 OR coronavirus"
    auth = {"api-key": api_key}
    numresults = "50"
    
    webadd = baseurl + "q=" + searchterms + "&page-size=" + \
        numresults + "&page=" + str(pagenum)

    # Make call to API

    response = requests.get(webadd, headers=auth)
    response.status_code
    
    # Get metadata
    
    rate_limit_daily = response.headers["X-RateLimit-Limit-day"]
    remaining_daily = response.headers["X-RateLimit-Remaining-day"]
    total_calls = int(rate_limit_daily) - int(remaining_daily)
    
    # Update counter
    
    counter += 1
    print("Call number {} to the API".format(counter))

# Update overall number of calls

if "total_calls" not in globals():
    total_calls = counter
elif "total_calls" in globals():
    total_calls = total_calls + (counter - total_calls)
else:
    pass

print("You have made {} total calls to the API in this session".format(total_calls))    

**TASK**: Re-run the code above another two or three times to see whether the `total_calls` variable is correctly keeping track of the number of requests.

There are quite a few new techniques introduced in the code above, so let's unpack each element one at a time:

The above block of code checks whether the `total_calls` variable exists: if it does then the variable keeping track of individual calls (`counter`) is initialised to the value of `total_calls`; if `total_calls` does not exist then `counter` is set to 0 (i.e., we have not made any calls to the API yet).

The above block performs the usual request to the API, but at the end it increases the `counter` variable by 1 to record the fact we made a call.

Finally, this block updates the `total_calls` variable, conditional on whether it already exists or not.

Note that there usually isn't a penalty for exceeding the call limit; what happens is you are restricted from making further calls until a sufficient period of time has passed. We say "usually", because it depends on the API in question, whether you have paid for a commercial/custom level of access etc.

### Concluding remarks on Covid-19 data

The Covid-19 pandemic is a seismic public health crisis that will dominate our lives for the foreseeable future. The example code above is not a craven attempt to provide some topicality to these materials, nor is it simply a particularly good example for learning how to interact with an API. There are real opportunities for social scientists to capture and analyse data on this phenomenon, such as The Guardian's reporting of the crisis.

You may also be interested in the publicly available data repository provided by Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE): <a href="https://github.com/CSSEGISandData/COVID-19" target=_blank>https://github.com/CSSEGISandData/COVID-19</a>. Updated daily, this resource provides CSV (Comma Separated Values) files of global Covid-19 statistics (e.g., country-level time series), as well as PDF copies of the World Health Organisation's situation reports.

At a UK level, the NHS releases data about Covid-19 symptoms reported through its NHS Pathways and 111 online platforms: <a href="https://digital.nhs.uk/data-and-information/publications/statistical/mi-potential-covid-19-symptoms-reported-through-nhs-pathways-and-111-online/latest" target=_blank>NHS Open Data</a>. Data on reported cases is also provided by Public Health England (PHE): <a href="https://www.gov.uk/government/publications/covid-19-track-coronavirus-cases" target=_blank>COVID-19: track coronavirus cases</a>. Many of these datasets are available as openly available as CSV files.

There is a <a href="https://docs.google.com/document/d/1eH0d3sldy_bocJpgvnCIsRXaguumDpiEzgDd2TjfJFI/edit" target=_blank>collaborative Google document</a> capturing sources of social data relating to COVID-19, curated by <a href="http://www.benbgeiger.co.uk/" target=_blank>Dr Ben Baumberg Geiger</a>.

Finally, the Office for National Statistics (ONS) provides data and experimental indicators of social like in the UK under Covid-19: <a href="https://www.ons.gov.uk/peoplepopulationandcommunity/healthandsocialcare/conditionsanddiseases" target=_blank>Coronavirus (COVID-19)</a>.

### A social research example: UK police data

The example contained in Appendix B introduces another API with relevant data for social science research. It reinforces much of what you've learned so far but introduces some new approaches also (using if statements to track API usage, employing API clients etc). If you feel comfortable with what you've learned so far then we highly recommend completing this lesson; if not take some more time to digest the Covid-19 example and return to it at a later date.

## Value, limitations and ethics

Computational methods for collecting data from the web are an increasingly important component of a social scientist's toolkit. They enable individuals to collect and reshape data - qualitative and quantitative - that otherwise would be inaccessible for research purposes. Thus far, this notebook has focused on the logic and practice of using APIs, however it is crucial we reflect critically on its value, limitations and ethical implications for social science purposes.

### Value

* The process of interacting with an API is a common and mature computational method, with lots of established packages (e.g., `requests`in Python), examples and help available. As a result the learning curve is not as steep as with other methods, and it is possible for a beginner to access data from an API in under an hour (see the UK Police API example in Appendix B, if you do not believe us).
* APIs provide access to data that is intended to be shared; thus, the data are already in a more user-friendly format (e.g., as JSON or CSV files). Compare this to data scraped from a web page, which often needs extensive cleaning and parsing in order to be analysed and shared. 
* The richness of some of the information and data made available through APIs is a point worth repeating. Many public, private and charitable institutions use their web sites to release and regularly update information of value to social scientists. Getting a handle on the volume, variety and velocity of such information is extremely challenging (if not impossible) without the use of computational methods.
* APIs provide flexible access to data: you do not need to bulk download data and then filter through the results to arrive at what you need. APIs allow you to format your request in a way that returns exactly and only what you require.
* Finally, the data you need might only be available through an API. This might be the case if you require organisational and financial information on UK companies, thereby necessitating the use of the <a href="https://developer.companieshouse.gov.uk/api/docs/" target=_blank>Companies House API</a>.

### Limitations

* APIs restrict the number of requests for data you can make. For example, the Guardian API allows up to 5,000 requests per day, the Companies House API up to 600 per five-minute period. These limits ensure the API can offer a reliable service to a wide user base (in contrast to a single user constantly requesting data and preventing others from doing so). Therefore you need to plan and structure your requests in such a way that you comply with the limit **and** retrieve the data you need for your analysis.
* The quality of an API's official documentation can vary wildly. Some APIs provide accurate, detailed guides on how to make requests and handle responses; others are sparsely written and it can be a pain figuring out what data are available at each endpoint, how to correctly specify the web address, what the rate limit is etc. And if guidance is available, it might not be for the programming language you are using.
* Data protection laws, such as the EU's General Data Protection Regulations (GDPR), impinge on the data you collect through APIs. GDPR means that you are responsible for processing, securing, storing, using and deleting an individual's personal data, even if it’s publicly available through an API. This is a critical and detailed area of data-driven activities, and we encourage you to consult relevant guidance (see *Further reading and resources* section).
* APIs can be updated, resulting in changes to the rate limit, authentication requirements, endpoints providing access to the data, cost of using the service etc. It can be a lot of work maintaining your code, especially if you make it available for use by others.
* An API is a product and you must comply with the Terms of Service/Use associated with it; else there can be legal implications resulting from non-compliance - for example, there might be restrictions around sharing data collected from an API.
* Interacting with APIs, and engaging in computational social science generally, is dependent on your computing setup. For example, you may not possess administrative rights on your machine, preventing you from scheduling your script to run on a regular basis (e.g., your computer automatically goes to sleep after a set period of time and you cannot change this setting). There are ways around this and you do not need a high performance computing setup to collect data from an API, but it is worth keeping in mind nonetheless.

### Ethical considerations

For the purposes of this discussion, we will assume you have sought and received ethical approval for a piece of research through the usual institutional processes: you've already considered consent, harm to researcher and participant, data security and curation etc. One of these aspects, *informed consent*, needs major consideration when it comes to using some APIs (Lomborg & Bechmann 2014). Let's take Twitter user data, which is available via an <a href="https://developer.twitter.com/en/docs" target=_blank>API</a> (we'll work with this in a future <a href="https://github.com/UKDataServiceOpen/new-forms-of-data" target=_blank>training series</a>). Users of Twitter will have signed up to the Terms of Service/Use when registering on the platform, where they will have agreed various clauses regarding the use and sharing of their information. However, can they reasonably be said to have given consent to participating in research using this data? What if you capture a user's personal data through the API, which at a later date the user deletes from their own profile: should you use this information in your research? There are no easy answers and we encourage you to consider these issues prior to beginning your data collection (see further reading suggestions).

## Conclusion

Interacting with an API is a simple yet powerful computational method for collecting data of value for social science research. It provides a relatively gentle introduction to using programming languages, also. However, "with great power comes great responsibility" (sorry). APIs take you into the realm of data protection, Terms of Service/Use, and many murky ethical issues. Wielded sensibly and sensitively, collecting data from APIs is a valuable and exciting social science research method.

Good luck on your data-driven travels!

## Bibliography

Barba, Lorena A. et al. (2019). *Teaching and Learning with Jupyter*. <a href="https://jupyter4edu.github.io/jupyter-edu-book/" target=_blank>https://jupyter4edu.github.io/jupyter-edu-book/</a>.

Brooker, P. (2020). *Programming with Python for Social Scientists*. London: SAGE Publications Ltd.

Lau, S., Gonzalez, J., & Nolan, D. (n.d.). *Principles and Techniques of Data Science*. https://www.textbook.ds100.org

Lomborg, S., and Bechmann, A. (2014). Using APIs for Data Collection on Social Media. *The Information Society: An International Journal*, 30(4): 256-265.

Tagliaferri, L. (n.d.). *How to Code in Python 3*. https://assets.digitalocean.com/books/python/how-to-code-in-python.pdf

## Further reading and resources

We publish a list of useful books, papers, websites and other resources on our web-scraping Github repository: <a href="https://github.com/UKDataServiceOpen/web-scraping/tree/master/reading-list/" target=_blank>[Reading list]</a>

The help documentation for the `requests` module is refreshingly readable and useful:
* <a href="https://requests.readthedocs.io/en/master/" target=_blank>`requests`</a>

You may also be interested in the following articles specifically relating to APIs:
* <a href="https://www.digitalocean.com/community/tutorials/how-to-use-web-apis-in-python-3" target=_blank>How To Use Web APIs in Python 3</a>
* <a href="https://ico.org.uk/for-organisations/guide-to-data-protection" target=_blank>Guide to Data Protection</a>
* <a href="https://www.gla.ac.uk/media/Media_487729_smxx.pdf" target=_blank>Social Media Research: A Guide to Ethics</a>

## Appendices

### Appendix A - Requesting URLs

We refer to a website's location on the internet as its web address or Uniform Resource Locator (URL).

In Python we've made use of the excellent `requests` module. By calling the `requests.get()` method, we mimic the manual process of launching a web browser and visiting a website. The `requests` module achieves this by placing a _request_ to the server hosting the website (e.g., show me the contents of the website), and handling the _response_ that is returned (e.g., the contents of the website and some metadata about the request). This _request-response_ protocol is known as HTTP (HyperText Transfer Protocol); HTTP allows computers to communicate with each other over the internet - you can learn more about it at <a href="https://www.w3schools.com/whatis/whatis_http.asp" target=_blank>W3 Schools</a>.

Run the code below to learn more about the data and metadata returned by `requests.get()`.

In [None]:
import requests

url = "https://httpbin.org/html"
response = requests.get(url)

print("1. {}".format(response)) # returns the object type (i.e. a response) and status code
print("\r")

print("2. {}".format(response.headers)) # returns a dictionary of response headers
print("\r")

print("3. {}".format(response.headers["Date"])) # return a particular header
print("\r")

print("4. {}".format(response.request)) # returns the request object that requested this response
print("\r")

print("5. {}".format(response.url)) # returns the URL of the response
print("\r")

#print(response.text) # returns the text contained in the response (i.e. the paragraphs, headers etc of the web page)
#print(response.content) # returns the content of the response (i.e. the HTML contents of the web page)

# Visit https://www.w3schools.com/python/ref_requests_response.asp for a full list of what is returned by the server
# in response to a request.

### Appendix B - Capturing UK Police data

* API: https://data.police.uk
* Documentation: https://data.police.uk/docs/

[IN DEVELOPMENT]

-- END OF FILE --