![SGSSS Logo](../../img/SGSSS_Stacked.png)

# Practical Computational Methods for Social Scientists

## Introduction

Computational methods are transforming research practice across the disciplines. For social scientists these methods offer a number of valuable opportunities, including creating new datasets from digital sources; unearthing new insights and avenues for research from existing data sources; and improving the accuracy and efficiency of fundamental research activities.

Application Programming Interfaces (APIs) have become one of the most important ways to access and transfer data online (Bail, 2021). There are a number of advantages to using APIs instead of web-scraping, in particular avoiding many of the legal or ethical issues associated with the latter. However APIs can use data formats and structures that are alien to researchers in the arts, humanities, and social sciences, and can enact access barriers of their own (e.g., subscriptions and fees for using the API, rate limits).

In this lesson we access a public API containing data of criminological and sociological relevance.

### Aims

This lesson has two aims:
1. Demonstrate how to use R to download data from the web through an Application Programming Interface (API).
2. Cultivate your computational thinking skills through coding examples. In particular, how to define and solve a data collection problem using a computational method.

### Lesson details

* **Level**: Introductory
* **Time**: 30-60 minutes
* **Pre-requisites**: None
* **Audience**: Researchers and analysts from any disciplinary background. The materials are slightly tailored for social scientists through the use of social data.
* **Learning outcomes**:
    1. Understand what an Application Programming Interface (API) is.
    2. Understand the key steps and requirements for collecting data from the web through an API.
    3. Be able to use R for requesting, processing and saving data from an API.

## Guide to using this resource

This learning resource was built using <a href="https://jupyter.org/" target=_blank>Jupyter Notebook</a>, an open-source software application that allows you to mix code, results and narrative in a single document. As <a href="https://jupyter4edu.github.io/jupyter-edu-book/" target=_blank>Barba et al. (2019)</a> espouse:
> In a world where every subject matter can have a data-supported treatment, where computational devices are omnipresent and pervasive, the union of natural language and computation creates compelling communication and learning opportunities.

If you are familiar with Jupyter notebooks then skip ahead to the main content (*What is an Application Programming Interface (API)*). Otherwise, the following is a quick guide to navigating and interacting with the notebook.

### Interaction

**You only need to execute the code that is contained in sections which are marked by `In []`.**

To execute a cell, click or double-click the cell and press the `Run` button on the top toolbar (you can also use the keyboard shortcut Shift + Enter).

Try it for yourself:

In [None]:
name <- readline(prompt="Enter name: ")
print(paste("Hi,", name, "enjoy learning more about R and APIs!"))

### Learn more

Jupyter notebooks provide rich, flexible features for conducting and documenting your data analysis workflow. To learn more about additional notebook features, we recommend working through some of the <a href="https://github.com/darribas/gds19/blob/master/content/labs/lab_00.ipynb" target=_blank>materials</a> provided by Dani Arribas-Bel at the University of Liverpool. 

## What is an Application Programming Interface (API)?

An Application Programming Interface (API) is
> a set of functions and procedures allowing the creation of applications that access the features or data of an operating system, application, or other service" (Oxford English Dictionary). 

In essence: an API acts as an intermediary between software applications. Think of an API's role as similar to that of a translator faciliating a conversation between two individuals who do not speak the same language. Neither individual needs to know the other's language, just how to formulate their response in a way the translator can understand. Similarly, an API **simplifies** how applications communicate with each other. (Another analogy you may prefer: APIs are like sockets or ports that allow devices access to data)

It performs this role by providing a set of protocols/standards for making *requests* and formulating *responses* between applications. For example, a smart phone application might need real-time traffic data from an online database. An API can validate the application's request for data, and handle the online database's response (i.e., the transfer of data to the application). In the absence of an API, the smart phone application would need to know a lot more technical information about the online database in order to communicate with it (e.g., what commands does the database understand?). But thanks to the API, the smart phone application only needs to know how to formulate a request that the API understands, which then communicates the request to the database and handles the response.

### Why would you want to use an API?

While APIs have much broader uses, researchers are primarily interested in using them to access online databases. You may have heard or be familiar with some of these APIs: YouTube, Facebook, Google Maps, Reddit. Organisations are opening up their internal databases using APIs - useful list of these here: https://github.com/public-apis/public-apis?tab=readme-ov-file#index

Many public, private and charitable institutions collect and share data of value to researchers in the arts, humanities and social sciences. Often they deposit their data to a data portal - e.g., <a href="https://data.gov.uk/" target=_blank>UK Government Open Data</a> -, allowing you to download the files as and when needed. However, another approach they can adopt is to allow access to the underlying information that is stored in their database through an API. Using this method, individuals can send a customised *request* for information to the database; if the request is valid, the database *responds* by providing you with the information you asked for. Think of using an API as the difference between downloading a raw data file which then needs to be filtered to arrive at the information you need, and performing the filtering when you request the data, so only what you need is returned.

### What is the general approach for accessing data through an API?

We begin by identifying an API containing data of interest. Then we need to **know** the following:
* The location of the API (i.e., web address). For example, the UK Police API can be accessed via <a href="https://data.police.uk/api" target=_blank>https://data.police.uk/api</a>.
* The terms of use associated with the API. Many APIs restrict the number of requests you can make over a given time period, while others require registration in order to authenticate who is trying to access the data. For example, the UK Police API does not require you to provide authentication but restricts the number of requests for data you can make (15 per second) - the number of allowable requests in a given time period is known as the *rate limit*.
* The location of the data of interest on the API. For example, data on street-level crime from the UK Police API is available at: <a href="https://data.police.uk/api/crimes-street" target=_blank>https://data.police.uk/api/crimes-street</a>. The location of the data is known as its *endpoint*.

We can usually find all of the information we need by reading the API's documentation e.g., <a href="https://data.police.uk/docs/" target=_blank>https://data.police.uk/docs/</a>.

Then we need to **do** the following:
* Register your use of the API (if required).
* Request data from the endpoint of interest, supplying authentication if required. This process is known as *making a call* to the API.
* Write this data to a file for future use.

For any programming task, it is useful to write out the steps needed to solve the problem: we call this *pseudo-code*, as it is captures the main tasks and the order in which they need to be executed.

## A social science example

Let's work through the steps in our general approach using a real API, one that provides data on policing activities and street-level crime in England, Wales and Northern Ireland: **UK Police API**.

###  Locating the API

The UK Police API can be accessed via the following web address or link: <a href="https://data.police.uk/api" target=_blank>https://data.police.uk/api</a>.

Note that you cannot request this web address through your browser; this is because this link acts as the *base* web address from which you can access the different data sets. For example, we can access a list of all the police forces whose data is available via the API using the following web address: <a href="https://data.police.uk/api/forces" target=_blank>https://data.police.uk/api/forces</a>.

**TASK**: Try it yourself: click on the above link to see what happens when you request data on police forces.

Before delving further into requesting data, let's understand the terms of use/restrictions associated with the UK Police API.

### API terms of use

The UK Police API is reasonably well documented (not always the case, unfortunately) and we can clearly identify what is required in order to interact with it. Firstly, the API does not require authentication: you do not need to register your use of the API, nor provide a password (known as an API key) whenever you request data.

Secondly, the API allows you to make up to 15 calls (requests) per second on average, though you can make up to 30 in a single second. If you are using the API for research purposes, it is highly unlikely you'll exceed this limit (but who knows what data requirements you have). There is no limit on the total number of calls you can make to the UK Police API as long as you comply with the aforementioned rate limits. 

See <a href="https://data.police.uk/docs/api-call-limits/" target=_blank>https://data.police.uk/docs/api-call-limits/</a> for full information on the API's call limits.

To learn more about rate limiting as a concept, see [*Rate Limiting*](https://sicss.io/2020/materials/day2-digital-trace-data/apis/rmarkdown/Application_Programming_interfaces.html) (Bail, 2021).

### Locating data

The UK Police API allows access to over twenty endpoints (data sets), grouped under the following headings:
* *Forces* e.g., senior officers
* *Crime* e.g., crime categories
* *Neighbourhoods* e.g., boundaries, events
* *Stop and search* e.g., by area or force

See <a href="https://data.police.uk/docs/" target=_blank>https://data.police.uk/docs/</a> for a complete list of endpoints accessible through this API.

### Registering use of API

We can skip this step as the UK Police API does not require us to register or provide any form of authentication (a good example of *open data*).

### Requesting data

We're ready for the interesting bit: requesting data through the API. To focus our activities, we'll attempt to do the following:
1. Download a list of police forces in the UK.
2. For each force, download its stop-and-search data.
3. Save all of the downloaded data to a file for future use.

Before we download data, we need to ensure Python has the functionality it needs to interact with the API.

In [None]:
# Load necessary libraries

library(httr)    # For HTTP requests
library(jsonlite) # For JSON handling
library(dplyr)   # For data manipulation
library(lubridate) # For date handling
library(fs)      # For file system operations

cat("Successfully loaded necessary libraries\n")

Libraries are additional techniques or functions that are not present when you launch R. Some do not even come with R when you download it and must be installed on your machine separately - think of using `ssc install <package>` in Stata, or `pip install <package>` in Python. For now just understand that many useful modules need to be imported every time you start a new R session.

In [None]:
# Define web address and search terms

baseurl <- "https://data.police.uk/api/" # base web address
forces <- "forces" # endpoint where forces data is located

webadd <- paste0(baseurl, forces) # construct web address to request
cat(webadd)

# Make call to API

response <- GET(webadd) # request the web address
status_code(response) # check if API was requested successfully

Good, we get a status code of *200*, which means the request was successful. A status code in *400s* or *500s* represent an unsuccessful attempt at requesting a web page (see <a href="https://www.w3schools.com/tags/ref_httpmessages.asp" target=_blank>here</a> for a comprehensive description of different types of response status codes).

Let's unpack the above code. First, we define a variable (also known as an 'object' in Python) called `baseurl` that contains the base web address of the UK Police API. Then we define a variable containing the endpoint we want to access data from (`forces`). Finally we concatenate these separate elements to form a valid web address that can be requested from the API (`webadd`).

The next step is to use the `get()` method of the `requests` module to request the web address, and in the same line of code, we store the results of the request in a variable called `response`. Finally, we check whether the request was successful by calling on the `status_code` attribute of the `response` variable.

You may be wondering exactly what it is we requested. To see the content of our request i.e., the data, we can use the `content()` method on the `response` variable:

In [None]:
forces_data <- content(response, as = "parsed")
forces_data

In [None]:
response

InNote how we called the `content()` method on the `response` variable. This is because the data is returned to us in a structure known as *JSON*. JSON (Javascript Object Notation) is a hierarchical data structure based on key-value pairs (known as *items*), which are separated by commas (Brooker, 2020; Tagliaferri, n.d.). For example, the `name` key stores a value referring to the name of the police force.

A JSON data structure can be difficult to understand at first, in no small part due to the visually unappealing presentation format. It is worth noting that this data structure begins and ends with curly braces (`{}`).

Thankfully R automatically parses JSON and converts it into a list format (i.e., a list of observations). You can view the underlying JSON structure here: https://data.police.uk/api/forces

And Voila, we have a list of police forces in the UK (excluding Scotland and the British Transport Police).

We hope you agree that requesting data from an API is a relatively simple task. The real challenge lies with the way the data are *structured* in response to your request. While sometimes you may be able to request data in a tabular format (e.g., a CSV or Excel file), most of the time it arrives looking a bit different than you may be familiar with. For instance, we currently have a list of police forces, and for each one there are two fields: `id` and `name`. 

Therefore we need to figure out how to navigate these results and extract information of interest. Thankfully R provides some intuitive methods for performing this task. For example, to access a particular element (observation) in a list:

In [None]:
forces_data[9]

R begins counting at one, hence why the value "9" refers to position "9" in the list (unlike Python which begins counting at zero).

**TASK**: extract a different police force from the list using another index value.

In [None]:
forces_data[INSERT_INDEX_VALUE]

OK, now that we're familiar with lists we can extract the `id` for each force and store them in a separate list like so:

In [None]:
force_ids <- sapply(forces_data, function(x) x$id)
print(force_ids)

To construct our list of force ids, we've made use of an intermediate technique in R: *list comprehension*.

We create a new list called `force_ids`, and we populate this variable with the values from the `id` field for each element (`x`) in the list (`forces_data`).

#### Stop-and-search data

Now that we have a list of force ids we can request their respective stop-and-search data. For now, let's simplify our task by requesting data for *City of London* police force.

In [None]:
baseurl = "https://data.police.uk/api/" # base web address
sas <- "stops-force" # stop-and-search endpoint
force <- "city-of-london" # particular force to download data for

webadd <- paste0(baseurl, sas, "?force=", force)
cat("Requesting stop-and-search data from:", webadd, "\n")
    
response <- GET(webadd) # request the web address
status_code(response) # check if API was requested successfully
    
sas_data <- content(response, as = "parsed") # store the data in a variable "sas_data"

sas_data <- lapply(sas_data, function(record) {
    record$force <- force
    return(record)
  })

The above code utilises many of the techniques we've seen before. The only new addition is the following:

This loop adds a new field to each stop-and-search record, as the original data doesn't make it clear which police force it refers to.

Let's look at the data itself:

In [None]:
sas_data

And how many records are returned:

(Note that by default only records relating the most recent month are returned. You can search for other dates: see the <a href="https://data.police.uk/docs/method/stops-force/" taregt=_blank>documentation</a> for how to specify this option.

In [None]:
length(sas_data)

### Saving results

The final task is to save the data to a file that we can use in the future. We'll write the data to a JSON file format, as this is the structure the data were returned in.

In [None]:
# Create a downloads folder

if (!dir.exists("./downloads")) {
  dir.create("./downloads")
}

The use of "./" tells the `dir.create()` command that the "downloads" folder should be created at the same level of the directory where this notebook is located. So if this notebook was stored in a directory located at "C:/Users/joebloggs/notebooks", the `dir.create()` command would result in a new folder located at "C:/Users/joebloggs/notebooks/downloads".
   
(Technically the "./" is not needed and you could just write `dir.create("downloads")` but it's good practice to be explicit.)

In [None]:
# Write the results to a JSON file

ddate <- Sys.Date()
forces_outfile <- paste0("./downloads/uk-police-forces-", ddate, ".json")
col_outfile <- paste0("./downloads/city-of-london-sas-", ddate, ".json")

write_json(forces_data, forces_outfile, pretty = TRUE)
write_json(sas_data, col_outfile, pretty = TRUE)  

How do we know this worked? The simplest way is to check whether a) the file was created, and b) the data were written to it.

In [None]:
# Check presence of file in current folder

list.files("./downloads")

In [None]:
# Open files and read (import) its contents

data <- fromJSON(forces_outfile)
data  

In [None]:
data <- readLines(col_outfile)
data 

This is difficult to read so instead we can convert the original JSON into a more familiar data structure (tabular or matrix).

In [None]:
data <- fromJSON(col_outfile)
print(head(data))

That concludes our whistlestop tour of the UK Police API. See Appendix A for an example of how you request stop-and-search data for **every** police force.

## What have we learned?

Let's recap what key skills and techniques we've learned:
* **How to import modules**. You will usually need to import modules into R to support your work. R does come with some methods and functions that are ready to use straight away, but for computational social science tasks you'll almost certainly need to import some additional modules.
* **How to make requests (calls) for data to an API**. You can use R to request data from an API.
* **How to handle and save the data that is returned by the API**. APIs tend to return data in JSON format, which requires different data manipulation techniques than you may be used to. You can process this data and save it to a file for future use.
* **How to do all of the above in an efficient, clear and effective manner**.

## Conclusion

Interacting with an API is a simple yet powerful computational method for collecting data of value for social science research. It provides a relatively gentle introduction to using programming languages, also. However, "with great power comes great responsibility" (sorry). APIs take you into the realm of data protection, Terms of Service/Use, and many murky ethical issues. Wielded sensibly and sensitively, collecting data from APIs is a valuable and exciting social science research method.

Good luck on your data-driven travels!

## Exercise

Produce a list of all senior police officers for each force listed in the Police API data.

In [None]:
# INSERT CODE HERE

In [None]:
# INSERT CODE HERE

The solution is provided at the end of this notebook.

## Bibliography

Bail, C. (2021). *Application Programming Interfaces*. <a href="https://sicss.io/overview/application-programming-interfaces" target=_blank>https://sicss.io/overview/application-programming-interfaces</a>

Barba, Lorena A. et al. (2019). *Teaching and Learning with Jupyter*. <a href="https://jupyter4edu.github.io/jupyter-edu-book/" target=_blank>https://jupyter4edu.github.io/jupyter-edu-book/</a>.

Brooker, P. (2020). *Programming with Python for Social Scientists*. London: SAGE Publications Ltd.

Lau, S., Gonzalez, J., & Nolan, D. (n.d.). *Principles and Techniques of Data Science*. https://www.textbook.ds100.org

Tagliaferri, L. (n.d.). *How to Code in Python 3*. https://assets.digitalocean.com/books/python/how-to-code-in-python.pdf

## Appendices

### Appendix A - Stop-and-search data for every police force

It would be inefficient to request data for every police force one at a time. We can make use of lists and loops to speed up the downloading of data.

First, let's get our list of police forces:

In [None]:
# Load necessary libraries
library(httr)    # For making API requests
library(jsonlite) # For JSON handling
library(lubridate) # For date handling

cat("Successfully imported necessary libraries\n")

# Define base URL and search terms
baseurl <- "https://data.police.uk/api/"
forces <- "forces"
webadd <- paste0(baseurl, forces)  # Construct the full URL

# Make the API call
response <- GET(webadd)

# Check status code
status_code(response)

# Store the data in a variable
forces_data <- content(response, as = "parsed")

# Print the fetched data (first few entries)
print(head(forces_data))

Next, we extract a list of force ids:

In [None]:
force_ids <- sapply(forces_data, function(x) x$id)

Then, for each of these ids we request stop-and-search data and store the results in a list:

In [None]:
# Load necessary libraries

library(httr)    # For HTTP requests
library(jsonlite) # For JSON handling

# Define base URL and endpoint
baseurl <- "https://data.police.uk/api/"
sas <- "stops-force"

# Create an empty list to store results
forces_sas_data <- list()

# Loop over each force ID to fetch stop-and-search data
for (force in force_ids) {
  
  # Construct API request URL
  webadd <- paste0(baseurl, sas, "?force=", force)
  
  # Make the API call
  response <- GET(webadd)
  
  # Check if request was successful
  if (status_code(response) == 200) {
    sas_data <- content(response, as = "parsed")
    
    # Add additional metadata to each record
    sas_data <- lapply(sas_data, function(el) {
      el$force <- force
      el$code <- status_code(response)
      el$note <- "Downloaded data"
      return(el)
    })
  } else {
    # If request fails, store failure message
    sas_data <- list(list(force = force, note = "Could not download", code = status_code(response)))
  }
  
  # Append results to the list
  forces_sas_data <- append(forces_sas_data, sas_data)
}

# Print a sample of the collected data
print(head(forces_sas_data, 3))

You'll see we added a conditional statement (`if, else`) to check whether we made a successful request for data: if yes, then store the data in the `sas_data` variable; if no, then define the `sas_data` variable as a dictionary containing some notes about the unsuccessful attempt.

Let's check the results. Our `data` list should contain the results for 44 police forces:

In [None]:
length(forces_sas_data)

We'll leave it to you to examine the contents of the `forces_sas_data` variable.

### Exercise Solution

In [None]:
# Load necessary libraries

library(httr)     # For HTTP requests
library(jsonlite) # For JSON handling
library(lubridate) # For date handling
library(dplyr)    # For data manipulation

cat("Successfully imported necessary libraries\n")

# Define base URL and endpoints
baseurl <- "https://data.police.uk/api/"
forces_endpoint <- "forces"
people_endpoint <- "people"

# Construct API request URL for police forces
webadd <- paste0(baseurl, forces_endpoint)

# Get force information
response <- GET(webadd)

# Check API response status
if (status_code(response) == 200) {
  forces_data <- content(response, as = "parsed")
} else {
  stop("Failed to fetch police forces data. Status code:", status_code(response))
}

# Extract force IDs
force_ids <- sapply(forces_data, function(el) el$id)

# Initialize empty list for storing police chief information
chief_list <- list()

# Loop over each police force ID to get chief officer details
for (force_id in force_ids) {
  people_webadd <- paste0(baseurl, forces_endpoint, "/", force_id, "/", people_endpoint)
  
  # Make API request
  response <- GET(people_webadd)
  
  # Check if request was successful
  if (status_code(response) == 200) {
    chief_info <- content(response, as = "parsed")
  } else {
    chief_info <- list(list(force = force_id, note = "Could not retrieve data", code = status_code(response)))
  }
  
  # Append to chief_list
  chief_list <- append(chief_list, list(chief_info))
  
  # Pause for 1 second to comply with API rate limits
  Sys.sleep(1)
}

# Display first few entries
print(head(chief_list, 5))

In [None]:
chief_list

--END OF FILE--