![SGSSS Logo](../img/SGSSS_Stacked.png)

# Collecting Digital Data for Social Scientists

## Introduction

Application Programming Interfaces (APIs) are one of the most important and reliable ways to access data online. Unlike web scraping, which involves parsing the visual structure of web pages and is vulnerable to layout changes, APIs provide a structured, programmatic interface that data providers explicitly design for machine-to-machine communication. This makes them more stable, more efficient, and often more ethical for data collection.

In this practical, we will work with the [UK Police API](https://data.police.uk/docs/), a freely available, open-data API that provides detailed information on policing in England, Wales, and Northern Ireland. The data available through this API -- including stop-and-search records, crime reports, and police force information -- are of direct relevance to criminological and sociological research. By the end of this session, you will be comfortable making API requests, processing JSON responses, and converting the results into formats suitable for analysis.

## Aims

This practical has two main aims:

1. **Demonstrate how to use R for API data collection** -- you will learn the key steps involved in requesting data from a web API, handling the response, and saving it in a structured format.
2. **Cultivate computational thinking** -- beyond the specific code, this session encourages you to think systematically about how digital data are structured and how to approach data collection tasks programmatically.

## Lesson Details

| Detail | Information |
|---|---|
| **Level** | Introductory |
| **Time** | ~60 minutes |
| **Pre-requisites** | None |

### Learning Outcomes

By the end of this practical, you will be able to:

- Understand what an API is and why it is useful for social science research.
- Identify the key steps involved in collecting data from a web API.
- Use R to request data from an API, process the JSON response, and save the results in a structured format suitable for analysis.

## Guide to Using This Resource

This practical is designed to be run in [Google Colab](https://colab.research.google.com/) or any Jupyter environment with an R kernel installed. If you are using Colab, note that R runtime support is available but may require selecting **Runtime > Change runtime type > R** from the menu.

To run a code cell, click on it and press **Shift + Enter**. The output will appear directly below the cell. Work through the notebook from top to bottom, running each cell in order.

If you are new to Jupyter notebooks or want a more thorough introduction, Dani Arribas-Bel provides an excellent guide in his [Geographic Data Science course materials](https://darribas.org/gds_course/content/bB/lab_B.html).

In [None]:
install.packages(c("httr", "jsonlite", "dplyr", "ggplot2"))
library(httr)
library(jsonlite)
library(dplyr)
library(ggplot2)
cat("Packages loaded\n")

In [None]:
# Run this cell to check everything is working.
# You should see a message printed below.
cat("Welcome to Practical 2: UK Police API Deep-Dive!\n")
cat("Your R environment is ready.\n")

## What is an API?

We covered APIs in detail in Lecture 2, so here we provide only a brief recap. An API (Application Programming Interface) acts as a **translator** between your code and a remote server. You send a request in a format the server understands, and it sends back the data you asked for in a structured format -- typically JSON.

Think of it like ordering food at a restaurant: you do not go into the kitchen yourself. Instead, you tell the waiter (the API) what you want, and the waiter brings back your meal (the data). The key advantage is that the data provider controls what is available and how it is delivered, which means the process is predictable, well-documented, and designed for automated access.

## First API Call (~10 min)

Let us make our first request to the UK Police API. We will start by retrieving a list of all police forces.

In [None]:
# Define the base URL and endpoint
baseurl <- "https://data.police.uk/api/"
forces_url <- paste0(baseurl, "forces")
cat("Requesting:", forces_url, "\n")

# Make the request
response <- GET(forces_url)

# Check the status code
status_code(response)

### Understanding Status Codes

The **status code** tells us whether the request was successful:

| Code | Meaning |
|---|---|
| **200** | OK -- the request was successful |
| **301** | Moved Permanently -- the resource has moved to a new URL |
| **404** | Not Found -- the resource does not exist |
| **429** | Too Many Requests -- you are being rate-limited |
| **500** | Internal Server Error -- something went wrong on the server side |

A status code of `200` means everything worked as expected.

In [None]:
# Parse the JSON response into an R data frame
forces_data <- fromJSON(content(response, "text"))
head(forces_data)

### Understanding JSON

The data returned by the API is in **JSON** (JavaScript Object Notation) format. JSON is the most common format for API responses. It uses:

- **Curly braces `{}`** to define objects (similar to R named lists), which contain **key-value pairs** separated by colons.
- **Square brackets `[]`** to define arrays (similar to R vectors or lists).

For example, each police force is represented as an object like `{"id": "avon-and-somerset", "name": "Avon and Somerset Constabulary"}`. The full response is an array of these objects.

In R, the `jsonlite::fromJSON()` function automatically simplifies flat JSON arrays of objects into a **data frame**, which is immediately convenient for analysis. This is one of the advantages of working with JSON in R -- the conversion from nested JSON to a tabular structure is handled for you.

## Working with JSON (~15 min)

Now that we have our data, let us explore its structure using standard R operations.

In [None]:
# What type of R object is forces_data?
class(forces_data)

Because `fromJSON()` automatically simplifies the JSON array of objects, `forces_data` is already a `data.frame` in R. Each row corresponds to a police force, and each column corresponds to a JSON key (e.g., `id` and `name`). This is one of the key differences from Python, where the JSON response is initially a list of dictionaries that you must explicitly convert to a DataFrame.

In [None]:
# How many police forces are there?
nrow(forces_data)

In [None]:
# Access individual rows by index
cat("First force:\n")
forces_data[1, ]

cat("\nTenth force:\n")
forces_data[10, ]

### A Note on Indexing

Unlike Python, R uses **one-based indexing**, meaning the first element is at position `1`, the second at position `2`, and so on. This is why `forces_data[1, ]` gives us the first police force and `forces_data[10, ]` gives us the tenth.

### TASK

Try extracting a different police force from the data frame by using a different row number. For example, what force is at row `26`? Or `41`? Experiment in the cell below.

In [None]:
# What happens if we try a row number that is too large?
forces_data[200, ]

Unlike Python, which raises an `IndexError`, R returns a row of `NA` values when you request a row beyond the data frame's length. This is a subtle but important difference -- R will not stop you with an error, so it is good practice to always check the dimensions of your data with `nrow()` or `dim()` before accessing specific rows.

In [None]:
# Extract all force IDs using the $ operator
force_ids <- forces_data$id
head(force_ids)

### The `$` Operator

In R, the `$` operator is used to extract a single column from a data frame by name. The expression `forces_data$id` returns a character vector containing all the values in the `id` column. This is analogous to accessing dictionary keys in Python's list comprehension `[force["id"] for force in forces_data]`, but is more concise because R's data frame structure makes column access straightforward.

## Parameterised Requests and Looping (~20 min)

So far, we have made a simple request to a single endpoint. But most APIs allow you to customise your requests using **query parameters** -- additional information appended to the URL that filters or specifies the data you want.

We will now use the **stop-and-search** endpoint of the UK Police API. This endpoint requires a `force` parameter, which specifies which police force's data we want to retrieve. The URL takes the form:

```
https://data.police.uk/api/stops-force?force=<force-id>
```

The `?force=` part is the query parameter. Let us start with a single force.

In [None]:
force <- "city-of-london"
sas_url <- paste0(baseurl, "stops-force?force=", force)
cat("Requesting:", sas_url, "\n")

response <- GET(sas_url)
status_code(response)

In [None]:
sas_data <- fromJSON(content(response, "text"))

# Add the force name to the data so we know where it came from
sas_data$force <- force

nrow(sas_data)

Notice that we added a `force` column to the data frame. The API does not include this information by default, so we inject it ourselves. This is essential when we later combine data from multiple forces -- without it, we would not know which force each record belongs to.

In [None]:
# Inspect the first few records
head(sas_data, 2)

### Looping Over All Forces

We have data for one force, but we want data for **all** forces. To do this, we will loop over our `force_ids` vector and make a separate API request for each one.

There are two important considerations when doing this:

1. **Error handling**: Not every request will succeed. Some forces may not have stop-and-search data available. We check the status code and use `tryCatch()` to handle failures gracefully.
2. **Rate limiting**: Making too many requests too quickly can overload the server or get you temporarily blocked. We use `Sys.sleep(1)` to pause for one second between each request, which is good practice and respectful of the API provider.

In [None]:
all_sas <- list()

for (force in force_ids) {
  url <- paste0(baseurl, "stops-force?force=", force)
  response <- GET(url)
  
  if (status_code(response) == 200) {
    data <- tryCatch(fromJSON(content(response, "text")), error = function(e) NULL)
    if (!is.null(data) && nrow(data) > 0) {
      data$force <- force
      all_sas[[force]] <- data
    }
  } else {
    cat(paste("Failed for", force, "\n"))
  }
  
  Sys.sleep(1)
}

sas_df <- bind_rows(all_sas)
cat(paste("Total records:", nrow(sas_df), "\n"))

### Understanding the Code

Let us break down the key parts of the loop above:

- **`Sys.sleep(1)`**: Pauses execution for 1 second between requests. This is a simple form of rate limiting that prevents us from overwhelming the API server.
- **`if (status_code(response) == 200)`**: Checks whether the request was successful before trying to process the data. If it was not, we print an informative error message rather than crashing.
- **`tryCatch()`**: Wraps the JSON parsing in error handling. Some responses may return valid HTTP 200 status codes but contain unexpected content (e.g., empty responses). `tryCatch()` prevents the script from stopping if parsing fails.
- **`bind_rows()`**: This `dplyr` function combines a list of data frames into a single data frame by stacking them row-wise. It is the R equivalent of repeatedly using `extend()` in Python, but operates on entire data frames at once. It also handles cases where data frames have slightly different columns by filling missing values with `NA`.

## From JSON to Analysis (~15 min)

We now have a large collection of stop-and-search records in a single data frame. Let us begin exploring the data.

In [None]:
head(sas_df)
sample_n(sas_df, 5)

In [None]:
# Cross-tabulation: outcome by age range
table(sas_df$outcome, sas_df$age_range)

The cross-tabulation above shows the count of stop-and-search outcomes within each age group. This allows us to see, for example, whether younger individuals are more or less likely to receive a particular outcome compared to older age groups. Look for patterns: are certain outcomes disproportionately associated with certain age ranges?

In [None]:
ggplot(sas_df, aes(x = outcome)) +
  geom_bar(fill = "#A3217A") +
  coord_flip() +
  labs(x = "Outcome", y = "Count", title = "Stop and Search Outcomes") +
  theme_minimal()

The bar chart provides a clear overview of how stop-and-search encounters are resolved across all forces. The most common outcomes are immediately visible. Consider what this distribution tells us about policing practice in England and Wales -- for example, what proportion of stops result in no further action?

## Exercise

**Task:** Produce a list of all senior police officers for each force.

The UK Police API provides a "people" endpoint for each force at:

```
https://data.police.uk/api/forces/<force-id>/people
```

Using the `force_ids` vector we created earlier, write a loop that requests the senior officers for every force, collects the results, and stores them in a list. Remember to include rate limiting and error handling.

Use the skeleton code below as a starting point. Fill in the gaps marked `# INSERT CODE HERE`.

In [None]:
# Skeleton code for the exercise

baseurl <- "https://data.police.uk/api/"

# Step 1: Get force IDs (we already have these, but included for completeness)
response <- GET(paste0(baseurl, "forces"))
forces_data <- fromJSON(content(response, "text"))
force_ids <- forces_data$id

# Step 2: Loop over all forces and collect senior officer data
chief_list <- list()

# INSERT CODE HERE
# Hint: loop over force_ids, construct the URL for each force's people endpoint,
# make a request, check the status code, and append the results to chief_list.
# Do not forget rate limiting!

cat(paste("Collected data for", length(chief_list), "forces.\n"))

**Hint:** The solution follows the same pattern as our stop-and-search loop. You need to construct a URL for each force, make a request, check the status code, and store the result. A full solution is provided in the Appendix at the end of this notebook.

## Bibliography

- Bail, C.A. (2021). *Breaking the Social Media Prism: How to Make Our Platforms Less Polarizing*. Princeton University Press. [https://www.chrisbail.net/](https://www.chrisbail.net/)
- Barba, L.A., Barker, L.J., Blank, D.S., et al. (2019). Teaching and Learning with Jupyter. [https://jupyter4edu.github.io/jupyter-edu-book/](https://jupyter4edu.github.io/jupyter-edu-book/)
- Brooker, P. (2020). *Programming with Python for Social Scientists*. SAGE Publications. [https://uk.sagepub.com/en-gb/eur/programming-with-python-for-social-scientists/book259583](https://uk.sagepub.com/en-gb/eur/programming-with-python-for-social-scientists/book259583)

## Appendix: Exercise Solution

In [None]:
chief_list <- list()

for (fid in force_ids) {
  url <- paste0(baseurl, "forces/", fid, "/people")
  response <- GET(url)
  
  if (status_code(response) == 200) {
    chief_list[[fid]] <- fromJSON(content(response, "text"))
  }
  
  Sys.sleep(1)
}

head(chief_list, 2)

---

**END OF FILE**