![SGSSS Logo](../img/SGSSS_Stacked.png)

# Collecting Digital Data for Social Scientists

## Introduction

Application Programming Interfaces (APIs) are one of the most important and reliable ways to access data online. Unlike web scraping, which involves parsing the visual structure of web pages and is vulnerable to layout changes, APIs provide a structured, programmatic interface that data providers explicitly design for machine-to-machine communication. This makes them more stable, more efficient, and often more ethical for data collection.

In this practical, we will work with the [UK Police API](https://data.police.uk/docs/), a freely available, open-data API that provides detailed information on policing in England, Wales, and Northern Ireland. The data available through this API -- including stop-and-search records, crime reports, and police force information -- are of direct relevance to criminological and sociological research. By the end of this session, you will be comfortable making API requests, processing JSON responses, and converting the results into formats suitable for analysis.

## Aims

This practical has two main aims:

1. **Demonstrate how to use Python for API data collection** -- you will learn the key steps involved in requesting data from a web API, handling the response, and saving it in a structured format.
2. **Cultivate computational thinking** -- beyond the specific code, this session encourages you to think systematically about how digital data are structured and how to approach data collection tasks programmatically.

## Lesson Details

| Detail | Information |
|---|---|
| **Level** | Introductory |
| **Time** | ~60 minutes |
| **Pre-requisites** | None |

### Learning Outcomes

By the end of this practical, you will be able to:

- Understand what an API is and why it is useful for social science research.
- Identify the key steps involved in collecting data from a web API.
- Use Python to request data from an API, process the JSON response, and save the results in a structured format suitable for analysis.

## Guide to Using This Resource

This practical is designed to be run in [Google Colab](https://colab.research.google.com/), a free, cloud-based environment for running Python code in Jupyter notebooks. You do not need to install anything on your own computer.

To run a code cell, click on it and press **Shift + Enter**. The output will appear directly below the cell. Work through the notebook from top to bottom, running each cell in order.

If you are new to Jupyter notebooks or want a more thorough introduction, Dani Arribas-Bel provides an excellent guide in his [Geographic Data Science course materials](https://darribas.org/gds_course/content/bB/lab_B.html).

In [None]:
# Run this cell to check everything is working.
# You should see a message printed below.
print("Welcome to Practical 2: UK Police API Deep-Dive!")
print("Your Python environment is ready.")

## What is an API?

We covered APIs in detail in Lecture 2, so here we provide only a brief recap. An API (Application Programming Interface) acts as a **translator** between your code and a remote server. You send a request in a format the server understands, and it sends back the data you asked for in a structured format -- typically JSON.

Think of it like ordering food at a restaurant: you do not go into the kitchen yourself. Instead, you tell the waiter (the API) what you want, and the waiter brings back your meal (the data). The key advantage is that the data provider controls what is available and how it is delivered, which means the process is predictable, well-documented, and designed for automated access.

## First API Call (~10 min)

Let us make our first request to the UK Police API. We will start by retrieving a list of all police forces.

In [None]:
import requests
import json
import pandas as pd
from datetime import datetime
import time
import matplotlib.pyplot as plt

In [None]:
# Define the base URL and endpoint
baseurl = "https://data.police.uk/api/"
forces_endpoint = "forces"

# Construct the full URL
webadd = baseurl + forces_endpoint
print("Requesting:", webadd)

# Make the request
response = requests.get(webadd)

# Check the status code
print("Status code:", response.status_code)

### Understanding Status Codes

The **status code** tells us whether the request was successful:

| Code | Meaning |
|---|---|
| **200** | OK -- the request was successful |
| **301** | Moved Permanently -- the resource has moved to a new URL |
| **404** | Not Found -- the resource does not exist |
| **429** | Too Many Requests -- you are being rate-limited |
| **500** | Internal Server Error -- something went wrong on the server side |

A status code of `200` means everything worked as expected.

In [None]:
forces_data = response.json()
forces_data

### Understanding JSON

The data returned by the API is in **JSON** (JavaScript Object Notation) format. JSON is the most common format for API responses. It uses:

- **Curly braces `{}`** to define objects (similar to Python dictionaries), which contain **key-value pairs** separated by colons.
- **Square brackets `[]`** to define arrays (similar to Python lists).

For example, each police force is represented as an object like `{"id": "avon-and-somerset", "name": "Avon and Somerset Constabulary"}`. The full response is a list (array) of these objects.

In [None]:
# Inspect the response headers for additional information
response.headers

## Working with JSON (~15 min)

Now that we have our data, let us explore its structure using standard Python operations.

In [None]:
# What type of Python object is forces_data?
type(forces_data)

In [None]:
# How many police forces are there?
len(forces_data)

In [None]:
# Loop through and print each force
for force in forces_data:
    print(force)
    print()

In [None]:
# Access individual elements by index
print("First force:")
print(forces_data[0])
print()
print("Tenth force:")
print(forces_data[9])

### A Note on Zero-Indexing

Python uses **zero-indexing**, meaning the first element in a list is at position `0`, the second is at position `1`, and so on. This is why `forces_data[0]` gives us the first police force and `forces_data[9]` gives us the tenth.

### TASK

Try extracting a different police force from the list by using a different index number. For example, what force is at position `[25]`? Or `[40]`? Experiment in the cell below.

In [None]:
# What happens if we try an index that is too large?
forces_data[200]

You should see an `IndexError`. This is Python telling you that the index you requested is out of range -- there are not 201 police forces in the list. This is an important reminder to always check the length of your data before trying to access specific elements.

In [None]:
# Use a list comprehension to extract all force IDs
force_ids = [force["id"] for force in forces_data]
force_ids

### List Comprehensions

The code above uses a **list comprehension**, which is a concise way to create a new list by applying an expression to each element of an existing list. The syntax is:

```python
new_list = [expression for item in old_list]
```

In our case, `[force["id"] for force in forces_data]` means: "for each `force` dictionary in `forces_data`, extract the value associated with the key `"id"`, and collect all those values into a new list." This is equivalent to writing a `for` loop, but in a single, readable line.

## Parameterised Requests and Looping (~20 min)

So far, we have made a simple request to a single endpoint. But most APIs allow you to customise your requests using **query parameters** -- additional information appended to the URL that filters or specifies the data you want.

We will now use the **stop-and-search** endpoint of the UK Police API. This endpoint requires a `force` parameter, which specifies which police force's data we want to retrieve. The URL takes the form:

```
https://data.police.uk/api/stops-force?force=<force-id>
```

The `?force=` part is the query parameter. Let us start with a single force.

In [None]:
baseurl = "https://data.police.uk/api/"
sas = "stops-force"
force = "city-of-london"

webadd = baseurl + sas + "?force=" + force
print(webadd)

response = requests.get(webadd)
response.status_code

In [None]:
sas_data = response.json()

# Add the force name to each record so we know where it came from
for el in sas_data:
    el["force"] = force

len(sas_data)

Notice that we added a `"force"` field to each record in the response. The API does not include this information by default, so we inject it ourselves. This is essential when we later combine data from multiple forces -- without it, we would not know which force each record belongs to.

In [None]:
# Inspect the first two records
sas_data[:2]

### Looping Over All Forces

We have data for one force, but we want data for **all** forces. To do this, we will loop over our `force_ids` list and make a separate API request for each one.

There are two important considerations when doing this:

1. **Error handling**: Not every request will succeed. Some forces may not have stop-and-search data available. We use an `if/else` statement to check the status code and handle failures gracefully.
2. **Rate limiting**: Making too many requests too quickly can overload the server or get you temporarily blocked. We use `time.sleep(1)` to pause for one second between each request, which is good practice and respectful of the API provider.

In [None]:
import time

forces_sas_data = []

for force in force_ids:
    webadd = baseurl + sas + "?force=" + force
    response = requests.get(webadd)

    if response.status_code == 200:
        data = response.json()
        for el in data:
            el["force"] = force
        forces_sas_data.extend(data)
    else:
        print(f"Could not download data for {force} (status: {response.status_code})")

    time.sleep(1)  # Rate limiting: wait 1 second between requests

print(f"Total records collected: {len(forces_sas_data)}")

### Understanding the Code

Let us break down the key parts of the loop above:

- **`time.sleep(1)`**: Pauses execution for 1 second between requests. This is a simple form of rate limiting that prevents us from overwhelming the API server.
- **`if response.status_code == 200`**: Checks whether the request was successful before trying to process the data. If it was not, we print an informative error message rather than crashing.
- **`.extend()` vs `.append()`**: We use `extend()` rather than `append()` because each API response is itself a list of records. Using `extend()` adds each individual record to our main list, whereas `append()` would add the entire sub-list as a single element, creating a nested structure we do not want.

## From JSON to Analysis (~15 min)

We now have a large collection of stop-and-search records. Let us convert them into a pandas DataFrame and begin exploring the data.

In [None]:
import pandas as pd

df = pd.DataFrame(forces_sas_data)
df.sample(5)

In [None]:
# Cross-tabulation: outcome by age range (column percentages)
pd.crosstab(
    index=df['outcome'],
    columns=df['age_range'],
    normalize='columns'
).round(2) * 100

The cross-tabulation above shows the percentage distribution of stop-and-search outcomes within each age group. This allows us to see, for example, whether younger individuals are more or less likely to receive a particular outcome compared to older age groups. Look for patterns: are certain outcomes disproportionately associated with certain age ranges?

In [None]:
# Cross-tabulation: object of search by age range
pd.crosstab(
    index=df['object_of_search'],
    columns=df['age_range'],
    normalize='columns'
).round(2) * 100

In [None]:
import matplotlib.pyplot as plt

outcome_counts = df['outcome'].value_counts()

outcome_counts.plot(kind='barh', color='#A3217A')
plt.xlabel('Count')
plt.ylabel('Outcome')
plt.title('Stop and Search Outcomes')
plt.tight_layout()
plt.show()

The bar chart provides a clear overview of how stop-and-search encounters are resolved across all forces. The most common outcomes are immediately visible. Consider what this distribution tells us about policing practice in England and Wales -- for example, what proportion of stops result in no further action?

## Exercise

**Task:** Produce a list of all senior police officers for each force.

The UK Police API provides a "people" endpoint for each force at:

```
https://data.police.uk/api/forces/<force-id>/people
```

Using the `force_ids` list we created earlier, write a loop that requests the senior officers for every force, collects the results, and stores them in a list. Remember to include rate limiting and error handling.

Use the skeleton code below as a starting point. Fill in the gaps marked `# INSERT CODE HERE`.

In [None]:
# Skeleton code for the exercise

baseurl = "https://data.police.uk/api/"
forces_endpoint = "forces"
people_endpoint = "/people"

# Step 1: Get force IDs (we already have these, but included for completeness)
response = requests.get(baseurl + forces_endpoint)
forces_data = response.json()
force_ids = [force["id"] for force in forces_data]

# Step 2: Loop over all forces and collect senior officer data
chief_list = []

# INSERT CODE HERE
# Hint: loop over force_ids, construct the URL for each force's people endpoint,
# make a request, check the status code, and append the results to chief_list.
# Do not forget rate limiting!

print(f"Collected data for {len(chief_list)} forces.")

**Hint:** The solution follows the same pattern as our stop-and-search loop. You need to construct a URL for each force, make a request, check the status code, and store the result. A full solution is provided in the Appendix at the end of this notebook.

## Bibliography

- Bail, C.A. (2021). *Breaking the Social Media Prism: How to Make Our Platforms Less Polarizing*. Princeton University Press. [https://www.chrisbail.net/](https://www.chrisbail.net/)
- Barba, L.A., Barker, L.J., Blank, D.S., et al. (2019). Teaching and Learning with Jupyter. [https://jupyter4edu.github.io/jupyter-edu-book/](https://jupyter4edu.github.io/jupyter-edu-book/)
- Brooker, P. (2020). *Programming with Python for Social Scientists*. SAGE Publications. [https://uk.sagepub.com/en-gb/eur/programming-with-python-for-social-scientists/book259583](https://uk.sagepub.com/en-gb/eur/programming-with-python-for-social-scientists/book259583)

## Appendix: Exercise Solution

In [None]:
from time import sleep

chief_list = []

for force_id in force_ids:
    webadd = baseurl + "forces/" + force_id + "/people"
    response = requests.get(webadd)

    if response.status_code == 200:
        chief_list.append({"force": force_id, "officers": response.json()})

    sleep(1)

chief_list[:2]

---

**END OF FILE**