![SGSSS Logo](../img/SGSSS_Stacked.png)

# Collecting Digital Data for Social Scientists

## Practical 3: API Challenge

In this practical you'll choose one of three APIs and collect data independently. Each option provides scaffolded code with gaps for you to fill in. The goal is to apply what you've learned about working with APIs to a new data source, navigating documentation and response structures on your own.

**Time:** ~60 minutes

By the end of this practical you should be able to:
- Read API documentation and understand endpoint structures
- Make requests, handle pagination, and parse nested JSON responses
- Extract relevant fields and save structured data to CSV

### Guide to this resource

This notebook is designed to run in [Google Colab](https://colab.research.google.com/). To get started:

1. Click **File > Save a copy in Drive** to create your own editable version.
2. **Change the runtime to R:** Go to **Runtime > Change runtime type** and select **R** from the dropdown.
3. Work through the cells in order, filling in the gaps where indicated.
4. Run each cell with **Shift + Enter** or by clicking the play button.
5. If you get stuck, check the **Appendix: Solutions** at the bottom of this notebook.

In [None]:
install.packages(c("httr", "jsonlite", "dplyr", "ggplot2"))
library(httr)
library(jsonlite)
library(dplyr)
library(ggplot2)
cat("Packages loaded\n")

### Instructions

**Choose one** of the three options below and work through the scaffolded code:

| Option | API | Authentication | Difficulty |
|--------|-----|---------------|------------|
| **A** | UK Parliament API | None required | Medium |
| **B** | World Bank Indicators API | None required | Medium |
| **C** | ONS API | None required | Medium-Hard |

Each option has:
- Some **provided code** to get you started
- **Tasks 1–3** where you fill in the gaps (marked with `# INSERT CODE HERE`)
- A **Task 4 (stretch)** for those who finish early

When you're done, save your collected data as a CSV file. Full solutions are available in the **Appendix** at the end of this notebook.

## Option A: UK Parliament API

The UK Parliament API provides data on Members of Parliament, Bills, Divisions (votes), and more. You can query information about current and historical MPs, their party affiliations, constituencies, and voting records.

**Documentation:**
- Members API: [https://members-api.parliament.uk/](https://members-api.parliament.uk/)
- Developer hub: [https://developer.parliament.uk/](https://developer.parliament.uk/)

**No authentication required** — you can start making requests straight away.

In [None]:
library(httr)
library(jsonlite)
library(dplyr)

base_url <- "https://members-api.parliament.uk/api/Members/Search"
params <- list(IsCurrentMember = "true", skip = 0, take = 20)

In [None]:
response <- GET(base_url, query = params)
cat("Status:", status_code(response), "\n")
data <- fromJSON(content(response, as = "text", encoding = "UTF-8"), simplifyVector = FALSE)
cat("Total results:", data$totalResults, "\n")
str(data$items[1:2], max.level = 3)

### Understanding the response

The Parliament API uses **skip/take pagination**:
- `skip`: how many records to skip (start at 0)
- `take`: how many records to return per page (max 20)
- `totalResults`: the total number of matching records

Each item in `data$items` contains a `value` list with fields like `nameDisplayAs`, `latestParty`, and `latestHouseMembership`. Explore the structure of the first item to understand the nesting before attempting the tasks.

**Tip:** Use `str()` or `names()` to explore nested list structures in R.

In [None]:
# TASK 1: Modify the request to get all MPs
# Hint: Use a while loop with skip and take parameters
# The API returns totalResults telling you how many MPs there are
# Increment skip by take each iteration

all_mps <- list()

# INSERT CODE HERE

cat("Collected", length(all_mps), "MPs\n")

In [None]:
# TASK 2: Extract name, party, and constituency for each MP
# Hint: Each item has nested structure - explore with str() or names()
# Look for value -> nameDisplayAs, value -> latestParty -> name,
# value -> latestHouseMembership -> membershipFrom
# Use lapply() or a for loop to iterate over all_mps

mp_records <- list()

# INSERT CODE HERE

cat("Extracted", length(mp_records), "records\n")

In [None]:
# TASK 3: Convert to data.frame and save as CSV
# Hint: Use bind_rows() from dplyr to combine list of records
# Use write.csv() to save

# INSERT CODE HERE

In [None]:
# TASK 4 (stretch): Get voting records for a specific division
# Hint: https://commonsvotes-api.parliament.uk/data/division/{divisionId}.json
# Try division 1234 as an example

# INSERT CODE HERE

## Option B: World Bank Indicators API

The World Bank Indicators API provides access to development indicators for 200+ countries, covering topics like GDP, population, life expectancy, education, and more. Data spans several decades for most indicators.

**Documentation:** [https://datahelpdesk.worldbank.org/knowledgebase/topics/125589](https://datahelpdesk.worldbank.org/knowledgebase/topics/125589)

**No authentication required** — you can start making requests straight away.

In [None]:
library(httr)
library(jsonlite)
library(dplyr)

url <- "https://api.worldbank.org/v2/country/GB/indicator/NY.GDP.MKTP.CD?format=json"
response <- GET(url)
cat("Status:", status_code(response), "\n")
data <- fromJSON(content(response, as = "text", encoding = "UTF-8"))

In [None]:
# World Bank returns a list of two elements: [[1]] metadata, [[2]] records
metadata <- data[[1]]
records <- data[[2]]
cat("Page:", metadata$page, "of", metadata$pages, "\n")
cat("Records on this page:", nrow(records), "\n")
str(records[1, ])

In [None]:
# TASK 1: Request GDP data for all G7 countries
# Hint: Use country codes separated by semicolons: USA;GBR;FRA;DEU;JPN;ITA;CAN
# Add per_page=500 to get all results in one request

# INSERT CODE HERE

In [None]:
# TASK 2: Extract country, year, and GDP value into a data.frame
# Hint: records is a data.frame from fromJSON()
# Access nested columns with records$country$value, records$date, records$value

gdp_records <- list()

# INSERT CODE HERE

cat("Extracted", length(gdp_records), "records\n")

In [None]:
# TASK 3: Convert to data.frame and save as CSV

# INSERT CODE HERE

In [None]:
# TASK 4 (stretch): Request life expectancy (SP.DYN.LE00.IN) and plot a time series
# Hint: Use ggplot2 with geom_line(), colouring by country

# INSERT CODE HERE

## Option C: ONS API

The ONS (Office for National Statistics) API provides access to UK official statistics, including data on the economy, population, labour market, and more. You can browse available datasets and retrieve specific editions and versions.

**Documentation:** [https://developer.ons.gov.uk/](https://developer.ons.gov.uk/)

**No authentication required** — you can start making requests straight away.

In [None]:
library(httr)
library(jsonlite)
library(dplyr)

url <- "https://api.beta.ons.gov.uk/v1/datasets"
response <- GET(url)
cat("Status:", status_code(response), "\n")
data <- fromJSON(content(response, as = "text", encoding = "UTF-8"))

In [None]:
# TASK 1: List all available datasets and their descriptions
# Hint: Explore data$items - each has 'title' and 'description'
# Use a for loop or apply function to print them

# INSERT CODE HERE

In [None]:
# TASK 2: Choose a dataset and request its latest version
# Hint: Use the 'links' field to find the URL for editions/versions
# Build the editions URL like:
# paste0("https://api.beta.ons.gov.uk/v1/datasets/", dataset_id, "/editions")

# INSERT CODE HERE

In [None]:
# TASK 3: Extract and save the data
# Hint: Check for CSV download links in the version's 'downloads' field
# The ONS download URL redirects, so use GET() to download the CSV first,
# then parse with read.csv(text = content(response, as = "text"))
# Use write.csv() to save

# INSERT CODE HERE

## Appendix: Solutions

### Option A Solution

In [None]:
# =============================================================
# OPTION A: FULL SOLUTION - UK Parliament API
# =============================================================

library(httr)
library(jsonlite)
library(dplyr)

# --- TASK 1: Get ALL current MPs using pagination ---

base_url <- "https://members-api.parliament.uk/api/Members/Search"
all_mps <- list()
skip <- 0
take <- 20

# Make the first request to find out total results
params <- list(IsCurrentMember = "true", skip = skip, take = take)
response <- GET(base_url, query = params)
raw_text <- content(response, as = "text", encoding = "UTF-8")
data <- fromJSON(raw_text, simplifyVector = FALSE)
total_results <- data$totalResults
cat("Total MPs to collect:", total_results, "\n")

# Add the first batch
all_mps <- c(all_mps, data$items)
skip <- skip + take

# Loop through remaining pages
while (skip < total_results) {
  params <- list(IsCurrentMember = "true", skip = skip, take = take)
  response <- GET(base_url, query = params)
  raw_text <- content(response, as = "text", encoding = "UTF-8")
  data <- fromJSON(raw_text, simplifyVector = FALSE)
  all_mps <- c(all_mps, data$items)
  cat("Collected", length(all_mps), "/", total_results, "\n")
  skip <- skip + take
  Sys.sleep(0.5)  # Be polite to the API
}

cat("\nCollected", length(all_mps), "MPs in total\n")

In [None]:
# --- TASK 2: Extract name, party, and constituency ---

mp_records <- lapply(all_mps, function(item) {
  mp <- item$value
  data.frame(
    name = mp$nameDisplayAs,
    party = mp$latestParty$name,
    constituency = mp$latestHouseMembership$membershipFrom,
    stringsAsFactors = FALSE
  )
})

cat("Extracted", length(mp_records), "records\n")
head(mp_records, 3)

In [None]:
# --- TASK 3: Convert to data.frame and save as CSV ---

df_mps <- bind_rows(mp_records)
head(df_mps)
cat("\nParty counts:\n")
print(table(df_mps$party))
write.csv(df_mps, "uk_mps.csv", row.names = FALSE)
cat("\nSaved to uk_mps.csv\n")

In [None]:
# --- TASK 4 (stretch): Get voting records for a division ---

division_url <- "https://commonsvotes-api.parliament.uk/data/division/1234.json"
response <- GET(division_url)
cat("Division request status:", status_code(response), "\n")

if (status_code(response) == 200) {
  division <- fromJSON(content(response, as = "text", encoding = "UTF-8"),
                       simplifyVector = FALSE)
  cat("Division title:", division$Title, "\n")
  cat("Date:", division$Date, "\n")
  cat("Ayes:", length(division$Ayes), "\n")
  cat("Noes:", length(division$Noes), "\n")

  # Extract voting records
  aye_records <- lapply(division$Ayes, function(mp) {
    data.frame(name = mp$Name, party = mp$Party, vote = "Aye",
               stringsAsFactors = FALSE)
  })
  no_records <- lapply(division$Noes, function(mp) {
    data.frame(name = mp$Name, party = mp$Party, vote = "No",
               stringsAsFactors = FALSE)
  })

  df_votes <- bind_rows(c(aye_records, no_records))
  cat("\nVotes by party:\n")
  print(table(df_votes$party, df_votes$vote))
} else {
  cat("Division not found. Try a different division ID.\n")
}

### Option B Solution

In [None]:
# =============================================================
# OPTION B: FULL SOLUTION - World Bank Indicators API
# =============================================================

library(httr)
library(jsonlite)
library(dplyr)

# --- TASK 1: Request GDP data for all G7 countries ---

g7_url <- "https://api.worldbank.org/v2/country/USA;GBR;FRA;DEU;JPN;ITA;CAN/indicator/NY.GDP.MKTP.CD?format=json&per_page=500"
response <- GET(g7_url)
cat("Status:", status_code(response), "\n")
data <- fromJSON(content(response, as = "text", encoding = "UTF-8"))

metadata <- data[[1]]
records <- data[[2]]
cat("Total records:", metadata$total, "\n")
cat("Records retrieved:", nrow(records), "\n")

In [None]:
# --- TASK 2: Extract country, year, and GDP value ---

gdp_records <- data.frame(
  country = records$country$value,
  country_code = records$countryiso3code,
  year = as.integer(records$date),
  gdp = records$value,
  stringsAsFactors = FALSE
)

cat("Extracted", nrow(gdp_records), "records\n")
head(gdp_records, 3)

In [None]:
# --- TASK 3: Save as CSV ---

head(gdp_records, 10)
cat("\nCountries:", unique(gdp_records$country), "\n")
cat("Year range:", min(gdp_records$year, na.rm = TRUE), "-",
    max(gdp_records$year, na.rm = TRUE), "\n")
write.csv(gdp_records, "g7_gdp.csv", row.names = FALSE)
cat("\nSaved to g7_gdp.csv\n")

In [None]:
# --- TASK 4 (stretch): Life expectancy time series plot ---

library(ggplot2)

le_url <- "https://api.worldbank.org/v2/country/USA;GBR;FRA;DEU;JPN;ITA;CAN/indicator/SP.DYN.LE00.IN?format=json&per_page=500"
response <- GET(le_url)
le_data <- fromJSON(content(response, as = "text", encoding = "UTF-8"))

le_records <- data.frame(
  country = le_data[[2]]$country$value,
  year = as.integer(le_data[[2]]$date),
  life_expectancy = le_data[[2]]$value,
  stringsAsFactors = FALSE
)

# Remove rows with missing values
le_records <- le_records[!is.na(le_records$life_expectancy), ]

ggplot(le_records, aes(x = year, y = life_expectancy, colour = country)) +
  geom_line() +
  labs(
    title = "Life Expectancy at Birth - G7 Countries",
    x = "Year",
    y = "Life Expectancy (years)",
    colour = "Country"
  ) +
  theme_minimal()

### Option C Solution

In [None]:
# =============================================================
# OPTION C: FULL SOLUTION - ONS API
# =============================================================

library(httr)
library(jsonlite)
library(dplyr)

# --- TASK 1: List all available datasets and their descriptions ---

url <- "https://api.beta.ons.gov.uk/v1/datasets"
response <- GET(url)
data <- fromJSON(content(response, as = "text", encoding = "UTF-8"),
                 simplifyVector = FALSE)

cat("Number of datasets:", length(data$items), "\n\n")

for (i in seq_along(data$items)) {
  dataset <- data$items[[i]]
  title <- ifelse(is.null(dataset$title), "No title", dataset$title)
  description <- ifelse(is.null(dataset$description), "No description",
                        dataset$description)
  dataset_id <- ifelse(is.null(dataset$id), "No ID", dataset$id)
  # Truncate long descriptions for readability
  if (nchar(description) > 100) {
    description <- paste0(substr(description, 1, 100), "...")
  }
  cat(i, ". [", dataset_id, "] ", title, "\n", sep = "")
  cat("   ", description, "\n\n")
}

In [None]:
# --- TASK 2: Choose a dataset and request its latest version ---

# Pick the first dataset as an example (or choose one that interests you)
chosen_dataset <- data$items[[1]]
dataset_id <- chosen_dataset$id
cat("Chosen dataset:", chosen_dataset$title, "\n")
cat("Dataset ID:", dataset_id, "\n")

# Get the editions for this dataset
editions_url <- paste0("https://api.beta.ons.gov.uk/v1/datasets/",
                       dataset_id, "/editions")
response <- GET(editions_url)
cat("\nEditions request status:", status_code(response), "\n")

if (status_code(response) == 200) {
  editions_data <- fromJSON(content(response, as = "text", encoding = "UTF-8"),
                            simplifyVector = FALSE)
  editions <- editions_data$items
  cat("Number of editions:", length(editions), "\n")

  # Get the latest edition
  latest_edition <- editions[[1]]$edition
  cat("Latest edition:", latest_edition, "\n")

  # Get versions for this edition
  versions_url <- paste0("https://api.beta.ons.gov.uk/v1/datasets/",
                         dataset_id, "/editions/", latest_edition, "/versions")
  response <- GET(versions_url)
  versions_data <- fromJSON(content(response, as = "text", encoding = "UTF-8"),
                            simplifyVector = FALSE)
  latest_version <- versions_data$items[[1]]
  cat("Latest version:", latest_version$version, "\n")
}

In [None]:
# --- TASK 3: Extract and save the data ---

# Check if there is a CSV download link
if (!is.null(latest_version$downloads) &&
    !is.null(latest_version$downloads$csv)) {
  csv_url <- latest_version$downloads$csv$href
  cat("CSV download URL:", csv_url, "\n")
  # Download the CSV using GET() (the ONS URL redirects,
  # which read.csv cannot always follow)
  csv_response <- GET(csv_url)
  df <- read.csv(text = content(csv_response, as = "text", encoding = "UTF-8"))
  cat("Shape:", nrow(df), "rows x", ncol(df), "columns\n")
  head(df)
  write.csv(df, paste0("ons_", dataset_id, ".csv"), row.names = FALSE)
  cat("\nSaved to", paste0("ons_", dataset_id, ".csv"), "\n")
} else {
  # If no CSV download, save the metadata
  cat("\nNo direct CSV download available.\n")
  cat("Saving dataset metadata instead.\n")
  dataset_info <- list(
    id = dataset_id,
    title = chosen_dataset$title,
    description = ifelse(is.null(chosen_dataset$description), "",
                         chosen_dataset$description),
    edition = latest_edition,
    version = latest_version$version
  )
  cat(toJSON(dataset_info, pretty = TRUE, auto_unbox = TRUE), "\n")
}

---

**END OF FILE**