![SGSSS Logo](../img/SGSSS_Stacked.png)

# Collecting Digital Data for Social Scientists

## Practical 3: API Challenge

In this practical you'll choose one of three APIs and collect data independently. Each option provides scaffolded code with gaps for you to fill in. The goal is to apply what you've learned about working with APIs to a new data source, navigating documentation and response structures on your own.

**Time:** ~60 minutes

By the end of this practical you should be able to:
- Read API documentation and understand endpoint structures
- Make requests, handle pagination, and parse nested JSON responses
- Extract relevant fields and save structured data to CSV

### Guide to this resource

This notebook is designed to run in [Google Colab](https://colab.research.google.com/). To get started:

1. Click **File > Save a copy in Drive** to create your own editable version.
2. Work through the cells in order, filling in the gaps where indicated.
3. Run each cell with **Shift + Enter** or by clicking the play button.
4. If you get stuck, check the **Appendix: Solutions** at the bottom of this notebook.

### Instructions

**Choose one** of the three options below and work through the scaffolded code:

| Option | API | Authentication | Difficulty |
|--------|-----|---------------|------------|
| **A** | UK Parliament API | None required | Medium |
| **B** | World Bank Indicators API | None required | Medium |
| **C** | ONS API | None required | Medium-Hard |

Each option has:
- Some **provided code** to get you started
- **Tasks 1–3** where you fill in the gaps (marked with `# INSERT CODE HERE`)
- A **Task 4 (stretch)** for those who finish early

When you're done, save your collected data as a CSV file. Full solutions are available in the **Appendix** at the end of this notebook.

## Option A: UK Parliament API

The UK Parliament API provides data on Members of Parliament, Bills, Divisions (votes), and more. You can query information about current and historical MPs, their party affiliations, constituencies, and voting records.

**Documentation:**
- Members API: [https://members-api.parliament.uk/](https://members-api.parliament.uk/)
- Developer hub: [https://developer.parliament.uk/](https://developer.parliament.uk/)

**No authentication required** — you can start making requests straight away.

In [None]:
import requests
import json
import pandas as pd
import time

base_url = "https://members-api.parliament.uk/api/Members/Search"
params = {"IsCurrentMember": "true", "skip": 0, "take": 20}

In [None]:
response = requests.get(base_url, params=params)
print(f"Status: {response.status_code}")
data = response.json()
print(f"Total results: {data['totalResults']}")
data['items'][:2]

### Understanding the response

The Parliament API uses **skip/take pagination**:
- `skip`: how many records to skip (start at 0)
- `take`: how many records to return per page (max 20)
- `totalResults`: the total number of matching records

Each item in `data['items']` contains a `value` dictionary with fields like `nameDisplayAs`, `latestParty`, and `latestHouseMembership`. Explore the structure of `data['items'][0]` to understand the nesting before attempting the tasks.

In [None]:
# TASK 1: Modify the request to get all MPs
# Hint: Use a loop with skip and take parameters
# The API returns totalResults telling you how many MPs there are
# Increment skip by take each iteration

all_mps = []

# INSERT CODE HERE

print(f"Collected {len(all_mps)} MPs")

In [None]:
# TASK 2: Extract name, party, and constituency for each MP
# Hint: Each item has nested structure - explore data['items'][0] to find the fields
# Look for 'value' -> 'nameDisplayAs', 'latestParty' -> 'name', 'latestHouseMembership' -> 'membershipFrom'

mp_records = []

# INSERT CODE HERE

print(f"Extracted {len(mp_records)} records")

In [None]:
# TASK 3: Convert to DataFrame and save as CSV

# INSERT CODE HERE

In [None]:
# TASK 4 (stretch): Get voting records for a specific division
# Hint: https://commonsvotes-api.parliament.uk/data/division/{divisionId}.json
# Try division 1234 as an example

# INSERT CODE HERE

## Option B: World Bank Indicators API

The World Bank Indicators API provides access to development indicators for 200+ countries, covering topics like GDP, population, life expectancy, education, and more. Data spans several decades for most indicators.

**Documentation:** [https://datahelpdesk.worldbank.org/knowledgebase/topics/125589](https://datahelpdesk.worldbank.org/knowledgebase/topics/125589)

**No authentication required** — you can start making requests straight away.

In [None]:
import requests
import json
import pandas as pd

url = "https://api.worldbank.org/v2/country/GB/indicator/NY.GDP.MKTP.CD?format=json"
response = requests.get(url)
print(f"Status: {response.status_code}")
data = response.json()

In [None]:
# World Bank returns [metadata, data] list
metadata = data[0]
records = data[1]
print(f"Page: {metadata['page']} of {metadata['pages']}")
print(f"Records on this page: {len(records)}")
records[0]

In [None]:
# TASK 1: Request GDP data for all G7 countries
# Hint: Use country codes separated by semicolons: USA;GBR;FRA;DEU;JPN;ITA;CAN
# Add &per_page=500 to get all results in one request

# INSERT CODE HERE

In [None]:
# TASK 2: Extract country, year, and GDP value into a list of dictionaries
# Hint: Each record has 'country' -> 'value', 'date', and 'value' fields

gdp_records = []

# INSERT CODE HERE

print(f"Extracted {len(gdp_records)} records")

In [None]:
# TASK 3: Convert to DataFrame and save as CSV

# INSERT CODE HERE

In [None]:
# TASK 4 (stretch): Request life expectancy (SP.DYN.LE00.IN) and plot a time series
# Hint: Use matplotlib or pandas .plot()

# INSERT CODE HERE

## Option C: ONS API

The ONS (Office for National Statistics) API provides access to UK official statistics, including data on the economy, population, labour market, and more. You can browse available datasets and retrieve specific editions and versions.

**Documentation:** [https://developer.ons.gov.uk/](https://developer.ons.gov.uk/)

**No authentication required** — you can start making requests straight away.

In [None]:
import requests
import json
import pandas as pd

url = "https://api.beta.ons.gov.uk/v1/datasets"
response = requests.get(url)
print(f"Status: {response.status_code}")
data = response.json()

In [None]:
# TASK 1: List all available datasets and their descriptions
# Hint: Explore data['items'] - each has 'title' and 'description'

# INSERT CODE HERE

In [None]:
# TASK 2: Choose a dataset and request its latest version
# Hint: Use the 'links' field to find the URL for editions/versions

# INSERT CODE HERE

In [None]:
# TASK 3: Extract and save the data

# INSERT CODE HERE

## Appendix: Solutions

### Option A Solution

In [None]:
# =============================================================
# OPTION A: FULL SOLUTION - UK Parliament API
# =============================================================

import requests
import json
import pandas as pd
import time

# --- TASK 1: Get ALL current MPs using pagination ---

base_url = "https://members-api.parliament.uk/api/Members/Search"
all_mps = []
skip = 0
take = 20

# Make the first request to find out total results
params = {"IsCurrentMember": "true", "skip": skip, "take": take}
response = requests.get(base_url, params=params)
data = response.json()
total_results = data['totalResults']
print(f"Total MPs to collect: {total_results}")

# Add the first batch
all_mps.extend(data['items'])
skip += take

# Loop through remaining pages
while skip < total_results:
    params = {"IsCurrentMember": "true", "skip": skip, "take": take}
    response = requests.get(base_url, params=params)
    data = response.json()
    all_mps.extend(data['items'])
    print(f"Collected {len(all_mps)} / {total_results}")
    skip += take
    time.sleep(0.5)  # Be polite to the API

print(f"\nCollected {len(all_mps)} MPs in total")

# --- TASK 2: Extract name, party, and constituency ---

mp_records = []
for item in all_mps:
    mp = item['value']
    record = {
        'name': mp['nameDisplayAs'],
        'party': mp['latestParty']['name'],
        'constituency': mp['latestHouseMembership']['membershipFrom']
    }
    mp_records.append(record)

print(f"Extracted {len(mp_records)} records")
mp_records[:3]

# --- TASK 3: Convert to DataFrame and save as CSV ---

df_mps = pd.DataFrame(mp_records)
print(df_mps.head())
print(f"\nParty counts:\n{df_mps['party'].value_counts()}")
df_mps.to_csv('uk_mps.csv', index=False)
print("\nSaved to uk_mps.csv")

# --- TASK 4 (stretch): Get voting records for a division ---

division_url = "https://commonsvotes-api.parliament.uk/data/division/1234.json"
response = requests.get(division_url)
print(f"\nDivision request status: {response.status_code}")

if response.status_code == 200:
    division = response.json()
    print(f"Division title: {division['Title']}")
    print(f"Date: {division['Date']}")
    print(f"Ayes: {len(division['Ayes'])}")
    print(f"Noes: {len(division['Noes'])}")

    # Extract voting records
    votes = []
    for mp in division['Ayes']:
        votes.append({'name': mp['Name'], 'party': mp['Party'], 'vote': 'Aye'})
    for mp in division['Noes']:
        votes.append({'name': mp['Name'], 'party': mp['Party'], 'vote': 'No'})

    df_votes = pd.DataFrame(votes)
    print(f"\nVotes by party:\n{df_votes.groupby(['party', 'vote']).size().unstack(fill_value=0)}")
else:
    print("Division not found. Try a different division ID.")

### Option B Solution

In [None]:
# =============================================================
# OPTION B: FULL SOLUTION - World Bank Indicators API
# =============================================================

import requests
import json
import pandas as pd

# --- TASK 1: Request GDP data for all G7 countries ---

g7_url = "https://api.worldbank.org/v2/country/USA;GBR;FRA;DEU;JPN;ITA;CAN/indicator/NY.GDP.MKTP.CD?format=json&per_page=500"
response = requests.get(g7_url)
print(f"Status: {response.status_code}")
data = response.json()

metadata = data[0]
records = data[1]
print(f"Total records: {metadata['total']}")
print(f"Records retrieved: {len(records)}")

# --- TASK 2: Extract country, year, and GDP value ---

gdp_records = []
for record in records:
    gdp_records.append({
        'country': record['country']['value'],
        'country_code': record['countryiso3code'],
        'year': int(record['date']),
        'gdp': record['value']
    })

print(f"Extracted {len(gdp_records)} records")
gdp_records[:3]

# --- TASK 3: Convert to DataFrame and save as CSV ---

df_gdp = pd.DataFrame(gdp_records)
print(df_gdp.head(10))
print(f"\nCountries: {df_gdp['country'].unique()}")
print(f"Year range: {df_gdp['year'].min()} - {df_gdp['year'].max()}")
df_gdp.to_csv('g7_gdp.csv', index=False)
print("\nSaved to g7_gdp.csv")

# --- TASK 4 (stretch): Life expectancy time series plot ---

import matplotlib.pyplot as plt

le_url = "https://api.worldbank.org/v2/country/USA;GBR;FRA;DEU;JPN;ITA;CAN/indicator/SP.DYN.LE00.IN?format=json&per_page=500"
response = requests.get(le_url)
le_data = response.json()

le_records = []
for record in le_data[1]:
    if record['value'] is not None:
        le_records.append({
            'country': record['country']['value'],
            'year': int(record['date']),
            'life_expectancy': record['value']
        })

df_le = pd.DataFrame(le_records)

fig, ax = plt.subplots(figsize=(12, 6))
for country in df_le['country'].unique():
    country_data = df_le[df_le['country'] == country].sort_values('year')
    ax.plot(country_data['year'], country_data['life_expectancy'], label=country)

ax.set_xlabel('Year')
ax.set_ylabel('Life Expectancy (years)')
ax.set_title('Life Expectancy at Birth - G7 Countries')
ax.legend()
plt.tight_layout()
plt.show()

### Option C Solution

In [None]:
# =============================================================
# OPTION C: FULL SOLUTION - ONS API
# =============================================================

import requests
import json
import pandas as pd
import io

# --- TASK 1: List all available datasets and their descriptions ---

url = "https://api.beta.ons.gov.uk/v1/datasets"
response = requests.get(url)
data = response.json()

print(f"Number of datasets: {len(data['items'])}\n")

for i, dataset in enumerate(data['items']):
    title = dataset.get('title', 'No title')
    description = dataset.get('description', 'No description')
    dataset_id = dataset.get('id', 'No ID')
    # Truncate long descriptions for readability
    if len(description) > 100:
        description = description[:100] + '...'
    print(f"{i+1}. [{dataset_id}] {title}")
    print(f"   {description}\n")

# --- TASK 2: Choose a dataset and request its latest version ---

# Pick the first dataset as an example (or choose one that interests you)
chosen_dataset = data['items'][0]
dataset_id = chosen_dataset['id']
print(f"Chosen dataset: {chosen_dataset['title']}")
print(f"Dataset ID: {dataset_id}")

# Get the editions for this dataset
editions_url = f"https://api.beta.ons.gov.uk/v1/datasets/{dataset_id}/editions"
response = requests.get(editions_url)
print(f"\nEditions request status: {response.status_code}")

if response.status_code == 200:
    editions_data = response.json()
    editions = editions_data['items']
    print(f"Number of editions: {len(editions)}")

    # Get the latest edition
    latest_edition = editions[0]['edition']
    print(f"Latest edition: {latest_edition}")

    # Get versions for this edition
    versions_url = f"https://api.beta.ons.gov.uk/v1/datasets/{dataset_id}/editions/{latest_edition}/versions"
    response = requests.get(versions_url)
    versions_data = response.json()
    latest_version = versions_data['items'][0]
    print(f"Latest version: {latest_version['version']}")

# --- TASK 3: Extract and save the data ---

    # Check if there is a CSV download link
    if 'downloads' in latest_version and 'csv' in latest_version['downloads']:
        csv_url = latest_version['downloads']['csv']['href']
        print(f"\nCSV download URL: {csv_url}")
        # Download the CSV using requests (the ONS URL redirects,
        # which pd.read_csv cannot follow)
        csv_response = requests.get(csv_url)
        csv_response.raise_for_status()
        df = pd.read_csv(io.StringIO(csv_response.text))
        print(f"Shape: {df.shape}")
        print(df.head())
        df.to_csv(f'ons_{dataset_id}.csv', index=False)
        print(f"\nSaved to ons_{dataset_id}.csv")
    else:
        # If no CSV download, save the metadata
        print("\nNo direct CSV download available.")
        print("Saving dataset metadata instead.")
        dataset_info = {
            'id': dataset_id,
            'title': chosen_dataset['title'],
            'description': chosen_dataset.get('description', ''),
            'edition': latest_edition,
            'version': latest_version['version']
        }
        print(json.dumps(dataset_info, indent=2))
else:
    print("Could not retrieve editions. Try a different dataset.")

---

**END OF FILE**