## 🔗 Open This Notebook in Google Colab

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/DavidLangworthy/ds4s/blob/master/days/day03/notebook/day03_starter.ipynb)

# 💨 Day 3 – Pollution and Public Health
### Linking PM2.5 exposure with economic capacity

We will combine air-pollution exposure data with GDP per capita to explore how economic resources intersect with clean air.

#### Data card: World Bank – PM2.5 exposure & GDP per capita
* **Sources:** [World Bank Indicators](https://data.worldbank.org/) (EN.ATM.PM25.MC.M3 and NY.GDP.PCAP.CD).
* **Temporal coverage:** 1990–2023 for most countries.
* **Units:** PM2.5 in micrograms/m³; GDP per capita in current US$.
* **Refresh cadence:** Updated annually; downloaded September 2024.
* **Caveats:** Some countries have missing years; regional aggregates are included and should be filtered out.

In [None]:
# Core imports and shared helpers
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Markdown, display

import utils

utils.baseline_style()


## Step 1: Load the PM2.5 and GDP tables
The helper will download the CSVs if they are missing locally. Each file is structured with metadata columns followed by one column per year.

In [None]:
# Example: convert a wide-to-long layout on a toy table
toy = pd.DataFrame({'Country Code': ['AAA'], 'Country Name': ['Example'], '2020': [1.2], '2021': [1.3]})
toy_long = toy.melt(id_vars=['Country Code', 'Country Name'], var_name='Year', value_name='Value')
utils.diagnostics(toy_long, 'Toy long format', expected_columns=['Country Code', 'Country Name', 'Year', 'Value'], expected_row_range=(2, 2))


In [None]:
pm25_raw = utils.load_data('pm25_exposure.csv')
gdp_raw = utils.load_data('gdp_per_country.csv')
utils.diagnostics(pm25_raw, 'PM2.5 exposure (raw)', expected_columns=['Country Name', 'Country Code', 'Indicator Name'], expected_row_range=(250, 300))
utils.diagnostics(gdp_raw, 'GDP per capita (raw)', expected_columns=['Country Name', 'Country Code', 'Indicator Name'], expected_row_range=(250, 300))


## Step 2: Reshape the wide tables into tidy format
Melt both datasets so that each row represents a single country–year observation, then convert the numeric fields.

In [None]:
id_vars = ['Country Name', 'Country Code']

pm25_long = (
    pm25_raw[id_vars + [col for col in pm25_raw.columns if col.isdigit()]]
    .melt(id_vars=id_vars, var_name='Year', value_name='pm25')
    .assign(
        Year=lambda df: pd.to_numeric(df['Year'], errors='coerce'),
        pm25=lambda df: pd.to_numeric(df['pm25'], errors='coerce'),
    )
)
gdp_long = (
    gdp_raw[id_vars + [col for col in gdp_raw.columns if col.isdigit()]]
    .melt(id_vars=id_vars, var_name='Year', value_name='gdp_per_capita')
    .assign(
        Year=lambda df: pd.to_numeric(df['Year'], errors='coerce'),
        gdp_per_capita=lambda df: pd.to_numeric(df['gdp_per_capita'], errors='coerce'),
    )
)

utils.diagnostics(pm25_long, 'PM2.5 tidy', expected_columns=['Country Name', 'Country Code', 'Year', 'pm25'], expected_row_range=(15000, 20000))
utils.diagnostics(gdp_long, 'GDP tidy', expected_columns=['Country Name', 'Country Code', 'Year', 'gdp_per_capita'], expected_row_range=(15000, 20000))


## Step 3: Join the datasets and filter to recent years
We focus on 2021 to minimise pandemic-era volatility while keeping the latest complete data.

In [None]:
merged = (
    pm25_long.merge(
        gdp_long[['Country Code', 'Year', 'gdp_per_capita']],
        on=['Country Code', 'Year'],
        how='inner',
    )
    .dropna(subset=['pm25', 'gdp_per_capita'])
)
merged = merged[(merged['Year'] == 2021) & (merged['pm25'] > 0) & (merged['gdp_per_capita'] > 0)]
merged = merged[merged['Country Code'].str.len() == 3]
utils.diagnostics(merged, 'PM2.5 vs GDP (2021)', expected_columns=['Country Name', 'Country Code', 'Year', 'pm25', 'gdp_per_capita'], expected_row_range=(120, 220))


## Step 4: Add helper columns for the regression-friendly plot
Log-transform GDP for readability and flag a few notable countries to annotate.

In [None]:
notable_countries = ['USA', 'CHN', 'IND', 'NOR', 'NGA']
plot_data = merged.assign(
    log_gdp=lambda df: np.log10(df['gdp_per_capita']),
    country_label=lambda df: np.where(df['Country Code'].isin(notable_countries), df['Country Name'], ''),
)
utils.diagnostics(plot_data, 'Plot dataset', expected_columns=['pm25', 'gdp_per_capita', 'log_gdp'], expected_row_range=(120, 220))


## Step 5: Build the annotated scatter plot
Follow the claim → evidence → visual → takeaway scaffold and ensure the story metadata is complete before plotting.

In [None]:
TITLE = 'High PM2.5 exposure clusters in countries with lower economic resources'
SUBTITLE = 'PM2.5 exposure vs. GDP per capita, 2021 (log scale)'
ANNOTATION = 'China and India face elevated pollution even as incomes rise; Norway enjoys clean air at high income.'
SOURCE = 'World Bank Indicators (EN.ATM.PM25.MC.M3, NY.GDP.PCAP.CD)'
UNITS = 'PM2.5 exposure (µg/m³)'

metadata = {
    'title': TITLE,
    'subtitle': SUBTITLE,
    'annotation': ANNOTATION,
    'source': SOURCE,
    'units': UNITS,
}
utils.validate_story_elements(metadata)

fig, ax = plt.subplots(figsize=(10, 6))
sns.regplot(
    data=plot_data,
    x='log_gdp',
    y='pm25',
    scatter_kws={'alpha': 0.6, 's': 70, 'color': '#1f77b4'},
    line_kws={'color': '#d62728', 'linewidth': 2},
    ax=ax,
)

for _, row in plot_data[plot_data['country_label'] != ''].iterrows():
    ax.text(
        row['log_gdp'] + 0.02,
        row['pm25'],
        row['country_label'],
        fontsize=10,
        ha='left',
        va='center',
    )

utils.apply_story_template(ax, title=TITLE, subtitle=SUBTITLE, source=SOURCE, units=UNITS)
ax.set_xlabel('GDP per capita (log₁₀ US$)')
ax.set_ylabel(UNITS)

ax.annotate(
    ANNOTATION,
    xy=(plot_data['log_gdp'].median(), plot_data['pm25'].quantile(0.75)),
    xycoords='data',
    xytext=(20, -80),
    textcoords='offset points',
    arrowprops=dict(arrowstyle='->', color='#444444'),
    fontsize=11,
    ha='left',
    va='top',
    bbox=dict(boxstyle='round,pad=0.3', fc='white', ec='#555555', alpha=0.85),
)

plt.tight_layout()
utils.save_last_fig('day03_solution_plot.png')


In [None]:
display(
    Markdown(
        utils.summarize_claim(
            claim='Cleaner air is strongly correlated with higher economic capacity.',
            evidence='The regression line slopes downward: wealthier countries generally report lower PM2.5 exposure.',
            takeaway='Communities with fewer resources face compounded risks—less capacity to mitigate pollution and greater exposure.',
        )
    )
)
