## 🔗 Open This Notebook in Google Colab

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/DavidLangworthy/ds4s/blob/main/days/day03/notebook/day03_starter.ipynb)

# 🌫️ Day 3 – Pollution, Prosperity, and Public Health

Today’s goal is to understand who breathes the dirtiest air and why. You will merge two World Bank indicators, stress-test the combined dataset, and build an interactive chart that invites exploration.

### Data Card — World Bank Indicators (2019 focus year)

| Dataset | File | Units | Coverage | Caveats |
| --- | --- | --- | --- | --- |
| PM2.5 exposure | `data/pm25_exposure.csv` | µg/m³ (annual mean) | Countries & aggregates, 1990–2023 | Some small-island states have missing years; values are rounded. |
| GDP per capita | `data/gdp_per_country.csv` | Current USD | Countries & aggregates, 1960–2023 | Currency in current dollars; economic shocks can skew year-to-year comparisons. |
| Selected year | 2019 | — | Global | Most countries report pre-pandemic; 2020–2021 shifts are excluded. |
| Caveats | — | — | — | Regional aggregates remain in the files; filter to country codes to avoid double counting. |


### Step 1 · Imports and helpers

Plotly drives the interactive visual; pandas handles wrangling.

In [None]:
import pandas as pd
import plotly.express as px

from utils import (
    expect_rows_between,
    load_csv,
    quick_check,
    save_plotly_fig,
    validate_columns,
    validate_story_elements,
)


### Step 2 · Load the two indicator tables

Grab the wide-format CSVs; we will pick out the 2019 columns next.

In [None]:
pm25_raw = load_csv("data/pm25_exposure.csv")
gdp_raw = load_csv("data/gdp_per_country.csv")
quick_check(pm25_raw.head(), name="PM2.5 preview")
quick_check(gdp_raw.head(), name="GDP preview")


<details>
<summary>Need a nudge?</summary>

- Look for the `2019` column; each dataset stores one column per year.
- The column names use spaces and parentheses—keep them intact when selecting.

</details>

### Step 3 · Isolate 2019 values and clean up column names

Convert to numeric and drop placeholder strings like `..` or blanks.

In [None]:
target_year = "2019"
pm25_2019 = (
    pm25_raw[["Country Name", "Country Code", target_year]]
    .rename(columns={target_year: "PM25"})
    .assign(PM25=lambda df: pd.to_numeric(df["PM25"], errors="coerce"))
)

gdp_2019 = (
    gdp_raw[["Country Name", "Country Code", target_year]]
    .rename(columns={target_year: "GDP_per_capita"})
    .assign(GDP_per_capita=lambda df: pd.to_numeric(df["GDP_per_capita"], errors="coerce"))
)

validate_columns(pm25_2019, ["Country Name", "Country Code", "PM25"])
validate_columns(gdp_2019, ["Country Name", "Country Code", "GDP_per_capita"])


### Step 4 · Merge the indicators and remove aggregates

World Bank files include regional roll-ups (e.g., “High income”). Filter to ISO-like codes to focus on individual countries.

In [None]:
combined = pm25_2019.merge(gdp_2019, on=["Country Name", "Country Code"], how="inner")
combined = combined.dropna(subset=["PM25", "GDP_per_capita"])
combined = combined[combined["Country Code"].str.fullmatch(r"[A-Z]{3}")]
expect_rows_between(combined, minimum=150, maximum=220)
quick_check(combined.tail(), name="Combined dataset")


<details>
<summary>Still cleaning?</summary>

- The regex filter keeps three-letter uppercase codes, a good proxy for ISO Alpha-3 countries.
- If you keep aggregates, they will stack on top of countries and skew the scatter.

</details>

### Step 5 · Classify countries into income tiers

This provides a colour channel for the bubble chart so we can compare peers.

In [None]:
income_bins = [0, 1085, 4255, 13205, float("inf")]
income_labels = ["Low income", "Lower-middle", "Upper-middle", "High income"]
combined["IncomeGroup"] = pd.cut(
    combined["GDP_per_capita"],
    bins=income_bins,
    labels=income_labels,
    include_lowest=True,
)
quick_check(combined.groupby("IncomeGroup").size().reset_index(name="Countries"), name="Income group counts")


### Step 6 · Build the storytelling checklist

As before, make sure the narrative scaffolding is in place before plotting.

In [None]:
story = {
    "title": "Wealthier Countries Breathe Cleaner Air",
    "subtitle": "PM2.5 exposure vs. GDP per capita (2019) with income-tier colouring",
    "annotation": "Upper-middle income countries shoulder the highest particulate loads.",
    "source": "World Bank World Development Indicators (accessed 2024-12)",
    "units": "µg/m³ and current USD",
}
validate_story_elements(story)


### Step 7 · Configure the interactive scatter

Use log scale on GDP to spread out low-income countries. Bubble size follows PM2.5 to highlight especially polluted places.

In [None]:
fig = px.scatter(
    combined,
    x="GDP_per_capita",
    y="PM25",
    color="IncomeGroup",
    size="PM25",
    hover_name="Country Name",
    log_x=True,
    size_max=40,
    title=story["title"],
    labels={
        "GDP_per_capita": "GDP per capita (USD, log scale)",
        "PM25": "PM2.5 exposure (µg/m³)",
        "IncomeGroup": "World Bank income tier",
    },
)
fig.update_layout(
    legend_title_text="Income tier",
    annotations=[
        dict(
            text=story["annotation"],
            x=0.02,
            y=0.98,
            xref="paper",
            yref="paper",
            showarrow=False,
            bgcolor="rgba(255,255,255,0.7)",
        )
    ],
)
fig.update_traces(marker=dict(opacity=0.75, line=dict(width=0.5, color="#333")))
fig.show()


### Step 8 · Optional static check

If Plotly is unavailable, a quick Matplotlib snapshot verifies the trend.

In [None]:
_ = combined.plot.scatter(x="GDP_per_capita", y="PM25", alpha=0.4, grid=True)


### Step 9 · Save the interactive output for archiving

This produces an HTML file that GitHub Actions can upload without embedding the chart here.

In [None]:
save_plotly_fig(fig, "plots/day03_solution_bubble.html")


### Step 10 · Reflection prompts

- Which filters or diagnostics caught issues before plotting?
- Identify at least one outlier to discuss (e.g., high pollution at high income).
- Note one limitation (e.g., missing population scaling) to acknowledge in your write-up.