## 🔗 Open This Notebook in Google Colab

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/DavidLangworthy/ds4s/blob/main/days/day03/notebook/day03_starter.ipynb)

# 🌫️ Day 3 – Pollution, Prosperity, and Public Health
You will connect particulate exposure with income to surface where air quality progress is lagging.

## 🧾 Data Card – PM₂.₅ Exposure & GDP per Capita
- **Sources:** [World Health Organization Global Health Observatory](https://ghoapi.azureedge.net/api/PM25) and [World Bank World Development Indicators](https://data.worldbank.org/indicator/NY.GDP.PCAP.CD).
- **Temporal coverage:** 1990–2021 (annual).
- **Units:** PM₂.₅ exposure in micrograms per cubic meter; GDP per capita in current USD.
- **Update cadence:** Updated annually as governments report emissions and national accounts.
- **Method notes:** Exposure estimates model outdoor concentrations weighted by population.
- **Caveats:** Some small states have missing PM₂.₅ values; GDP is in nominal dollars and not adjusted for purchasing power.

## 🧭 Story Scaffold
- **Claim:** Where is economic growth decoupling from dirty air?
- **Evidence:** Which countries sit above or below the trend?
- **Visual:** Scatter plot with log-scale GDP, color-coded income tiers, and labeled standouts.
- **Takeaway:** Highlight uncertainty (modeled exposure) and equity concerns.

In [None]:
from __future__ import annotations

from pathlib import Path
import sys

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

for candidate in [Path.cwd(), Path.cwd().parent, Path.cwd().parent.parent]:
    utils_path = candidate / "utils.py"
    if utils_path.exists():
        if str(candidate) not in sys.path:
            sys.path.insert(0, str(candidate))
        break
else:
    raise FileNotFoundError("Unable to locate utils.py. Did you download the full project?")

from utils import (
    baseline_style,
    diagnose_dataframe,
    expect_rows_between,
    load_data,
    save_last_fig,
    validate_columns,
    validate_story_elements,
)

baseline_style()
sns.set_palette("colorblind")


In [None]:
# Example: log-scaling a scatter plot
example_data = pd.DataFrame(
    {
        "GDP_per_capita": [500, 2000, 10000, 50000],
        "PM25": [60, 45, 18, 12],
        "Label": ["Low", "Lower-middle", "Upper-middle", "High"],
    }
)
fig, ax = plt.subplots(figsize=(5, 3))
ax.scatter(example_data["GDP_per_capita"], example_data["PM25"], color="#1f77b4")
ax.set_xscale("log")
ax.set_xlabel("GDP per capita (USD, log scale)")
ax.set_ylabel("PM₂.₅ exposure (µg/m³)")
plt.close(fig)
fig


In [None]:
# Step 1 – Load pollution and income datasets
df_pm25 = load_data("pm25_exposure.csv")
df_gdp = load_data("gdp_per_country.csv")

# TODO: Load both CSV files into dataframes using load_data


<details>
<summary>Need a nudge on loading?</summary>
<ul>
<li>Call <code>load_data</code> for <code>pm25_exposure.csv</code> and <code>gdp_per_country.csv</code>.</li>
<li>Store them in clearly named dataframes (e.g., <code>df_pm25</code>, <code>df_gdp</code>).</li>
</ul>
</details>

In [None]:
# Step 2 – Focus on the most recent common year (2019)
pollution_2019 = (
    df_pm25[["Country Name", "Country Code", "2019"]]
    .rename(columns={"2019": "PM25"})
)
income_2019 = (
    df_gdp[["Country Name", "Country Code", "2019"]]
    .rename(columns={"2019": "GDP_per_capita"})
)

# TODO: Select the Country columns plus 2019, and rename to PM25 / GDP_per_capita


<details>
<summary>Column selection hint</summary>
<ul>
<li>Use double brackets to keep specific columns in the same order.</li>
<li><code>rename</code> lets you swap out the 2019 column name.</li>
</ul>
</details>

In [None]:
# Step 3 – Combine and clean the datasets
pollution_income = (
    pollution_2019.merge(income_2019, on=["Country Name", "Country Code"], how="inner")
    .dropna(subset=["PM25", "GDP_per_capita"])
)

pollution_income["IncomeGroup"] = pd.qcut(
    pollution_income["GDP_per_capita"],
    q=4,
    labels=["Low income", "Lower-middle", "Upper-middle", "High income"],
)

# TODO: Merge on both keys, drop missing values, and bucket incomes into quartiles


<details>
<summary>Merge hint</summary>
<ul>
<li>Join on both country name and code to stay safe.</li>
<li>Use <code>dropna</code> so the scatter does not include incomplete rows.</li>
<li><code>pd.qcut</code> splits countries into quartiles for the <code>IncomeGroup</code> column.</li>
</ul>
</details>

In [None]:
# Step 4 – Diagnostics and expectations
diagnose_dataframe(pollution_income, name="PM2.5 vs GDP (2019)")
validate_columns(pollution_income, ["Country Name", "Country Code", "PM25", "GDP_per_capita"], name="pollution_income")
expect_rows_between(pollution_income, 150, 220, name="pollution_income")


In [None]:
# Step 5 – Identify standout countries for annotation
top_polluted = pollution_income.nlargest(3, "PM25")
rapid_improvers = pollution_income.nsmallest(3, "PM25")
standout_labels = pd.concat([top_polluted, rapid_improvers])["Country Name"].unique().tolist()
standout_labels


In [None]:
# Step 6 – Story metadata strings
TITLE = "Economic Growth Alone Doesn't Clear the Air"
SUBTITLE = "Population-weighted PM₂.₅ exposure vs. GDP per capita, 2019"
ANNOTATION = "High-income gulf states still face elevated PM₂.₅ despite wealth."
SOURCE = "WHO GHO & World Bank (downloaded 2024-04-15)"
UNITS = "PM₂.₅ (µg/m³) and GDP per capita (USD, log scale)"

validate_story_elements(
    {
        "TITLE": TITLE,
        "SUBTITLE": SUBTITLE,
        "ANNOTATION": ANNOTATION,
        "SOURCE": SOURCE,
        "UNITS": UNITS,
    }
)

# TODO: Make sure each storytelling string is filled in


In [None]:
# Step 7 – Build the annotated scatter plot
fig, ax = plt.subplots(figsize=(9, 6))
scatter = sns.scatterplot(
    data=pollution_income,
    x="GDP_per_capita",
    y="PM25",
    hue="IncomeGroup",
    palette="Set2",
    alpha=0.8,
    edgecolor="white",
    linewidth=0.5,
    ax=ax,
)
ax.set_xscale("log")
ax.set_title(TITLE)
ax.set_xlabel(f"GDP per capita (USD, log scale) — {SUBTITLE}")
ax.set_ylabel(UNITS.split(" and ")[0])
ax.text(0.01, -0.2, f"Source: {SOURCE}", transform=ax.transAxes)
ax.legend(title="Income tier", loc="upper right")

for _, row in pollution_income[pollution_income["Country Name"].isin(standout_labels)].iterrows():
    ax.annotate(
        row["Country Name"],
        xy=(row["GDP_per_capita"], row["PM25"]),
        xytext=(5, 5),
        textcoords="offset points",
        fontsize=9,
        bbox=dict(boxstyle="round,pad=0.2", fc="white", ec="none", alpha=0.8),
    )

fig.tight_layout()
pollution_fig = fig
fig

# TODO: Create the scatter with log-scaled GDP and annotations for standouts


In [None]:
# Step 8 – Final validation and save option
validate_story_elements(
    {
        "TITLE": TITLE,
        "SUBTITLE": SUBTITLE,
        "ANNOTATION": ANNOTATION,
        "SOURCE": SOURCE,
        "UNITS": UNITS,
    }
)
save_last_fig("day03_pollution_income.png", fig=pollution_fig)
