## 🔗 Open This Notebook in Google Colab

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/DavidLangworthy/ds4s/blob/master/days/day03/solution/day03_solution.ipynb)

# 🌫️ Day 3 – Pollution, Prosperity, and Health
### Guided loops: tidy → merge → explore → storytell

Today’s loops build on everything so far: you will tidy two datasets, confirm they align, create an interpretable scatter plot, and narrate what the relationship says—and doesn’t say—about global health equity.

## 📇 Data Card — World Bank Indicators (PM₂.₅ & GDP per Capita)
- **Sources**: World Bank World Development Indicators (downloaded January 2024).
- **Temporal coverage**: 1960–2023 annual values; we focus on 2019 to limit pandemic-era noise.
- **Units**: PM₂.₅ exposure in µg/m³; GDP per capita in current USD.
- **Processing notes**: Select 2019 column, drop aggregates (non-ISO codes), keep countries with both metrics.
- **Last updated**: December 2023 WDI refresh.
- **Caveats**: GDP in nominal USD exaggerates currency swings; PM₂.₅ exposure uses modeled estimates. Aggregated regions (e.g., “Africa Eastern and Southern”) are removed before plotting.

> 🔎 **What this analysis cannot tell us**: Causality between income and pollution, within-country inequality, or rural/urban exposure differences.

## 🗺️ Workflow Map
1. **Setup & shared helpers**.
2. **Load and inspect** both raw indicator tables.
3. **Slice 2019 & clean** column names.
4. **Merge & enrich** with a simple income-tier grouping.
5. **Story scaffold** before plotting.
6. **Visualise** with Plotly, run accessibility checks, and reflect on uncertainty.

## Step 0 · Imports, style, and quick diagnostics

In [None]:

from pathlib import Path
from textwrap import dedent

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Image, display

sns.set_theme(style="whitegrid")
plt.rcParams.update({
    "axes.titlesize": 18,
    "axes.labelsize": 13,
    "axes.titleweight": "bold",
    "figure.titlesize": 20,
    "xtick.labelsize": 11,
    "ytick.labelsize": 11,
})


def baseline_style():
    """Reset the Matplotlib/Seaborn style so every figure starts consistent."""
    sns.set_theme(style="whitegrid")
    plt.rcParams.update({
        "axes.titlesize": 18,
        "axes.labelsize": 13,
        "axes.titleweight": "bold",
        "figure.titlesize": 20,
        "xtick.labelsize": 11,
        "ytick.labelsize": 11,
        "legend.title_fontsize": 12,
        "legend.fontsize": 11,
    })
    return plt


def quick_peek(df, expected_columns=None, sample=3, label="DataFrame"):
    """Print a friendly snapshot so students can self-diagnose issues quickly."""
    print(f"
🔍 {label} preview")
    print(df.head(sample))
    print(f"Rows: {len(df):,} | Columns: {list(df.columns)}")
    if expected_columns:
        missing = [col for col in expected_columns if col not in df.columns]
        if missing:
            print(f"⚠️ Missing column(s): {missing}")
        else:
            print("✅ Columns match the expectation.")
    return df


def expect_rows_between(df, low, high, label="row count"):
    rows = len(df)
    if low <= rows <= high:
        print(f"✅ {label.title()} looks right: {rows:,}.")
    else:
        print(f"⚠️ {label.title()} looks off: {rows:,}. Expected between {low:,} and {high:,}.")
    return rows


def validate_story_elements(elements):
    missing = [key for key, value in elements.items() if not value or not str(value).strip()]
    if missing:
        print(f"⚠️ Please fill in these storytelling fields: {', '.join(missing)}")
    else:
        print("✅ Story scaffold is ready — every element is filled in.")
    return elements


def save_last_fig(filename, fig=None, dpi=300):
    """Save the latest Matplotlib figure with consistent export settings."""
    output_path = Path.cwd() / filename
    output_path.parent.mkdir(parents=True, exist_ok=True)
    if fig is None:
        fig = plt.gcf()
    if fig and getattr(fig, "axes", None):
        fig.savefig(output_path, dpi=dpi, bbox_inches="tight")
        print(f"💾 Saved figure to {output_path}")
    else:
        print("⚠️ No figure detected to save.")
    return output_path

baseline_style()


## Step 1 · Load and preview both indicators
**Micro-task**: bring in PM₂.₅ and GDP datasets, then inspect the first few rows to understand the wide format.

In [None]:

data_dir = Path.cwd() / "data"
pm_df = pd.read_csv(data_dir / "pm25_exposure.csv")
gdp_df = pd.read_csv(data_dir / "gdp_per_country.csv")

quick_peek(pm_df, expected_columns=["Country Name", "Country Code", "Indicator Name"], label="PM₂.₅ raw table")
quick_peek(gdp_df, expected_columns=["Country Name", "Country Code", "Indicator Name"], label="GDP raw table")


### Self-diagnostic: indicator sanity checks
Students verify they’re pulling the correct metrics before slicing years.

In [None]:

print("PM₂.₅ indicator unique values:", pm_df["Indicator Name"].unique()[:1])
print("GDP indicator unique values:", gdp_df["Indicator Name"].unique()[:1])


## Step 2 · Slice the 2019 columns and tidy names
This loop reinforces selecting columns, renaming, and dropping aggregates.

In [None]:

YEAR = "2019"
pm_2019 = pm_df[["Country Name", "Country Code", YEAR]].rename(columns={YEAR: "PM25"})
gdp_2019 = gdp_df[["Country Name", "Country Code", YEAR]].rename(columns={YEAR: "GDP_per_capita"})

pm_2019 = pm_2019.dropna(subset=["PM25"])                   .assign(PM25=lambda d: pd.to_numeric(d["PM25"], errors="coerce"))
gdp_2019 = gdp_2019.dropna(subset=["GDP_per_capita"])                     .assign(GDP_per_capita=lambda d: pd.to_numeric(d["GDP_per_capita"], errors="coerce"))

quick_peek(pm_2019, expected_columns=["Country Name", "Country Code", "PM25"], label="PM₂.₅ (2019)")
quick_peek(gdp_2019, expected_columns=["Country Name", "Country Code", "GDP_per_capita"], label="GDP per capita (2019)")


### Aggregates filter
Many rows are regional aggregates. Keep ISO3 country codes only.

In [None]:

def is_iso(code: str) -> bool:
    return isinstance(code, str) and len(code) == 3 and code.isalpha()

pm_countries = pm_2019[pm_2019["Country Code"].apply(is_iso)]
gdp_countries = gdp_2019[gdp_2019["Country Code"].apply(is_iso)]

print(f"PM₂.₅ country rows: {len(pm_countries):,}")
print(f"GDP country rows: {len(gdp_countries):,}")


## Step 3 · Merge, enrich, and add quick diagnostics
Join on ISO code, drop missing values, and classify countries into income quartiles as a storytelling aid.

In [None]:

merged = pm_countries.merge(
    gdp_countries,
    on=["Country Name", "Country Code"],
    how="inner",
)
merged = merged.dropna(subset=["PM25", "GDP_per_capita"]).copy()

merged["IncomeTier"] = pd.qcut(merged["GDP_per_capita"], 4, labels=["Low", "Lower-middle", "Upper-middle", "High"], duplicates="drop")
expect_rows_between(merged, 150, 200, label="country matches")
quick_peek(merged, expected_columns=["Country Name", "Country Code", "PM25", "GDP_per_capita", "IncomeTier"], label="Merged tidy table")


### Progress anchor
The reference scatter reminds students what success looks like.

In [None]:
display(Image(filename=str(Path.cwd() / 'plots' / 'day03_solution_plot.png')), width=420)

## Step 4 · Story-first chart checklist

In [None]:

TITLE = "Wealthier Countries Breathe Cleaner Air — With Exceptions"
SUBTITLE = "PM₂.₅ exposure vs. GDP per capita (log scale), countries with available 2019 data"
ANNOTATION = "Middle-income countries often face the highest pollution as industrialization accelerates."
SOURCE = "World Bank WDI (PM₂.₅ exposure & GDP per capita, 2019 download)"
UNITS = "PM₂.₅ (µg/m³) vs. GDP per capita (current USD)"
ACCESSIBILITY_NOTES = "Log x-axis labeled every power of ten; colorblind-safe palette for income tiers; hover text supplies country names."

validate_story_elements({
    "TITLE": TITLE,
    "SUBTITLE": SUBTITLE,
    "ANNOTATION": ANNOTATION,
    "SOURCE": SOURCE,
    "UNITS": UNITS,
    "ACCESSIBILITY_NOTES": ACCESSIBILITY_NOTES,
})


## Step 5 · Build the interactive scatter with Plotly Express
Plotly gives hover labels and pan/zoom, while helper text and annotations cue interpretation.

In [None]:

import plotly.express as px

fig = px.scatter(
    merged,
    x="GDP_per_capita",
    y="PM25",
    color="IncomeTier",
    hover_name="Country Name",
    log_x=True,
    labels={"GDP_per_capita": "GDP per capita (USD, log scale)", "PM25": "PM₂.₅ exposure (µg/m³)"},
    color_discrete_sequence=px.colors.qualitative.Safe,
    title=f"{TITLE}<br><sup>{SUBTITLE}</sup>",
)
fig.update_traces(marker=dict(size=12, opacity=0.7, line=dict(width=0.5, color="#333")))
fig.add_annotation(
    x=4000,
    y=60,
    text=ANNOTATION,
    showarrow=True,
    arrowcolor="#333",
    bgcolor="rgba(255,255,255,0.9)",
)
fig.update_layout(
    legend_title_text="Income tier",
    margin=dict(l=40, r=40, t=90, b=40),
    template="plotly_white",
    annotations=[
        dict(
            text=f"Source: {SOURCE} | Notes: {ACCESSIBILITY_NOTES}",
            xref="paper",
            yref="paper",
            x=0,
            y=-0.18,
            showarrow=False,
            font=dict(size=11, color="#4f4f4f"),
            align="left",
        )
    ],
)
fig.show()


### Export checkpoint

In [None]:

plots_dir = Path.cwd() / "plots"
plots_dir.mkdir(parents=True, exist_ok=True)
try:
    fig.write_image(str(plots_dir / "day03_solution_plot.png"))
    print("💾 Saved Plotly figure to", plots_dir / "day03_solution_plot.png")
except Exception as exc:
    print("⚠️ Plot export skipped:", exc)


## Step 6 · Reflect on drivers, uncertainty, and ethics
- **Claim → Evidence → Visual → Takeaway**:
  - **Claim**: Cleaner air generally correlates with higher income, yet some wealthy countries remain high-pollution outliers.
  - **Evidence**: Log-scale scatter slopes downward, with middle-income clusters around 40–70 µg/m³; annotations flag the industrialization pinch point.
  - **Visual**: Interactive bubble chart with income tiers, annotation, and accessible styling.
  - **Takeaway**: Economic growth can finance cleaner air, but policy choices matter—there’s nothing automatic about the transition.
- **Limitations**: GDP per capita misses inequality; PM₂.₅ averages hide urban hot spots.
- **Potential misreads**: Log axes can confuse; remind viewers that equal spacing means orders of magnitude.
- **Next questions**: Which policies helped high-income countries decouple growth from pollution? How do population-weighted exposures differ from national averages?

## Process quality checklist
✅ Loaded and inspected both datasets • ✅ Filtered 2019 country rows with diagnostics • ✅ Built story scaffold • ✅ Created an accessible interactive scatter • ✅ Documented limitations, uncertainty, and ethics.