## 🔗 Open This Notebook in Google Colab

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/DavidLangworthy/ds4s/blob/master/Day%203_%20Air%20Quality%20%26%20Health.ipynb)

# 😷 Day 3 – Air Quality, Health, and Wealth

Today we connect economic development to air pollution exposure. Students learn to **join datasets**, work in tidy long form, and build an interactive scatterplot that foregrounds uncertainty and ethical interpretation.

> **Teacher sidebar — pacing & differentiation**  
> • Timing: ~50 minutes.  
> • Suggested checkpoints: after the join (ensure row counts align) and after the log-axis plot (check for math errors).  
> • Stretch ideas: compute regional medians, add WHO guideline reference lines, or capture country annotations for storytelling practice.

## 🗺️ Roadmap for Today

Loop | Focus | What success looks like
--- | --- | ---
0 | Story scaffold | Claim + annotation drafted
1 | Load slices | 2019 PM₂.₅ + GDP columns isolated
2 | Join & clean | Nulls handled, units clarified
3 | Diagnostics | Shape, ranges, correlations checked
4 | Visualize & interpret | Interactive scatter with log x-axis + ethics discussion

## 🗂️ Data Cards — World Bank Air Quality & GDP

- **PM₂.₅ exposure**: World Bank World Development Indicators (indicator EN.ATM.PM25.MC.M3), mean population-weighted annual exposure.  
- **GDP per capita**: World Bank WDI (indicator NY.GDP.PCAP.CD), constant 2015 USD.  
- **Temporal coverage**: 1990 – 2023 (annual).  
- **Units**: PM₂.₅ in micrograms per cubic meter (µg/m³); GDP per capita in USD.  
- **Method notes**: PM₂.₅ estimates blend satellite retrievals with ground stations; GDP values are inflation-adjusted.  
- **Caveats**: Missing data for small states; pollution exposure within countries is uneven.  
- **Update cadence**: Annual (usually September).

> 🎯 **Integrity cue**: Avoid implying causation. Highlight that income correlates with cleaner air, but policy, geography, and industrial mix mediate the relationship.

In [None]:
# Shared utilities for the DS4S course notebooks
        from pathlib import Path
        import pandas as pd
        import numpy as np
        import matplotlib.pyplot as plt
        from IPython.display import display
        import plotly.express as px

        plt.style.use('seaborn-v0_8-whitegrid')
        plt.rcParams.update({
            'figure.dpi': 120,
            'axes.titlesize': 16,
            'axes.labelsize': 13,
            'axes.titlepad': 12,
            'figure.figsize': (10, 5),
        })


        def load_csv(path: Path, **read_kwargs) -> pd.DataFrame:
            '''Load a CSV and report the basic shape.'''
            df = pd.read_csv(path, **read_kwargs)
            print(f"✅ Loaded {path.name} with {df.shape[0]:,} rows and {df.shape[1]} columns")
            return df


        def validate_columns(df: pd.DataFrame, required):
            missing = [col for col in required if col not in df.columns]
            if missing:
                raise ValueError(f"Missing columns: {missing}")
            print(f"✅ Columns present: {', '.join(required)}")


        def expect_rows_between(df: pd.DataFrame, low: int, high: int):
            rows = df.shape[0]
            if not (low <= rows <= high):
                raise ValueError(f"Row count {rows} outside expected range {low}-{high}")
            print(f"✅ Row count {rows} within expected {low}-{high}")


        def quick_snapshot(df: pd.DataFrame, name: str, n: int = 3):
            print(f"
{name} snapshot → shape={df.shape}")
            print("Columns:", list(df.columns))
            print("Nulls:
", df.isna().sum())
            display(df.head(n))


        def ensure_story_elements(title: str, subtitle: str, annotation: str, source: str, units: str):
            fields = {
                'TITLE': title,
                'SUBTITLE': subtitle,
                'ANNOTATION': annotation,
                'SOURCE': source,
                'UNITS': units,
            }
            missing = [key for key, value in fields.items() if not str(value).strip()]
            if missing:
                raise ValueError(f"Please complete these storytelling fields: {', '.join(missing)}")
            print("✅ Story scaffold complete →", ", ".join(f"{k}: {v}" for k, v in fields.items()))
            return fields


        def save_last_fig(filename: str):
            plots_dir = Path.cwd() / "plots"
            plots_dir.mkdir(parents=True, exist_ok=True)
            fig = plt.gcf()
            if not fig.axes:
                raise RuntimeError("Run the plotting cell before saving.")
            output_path = plots_dir / filename
            fig.savefig(output_path, dpi=300, bbox_inches='tight')
            print(f"📁 Saved figure to {output_path}")


        def save_plotly_fig(fig, filename: str):
            plots_dir = Path.cwd() / "plots"
            plots_dir.mkdir(parents=True, exist_ok=True)
            output_path = plots_dir / filename
            fig.write_html(str(output_path))
            print(f"📁 Saved interactive figure to {output_path}")

## 🔁 Loop 0 — Story scaffold (3 min)

In [None]:
TITLE = "Wealthier Countries Breathe Cleaner Air"
SUBTITLE = "PM₂.₅ exposure vs GDP per capita (2019)"
ANNOTATION = "WHO guideline: keep annual PM₂.₅ under 10 µg/m³ — only a few countries achieve it."
SOURCE = "World Bank WDI (EN.ATM.PM25.MC.M3 & NY.GDP.PCAP.CD, downloaded 2024)"
UNITS = "PM₂.₅ (µg/m³) vs GDP per capita (2015 USD)"

story_fields = ensure_story_elements(TITLE, SUBTITLE, ANNOTATION, SOURCE, UNITS)

## 🔁 Loop 1 — Load the 2019 slices (8 min)

Focus on a single, recent year to keep the story tight. Extract the columns we need and validate shapes.

In [None]:
data_dir = Path.cwd() / "data"
pm25 = load_csv(data_dir / "pm25_exposure.csv")
gdp = load_csv(data_dir / "gdp_per_country.csv")

required_cols = ["Country Name", "Country Code", "2019"]
validate_columns(pm25, required_cols + ["Indicator Name", "Indicator Code"])
validate_columns(gdp, required_cols + ["Indicator Name", "Indicator Code"])

pm2019 = pm25[required_cols].rename(columns={"2019": "PM25"})
gdp2019 = gdp[required_cols].rename(columns={"2019": "GDP_per_capita"})
expect_rows_between(pm2019.dropna().reset_index(drop=True), 200, 250)

In [None]:
quick_snapshot(pm2019.head(), name="PM₂.₅ 2019 (head)")
quick_snapshot(gdp2019.head(), name="GDP per capita 2019 (head)")

## 🔁 Loop 2 — Join and clean (8 min)

Merge on country code, drop nulls, and compute helper columns for storytelling (e.g., income groups via quantiles).

In [None]:
merged = (
    pm2019.merge(gdp2019, on=["Country Name", "Country Code"], how="inner")
    .dropna(subset=["PM25", "GDP_per_capita"])
)
merged["GDP_per_capita"] = pd.to_numeric(merged["GDP_per_capita"], errors="coerce")
merged["PM25"] = pd.to_numeric(merged["PM25"], errors="coerce")
merged = merged.dropna(subset=["PM25", "GDP_per_capita"])  # second pass after coercion
merged = merged.loc[(merged["PM25"] > 0) & (merged["GDP_per_capita"] > 0)]
merged["IncomeGroup"] = pd.qcut(merged["GDP_per_capita"], q=4, labels=["Q1 Lowest", "Q2", "Q3", "Q4 Highest"])
quick_snapshot(merged.sample(5, random_state=42), name="Merged sample")
print("Countries retained:", merged.shape[0])

## 🔁 Loop 3 — Diagnostics (6 min)

Check distributions and correlations to anticipate the story your scatterplot will show.

In [None]:
summary = merged[["PM25", "GDP_per_capita"]].describe(percentiles=[0.25, 0.5, 0.75])
print(summary)
correlation = merged["PM25"].corr(merged["GDP_per_capita"], method="spearman")
print("Spearman correlation (PM₂.₅ vs GDP): {:.2f}".format(correlation))

## 🔁 Loop 4 — Visualize & interpret (12 min)

Build an interactive scatterplot with log-scaled GDP, WHO guideline reference, and annotations for contrasting cases.

In [None]:
fig = px.scatter(
    merged,
    x="GDP_per_capita",
    y="PM25",
    color="IncomeGroup",
    hover_name="Country Name",
    color_discrete_sequence=px.colors.qualitative.Safe,
    labels={"GDP_per_capita": "GDP per Capita (2015 USD, log scale)", "PM25": "PM₂.₅ Exposure (µg/m³)"},
)
fig.update_traces(marker=dict(size=9, opacity=0.75, line=dict(width=0.5, color="white")))
fig.update_layout(
    title=dict(text=f"{TITLE}<br><sup>{SUBTITLE}</sup>", x=0.05),
    legend_title_text="Income quartile (2019)",
    template="plotly_white",
)
fig.update_xaxes(type="log", ticks="outside", showgrid=True)
fig.add_hline(y=10, line_dash="dash", line_color="#d62728", annotation_text="WHO guideline (10 µg/m³)")
clean_case = merged.nsmallest(1, "PM25").iloc[0]
fig.add_annotation(
    x=clean_case["GDP_per_capita"],
    y=clean_case["PM25"],
    text=f"{clean_case['Country Name']}: {clean_case['PM25']:.1f} µg/m³",
    showarrow=True,
    arrowhead=2,
    ax=40,
    ay=-40,
)
fig.add_annotation(
    x=merged.nlargest(1, "PM25")["GDP_per_capita"].iloc[0],
    y=merged["PM25"].max(),
    text="High exposure despite growth",
    showarrow=True,
    arrowhead=2,
    ax=-60,
    ay=40,
)
fig.update_layout(margin=dict(l=40, r=40, t=90, b=50))
fig

In [None]:
print("Scatter points rendered:", len(fig.data[0]['x']))
assert len(fig.data[0]['x']) == merged.shape[0]

In [None]:
from IPython.display import Markdown

claim = "Economic growth often coincides with cleaner air, but outliers reveal policy gaps."
evidence = (
    "GDP per capita and PM₂.₅ exposure show a negative correlation (Spearman ≈ {:.2f}). Many low-income countries exceed the WHO guideline.".format(correlation)
)
visual = "Plotly scatter with log-scaled GDP, WHO guideline line, and annotated outliers."
takeaway = "Use visuals to discuss equity: who can afford pollution controls, and whose lungs pay the price?"
Markdown(
    f"""
| Claim | Evidence | Visual | Takeaway |
| --- | --- | --- | --- |
| {claim} | {evidence} | {visual} | {takeaway} |
"""
)

## 💾 Save the figure for the teacher dashboard

In [None]:
save_plotly_fig(fig, "day03_solution_plot.html")

## ✅ Exit Ticket

- Which country surprised you on the chart and why?  
- How would you explain the log scale to a classmate?  
- What additional data would help you argue for cleaner air policies?