## 🔗 Open This Notebook in Google Colab

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/DavidLangworthy/ds4s/blob/master/Day%203%20%E2%80%93%20Air%20Quality%20%26%20Income.ipynb)

# 🌬️ Day 3 – Air Quality and Economic Development
### Connect health impacts to income levels with an interactive scatter

We keep the learn → do → check cadence, now using **Plotly Express** for an interactive view of fine particulate pollution (PM₂.₅) versus GDP per capita.

---

## 🧠 Learning Rhythm
- 🔁 Five loops: load pollution data, load economic data, merge, diagnose, visualize.
- 🧪 Column checks catch spelling/capitalization issues that often break merges.
- 🎯 Micro-rubric rewards: clean joins, log-scale reasoning, and thoughtful annotation.

> **Teacher Sidecar**: Plan ~55 minutes. Students who skip the diagnostic often end up with mismatched country codes—redirect them to rerun Loop 3.

## 📇 Data Card — World Bank PM₂.₅ Exposure & GDP
- **Source**: World Bank open data (downloaded 2023).
- **Temporal coverage**: 1990–2021 (we focus on 2019 cross-section).
- **Metrics**: PM₂.₅ annual exposure (µg/m³) and GDP per capita (current US$).
- **Last updated**: October 2023 pull.
- **Caveats**: Some small states lack GDP data; pollution values above WHO guideline (5 µg/m³) indicate health risk.

## 🧵 Story Scaffold (Claim → Evidence → Visual → Takeaway)
- **Claim**: Wealthier countries typically experience lower PM₂.₅ exposure, but notable exceptions remain.
- **Evidence to gather**: Joined dataset of GDP per capita and PM₂.₅ for 2019 with sufficient country coverage.
- **Visual plan**: Log-scale scatter with regions highlighted and an annotation for outliers.
- **Takeaway**: Economic growth correlates with cleaner air, yet policy choices create exceptions worth noting.


In [None]:

from __future__ import annotations

from pathlib import Path
from typing import Any, Mapping, Sequence

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from IPython.display import display

DATA_DIR = Path.cwd() / "data"

sns.set_theme(style="whitegrid", font_scale=1.1)
plt.rcParams.update({
    "axes.titlesize": 16,
    "axes.labelsize": 13,
    "axes.grid": True,
    "figure.figsize": (11, 6),
    "figure.dpi": 120,
})

def ping_environment(packages: Mapping[str, object]) -> None:
    """Print library versions so teachers can confirm the runtime."""
    for label, module in packages.items():
        version = getattr(module, "__version__", "built-in")
        print(f"{label}: {version}")
    print("Environment check complete ✅")

def load_data(file_name: str, /, **kwargs) -> pd.DataFrame:
    """Load a CSV from the shared data folder with a friendly status message."""
    path = DATA_DIR / file_name
    if not path.exists():
        raise FileNotFoundError(f"Expected data file at {path}")
    df = pd.read_csv(path, **kwargs)
    print(f"Loaded {file_name} → shape {df.shape}")
    return df

def validate_columns(df: pd.DataFrame, required: Sequence[str]) -> pd.DataFrame:
    missing = [col for col in required if col not in df.columns]
    if missing:
        raise ValueError(f"Missing expected columns: {missing}")
    print(f"Columns validated ✅ {list(required)}")
    return df

def expect_rows_between(df: pd.DataFrame, lower: int, upper: int, label: str = "rows") -> pd.DataFrame:
    n_rows = len(df)
    if not (lower <= n_rows <= upper):
        raise ValueError(
            f"Unexpected {label}: {n_rows} (expected between {lower} and {upper})"
        )
    print(f"{label.capitalize()} check ✅ {n_rows} (expected {lower}-{upper})")
    return df

def quick_peek(df: pd.DataFrame, n: int = 5) -> pd.DataFrame:
    """Display a head preview and NA counts for formative assessment."""
    display(df.head(n))
    print("Null values per column:")
    print(df.isna().sum())
    return df

def ensure_metadata(**metadata: str) -> None:
    blanks = [key for key, value in metadata.items() if not str(value).strip()]
    if blanks:
        raise ValueError(f"Please fill in metadata fields: {blanks}")
    print("Story metadata looks great ✅")

def annotate_source(ax: plt.Axes, *, source: str, units: str) -> plt.Axes:
    ax.text(
        0.0,
        -0.22,
        f"Source: {source}
Units: {units}",
        transform=ax.transAxes,
        ha="left",
        fontsize=10,
    )
    return ax

def _resolve_fig(fig: Any | None) -> Any:
    if fig is not None:
        return fig
    if plt.get_fignums():
        return plt.gcf()
    return None

def save_last_fig(fig: Any | None, filename: str) -> Path:
    plots_dir = Path.cwd() / "plots"
    plots_dir.mkdir(parents=True, exist_ok=True)
    resolved = _resolve_fig(fig)
    if resolved is None:
        raise ValueError("No recent figure detected.")

    output_path = plots_dir / filename

    if hasattr(resolved, "savefig"):
        resolved.savefig(output_path, dpi=300, bbox_inches="tight")
        print(f"Saved figure to {output_path}")
        return output_path

    if hasattr(resolved, "write_image"):
        try:
            resolved.write_image(str(output_path))
            print(f"Saved figure to {output_path}")
            return output_path
        except Exception as exc:
            html_path = output_path.with_suffix(".html")
            resolved.write_html(str(html_path))
            print(f"Saved interactive figure to {html_path} (fallback: {exc})")
            return html_path

    raise ValueError("Don't know how to export this figure type.")


## 🔁 Loop 1 · Confirm the setup
*Goal: Ensure pandas and Plotly are ready before data work begins.*

In [None]:
import plotly.express as px
ping_environment({"pandas": pd, "plotly": px})
assert DATA_DIR.exists(), f"Data directory missing: {DATA_DIR}"
print(f"Data files available: {len(list(DATA_DIR.glob('*')))} items")

## 🔁 Loop 2 · Load PM₂.₅ exposure data
*Goal: Pull the air quality dataset and confirm expected structure.*

In [None]:
pm_raw = load_data("pm25_exposure.csv")
validate_columns(pm_raw, ["Country Name", "Country Code", "2019"])
pm_2019 = pm_raw[["Country Name", "Country Code", "2019"]].rename(columns={"2019": "PM25"})
expect_rows_between(pm_2019, 180, 220, label="countries in PM dataset")
quick_peek(pm_2019.sample(5, random_state=0))


## 🔁 Loop 3 · Load GDP per capita data
*Goal: Bring in the economic indicator with parallel checks.*

In [None]:
gdp_raw = load_data("gdp_per_country.csv")
validate_columns(gdp_raw, ["Country Name", "Country Code", "2019"])
gdp_2019 = gdp_raw[["Country Name", "Country Code", "2019"]].rename(columns={"2019": "GDP_per_capita"})
expect_rows_between(gdp_2019, 180, 220, label="countries in GDP dataset")
quick_peek(gdp_2019.sample(5, random_state=1))


## 🔁 Loop 4 · Merge and diagnose the combined table
*Goal: Join on shared identifiers, drop incomplete rows, and sanity-check ranges.*

In [None]:
combined = pm_2019.merge(gdp_2019, on=["Country Name", "Country Code"], how="inner")
pre_drop = len(combined)
combined = combined.dropna()
print(f"Dropped {pre_drop - len(combined)} rows with missing values.")
expect_rows_between(combined, 150, 220, label="countries with both metrics")
print("PM₂.₅ range:", combined["PM25"].min(), "→", combined["PM25"].max())
print("GDP per capita range:", combined["GDP_per_capita"].min(), "→", combined["GDP_per_capita"].max())
assert combined["GDP_per_capita"].min() > 0, "GDP per capita must be positive for log scale."
combined_sorted = combined.sort_values("GDP_per_capita")
combined_sorted.head()


## 🔁 Loop 5 · Visualize with story-first metadata
*Goal: Craft the Plotly scatter with subtitles, annotation, and accessible styling.*

In [None]:
TITLE = "Wealthier Nations Tend to Breathe Cleaner Air"
SUBTITLE = "2019 PM₂.₅ exposure vs. GDP per capita (World Bank)"
ANNOTATION = "Several oil-producing economies buck the trend—wealth does not guarantee clean air."
SOURCE = "World Bank Open Data (PM₂.₅ exposure & GDP per capita)"
UNITS = "PM₂.₅ in µg/m³; GDP per capita in current US$"

ensure_metadata(TITLE=TITLE, SUBTITLE=SUBTITLE, ANNOTATION=ANNOTATION, SOURCE=SOURCE, UNITS=UNITS)

fig = px.scatter(
    combined,
    x="GDP_per_capita",
    y="PM25",
    hover_name="Country Name",
    log_x=True,
    color_discrete_sequence=["#2ca02c"],
    opacity=0.75,
)
fig.update_traces(marker=dict(size=9, line=dict(color="#ffffff", width=0.5)))
fig.update_layout(
    title={"text": f"{TITLE}<br><sup>{SUBTITLE}</sup>", "x": 0.02, "xanchor": "left"},
    xaxis_title="GDP per capita (log scale, US$)",
    yaxis_title="PM₂.₅ exposure (µg/m³)",
    template="plotly_white",
    hovermode="closest",
)
fig.add_annotation(
    text=ANNOTATION,
    xref="paper",
    yref="paper",
    x=0.02,
    y=0.98,
    showarrow=False,
    align="left",
    bgcolor="rgba(255,255,255,0.85)",
    bordercolor="#333333",
    borderwidth=1,
    font=dict(size=12),
)
fig.add_annotation(
    text=f"Source: {SOURCE}<br>Units: {UNITS}",
    xref="paper",
    yref="paper",
    x=0.0,
    y=-0.18,
    showarrow=False,
    align="left",
    font=dict(size=11, color="#444444"),
)
fig.show()


## 🔁 Loop 6 · Interpret and self-check
*Goal: Quantify a comparison and surface an exception for discussion.*

In [None]:
low_income = combined[combined["GDP_per_capita"] < 4000]["PM25"].mean()
high_income = combined[combined["GDP_per_capita"] > 40000]["PM25"].mean()
ratio = low_income / high_income
print(f"Mean PM₂.₅ for < $4k GDP per capita: {low_income:.1f} µg/m³")
print(f"Mean PM₂.₅ for > $40k GDP per capita: {high_income:.1f} µg/m³")
print(f"Exposure ratio (low/high income): {ratio:.1f}×")
assert ratio > 2, "Lower-income exposure should be at least twice as high on average."


### 🧾 Claim → Evidence → Visual → Takeaway (filled)
- **Claim**: Economic development generally correlates with lower PM₂.₅ exposure.
- **Evidence**: Average exposure among low-income countries is more than double that of high-income peers (see diagnostics above).
- **Visual**: Log-scale scatter with annotation for exceptions and consistent metadata.
- **Takeaway**: Policy decisions and fuel choices matter—wealth is helpful but not sufficient for clean air.

> **Limitation prompt**: PM₂.₅ exposure is annual; seasonal spikes and within-country inequalities need separate analysis.

---

### 💾 Save your work
Run the next cell to export the interactive figure reference.


In [None]:
save_last_fig(fig, "day03_solution_plot.png")