## 🔗 Open This Notebook in Google Colab

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/DavidLangworthy/ds4s/blob/master/days/day03/notebook/day03_starter.ipynb)

# 🫁 Day 3 – Pollution and Public Health
### Linking particulate exposure with national income levels

Today we blend two World Bank indicators to explore how fine particulate pollution (PM₂.₅) relates to a country's
economic resources. You'll practice joining datasets, running quick diagnostics, and designing a scatter plot that
communicates both correlation and uncertainty.

### 🗂️ Data card — World Bank PM₂.₅ exposure and GDP per capita
- **Sources:** World Bank Sustainable Development Indicators (downloaded 2024-12)
- **Temporal coverage:** 1990–2021 (annual) — today we focus on 2019 to avoid pandemic disruptions
- **Geography:** Countries and territories with consistent World Bank reporting
- **Units:**
  - PM₂.₅ exposure: population-weighted annual mean concentration (µg/m³)
  - GDP per capita: constant 2015 US dollars
- **Collection notes:** PM₂.₅ estimates come from satellite retrievals and ground monitors blended via statistical models
- **Caveats:** Some small states lack GDP data; pollution values may have large confidence intervals; values are annual averages, not peaks
- **Mindful design:** Use log scaling carefully and label axis units prominently to avoid misinterpretation.

### 1. Set up the environment

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from IPython.display import display

pd.options.display.float_format = "{:.2f}".format

In [None]:
# Shared helper utilities used throughout the week.
from __future__ import annotations

import warnings
from pathlib import Path
from typing import Iterable, Mapping

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns


def resolve_data_dir(max_up: int = 5) -> Path:
    """Locate the project-level ``data`` directory regardless of execution location."""
    here = Path.cwd()
    for _ in range(max_up + 1):
        candidate = here / "data"
        if candidate.exists():
            return candidate
        here = here.parent
    raise FileNotFoundError(
        "Could not find a 'data' directory relative to this notebook.
"
        "If you are running in Colab, mount your drive or upload the data folder first."
    )


DATA_DIR = resolve_data_dir()
PROJECT_ROOT = DATA_DIR.parent
PLOTS_DIR = PROJECT_ROOT / "plots"
PLOTS_DIR.mkdir(parents=True, exist_ok=True)


def baseline_style() -> None:
    """Apply a consistent, high-contrast visual style that is colorblind-friendly."""
    sns.set_theme(style="whitegrid", context="talk", font_scale=0.9)
    plt.rcParams.update(
        {
            "figure.dpi": 120,
            "axes.titlesize": 16,
            "axes.labelsize": 13,
            "legend.fontsize": 11,
            "axes.titleweight": "semibold",
        }
    )


def load_data(filename: str | Path, **kwargs) -> pd.DataFrame:
    """Read a CSV file from the shared data directory and report its shape."""
    path = Path(filename)
    if not path.exists():
        path = DATA_DIR / filename
    df = pd.read_csv(path, **kwargs)
    print(f"Loaded {path.name} with shape {df.shape}.")
    return df


def validate_columns(df: pd.DataFrame, required: Iterable[str], *, context: str = "") -> None:
    missing = [col for col in required if col not in df.columns]
    if missing:
        warnings.warn(
            f"Missing expected columns {missing} in {context or 'dataframe'}.
"
            "Double-check your renaming and loading steps before moving on."
        )
    else:
        print(f"✅ Columns look good: {list(required)}")


def expect_rows_between(df: pd.DataFrame, low: int, high: int, *, label: str = "rows") -> None:
    count = len(df)
    if not (low <= count <= high):
        warnings.warn(
            f"{label} check: expected between {low:,} and {high:,} but found {count:,}."
        )
    else:
        print(f"✅ {label} check: {count:,} rows is within the expected range.")


def quick_diagnose(df: pd.DataFrame, *, sample: int = 3) -> None:
    print("
Preview of the current dataframe:")
    display(df.head(sample))
    print("
Null values by column:")
    print(df.isna().sum())


def validate_story_fields(fields: Mapping[str, str]) -> None:
    missing = [name for name, value in fields.items() if not str(value).strip()]
    if missing:
        warnings.warn(
            "The following story fields are blank: " + ", ".join(missing) +
            "
Fill them in so your chart has a clear narrative frame."
        )
    else:
        print("✅ Narrative checklist complete.")


def save_last_fig(fig: plt.Figure | None, filename: str) -> Path | None:
    if fig is None:
        fig = plt.gcf()
    if fig and getattr(fig, "axes", None):
        output_path = PLOTS_DIR / filename
        fig.savefig(output_path, dpi=300, bbox_inches="tight")
        print(f"Saved figure to {output_path.relative_to(PROJECT_ROOT)}")
        return output_path
    warnings.warn("No matplotlib figure available to save yet.")
    return None


### 2. Load and preview the indicators
We read both wide-form datasets and confirm the key identifier columns exist before subsetting to 2019.

In [None]:
pm25 = load_data("pm25_exposure.csv")
gdp = load_data("gdp_per_country.csv")

validate_columns(pm25, ["Country Name", "Country Code"])
validate_columns(gdp, ["Country Name", "Country Code"])
quick_diagnose(pm25.iloc[:, :6])
quick_diagnose(gdp.iloc[:, :6])

### 3. Slice the 2019 data and tidy the columns
We select 2019 values, rename columns, merge on country code, and drop rows without both indicators.

In [None]:
pm25_2019 = (
    pm25[["Country Name", "Country Code", "2019"]]
    .rename(columns={"2019": "pm25_ug_m3"})
)
gdp_2019 = (
    gdp[["Country Name", "Country Code", "2019"]]
    .rename(columns={"2019": "gdp_per_capita_usd"})
)

merged = (
    pm25_2019
    .merge(gdp_2019, on=["Country Name", "Country Code"], how="inner")
    .dropna()
)
expect_rows_between(merged, 170, 190, label="countries with both metrics")
quick_diagnose(merged.head())

### 4. Engineer helper fields for analysis
Compute log-scale values, quartile groups, and quick summary statistics to understand the distribution.

In [None]:
merged["log_gdp"] = np.log10(merged["gdp_per_capita_usd"])
merged["gdp_group"] = pd.qcut(
    merged["gdp_per_capita_usd"],
    q=4,
    labels=["Lowest income", "Lower-middle", "Upper-middle", "Highest income"],
)

summary_stats = merged[["pm25_ug_m3", "gdp_per_capita_usd"]].describe()
display(summary_stats)
print("Correlation (Pearson): {:.2f}".format(merged["pm25_ug_m3"].corr(merged["gdp_per_capita_usd"])))
quick_diagnose(merged.sample(5))

### 5. Define the storytelling frame
Set the narrative scaffold before plotting so our visual decisions support the claim.

In [None]:
TITLE = "Wealthier countries breathe cleaner air"
SUBTITLE = "Population-weighted PM₂.₅ exposure vs. GDP per capita (2019)"
ANNOTATION = "Each marker is a country; trendline highlights the negative relationship"
SOURCE = "Source: World Bank SDI (downloaded Dec 2024)"
UNITS = "PM₂.₅ concentration (µg/m³) vs. GDP per capita (2015 USD)"

validate_story_fields({
    "TITLE": TITLE,
    "SUBTITLE": SUBTITLE,
    "ANNOTATION": ANNOTATION,
    "SOURCE": SOURCE,
    "UNITS": UNITS,
})

### 6. Build the interactive scatter plot
Use Plotly Express for hoverable context, log-scale the x-axis, and add an OLS trendline for reference.

In [None]:
fig = px.scatter(
    merged,
    x="gdp_per_capita_usd",
    y="pm25_ug_m3",
    color="gdp_group",
    hover_name="Country Name",
    hover_data={
        "gdp_per_capita_usd": ":.0f",
        "pm25_ug_m3": ":.1f",
    },
    log_x=True,
    labels={
        "gdp_per_capita_usd": "GDP per capita (2015 USD, log scale)",
        "pm25_ug_m3": "PM₂.₅ exposure (µg/m³)",
        "gdp_group": "Income quartile",
    },
    title=f"{TITLE}<br><sup>{SUBTITLE}</sup>",
    trendline="ols",
    trendline_color_override="#333333",
    color_discrete_sequence=px.colors.qualitative.Safe,
)
fig.update_layout(
    legend_title_text="Income group",
    margin=dict(l=60, r=40, t=90, b=60),
    template="plotly_white",
    annotations=[
        dict(
            text=f"{ANNOTATION}<br>{SOURCE}",
            x=0,
            y=-0.22,
            xref="paper",
            yref="paper",
            showarrow=False,
            align="left",
        )
    ],
)
fig.show()

html_path = (PROJECT_ROOT / "plots" / "day03_solution_plot.html")
fig.write_html(html_path)
print(f"Saved interactive figure to {html_path.relative_to(PROJECT_ROOT)}")

### 7. Interpret responsibly
- **Key takeaway:** Higher-income countries generally record lower PM₂.₅ exposure, though several wealthier nations still sit above the WHO guideline of 5 µg/m³.
- **Uncertainty & caveats:** Trendline assumes a linear log relationship and ignores policy lags; satellite-derived pollution estimates carry uncertainties, especially in regions with sparse monitors.
- **What this plot cannot tell us:** It omits population size, pollution sources, and within-country inequality; pair it with time-series or policy data to understand causal drivers.

### 8. Process micro-rubric
| Step | Evidence of completion |
| --- | --- |
| Data loaded & validated | Column checks completed for PM₂.₅ and GDP tables |
| Cleaning documented | 2019 slice merged with nulls removed and diagnostics logged |
| Story frame filled | Title, subtitle, annotation, source, units finalized before plotting |
| Visualization reviewed | Log scaling, color palette, legend, and annotation verified |
| Reflection written | Takeaway, uncertainty, and limitations clearly stated |