## üîó Open This Notebook in Google Colab

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/DavidLangworthy/ds4s/blob/master/days/day03/notebook/day03_starter.ipynb)

# üå´Ô∏è Day 3 ‚Äì Pollution, Prosperity, and Public Health
We will continue the learn‚Äìdo cadence: inspect, transform, visualise, reflect. This time the goal is an interactive bubble chart that helps explain why clean air remains unequal.

## üóÇÔ∏è Data Card: PM‚ÇÇ.‚ÇÖ Exposure + GDP per Capita
- **Source:** World Bank World Development Indicators (downloaded Nov 2024).
- **Temporal coverage:** Annual indicators, 1990‚Äì2023; we focus on 2019 for a pre-pandemic snapshot.
- **Units:** PM‚ÇÇ.‚ÇÖ in ¬µg/m¬≥ (population-weighted exposure); GDP per capita in current USD.
- **Last updated:** October 2024 release.
- **Method notes:** Country-level statistics harmonised with ISO-3 codes; gapminder metadata supplies continent groupings and approximate population for bubble sizing.
- **Caveats:** Aggregated regions and territories appear in the raw files ‚Äî filter to true ISO-3 countries. GDP per capita in current USD is not PPP-adjusted.
- **Integrity prompt:** What story could be lost if you plot on a linear x-axis? Why might log-scaling be more responsible here?

## Story Scaffold Reminder
- **Claim:** Which relationship between wealth and air quality are you highlighting?
- **Evidence:** Which clusters or outliers support that claim?
- **Visual:** How do scale choices, hover text, and colour reinforce clarity?
- **Takeaway:** Draft the public-health message you want readers to remember.

## Step 0 ¬∑ Imports

In [None]:
from pathlib import Path

import pandas as pd
import plotly.express as px

from days.utils import (
    check_story_metadata,
    load_data,
    plots_directory,
    quick_diagnostics,
)

## Step 1 ¬∑ Load 2019 indicators
Slice both datasets to a single analysis year so each row represents a comparable snapshot.

In [None]:
YEAR = "2019"
pm = load_data("data/pm25_exposure.csv")
gdp = load_data("data/gdp_per_country.csv")

pm_year = pm[["Country Name", "Country Code", YEAR]].rename(columns={YEAR: "PM25"})
gdp_year = gdp[["Country Name", "Country Code", YEAR]].rename(columns={YEAR: "GDP_per_capita"})

merged = pd.merge(pm_year, gdp_year, on=["Country Name", "Country Code"], how="inner")
merged = merged.dropna(subset=["PM25", "GDP_per_capita"])

In [None]:
quick_diagnostics(
    merged,
    expected_columns=["Country Name", "Country Code", "PM25", "GDP_per_capita"],
    rows_between=(160, 210),
)
print("Expected: ~180 countries. Values should be positive; PM‚ÇÇ.‚ÇÖ units are ¬µg/m¬≥.")

## Step 2 ¬∑ Add continent groupings and population proxies
Use Plotly's Gapminder metadata to attach continent names and approximate population for bubble sizing. Filter out rows without ISO-3 matches to avoid aggregates.

In [None]:
gap = px.data.gapminder()[["country", "iso_alpha", "continent", "pop"]].drop_duplicates()
merged = merged.merge(gap, left_on="Country Code", right_on="iso_alpha", how="left")
clean_countries = merged.dropna(subset=["continent"])
clean_countries = clean_countries.rename(columns={"country": "Gapminder Name", "pop": "Population", "continent": "Region"})
clean_countries = clean_countries.drop(columns=["iso_alpha"])

In [None]:
quick_diagnostics(
    clean_countries[["Country Name", "Country Code", "Region", "Population", "PM25", "GDP_per_capita"]],
    expected_columns=["Country Name", "Country Code", "Region", "Population", "PM25", "GDP_per_capita"],
    rows_between=(130, 200),
    head_rows=4,
)
print("Population uses Gapminder's latest available year (approx. 2007) ‚Äî note this caveat in your write-up.")

## Step 3 ¬∑ Interim static check
Before going interactive, create a quick scatter to make sure the log scaling and axis labels communicate cleanly.

![Interim preview ‚Äì downward sloping scatter cloud.](../../plots/day03_solution_plot.png)

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(8, 5))
ax.scatter(clean_countries["GDP_per_capita"], clean_countries["PM25"], alpha=0.4, color="#5f6caf")
ax.set_xscale("log")
ax.set_xlabel("GDP per capita (current USD, log scale)")
ax.set_ylabel("PM‚ÇÇ.‚ÇÖ exposure (¬µg/m¬≥)")
ax.set_title("Draft: wealthier countries tend to have cleaner air")
plt.show()

## Step 4 ¬∑ Story metadata

In [None]:
TITLE = "Middle-income countries still breathe the dirtiest air"
SUBTITLE = "PM‚ÇÇ.‚ÇÖ exposure vs. GDP per capita, 2019"
ANNOTATION = "South & East Asia dominate the high-PM‚ÇÇ.‚ÇÖ, middle-income cluster."
SOURCE = "Source: World Bank WDI (PM‚ÇÇ.‚ÇÖ exposure & GDP per capita, 2019)"
UNITS = "Units: PM‚ÇÇ.‚ÇÖ (¬µg/m¬≥), GDP per capita (current USD)"

check_story_metadata(
    TITLE=TITLE,
    SUBTITLE=SUBTITLE,
    ANNOTATION=ANNOTATION,
    SOURCE=SOURCE,
    UNITS=UNITS,
)

## Step 5 ¬∑ Build the interactive bubble chart
Encode GDP on a log x-axis, PM‚ÇÇ.‚ÇÖ on the y-axis, colour by region, and bubble size by population. Add hover text with country names and values so viewers can investigate the story.

In [None]:
fig = px.scatter(
    clean_countries,
    x="GDP_per_capita",
    y="PM25",
    color="Region",
    size="Population",
    hover_name="Country Name",
    hover_data={"GDP_per_capita": ":,.0f", "PM25": ":.1f", "Population": ":,.0f"},
    log_x=True,
    size_max=60,
    title=f"{TITLE}<br><sup>{SUBTITLE}</sup>",
    labels={"GDP_per_capita": "GDP per capita (current USD, log scale)", "PM25": "PM‚ÇÇ.‚ÇÖ exposure (¬µg/m¬≥)"},
    template="plotly_white",
)
fig.add_annotation(
    x=6000,
    y=45,
    text=ANNOTATION,
    showarrow=True,
    arrowhead=2,
    ax=-120,
    ay=-40,
    bgcolor="rgba(255,255,255,0.8)",
)
fig.update_layout(
    legend_title="Region",
    margin=dict(l=40, r=40, t=80, b=40),
    annotations=list(fig.layout.annotations) + [
        dict(
            text=f"{SOURCE} ¬∑ {UNITS}",
            xref="paper",
            yref="paper",
            x=0.5,
            y=-0.18,
            showarrow=False,
            font=dict(size=11, color="#555555"),
        )
    ],
)
fig.show(renderer="notebook")

## Step 6 ¬∑ Interpret with the scaffold
- **Claim:** The worst air pollution burdens fall on densely populated middle-income countries.
- **Evidence:** Countries like India, Bangladesh, and Pakistan cluster at high PM‚ÇÇ.‚ÇÖ levels despite rising incomes, while high-income countries sit in the low-exposure corner.
- **Visual:** Log-scaling the x-axis reveals the inverted-U pattern; bubble sizes and colour-coded regions contextualise scale and geography.
- **Takeaway:** ‚ÄúEconomic growth alone does not deliver clean air ‚Äî targeted pollution controls are essential in rapidly developing nations.‚Äù

### Limitations to note
- Population estimates come from 2007 Gapminder data; cite this approximation when sharing.
- PM‚ÇÇ.‚ÇÖ exposure is averaged nationally and hides within-country inequalities.
- GDP per capita in current USD can swing with exchange rates; try PPP-adjusted or median income for deeper analysis.

## Step 7 ¬∑ Export the interactive figure

In [None]:
export_path = plots_directory() / "day03_solution_plot.html"
fig.write_html(str(export_path))
print(f"üíæ Saved interactive figure to {export_path}")