## 🔗 Open This Notebook in Google Colab

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/DavidLangworthy/ds4s/blob/master/Day%203_%20Pollution%20and%20Public%20Health.ipynb)

# 🌫️ Day 3 – Visualizing Pollution and Public Health  
### How Air Quality and Economic Development Intersect

Today we’re diving into the relationship between **air pollution** and **economic development** — and how these two factors can reveal powerful patterns about global health and sustainability.

You’ll be working with **real data on PM2.5 exposure** (fine particulate air pollution) and **GDP per capita** (as a proxy for wealth) for countries around the world. Using these, you’ll build an interactive chart that not only shows **where air pollution is worst**, but also why it’s that way — and who is most affected.

---

## 🧭 What You'll Learn

Today continues the **“walk”** phase — and we start adding a few **“run”** elements by the end. You’ll:

- Load and combine data from **multiple sources**
- Create **scatter plots** to explore correlations between two variables
- Use **Plotly Express** to build a beautiful, interactive **bubble chart**
- Add extra dimensions to a chart using **color and size**
- Practice **visual storytelling** by identifying patterns, trends, and outliers

---

## 🔧 Tools & Setup

You’ll use:
- `pandas` to load and merge datasets
- `Plotly Express` to build interactive visuals
- (Optional) `seaborn` or `matplotlib` for a quick static version

All tools run in **Google Colab** — no installs needed beyond `pip install plotly`.

---

## 📊 Datasets

You’ll use two cleaned, pre-aligned datasets (for the same year, e.g. 2019):

- **PM2.5 Exposure by Country** – average annual exposure to fine particulate matter (in micrograms per cubic meter), from the **Global Burden of Disease** dataset via the World Bank  
- **GDP per Capita by Country** – in current USD, from the **World Bank World Development Indicators**

Each row is a country. You’ll join the two datasets by country name to get:
- `Country`, `PM25`, `GDP_per_capita`, `Population`, `Region`

This enables a rich, multidimensional analysis — with just one chart.

---

## 🛠️ Lab: Air Pollution vs. Development

Open the notebook called **“Air Pollution vs. Development”**. It will guide you through:

### 1. Merging the Data
You’ll load the two CSV files and use `pd.merge()` to combine them.

> For example: `merged_df = pd.merge(df_pm, df_gdp, on='Country')`  
> A hint in the notebook helps with this if you get stuck.

You’ll end up with one DataFrame with everything you need to plot.

---

### 2. Quick Static Plot (Optional)
You’ll first make a **static scatter plot** using Seaborn or Matplotlib to explore the basic relationship:
- X-axis: GDP per capita
- Y-axis: PM2.5 levels

> You'll likely see an **inverted U-shaped pattern**:  
> Low-income countries have moderate PM2.5, middle-income countries often have the highest pollution, and wealthier countries tend to have cleaner air.

This shape is sometimes called the **environmental Kuznets curve** — and it shows how pollution can rise during industrialization and fall again with better regulation.

---

### 3. Interactive Bubble Chart with Plotly
Now comes the fun part: building a **fully interactive bubble chart** using `plotly.express`.

You’ll create a chart where:
- X = GDP per capita (log scale)
- Y = PM2.5 exposure
- Size = population
- Color = region
- Hover = country name

> Here's the structure you’ll use (the notebook will guide you):

```python
fig = px.scatter(merged_df, 
                 x='GDP_per_capita', y='PM25',
                 size='Population', color='Region',
                 hover_name='Country',
                 log_x=True,
                 title='PM2.5 vs GDP per Capita (2019)')
fig.show()
```
With this, you can:
- Hover over each country to see exact values
- Identify patterns and outliers
- Explore disparities interactively

This is your most modern visualization so far — and one you’ll want to share.

---

### 4. Highlighting Key Points (Optional)

The notebook may include a step to **highlight or annotate** key countries:
- For example, **Bangladesh** (very high PM2.5), **Qatar** (high GDP and pollution), or the **U.S.** (high GDP, lower pollution)

This helps tell a more pointed story: some countries have grown economically but haven’t yet controlled air pollution; others have cleaned up their air through policy and innovation.

---

### 5. Telling the Story

At the end of the notebook, you’ll answer a few reflective questions in Markdown:
- What kind of relationship do you see between GDP and pollution?
- Why might middle-income countries have the worst air quality?
- What stands out in your chart?

> 💡 Hint: Industrial growth (factories, cars, fossil fuels) tends to increase pollution before environmental protections kick in.

This connects back to what you saw in Day 2: **reliance on fossil fuels drives both CO₂ and air pollution**. The path to sustainability isn’t just about energy — it’s also about health.

---

## 🧪 Starter vs. Solution

- The **starter notebook** walks you through the merge, the static plot, and scaffolds most of the Plotly code (with a few `# TODO` prompts).
- The **solution notebook** shows a polished, colorful, fully interactive bubble chart, with example analysis and optional annotations.

---

## ✅ By the End of Today

You’ll be able to:
- Join two datasets in `pandas`
- Create rich scatter plots with multiple visual dimensions
- Build an interactive chart using `plotly.express`
- Analyze and narrate **correlations and inequalities** in public health data

This is your first fully interactive, multivariate visualization. From here, you’ll start to recognize visual patterns — and tell more human stories using data.

Let’s get started.


In [None]:
from pathlib import Path
import pandas as pd
import plotly.express as px

# TODO: Bring in the libraries you plan to use

# Load datasets from the local data folder
data_dir = Path.cwd() / "data"
df_pm = pd.read_csv(data_dir / "pm25_exposure.csv")
df_gdp = pd.read_csv(data_dir / "gdp_per_country.csv")

# TODO: Load the PM₂.₅ exposure and GDP datasets

# Extract columns for a single year (2019)
df_pm2019 = df_pm[["Country Name", "Country Code", "2019"]].rename(columns={"2019": "PM25"})
df_gdp2019 = df_gdp[["Country Name", "Country Code", "2019"]].rename(columns={"2019": "GDP_per_capita"})

# TODO: Tidy up the 2019 slices and rename the columns

# Merge clearly on Country Name or Country Code
df_merge = pd.merge(df_pm2019, df_gdp2019, on=["Country Name", "Country Code"], how="inner")
df_merge.dropna(inplace=True)  # remove rows without complete data

# TODO: Combine the datasets and drop incomplete rows

# Basic plot
fig = px.scatter(
    df_merge,
    x="GDP_per_capita",
    y="PM25",
    hover_name="Country Name",
    log_x=True,
    title="PM2.5 Exposure vs GDP per Capita (2019)",
    labels={"GDP_per_capita": "GDP per Capita (USD)", "PM25": "PM2.5 (µg/m³)"},
)

fig.show()

# TODO: Build the scatter plot that compares income and air quality


In [None]:
from pathlib import Path

plots_dir = Path.cwd() / "plots"
plots_dir.mkdir(parents=True, exist_ok=True)

if "fig" in globals():
    output_path = plots_dir / "day03_solution_plot.png"
    try:
        fig.write_image(str(output_path))
        print(f"Saved Plotly figure to {output_path}")
    except Exception as exc:
        print(f"Plot export skipped: {exc}")
else:
    print("Plotly figure `fig` not found.")
