# Wrangle and Visualize Global COVID-19 Deaths
**Course:** BAN405 – Python Programming  
**File created:** 2025-10-14

This notebook follows the assignment instructions to wrangle and visualize COVID-19 deaths using the Johns Hopkins CSSE time series dataset and a country–continent lookup.

> **How to run:**  
> 1) Place the CSV files in a local folder named `data/` (next to this notebook):  
> &nbsp;&nbsp;• `data/time_series_covid19_deaths_global.csv` (JHU CSSE)  
> &nbsp;&nbsp;• `data/Countries Continents.csv` (country→continent mapping from OWID)  
> 2) Then run the notebook top-to-bottom (Kernel → Restart & Run All).
>
> If the files are missing locally, the notebook **attempts** to fetch them from public GitHub URLs as a fallback. If your environment has no internet, place the files manually in `data/`.

> **Statement on use of generative AI**  
> I used ChatGPT (GPT‑5 Thinking) to help draft and organize this notebook, including code structure, comments, and markdown explanations. I reviewed, edited, and verified the content, and I am responsible for the final submission.

In [1]:
# === Imports ===
import os
import io
import math
import zipfile
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Plot style (keep default; do not set specific colors per course rules)
plt.rcParams["figure.dpi"] = 120

DATA_DIR = Path("data")
PLOTS_DIR = Path("plots")
PLOTS_DIR.mkdir(exist_ok=True)

# URLs (fallback if local files are missing)
JHU_DEATHS_URL = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/" \                 "csse_covid_19_data/csse_covid_19_time_series/" \                 "time_series_covid19_deaths_global.csv"

# A commonly used OWID continents mapping (structure: country, continent)
# Your local file may be named 'Countries Continents.csv' with similar columns.
OWID_CONTINENTS_URL = "https://raw.githubusercontent.com/owid/covid-19-data/master/scripts/input/continents/continents.csv"



SyntaxError: unexpected character after line continuation character (4045239516.py, line 19)

## Task 1 – Data wrangling
### 1. Load and explore the COVID data set
We load the JHU deaths time series. The dataset contains cumulative deaths by Province/State and Country/Region with date columns in `m/d/yy` format.

In [None]:
# Utility: load CSV from local 'data/' if present; otherwise try GitHub (internet required)
def load_csv(local_name: str, fallback_url: str) -> pd.DataFrame:
    local_path = DATA_DIR / local_name
    if local_path.exists():
        df = pd.read_csv(local_path)
        print(f"Loaded local file: {local_path}")
        return df
    else:
        try:
            df = pd.read_csv(fallback_url)
            print(f"Loaded from web fallback: {fallback_url}")
            # Save a local copy so subsequent runs work offline
            DATA_DIR.mkdir(exist_ok=True)
            df.to_csv(local_path, index=False)
            print(f"Saved a local copy to: {local_path}")
            return df
        except Exception as e:
            raise FileNotFoundError(
                f"Could not find {local_path} and failed to download from the web.
"
                f"Error: {e}\nPlease place the file in the 'data/' folder and rerun."
            )

# Load datasets
deaths_raw = load_csv("time_series_covid19_deaths_global.csv", JHU_DEATHS_URL)

print("Shape:", deaths_raw.shape)
display(deaths_raw.head())
display(deaths_raw.tail())
print("\nColumns:", list(deaths_raw.columns)[:10], "...")
print(deaths_raw.info())

In [None]:
# Basic exploration
countries_unique = deaths_raw["Country/Region"].nunique()
provinces_present = deaths_raw["Province/State"].notna().sum()
print(f"Unique countries/regions: {countries_unique}")
print(f"Rows with a province/state value: {provinces_present}")

# Identify date columns programmatically (JHU file: first 4 are meta columns)
meta_cols = ["Province/State", "Country/Region", "Lat", "Long"]
date_cols = [c for c in deaths_raw.columns if c not in meta_cols]
print(f"Detected {len(date_cols)} date columns. First & last:", date_cols[0], "…", date_cols[-1])

# Check missing values in key columns
print("\nMissing values per meta column:")
print(deaths_raw[meta_cols].isna().sum())

### 2. Reshape from wide to long (tidy)
We use `pandas.melt()` to pivot date columns into a single `date` column for easier time-series handling.

In [None]:
deaths_long = deaths_raw.melt(
    id_vars=meta_cols,
    value_vars=date_cols,
    var_name="date_str",
    value_name="total_deaths"
)

# Convert date strings to datetime
deaths_long["date"] = pd.to_datetime(deaths_long["date_str"], format="%m/%d/%y")

# Keep relevant columns and ensure numeric
deaths_long["total_deaths"] = pd.to_numeric(deaths_long["total_deaths"], errors="coerce").fillna(0).astype("Int64")
deaths_long = deaths_long.drop(columns=["date_str"])

display(deaths_long.head())
print(deaths_long.info())

### 3–4. Convert dates to timestamps and aggregate to the country level
We sum across provinces/states within each country for each day.

In [None]:
country_daily = (
    deaths_long
    .groupby(["Country/Region", "date"], as_index=False)["total_deaths"]
    .sum()
    .sort_values(["Country/Region", "date"])
    .reset_index(drop=True)
)

print("Aggregated shape:", country_daily.shape)
display(country_daily.head())
display(country_daily.tail())

### 5. Compute daily new deaths
Daily new deaths are the difference in cumulative totals between two adjacent days. We use `groupby().diff()`.

In [None]:
country_daily["new_deaths"] = (
    country_daily
    .groupby("Country/Region")["total_deaths"]
    .diff()
    .fillna(0)
    .astype(int)
)

# Some countries can have occasional data corrections (negative diffs). Clip to zero for plotting clarity.
country_daily["new_deaths"] = country_daily["new_deaths"].clip(lower=0)

display(country_daily.head(10))

## Task 2 – Data visualization
### 1a) Total deaths over time for the **three countries with the highest totals** (as of the last date)
We compute the top 3 countries by total deaths at the latest date in the data and plot their cumulative trajectories on one chart. The plot is saved as `plots/total_deaths.png`.

In [None]:
latest_date = country_daily["date"].max()
latest_totals = (
    country_daily[country_daily["date"] == latest_date]
    .sort_values("total_deaths", ascending=False)
    .head(3)
)

top3_countries = latest_totals["Country/Region"].tolist()
print("Latest date:", latest_date.date())
print("Top 3 countries:", top3_countries)

fig = plt.figure(figsize=(8, 5))
for country in top3_countries:
    subset = country_daily[country_daily["Country/Region"] == country]
    plt.plot(subset["date"], subset["total_deaths"], label=country)

plt.title("Total COVID-19 Deaths Over Time (Top 3 Countries)")
plt.xlabel("Date")
plt.ylabel("Total deaths (cumulative)")
plt.legend()
plt.tight_layout()

out_path = PLOTS_DIR / "total_deaths.png"
plt.savefig(out_path)
plt.show()
print(f"Saved: {out_path.resolve()}")

### 1b) Daily new deaths for **Norway, Denmark, and Sweden** as three subplots
Saved as `plots/new_deaths.png`.

In [None]:
nordics = ["Norway", "Denmark", "Sweden"]
fig, axes = plt.subplots(3, 1, figsize=(9, 8), sharex=True)

for ax, country in zip(axes, nordics):
    subset = country_daily[country_daily["Country/Region"] == country]
    ax.plot(subset["date"], subset["new_deaths"])
    ax.set_title(country)
    ax.set_ylabel("New deaths")

axes[-1].set_xlabel("Date")
fig.suptitle("Daily New COVID-19 Deaths – Nordics", y=0.95)
fig.tight_layout()

out_path = PLOTS_DIR / "new_deaths.png"
fig.savefig(out_path)
plt.show()
print(f"Saved: {out_path.resolve()}")

### OPTIONAL: Reusable plotting function
`plot_total_deaths(countries, data)` plots cumulative deaths for any selection of countries on the same axes with basic input validation.

In [None]:
def plot_total_deaths(countries, data):
    """
    Plot cumulative deaths over time for one or more countries.

    Parameters
    ----------
    countries : list[str]
        Country names as they appear in `data['Country/Region']`.
    data : pd.DataFrame
        DataFrame with columns ['Country/Region', 'date', 'total_deaths'].
    """
    # Validate inputs
    if not isinstance(countries, (list, tuple)):
        raise TypeError("`countries` must be a list or tuple of country names.")
    if len(countries) == 0:
        print("No countries provided. Nothing to plot.")
        return None
    if not set(["Country/Region", "date", "total_deaths"]).issubset(data.columns):
        raise ValueError("`data` must contain 'Country/Region', 'date', and 'total_deaths'.")

    known = set(data["Country/Region"].unique())
    requested = set(countries)
    unknown = sorted(list(requested - known))
    valid = [c for c in countries if c in known]

    if unknown:
        print(f"Warning: Unknown countries ignored: {unknown}")
    if not valid:
        print("No valid countries to plot. Nothing to do.")
        return None

    plt.figure(figsize=(8, 5))
    for c in valid:
        subset = data[data["Country/Region"] == c]
        plt.plot(subset["date"], subset["total_deaths"], label=c)
    plt.title("Total COVID-19 Deaths")
    plt.xlabel("Date")
    plt.ylabel("Total deaths (cumulative)")
    plt.legend()
    plt.tight_layout()
    plt.show()

# Quick tests
tests = [
    ['Norway', 'Denmark', 'Sweden'],
    ['Norway', 'Denmark', 'Atlantis'],
    ['The moon', 'Mars', 'Atlantis'],
    []
]
for t in tests:
    print("\nTest:", t)
    plot_total_deaths(t, country_daily)

## Task 3 – Data merging with continents
We merge a country→continent mapping onto our country-day data. Minor name harmonization improves the match (e.g., `US` vs `United States`, `Korea, South` vs `South Korea`). After merging we (1) check coverage, (2) compute **total deaths per continent**, and (3) make bar charts.

In [None]:
# Load continents mapping
def load_continents():
    # The assignment mentions a local file named 'Countries Continents.csv'.
    # We'll try that first, then fall back to an OWID mapping hosted on GitHub.
    try:
        cc = pd.read_csv(DATA_DIR / "Countries Continents.csv")
        print("Loaded local 'Countries Continents.csv'")
    except Exception:
        cc = pd.read_csv(OWID_CONTINENTS_URL)
        print("Loaded continents from OWID fallback URL")
        # Normalize likely column names to expected ones
    # Try to coerce to common column names
    lower_cols = {c.lower(): c for c in cc.columns}
    # Heuristics for column naming
    country_col = None
    for cand in ["country", "location", "entity", "name", "Country", "Location"]:
        if cand.lower() in lower_cols:
            country_col = lower_cols[cand.lower()]
            break
    continent_col = None
    for cand in ["continent", "Continent"]:
        if cand.lower() in lower_cols:
            continent_col = lower_cols[cand.lower()]
            break
    if country_col is None or continent_col is None:
        raise ValueError("Could not infer country/continent columns from the continents file.")

    cc = cc.rename(columns={country_col: "country", continent_col: "continent"})
    cc["country"] = cc["country"].astype(str)
    cc["continent"] = cc["continent"].astype(str)
    return cc[["country", "continent"]]

continents = load_continents()
display(continents.head())
print("Unique continents:", sorted(continents["continent"].dropna().unique().tolist()))

In [None]:
# Name harmonization dictionary: JHU -> OWID country names
name_map = {
    "US": "United States",
    "Korea, South": "South Korea",
    "Taiwan*": "Taiwan",
    "Burma": "Myanmar",
    "Congo (Kinshasa)": "Democratic Republic of Congo",
    "Congo (Brazzaville)": "Republic of the Congo",
    "Cote d'Ivoire": "Côte d'Ivoire",
    "Holy See": "Vatican",
    "Cape Verde": "Cabo Verde",  # sometimes appears as 'Cabo Verde' in OWID
    "Eswatini": "Eswatini",      # consistent
    "West Bank and Gaza": "Palestine",
    "Bahamas": "The Bahamas",
    "Gambia": "The Gambia",
    "Timor-Leste": "East Timor",
    "North Macedonia": "North Macedonia",
    "Laos": "Laos",
    "Sao Tome and Principe": "Sao Tome and Principe",
}

country_daily_merge = country_daily.copy()
country_daily_merge["country_std"] = country_daily_merge["Country/Region"].replace(name_map)

# Some rows in JHU are not countries (e.g., cruise ships, overseas territories).
# These may not have a continent — that's expected.
merge_df = country_daily_merge.merge(
    continents.rename(columns={"country": "country_std"}),
    on="country_std",
    how="left"
)

coverage = merge_df[["Country/Region", "country_std", "continent"]].drop_duplicates()
missing = coverage["continent"].isna().sum()
total = coverage.shape[0]
print(f"Merged country→continent coverage: {total - missing}/{total} matched, {missing} missing")

# Show a few missing examples
missing_names = coverage[coverage["continent"].isna()].head(20)
display(missing_names)

### Total deaths per continent (latest date)
We compute totals using the latest date available and make a bar plot. Saved as `plots/continent_bar_plot.png`.

In [None]:
latest_date = merge_df["date"].max()
continent_latest = (
    merge_df[merge_df["date"] == latest_date]
    .dropna(subset=["continent"])
    .groupby("continent", as_index=False)["total_deaths"]
    .sum()
    .sort_values("total_deaths", ascending=False)
)

display(continent_latest)

plt.figure(figsize=(7, 5))
plt.bar(continent_latest["continent"], continent_latest["total_deaths"])
plt.title(f"Total COVID-19 Deaths by Continent (as of {latest_date.date()})")
plt.xlabel("Continent")
plt.ylabel("Total deaths (cumulative)")
plt.tight_layout()

out_path = PLOTS_DIR / "continent_bar_plot.png"
plt.savefig(out_path)
plt.show()
print(f"Saved: {out_path.resolve()}")

### OPTIONAL: Stacked bar – total deaths per year by continent
For each year, we take the **last available date in that year** and compute the cumulative totals per continent. We then plot a stacked bar chart and save it as `plots/continent_stacked_plot.png`.

In [None]:
merge_df["year"] = merge_df["date"].dt.year

# For each year, get last date present in that year
last_dates_per_year = merge_df.groupby("year")["date"].max().rename("last_date").reset_index()

year_end = merge_df.merge(last_dates_per_year, on="year")
year_end = year_end[year_end["date"] == year_end["last_date"]]

by_continent_year = (
    year_end.dropna(subset=["continent"])
    .groupby(["year", "continent"], as_index=False)["total_deaths"]
    .sum()
)

pivot = by_continent_year.pivot(index="year", columns="continent", values="total_deaths").fillna(0).sort_index()

ax = pivot.plot(kind="bar", stacked=True, figsize=(9, 6), legend=True)
ax.set_title("Total COVID-19 Deaths per Year by Continent (stacked)")
ax.set_xlabel("Year")
ax.set_ylabel("Total deaths (cumulative)")
plt.tight_layout()

out_path = PLOTS_DIR / "continent_stacked_plot.png"
plt.savefig(out_path)
plt.show()
print(f"Saved: {out_path.resolve()}")

## Notes / Limitations
- The JHU dataset contains a few non-country entities (e.g., *Diamond Princess*). Those typically do not map to continents and are left unmatched.
- Some countries were renamed or have alternative spellings across sources; a small harmonization dictionary is included. You may expand it if needed to maximize merge coverage.
- Daily new deaths can show spikes due to backfills or corrections. We clip negative values to zero to simplify plotting.
- The notebook attempts a web fallback if the local `data/` files are missing. If your environment has no internet access, make sure to place both CSV files in `data/`.

## Generative AI Acknowledgment
Parts of this notebook (structure, comments, and helper functions) were created with assistance from ChatGPT (GPT‑5 Thinking). I validated the steps, ran the code, and ensured it aligns with course requirements.