
# COVID-19 Global Data Tracker
A ready-to-run notebook that loads, cleans, analyzes, and visualizes global COVID-19 trends (cases, deaths, and vaccinations).

**Data Source (recommended):**
- Our World in Data (OWID) COVID-19 dataset — CSV: <https://covid.ourworldindata.org/data/owid-covid-data.csv> · Docs: <https://ourworldindata.org/covid-deaths>

**Alternative source (advanced):**
- Johns Hopkins University CSSE COVID-19 GitHub — <https://github.com/CSSEGISandData/COVID-19>

---

## What this notebook covers
1. Data collection (via direct CSV URL from OWID)
2. Data loading & exploration
3. Data cleaning
4. Exploratory data analysis (EDA)
5. Vaccination progress analysis
6. (Optional) Choropleth map with Plotly Express
7. Insights & reporting (write-up cells provided)

> Tip: Run the cells **top-to-bottom**. If you need a PDF, use `File → Save and Export As → PDF` (or print to PDF).



## 0) Setup & Parameters
- Edit the **COUNTRIES** list to focus on your countries of interest.
- Optionally adjust the **START_DATE** and **END_DATE** windows.


In [3]:

# === User parameters ===
COUNTRIES = ["Nigeria", "Kenya", "United States", "India", "Brazil"]
START_DATE = None  # e.g., "2020-03-01"
END_DATE = None    # e.g., "2023-12-31"

# === Data source ===
OWID_CSV_URL = "https://covid.ourworldindata.org/data/owid-covid-data.csv"

# === Toggle features ===
ENABLE_PLOTLY_CHOROPLETH = True  # set False if plotly isn't installed

# Library imports
import os
import sys
import math
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Optional libs
try:
    import seaborn as sns
    _HAS_SEABORN = True
except Exception:
    _HAS_SEABORN = False

try:
    import plotly.express as px
    _HAS_PLOTLY = True
except Exception:
    _HAS_PLOTLY = False

print("Seaborn available:", _HAS_SEABORN)
print("Plotly available:", _HAS_PLOTLY)


Seaborn available: True
Plotly available: True



## 1) Data Collection
We use the **Our World in Data** (OWID) COVID-19 CSV. It's a single, cleaned file that includes cases, deaths, tests, and vaccinations for all countries and regions, by date.


In [None]:
import requests
import pandas as pd

OWID_CSV_URL = "https://covid.ourworldindata.org/data/owid-covid-data.csv"
DATA_PATH = "owid-covid-data.csv"

try:
    r = requests.get(OWID_CSV_URL, timeout=30)
    r.raise_for_status()
    with open(DATA_PATH, "wb") as f:
        f.write(r.content)
    print("Download successful!")
except Exception as e:
    print("Download failed:", e)

df_raw = pd.read_csv(DATA_PATH)
print(df_raw.shape)
df_raw.head()

Downloading OWID data...


URLError: <urlopen error [Errno 11001] getaddrinfo failed>


## 2) Data Loading & Exploration
Quickly inspect columns, sample rows, and missingness.


In [None]:

print("Columns ({}):".format(len(df_raw.columns)))
print(df_raw.columns.tolist()[:50])  # show first 50 to keep it readable

display(df_raw.head(10))

# Missingness summary
missing_summary = df_raw.isnull().sum().sort_values(ascending=False)
display(missing_summary.head(25))



## 3) Data Cleaning
Steps:
- Keep **countries** (exclude OWID aggregates like continents) using `continent` column.
- Filter to `COUNTRIES` if specified.
- Convert `date` to datetime and apply optional date filters.
- Forward-fill or interpolate select numeric columns for smoother visuals (optional).


In [None]:

# Keep rows where location is a country (continent not null)
df = df_raw[df_raw["continent"].notna()].copy()

# Focus on selected countries (if present in data)
if COUNTRIES:
    df = df[df["location"].isin(COUNTRIES)].copy()

# Parse dates
df["date"] = pd.to_datetime(df["date"], errors="coerce")
df = df.sort_values(["location", "date"])

# Optional date window filter
if START_DATE:
    df = df[df["date"] >= pd.to_datetime(START_DATE)]
if END_DATE:
    df = df[df["date"] <= pd.to_datetime(END_DATE)]

# Choose a subset of numeric columns to clean (commonly used)
numeric_cols = [
    "total_cases","new_cases","total_deaths","new_deaths",
    "total_vaccinations","people_vaccinated","people_fully_vaccinated",
    "total_boosters","new_vaccinations","population"
]

for col in numeric_cols:
    if col in df.columns:
        # Forward fill within each country
        df[col] = df.groupby("location")[col].apply(lambda s: s.ffill())

print("Cleaned shape:", df.shape)
df.head()



## 4) Exploratory Data Analysis (EDA)

We'll look at:
- **Total cases over time** for selected countries
- **Total deaths over time**
- **Daily new cases** comparison
- **Case fatality ratio (CFR)** = `total_deaths / total_cases`


In [None]:

# Helper to line-plot a metric over time for selected countries
def plot_over_time(metric, title):
    plt.figure(figsize=(10, 5))
    for country in sorted(df["location"].unique()):
        sub = df[df["location"] == country]
        if metric in sub.columns and sub[metric].notna().any():
            plt.plot(sub["date"], sub[metric], label=country)
    plt.title(title)
    plt.xlabel("Date")
    plt.ylabel(metric.replace("_", " ").title())
    plt.legend(loc="best")
    plt.grid(True)
    plt.show()

# Total cases & deaths
if "total_cases" in df.columns:
    plot_over_time("total_cases", "Total COVID-19 Cases Over Time")

if "total_deaths" in df.columns:
    plot_over_time("total_deaths", "Total COVID-19 Deaths Over Time")

# Daily new cases
if "new_cases" in df.columns:
    plot_over_time("new_cases", "Daily New COVID-19 Cases")

# Case fatality ratio (CFR)
if {"total_deaths","total_cases"}.issubset(df.columns):
    df["cfr"] = df["total_deaths"] / df["total_cases"]
    plot_over_time("cfr", "Case Fatality Ratio (Total Deaths / Total Cases)")


In [None]:

# Top countries by total cases (latest date per country)
if "total_cases" in df.columns:
    latest = df.sort_values("date").groupby("location").tail(1)
    top_cases = latest[["location","total_cases"]].dropna().sort_values("total_cases", ascending=False).head(15)

    plt.figure(figsize=(10, 5))
    plt.bar(top_cases["location"], top_cases["total_cases"])
    plt.xticks(rotation=45, ha="right")
    plt.title("Top Countries by Total COVID-19 Cases (latest)")
    plt.ylabel("Total Cases")
    plt.tight_layout()
    plt.show()

    display(top_cases.reset_index(drop=True))



## 5) Visualizing Vaccination Progress
We'll examine cumulative vaccinations and vaccination coverage (% of population) for selected countries.


In [None]:

# % vaccinated metrics if available
cols_needed = {"people_vaccinated","people_fully_vaccinated","population"}
if cols_needed.issubset(df.columns):
    for col in ["people_vaccinated","people_fully_vaccinated"]:
        share_col = col + "_share_of_pop"
        df[share_col] = df[col] / df["population"]

    # Plot shares over time
    def plot_share(metric, title):
        plt.figure(figsize=(10, 5))
        for country in sorted(df["location"].unique()):
            sub = df[df["location"] == country]
            if metric in sub.columns and sub[metric].notna().any():
                plt.plot(sub["date"], sub[metric]*100, label=country)
        plt.title(title)
        plt.xlabel("Date")
        plt.ylabel("% of Population")
        plt.legend(loc="best")
        plt.grid(True)
        plt.show()

    plot_share("people_vaccinated_share_of_pop", "People Vaccinated (% of population)")
    plot_share("people_fully_vaccinated_share_of_pop", "People Fully Vaccinated (% of population)")

# Cumulative vaccinations
if "total_vaccinations" in df.columns:
    plot_over_time("total_vaccinations", "Total Vaccinations Over Time")



## 6) (Optional) Choropleth Map — Cases or Vaccination Rates by Country
This section uses **Plotly Express**. If it's not installed, either set `ENABLE_PLOTLY_CHOROPLETH=False` at the top, or install it with:

```bash
pip install plotly
```


In [None]:

if ENABLE_PLOTLY_CHOROPLETH and _HAS_PLOTLY:
    latest_world = (
        df_raw[df_raw["continent"].notna()]
        .sort_values("date")
        .groupby("iso_code", as_index=False)
        .tail(1)
    )

    # Choose metric here: "total_cases_per_million" if available else "total_cases"
    metric = "total_cases_per_million" if "total_cases_per_million" in latest_world.columns else "total_cases"

    fig = px.choropleth(
        latest_world,
        locations="iso_code",
        color=metric,
        hover_name="location",
        projection="natural earth",
        title=f"Global {metric.replace('_',' ').title()} (latest)",
    )
    fig.show()
else:
    print("Plotly unavailable or disabled. Skipping choropleth.")



## 7) Insights & Reporting (Write Your Narrative Here)
Use this section to summarize 3–5 key findings. For example:
- Country X had the fastest vaccine rollout in 2021, reaching Y% fully vaccinated by DATE.
- Country A and Country B saw similar case waves in mid-2020, but diverged in mortality rates in 2021.
- Anomalies: Sudden spikes in Country C's daily cases were due to reporting changes on DATE.

**Template:**
- **Trend 1:** …  
- **Trend 2:** …  
- **Trend 3:** …  
- **Open questions/limitations:** Data lags, reporting changes, and definitions differ by country.



## Appendix A) (Optional) Correlation Heatmap
This examines relationships among numeric variables for a single selected country.


In [None]:

if _HAS_SEABORN:
    example_country = COUNTRIES[0] if COUNTRIES else "Nigeria"
    sub = df[df["location"] == example_country]
    numeric = sub.select_dtypes(include=[np.number]).dropna(axis=1, how="all").fillna(0)
    corr = numeric.corr()

    plt.figure(figsize=(10, 8))
    sns.heatmap(corr, annot=False)
    plt.title(f"Correlation Heatmap — {example_country}")
    plt.show()
else:
    print("Seaborn not available; skipping heatmap.")



## Appendix B) Export Figures & Data (Optional)
Uncomment and run to save outputs to a local `outputs/` folder.


In [None]:

# from pathlib import Path
# out_dir = Path("outputs")
# out_dir.mkdir(exist_ok=True)

# # Example: export latest snapshot by country
# latest_snapshot = df.sort_values("date").groupby("location").tail(1)
# latest_snapshot.to_csv(out_dir / "latest_snapshot.csv", index=False)

# # Example: save a static plot
# plt.figure(figsize=(10, 5))
# for country in sorted(df["location"].unique()):
#     sub = df[df["location"] == country]
#     if "total_cases" in sub.columns and sub["total_cases"].notna().any():
#         plt.plot(sub["date"], sub["total_cases"], label=country)
# plt.title("Total COVID-19 Cases Over Time")
# plt.xlabel("Date"); plt.ylabel("Total Cases"); plt.legend(); plt.grid(True)
# plt.tight_layout()
# plt.savefig(out_dir / "total_cases_over_time.png", dpi=200)
# plt.close()
