
# COVID-19 Global Data Tracker

**Author:** _Mercy_ Wafula 
**Date:** 2025-08-21

This notebook tracks global COVID-19 trends using the **Our World in Data** dataset.  
You will: import, clean, analyze, and visualize cases, deaths, and vaccinations; and communicate insights.



## Project Objectives
- Import and clean COVID-19 global data
- Analyze time trends (cases, deaths, vaccinations)
- Compare metrics across countries/regions
- Visualize with charts (and optional choropleth map)
- Communicate findings with clear narrative



## 1) Setup
Install and import libraries. You may need to run the `pip install` cell once.


In [None]:

# Run this once if needed (uncomment to install)
# %pip install pandas matplotlib seaborn plotly geopandas pycountry

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Plotly is optional (for choropleth map)
import plotly.express as px

# For display options
pd.set_option("display.max_columns", 100)
pd.set_option("display.width", 120)



## 2) Data Collection & Loading

We will use the Our World in Data (OWID) COVID-19 dataset (cleaned).  
- Download: https://covid.ourworldindata.org/data/owid-covid-data.csv  
Save it in the same folder as this notebook, or set `DATA_PATH` to the CSV URL.


In [None]:

# Path to local file (preferred after first download) or use the URL directly
DATA_PATH = "owid-covid-data.csv"  # or "https://covid.ourworldindata.org/data/owid-covid-data.csv"

# Load data
df = pd.read_csv(DATA_PATH, low_memory=False)
print(df.shape)
df.head()



### Explore Structure


In [None]:

# Columns overview
df.columns.tolist()[:40]


In [None]:

# Preview
df.head(10)


In [None]:

# Missing values snapshot (top 30 columns for brevity)
df.isnull().sum().sort_values(ascending=False).head(30)



## 3) Data Cleaning

- Keep relevant columns
- Convert `date` to datetime
- Filter to countries (exclude aggregates like "World", continents)
- Optionally focus on a subset of countries (e.g., Kenya, USA, India)
- Handle missing numeric values


In [None]:

# Relevant columns
cols = [
    "iso_code","continent","location","date",
    "total_cases","new_cases","total_deaths","new_deaths",
    "total_vaccinations","people_vaccinated","people_fully_vaccinated",
    "total_boosters","new_vaccinations","population","gdp_per_capita"
]
df = df[cols].copy()

# Convert date
df["date"] = pd.to_datetime(df["date"], errors="coerce")
df = df.dropna(subset=["date"])

# Keep countries (exclude aggregates: iso_code starting with OWID_)
df_countries = df[~df["iso_code"].str.startswith("OWID")].copy()

# Choose focus countries (edit as needed)
focus_countries = ["Kenya", "United States", "India"]
df_focus = df_countries[df_countries["location"].isin(focus_countries)].copy()

# Sort
df_focus = df_focus.sort_values(["location","date"])

# Handle missing numeric values (forward fill within each country)
num_cols = [
    "total_cases","new_cases","total_deaths","new_deaths",
    "total_vaccinations","people_vaccinated","people_fully_vaccinated",
    "total_boosters","new_vaccinations","population","gdp_per_capita"
]
df_focus[num_cols] = df_focus.groupby("location")[num_cols].apply(lambda g: g.ffill().bfill())

df_focus.head()



## 4) Exploratory Data Analysis (EDA)

### 4.1 Total Cases Over Time (Selected Countries)


In [None]:

plt.figure()
for country in focus_countries:
    tmp = df_focus[df_focus["location"]==country]
    plt.plot(tmp["date"], tmp["total_cases"], label=country)
plt.title("Total COVID-19 Cases Over Time")
plt.xlabel("Date")
plt.ylabel("Total Cases")
plt.legend()
plt.tight_layout()
plt.show()



### 4.2 Total Deaths Over Time (Selected Countries)


In [None]:

plt.figure()
for country in focus_countries:
    tmp = df_focus[df_focus["location"]==country]
    plt.plot(tmp["date"], tmp["total_deaths"], label=country)
plt.title("Total COVID-19 Deaths Over Time")
plt.xlabel("Date")
plt.ylabel("Total Deaths")
plt.legend()
plt.tight_layout()
plt.show()



### 4.3 Daily New Cases Comparison


In [None]:

plt.figure()
for country in focus_countries:
    tmp = df_focus[df_focus["location"]==country]
    plt.plot(tmp["date"], tmp["new_cases"].fillna(0), label=country)
plt.title("Daily New COVID-19 Cases")
plt.xlabel("Date")
plt.ylabel("New Cases")
plt.legend()
plt.tight_layout()
plt.show()



### 4.4 Case Fatality Ratio (CFR) — Death Rate


In [None]:

df_focus["death_rate"] = np.where(
    df_focus["total_cases"]>0,
    (df_focus["total_deaths"] / df_focus["total_cases"]) * 100.0,
    np.nan
)

plt.figure()
for country in focus_countries:
    tmp = df_focus[df_focus["location"]==country]
    plt.plot(tmp["date"], tmp["death_rate"], label=country)
plt.title("Case Fatality Ratio (Total Deaths / Total Cases) %")
plt.xlabel("Date")
plt.ylabel("Death Rate (%)")
plt.legend()
plt.tight_layout()
plt.show()



## 5) Visualizing Vaccination Progress


In [None]:

plt.figure()
for country in focus_countries:
    tmp = df_focus[df_focus["location"]==country]
    plt.plot(tmp["date"], tmp["people_vaccinated"], label=country)
plt.title("People with at least one dose")
plt.xlabel("Date")
plt.ylabel("People Vaccinated")
plt.legend()
plt.tight_layout()
plt.show()


In [None]:

# Percent vaccinated (at least one dose)
df_focus["pct_vaccinated"] = np.where(
    df_focus["population"]>0,
    (df_focus["people_vaccinated"] / df_focus["population"]) * 100.0,
    np.nan
)

plt.figure()
for country in focus_countries:
    tmp = df_focus[df_focus["location"]==country]
    plt.plot(tmp["date"], tmp["pct_vaccinated"], label=country)
plt.title("% Population Vaccinated (≥1 dose)")
plt.xlabel("Date")
plt.ylabel("Percent (%)")
plt.legend()
plt.tight_layout()
plt.show()



### 5.1 Top Countries by Total Cases (Latest Date)


In [None]:

# Latest date per country
idx = df_countries.groupby("location")["date"].transform("max") == df_countries["date"]
latest = df_countries[idx].copy()

top_cases = latest.nlargest(10, "total_cases")[["location","total_cases","total_deaths","population"]]

plt.figure()
plt.bar(top_cases["location"], top_cases["total_cases"])
plt.title("Top 10 Countries by Total Cases (Latest Available)")
plt.xlabel("Country")
plt.ylabel("Total Cases")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()

top_cases



## 6) Optional: Choropleth Map (Cases or Vaccination Rate)


In [None]:

# Prepare latest snapshot with iso_code and metrics
latest_map = latest[["iso_code","location","total_cases","people_vaccinated","population"]].copy()
latest_map["pct_vaccinated"] = np.where(
    latest_map["population"]>0,
    (latest_map["people_vaccinated"]/latest_map["population"])*100.0,
    np.nan
)

# Plot cases density choropleth (requires internet-safe rendering in Jupyter)
fig = px.choropleth(
    latest_map.dropna(subset=["iso_code"]),
    locations="iso_code",
    color="total_cases",
    hover_name="location",
    color_continuous_scale="Viridis",
    title="Total COVID-19 Cases by Country (Latest Available)",
    projection="natural earth"
)
fig.show()



## 7) Insights & Reporting (Write Your Narrative)

- **Insight 1:** _e.g., Country X had the fastest vaccine rollout in 2021._
- **Insight 2:** _e.g., Country Y shows a steady decline in new cases after vaccination reached Z%._
- **Insight 3:** _e.g., Death rate stabilized after ..._

Use this space to explain patterns, anomalies, and key comparisons.



## 8) Reproducibility
- Ensure the notebook runs top-to-bottom without errors.
- Lock library versions (see `requirements.txt`).
- Save figures if needed and export to PDF (optional).


In [None]:

# Optional: Save a figure example
# plt.figure()
# ... your plot ...
# plt.savefig("figure_example.png", dpi=200, bbox_inches="tight")
