#  COVID-19 Global Data Tracker
**Date generated:** 2025-08-24

This notebook guides you through loading, cleaning, analyzing, and visualizing global COVID-19 data, with clear, reproducible steps and narrative blocks .

**Data Source (recommended):** Our World in Data — `owid-covid-data.csv`  
If you don't have the file locally, this notebook can attempt to download it for you.

---

## Accomplishments
- Import and clean global COVID-19 data
- Analyze time trends (cases, deaths, vaccinations)
- Compare metrics across countries/regions
- Visualize trends with charts and a world choropleth (optional)
- Summarize findings with clear, concise insights

## 0) Project Setup

**Instructions**
- If your environment blocks internet, manually download `owid-covid-data.csv`
  from Our World in Data and place it in the same folder as this notebook.
- Otherwise, the next cell can fetch it automatically.

In [None]:
DATA_URL = "https://covid.ourworldindata.org/data/owid-covid-data.csv"
DATA_FILE = "owid-covid-data.csv"

SELECTED_COUNTRIES = ["Kenya", "United States", "India", "United Kingdom", "Brazil"]

ROLLING_DAYS = 7

print("Configured countries:", SELECTED_COUNTRIES)

## 1) Data Collection & Loading

In [None]:
import os
import pandas as pd

if not os.path.exists(DATA_FILE):
    try:
        import urllib.request
        print("Downloading data from Our World in Data...")
        urllib.request.urlretrieve(DATA_URL, DATA_FILE)
        print("Download complete:", DATA_FILE)
    except Exception as e:
        print("Could not download automatically.")
        print("Error:", e)
        print("➡️ Please manually place 'owid-covid-data.csv' next to this notebook and re-run.")

df = pd.read_csv(DATA_FILE, parse_dates=["date"])
print("Rows:", len(df), "| Columns:", len(df.columns))
df.head(3)

### Quick Schema & Missing Values

In [None]:
df.info()

In [None]:
miss = df.isna().sum().sort_values(ascending=False)
miss.head(25)

## 2) Data Cleaning

**Goals**
- Filter to countries of interest
- Keep columns relevant to the analysis
- Ensure date and numeric fields are in correct formats
- Fill or interpolate missing values where suitable

In [None]:
import numpy as np

keep_cols = [
    "iso_code","continent","location","date",
    "total_cases","new_cases","total_deaths","new_deaths",
    "total_vaccinations","people_vaccinated","people_fully_vaccinated",
    "total_boosters","new_vaccinations","population","population_density",
    "median_age","aged_65_older","aged_70_older",
    "gdp_per_capita","hospital_beds_per_thousand",
    "life_expectancy",
    "people_vaccinated_per_hundred","people_fully_vaccinated_per_hundred",
    "total_boosters_per_hundred"
]

df_clean = df[keep_cols].copy()


df_clean = df_clean[df_clean["location"].isin(SELECTED_COUNTRIES)]


df_clean = df_clean.sort_values(["location","date"]).groupby("location").apply(
    lambda g: g.ffill()
).reset_index(drop=True)


num_cols = [
    "total_cases","new_cases","total_deaths","new_deaths",
    "total_vaccinations","people_vaccinated","people_fully_vaccinated",
    "total_boosters","new_vaccinations","population","population_density",
    "median_age","aged_65_older","aged_70_older","gdp_per_capita",
    "hospital_beds_per_thousand","life_expectancy",
    "people_vaccinated_per_hundred","people_fully_vaccinated_per_hundred",
    "total_boosters_per_hundred"
]
for c in num_cols:
    if c in df_clean:
        df_clean[c] = pd.to_numeric(df_clean[c], errors="coerce")

# Derive metrics
df_clean["case_fatality_rate"] = df_clean["total_deaths"] / df_clean["total_cases"]
df_clean["new_cases_smoothed"] = df_clean.groupby("location")["new_cases"].transform(
    lambda s: s.rolling(ROLLING_DAYS, min_periods=1).mean()
)
df_clean["new_deaths_smoothed"] = df_clean.groupby("location")["new_deaths"].transform(
    lambda s: s.rolling(ROLLING_DAYS, min_periods=1).mean()
)

df_clean.head(3)

## 3) Exploratory Data Analysis (EDA)

We'll examine total cases & deaths over time, daily new cases (smoothed), and compute case fatality rates.

In [None]:
import matplotlib.pyplot as plt

def plot_timeseries(df_in, y, title, ylabel):
    plt.figure(figsize=(10,5))
    for country in SELECTED_COUNTRIES:
        g = df_in[df_in["location"]==country]
        plt.plot(g["date"], g[y], label=country)
    plt.title(title)
    plt.xlabel("Date")
    plt.ylabel(ylabel)
    plt.legend()
    plt.grid(True)
    plt.show()

plot_timeseries(df_clean, "total_cases", "Total Cases Over Time", "Total cases")

In [None]:
plot_timeseries(df_clean, "total_deaths", "Total Deaths Over Time", "Total deaths")

In [None]:
plot_timeseries(df_clean, "new_cases_smoothed", f"{ROLLING_DAYS}-Day Smoothed New Cases", "New cases (smoothed)")

### Death Rate (Case Fatality Ratio) Over Time

In [None]:
plot_timeseries(df_clean, "case_fatality_rate", "Case Fatality Rate Over Time", "CFR = total_deaths / total_cases")

### Top Countries by Total Cases (Latest Date)

In [None]:
latest_date = df["date"].max()
latest = df[df["date"]==latest_date]
top = latest[["location","total_cases","continent"]].dropna(subset=["total_cases"])
top = top[~top["location"].str.startswith("World")]  
top10 = top.sort_values("total_cases", ascending=False).head(10)
top10

In [None]:
plt.figure(figsize=(10,5))
plt.bar(top10["location"], top10["total_cases"])
plt.title(f"Top 10 Countries by Total Cases on {latest_date.date()}")
plt.ylabel("Total cases")
plt.xticks(rotation=45, ha="right")
plt.grid(True, axis="y")
plt.tight_layout()
plt.show()

## 4) Vaccination Progress

 visualize cumulative vaccinations and vaccination coverage where available.

In [None]:
plot_timeseries(df_clean, "total_vaccinations", "Total Vaccinations Over Time", "Total vaccinations")

In [None]:
plot_timeseries(df_clean, "people_fully_vaccinated_per_hundred", "Fully Vaccinated (% of population)", "% fully vaccinated")

## 5)  World Choropleth (Cases or Vaccination Rates)

> Requires `plotly` (`pip install plotly`).  


In [None]:
try:
    import plotly.express as px

    latest_date = df["date"].max()
    latest = df[df["date"]==latest_date].copy()

    metric = "people_fully_vaccinated_per_hundred" 
    title = f"World Map — {metric.replace('_',' ').title()} on {latest_date.date()}"

    fig = px.choropleth(
        latest,
        locations="iso_code",
        color=metric,
        hover_name="location",
        color_continuous_scale="Viridis",
        title=title
    )
    fig.show()
except Exception as e:
    print("Plotly not available or another issue occurred:", e)
    print("Install with: pip install plotly")

## 6)  Quick Insights

The cell below calculates a few ready-made bullets

In [None]:
import pandas as pd
import numpy as np

latest_date = df["date"].max()
latest = df[df["date"]==latest_date].copy()

def top_n(series, n=3, label="value"):
    s = series.dropna().sort_values(ascending=False).head(n)
    return [f"{i+1}. {idx}: {val:,.0f}" for i, (idx, val) in enumerate(s.items())]

insights = []

total_cases_by_country = latest.set_index("location")["total_cases"]
insights.append(f"**Top countries by total cases on {latest_date.date()}:**")
insights.extend(top_n(total_cases_by_country, 3))

vax_pct = latest.set_index("location")["people_fully_vaccinated_per_hundred"]
insights.append("")
insights.append("**Top countries by % fully vaccinated (of those with data):**")
s = vax_pct.dropna().sort_values(ascending=False).head(3)
for i, (loc, val) in enumerate(s.items(), 1):
    insights.append(f"{i}. {loc}: {val:.1f}%")

try:
    last30 = df_clean.groupby("location").apply(
        lambda g: g[g["date"] >= (g["date"].max() - pd.Timedelta(days=30))]
    ).reset_index(drop=True)

    accel = last30.groupby("location")["new_cases_smoothed"].mean().sort_values(ascending=False)
    insights.append("")
    insights.append("**Selected countries by average new cases (smoothed) over last 30 days:**")
    for i, (loc, val) in enumerate(accel.items(), 1):
        insights.append(f"{i}. {loc}: {val:,.0f} per day")
except Exception as e:
    insights.append("Could not compute last-30-day acceleration due to data irregularities.")

print("\n".join(insights))