# COVID-19 Global Trends — Analysis Notebook

This notebook analyzes COVID-19 global trends (cases, deaths, vaccinations) using a **Kaggle / Johns Hopkins (JHU CSSE)** dataset by default. If you don't have the Kaggle CSV locally, the notebook will gracefully fall back to downloading the **Our World in Data (OWID)** CSV online for convenience.

**How to use**
- Preferred: place the Kaggle CSV (e.g., `covid_19_data.csv` or the JHU CSV you downloaded from Kaggle) in the same folder as this notebook.
- Alternative: the notebook will attempt to download OWID data automatically if the local file is missing.


## 1) Setup & Data Collection

This section loads the dataset. The notebook looks for common Kaggle/JHU filenames (e.g., `covid_19_data.csv`, `time_series_covid19_confirmed_global.csv`); if not found, it downloads OWID data as a reliable fallback.


In [None]:
# Imports and settings
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

plt.rcParams["figure.figsize"] = (10,6)
pd.set_option("display.max_columns", 60)

# File detection: common Kaggle/JHU filenames to check
possible_files = [
    "covid_19_data.csv",  # common Kaggle name
    "time_series_covid19_confirmed_global.csv",
    "time_series_covid19_deaths_global.csv",
    "time_series_covid19_recovered_global.csv",
    "owid-covid-data.csv"
]

local_file = None
for fname in possible_files:
    p = Path(fname)
    if p.exists():
        local_file = p
        break

OWID_URL = "https://covid.ourworldindata.org/data/owid-covid-data.csv"

def load_data():
    # Prefer a single combined CSV if available (common Kaggle export)
    if local_file:
        print(f"Loading local file: {local_file}")
        try:
            df = pd.read_csv(local_file)
            return df
        except Exception as e:
            print("Error reading local file:", e)
    # Fallback to OWID
    print("Local Kaggle file not found or unreadable — downloading OWID dataset...")
    try:
        df = pd.read_csv(OWID_URL)
        return df
    except Exception as e:
        raise RuntimeError("Failed to load any dataset. Please place a Kaggle CSV in the notebook folder or ensure internet access.") from e

df = load_data()
print("Loaded dataset with shape:", df.shape)
df.head(3)

## 2) Data Exploration & Cleaning

This section inspects columns, converts dates, and prepares a working subset of columns used for analysis.


In [None]:
# Preview columns and types
print("Columns (sample):", list(df.columns)[:50])
# Normalize column names (lowercase, replace spaces)
df.columns = [c.strip() for c in df.columns]

# Attempt to find a date column and convert to datetime
date_col = None
for candidate in ["date","Date","observation_date","Date_reported"]:
    if candidate in df.columns:
        date_col = candidate
        break

if date_col is None:
    # Try to guess by dtype or common patterns
    for c in df.columns:
        if "date" in c.lower():
            date_col = c
            break

if date_col is None:
    raise RuntimeError("No date column detected. Ensure your CSV has a date column named like 'date' or 'observation_date'.")

df[date_col] = pd.to_datetime(df[date_col], errors="coerce")
df = df.rename(columns={date_col: "date"})

# Keep a practical subset (OWID-like)
essential = ["iso_code","continent","location","date",
             "total_cases","new_cases","total_deaths","new_deaths",
             "total_vaccinations","people_vaccinated","people_fully_vaccinated",
             "population","new_cases_smoothed","new_deaths_smoothed",
             "total_cases_per_million","total_deaths_per_million",
             "people_fully_vaccinated_per_hundred","people_vaccinated_per_hundred"]

available = [c for c in essential if c in df.columns]
print("Using columns:", available)
df = df[available].copy()

print("Missing values (top 10):")
print(df.isnull().sum().sort_values(ascending=False).head(10))

df.info()

## 3) Cleaning & Preprocessing

- Exclude aggregates (World or continents) where appropriate.
- Forward-fill cumulative metrics by country.
- Fill missing daily metrics with zeros where sensible.


In [None]:
# Exclude non-country aggregates if iso_code exists
if "iso_code" in df.columns:
    df = df[df["iso_code"].apply(lambda x: isinstance(x, str) and len(x)==3 if pd.notna(x) else False)]

# Sort and forward-fill cumulative fields grouped by location
df = df.sort_values(["location","date"])
cum_cols = [c for c in ["total_cases","total_deaths","total_vaccinations",
                        "people_vaccinated","people_fully_vaccinated"] if c in df.columns]
df[cum_cols] = df.groupby("location")[cum_cols].ffill()

# Fill new_* with 0 if missing (daily counts)
for col in ["new_cases","new_deaths","new_cases_smoothed","new_deaths_smoothed"]:
    if col in df.columns:
        df[col] = df[col].fillna(0)

print("After cleaning, sample:")
display(df.head(3))

## 4) Exploratory Data Analysis (EDA)

We'll create:
- Time series for total cases & deaths for selected countries
- Comparison of daily new cases
- Top countries by total cases (latest date)


In [None]:
# Select focus countries (customize as needed)
focus = ["Kenya","United States","India","United Kingdom","South Africa"]
focus = [c for c in focus if c in df["location"].unique()]
print("Focus countries:", focus)

# Latest snapshot per country
latest = df.sort_values("date").groupby("location").tail(1)

# Top countries by total cases (if column exists)
if "total_cases" in latest.columns:
    top = latest.nlargest(15, "total_cases")[["location","total_cases"]].set_index("location")
    top.plot(kind="bar", ylabel="Total cases", title="Top 15 countries by total cases (latest)")
    plt.tight_layout()
    plt.show()

# Time series function
def plot_metric(metric, countries, title, ylabel):
    if metric not in df.columns:
        print(f"Metric {metric} not found in data.")
        return
    pivot = df[df["location"].isin(countries)].pivot(index="date", columns="location", values=metric)
    pivot.plot(title=title, ylabel=ylabel)
    plt.show()

plot_metric("total_cases", focus, "Total Cases Over Time", "Total cases")
plot_metric("total_deaths", focus, "Total Deaths Over Time", "Total deaths")

# Daily new cases (smoothed preferred)
metric_new = "new_cases_smoothed" if "new_cases_smoothed" in df.columns else "new_cases"
plot_metric(metric_new, focus, "Daily New Cases (smoothed if available)", "New cases")

## 5) Vaccination Analysis

Compare cumulative vaccinations and percent vaccinated for latest date.


In [None]:
# Vaccination over time for focus countries
if "total_vaccinations" in df.columns:
    plot_metric("total_vaccinations", focus, "Cumulative Vaccinations Over Time", "Total vaccinations")

# Percent fully vaccinated (latest)
if "people_fully_vaccinated_per_hundred" in latest.columns:
    pct = latest.dropna(subset=["people_fully_vaccinated_per_hundred"]).sort_values("people_fully_vaccinated_per_hundred", ascending=False)
    pct_top = pct.head(20).set_index("location")["people_fully_vaccinated_per_hundred"]
    pct_top.plot(kind="bar", title="% Fully Vaccinated (Top 20)", ylabel="% of population (per hundred)")
    plt.tight_layout()
    plt.show()

## 6) Optional Choropleth Map (Plotly)

If Plotly is installed, a choropleth map can be rendered using `iso_code` and `total_cases_per_million`.


In [None]:
try:
    import plotly.express as px
    PLOTLY = True
except Exception:
    PLOTLY = False

if PLOTLY and "iso_code" in latest.columns and "total_cases_per_million" in latest.columns:
    fig = px.choropleth(latest, locations="iso_code", color="total_cases_per_million",
                        hover_name="location", title="Total cases per million (latest)")
    fig.show()
else:
    print("Plotly not available or required columns missing. Install plotly with `pip install plotly` to enable map.")

## 7) Insights & Next Steps

- Write 3–5 key insights here after reviewing the charts (e.g., which countries had the fastest vaccine rollouts, where death rates were highest, etc.).
- Optional next steps: deeper statistical modeling, dashboards with Streamlit, or including hospitalization data.


## 8) Reproducibility & Submission

- Ensure you have the Kaggle CSV in the notebook folder or internet access for the OWID fallback.
- To share: upload both `covid19_analysis.ipynb` and `README.md` to your GitHub repository.
