# COVID-19 Global Data Tracker

This notebook analyzes global COVID-19 trends, including cases, deaths, and vaccinations across countries and time. We'll clean and process the data, perform exploratory data analysis, generate insights, and visualize trends using Python data tools.

## 1. Data Loading & Exploration

First, let's import the necessary libraries and load the COVID-19 dataset from Our World in Data.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from datetime import datetime

# Set plot styles
plt.style.use('seaborn-whitegrid')
sns.set_palette('viridis')
%matplotlib inline
plt.rcParams['figure.figsize'] = (12, 8)

In [None]:
# Load the dataset
file_path = '../data/owid-covid-data.csv'
df = pd.read_csv(file_path)

# Display basic information about the dataset
print(f"Dataset shape: {df.shape}")
print(f"\nColumns in the dataset:\n{df.columns.tolist()}")
print(f"\nData types:\n{df.dtypes}")
print(f"\nFirst few rows:")
df.head()

In [None]:
# Check for missing values
missing_values = df.isnull().sum()
print("Missing values in each column:")
missing_values[missing_values > 0].sort_values(ascending=False)

### Key Columns

The dataset contains numerous columns. Here are the most important ones for our analysis:

- **Identifiers**:
  - `iso_code`: ISO 3166-1 alpha-3 country code
  - `continent`: Continent of the geographical location
  - `location`: Geographical location (typically a country)
  - `date`: Date of observation

- **Case Data**:
  - `total_cases`: Total confirmed cases of COVID-19
  - `new_cases`: New confirmed cases of COVID-19
  - `new_cases_smoothed`: New confirmed cases (7-day smoothed)

- **Death Data**:
  - `total_deaths`: Total deaths attributed to COVID-19
  - `new_deaths`: New deaths attributed to COVID-19
  - `new_deaths_smoothed`: New deaths (7-day smoothed)

- **Vaccination Data**:
  - `total_vaccinations`: Total vaccination doses administered
  - `people_vaccinated`: People who received at least one dose
  - `people_fully_vaccinated`: People who received all doses prescribed
  - `people_fully_vaccinated_per_hundred`: Percentage of population fully vaccinated

- **Demographic Data**:
  - `population`: Population in 2020
  - `population_density`: Number of people per square kilometer
  - `median_age`: Median age of the population
  - `gdp_per_capita`: Gross domestic product at purchasing power parity

We'll also calculate a derived metric:
- `death_rate`: Calculated as `total_deaths / total_cases`

In [None]:
# Get unique countries/locations in the dataset
print(f"Number of unique countries/locations: {df['location'].nunique()}")
print(f"\nList of continents: {df['continent'].unique().tolist()}")

## 2. Data Cleaning

Let's clean the data by converting the date column to datetime, handling missing values, and filtering for countries of interest.

In [None]:
# Convert date column to datetime
df['date'] = pd.to_datetime(df['date'])

# Create a list of countries of interest
countries_of_interest = ['Kenya', 'United States', 'India', 'South Africa', 'United Kingdom', 'Brazil', 'China']

# Filter the dataset for these countries
df_countries = df[df['location'].isin(countries_of_interest)]

# Display the first few rows of the filtered dataset
df_countries.head()

In [None]:
# Calculate death rate (total_deaths / total_cases)
df_countries['death_rate'] = df_countries['total_deaths'] / df_countries['total_cases']

# Handle missing values in key columns
# For numeric columns, we'll fill NaN with 0 for simplicity
numeric_cols = ['total_cases', 'new_cases', 'total_deaths', 'new_deaths', 'total_vaccinations', 'people_vaccinated', 'people_fully_vaccinated']
df_countries[numeric_cols] = df_countries[numeric_cols].fillna(0)

# Check the cleaned data
df_countries.head()

## 3. Exploratory Data Analysis (EDA)

Now, let's analyze the COVID-19 trends over time for our selected countries.

In [None]:
# Get the latest date in the dataset
latest_date = df['date'].max()
print(f"Latest date in the dataset: {latest_date}")

# Get the total cases and deaths for each country as of the latest date
latest_data = df_countries[df_countries['date'] == latest_date].sort_values('total_cases', ascending=False)

# Display the latest statistics
latest_data[['location', 'total_cases', 'total_deaths', 'death_rate']].reset_index(drop=True)

In [None]:
# Plot total cases over time for selected countries
plt.figure(figsize=(14, 8))

for country in countries_of_interest:
    country_data = df_countries[df_countries['location'] == country]
    plt.plot(country_data['date'], country_data['total_cases'], label=country)

plt.title('Total COVID-19 Cases Over Time', fontsize=16)
plt.xlabel('Date', fontsize=12)
plt.ylabel('Total Cases', fontsize=12)
plt.legend()
plt.grid(True)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# Plot total deaths over time for selected countries
plt.figure(figsize=(14, 8))

for country in countries_of_interest:
    country_data = df_countries[df_countries['location'] == country]
    plt.plot(country_data['date'], country_data['total_deaths'], label=country)

plt.title('Total COVID-19 Deaths Over Time', fontsize=16)
plt.xlabel('Date', fontsize=12)
plt.ylabel('Total Deaths', fontsize=12)
plt.legend()
plt.grid(True)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# Plot daily new cases for selected countries
plt.figure(figsize=(14, 8))

for country in countries_of_interest:
    country_data = df_countries[df_countries['location'] == country]
    # Use 7-day moving average for smoother visualization
    plt.plot(country_data['date'], country_data['new_cases_smoothed'], label=country)

plt.title('Daily New COVID-19 Cases (7-day Moving Average)', fontsize=16)
plt.xlabel('Date', fontsize=12)
plt.ylabel('New Cases', fontsize=12)
plt.legend()
plt.grid(True)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# Compare death rates across countries
latest_death_rates = latest_data[['location', 'death_rate']].sort_values('death_rate', ascending=False)

plt.figure(figsize=(12, 6))
sns.barplot(x='location', y='death_rate', data=latest_death_rates)
plt.title('COVID-19 Death Rate by Country (Latest Data)', fontsize=16)
plt.xlabel('Country', fontsize=12)
plt.ylabel('Death Rate (Deaths/Cases)', fontsize=12)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## 4. Vaccination Analysis

Let's analyze the vaccination progress across countries.

In [None]:
# Plot vaccination progress over time
plt.figure(figsize=(14, 8))

for country in countries_of_interest:
    country_data = df_countries[df_countries['location'] == country]
    plt.plot(country_data['date'], country_data['people_fully_vaccinated_per_hundred'], label=country)

plt.title('Percentage of Population Fully Vaccinated Over Time', fontsize=16)
plt.xlabel('Date', fontsize=12)
plt.ylabel('Percentage Fully Vaccinated', fontsize=12)
plt.legend()
plt.grid(True)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# Compare vaccination rates across countries (latest data)
latest_vax_data = latest_data[['location', 'people_fully_vaccinated_per_hundred']].sort_values('people_fully_vaccinated_per_hundred', ascending=False)

plt.figure(figsize=(12, 6))
sns.barplot(x='location', y='people_fully_vaccinated_per_hundred', data=latest_vax_data)
plt.title('Percentage of Population Fully Vaccinated by Country (Latest Data)', fontsize=16)
plt.xlabel('Country', fontsize=12)
plt.ylabel('Percentage Fully Vaccinated', fontsize=12)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## 5. Choropleth Map Visualization

Let's create a world map showing COVID-19 cases and vaccination rates by country.

In [None]:
# Get the latest data for all countries
latest_global_data = df[df['date'] == latest_date].copy()

# Create a choropleth map of total cases per million
fig = px.choropleth(
    latest_global_data,
    locations="iso_code",
    color="total_cases_per_million",
    hover_name="location",
    hover_data=["total_cases", "total_deaths", "total_cases_per_million"],
    title="COVID-19 Total Cases per Million by Country",
    color_continuous_scale=px.colors.sequential.Plasma,
    projection="natural earth"
)

fig.update_layout(height=600, margin={"r":0,"t":30,"l":0,"b":0})
fig.show()

In [None]:
# Create a choropleth map of vaccination rates
fig = px.choropleth(
    latest_global_data,
    locations="iso_code",
    color="people_fully_vaccinated_per_hundred",
    hover_name="location",
    hover_data=["people_fully_vaccinated_per_hundred", "total_vaccinations_per_hundred"],
    title="COVID-19 Vaccination Rates by Country (% Fully Vaccinated)",
    color_continuous_scale=px.colors.sequential.Viridis,
    projection="natural earth"
)

fig.update_layout(height=600, margin={"r":0,"t":30,"l":0,"b":0})
fig.show()

## 6. Insights & Reporting

Based on our analysis, here are some key insights about the global COVID-19 trends:

### Key Insights:

1. **Case Distribution**: The United States, India, and Brazil have consistently shown the highest total case counts among our selected countries, reflecting both population size and varying effectiveness of containment measures.

2. **Death Rates**: Death rates vary significantly across countries, which may be attributed to differences in healthcare systems, population demographics, testing capacity, and reporting methodologies.

3. **Vaccination Progress**: There are substantial disparities in vaccination rates globally. Developed nations generally show higher vaccination rates compared to developing countries, highlighting issues of vaccine equity and distribution.

4. **Waves of Infection**: The data shows distinct waves of infection across different countries, often occurring at different times, demonstrating the global yet asynchronous nature of the pandemic.

5. **Regional Patterns**: The choropleth maps reveal regional patterns in both case rates and vaccination coverage, with notable variations between continents and economic regions.

### Conclusion:

The COVID-19 pandemic has affected countries around the world in different ways, with variations in case numbers, death rates, and vaccination progress. These differences reflect a complex interplay of factors including healthcare infrastructure, government policies, population demographics, and economic resources.

Our analysis highlights the importance of global cooperation in pandemic response, as well as the need for equitable access to vaccines and healthcare resources. The data also underscores the value of timely and accurate reporting in understanding and responding to global health crises.