# COVID-19 Global Data Tracker
### By @theoddysey

This notebook analyzes global COVID-19 trends including cases, deaths, recoveries, and vaccinations across countries and time periods. We'll clean and process real-world data, perform exploratory data analysis (EDA), generate insights, and visualize trends using Python data tools.

## 1. Data Collection and Setup

First, let's import the necessary libraries and load the COVID-19 dataset from Our World in Data.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from datetime import datetime

# Set plot styles
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('viridis')
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 12

In [None]:
# Download the dataset if not already available
import os
import requests

# Create data directory if it doesn't exist
if not os.path.exists('data'):
    os.makedirs('data')

# URL for the Our World in Data COVID-19 dataset
url = 'https://covid.ourworldindata.org/data/owid-covid-data.csv'
file_path = 'data/owid-covid-data.csv'

# Download the dataset if it doesn't exist
if not os.path.exists(file_path):
    print(f"Downloading COVID-19 dataset from {url}...")
    response = requests.get(url)
    with open(file_path, 'wb') as f:
        f.write(response.content)
    print("Download complete!")
else:
    print(f"Dataset already exists at {file_path}")

## 2. Data Loading & Exploration

Let's load the dataset and explore its structure.

In [None]:
# Load the dataset
df = pd.read_csv(file_path)

# Display basic information about the dataset
print(f"Dataset shape: {df.shape}")
print(f"Number of countries/regions: {df['location'].nunique()}")
print(f"Date range: {df['date'].min()} to {df['date'].max()}")

# Preview the first few rows
df.head()

In [None]:
# Check the columns in the dataset
print("Columns in the dataset:")
for col in df.columns:
    print(f"- {col}")

In [None]:
# Check for missing values
missing_values = df.isnull().sum()
missing_percentage = (missing_values / len(df)) * 100

# Create a DataFrame to display missing values
missing_df = pd.DataFrame({
    'Missing Values': missing_values,
    'Percentage (%)': missing_percentage
})

# Sort by missing values in descending order
missing_df = missing_df.sort_values('Missing Values', ascending=False)

# Display columns with missing values
missing_df[missing_df['Missing Values'] > 0]

## 3. Data Cleaning

Let's clean the data by converting the date column to datetime, handling missing values, and filtering countries of interest.

In [None]:
# Convert date column to datetime
df['date'] = pd.to_datetime(df['date'])

# Select countries of interest (you can modify this list)
countries_of_interest = ['United States', 'India', 'Brazil', 'United Kingdom', 'Russia', 'France', 'Germany', 'South Africa', 'Kenya', 'China']

# Filter the dataset for countries of interest
filtered_df = df[df['location'].isin(countries_of_interest)].copy()

# Check the filtered dataset
print(f"Filtered dataset shape: {filtered_df.shape}")
print(f"Countries in filtered dataset: {filtered_df['location'].unique()}")

# Preview the filtered dataset
filtered_df.head()

In [None]:
# Handle missing values for key metrics
key_metrics = ['total_cases', 'new_cases', 'total_deaths', 'new_deaths']

# Fill missing values with 0 for key metrics
for metric in key_metrics:
    filtered_df[metric] = filtered_df[metric].fillna(0)

# Calculate death rate (deaths per case)
filtered_df['death_rate'] = (filtered_df['total_deaths'] / filtered_df['total_cases'] * 100).round(2)

# Handle division by zero or missing values in death rate
filtered_df['death_rate'] = filtered_df['death_rate'].replace([np.inf, -np.inf], np.nan).fillna(0)

# Preview the cleaned dataset
filtered_df[['location', 'date', 'total_cases', 'total_deaths', 'death_rate']].head()

## 4. Exploratory Data Analysis (EDA)

Let's analyze the COVID-19 trends across different countries.

In [None]:
# Get the latest data for each country
latest_data = filtered_df.sort_values('date').groupby('location').tail(1).sort_values('total_cases', ascending=False)

# Display the latest statistics for each country
latest_data[['location', 'date', 'total_cases', 'total_deaths', 'death_rate']]

In [None]:
# Plot total cases over time for selected countries
plt.figure(figsize=(14, 8))

for country in countries_of_interest:
    country_data = filtered_df[filtered_df['location'] == country]
    plt.plot(country_data['date'], country_data['total_cases'], label=country)

plt.title('Total COVID-19 Cases Over Time', fontsize=16)
plt.xlabel('Date', fontsize=14)
plt.ylabel('Total Cases', fontsize=14)
plt.legend()
plt.grid(True)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# Plot total deaths over time for selected countries
plt.figure(figsize=(14, 8))

for country in countries_of_interest:
    country_data = filtered_df[filtered_df['location'] == country]
    plt.plot(country_data['date'], country_data['total_deaths'], label=country)

plt.title('Total COVID-19 Deaths Over Time', fontsize=16)
plt.xlabel('Date', fontsize=14)
plt.ylabel('Total Deaths', fontsize=14)
plt.legend()
plt.grid(True)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# Create a bar chart for total cases by country (latest data)
plt.figure(figsize=(14, 8))
sns.barplot(x='location', y='total_cases', data=latest_data)
plt.title('Total COVID-19 Cases by Country (Latest Data)', fontsize=16)
plt.xlabel('Country', fontsize=14)
plt.ylabel('Total Cases', fontsize=14)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# Create a bar chart for death rates by country (latest data)
plt.figure(figsize=(14, 8))
sns.barplot(x='location', y='death_rate', data=latest_data)
plt.title('COVID-19 Death Rate by Country (Latest Data)', fontsize=16)
plt.xlabel('Country', fontsize=14)
plt.ylabel('Death Rate (%)', fontsize=14)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## 5. Visualizing Vaccination Progress

Let's analyze the vaccination rollout across different countries.

In [None]:
# Check vaccination-related columns
vax_columns = [col for col in df.columns if 'vaccin' in col.lower()]
print("Vaccination-related columns:")
for col in vax_columns:
    print(f"- {col}")

In [None]:
# Plot total vaccinations over time for selected countries
plt.figure(figsize=(14, 8))

for country in countries_of_interest:
    country_data = filtered_df[filtered_df['location'] == country]
    plt.plot(country_data['date'], country_data['total_vaccinations'], label=country)

plt.title('Total COVID-19 Vaccinations Over Time', fontsize=16)
plt.xlabel('Date', fontsize=14)
plt.ylabel('Total Vaccinations', fontsize=14)
plt.legend()
plt.grid(True)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# Plot percentage of population vaccinated
plt.figure(figsize=(14, 8))

for country in countries_of_interest:
    country_data = filtered_df[filtered_df['location'] == country]
    plt.plot(country_data['date'], country_data['people_vaccinated_per_hundred'], label=country)

plt.title('Percentage of Population Vaccinated Over Time', fontsize=16)
plt.xlabel('Date', fontsize=14)
plt.ylabel('People Vaccinated (%)', fontsize=14)
plt.legend()
plt.grid(True)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# Create a bar chart for vaccination rates by country (latest data)
plt.figure(figsize=(14, 8))
sns.barplot(x='location', y='people_vaccinated_per_hundred', data=latest_data)
plt.title('COVID-19 Vaccination Rate by Country (Latest Data)', fontsize=16)
plt.xlabel('Country', fontsize=14)
plt.ylabel('People Vaccinated (%)', fontsize=14)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## 6. Building a Choropleth Map

Let's create a world map visualization for COVID-19 cases and vaccination rates.

In [None]:
# Get the latest data for all countries
latest_global_data = df.sort_values('date').groupby('location').tail(1)

# Create a choropleth map for total cases
fig = px.choropleth(
    latest_global_data,
    locations="iso_code",
    color="total_cases",
    hover_name="location",
    hover_data=["total_cases", "total_deaths", "death_rate"],
    color_continuous_scale="Viridis",
    title="Global COVID-19 Cases"
)

fig.update_layout(height=600, width=1000)
fig.show()

In [None]:
# Create a choropleth map for vaccination rates
fig = px.choropleth(
    latest_global_data,
    locations="iso_code",
    color="people_vaccinated_per_hundred",
    hover_name="location",
    hover_data=["people_vaccinated_per_hundred", "people_fully_vaccinated_per_hundred"],
    color_continuous_scale="Viridis",
    title="Global COVID-19 Vaccination Rates (%)"
)

fig.update_layout(height=600, width=1000)
fig.show()

## 7. Insights & Reporting

Let's summarize our findings and insights from the COVID-19 data analysis.

### Key Insights from the COVID-19 Data Analysis

1. **Global Case Distribution**: The United States, India, and Brazil have consistently reported the highest number of COVID-19 cases globally, indicating the severity of the pandemic in these regions.

2. **Death Rate Variations**: Despite having high case numbers, some countries have managed to maintain lower death rates, suggesting differences in healthcare capacity, testing strategies, and population demographics.

3. **Vaccination Progress**: Countries like the United Kingdom and the United States have achieved higher vaccination rates compared to others, demonstrating the disparity in vaccine distribution and administration globally.

4. **Waves of Infection**: The data reveals multiple waves of infection across different countries, with varying timing and intensity, highlighting the dynamic nature of the pandemic and the influence of public health measures.

5. **Correlation Between Measures**: There appears to be a correlation between early vaccination rollout and reduced death rates in subsequent waves, suggesting the effectiveness of vaccines in mitigating the severity of the pandemic.

### Conclusion

This analysis provides valuable insights into the global COVID-19 pandemic, highlighting the disparities in case numbers, death rates, and vaccination progress across different countries. The visualizations help in understanding the temporal trends and geographical distribution of the pandemic.

The data suggests that while the pandemic has affected countries worldwide, the impact has been uneven, with some countries experiencing more severe outbreaks than others. Vaccination has emerged as a crucial tool in combating the pandemic, with countries having higher vaccination rates showing signs of recovery.

Future analysis could focus on the relationship between public health measures, vaccination rates, and pandemic outcomes, as well as the long-term economic and social impacts of the pandemic.