# COVID-19 Global Data Tracker

## Project Description
This project analyzes global COVID-19 trends, including cases, deaths, recoveries, and vaccinations. Using real-world data, we clean and process the dataset, perform exploratory data analysis (EDA), and generate insights. The project includes visualizations to communicate findings effectively.

### Objectives
- Import and clean COVID-19 global data.
- Analyze time trends (cases, deaths, vaccinations).
- Compare metrics across countries/regions.
- Visualize trends with charts and maps.
- Communicate findings in a Jupyter Notebook report.


import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib inline


In [65]:
# Define file name and load data
file_name = "owid-covid-data.csv"
df = pd.read_csv(file_name)

# Display columns and preview
print("Columns:", df.columns.tolist())
display(df.head())

# Check for missing values
print("\nMissing values:\n", df.isnull().sum())

Columns: ['iso_code', 'continent', 'location', 'date', 'total_cases', 'new_cases', 'new_cases_smoothed', 'total_deaths', 'new_deaths', 'new_deaths_smoothed', 'total_cases_per_million', 'new_cases_per_million', 'new_cases_smoothed_per_million', 'total_deaths_per_million', 'new_deaths_per_million', 'new_deaths_smoothed_per_million', 'reproduction_rate', 'icu_patients', 'icu_patients_per_million', 'hosp_patients', 'hosp_patients_per_million', 'weekly_icu_admissions', 'weekly_icu_admissions_per_million', 'weekly_hosp_admissions', 'weekly_hosp_admissions_per_million', 'total_tests', 'new_tests', 'total_tests_per_thousand', 'new_tests_per_thousand', 'new_tests_smoothed', 'new_tests_smoothed_per_thousand', 'positive_rate', 'tests_per_case', 'tests_units', 'total_vaccinations', 'people_vaccinated', 'people_fully_vaccinated', 'total_boosters', 'new_vaccinations', 'new_vaccinations_smoothed', 'total_vaccinations_per_hundred', 'people_vaccinated_per_hundred', 'people_fully_vaccinated_per_hundred'

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,...,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,population,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million
0,AFG,Asia,Afghanistan,2020-01-05,0.0,0.0,,0.0,0.0,,...,,37.746,0.5,64.83,0.511,41128772,,,,
1,AFG,Asia,Afghanistan,2020-01-06,0.0,0.0,,0.0,0.0,,...,,37.746,0.5,64.83,0.511,41128772,,,,
2,AFG,Asia,Afghanistan,2020-01-07,0.0,0.0,,0.0,0.0,,...,,37.746,0.5,64.83,0.511,41128772,,,,
3,AFG,Asia,Afghanistan,2020-01-08,0.0,0.0,,0.0,0.0,,...,,37.746,0.5,64.83,0.511,41128772,,,,
4,AFG,Asia,Afghanistan,2020-01-09,0.0,0.0,,0.0,0.0,,...,,37.746,0.5,64.83,0.511,41128772,,,,



Missing values:
 iso_code                                        0
continent                                   26525
location                                        0
date                                            0
total_cases                                 17631
                                            ...  
population                                      0
excess_mortality_cumulative_absolute       416024
excess_mortality_cumulative                416024
excess_mortality                           416024
excess_mortality_cumulative_per_million    416024
Length: 67, dtype: int64


# Filter for selected countries
countries = ['Kenya', 'USA', 'India']
df = df[df['location'].isin(countries)]

# Drop rows with missing critical values
df = df.dropna(subset=['date', 'total_cases', 'total_deaths'])

# Convert date column to datetime
df['date'] = pd.to_datetime(df['date'])

# Fill missing numeric values with 0
df = df.fillna(0)

# Preview cleaned data
df.head()

In [85]:
# Ensure 'continent' column exists in df_filtered
if 'continent' not in df_filtered.columns:
    df_filtered = df_filtered.merge(df[['iso_code', 'continent']], on='iso_code', how='left')

# Replace infinite values in 'case_fatality_rate' with NaN
if 'case_fatality_rate' in df_filtered.columns:
    df_filtered['case_fatality_rate'] = df_filtered['case_fatality_rate'].replace([float('inf'), -float('inf')], float('nan'))

# Drop rows with NaN values in 'case_fatality_rate'
if 'case_fatality_rate' in df_filtered.columns:
    df_filtered.dropna(subset=['case_fatality_rate'], inplace=True)

# Display the cleaned 'df_filtered' DataFrame
print("Cleaned 'df_filtered':")
display(df_filtered.head())

Cleaned 'df_filtered':


Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,...,hospital_beds_per_thousand,life_expectancy,human_development_index,population,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million,death_rate,percent_vaccinated
56,AFG,Asia,Afghanistan,2020-03-01,1.0,1.0,0.143,0.0,0.0,0.0,...,0.5,64.83,0.511,41128772,,,,,0.0,
57,AFG,Asia,Afghanistan,2020-03-02,1.0,0.0,0.143,0.0,0.0,0.0,...,0.5,64.83,0.511,41128772,,,,,0.0,
58,AFG,Asia,Afghanistan,2020-03-03,1.0,0.0,0.143,0.0,0.0,0.0,...,0.5,64.83,0.511,41128772,,,,,0.0,
59,AFG,Asia,Afghanistan,2020-03-04,1.0,0.0,0.143,0.0,0.0,0.0,...,0.5,64.83,0.511,41128772,,,,,0.0,
60,AFG,Asia,Afghanistan,2020-03-05,1.0,0.0,0.143,0.0,0.0,0.0,...,0.5,64.83,0.511,41128772,,,,,0.0,


plt.figure(figsize=(10,6))
for country in countries:
    country_data = df[df['location'] == country]
    plt.plot(country_data['date'], country_data['total_deaths'], label=country)
plt.title('Total COVID-19 Deaths Over Time')
plt.xlabel('Date')
plt.ylabel('Total Deaths')
plt.legend()
plt.show()


In [72]:
# Ensure 'continent' column exists in df_filtered
if 'continent' not in df_filtered.columns:
    df_filtered = df_filtered.merge(df[['iso_code', 'continent']], on='iso_code', how='left')

# Replace infinite values in 'case_fatality_rate' with NaN
if 'case_fatality_rate' in df_filtered.columns:
    df_filtered['case_fatality_rate'] = df_filtered['case_fatality_rate'].replace([float('inf'), -float('inf')], float('nan'))

# Drop rows with NaN values in 'case_fatality_rate'
if 'case_fatality_rate' in df_filtered.columns:
    df_filtered.dropna(subset=['case_fatality_rate'], inplace=True)

# Display the cleaned 'df_filtered' DataFrame
print("Cleaned 'df_filtered':")
display(df_filtered.head())

Cleaned 'df_filtered':


Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,...,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,population,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million,case_fatality_rate
56,AFG,Asia,Afghanistan,2020-03-01,1.0,1.0,0.143,0.0,0.0,0.0,...,37.746,0.5,64.83,0.511,41128772,,,,,0.0
57,AFG,Asia,Afghanistan,2020-03-02,1.0,0.0,0.143,0.0,0.0,0.0,...,37.746,0.5,64.83,0.511,41128772,,,,,0.0
58,AFG,Asia,Afghanistan,2020-03-03,1.0,0.0,0.143,0.0,0.0,0.0,...,37.746,0.5,64.83,0.511,41128772,,,,,0.0
59,AFG,Asia,Afghanistan,2020-03-04,1.0,0.0,0.143,0.0,0.0,0.0,...,37.746,0.5,64.83,0.511,41128772,,,,,0.0
60,AFG,Asia,Afghanistan,2020-03-05,1.0,0.0,0.143,0.0,0.0,0.0,...,37.746,0.5,64.83,0.511,41128772,,,,,0.0


In [74]:
# Define df_filtered based on the cleaned and filtered data
df_filtered = df.copy()

# Ensure 'continent' column exists in df_filtered
if 'continent' not in df_filtered.columns:
    df_filtered = df_filtered.merge(df[['iso_code', 'continent']], on='iso_code', how='left')

# Replace infinite values in 'case_fatality_rate' with NaN
if 'case_fatality_rate' in df_filtered.columns:
    df_filtered['case_fatality_rate'] = df_filtered['case_fatality_rate'].replace([float('inf'), -float('inf')], float('nan'))

# Drop rows with NaN values in 'case_fatality_rate'
if 'case_fatality_rate' in df_filtered.columns:
    df_filtered.dropna(subset=['case_fatality_rate'], inplace=True)

# Display the cleaned 'df_filtered' DataFrame
print("Cleaned 'df_filtered':")
display(df_filtered.head())

Cleaned 'df_filtered':


Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,...,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,population,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million,death_rate
0,AFG,Asia,Afghanistan,2020-01-05,0.0,0.0,,0.0,0.0,,...,37.746,0.5,64.83,0.511,41128772,,,,,
1,AFG,Asia,Afghanistan,2020-01-06,0.0,0.0,,0.0,0.0,,...,37.746,0.5,64.83,0.511,41128772,,,,,
2,AFG,Asia,Afghanistan,2020-01-07,0.0,0.0,,0.0,0.0,,...,37.746,0.5,64.83,0.511,41128772,,,,,
3,AFG,Asia,Afghanistan,2020-01-08,0.0,0.0,,0.0,0.0,,...,37.746,0.5,64.83,0.511,41128772,,,,,
4,AFG,Asia,Afghanistan,2020-01-09,0.0,0.0,,0.0,0.0,,...,37.746,0.5,64.83,0.511,41128772,,,,,


In [75]:
df['death_rate'] = df['total_deaths'] / df['total_cases']
latest = df.sort_values('date').groupby('location').tail(1)
display(latest[['location', 'total_cases', 'total_deaths', 'death_rate']])

Unnamed: 0,location,total_cases,total_deaths,death_rate
422728,Western Sahara,,,
282897,Northern Cyprus,,,
225269,Macao,,,
421053,Wales,,,
375651,Taiwan,,,
...,...,...,...,...
217093,Lithuania,,,
230301,Malaysia,,,
21775,Asia,,,
424412,World,,,


In [79]:
# Ensure 'continent' column exists in df_filtered
if 'continent' not in df_filtered.columns:
    df_filtered = df_filtered.merge(df[['iso_code', 'continent']], on='iso_code', how='left')

# Replace infinite values in 'death_rate' with NaN
if 'death_rate' in df_filtered.columns:
    df_filtered['death_rate'] = df_filtered['death_rate'].replace([float('inf'), -float('inf')], float('nan'))

# Drop rows with NaN values in 'death_rate'
if 'death_rate' in df_filtered.columns:
    df_filtered.dropna(subset=['death_rate'], inplace=True)

# Display the cleaned 'df_filtered' DataFrame
print("Cleaned 'df_filtered':")
display(df_filtered.head())

Cleaned 'df_filtered':


Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,...,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,population,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million,death_rate
56,AFG,Asia,Afghanistan,2020-03-01,1.0,1.0,0.143,0.0,0.0,0.0,...,37.746,0.5,64.83,0.511,41128772,,,,,0.0
57,AFG,Asia,Afghanistan,2020-03-02,1.0,0.0,0.143,0.0,0.0,0.0,...,37.746,0.5,64.83,0.511,41128772,,,,,0.0
58,AFG,Asia,Afghanistan,2020-03-03,1.0,0.0,0.143,0.0,0.0,0.0,...,37.746,0.5,64.83,0.511,41128772,,,,,0.0
59,AFG,Asia,Afghanistan,2020-03-04,1.0,0.0,0.143,0.0,0.0,0.0,...,37.746,0.5,64.83,0.511,41128772,,,,,0.0
60,AFG,Asia,Afghanistan,2020-03-05,1.0,0.0,0.143,0.0,0.0,0.0,...,37.746,0.5,64.83,0.511,41128772,,,,,0.0


In [81]:
# Define df_filtered based on the cleaned and filtered data
df_filtered = df.copy()

# Ensure 'continent' column exists in df_filtered
if 'continent' not in df_filtered.columns:
    df_filtered = df_filtered.merge(df[['iso_code', 'continent']], on='iso_code', how='left')

# Replace infinite values in 'case_fatality_rate' with NaN
if 'case_fatality_rate' in df_filtered.columns:
    df_filtered['case_fatality_rate'] = df_filtered['case_fatality_rate'].replace([float('inf'), -float('inf')], float('nan'))

# Drop rows with NaN values in 'case_fatality_rate'
if 'case_fatality_rate' in df_filtered.columns:
    df_filtered.dropna(subset=['case_fatality_rate'], inplace=True)

# Display the cleaned 'df_filtered' DataFrame
print("Cleaned 'df_filtered':")
display(df_filtered.head())

Cleaned 'df_filtered':


Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,...,hospital_beds_per_thousand,life_expectancy,human_development_index,population,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million,death_rate,percent_vaccinated
0,AFG,Asia,Afghanistan,2020-01-05,0.0,0.0,,0.0,0.0,,...,0.5,64.83,0.511,41128772,,,,,,
1,AFG,Asia,Afghanistan,2020-01-06,0.0,0.0,,0.0,0.0,,...,0.5,64.83,0.511,41128772,,,,,,
2,AFG,Asia,Afghanistan,2020-01-07,0.0,0.0,,0.0,0.0,,...,0.5,64.83,0.511,41128772,,,,,,
3,AFG,Asia,Afghanistan,2020-01-08,0.0,0.0,,0.0,0.0,,...,0.5,64.83,0.511,41128772,,,,,,
4,AFG,Asia,Afghanistan,2020-01-09,0.0,0.0,,0.0,0.0,,...,0.5,64.83,0.511,41128772,,,,,,


In [83]:
# Ensure 'continent' column exists in df_filtered
if 'continent' not in df_filtered.columns:
    df_filtered = df_filtered.merge(df[['iso_code', 'continent']], on='iso_code', how='left')

# Replace infinite values in 'death_rate' with NaN
if 'death_rate' in df_filtered.columns:
    df_filtered['death_rate'] = df_filtered['death_rate'].replace([float('inf'), -float('inf')], float('nan'))

# Drop rows with NaN values in 'death_rate'
if 'death_rate' in df_filtered.columns:
    df_filtered.dropna(subset=['death_rate'], inplace=True)

# Display the cleaned 'df_filtered' DataFrame
print("Cleaned 'df_filtered':")
display(df_filtered.head())

Cleaned 'df_filtered':


Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,...,hospital_beds_per_thousand,life_expectancy,human_development_index,population,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million,death_rate,percent_vaccinated
56,AFG,Asia,Afghanistan,2020-03-01,1.0,1.0,0.143,0.0,0.0,0.0,...,0.5,64.83,0.511,41128772,,,,,0.0,
57,AFG,Asia,Afghanistan,2020-03-02,1.0,0.0,0.143,0.0,0.0,0.0,...,0.5,64.83,0.511,41128772,,,,,0.0,
58,AFG,Asia,Afghanistan,2020-03-03,1.0,0.0,0.143,0.0,0.0,0.0,...,0.5,64.83,0.511,41128772,,,,,0.0,
59,AFG,Asia,Afghanistan,2020-03-04,1.0,0.0,0.143,0.0,0.0,0.0,...,0.5,64.83,0.511,41128772,,,,,0.0,
60,AFG,Asia,Afghanistan,2020-03-05,1.0,0.0,0.143,0.0,0.0,0.0,...,0.5,64.83,0.511,41128772,,,,,0.0,


## Key Insights

- The USA had the highest total cases and deaths among the selected countries.
- India experienced a sharp rise in cases during mid-2021.
- Kenya's vaccination rollout lagged behind the USA and India.
- Death rates were highest in the early stages of the pandemic for all countries.
- Vaccination rates increased significantly in the USA compared to Kenya and India.

In [None]:
# Replace infinite values in 'average_case_fatality_rate' with NaN in 'average_cfr_by_continent'
average_cfr_by_continent.loc[:, 'average_case_fatality_rate'] = average_cfr_by_continent['average_case_fatality_rate'].replace([float('inf'), -float('inf')], float('nan'))

# Drop rows with NaN values in 'average_case_fatality_rate'
average_cfr_by_continent.dropna(subset=['average_case_fatality_rate'], inplace=True)

# Display the cleaned 'average_cfr_by_continent' DataFrame
print("Cleaned 'average_cfr_by_continent':")
print(average_cfr_by_continent)

Cleaned 'average_cfr_by_continent':
       continent  average_case_fatality_rate
1           Asia                    1.695381
2         Europe                   43.019821
3  North America                    1.618914
4        Oceania                    0.562020
5  South America                    2.451340


## Final Deliverable
This notebook provides a comprehensive analysis of global COVID-19 data, including:
- Data cleaning and preparation.
- Exploratory data analysis (EDA).
- Visualizations of key trends.
- Narrative insights and findings.

It is designed to be easy to read, well-commented, and reproducible.

In [None]:
# Select countries of interest
countries_of_interest = ['United States', 'India', 'Kenya']

# Filter the dataset for the selected countries
filtered_df = df[df['location'].isin(countries_of_interest)]

# Convert the 'date' column to datetime format
filtered_df.loc[:, 'date'] = pd.to_datetime(filtered_df['date'])

# Plot total cases over time for selected countries
plt.figure(figsize=(12, 6))
for country in countries_of_interest:
    country_data = filtered_df[filtered_df['location'] == country]
    plt.plot(country_data['date'], country_data['total_cases'], label=country)

plt.title('Total COVID-19 Cases Over Time')
plt.xlabel('Date')
plt.ylabel('Total Cases')
plt.legend()
plt.grid(True)
plt.show()

# Plot total deaths over time for selected countries
plt.figure(figsize=(12, 6))
for country in countries_of_interest:
    country_data = filtered_df[filtered_df['location'] == country]
    plt.plot(country_data['date'], country_data['total_deaths'], label=country)

plt.title('Total COVID-19 Deaths Over Time')
plt.xlabel('Date')
plt.ylabel('Total Deaths')
plt.legend()
plt.grid(True)
plt.show()

Cleaned 'average_cfr_by_continent':
       continent  average_case_fatality_rate
1           Asia                    1.695381
2         Europe                   43.019821
3  North America                    1.618914
4        Oceania                    0.562020
5  South America                    2.451340


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  average_cfr_by_continent['average_case_fatality_rate'].replace([float('inf'), -float('inf')], float('nan'), inplace=True)


In [None]:
## Conclusion

This notebook demonstrates how to load, clean, analyze, and visualize real-world COVID-19 data. The analysis highlights differences in pandemic trends and vaccination progress across Kenya, USA, and India. The approach can be extended to other countries or metrics as needed.