# COVID-19 Global Data

The COVID-19 Global Data Tracker is a powerful real-world project that builds strong skills in data cleaning, EDA, and data visualization. Let's do it step by step, and I’ll walk you through everything.

Step1: data loading and exloration

In [27]:
import pandas as pd

# 1Load Data
try:
    df = pd.read_csv('owid-covid-data.csv')
    print("Dataset loaded successfully!")
except FileNotFoundError:
    print("Error: owid-covid-data.csv not found. Please download and place it in your working directory.")
    raise

#preview the dataset
print("Dataset preview")
print(df.head())



Dataset loaded successfully!
Dataset preview
  iso_code continent     location        date  total_cases  new_cases  \
0      AFG      Asia  Afghanistan  2020-01-05          0.0        0.0   
1      AFG      Asia  Afghanistan  2020-01-06          0.0        0.0   
2      AFG      Asia  Afghanistan  2020-01-07          0.0        0.0   
3      AFG      Asia  Afghanistan  2020-01-08          0.0        0.0   
4      AFG      Asia  Afghanistan  2020-01-09          0.0        0.0   

   new_cases_smoothed  total_deaths  new_deaths  new_deaths_smoothed  ...  \
0                 NaN           0.0         0.0                  NaN  ...   
1                 NaN           0.0         0.0                  NaN  ...   
2                 NaN           0.0         0.0                  NaN  ...   
3                 NaN           0.0         0.0                  NaN  ...   
4                 NaN           0.0         0.0                  NaN  ...   

   male_smokers  handwashing_facilities  hospital_bed

In [28]:
# Step 2.3: Check column names
print("\nColumn names:")
print(df.columns)



Column names:
Index(['iso_code', 'continent', 'location', 'date', 'total_cases', 'new_cases',
       'new_cases_smoothed', 'total_deaths', 'new_deaths',
       'new_deaths_smoothed', 'total_cases_per_million',
       'new_cases_per_million', 'new_cases_smoothed_per_million',
       'total_deaths_per_million', 'new_deaths_per_million',
       'new_deaths_smoothed_per_million', 'reproduction_rate', 'icu_patients',
       'icu_patients_per_million', 'hosp_patients',
       'hosp_patients_per_million', 'weekly_icu_admissions',
       'weekly_icu_admissions_per_million', 'weekly_hosp_admissions',
       'weekly_hosp_admissions_per_million', 'total_tests', 'new_tests',
       'total_tests_per_thousand', 'new_tests_per_thousand',
       'new_tests_smoothed', 'new_tests_smoothed_per_thousand',
       'positive_rate', 'tests_per_case', 'tests_units', 'total_vaccinations',
       'people_vaccinated', 'people_fully_vaccinated', 'total_boosters',
       'new_vaccinations', 'new_vaccinations_smoot

In [29]:
df.shape

(429435, 67)

In [30]:
print("\nData types")
print(df.dtypes)

print("\nMissing values per column")
print(df.isnull().sum().sort_values(ascending=False))


Data types
iso_code                                    object
continent                                   object
location                                    object
date                                        object
total_cases                                float64
                                            ...   
population                                   int64
excess_mortality_cumulative_absolute       float64
excess_mortality_cumulative                float64
excess_mortality                           float64
excess_mortality_cumulative_per_million    float64
Length: 67, dtype: object

Missing values per column
weekly_icu_admissions                   418442
weekly_icu_admissions_per_million       418442
excess_mortality                        416024
excess_mortality_cumulative_absolute    416024
excess_mortality_cumulative             416024
                                         ...  
total_cases_per_million                  17631
location                                     

# Step 3: Data cleaning

In [31]:
# Convert date to datetime
df['date'] = pd.to_datetime(df['date'])


In [32]:
# Drop rows with missing 'location' or 'date' (critical for analysis)
df.dropna(subset=['location', 'date'], inplace=True)

In [26]:
# Fill missing numeric columns with forward fill (grouped by location)
import numpy as np
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
df[num_cols] = df.groupby('location')[num_cols].apply(lambda group: group.ffill())


KeyboardInterrupt



In [33]:
# Drop any remaining rows with critical missing data for total_cases and total_deaths
df = df.dropna(subset=['total_cases', 'total_deaths'])

print("\nAfter cleaning, missing values in key columns:")
print(df[['total_cases', 'total_deaths', 'total_vaccinations']].isnull().sum())


After cleaning, missing values in key columns:
total_cases                0
total_deaths               0
total_vaccinations    338262
dtype: int64


In [19]:
# Check missing values in ey columns

df[['total_cases', 'total_deaths', 'total_vaccinations']].isnull().sum()



total_cases            17631
total_deaths           17631
total_vaccinations    344018
dtype: int64

In [20]:
df[['total_cases', 'total_deaths', 'total_vaccinations']] = df[['total_cases', 'total_deaths', 'total_vaccinations']].fillna(0)


In [21]:
df[['total_cases', 'total_deaths', 'total_vaccinations']].isnull().sum()

total_cases           0
total_deaths          0
total_vaccinations    0
dtype: int64

In [22]:
# Remove aggregate regions
aggregates = ['World', 'Africa', 'Asia', 'Europe', 'European Union',
              'International', 'North America', 'Oceania', 'South America']

df = df[~df['location'].isin(aggregates)]


In [23]:
df = df.sort_values(['location', 'date']).reset_index(drop=True)
