## COVID-19 Mortality Data (2020–2021)

To explore the potential long-term health impacts of air pollution, we use IHME estimates of COVID-19 mortality for 2020 and 2021.

---

### 📦 Import Required Pandas Library

In [1]:
import pandas as pd

### 📁 loading the dataset

from the CSV raw file, and making a copy for the cleaned to be dataset
and preview the first few rows.

In [2]:
raw_data = pd.read_csv("../1_datasets/raw_datasets/covid_deaths.csv")
cleaned_death_data = raw_data.copy()
cleaned_death_data.head()

Unnamed: 0,measure,location,sex,age,cause,metric,year,val,upper,lower
0,Deaths,Spain,Both,All ages,COVID-19,Percent,2020,0.125091,0.130679,0.118814
1,Deaths,Spain,Both,All ages,COVID-19,Rate,2020,132.948639,139.131075,126.395299
2,Deaths,Spain,Both,15-49 years,COVID-19,Percent,2020,0.148088,0.156547,0.139678
3,Deaths,Spain,Both,15-49 years,COVID-19,Rate,2020,10.472981,11.087217,9.876074
4,Deaths,Fiji,Both,All ages,COVID-19,Percent,2021,0.12138,0.234511,0.063697


### 🧾 Initial Column Inspection
We inspect:
- Full list of column names
- Identify metadata columns 
- Identify unnecessary columns

In [3]:
cleaned_death_data.columns

Index(['measure', 'location', 'sex', 'age', 'cause', 'metric', 'year', 'val',
       'upper', 'lower'],
      dtype='object')

In [3]:
# checking initial rows and columns number
cleaned_death_data.shape

(200, 10)

### 🔍 Summary Statistics & Missing Data

- Check `.info()` for data types and nulls
- Use `.isnull().sum()` to count missing values

In [4]:
cleaned_death_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   measure   200 non-null    object 
 1   location  200 non-null    object 
 2   sex       200 non-null    object 
 3   age       200 non-null    object 
 4   cause     200 non-null    object 
 5   metric    200 non-null    object 
 6   year      200 non-null    int64  
 7   val       200 non-null    float64
 8   upper     200 non-null    float64
 9   lower     200 non-null    float64
dtypes: float64(3), int64(1), object(6)
memory usage: 15.8+ KB


In [5]:
cleaned_death_data.isnull().sum()

measure     0
location    0
sex         0
age         0
cause       0
metric      0
year        0
val         0
upper       0
lower       0
dtype: int64

### 🧹 Column Renaming and Dropping
Drop columns:
- Drop columns not needed for analysis

Rename columns:
- Standardize names


In [6]:
cleaned_death_data = cleaned_death_data.drop(
    ["sex", "measure", "cause"], axis=1, errors="ignore"
)
cleaned_death_data = cleaned_death_data.rename(
    columns={"val": "COVID Deaths Val", "location": "Country"}
)
cleaned_death_data.head()

Unnamed: 0,Country,age,metric,year,COVID Deaths Val,upper,lower
0,Spain,All ages,Percent,2020,0.125091,0.130679,0.118814
1,Spain,All ages,Rate,2020,132.948639,139.131075,126.395299
2,Spain,15-49 years,Percent,2020,0.148088,0.156547,0.139678
3,Spain,15-49 years,Rate,2020,10.472981,11.087217,9.876074
4,Fiji,All ages,Percent,2021,0.12138,0.234511,0.063697


### 🎯 Filtering by Country

To ensure data consistency and focus our analysis on relevant entries, we applied filters using the `isin()` method

In [7]:
countries_to_keep = [
    "Egypt",
    "India",
    "United States of America",
    "Chile",
    "Nigeria",
    "Italy",
    "China",
    "Japan",
    "Afghanistan",
    "Germany",
    "Russian Federation",
    "Spain",
    "Romania",
    "Indonesia",
    "Saudi Arabia",
    "Bangladesh",
    "Republic of Korea",
    "South Africa",
    "Ethiopia",
    "Kenya",
    "Brazil",
    "Mexico",
    "Canada",
    "Australia",
    "Fiji",
]

cleaned_death_data = cleaned_death_data[
    cleaned_death_data["Country"].isin(countries_to_keep)
]
cleaned_death_data.head(26)

Unnamed: 0,Country,age,metric,year,COVID Deaths Val,upper,lower
0,Spain,All ages,Percent,2020,0.125091,0.130679,0.118814
1,Spain,All ages,Rate,2020,132.948639,139.131075,126.395299
2,Spain,15-49 years,Percent,2020,0.148088,0.156547,0.139678
3,Spain,15-49 years,Rate,2020,10.472981,11.087217,9.876074
4,Fiji,All ages,Percent,2021,0.12138,0.234511,0.063697
5,Fiji,All ages,Rate,2021,124.301052,246.763477,75.31109
6,Fiji,15-49 years,Percent,2021,0.123064,0.236029,0.064148
7,Fiji,15-49 years,Rate,2021,38.868283,77.220136,23.558042
8,Indonesia,All ages,Percent,2020,0.046378,0.091085,0.015115
9,Indonesia,All ages,Rate,2020,32.667317,67.054381,10.621372


### Filter data for Rate metric, year 2020, and age 15–49 years

We keep only rows where:
- `metric` = **"Rate"**
- `year` = **2020**
- `age` = **"15–49 years"**

This focuses the analysis on younger adults, to explore if factors like PM2.5 exposure explain higher COVID-19 mortality beyond age-related vulnerability.


In [11]:
df_cleaned_death_data = cleaned_death_data[
    (cleaned_death_data["metric"] == "Rate")
    & (cleaned_death_data["year"] == 2020)
    & (cleaned_death_data["age"] == "15-49 years")
]
print(df_cleaned_death_data)

                      Country          age metric  year  COVID Deaths Val  \
3                       Spain  15-49 years   Rate  2020         10.472981   
11                  Indonesia  15-49 years   Rate  2020         10.159491   
27                     Mexico  15-49 years   Rate  2020         57.019696   
31         Russian Federation  15-49 years   Rate  2020         18.195048   
35                      Chile  15-49 years   Rate  2020         17.427140   
39                Afghanistan  15-49 years   Rate  2020         67.643147   
43               South Africa  15-49 years   Rate  2020         54.933363   
47                 Bangladesh  15-49 years   Rate  2020         18.638962   
59                     Canada  15-49 years   Rate  2020         11.392234   
63                      India  15-49 years   Rate  2020         18.618172   
71          Republic of Korea  15-49 years   Rate  2020          0.239668   
79                   Ethiopia  15-49 years   Rate  2020         24.108446   

In [12]:
df_cleaned_death_data.reset_index(drop=True)

Unnamed: 0,Country,age,metric,year,COVID Deaths Val,upper,lower
0,Spain,15-49 years,Rate,2020,10.472981,11.087217,9.876074
1,Indonesia,15-49 years,Rate,2020,10.159491,20.825411,3.318329
2,Mexico,15-49 years,Rate,2020,57.019696,70.946973,44.181729
3,Russian Federation,15-49 years,Rate,2020,18.195048,22.069367,14.809403
4,Chile,15-49 years,Rate,2020,17.42714,18.2755,16.582775
5,Afghanistan,15-49 years,Rate,2020,67.643147,90.454679,51.068891
6,South Africa,15-49 years,Rate,2020,54.933363,54.955219,54.898359
7,Bangladesh,15-49 years,Rate,2020,18.638962,25.12202,15.987977
8,Canada,15-49 years,Rate,2020,11.392234,12.087162,10.701911
9,India,15-49 years,Rate,2020,18.618172,19.671096,17.520649


### 💾 Export Cleaned Data

Save the cleaned dataset to a `.csv` file for future use:

In [13]:
df_cleaned_death_data.to_csv("cleaned_death_data_covid.csv", index=False)