## COVID-19 Mortality Data (2020)

To explore the potential long-term health impacts of air pollution, we use IHME estimates of COVID-19 mortality for 2020 

---

### 📦 Import Required Pandas Library

In [146]:
import pandas as pd

### 📁 loading the dataset

from the CSV raw file, and making a copy for the cleaned to be dataset
and preview the first few rows.

In [147]:
raw_data = pd.read_csv("../1_datasets/raw_datasets/covid_deaths.csv")
cleaned_death_data = raw_data.copy()
cleaned_death_data.head()

Unnamed: 0,measure,location,sex,age,cause,metric,year,val,upper,lower
0,Deaths,Spain,Both,All ages,COVID-19,Percent,2020,0.125091,0.130679,0.118814
1,Deaths,Spain,Both,All ages,COVID-19,Rate,2020,132.948639,139.131075,126.395299
2,Deaths,Spain,Both,15-49 years,COVID-19,Percent,2020,0.148088,0.156547,0.139678
3,Deaths,Spain,Both,15-49 years,COVID-19,Rate,2020,10.472981,11.087217,9.876074
4,Deaths,Fiji,Both,All ages,COVID-19,Percent,2021,0.12138,0.234511,0.063697


### 🧾 Initial Column Inspection
We inspect:
- Full list of column names
- Identify metadata columns 
- Identify unnecessary columns

In [148]:
cleaned_death_data.columns

Index(['measure', 'location', 'sex', 'age', 'cause', 'metric', 'year', 'val',
       'upper', 'lower'],
      dtype='object')

In [149]:
# checking initial rows and columns number
cleaned_death_data.shape

(200, 10)

### 🔍 Summary Statistics & Missing Data

- Check `.info()` for data types and nulls
- Use `.isnull().sum()` to count missing values

In [150]:
cleaned_death_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   measure   200 non-null    object 
 1   location  200 non-null    object 
 2   sex       200 non-null    object 
 3   age       200 non-null    object 
 4   cause     200 non-null    object 
 5   metric    200 non-null    object 
 6   year      200 non-null    int64  
 7   val       200 non-null    float64
 8   upper     200 non-null    float64
 9   lower     200 non-null    float64
dtypes: float64(3), int64(1), object(6)
memory usage: 15.8+ KB


In [151]:
cleaned_death_data.isnull().sum()

measure     0
location    0
sex         0
age         0
cause       0
metric      0
year        0
val         0
upper       0
lower       0
dtype: int64

### 🧹 Column Renaming and Dropping
Drop columns:
- Drop columns not needed for analysis

Rename columns:
- Standardize names


In [152]:
cleaned_death_data = cleaned_death_data.drop(
    ["sex", "measure", "cause"], axis=1, errors="ignore"
)
cleaned_death_data = cleaned_death_data.rename(
    columns={"val": "COVID Deaths Val", "location": "Country"}
)
cleaned_death_data.head()

Unnamed: 0,Country,age,metric,year,COVID Deaths Val,upper,lower
0,Spain,All ages,Percent,2020,0.125091,0.130679,0.118814
1,Spain,All ages,Rate,2020,132.948639,139.131075,126.395299
2,Spain,15-49 years,Percent,2020,0.148088,0.156547,0.139678
3,Spain,15-49 years,Rate,2020,10.472981,11.087217,9.876074
4,Fiji,All ages,Percent,2021,0.12138,0.234511,0.063697


### 🎯 Filtering by Country

To ensure data consistency and focus our analysis on relevant entries, we applied filters using the `isin()` method

In [153]:
countries_to_keep = [
    "Egypt",
    "India",
    "United States of America",
    "Chile",
    "Nigeria",
    "Italy",
    "China",
    "Japan",
    "Afghanistan",
    "Germany",
    "Russian Federation",
    "Spain",
    "Romania",
    "Indonesia",
    "Saudi Arabia",
    "Bangladesh",
    "Republic of Korea",
    "South Africa",
    "Ethiopia",
    "Kenya",
    "Brazil",
    "Mexico",
    "Canada",
    "Australia",
    "Fiji",
]

cleaned_death_data = cleaned_death_data[
    cleaned_death_data["Country"].isin(countries_to_keep)
]
cleaned_death_data.head(26)

Unnamed: 0,Country,age,metric,year,COVID Deaths Val,upper,lower
0,Spain,All ages,Percent,2020,0.125091,0.130679,0.118814
1,Spain,All ages,Rate,2020,132.948639,139.131075,126.395299
2,Spain,15-49 years,Percent,2020,0.148088,0.156547,0.139678
3,Spain,15-49 years,Rate,2020,10.472981,11.087217,9.876074
4,Fiji,All ages,Percent,2021,0.12138,0.234511,0.063697
5,Fiji,All ages,Rate,2021,124.301052,246.763477,75.31109
6,Fiji,15-49 years,Percent,2021,0.123064,0.236029,0.064148
7,Fiji,15-49 years,Rate,2021,38.868283,77.220136,23.558042
8,Indonesia,All ages,Percent,2020,0.046378,0.091085,0.015115
9,Indonesia,All ages,Rate,2020,32.667317,67.054381,10.621372


### Filter data for Rate metric, year 2020, and age 15–49 years

We keep only rows where:
- `metric` = **"Rate"**
- `year` = **2020**
- `age` = **"15–49 years"**

This focuses the analysis on younger adults, to explore if factors like PM2.5 exposure explain higher COVID-19 mortality beyond age-related vulnerability.


In [154]:
df_cleaned_death_data = cleaned_death_data[
    (cleaned_death_data["metric"] == "Rate")
    & (cleaned_death_data["year"] == 2020)
    & (cleaned_death_data["age"] == "15-49 years")
]
print(df_cleaned_death_data)

                      Country          age metric  year  COVID Deaths Val  \
3                       Spain  15-49 years   Rate  2020         10.472981   
11                  Indonesia  15-49 years   Rate  2020         10.159491   
27                     Mexico  15-49 years   Rate  2020         57.019696   
31         Russian Federation  15-49 years   Rate  2020         18.195048   
35                      Chile  15-49 years   Rate  2020         17.427140   
39                Afghanistan  15-49 years   Rate  2020         67.643147   
43               South Africa  15-49 years   Rate  2020         54.933363   
47                 Bangladesh  15-49 years   Rate  2020         18.638962   
59                     Canada  15-49 years   Rate  2020         11.392234   
63                      India  15-49 years   Rate  2020         18.618172   
71          Republic of Korea  15-49 years   Rate  2020          0.239668   
79                   Ethiopia  15-49 years   Rate  2020         24.108446   

Rename columns:
- Standardize names

In [155]:
df_cleaned_death_data = df_cleaned_death_data.rename(
    columns={"year": "Year", "COVID Deaths Val": "COVID Deaths Val (Rate)"},
)

In [156]:
df_cleaned_death_data.reset_index(drop=True)

Unnamed: 0,Country,age,metric,Year,COVID Deaths Val (Rate),upper,lower
0,Spain,15-49 years,Rate,2020,10.472981,11.087217,9.876074
1,Indonesia,15-49 years,Rate,2020,10.159491,20.825411,3.318329
2,Mexico,15-49 years,Rate,2020,57.019696,70.946973,44.181729
3,Russian Federation,15-49 years,Rate,2020,18.195048,22.069367,14.809403
4,Chile,15-49 years,Rate,2020,17.42714,18.2755,16.582775
5,Afghanistan,15-49 years,Rate,2020,67.643147,90.454679,51.068891
6,South Africa,15-49 years,Rate,2020,54.933363,54.955219,54.898359
7,Bangladesh,15-49 years,Rate,2020,18.638962,25.12202,15.987977
8,Canada,15-49 years,Rate,2020,11.392234,12.087162,10.701911
9,India,15-49 years,Rate,2020,18.618172,19.671096,17.520649


### 📁 loading the SDI dataset

from the CSV raw file, and making a copy for the cleaned to be dataset
and preview the first few rows.

In [157]:
sdi_data = pd.read_csv("../1_datasets/raw_datasets/sdi_by_country_year_2010_2019.csv")

In [158]:
sdi_2020 = sdi_data.copy()
sdi_2020.head()

Unnamed: 0,covariate_name_short,location_id,location_name,year_id,age_group_id,age_group_name,sex_id,sex,mean_value,lower_value,upper_value
0,sdi,1,Global,1950,22,All Ages,3,Both,0.369235,0.369235,0.369235
1,sdi,4,"Southeast Asia, East Asia, and Oceania",1950,22,All Ages,3,Both,0.205729,0.205729,0.205729
2,sdi,5,East Asia,1950,22,All Ages,3,Both,0.193972,0.193972,0.193972
3,sdi,6,China,1950,22,All Ages,3,Both,0.184144,0.184144,0.184144
4,sdi,7,Democratic People's Republic of Korea,1950,22,All Ages,3,Both,0.323069,0.323069,0.323069


### 🎯 Filtering by Country and Year.

To ensure data consistency on relevant entries needed for the merge, we applied filters using the `isin()` method

In [159]:
year_to_keep = [2020]
sdi_2020 = sdi_2020[
    sdi_2020["location_name"].isin(countries_to_keep)
    & sdi_2020["year_id"].isin(year_to_keep)
]

### 🧹 Column Dropping
- Drop columns not needed for analysis

In [160]:
sdi_2020 = sdi_2020.drop(
    [
        "covariate_name_short",
        "location_id",
        "age_group_id",
        "age_group_name",
        "sex_id",
        "sex",
        "lower_value",
        "upper_value",
    ],
    axis=1,
    errors="ignore",
)
sdi_2020 = sdi_2020.rename(
    columns={
        "location_name": "Country",
        "year_id": "Year",
        "mean_value": "SDI mean value",
    }
)
sdi_2020.head()

Unnamed: 0,Country,Year,SDI mean value
51523,China,2020,0.713365
51528,Indonesia,2020,0.651927
51539,Fiji,2020,0.671431
51569,Romania,2020,0.764276
51579,Russian Federation,2020,0.806011


### ✅ Index Reset & Checking Missing Data

- Resetting the DataFrame index to maintain a clean and sequential index.
- Count the missing value

In [161]:
sdi_2020.reset_index(drop=True)
sdi_2020.shape

(25, 3)

In [162]:
sdi_2020.isnull().sum()

Country           0
Year              0
SDI mean value    0
dtype: int64

### 🔗 COVID Deaths and SDI data merge
Performed an **inner merge** of the covid deaths dataset with the SDI dataset using `"country"` and `"year"` as keys.
- This enables analysis of the deaths rate in the context of each country's
socio-demographic development

In [163]:
merged_covid_sdi = pd.merge(
    df_cleaned_death_data, sdi_2020, on=["Country", "Year"], how="inner"
)
merged_covid_sdi.head()

Unnamed: 0,Country,age,metric,Year,COVID Deaths Val (Rate),upper,lower,SDI mean value
0,Spain,15-49 years,Rate,2020,10.472981,11.087217,9.876074,0.766506
1,Indonesia,15-49 years,Rate,2020,10.159491,20.825411,3.318329,0.651927
2,Mexico,15-49 years,Rate,2020,57.019696,70.946973,44.181729,0.660119
3,Russian Federation,15-49 years,Rate,2020,18.195048,22.069367,14.809403,0.806011
4,Chile,15-49 years,Rate,2020,17.42714,18.2755,16.582775,0.769214


Dropping unnecessary columns in the middle merged dataset

In [164]:
merged_covid_sdi = merged_covid_sdi.drop(
    ["age", "Year", "metric"],
    axis=1,
    errors="ignore",
)
merged_covid_sdi.head()

Unnamed: 0,Country,COVID Deaths Val (Rate),upper,lower,SDI mean value
0,Spain,10.472981,11.087217,9.876074,0.766506
1,Indonesia,10.159491,20.825411,3.318329,0.651927
2,Mexico,57.019696,70.946973,44.181729,0.660119
3,Russian Federation,18.195048,22.069367,14.809403,0.806011
4,Chile,17.42714,18.2755,16.582775,0.769214


### 🔗 Reading the PM25 Dataset and merge
reading the pm2.5 dataset and clean it to Perform an **inner merge**  with the covid/SDI dataset

In [165]:
pm25_concentration = pd.read_csv("../1_datasets/cleaned_datasets/cleaned_pm25_data.csv")
df_pm_average = pm25_concentration.copy()
df_pm_average.copy()

Unnamed: 0,Country,Year,PM25 concentration,PM25 lower bound,PM25 upper bound
0,Germany,2019,10.73,10.56,10.93
1,Japan,2019,10.84,10.02,11.55
2,Brazil,2019,10.94,9.37,13.01
3,Kenya,2019,12.52,7.80,17.78
4,Romania,2019,13.30,12.55,14.10
...,...,...,...,...,...
245,Afghanistan,2010,68.97,49.48,96.51
246,Fiji,2010,7.19,3.79,12.39
247,Canada,2010,7.76,7.55,7.96
248,Australia,2010,9.03,8.14,9.93


### 📌 Averaging PM₂.₅ Concentration (2010–2019)

In this step, we calculate the long-term average of ambient PM₂.₅ concentration for each country between 2010 and 2019. This allows us to capture chronic exposure to air pollution over a decade, which can be compared to COVID-19 outcomes in 2020. We then rename the resulting column to Avg_PM25_2010_2019 for consistency.

In [166]:
avg_pm25 = df_pm_average.groupby("Country")["PM25 concentration"].mean().reset_index()

In [167]:
avg_pm25.rename(columns={"PM25 concentration": "Avg_PM25_2010_2019"}, inplace=True)
avg_pm25.head()

Unnamed: 0,Country,Avg_PM25_2010_2019
0,Afghanistan,67.013
1,Australia,7.479
2,Bangladesh,49.53
3,Brazil,13.107
4,Canada,7.188


### 🔗 Merging COVID-19, SDI, and PM₂.₅ Datasets

In this step, we perform an inner merge to combine all cleaned data sources into a single dataframe, using the common column Country. This includes:

COVID-19 death rates (2020) for the 15–49 age group

Socio-demographic Index (SDI) for the year 2020

Average PM₂.₅ concentration from 2010 to 2019

The resulting dataset is ready for exploratory data analysis (EDA) and modeling to investigate the relationship between long-term air pollution, development level, and COVID-19 outcomes.

In [168]:
covid_pm25_sdi_final = pd.merge(merged_covid_sdi, avg_pm25, on=["Country"], how="inner")
covid_pm25_sdi_final.head()

Unnamed: 0,Country,COVID Deaths Val (Rate),upper,lower,SDI mean value,Avg_PM25_2010_2019
0,Spain,10.472981,11.087217,9.876074,0.766506,10.995
1,Indonesia,10.159491,20.825411,3.318329,0.651927,19.61
2,Mexico,57.019696,70.946973,44.181729,0.660119,21.268
3,Russian Federation,18.195048,22.069367,14.809403,0.806011,9.857
4,Chile,17.42714,18.2755,16.582775,0.769214,22.628


### 💾 Export Cleaned Data

Save the cleaned datasets to a `.csv` file for future use:

In [169]:
df_cleaned_death_data.to_csv("cleaned_death_data_covid.csv", index=False)

In [170]:
covid_pm25_sdi_final.to_csv("covid_pm25_sdi_final.csv")