## Cleaning GBD Heath Data & 🔗 Merging Data for Analysis

In this notebook, we prepare and combine multiple datasets to create a single, clean dataset for analysis.  
The workflow includes:

- Cleaning the **health outcomes** data to remove inconsistencies and ensure uniform formatting.
- Reading the cleaned **SDI (Socio-Demographic Index)** and **PM2.5** datasets.
- Merging all three datasets on `Country` and `Year` to retain only the countries and years present in all sources.

The final dataset covers the period **2010 to 2019**, enabling us to explore the relationship between **long-term PM2.5 exposure**, socio-demographic factors, and health outcomes at the country level.


### 📦 Import Required Pandas Library

In [None]:
import pandas as pd

### 📁 loading the dataset

from the CSV raw file, and making a copy for the cleaned to be dataset
and preview the first few rows.

In [None]:
raw_data = pd.read_csv("../1_datasets/raw_datasets/IHME-GBD_DALY.csv")
clean_gbd_health_outcomes = raw_data.copy()
clean_gbd_health_outcomes.head()

### 🔍 Initial Column Inspection
We inspect:
- Full list of column names
- Identify metadata columns 
- Identify unnecessary columns

In [None]:
clean_gbd_health_outcomes.columns

In [None]:
# checking initial rows and columns number
clean_gbd_health_outcomes.shape

### 🔍 Summary Statistics & Missing Data

- Check `.info()` for data types and nulls
- Use `.isnull().sum()` to count missing values

In [None]:
clean_gbd_health_outcomes.info()

In [None]:
clean_gbd_health_outcomes.isnull().sum()

### 🖊️ Column Renaming and Dropping
Drop columns:
- Drop columns not needed for analysis

Rename columns:
- Standardize names

In [None]:
clean_gbd_health_outcomes = clean_gbd_health_outcomes.drop(
    ["sex", "age", "metric", "measure"], axis=1, errors="ignore"
)
clean_gbd_health_outcomes = clean_gbd_health_outcomes.rename(
    columns={
        "val": "DALY Val",
        "location": "Country",
        "rei": "Risk-Exposure-Impact",
        "year": "Year",
    }
)
clean_gbd_health_outcomes.head()

### ✅ Check: Ensure dataset contains the unique countries

This step verifies that the dataset includes exactly the unique countries before proceeding.

In [62]:
country_count = clean_gbd_health_outcomes["Country"].nunique()
print(country_count)

732


### 🔍 Inspect country names

Use `value_counts()` to check the frequency of each country and review the names.

In [None]:
value_counts = clean_gbd_health_outcomes["Country"].value_counts()
print(value_counts)

In [None]:
clean_gbd_health_outcomes.reset_index(drop=True)

### 📥 Read PM2.5 and SDI datasets

We first load the **PM2.5** dataset and the **SDI** (Socio-Demographic Index) dataset into dataframes.  
These datasets will later be merged to create a combined dataset containing PM2.5 and socio-demographic data for each country and year.


In [None]:
pm25_df = pd.read_csv("../1_datasets/cleaned_datasets/pm25_all_countries_cleaned.csv")
sdi_df = pd.read_csv("../1_datasets/cleaned_datasets/sdi_data_2010_2019_cleaned.csv")

### ✅ Check: Ensure dataset contains the unique countries

This step verifies that the dataset includes exactly the unique countries before proceeding.

In [None]:
unique_countries = sdi_df["Country"].nunique()
print(unique_countries)

### 🤝 Clean and merge SDI and PM2.5 datasets

Before merging, we remove any leading or trailing spaces from the **Country** column to ensure accurate matching.  
Then, we perform an **inner merge** on `Country` and `Year` to keep only the common records between the SDI and PM2.5 datasets.
This results in a combined dataframe containing countries and years that are present in both datasets.


In [54]:
clean_gbd_health_outcomes["Country"] = sdi_df["Country"].str.strip()
pm25_df["Country"] = pm25_df["Country"].str.strip()

sdi_pm25 = pd.merge(sdi_df, pm25_df, on=["Country", "Year"], how="inner")
print(sdi_pm25.shape)
sdi_pm25.head()

(1920, 7)


Unnamed: 0,Country,Year,SDI_mean_value,Population Category,PM25 concentration (µg/m³),PM25 lower bound,PM25 upper bound
0,China,2010,0.641521,Total,47.18,44.59,49.38
1,Democratic People's Republic of Korea,2010,0.537193,Total,50.41,37.98,65.52
2,Cambodia,2010,0.410211,Total,18.85,15.72,22.03
3,Indonesia,2010,0.585982,Total,20.59,16.5,24.63
4,Lao People's Democratic Republic,2010,0.414866,Total,22.86,17.56,29.64


### 🔗 Merge health outcomes with SDI and PM2.5 data

We merge the cleaned health outcomes dataset with the previously combined SDI and PM2.5 dataset.  
An **inner merge** on `Country` and `Year` ensures we keep only the countries and years that appear in **all three datasets**: health, SDI, and PM2.5.
This creates a final dataset ready for analysis.


In [56]:
sdi_pm25_gbd = pd.merge(
    clean_gbd_health_outcomes, sdi_pm25, on=["Country", "Year"], how="inner"
)
print(sdi_pm25_gbd.shape)
sdi_pm25_gbd.head()

(1940, 12)


Unnamed: 0,Country,cause,Risk-Exposure-Impact,Year,DALY Val,upper,lower,SDI_mean_value,Population Category,PM25 concentration (µg/m³),PM25 lower bound,PM25 upper bound
0,China,Stroke,Ambient particulate matter pollution,2013,927.72159,1208.913876,564.490209,0.663103,Total,64.93,61.43,67.88
1,Democratic People's Republic of Korea,Stroke,Ambient particulate matter pollution,2014,960.571587,1227.910165,589.905594,0.553604,Total,60.09,45.31,77.31
2,Cambodia,Stroke,Ambient particulate matter pollution,2017,945.157096,1206.731114,601.287022,0.451222,Total,16.84,14.88,19.48
3,Indonesia,Stroke,Ambient particulate matter pollution,2018,909.881906,1175.832907,570.912093,0.640949,Total,18.47,15.58,22.9
4,Lao People's Democratic Republic,Stroke,Ambient particulate matter pollution,2019,884.162698,1132.470448,582.014559,0.478811,Total,21.15,16.86,25.13


### 📊 Check unique countries in the merged dataset

We check how many unique countries are included in the final merged dataset,  
and list their names to verify that only the mutual countries across all datasets are kept.


In [57]:
country_count = sdi_pm25_gbd["Country"].nunique()
print(country_count)

191


In [59]:
unique_countries = sdi_pm25_gbd["Country"].unique()
print(unique_countries)

['China' "Democratic People's Republic of Korea" 'Cambodia' 'Indonesia'
 "Lao People's Democratic Republic" 'Malaysia' 'Maldives' 'Myanmar'
 'Philippines' 'Sri Lanka' 'Thailand' 'Timor-Leste' 'Viet Nam' 'Fiji'
 'Kiribati' 'Marshall Islands' 'Micronesia (Federated States of)'
 'Papua New Guinea' 'Samoa' 'Solomon Islands' 'Tonga' 'Vanuatu' 'Armenia'
 'Azerbaijan' 'Georgia' 'Kazakhstan' 'Kyrgyzstan' 'Mongolia' 'Tajikistan'
 'Turkmenistan' 'Uzbekistan' 'Albania' 'Bosnia and Herzegovina' 'Bulgaria'
 'Croatia' 'Czechia' 'Hungary' 'North Macedonia' 'Montenegro' 'Poland'
 'Romania' 'Serbia' 'Slovakia' 'Slovenia' 'Belarus' 'Estonia' 'Latvia'
 'Lithuania' 'Republic of Moldova' 'Russian Federation' 'Ukraine'
 'Brunei Darussalam' 'Japan' 'Republic of Korea' 'Singapore' 'Australia'
 'New Zealand' 'Andorra' 'Austria' 'Belgium' 'Cyprus' 'Denmark' 'Finland'
 'France' 'Germany' 'Greece' 'Iceland' 'Ireland' 'Israel' 'Italy'
 'Luxembourg' 'Malta' 'Norway' 'Portugal' 'Spain' 'Sweden' 'Switzerland'
 'Argen

In [60]:
sdi_pm25_gbd.reset_index(drop=True)

Unnamed: 0,Country,cause,Risk-Exposure-Impact,Year,DALY Val,upper,lower,SDI_mean_value,Population Category,PM25 concentration (µg/m³),PM25 lower bound,PM25 upper bound
0,China,Stroke,Ambient particulate matter pollution,2013,927.721590,1208.913876,564.490209,0.663103,Total,64.93,61.43,67.88
1,Democratic People's Republic of Korea,Stroke,Ambient particulate matter pollution,2014,960.571587,1227.910165,589.905594,0.553604,Total,60.09,45.31,77.31
2,Cambodia,Stroke,Ambient particulate matter pollution,2017,945.157096,1206.731114,601.287022,0.451222,Total,16.84,14.88,19.48
3,Indonesia,Stroke,Ambient particulate matter pollution,2018,909.881906,1175.832907,570.912093,0.640949,Total,18.47,15.58,22.90
4,Lao People's Democratic Republic,Stroke,Ambient particulate matter pollution,2019,884.162698,1132.470448,582.014559,0.478811,Total,21.15,16.86,25.13
...,...,...,...,...,...,...,...,...,...,...,...,...
1935,Tuvalu,Chronic respiratory diseases,Ambient particulate matter pollution,2013,43.190450,67.336873,24.057056,0.535166,Total,6.81,2.97,12.43
1936,South Sudan,Chronic respiratory diseases,Ambient particulate matter pollution,2017,43.072817,68.204042,23.629823,0.274388,Total,21.64,10.98,39.87
1937,Sudan,Respiratory infections and tuberculosis,Ambient particulate matter pollution,2017,596.198435,1080.889328,132.800240,0.506884,Total,22.73,11.22,41.89
1938,Georgia,Stroke,Ambient particulate matter pollution,2018,76.594403,117.210024,48.271688,0.717994,Total,20.66,16.28,26.59


### 💾 Export Cleaned Data

Save the cleaned dataset to a `.csv` file for future use:

In [61]:
sdi_pm25_gbd.to_csv("final_sdi_pm25_gbd.csv", index=False)