## 🧾 Socio-demographic Index (SDI) Data Preparation

We include the Socio-demographic Index (SDI) to contextualize the burden of respiratory diseases and pollution exposure across countries.

SDI is a composite metric reflecting:
- Income per capita
- Educational attainment
- Total fertility rate

---

### 📦 Import Required Pandas Library

In [31]:
import pandas as pd

### 📁 loading the dataset

from the CSV raw file, and making a copy for the cleaned to be dataset
and preview the first few rows.

In [32]:
raw_sdi_data = pd.read_csv(
    "../1_datasets/raw_datasets/sdi_by_country_year_2010_2019.csv"
)
cleaned_sdi_data = raw_sdi_data.copy()
cleaned_sdi_data.head()

Unnamed: 0,covariate_name_short,location_id,location_name,year_id,age_group_id,age_group_name,sex_id,sex,mean_value,lower_value,upper_value
0,sdi,1,Global,1950,22,All Ages,3,Both,0.369235,0.369235,0.369235
1,sdi,4,"Southeast Asia, East Asia, and Oceania",1950,22,All Ages,3,Both,0.205729,0.205729,0.205729
2,sdi,5,East Asia,1950,22,All Ages,3,Both,0.193972,0.193972,0.193972
3,sdi,6,China,1950,22,All Ages,3,Both,0.184144,0.184144,0.184144
4,sdi,7,Democratic People's Republic of Korea,1950,22,All Ages,3,Both,0.323069,0.323069,0.323069


In [33]:
# checking initial rows and columns number
cleaned_sdi_data.shape

(52992, 11)

### 🧾 Initial Column Inspection
We inspect:
- Full list of column names
- Identify metadata columns 
- Identify unnecessary columns


In [34]:
cleaned_sdi_data.columns

Index(['covariate_name_short', 'location_id', 'location_name', 'year_id',
       'age_group_id', 'age_group_name', 'sex_id', 'sex', 'mean_value',
       'lower_value', 'upper_value'],
      dtype='object')

### 🧹 Column Renaming and Dropping
Drop columns:
- Drop columns not needed for analysis

Rename columns:
- Standardize names

In [35]:
cleaned_sdi_data = cleaned_sdi_data.drop(
    [
        "covariate_name_short",
        "location_id",
        "age_group_id",
        "age_group_name",
        "sex_id",
        "sex",
        "lower_value",
        "upper_value",
    ],
    axis=1,
    errors="ignore",
)
cleaned_sdi_data.head()

Unnamed: 0,location_name,year_id,mean_value
0,Global,1950,0.369235
1,"Southeast Asia, East Asia, and Oceania",1950,0.205729
2,East Asia,1950,0.193972
3,China,1950,0.184144
4,Democratic People's Republic of Korea,1950,0.323069


In [None]:
cleaned_sdi_data = cleaned_sdi_data.rename(
    columns={
        "location_name": "Country",
        "year_id": "Year",
        "mean_value": "SDI mean value",
    }
)
cleaned_sdi_data.head()

Unnamed: 0,Country,Year,SDI mean value
0,Global,1950,0.369235
1,"Southeast Asia, East Asia, and Oceania",1950,0.205729
2,East Asia,1950,0.193972
3,China,1950,0.184144
4,Democratic People's Republic of Korea,1950,0.323069


### 🎯 Filtering by Country and Year.

To ensure data consistency and focus our analysis on relevant entries, we applied filters using the `isin()` method

In [37]:
countries_to_keep = [
    "Egypt",
    "India",
    "United States of America",
    "Chile",
    "Nigeria",
    "Italy",
    "China",
    "Japan",
    "Afghanistan",
    "Germany",
    "Russian Federation",
    "Spain",
    "Romania",
    "Indonesia",
    "Saudi Arabia",
    "Bangladesh",
    "Republic of Korea",
    "South Africa",
    "Ethiopia",
    "Kenya",
    "Brazil",
    "Mexico",
    "Canada",
    "Australia",
    "Fiji",
]
years_to_keep = [2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019]

cleaned_sdi_data = cleaned_sdi_data[
    cleaned_sdi_data["Country"].isin(countries_to_keep)
    & cleaned_sdi_data["Year"].isin(years_to_keep)
]
cleaned_sdi_data.head()

Unnamed: 0,Country,Year,SDI mean value
44163,China,2010,0.641521
44168,Indonesia,2010,0.585982
44179,Fiji,2010,0.623309
44209,Romania,2010,0.723891
44219,Russian Federation,2010,0.76332


### 🔍 Summary Statistics & Missing Data

- Check `.info()` for data types and nulls
- Use `.isnull().sum()` to count missing values

In [38]:
cleaned_sdi_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 250 entries, 44163 to 50994
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Country         250 non-null    object 
 1   Year            250 non-null    int64  
 2   SDI mean value  250 non-null    float64
dtypes: float64(1), int64(1), object(1)
memory usage: 7.8+ KB


In [39]:
cleaned_sdi_data.isnull().sum()

Country           0
Year              0
SDI mean value    0
dtype: int64

### ✅ Index Reset

Resetting the DataFrame index to maintain a clean and sequential index.
This avoids retaining the original row numbers from the source dataset, which may be non-sequential or misleading after filtering.

In [40]:
cleaned_sdi_data.reset_index(drop=True)

Unnamed: 0,Country,Year,SDI mean value
0,China,2010,0.641521
1,Indonesia,2010,0.585982
2,Fiji,2010,0.623309
3,Romania,2010,0.723891
4,Russian Federation,2010,0.763320
...,...,...,...
245,India,2019,0.560810
246,Ethiopia,2019,0.346423
247,Kenya,2019,0.508004
248,South Africa,2019,0.674041


### 💾 Export Cleaned Data

Save the cleaned dataset to a `.csv` file for future use:

In [41]:
cleaned_sdi_data.to_csv("cleaned_sdi_data.csv", index=False)

### 🔗 Reading the Datasets and merge
Performed an **inner merge** of the PM2.5 dataset with the SDI dataset using `"country"` and `"year"` as keys.
- This enables analysis of PM2.5 exposure in the context of each country's
socio-demographic development

In [42]:
cleaned_pm25_data = pd.read_csv("../1_datasets/cleaned_datasets/cleaned_pm25_data.csv")
cleaned_sdi_data = pd.read_csv("../1_datasets/cleaned_datasets/cleaned_sdi_data.csv")

In [43]:
merged_sdi_pm25 = pd.merge(
    cleaned_pm25_data, cleaned_sdi_data, on=["Country", "Year"], how="inner"
)
merged_sdi_pm25.head(30)

Unnamed: 0,Country,Year,PM25 concentration,PM25 lower bound,PM25 upper bound,SDI mean value
0,Germany,2019,10.73,10.56,10.93,0.899703
1,Japan,2019,10.84,10.02,11.55,0.867148
2,Brazil,2019,10.94,9.37,13.01,0.645298
3,Kenya,2019,12.52,7.8,17.78,0.508004
4,Romania,2019,13.3,12.55,14.1,0.760284
5,Italy,2019,14.22,13.99,14.44,0.80153
6,Mexico,2019,17.83,16.33,19.65,0.655095
7,Indonesia,2019,19.34,16.76,23.72,0.646865
8,South Africa,2019,19.75,18.26,21.8,0.674041
9,Chile,2019,20.49,19.31,21.89,0.76512


### 💾 Export Cleaned Merged Data

Save the merged dataset to a `.csv` file for future use:

In [44]:
merged_sdi_pm25.to_csv("merged_sdi_pm25_data.csv", index=False)