## 📊 IHME GBD Data Analysis (2010–2019)

This notebook explores data from the **IHME Global Burden of Disease (GBD)** study covering the years **2010 to 2019** for **25 countries**.  
The analysis focuses on understanding **health outcomes** related to **cardiovascular diseases** and **respiratory diseases** potentially linked to exposure to **ambient PM2.5 air pollution**.


### 📦 Import Required Pandas Library

In [1]:
import pandas as pd

### 📁 loading the dataset

from the CSV raw file, and making a copy for the cleaned to be dataset
and preview the first few rows.

In [2]:
raw_data = pd.read_csv("../1_datasets/raw_datasets/IHME-GBD_2021_DATA-f0557795-1.csv")
clean_gbd_health_outcomes = raw_data.copy()
clean_gbd_health_outcomes.head()

Unnamed: 0,measure,location,sex,age,cause,rei,metric,year,val,upper,lower
0,DALYs (Disability-Adjusted Life Years),China,Both,All ages,Stroke,Ambient particulate matter pollution,Percent,2010,0.221882,0.296283,0.128691
1,DALYs (Disability-Adjusted Life Years),China,Both,All ages,Stroke,Ambient particulate matter pollution,Rate,2010,793.957477,1068.976845,455.632535
2,DALYs (Disability-Adjusted Life Years),China,Both,All ages,Stroke,Ambient particulate matter pollution,Percent,2011,0.234283,0.30752,0.138086
3,DALYs (Disability-Adjusted Life Years),China,Both,All ages,Stroke,Ambient particulate matter pollution,Rate,2011,835.920051,1110.785446,492.084815
4,DALYs (Disability-Adjusted Life Years),China,Both,All ages,Stroke,Ambient particulate matter pollution,Percent,2012,0.249782,0.321437,0.151426


### 🧾 Initial Column Inspection
We inspect:
- Full list of column names
- Identify metadata columns 
- Identify unnecessary columns

In [3]:
clean_gbd_health_outcomes.columns

Index(['measure', 'location', 'sex', 'age', 'cause', 'rei', 'metric', 'year',
       'val', 'upper', 'lower'],
      dtype='object')

In [4]:
# checking initial rows and columns number
clean_gbd_health_outcomes.shape

(1500, 11)

### 🔍 Summary Statistics & Missing Data

- Check `.info()` for data types and nulls
- Use `.isnull().sum()` to count missing values

In [5]:
clean_gbd_health_outcomes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1500 entries, 0 to 1499
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   measure   1500 non-null   object 
 1   location  1500 non-null   object 
 2   sex       1500 non-null   object 
 3   age       1500 non-null   object 
 4   cause     1500 non-null   object 
 5   rei       1500 non-null   object 
 6   metric    1500 non-null   object 
 7   year      1500 non-null   int64  
 8   val       1500 non-null   float64
 9   upper     1500 non-null   float64
 10  lower     1500 non-null   float64
dtypes: float64(3), int64(1), object(7)
memory usage: 129.0+ KB


In [6]:
clean_gbd_health_outcomes.isnull().sum()

measure     0
location    0
sex         0
age         0
cause       0
rei         0
metric      0
year        0
val         0
upper       0
lower       0
dtype: int64

### 🧹 Column Renaming and Dropping
Drop columns:
- Drop columns not needed for analysis

Rename columns:
- Standardize names

In [14]:
clean_gbd_health_outcomes = clean_gbd_health_outcomes.drop(
    ["sex", "age"], axis=1, errors="ignore"
)
clean_gbd_health_outcomes = clean_gbd_health_outcomes.rename(
    columns={
        "val": "measure Val",
        "location": "Country",
        "rei": "Risk-Exposure-Impact",
        "year": "Year",
    }
)
clean_gbd_health_outcomes.head()

Unnamed: 0,measure,Country,cause,Risk-Exposure-Impact,metric,Year,measure Val,upper,lower
0,DALYs (Disability-Adjusted Life Years),China,Stroke,Ambient particulate matter pollution,Percent,2010,0.221882,0.296283,0.128691
1,DALYs (Disability-Adjusted Life Years),China,Stroke,Ambient particulate matter pollution,Rate,2010,793.957477,1068.976845,455.632535
2,DALYs (Disability-Adjusted Life Years),China,Stroke,Ambient particulate matter pollution,Percent,2011,0.234283,0.30752,0.138086
3,DALYs (Disability-Adjusted Life Years),China,Stroke,Ambient particulate matter pollution,Rate,2011,835.920051,1110.785446,492.084815
4,DALYs (Disability-Adjusted Life Years),China,Stroke,Ambient particulate matter pollution,Percent,2012,0.249782,0.321437,0.151426


### ✅ Check: Ensure dataset contains the 25 unique countries

This step verifies that the dataset includes exactly 25 unique countries before proceeding.

In [8]:
country_count = clean_gbd_health_outcomes["Country"].nunique()
print(country_count)

25


### 🔍 Inspect and standardize country names

Use `value_counts()` to check the frequency of each country and review the names.  
We will rename countries as needed to ensure consistent naming across datasets before merging.

In [10]:
value_counts = clean_gbd_health_outcomes["Country"].value_counts()
print(value_counts)

Country
China                       60
Indonesia                   60
Fiji                        60
Romania                     60
Japan                       60
Russian Federation          60
Republic of Korea           60
Australia                   60
Italy                       60
Germany                     60
Spain                       60
Canada                      60
United States of America    60
Chile                       60
Mexico                      60
Saudi Arabia                60
Brazil                      60
Egypt                       60
India                       60
Afghanistan                 60
Ethiopia                    60
Bangladesh                  60
Kenya                       60
South Africa                60
Nigeria                     60
Name: count, dtype: int64


In [12]:
clean_gbd_health_outcomes.reset_index(drop=True)

Unnamed: 0,measure,Country,cause,Risk-Exposure-Impact,metric,Year,measure Val,upper,lower
0,DALYs (Disability-Adjusted Life Years),China,Stroke,Ambient particulate matter pollution,Percent,2010,0.221882,0.296283,0.128691
1,DALYs (Disability-Adjusted Life Years),China,Stroke,Ambient particulate matter pollution,Rate,2010,793.957477,1068.976845,455.632535
2,DALYs (Disability-Adjusted Life Years),China,Stroke,Ambient particulate matter pollution,Percent,2011,0.234283,0.307520,0.138086
3,DALYs (Disability-Adjusted Life Years),China,Stroke,Ambient particulate matter pollution,Rate,2011,835.920051,1110.785446,492.084815
4,DALYs (Disability-Adjusted Life Years),China,Stroke,Ambient particulate matter pollution,Percent,2012,0.249782,0.321437,0.151426
...,...,...,...,...,...,...,...,...,...
1495,DALYs (Disability-Adjusted Life Years),Nigeria,Stroke,Ambient particulate matter pollution,Rate,2017,156.375259,243.271230,79.416332
1496,DALYs (Disability-Adjusted Life Years),Nigeria,Stroke,Ambient particulate matter pollution,Percent,2018,0.156861,0.234241,0.081685
1497,DALYs (Disability-Adjusted Life Years),Nigeria,Stroke,Ambient particulate matter pollution,Rate,2018,143.493001,228.796021,73.322380
1498,DALYs (Disability-Adjusted Life Years),Nigeria,Stroke,Ambient particulate matter pollution,Percent,2019,0.151527,0.233375,0.076314


### 💾 Export Cleaned Data

Save the cleaned dataset to a `.csv` file for future use:

In [15]:
clean_gbd_health_outcomes.to_csv("clean_gbd_health.csv", index=False)