In [2]:
import numpy as np;
import pandas as pd;
import matplotlib.pyplot as plt;
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV


For Classification Task

Load the dataset into a DataFrame object using the Pandas library.

In [3]:
df = pd.read_csv(r"C:\Users\LOQ\Machine Learning Assignment\Final Assesment\Life Expectancy Data.csv")
df.columns = df.columns.str.strip()

• Perform an initial analysis to gather a detailed description of the dataset. For example:

(a) When and by whom was the dataset created?

-> The dataset was created by the World Health Organization (WHO) in collaboration with the United Nations (for economic data). It aggregates health and economic indicators from 2000 to 2015 across 193 countries. The project acknowledges contributions from researchers like Deeksha Russell and Duan Wang, who compiled and merged data from WHO’s Global Health Observatory (GHO) and UN sources

(b) How did you access the dataset?

-> I accessed the that through Kaggle under the title "Life Expectency (WHO)"

(c) How does it align with the chosen UNSDG?

-> This dataset aligns closely with UNSDG 3: Good Health and Well-being, which aims to "ensure healthy lives and promote well-being for all at all ages.

(d) List all the attributes (columns) present in the dataset.

-> The dataset contains 7 columns which are:-
1. Country

2. Year (2000–2015)

3. Status (Developed/Developing)

4. Life Expectancy

5. Adult Mortality (deaths per 1000, ages 15–60)

6. Infant Deaths (per 1000)

7. Alcohol Consumption (liters per capita)

8. Percentage Expenditure (healthcare)

9. Hepatitis B Immunization (%)

10. Measles Cases (per 1000)

11. BMI (average)

12. Under-Five Deaths (per 1000)

13. Polio Immunization (%)

14. Total Expenditure (healthcare)

15. Diphtheria Immunization (%)

16. HIV/AIDS Deaths (per 1000)

17. GDP

18. Population

19. Thinness 1–19 Years (%)

20. Thinness 5–9 Years (%)

21. Income Composition of Resources (index 0–1)

22. Schooling (years)



• Identify potential questions that the dataset could help answer.

-> The potential questions that the dataset could help answer are:
1. How do immunization rates (e.g., polio, hepatitis B) correlate with life expectancy?

2. Does healthcare expenditure improve life expectancy in low-income countries?

3. What is the relationship between adult mortality and socioeconomic factors (e.g., GDP, schooling)?

4. How significant are lifestyle factors (BMI, alcohol consumption) in predicting lifespan?

5. Are densely populated countries more likely to have lower life expectancy?

• Assess the dataset’s suitability for analysis (e.g., data completeness, relevance, and quality).

-> The dataset has missing data (~5% of rows) for GDP, Hepatitis B, and population, especially in smaller nations (e.g., Vanuatu). The relevance is high as it covers critical health, economic, and social factors over 15 years and also aligns with SDG 3 priorities. The dataset is reliable as primary sources are WHO and UN.

2. Conducting Exploratory Data Analysis (EDA):

Understanding the characteristics of the data beforehand is crucial for building a model with
acceptable performance. Before proceeding to build, train, and test the model, write code to
inspect, preview, summarize, explore, and visualize your data.

Preview Data

In [5]:
print("First 5 rows:")
print("first 5 row")
print(df.head())

print("Last 5 rows")
print(df.tail())
print("\nDataset shape:", df.shape)

First 5 rows:
first 5 row
       Country  Year      Status  Life expectancy  Adult Mortality  \
0  Afghanistan  2015  Developing             65.0            263.0   
1  Afghanistan  2014  Developing             59.9            271.0   
2  Afghanistan  2013  Developing             59.9            268.0   
3  Afghanistan  2012  Developing             59.5            272.0   
4  Afghanistan  2011  Developing             59.2            275.0   

   infant deaths  Alcohol  Percentage expenditure  Hepatitis B  Measles  ...  \
0             62     0.01               71.279624         65.0     1154  ...   
1             64     0.01               73.523582         62.0      492  ...   
2             66     0.01               73.219243         64.0      430  ...   
3             69     0.01               78.184215         67.0     2787  ...   
4             71     0.01                7.097109         68.0     3013  ...   

   Polio  Total expenditure  Diphtheria  HIV/AIDS         GDP  Populatio

In [6]:
print("Dataset Info:\n")
df.info()

Dataset Info:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2938 entries, 0 to 2937
Data columns (total 22 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Country                          2938 non-null   object 
 1   Year                             2938 non-null   int64  
 2   Status                           2938 non-null   object 
 3   Life expectancy                  2928 non-null   float64
 4   Adult Mortality                  2928 non-null   float64
 5   infant deaths                    2938 non-null   int64  
 6   Alcohol                          2744 non-null   float64
 7   Percentage expenditure           2938 non-null   float64
 8   Hepatitis B                      2385 non-null   float64
 9   Measles                          2938 non-null   int64  
 10  BMI                              2904 non-null   float64
 11  under-five deaths                2938 non-null   int64  
 12  Polio

(a) Perform data cleaning and compute summary statistics for the dataset.

Handle missing values

In [7]:
print("Missing Values Before Handling:")
print(df.isnull().sum().sort_values(ascending=False))

Missing Values Before Handling:
Population                         652
Hepatitis B                        553
GDP                                448
Total expenditure                  226
Alcohol                            194
Income composition of resources    167
Schooling                          163
thinness 1-19 years                 34
thinness 5-9 years                  34
BMI                                 34
Diphtheria                          19
Polio                               19
Life expectancy                     10
Adult Mortality                     10
infant deaths                        0
Status                               0
Country                              0
Year                                 0
under-five deaths                    0
Measles                              0
Percentage expenditure               0
HIV/AIDS                             0
dtype: int64


In [8]:
# Step 1: Handle missing values in numeric columns
numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns.tolist()

# Replace missing values with column-specific medians
for col in numeric_cols:
    median_value = df[col].median()
    df[col] = df[col].fillna(median_value)

# Step 2: Handle missing values in categorical columns
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()

# Replace missing values with the mode of each column
for col in categorical_cols:
    mode_value = df[col].mode()[0] if not df[col].mode().empty else "Unknown"
    df[col] = df[col].fillna(mode_value)

# Step 3: Country-specific imputation for health-related metrics
health_cols = ['Life expectancy', 'Adult Mortality', 'BMI', 
               'thinness 1-19 years', 'thinness 5-9 years']

for col in health_cols:
    # Calculate country-specific medians
    country_medians = df.groupby('Country')[col].median()
    df[col] = df.apply(
        lambda row: country_medians[row['Country']] if pd.isnull(row[col]) else row[col],
        axis=1
    )
    # Fill any remaining NaNs with global median
    global_median = df[col].median()
    df[col] = df[col].fillna(global_median)

# Step 4: Status-based imputation for vaccination rates
vaccine_cols = ['Polio', 'Diphtheria', 'Hepatitis B']

for col in vaccine_cols:
    # Calculate status-specific modes
    status_modes = df.groupby('Status')[col].agg(lambda x: x.mode()[0] if not x.mode().empty else x.median())
    df[col] = df.apply(
        lambda row: status_modes[row['Status']] if pd.isnull(row[col]) else row[col],
        axis=1
    )
    # Fill any remaining NaNs with global mode
    global_mode = df[col].mode()[0] if not df[col].mode().empty else df[col].median()
    df[col] = df[col].fillna(global_mode)

# Step 5: Temporal filling for economic indicators
economic_cols = ['GDP', 'Population', 'Income composition of resources']

# Sort by Country and Year for forward/backward filling
df = df.sort_values(['Country', 'Year'])

for col in economic_cols:
    # Forward fill within each country
    df[col] = df.groupby('Country')[col].ffill()
    # Backward fill within each country
    df[col] = df.groupby('Country')[col].bfill()
    # Fill any remaining NaNs with global median
    global_median = df[col].median()
    df[col] = df[col].fillna(global_median)

# Step 6: Remaining columns (global median imputation)
remaining_cols = ['Alcohol', 'Schooling', 'Total expenditure']

for col in remaining_cols:
    global_median = df[col].median()
    df[col] = df[col].fillna(global_median)

# Final Validation
print("\nMissing Values After Enhanced Handling:")
print(df.isnull().sum().sort_values(ascending=False))

# Save the cleaned dataset
df.to_csv("cleaned_life_expectancy.csv", index=False)


Missing Values After Enhanced Handling:
Country                            0
Year                               0
Status                             0
Life expectancy                    0
Adult Mortality                    0
infant deaths                      0
Alcohol                            0
Percentage expenditure             0
Hepatitis B                        0
Measles                            0
BMI                                0
under-five deaths                  0
Polio                              0
Total expenditure                  0
Diphtheria                         0
HIV/AIDS                           0
GDP                                0
Population                         0
thinness 1-19 years                0
thinness 5-9 years                 0
Income composition of resources    0
Schooling                          0
dtype: int64
