# #Dataset Overview

*Number of observations (n): 2938

*Number of predictors (p) :21

n≥p+5⇒2938≥21+5   
so the data is suitable for multivariable regression

# #features explaination
###Demographic & Identifiers

*   Country
*   year: Year of observation (2000–2015)
*   Status: Developing or Developed country.

###Health & Mortality Indicators

*   Adult Mortality:Probability of dying between ages 15–60 per 1000 population.
*   infant deaths: Number of infant deaths per 1000 live births.
*   under-five deaths: Deaths of children under 5 per 1000 population.
*   Hepatitis B : Immunization coverage against the Hepatitis B.
*   Polio : Polio immunization rate.
*   Diphtheria: DPT vaccination coverage.
*   Measles: number of reported cases per 1000 population.
*   HIV/AIDS: Deaths per 1000 population.

##Lifestyle & Physical Health


*   Alcohol:The average annual alcohol consumption per individual.
*   BMI:Average Body Mass Index.  
*   thinness 1–19 years: Malnutrition indicator for youth.
*   thinness 5–9 years: Malnutrition indicator for children.

##Economic Indicators
*   GDP:The average income generated per person in a country
*   percentage expenditure:Health expenditure as a percentage of GDP.
*   Total expenditure:Government health spending (%).   
*  Income composition of resources:Composite income index; strong predictor of development.

##Education & Social Development

*  Schooling : Average years of schooling
*  Population : The total population size of a country



In [None]:
import pandas as pd
pd.set_option('display.max_columns', None)
import numpy as np
import matplotlib.pyplot as plt

# Load the dataset

In [None]:
data = pd.read_csv('Dataset/Life-Expectancy-Data-Updated.csv')
data.head()

# Check the dimension of the dataset

In [None]:
print(data.shape)

2938 rows → number of observations (n)

22 columns → number of variables (features + target)



# Get information about the data

In [None]:
# Display column names and data types
data.info()


# Classification of Variables in the Dataset


In [None]:
# Categorical variables
categorical_variables = ["Country", "Region"]

# Numerical variables
numerical_variables = data.columns.drop(categorical_variables)

numerical_variables

In [None]:
data["Country"].unique()

In [None]:
data["Country"].unique().size

Country is a nominal categorical variable → have 193 unique categories, would create hundreds of dummy variables so we will drop this column instead of making one hot encoding and the country does not affect the life expectancy

Status is a binary categorical variable →Label Encoding

```
Developed → 1, Developing → 0
```






# Count missing values for each column


In [None]:
missing_count = data.isnull().sum()
missing_count

# Drop the missing values in target variable ['Life expectancy']

In [None]:
data = data.dropna(subset=['Life expectancy '])

#Immunization columns (Hepatitis B – Polio – Diphtheria)
 Missing values in immunization variables were imputed using group-wise median
 based on country development status to preserve realistic differences
 between developed and developing countries.


In [None]:
# Group-wise median imputation for immunization variables
immunization_cols = ["Hepatitis B", "Polio", "Diphtheria "]

for col in immunization_cols:
    data[col] = data.groupby("Status")[col].transform(
        lambda x: x.fillna(x.median())
    )

#Social Variables (Schooling – Income composition)
Social variables are imputed using group-wise median by development status
 because education and income composition are strongly associated with a
 country's level of development.

In [None]:
social_cols = ["Schooling", "Income composition of resources"]

for col in social_cols:
    data[col] = data.groupby("Status")[col].transform(
        lambda x: x.fillna(x.median())
    )


#Economic variables
GDP and Population are imputed using group-wise median due to strong
 association with development status.

Total health expenditure is imputed using the global median.

In [None]:
# GDP and Population are imputed using group-wise median
for col in ["GDP", "Population"]:
    data[col] = data.groupby("Status")[col].transform(
        lambda x: x.fillna(x.median())
    )

# Total health expenditure is imputed using the global median.
data["Total expenditure"] = data["Total expenditure"].fillna(
    data["Total expenditure"].median()
)


## Lifestyle-related variables are imputed using the median of each column.

In [None]:
lifestyle_cols = ["Alcohol",' BMI '," thinness  1-19 years", " thinness 5-9 years"]

# Impute lifestyle variables using median
for col in lifestyle_cols:
    data.loc[:, col] = data[col].fillna(data[col].median())

#check that no missing values remain in the dataset

In [None]:
data.isnull().sum()

## Outlier Detection using IQR

In [None]:
outlier_summary = {}  # Dictionary to store the number of outliers for each column

for col in numerical_variables:
    Q1 = data[col].quantile(0.25)  # 25th percentile
    Q3 = data[col].quantile(0.75)  # 75th percentile
    IQR = Q3 - Q1                  # Interquartile range (IQR)

    lower_bound = Q1 - 1.5 * IQR   # Lower bound
    upper_bound = Q3 + 1.5 * IQR   # Upper bound

    # Select rows with value is below lower_bound or above upper_bound
    outliers = data[(data[col] < lower_bound) | (data[col] > upper_bound)]

    outlier_summary[col] = outliers.shape[0]  # Store the count of outliers

# Convert the dictionary to a DataFrame
pd.DataFrame.from_dict(outlier_summary, orient="index", columns=["Outlier Count"])


#### Not all features have the same nature so, we can't use the same approach of removing outliers for all of them.

#### Features where the difference between two outliers is important(e.g, GDP) will be treated using log transformation.

#### Features where the difference doesn't matter will be capped.

#### Rows where the outliers are probabily due to errors where be deleted

In [None]:
# 1. Define lists for different treatments
cols_to_log = ['Infant_deaths', 'Under_five_deaths', 'GDP_per_capita', 'Population_mln']
cols_to_cap = ['Alcohol_consumption', 'Hepatitis_B', 'Measles', 'BMI', 'Polio', 'Diphtheria', 'Incidents_HIV', 'Thinness_ten_nineteen_years', 'Thinness_five_nine_years', 'Schooling']
cols_to_drop_rows = ['Life_expectancy', 'Adult_mortality']

# STRATEGY 1: Log Transformation (To consider the differences between outliers)
for col in cols_to_log:
    data[col] = np.log1p(data[col])  # Use log1p to avoide log(0)

# STRATEGY 2: Capping (Handles "Noisy" Outliers)
for col in cols_to_cap:
    Q1 = data[col].quantile(0.25)
    Q3 = data[col].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    
    # Apply Capping
    data[col] = np.where(data[col] < lower, lower, data[col])
    data[col] = np.where(data[col] > upper, upper, data[col])

# STRATEGY 3: Dropping Rows (Handles Data Errors)
for col in cols_to_drop_rows:
    Q1 = data[col].quantile(0.25)
    Q3 = data[col].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR

    # Delete rows
    data = data[(data[col] > lower) & (data[col] < upper)]

#Check for remaining outliers after capping


In [None]:
outlier_summary_after = {}

for col in numerical_variables:
    Q1 = data[col].quantile(0.25)
    Q3 = data[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    outliers = data[(data[col] < lower_bound) | (data[col] > upper_bound)]
    outlier_summary_after[col] = outliers.shape[0]

pd.DataFrame.from_dict(outlier_summary_after, orient="index", columns=["Outlier Count"])

#Descriptive Statistics

In [None]:
def compute_mean(x):
    return sum(x) / len(x)



In [None]:
def compute_median(x):
    x_sorted = sorted(x)
    n = len(x_sorted)
    mid = n // 2
    if n % 2 == 0:
        return (x_sorted[mid - 1] + x_sorted[mid]) / 2
    else:
        return x_sorted[mid]


In [None]:
def compute_variance(x):
    mean = compute_mean(x)
    return sum((xi - mean)**2 for xi in x) / (len(x) - 1)


In [None]:
def compute_std(x):
    return compute_variance(x) ** 0.5


In [None]:
def compute_mode(x):
    freq = {}
    for value in x:
        freq[value] = freq.get(value, 0) + 1
    max_freq = max(freq.values())
    modes = [k for k, v in freq.items() if v == max_freq]
    return modes[0]  # return first mode


In [None]:
def min_value(x):
    m = x[0]
    for value in x:
        if value < m:
            m = value
    return m

def max_value(x):
    m = x[0]
    for value in x:
        if value > m:
            m = value
    return m


In [None]:
pd.options.display.float_format = '{:.6f}'.format


#Measures of Dispersion

In [None]:
dispersion = {}

for col in numerical_variables:
    values = data[col].dropna().tolist()
    dispersion[col] = {
        "Variance": compute_variance(values),
        "Standard Deviation": compute_std(values),
        "Min Value": min_value(values),
        "Max Value": max_value(values)
    }
dispersion_df = pd.DataFrame(dispersion).T
dispersion_df


## Measures of Central Tendency


In [None]:
central_tendency = {}

for col in numerical_variables:
    values = data[col].dropna().tolist()
    central_tendency[col] = {
        "Mean": compute_mean(values),
        "Median": compute_median(values),
        "Mode": compute_mode(values)
    }

central_df = pd.DataFrame(central_tendency).T
central_df


In [None]:
data.describe()

#Split Data to Target & Features for regreesion model

In [None]:
# Define target variable
y = data['Life expectancy ']

# Define feature matrix (drop target)
X = data.drop(columns=['Life expectancy '])


#Feature Standardization (Z-score)

In [None]:
X_standardized = X.copy()


In [None]:
for col in X_standardized.columns:
    values = X_standardized[col].values
    mean = compute_mean(values)
    std = compute_std(values)
    if std != 0: # avoid dividing by zero
     X_standardized[col] = [(x - mean) / std for x in values]
    else:
        # If the feature has zero variance, set standardized values to zero
        X_standardized[col] = 0

In [None]:
X_standardized

#Save cleaned data

In [None]:
import os

# Create the directory if it doesn't exist
os.makedirs('Dataset', exist_ok=True)

data.to_csv('Dataset/Life Expectancy Data Cleaned.csv', index=False)