# Life Expectancy Data Science Project

---------------------------------------------

### Introduction
This analysis explores a life expectancy dataset, aiming to uncover factors affecting life expectancy across countries over time. We'll handle missing values, engineer features, perform exploratory analysis, visualize patterns, and build a regression model to predict life expectancy.

### Objectives
- Understand the structure and quality of the dataset
- Identify key features affecting life expectancy
- Handle missing data appropriately
- Engineer new features to improve prediction
- Visualize relationships and trends
- Build a regression model to predict life expectancy
- Evaluate model performance using cross-validation
- Derive actionable insights

### Task 1: Explore Dataset and Missing Values

In [44]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from scipy.stats.mstats import winsorize
import re

In [None]:
df = pd.read_csv('Life_Expectancy_Data.csv')
df.shape 

In [None]:
df.dtypes

In [None]:
df.dtypes.value_counts()

In [None]:
df.columns

In [45]:

for column in df.columns:
    # Remove leading/trailing spaces and compress multiple spaces into single spaces
    cleaned_column = re.sub(r'\s+', ' ', column.strip())
    df.rename(columns={column: cleaned_column}, inplace=True)

df.columns

Index(['Country', 'Year', 'Status', 'Life expectancy', 'Adult Mortality',
       'infant deaths', 'Alcohol', 'percentage expenditure', 'Hepatitis B',
       'Measles', 'BMI', 'under-five deaths', 'Polio', 'Total expenditure',
       'Diphtheria', 'HIV/AIDS', 'GDP', 'Population', 'thinness 1-19 years',
       'thinness 5-9 years', 'Income composition of resources', 'Schooling',
       'Health Spending Ratio', 'Deaths per Infant'],
      dtype='object')

In [None]:
df.head(20)

In [None]:
print(df.duplicated())

In [None]:
#Find total of duplicated values
print(df.duplicated().sum())

In [None]:
# Get the number of unique countries from the 'Country' column
number_of_countries = df['Country'].nunique()

# Print the number of unique countries
print(f"The total number of unique countries in the dataset is: {number_of_countries}")

### Task 2: Handle Missing Data and Justify Method

In [None]:
null_values = df.isnull().sum()

In [None]:
#Checks if any column has NaN
df.isnull().any()

In [None]:
#Checks if any row has NaN
df.isnull().any(axis=1)

In [None]:
#Checks if all values in a column are NaN
df.isnull().all()

In [None]:
#Checks if all values in a row are NaN
df.isnull().all(axis=1)

In [None]:
null_percentage = (df.isnull().sum() / len(df))*100
print(null_percentage)

In [None]:
missing_df = pd.DataFrame({'Missing Values': null_values, 'Percent Missing': null_percentage})
missing_df[missing_df['Missing Values'] > 0]

Advanced mechanisms to handle missing data

In [None]:
import pandas as pd
import numpy as np
from scipy.stats import zscore

# Load your dataset
df = pd.read_csv("Life_Expectancy_Data.csv")
df.columns = df.columns.str.strip()  # Clean column names

# Get numerical features only
num_features = df.select_dtypes(include='number').columns

# Initialize DataFrame to store outlier flags
outliers_df = pd.DataFrame(False, index=df.index, columns=num_features)

import warnings

# Detect outliers by country (improved)
for country, group in df.groupby("Country"):
    group_index = group.index
    group_numeric = group[num_features]

    # Filter columns with non-zero std to avoid division by 0
    valid_cols = group_numeric.loc[:, group_numeric.std() > 0]

    if valid_cols.empty:
        continue  # Skip if all stds are zero

    # Suppress only this specific warning
    with warnings.catch_warnings():
        warnings.filterwarnings("ignore", message=".*Precision loss occurred.*")
        z_scores = valid_cols.apply(zscore, nan_policy='omit')

    # Mark outliers for these valid columns
    outliers = np.abs(z_scores) > 3
    outliers_df.loc[group_index, valid_cols.columns] = outliers


# Count outliers per feature
outlier_counts = outliers_df.sum().sort_values(ascending=False)

# Show results
print("✅ Country-wise Outlier Counts per Feature:")
print(outlier_counts)


In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# === Select features to plot ===
features = ['Polio', 'Diphtheria', 'Income composition of resources',
            'HIV/AIDS', 'thinness  1-19 years', 'thinness 5-9 years']

# === Melt into long format ===
df_melted = df[['Country'] + features].melt(id_vars='Country', var_name='Feature', value_name='Value')

# === Normalize per Country and Feature ===
def scale_country_feature(group):
    q1 = group['Value'].quantile(0.25)
    q3 = group['Value'].quantile(0.75)
    iqr = q3 - q1 if q3 > q1 else 1  # avoid zero division
    group['Scaled'] = group['Value'].clip(q1 - 1.5 * iqr, q3 + 1.5 * iqr)
    return group

df_scaled = df_melted.groupby(['Country', 'Feature'], group_keys=False).apply(scale_country_feature)

# === Plot all features together ===
plt.figure(figsize=(14, 6))
sns.boxplot(data=df_scaled, x='Feature', y='Scaled')
plt.title("Boxplot of Selected Features (Outliers Detected per Country)")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# === Select features to plot ===
features = ['Measles', 'GDP','percentage expenditure', 'Adult Mortality', 'under-five deaths']

# === Melt into long format ===
df_melted = df[['Country'] + features].melt(id_vars='Country', var_name='Feature', value_name='Value')

# === Normalize per Country and Feature ===
def scale_country_feature(group):
    q1 = group['Value'].quantile(0.25)
    q3 = group['Value'].quantile(0.75)
    iqr = q3 - q1 if q3 > q1 else 1  # avoid zero division
    group['Scaled'] = group['Value'].clip(q1 - 1.5 * iqr, q3 + 1.5 * iqr)
    return group

df_scaled = df_melted.groupby(['Country', 'Feature'], group_keys=False).apply(scale_country_feature)

# === Plot all features together ===
plt.figure(figsize=(14, 6))
sns.boxplot(data=df_scaled, x='Feature', y='Scaled')
plt.title("Boxplot of Selected Features (Outliers Detected per Country)")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# === First feature group ===
features_1 = ['Polio', 'Diphtheria', 'Income composition of resources',
              'HIV/AIDS', 'thinness  1-19 years', 'thinness 5-9 years']

# === Second feature group ===
features_2 = ['Measles', 'GDP', 'percentage expenditure', 'Adult Mortality', 'under-five deaths', 'Total expenditure']

features_3 = ['Population', 'Alcohol', 'Schooling', 'BMI', 'Life expectancy', 'Hepatitis B']

# === Prepare melted data for both groups ===
def melt_and_tag_outliers(df, features):
    melted = df[['Country'] + features].melt(id_vars='Country', var_name='Feature', value_name='Value')

    def detect_and_tag_outliers(group):
        q1 = group['Value'].quantile(0.25)
        q3 = group['Value'].quantile(0.75)
        iqr = q3 - q1
        lower = q1 - 1.5 * iqr
        upper = q3 + 1.5 * iqr
        group['Outlier'] = (group['Value'] < lower) | (group['Value'] > upper)
        return group

    tagged = melted.groupby(['Country', 'Feature'], group_keys=False).apply(detect_and_tag_outliers)
    return tagged

df_tagged_1 = melt_and_tag_outliers(df, features_1)
df_tagged_2 = melt_and_tag_outliers(df, features_2)
df_tagged_3 = melt_and_tag_outliers(df, features_3)

# === Create subplots ===
fig, (ax1, ax2, ax3) = plt.subplots(nrows=3, figsize=(14, 30))  # reduce figure height


# --- Plot 1 ---
sns.boxplot(data=df_tagged_1, x='Feature', y='Value', showfliers=False, ax=ax1)
sns.stripplot(
    data=df_tagged_1[df_tagged_1['Outlier']],
    x='Feature',
    y='Value',
    hue='Country',
    dodge=True,
    jitter=True,
    marker='o',
    alpha=0.6,
    linewidth=0.5,
    edgecolor='gray',
    palette='tab20',
    ax=ax1
)
ax1.set_title("Group 1: Country-Based Outliers (Polio, Diphtheria, etc.)")
ax1.tick_params(axis='x', rotation=45)
ax1.legend_.remove()


# --- Plot 2 ---
sns.boxplot(data=df_tagged_2, x='Feature', y='Value', showfliers=False, ax=ax2)
sns.stripplot(
    data=df_tagged_2[df_tagged_2['Outlier']],
    x='Feature',
    y='Value',
    hue='Country',
    dodge=True,
    jitter=True,
    marker='o',
    alpha=0.6,
    linewidth=0.5,
    edgecolor='gray',
    palette='tab20',
    ax=ax2
)
ax2.set_title("Group 2: Country-Based Outliers (Measles, GDP, etc.)")
ax2.tick_params(axis='x', rotation=45)

ax2.legend_.remove()

# --- Plot 3 ---
sns.boxplot(data=df_tagged_3, x='Feature', y='Value', showfliers=False, ax=ax3)
sns.stripplot(
    data=df_tagged_3[df_tagged_3['Outlier']],
    x='Feature',
    y='Value',
    hue='Country',
    dodge=True,
    jitter=True,
    marker='o',
    alpha=0.6,
    linewidth=0.5,
    edgecolor='gray',
    palette='tab20',
    ax=ax3
)
ax3.set_title("Group 3: Country-Based Outliers (Population, Alcohol, etc.)")
ax3.tick_params(axis='x', rotation=45)
ax3.legend_.remove()


# === Add legend outside the full figure ===
handles, labels = ax2.get_legend_handles_labels()
fig.legend(handles, labels, title='Country', bbox_to_anchor=(1.02, 0.5), loc='center left')
plt.show()


In [None]:
import pandas as pd
from scipy.stats import median_abs_deviation

# === Initialize tracker ===
outlier_indices_per_column = {col: set() for col in df.select_dtypes(include='number').columns}

# === Modified Z-Score function ===
def detect_outliers_modified_z(series, threshold=3.5):
    median = series.median()
    mad = median_abs_deviation(series, scale='normal')  # scaled to be comparable to std
    if mad == 0:
        return pd.Series([False] * len(series), index=series.index)
    z_scores = 0.6745 * (series - median) / mad
    return abs(z_scores) > threshold

# === Process per country ===
for country, group in df.groupby('Country'):
    for col in df.select_dtypes(include='number').columns:
        is_outlier = detect_outliers_modified_z(group[col])
        outlier_indices_per_column[col].update(group[is_outlier].index)

# === Build summary table ===
summary = []
for col, indices in outlier_indices_per_column.items():
    count = len(indices)
    summary.append({
        'Column': col,
        'Outlier Count': count,
        'Percentage': round((count / len(df)) * 100, 2)
    })

# === Create DataFrame and show results ===
outlier_summary = pd.DataFrame(summary).sort_values(by='Outlier Count', ascending=False)

print("📊 Outlier Detection Summary (Modified Z-Score per Country, Unique Rows Only)")
print("=" * 70)
print(outlier_summary.to_string(index=False))


### Implementation of Data handling 

In [None]:
import pandas as pd
import numpy as np
from scipy.stats.mstats import winsorize

# === Load dataset ===
try:
    df = pd.read_csv("Life_Expectancy_Data.csv")
    df.columns = df.columns.str.strip()
    print("Dataset loaded successfully.")
except FileNotFoundError:
    print("Error: 'Life_Expectancy_Data.csv' not found.")
    exit()

# === Define feature categories ===
# These will use percentile-based capping due to skewness
skewed_features = ['Adult Mortality']
# These will use IQR capping
iqr_features = ['Polio', 'Diphtheria', 'Hepatitis B',
                'Total expenditure', 'percentage expenditure']
# These will use log transformation
log_transform_features = ['Measles', 'Population', 'Alcohol']


# All features to process
all_features = skewed_features + iqr_features + log_transform_features 

# === Impute missing values using country-wise median ===
print("🔧 Imputing missing values with country-wise median...")
for col in all_features:
    if col in df.columns:
        df[col] = df.groupby('Country')[col].transform(lambda x: x.fillna(x.median()))
        if df[col].isnull().any():
            df[col].fillna(df[col].median(), inplace=True)
print("Imputation complete.")

# === Helper Functions ===

def percentile_cap_grouped(df, col, lower=0.01, upper=0.99):
    """Apply percentile capping per country."""
    def cap(x):
        return x.clip(lower=x.quantile(lower), upper=x.quantile(upper))
    return df.groupby("Country")[col].transform(cap)

def iqr_cap_grouped(df, col):
    """Apply IQR capping per country."""
    def cap(x):
        q1 = x.quantile(0.25)
        q3 = x.quantile(0.75)
        iqr = q3 - q1
        if iqr == 0 or x.isnull().all():
            return x
        lower = q1 - 1.5 * iqr
        upper = q3 + 1.5 * iqr
        return x.clip(lower, upper)
    return df.groupby("Country")[col].transform(cap)

def apply_log_transform_safely(x):
    """Log transform safely, avoiding log(0)."""
    return np.log1p(x.clip(lower=0))

# === Apply transformations ===
print("\nApplying transformations and tracking changes...")
df_cleaned = df.copy()
changes_summary = {}

# 1. Percentile Capping → skewed features
for col in skewed_features:
    if col in df_cleaned.columns:
        before = df_cleaned[col].copy()
        df_cleaned[col] = percentile_cap_grouped(df_cleaned, col)
        changes_summary[col] = (df_cleaned[col] != before).sum()
        print(f"✔ Percentile Capped → {col}: {changes_summary[col]} values modified")

# 2. IQR Capping → other features
for col in iqr_features:
    if col in df_cleaned.columns:
        before = df_cleaned[col].copy()
        df_cleaned[col] = iqr_cap_grouped(df_cleaned, col)
        changes_summary[col] = (df_cleaned[col] != before).sum()
        print(f"✔ IQR Capped → {col}: {changes_summary[col]} values modified")

# 3. Log Transform → log_transform_features
for col in log_transform_features:
    if col in df_cleaned.columns:
        before = df_cleaned[col].copy()
        df_cleaned[col] = apply_log_transform_safely(df_cleaned[col])
        changes_summary[col] = (df_cleaned[col] != before).sum()
        print(f"✔ Log Transformed → {col}: {changes_summary[col]} values modified")

# === Save cleaned dataset ===
final_cols = ['Country', 'Year', 'Life expectancy'] + all_features
final_cols = [col for col in final_cols if col in df_cleaned.columns]
df_final = df_cleaned[final_cols]
df_final.to_csv("Cleaned_Life_Expectancy_Countrywise_Final.csv", index=False)

# === Summary ===
print("\n✅ Final Summary of Outlier Handling:")
for col, count in changes_summary.items():
    if col in skewed_features:
        method = "Percentile Capped"
    elif col in iqr_features:
        method = "IQR Capped"
    else:
        method = "Log Transformed"
    print(f"✔ {method} → {col}: {count} values modified")

print("\n📁 Cleaned dataset saved as: Cleaned_Life_Expectancy_Countrywise_Final.csv")


In [None]:
# Handling missing values for BMI separately

# --- Step 1: Identify invalid BMI values ---
invalid_bmi_mask = (df['BMI'] < 15) | (df['BMI'] > 40)
num_bmi_replaced = invalid_bmi_mask.sum()

# --- Step 2: Replace invalid values with NaN ---
df['BMI'] = df['BMI'].mask(invalid_bmi_mask, np.nan)

# --- Step 3: Impute missing BMI values using MICE based on Life expectancy ---
imputer = IterativeImputer(random_state=42)
# Select relevant columns for MICE
imputed_values = imputer.fit_transform(df[['Life expectancy', 'BMI']])
df[['Life expectancy', 'BMI']] = imputed_values

# --- Step 4: Extract cleaned BMI data ---
cleaned_bmi_df = df[['Country', 'Year', 'BMI']].copy()

# --- Output Summary ---
print(f"{num_bmi_replaced} invalid BMI values were replaced with NaN and imputed using MICE.")
print("\nCleaned BMI Dataset (Sample):")
print(cleaned_bmi_df.head(10))

# Optional: Save to CSV
cleaned_bmi_df.to_csv("cleaned_bmi_data.csv", index=False)

In [None]:
# --- Step 0: Initial State ---
print("\n--- Initial State of GDP Column ---")
initial_missing_gdp = df['GDP'].isnull().sum()
print(f"Number of missing GDP values initially: {initial_missing_gdp}")

# --- Step 1: Detect Outliers on a Per-Country Basis ---
print("\n--- Step 1: Detecting Outliers for Each Country Individually ---")

def get_country_upper_bound(series):
    q3 = series.quantile(0.75)
    iqr = q3 - series.quantile(0.25)
    return q3 + 1.5 * iqr

def get_country_lower_bound(series):
    q1 = series.quantile(0.25)
    iqr = series.quantile(0.75) - q1
    iqr_lower_bound = q1 - 1.5 * iqr
    domain_lower_bound = 100.0
    return max(iqr_lower_bound, domain_lower_bound)

# Apply outlier bounds per country
country_upper_bounds = df.groupby('Country')['GDP'].transform(get_country_upper_bound)
country_lower_bounds = df.groupby('Country')['GDP'].transform(get_country_lower_bound)

outlier_mask = (df['GDP'] < country_lower_bounds) | (df['GDP'] > country_upper_bounds)
outliers = df[outlier_mask]

print(f"Number of GDP outliers detected across all countries: {len(outliers)}")
if not outliers.empty:
    print("Sample of detected outliers:")
    print(outliers[['Country', 'Year', 'GDP']].head())

# --- Step 2: Mark Outliers as NaN ---
print("\n--- Step 2: Marking Outliers as NaN ---")
df.loc[outlier_mask, 'GDP'] = np.nan
total_missing_after_marking = df['GDP'].isnull().sum()
print(f"Total GDP values now missing (NaN): {total_missing_after_marking}")

# --- Step 3: Impute with Country-Specific Mean ---
country_gdp_mean = df.groupby('Country')['GDP'].transform('mean')
df['GDP'].fillna(country_gdp_mean, inplace=True)

# Fallback to global mean if any still missing
global_gdp_mean = df['GDP'].mean()
df['GDP'].fillna(global_gdp_mean, inplace=True)

# --- Step 4: Extract Cleaned GDP Data ---
print("\n--- Step 4: Extracting Cleaned GDP Data ---")
cleaned_gdp_df = df[['Country', 'Year', 'GDP']].copy()
print("✅ GDP cleaned successfully. Sample:")
print(cleaned_gdp_df.head(10))

# Optional: Save to CSV
cleaned_gdp_df.to_csv("cleaned_gdp_data.csv", index=False)


### Task 3: Apply Chosen Method and Evaluate

In [None]:
numeric_columns = df.select_dtypes(include = 'number')
df.fillna(df.mean(numeric_only=True), inplace=True)
df.isnull().sum()

In [None]:
nonNumericColumns = df.select_dtypes(include = 'object')
for column in nonNumericColumns.columns:
    df[column].fillna(df[column].mode()[0])
    
df.isnull().sum()

### Task 4: Identify Potential Features

In [None]:
# Display all columns in the DataFrame
pd.set_option('display.max_columns', None)
#Description of the dataset transposed
df.describe(include='all')

###  Task 5: Feature Engineering

In [None]:
df['Health Spending Ratio'] = df['Total expenditure'] / df['GDP']
df['Deaths per Infant'] = df['infant deaths'] / df['Population']

### Task 6: Impact of New Features

In [None]:
df[['Health Spending Ratio', 'Deaths per Infant']].describe()

### Task 7: Select Key Variables for Visualization

In [None]:
df[['Life expectancy', 'GDP', 'Schooling', 'Alcohol', 'BMI', 'HIV/AIDS']].corr()

### Task 8: Visualizations

In [None]:
plt.figure(figsize=(15,10))
sns.heatmap(df.select_dtypes(include='number').corr(), cmap='coolwarm', annot=True)
plt.title('Correlation Heatmap')
plt.show()

In [None]:
plt.figure(figsize=(8,6))
sns.boxplot(x='Status', y='Life expectancy', data=df)
plt.title('Life Expectancy by Development Status')
plt.show()

In [None]:
plt.figure(figsize=(8,6))
sns.scatterplot(data=df, x='Alcohol', y='Life expectancy', hue='Status')
plt.title('Life Expectancy vs Alcohol')
plt.show()

In [None]:
plt.figure(figsize=(8,6))
sns.scatterplot(data=df, x='Hepatitis B', y='Life expectancy', hue='Status')
plt.title('Life Expectancy vs Hepatitis B')
plt.show()

In [None]:
plt.figure(figsize=(8,6))
sns.scatterplot(data=df, x='HIV/AIDS', y='Life expectancy', hue='Status')
plt.title('Life Expectancy vs HIV/AIDS')
plt.show()

In [None]:
# 3D Plot
fig = px.scatter_3d(df, x='GDP', y='Schooling', z='Life expectancy',
                     color='Status', size='Population')
fig.show()

### Task 9: Interpretation
- Higher GDP and schooling are associated with higher life expectancy.
- Developing countries tend to have more outliers and lower average life expectancy.
- HIV/AIDS has a strong negative correlation with life expectancy.

### Task 10: Data Splitting and Model Training

In [None]:
features = ['GDP', 'Schooling', 'Alcohol', 'BMI', 'HIV/AIDS']
X = df[features]
y = df['Life expectancy']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
pred = model.predict(X_test)

### Task 11: Cross Validation and Model Evaluation

In [None]:
mae = mean_absolute_error(y_test, pred)
r2 = r2_score(y_test, pred)
cross_val = cross_val_score(model, X, y, cv=5).mean()
mae, r2, cross_val

### Task 12: Conclusion and Recommendations
- **Key Findings**: Life expectancy is positively influenced by GDP, schooling, and healthcare access. HIV/AIDS is a major negative predictor.
- **Model Performance**: The linear model gives reasonable accuracy with cross-validation.
- **Recommendation**: Focus on improving education, economic stability, and healthcare to raise life expectancy.