# Life Expectancy Data Science Project

---------------------------------------------

### Introduction
This analysis explores a life expectancy dataset, aiming to uncover factors affecting life expectancy across countries over time. We'll handle missing values, engineer features, perform exploratory analysis, visualize patterns, and build a regression model to predict life expectancy.

### Objectives
- Understand the structure and quality of the dataset
- Identify key features affecting life expectancy
- Handle missing data appropriately
- Engineer new features to improve prediction
- Visualize relationships and trends
- Build a regression model to predict life expectancy
- Evaluate model performance using cross-validation
- Derive actionable insights

### Task 1: Explore Dataset and Missing Values

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score


In [None]:
df = pd.read_csv('Life_Expectancy_Data.csv')
df.shape 

In [None]:
df.dtypes

In [None]:
df.dtypes.value_counts()

In [None]:
df.columns

In [None]:
for column in df.columns:
    df.rename(columns={column: column.strip()}, inplace=True)

df.columns

In [None]:
df.head(20)

In [None]:
print(df.duplicated())

In [None]:
#Find total of duplicated values
print(df.duplicated().sum())

In [None]:
# Get the number of unique countries from the 'Country' column
number_of_countries = df['Country'].nunique()

# Print the number of unique countries
print(f"The total number of unique countries in the dataset is: {number_of_countries}")

### Task 2: Handle Missing Data and Justify Method

In [None]:
null_values = df.isnull().sum()

In [None]:
#Checks if any column has NaN
df.isnull().any()

In [None]:
#Checks if any row has NaN
df.isnull().any(axis=1)

In [None]:
#Checks if all values in a column are NaN
df.isnull().all()

In [None]:
#Checks if all values in a row are NaN
df.isnull().all(axis=1)

In [None]:
null_percentage = (df.isnull().sum() / len(df))*100
print(null_percentage)

In [None]:
missing_df = pd.DataFrame({'Missing Values': null_values, 'Percent Missing': null_percentage})
missing_df[missing_df['Missing Values'] > 0]

Advanced mechanisms to handle missing data

In [None]:
# Using Z score to detect outliers
from scipy.stats import zscore

# Calculate Z-scores for numerical features
z_scores = zscore(df.select_dtypes(include='number'))
outliers = (abs(z_scores) > 3)

# Count outliers per feature
outliers_df = pd.DataFrame(outliers, columns=df.select_dtypes(include='number').columns)
outliers_df.sum().sort_values(ascending=False)


In [None]:
# Visualizing outliers using boxplots
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(16, 6))
sns.boxplot(data=df[['Polio', 'Diphtheria', 'Income composition of resources', 'HIV/AIDS', 'thinness  1-19 years', 'thinness 5-9 years']])
plt.xticks(rotation=45)
plt.title("Boxplots to Detect Outliers")
plt.show()


In [None]:
# Visualizing outliers using boxplots
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(16, 6))
sns.boxplot(data=df[['Measles', 'Population', 'GDP','percentage expenditure', 'Adult Mortality', 'under-five deaths']])
plt.xticks(rotation=45)
plt.title("Boxplots to Detect Outliers")
plt.show()

In [None]:
# IQR method for outlier detection (Interquartile Range)
outlier_data = []
def detect_outliers_iqr(data):
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = data[(data < lower_bound) | (data > upper_bound)]
    return outliers

for col in df.select_dtypes(include='number').columns:
    outliers = detect_outliers_iqr(df[col])
    outlier_data.append({
        'Column': col,
        'Outlier Count': len(outliers),
        'Percentage': round((len(outliers) / len(df)) * 100, 2)
    })

# Create DataFrame and sort by outlier count (descending)
outlier_summary = pd.DataFrame(outlier_data)
outlier_summary = outlier_summary.sort_values('Outlier Count', ascending=False)

# Display as a formatted table
print("Outlier Detection Summary")
print("=" * 40)
print(outlier_summary.to_string(index=False))

### Implementation of Data handling 

In [None]:
# Define feature categories
health_indicators = [
    'Adult Mortality', 'infant deaths', 'under-five deaths', 'thinness  1-19 years', 'thinness 5-9 years'
]
socioeconomic = ['Income composition of resources']
demographic = ['Population']
healthcare = ['Polio', 'Diphtheria', 'Hepatitis B', 'Total expenditure']
disease_prevalence = ['Measles', 'HIV/AIDS']

# --- Helper Functions ---

def winsorize_series(series, limits=(0.01, 0.01)):
    return pd.Series(winsorize(series, limits=limits), index=series.index)

def percentile_cap(series, lower=0.01, upper=0.99):
    return series.clip(series.quantile(lower), series.quantile(upper))

def iqr_cap(series):
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    return series.clip(lower, upper)

def log_transform(series):
    return np.log1p(series.clip(lower=0))  # avoid log(0)

# --- Apply Handling Strategy per Category ---

# 1. Health Indicators → Winsorization
for col in health_indicators:
    if col in df.columns:
        df[col] = winsorize_series(df[col])

# 2. Socioeconomic → Log transformation + Imputation (MICE)
for col in socioeconomic:
    if col in df.columns:
        df[col] = log_transform(df[col])
        
# MICE Imputation for socioeconomic columns
imputer = IterativeImputer(random_state=0)
df[socioeconomic] = imputer.fit_transform(df[socioeconomic])

# 3. Demographic → Percentile-based capping
for col in demographic:
    if col in df.columns:
        df[col] = percentile_cap(df[col])

# 4. Healthcare Factors → IQR-based capping
for col in healthcare:
    if col in df.columns:
        df[col] = iqr_cap(df[col])

# 5. Disease Prevalence → Percentile-based capping
for col in disease_prevalence:
    if col in df.columns:
        df[col] = percentile_cap(df[col])

# ✅ Cleaned dataset is now ready
print(df.head())


In [None]:
# Handling missing values for BMI separately

# Introduce invalid BMI values
df.loc[100:110, 'BMI'] = 5   # Too low
df.loc[300:305, 'BMI'] = 70  # Too high

# --- Step 1: Identify invalid BMI values ---
invalid_bmi_mask = (df['BMI'] < 10) | (df['BMI'] > 60)
num_bmi_replaced = invalid_bmi_mask.sum()

# --- Step 2: Replace invalid values with NaN ---
df['BMI'] = df['BMI'].mask(invalid_bmi_mask, np.nan)

# --- Step 3: Impute using MICE (based on Life expectancy) ---
imputer = IterativeImputer(random_state=42)
df[['Life expectancy', 'BMI']] = imputer.fit_transform(df[['Life expectancy', 'BMI']])

# --- Output ---
print(f"✅ {num_bmi_replaced} invalid BMI values replaced and imputed using MICE.")
print(df[['Life expectancy', 'BMI']].head())


In [None]:
import numpy as np
import pandas as pd

# --- Step 0: Initial State ---
print("\n--- Initial State of GDP Column ---")
initial_missing_gdp = df['GDP'].isnull().sum()
print(f"Number of missing GDP values initially: {initial_missing_gdp}")

# --- Step 1: Detect Outliers on a Per-Country Basis ---
print("\n--- Step 1: Detecting Outliers for Each Country Individually ---")

def get_country_upper_bound(series):
    q3 = series.quantile(0.75)
    iqr = q3 - series.quantile(0.25)
    return q3 + 1.5 * iqr

def get_country_lower_bound(series):
    q1 = series.quantile(0.25)
    iqr = series.quantile(0.75) - q1
    iqr_lower_bound = q1 - 1.5 * iqr
    domain_lower_bound = 100.0
    return max(iqr_lower_bound, domain_lower_bound)

# Apply outlier bounds per country
country_upper_bounds = df.groupby('Country')['GDP'].transform(get_country_upper_bound)
country_lower_bounds = df.groupby('Country')['GDP'].transform(get_country_lower_bound)

outlier_mask = (df['GDP'] < country_lower_bounds) | (df['GDP'] > country_upper_bounds)
outliers = df[outlier_mask]

print(f"Number of GDP outliers detected across all countries: {len(outliers)}")
if not outliers.empty:
    print("Sample of detected outliers:")
    print(outliers[['Country', 'Year', 'GDP']].head())

# --- Step 2: Mark Outliers as NaN ---
print("\n--- Step 2: Marking Outliers as NaN ---")
df.loc[outlier_mask, 'GDP'] = np.nan
total_missing_after_marking = df['GDP'].isnull().sum()
print(f"Total GDP values now missing (NaN): {total_missing_after_marking}")

# --- Step 3: Impute with Country-Specific Mean ---
country_gdp_mean = df.groupby('Country')['GDP'].transform('mean')
df['GDP'].fillna(country_gdp_mean, inplace=True)

# Fallback to global mean if any still missing
global_gdp_mean = df['GDP'].mean()
df['GDP'].fillna(global_gdp_mean, inplace=True)

# --- Step 4: Extract Cleaned GDP Data ---
print("\n--- Step 4: Extracting Cleaned GDP Data ---")
cleaned_gdp_df = df[['Country', 'Year', 'GDP']].copy()
print("✅ GDP cleaned successfully. Sample:")
print(cleaned_gdp_df.head(10))

# Optional: Save to CSV
cleaned_gdp_df.to_csv("cleaned_gdp_data.csv", index=False)


### Task 3: Apply Chosen Method and Evaluate

In [None]:
numeric_columns = df.select_dtypes(include = 'number')
df.fillna(df.mean(numeric_only=True), inplace=True)
df.isnull().sum()

In [None]:
nonNumericColumns = df.select_dtypes(include = 'object')
for column in nonNumericColumns.columns:
    df[column].fillna(df[column].mode()[0])
    
df.isnull().sum()

### Task 4: Identify Potential Features

In [None]:
# Display all columns in the DataFrame
pd.set_option('display.max_columns', None)
#Description of the dataset transposed
df.describe(include='all')

###  Task 5: Feature Engineering

In [None]:
df['Health Spending Ratio'] = df['Total expenditure'] / df['GDP']
df['Deaths per Infant'] = df['infant deaths'] / df['Population']

### Task 6: Impact of New Features

In [None]:
df[['Health Spending Ratio', 'Deaths per Infant']].describe()

### Task 7: Select Key Variables for Visualization

In [None]:
df[['Life expectancy', 'GDP', 'Schooling', 'Alcohol', 'BMI', 'HIV/AIDS']].corr()

### Task 8: Visualizations

In [None]:
plt.figure(figsize=(15,10))
sns.heatmap(df.select_dtypes(include='number').corr(), cmap='coolwarm', annot=True)
plt.title('Correlation Heatmap')
plt.show()

In [None]:
plt.figure(figsize=(8,6))
sns.boxplot(x='Status', y='Life expectancy', data=df)
plt.title('Life Expectancy by Development Status')
plt.show()

In [None]:
plt.figure(figsize=(8,6))
sns.scatterplot(data=df, x='Alcohol', y='Life expectancy', hue='Status')
plt.title('Life Expectancy vs Alcohol')
plt.show()

In [None]:
plt.figure(figsize=(8,6))
sns.scatterplot(data=df, x='Hepatitis B', y='Life expectancy', hue='Status')
plt.title('Life Expectancy vs Hepatitis B')
plt.show()

In [None]:
plt.figure(figsize=(8,6))
sns.scatterplot(data=df, x='HIV/AIDS', y='Life expectancy', hue='Status')
plt.title('Life Expectancy vs HIV/AIDS')
plt.show()

In [None]:
# 3D Plot
fig = px.scatter_3d(df, x='GDP', y='Schooling', z='Life expectancy',
                     color='Status', size='Population')
fig.show()

### Task 9: Interpretation
- Higher GDP and schooling are associated with higher life expectancy.
- Developing countries tend to have more outliers and lower average life expectancy.
- HIV/AIDS has a strong negative correlation with life expectancy.

### Task 10: Data Splitting and Model Training

In [None]:
features = ['GDP', 'Schooling', 'Alcohol', 'BMI', 'HIV/AIDS']
X = df[features]
y = df['Life expectancy']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
pred = model.predict(X_test)

### Task 11: Cross Validation and Model Evaluation

In [None]:
mae = mean_absolute_error(y_test, pred)
r2 = r2_score(y_test, pred)
cross_val = cross_val_score(model, X, y, cv=5).mean()
mae, r2, cross_val

### Task 12: Conclusion and Recommendations
- **Key Findings**: Life expectancy is positively influenced by GDP, schooling, and healthcare access. HIV/AIDS is a major negative predictor.
- **Model Performance**: The linear model gives reasonable accuracy with cross-validation.
- **Recommendation**: Focus on improving education, economic stability, and healthcare to raise life expectancy.