## Dataset Information 
https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who/data


### Columns:
• Country: Country name

• Year: Year of data collection

• Status: Developed/Developing status

• Life Expectancy: Life expectancy in age

• Adult Mortality: Adult Mortality Rates of both sexes (probability of dying between 15 and 60 years per 1000 population)

• Infant Deaths: Number of Infant Deaths per 1000 population

• Alcohol: Alcohol, recorded per capita (15+) consumption (in litres of pure alcohol)

• Percentage Expenditure: Expenditure on health as a percentage of Gross Domestic Product per capita(%)

• Hepatitis B: Hepatitis B (HepB) immunization coverage among 1-year-olds (%)

• Measles: Measles - number of reported cases per 1000 population

• BMI: Average Body Mass Index of entire population

• Under-five Deaths: Number of under-five deaths per 1000 population

• Polio: Polio (Pol3) immunization coverage among 1-year-olds (%)

• Total Expenditure: General government expenditure on health as a percentage of total government expenditure (%)

• Diphtheria: Diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage among 1-year-olds (%)

• HIV/AIDS: Deaths per 1000 live births HIV/AIDS (0-4 years)

• GDP: Gross Domestic Product per capita (in USD)

• Population: Population of the country

• Thinness 1-19 years: Prevalence of thinness among children and adolescents for Age 10 to 19 (% )

• Thinness 5-6 years: Prevalance of thinness among children for Age 5 to 9(%)

• Income Composition of Resources: Human Development Index in terms of income composition of resources (index ranging from 0 to 1)

• Schooling: Number of years of Schooling(years)

## Processing the Dataset
You are free to edit this to process the data differently, or to perform exploratory data analysis (EDA) before constructing your models.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.pipeline import Pipeline

df = pd.read_csv('Life Expectancy Data.csv')
print(df.info())
print(df.describe())

# # Convert 'Year' to datetime (optional) for better time-series handling
# df['Year'] = pd.to_datetime(df['Year'], format='%Y', errors='coerce')

# Drop columns with more than 30% missing values
threshold = 0.3
missing_percentage = df.isnull().mean()
columns_to_drop = missing_percentage[missing_percentage > threshold].index
df = df.drop(columns=columns_to_drop)

# Convert 'Status' to a binary representation
df['Status'] = df['Status'].map({'Developed': 1, 'Developing': 0})

# Fill missing values for numerical columns with the column mean
num_cols = df.select_dtypes(include=[np.number]).columns
df[num_cols] = df[num_cols].fillna(df[num_cols].mean())

# Fill missing values for categorical columns with mode
cat_cols = df.select_dtypes(include=['object']).columns
for col in cat_cols:
    df[col].fillna(df[col].mode()[0], inplace=True)

# Drop irrelevant columns if any (e.g., 'Country', if you want to focus on global patterns)
df = df.drop(columns=['Country'])  # Uncomment if 'Country' column shouldn't be dropped

# Drop duplicates
df = df.drop_duplicates()

# Reset index after cleaning
df.reset_index(drop=True, inplace=True)

# # Ensure 'Year' is removed from the features as it is datetime
# df = df.drop(columns=['Year'])

# Split data into features and target
X = df.drop(columns=['Life expectancy '])
y = df['Life expectancy ']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

## Implementing the Regression Models
Take a look at the sample linear regression function, and implement the other function in the same manner. 

#### Sample Linear Regression

In [None]:
def sample_linear_regression(X_train, y_train, X_test, y_test, n_iterations=100):
    mse_scores = []
    r2_scores = []
    
    for _ in range(n_iterations):
        model = LinearRegression()
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        mse = mean_squared_error(y_test, y_pred)
        r2 = r2_score(y_test, y_pred)
        mse_scores.append(mse)
        r2_scores.append(r2)
    
    # Average the scores over all iterations
    avg_mse = np.mean(mse_scores)
    avg_r2 = np.mean(r2_scores)
    
    return avg_mse, avg_r2

In [None]:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

def linear_regression(X_train, y_train, X_test, y_test, n_iterations=100):
    # YOUR CODE HERE
    
    return avg_mse, avg_r2

In [None]:
# Lasso Regression
def lasso_regression(X_train, y_train, X_test, y_test, alpha=1.0, n_iterations=100):
    # YOUR CODE HERE
    
    return avg_mse, avg_r2

In [None]:
# Ridge Regression
def ridge_regression(X_train, y_train, X_test, y_test, alpha=1.0, n_iterations=100):
    # YOUR CODE HERE
    
    return avg_mse, avg_r2

In [None]:
# Elastic Net Regression
def elastic_net_regression(X_train, y_train, X_test, y_test, alpha=1.0, l1_ratio=0.5, n_iterations=100):
    # YOUR CODE HERE
    
    return avg_mse, avg_r2

## Evaluation

In [None]:
import warnings

warnings.filterwarnings("ignore", category=UserWarning, module='sklearn')

# Train models and collect results
results = {}
results['Linear Regression'] = linear_regression(X_train, y_train, X_test, y_test, n_iterations=100)
results['Lasso Regression'] = lasso_regression(X_train, y_train, X_test, y_test, alpha=1.0, n_iterations=100)
results['Ridge Regression'] = ridge_regression(X_train, y_train, X_test, y_test, alpha=1.0, n_iterations=100)
results['Elastic Net Regression'] = elastic_net_regression(X_train, y_train, X_test, y_test, alpha=1.0, l1_ratio=0.5, n_iterations=100)

# Print results
for k, v in results.items():
    print(f"{k} - Mean Squared Error: {v[0]:.2f}, R^2 Score: {v[1]:.2f}")

## Inferences
To answer these questions, you are expected to continue experimentation. You may plot graphs, etc. to arrive at conclusions. All questions are to be backed up with proper reasoning.

(Answer the questions in the same markdown cells)

1. How does the performance of the Linear Regression model compare with other regression models (Lasso, Ridge, Elastic Net) based on Mean Squared Error (MSE) and R^2 Score?

2. Which features are selected by Lasso Regression, and how does this compare with the features used in Linear Regression?

3. How does Ridge Regression with different alpha values impact the model's performance and feature coefficients?

4. How does Elastic Net's combination of Lasso and Ridge Regression affect feature selection and model performance?