# **Introduction**

In this assignment, you will work with a dataset that includes information about the cost of treatment of different patients. The cost of treatment depends on many factors: diagnosis, type of clinic, city of residence, age and so on. We have no data on the diagnosis of patients. 

Columns

age: age of primary beneficiary

sex: insurance contractor gender, female, male

bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height,
objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9

children: Number of children covered by health insurance / Number of dependents

smoker: Smoking

region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.

charges: Individual medical costs billed by health insurance

First, you will start by fitting a basic regression model using scikit-learn (sklearn) to establish a baseline for comparison. This basic regression model will serve as a reference point for evaluating the performance of more sophisticated models incorporating regularization techniques.

Furthermore, you will apply L1 (Lasso) and L2 (Ridge) regularization techniques to refine your predictions and evaluate the impact of these methods on the accuracy of your results. Regularization is a technique used to prevent overfitting by adding a penalty term to the loss function, encouraging simpler models with smaller coefficients. L1 regularization (Lasso) encourages sparsity by penalizing the absolute values of coefficients, while L2 regularization (Ridge) penalizes the square of coefficients. By incorporating these regularization techniques, you aim to improve the generalization performance of your regression models and obtain more robust predictions of house prices in the Boston area.

# Personal Data

In [None]:
# Set your student number
student_number = ''
Name = ''
Last_Name = ''

## Imports

In [None]:
import os
import pandas
import numpy as np
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_percentage_error
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.preprocessing import LabelEncoder, StandardScaler, PolynomialFeatures
import numpy as np
import pandas as pd
import seaborn as sns
from joblib import dump, load
from matplotlib import pyplot as plt

# Load and Explore

Load the dataset (as a dataframe) using pandas and display the top 5 rows of the dataframe and then check for missing values and impute missing values with mean

In [None]:
file_path = "./InsuranceData.csv"
df = pd.read_csv(file_path)
df.head(5)

Get a brief description and do some EDA to get familiar with the dataset

In [None]:
# TODO: you can use .info() and .description()
df.info()

In [None]:
df.describe()

In [None]:
df['region'].value_counts()

# Preprocessing

In [None]:
# TODO: apply any pre processing method you think is necessary, drop salary and convert to numpy array
# Options: Normalization, Standardization, Outlier Detection, Imputation, etc.

df['sex'] = LabelEncoder().fit_transform(df['sex'])
df['smoker'] = LabelEncoder().fit_transform(df['smoker'])

df['sex'] = df['sex'].astype(bool)
df['smoker'] = df['smoker'].astype(bool)

In [None]:
region = pd.get_dummies(df['region'])
df = pd.concat([df, region], axis = 1)
df.drop('region', axis = 1, inplace = True)

In [None]:
df.fillna(df.mean(), inplace=True)

In [None]:
plt.figure(figsize=(15, 10))
plt.title("Boxplot of the Columns (Features)")

X = df[['bmi', 'age', 'children']]

X.boxplot()

plt.show()


Q1 = X.quantile(0.25)
Q3 = X.quantile(0.75)
IQR = Q3 - Q1
outliers = ((X < (Q1 - 1.5 * IQR)) | (X > (Q3 + 1.5 * IQR))).any(axis=1).sum()


if outliers > 0:
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    print(f"Total of {outliers} outliers detected. Data has been scaled.")
    print(X_scaled[:5])
else:
    print("No outliers detected.")

In [None]:
# TODO: Split the dataset into two parts such that the training set contains 80% of the samples.

In [None]:
X = df.drop('charges', axis = 1)
y = df['charges']
x_train, x_test, y_train, y_test = train_test_split(X, y, train_size=0.8)

In [None]:
ss = StandardScaler()
x_train = ss.fit_transform(x_train)
x_test = ss.transform(x_test)

# Training

Fit a linear regressor to the data. (Use sklearn)

In [None]:
# TODO: Use sklearn
lr = LinearRegression()
lr.fit(x_train, y_train)

Get the coefficients of the variables and visualize it

In [None]:
# TODO: 
def visualize_coef(model, label, color):
    print("Coefficients of the variables in the sklearn: ", model.coef_)

    plt.figure(figsize=(15, 10))
    plt.bar(range(len(model.coef_)), model.coef_, label=label, color=color, alpha=0.5)
    plt.xlabel("Coefficient Index")
    plt.ylabel("Coefficient Value")
    plt.title("Comparison of Coefficients")
    plt.legend()
    
    plt.show()
    
visualize_coef(lr, "lr", "r")

Get the score value of sklearn regressor on train dataset</br>
if you are not familiar with R-squared concept see the link below:
[R-squared](https://statisticsbyjim.com/regression/interpret-r-squared-regression/)



In [None]:
# TODO: Calculate R² score and MSE on the training dataset
# TODO: Calculate R² score and MSE on the training dataset
def calc_scores(model, X, y):
    y_pred = model.predict(X)
    r2 = r2_score(y, y_pred)
    mse = mean_squared_error(y, y_pred)
    return r2, mse

r2, mse = calc_scores(lr, x_train, y_train)

print("R² score on the training dataset: ", r2)
print("MSE on the training dataset: ", mse)

# Regularization

L1 regularization, also known as Lasso (Least Absolute Shrinkage and Selection Operator) regularization, is a technique used in regression models that encourages simplicity and sparsity in the model coefficients. This is achieved by adding a penalty equal to the absolute value of the magnitude of coefficients to the loss function.\
</br>
Train a regression model using L1 regularization.

In [None]:
# TODO: Use Lasso from sklearn library

lasso_lr = Lasso(alpha=0.5)
lasso_lr.fit(x_train, y_train)


lasso_lr_low_alpha = Lasso(alpha=0.1)
lasso_lr_low_alpha.fit(x_train, y_train)


lasso_lr_high_alpha = Lasso(alpha=2)
lasso_lr_high_alpha.fit(x_train, y_train)

Get the coefficients of the variables and visualize it.

In [None]:
visualize_coef(lasso_lr, "lasso, alpha=0.5", "b")
visualize_coef(lasso_lr_low_alpha, "lasso, alpha=0.1", "b")
visualize_coef(lasso_lr_high_alpha, "lasso, alpha=2", "b")

Train a regression model using L2 regularization.

In [None]:
# TODO: Use Ridge from sklearn library
Ridge_lr = Ridge(alpha=0.5)
Ridge_lr.fit(x_train, y_train)


Ridge_lr_low_alpha = Ridge(alpha=0.1)
Ridge_lr_low_alpha.fit(x_train, y_train)


Ridge_lr_high_alpha = Ridge(alpha=2)
Ridge_lr_high_alpha.fit(x_train, y_train)

In [None]:
visualize_coef(Ridge_lr, "Ridge, alpha=0.5", "b")
visualize_coef(Ridge_lr_low_alpha, "Ridge, alpha=0.1", "b")
visualize_coef(Ridge_lr_high_alpha, "Ridge, alpha=2", "b")

Test different regularization parameters (alpha) using cross validation. Use MAPE for evaluation.

In [None]:
# TODO: Use folding methods from sklearn library
kf = KFold(n_splits=5, shuffle=True, random_state=42)

alphas = [0.1, 1, 10, 100]
results = {}

for alpha in alphas:
    model = Ridge(alpha=alpha)
    scores = cross_val_score(model, X, y, cv=kf, scoring='neg_mean_absolute_percentage_error')
    results[alpha] = np.mean(np.abs(scores))

print("Cross-validated MSE for different alphas:", results)

In [None]:
kf = KFold(n_splits=5, shuffle=True, random_state=42)

alphas = [0.1, 1, 10, 100]
results = {}

for alpha in alphas:
    model = Lasso(alpha=alpha)
    scores = cross_val_score(model, X, y, cv=kf, scoring='neg_mean_absolute_percentage_error')
    results[alpha] = np.mean(np.abs(scores))

print("Cross-validated MSE for different alphas:", results)

Add extra variables to dataset to make model more complex, then compare the results. 

In [None]:
# TODO: Increase No. of dimensions using PolynomialFeatures from sklearn 
poly_features = PolynomialFeatures(degree=2)
X_poly = poly_features.fit_transform(X)

x_train, x_test, y_train, y_test = train_test_split(X_poly, y, train_size=0.8)
ss = StandardScaler()
x_train = ss.fit_transform(x_train)
x_test = ss.transform(x_test)

lr = LinearRegression()
lr.fit(x_train, y_train)
mean_absolute_percentage_error(y_test, lr.predict(x_test))

In [None]:
# TODO:     
visualize_coef(lr, "lr", "r")

In [None]:
lasso = Lasso(alpha=10)
lasso.fit(x_train, y_train)
mean_absolute_percentage_error(y_test, lasso.predict(x_test))

In [None]:
visualize_coef(lasso, "lasso", "b")

In [None]:
Ridge = Ridge(alpha=10)
Ridge.fit(x_train, y_train)
mean_absolute_percentage_error(y_test, Ridge.predict(x_test))

In [None]:
visualize_coef(Ridge, "Ridge", "b")

Report your best model with its evaluated results.

In [None]:
# TODO:

## Questions

1. Compare the weight distribution when applying L1/L2 regularization and their sparsity?

2. How does the regularization parameter (alpha) affect each feature? Does it help to model's explainability?

3. How does the regularization affect dimension expansion?

## Sample Answers

1. As we know L1 regularization's main feature is sparsity. It means that it pushes the coefficients of the linear regression to be zero. We can see that feature just by comparing the plots of the normal LR and the Lasso. Many of the coefficients in the Lasso are zero or at least near zero (As we are adding a penalty equal to the absolute value of the coefficients, our model tends to have smaller coefficients and thus we'll have many zero values between them). 

    In the first glance we can see the difference between their sparsity. When we use the Lasso our coefficients have more sparsity and most of them are zero or at least so close to zero, but in the Ridge our coefficients are not forced to be exactly zero, they're just pushed to be small and not necessarily zero. This also indicates a feature selection act that is done by the Lasso. Lasso tends to kinda ignore the less important features and gives them zero coefficients and by that it acts as a feature selector, but the Ridge doesn't act like that and gives all of the features their corresponding coefficients to just ensure that the L2 norm is small.

    So basically their key differences are the sparsity that Lasso has compared to the Ridge (Lasso pushes the coefficients to be small and mostly zero, but Ridge just pushes them to be small) and also the feature selection act that Lasso has (and Ridge doesn't have) when dealing w/ the features.



2. As alpha increases, the coefficients of features are penalized more heavily. In Ridge regression (L2 regularization), all coefficients are shrunk but remain non-zero, while in Lasso regression (L1 regularization), some coefficients can be completely reduced to zero, effectively performing feature selection. This means that different features can be affected differently based on their relevance to the target variable and their correlation with other features.

    By shrinking coefficients, especially in Lasso regression, regularization can enhance model interpretability. A model with fewer non-zero coefficients is easier to explain and understand since it highlights only the most relevant features. This is particularly beneficial in high-dimensional datasets where many features may contribute little to the predictive power.

    Models with appropriate regularization tend to produce more stable coefficient estimates across different datasets or samples. This stability aids in understanding which features consistently contribute to predictions, enhancing overall explainability

3. Regularization helps control overfitting by penalizing large coefficients in the model. In high-dimensional spaces, models can easily become overly complex and fit noise rather than the underlying data structure. By introducing a penalty term (controlled by alpha), regularization restricts the complexity of the model, effectively managing the dimensionality of the feature space.

    Ridge regression (L2 regularization) does not eliminate coefficients but shrinks them towards zero. This shrinkage stabilizes coefficient estimates, particularly in high-dimensional settings where multicollinearity may be present. Stable coefficients mean that small changes in the data do not lead to large fluctuations in the model, which is crucial when dealing with expanded dimensions.

    Regularization introduces bias into the model but reduces variance by limiting how much coefficients can change based on fluctuations in the training data. This tradeoff is essential when working with high-dimensional data, as it helps maintain predictive performance while controlling for overfitting.