# Supervised learning : Handling Overfitting for Linear Regression

## Import necessary libraries and modules

In [2]:
import pandas as pd
import numpy as np

from sklearn.datasets import fetch_california_housing

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LinearRegression, Ridge, RidgeCV,Ridge, Lasso, LassoCV

from sklearn.metrics import mean_squared_error

## 1. Load California housing dataset

In [3]:
data = fetch_california_housing(as_frame=True)
df = data.frame
df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


## 2. Data-preprocesing

We first load the data, define the features (X) and the target (y), and split them into a training set and a test set. We'll then apply feature scaling to normalize the range of independent variables or features of data.

### Define features and target

In [4]:
X = df.drop('MedHouseVal', axis=1)
y = df['MedHouseVal']

### Feature Scaling

In [5]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

### Split the dataset into training set and test set

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Linear Regression

In [7]:
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

lin_train_pred = lin_reg.predict(X_train)
lin_test_pred = lin_reg.predict(X_test)

mse_lin_reg_train = mean_squared_error(y_train, lin_train_pred)
mse_lin_reg_test = mean_squared_error(y_test, lin_test_pred)

print('Linear Regression')
print('Training MSE:', mean_squared_error(y_train, lin_train_pred))
print('Test MSE:', mean_squared_error(y_test, lin_test_pred))

Linear Regression
Training MSE: 0.5179331255246699
Test MSE: 0.5558915986952442


### Evaluating Linear Regression using Cross Validation

In [8]:
# Perform cross-validation
scores_lin_reg_CV = cross_val_score(lin_reg, X_scaled, y, cv=5, scoring='neg_mean_squared_error') # Evaluating trainning data on 5 folds

# Calculate the average MSE across all folds
mse_lin_reg_CV = -scores_lin_reg_CV.mean()

# Print the average MSE for linear regression
print("Average cross-validation score for Linear Regression: ", mse_lin_reg_CV) 

Average cross-validation score for Linear Regression:  0.5582901717686556


# Regularisation techniques :  Ridge & Lasso

In this step we'll train our data using regularization techniques.

Ridge and Lasso Regression: these are regularization techniques where a penalty is added to the loss function to shrink the coefficients of features. 

**Ridge** uses L2 regularization (squared magnitude of coefficient as penalty term) and tends to reduce coefficients for less important features. 

**Loss function for Ridge = Mean Squared Error + α x (sum of squares of all feature coefficients)**

**Lasso** uses L1 regularization (absolute value of the magnitude of coefficient as penalty term) and can make some feature coefficients zero, effectively performing feature selection.

**Loss function = Mean Squared Error + α x (sum of absolute values of all feature coefficients)**


**Note** : 

- The alpha parameter in Ridge and Lasso corresponds to the strength of the regularization. You can experiment with different alpha values to see how they affect the model's performance.

- Ridge and Lasso regression are sensitive to the scale of input features, which is why it's important to perform feature scaling before applying these methods. The StandardScaler we used earlier ensures this.

- That Mean Squared Error (MSE) is a loss function, meaning lower values are better. When comparing models, the model with the lower MSE is generally the better one.

## 1. Regularisation techniques without cross-validation

### Ridge regression

For Ridge regression, we choose an alpha = 1

In [10]:
ridge_reg = Ridge(alpha=1.0)
ridge_reg.fit(X_train, y_train)

ridge_train_pred = ridge_reg.predict(X_train)
ridge_test_pred = ridge_reg.predict(X_test)

mse_ridge_train = mean_squared_error(y_train, ridge_train_pred)
mse_ridge_test = mean_squared_error(y_test, ridge_test_pred)

print('Ridge Regression Without cross-validation')
print('Training MSE:', mse_ridge_train)
print('Test MSE:', mse_ridge_test)

Ridge Regression Without cross-validation
Training MSE: 0.5179332209751273
Test MSE: 0.5558512007367511


### Lasso Regression

For Lasso regression, we choose an alpha = 0.1

In [11]:
lasso_reg = Lasso(alpha=0.1)
lasso_reg.fit(X_train, y_train)

lasso_train_pred = lasso_reg.predict(X_train)
lasso_test_pred = lasso_reg.predict(X_test)

mse_lasso_train =mean_squared_error(y_train, lasso_train_pred)
mse_lasso_test = mean_squared_error(y_test, lasso_test_pred)

print('\nLasso Regression Without cross-validation')
print('Training MSE:', mse_lasso_train)
print('Test MSE:', mse_lasso_test)


Lasso Regression Without cross-validation
Training MSE: 0.6717535785617642
Test MSE: 0.6795515190149223


## 2. Regularisation techniques with cross validation

In this step, we'll define first a list of alpha values that we want to test. 

We'll then train a RidgeCV and LassoCV model on the training data, which will automatically use cross-validation to find the best alpha value from the list provided (the cv parameter of these functions is the number of cross-validation folds (5 in this case)).

We'll predict the target for the training and test data and calculate the Mean Squared Error (MSE) for both Ridge and Lasso models.

### Define alpha values to test

In [12]:
alphas = [0.001, 0.01, 0.1, 1, 10, 100]

## 1. Ridge Regression with Cross-Validation

In [13]:
ridge_reg_cv = RidgeCV(alphas=alphas, cv=5)
ridge_reg_cv.fit(X_train, y_train)

ridge_cv_train_pred = ridge_reg_cv.predict(X_train)
ridge_cv_test_pred = ridge_reg_cv.predict(X_test)

mse_ridge_cv_train = mean_squared_error(y_train, ridge_cv_train_pred)
mse_ridge_cv_test = mean_squared_error(y_test, ridge_cv_test_pred)

print('Ridge Regression with CV')
print('Best alpha:', ridge_reg_cv.alpha_)
print('Training MSE:', mse_ridge_cv_train)
print('Test MSE:', mse_ridge_cv_test)

Ridge Regression with CV
Best alpha: 0.1
Training MSE: 0.5179331264806069
Test MSE: 0.5558875470324994


## 2. Lasso Regression with Cross-Validation

In [14]:
lasso_reg_cv = LassoCV(alphas=alphas, cv=5)
lasso_reg_cv.fit(X_train, y_train)

lasso_cv_train_pred = lasso_reg_cv.predict(X_train)
lasso_cv_test_pred = lasso_reg_cv.predict(X_test)

mse_lasso_cv_train =mean_squared_error(y_train, lasso_cv_train_pred)
mse_lasso_cv_test = mean_squared_error(y_test, lasso_cv_test_pred)

print('\nLasso Regression with CV')
print('Best alpha:', lasso_reg_cv.alpha_)
print('Training MSE:', mse_lasso_cv_train)
print('Test MSE:', mse_lasso_cv_test)



Lasso Regression with CV
Best alpha: 0.001
Training MSE: 0.5179922809505753
Test MSE: 0.5544062174455687


# Summary & Conculsion

In [15]:
mse_train = [mse_lin_reg_train, mse_ridge_train,mse_lasso_train, mse_ridge_cv_train,mse_lasso_cv_train]
mse_test = [mse_lin_reg_test, mse_ridge_test,mse_lasso_test, mse_ridge_cv_test,mse_lasso_cv_test]
index = ['Training_set', 'Test_set']
columns  = ['LinearReg', 'RidgeWithoutCV', 'RidgeCV', 'LassoWhitoutCV', 'LassoCV']

In [16]:
results_reg = pd.DataFrame([mse_train, mse_test], index = index, columns = columns)
results_reg

Unnamed: 0,LinearReg,RidgeWithoutCV,RidgeCV,LassoWhitoutCV,LassoCV
Training_set,0.517933,0.517933,0.671754,0.517933,0.517992
Test_set,0.555892,0.555851,0.679552,0.555888,0.554406


In [17]:
results_reg.loc['Test_set'].sort_values()

LassoCV           0.554406
RidgeWithoutCV    0.555851
LassoWhitoutCV    0.555888
LinearReg         0.555892
RidgeCV           0.679552
Name: Test_set, dtype: float64

In [18]:
# Print the average MSE for linear regression
print("Average cross-validation score for Linear Regression: ", mse_lin_reg_CV)

Average cross-validation score for Linear Regression:  0.5582901717686556


LassoCV has the lowest MSE, making it the best model.

Thus, the best model is a lasso regression with alpha = 0.001