# Experiment 3 - Lasso Regression




In experiment 3, a Lasso regression model is trained to predict salaries based on the given features. Lasso regression, short for Least Absolute Shrinkage and Selection Operator, is another variant of linear regression that includes a regularization term.
Performs L1 regularization, i.e., adds penalty equivalent to the absolute value of the magnitude of coefficients [1]

### 0. Import Packages

In [None]:
! pip install -r https://raw.githubusercontent.com/aso-uts/labs_datasets/main/36106-mlaa/requirements.txt

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split, GridSearchCV, KFold, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import mean_squared_error
from scipy.stats import f_oneway

### 1.   Load Dataset

In [2]:
train_num = pd.read_csv('/Users/ronikj/Downloads/train_num.csv')

### 2. Explore Dataset

In [3]:
train_num.head()

Unnamed: 0,quant,logical,10percentage,english,collegetier,12percentage,computerprogramming,domain,age,salary,specialization_encoded
0,2.432643,1.892646,1.016186,1.420185,0.0,0.868785,0.413642,0.421884,0.249636,1.094253,1.49727
1,-2.500658,-0.75406,-2.076659,-0.58101,0.0,-0.885233,-0.9865498,-0.568227,0.249636,-1.328135,-0.810604
2,-0.075118,-0.293763,-0.035181,-0.152183,0.0,0.974026,-6.63264e-16,0.787905,0.827781,-0.279638,-0.041313
3,0.993763,1.374812,0.664726,1.658423,0.0,1.048685,2.397247,1.252765,-0.328509,0.913478,-0.810604
4,-1.226222,-0.063615,0.433767,0.705472,0.0,0.059238,-1.569963,-0.747762,0.249636,-0.677344,-1.689794


In [4]:
train_num.shape

(2998, 11)

In [5]:
train_num.dtypes

quant                     float64
logical                   float64
10percentage              float64
english                   float64
collegetier               float64
12percentage              float64
computerprogramming       float64
domain                    float64
age                       float64
salary                    float64
specialization_encoded    float64
dtype: object

### 3. Explore Target variable

In [6]:
target = train_num['salary']
target.head()

0    1.094253
1   -1.328135
2   -0.279638
3    0.913478
4   -0.677344
Name: salary, dtype: float64

In [21]:
target.shape

(2998,)

### 4. Explore Variables of Interest

This step is skipped as data exploration has been performed in experiment 1. The same cleaned and explored dataset is being used for the Experiment 2.

### 5. Prepare Data

This step is skipped as data exploration has been performed in experiment 1. The same cleaned and explored dataset is being used for the Experiment 2.

### 6. Split Dataset

In [9]:
X = train_num.drop(columns=['salary'])
y = train_num['salary'] 

X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.2, random_state=42)
X_test, X_val, y_test, y_val = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("X_val shape:", X_val.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)
print("y_val shape:", y_val.shape)

X_train shape: (2398, 10)
X_test shape: (300, 10)
X_val shape: (300, 10)
y_train shape: (2398,)
y_test shape: (300,)
y_val shape: (300,)


### 7. Baseline Model

In [10]:
lasso_model = Lasso(alpha=1)
lasso_model.fit(X_train, y_train)

Lasso(alpha=1)

In [11]:
y_train_pred = lasso_model.predict(X_train)
train_rmse = mean_squared_error(y_train, y_train_pred, squared=False)

In [12]:
y_pred = lasso_model.predict(X_val)
val_rmse = mean_squared_error(y_val, y_pred, squared=False)

In [13]:
model_train_perf = train_rmse.round(3)
model_valid_perf = val_rmse.round(3)

print("Root Mean Squared Error (RMSE) on Training Set:", model_train_perf)
print("Root Mean Squared Error (RMSE) on Validation Set:", model_valid_perf)

Root Mean Squared Error (RMSE) on Training Set: 1.004
Root Mean Squared Error (RMSE) on Validation Set: 1.001


The baseline model in this case has been taken as a Lasso Regularization model with a random value for alpha. We see increased RMSE scores as compared to Ridge Regularization and Linear Regression. 

This indicates Cross Validation requirement to find the ideal score for alpha.

### 8. Train Machine Learning Model

In [22]:
folds = KFold(n_splits=5, shuffle=True, random_state=42)
params = {'alpha': [0.001, 0.01, 0.1, 0.2, 0.5, 0.9, 1.0, 5.0, 10.0,]}

model = Lasso()

model_cv = GridSearchCV(estimator=model, param_grid = params, 
                       scoring='r2', cv=folds, 
                       return_train_score=True, verbose=1)
model_cv.fit(X_train, y_train)

Fitting 5 folds for each of 9 candidates, totalling 45 fits


GridSearchCV(cv=KFold(n_splits=5, random_state=42, shuffle=True),
             estimator=Lasso(),
             param_grid={'alpha': [0.001, 0.01, 0.1, 0.2, 0.5, 0.9, 1.0, 5.0,
                                   10.0]},
             return_train_score=True, scoring='r2', verbose=1)

In [23]:
model_cv.best_params_

{'alpha': 0.001}

In [24]:
lasso_model = Lasso(alpha=0.001)
lasso_model.fit(X_train, y_train)

Lasso(alpha=0.001)

In [25]:
y_train_pred = lasso_model.predict(X_train)
train_rmse = mean_squared_error(y_train, y_train_pred, squared=False)

In [26]:
y_pred = lasso_model.predict(X_val)
val_rmse = mean_squared_error(y_val, y_pred, squared=False)

In [27]:
y_pred_test = lasso_model.predict(X_test)
test_rmse = mean_squared_error(y_test, y_pred_test, squared=False)

### 9. Assess Model Performance

In [28]:
model_train_perf = train_rmse.round(3)
model_valid_perf = val_rmse.round(3)
model_test_perf = test_rmse.round(3)

print("Root Mean Squared Error (RMSE) on Training Set:", model_train_perf)
print("Root Mean Squared Error (RMSE) on Validation Set:", model_valid_perf)
print("Root Mean Squared Error (RMSE) for Testing Set:", model_test_perf)

Root Mean Squared Error (RMSE) on Training Set: 0.872
Root Mean Squared Error (RMSE) on Validation Set: 0.868
Root Mean Squared Error (RMSE) for Testing Set: 0.843


The observations made are the following:

1. We observe similar scores in Lasso Regularization as we see in ridge regression and linear regression once the alpha hyperparamter tuning is worked on.
2. This indicates a good usage of the KFold method to ensure the best RMSE scored is derived from the Lasso model.

### 10. Reference

[1]: Ridge and Lasso Regression in Python | Complete Tutorial (Updated 2024) - Aarshay Jain - 11 Jan, 2024 - Analytics Vidhya - https://www.analyticsvidhya.com/blog/2016/01/ridge-lasso-regression-python-complete-tutorial/#:~:text=Lasso%20regression%20is%20a%20regularization,i.e.%20models%20with%20fewer%20parameters)