<a href="https://colab.research.google.com/github/Jasmine-Syed/Overfitting/blob/main/Lazzo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

We will import the pandas and numpy module to handle the dataset and train_test_split module to create training and test datasets. 

The r2_score, sqrt and mean_squared_error modules are imported to calculate evaluation metrics. The lasso module from scikit-learn will be used to build our lasso regression model.

In [2]:
## Load requried packages
import pandas as pd
import numpy as np
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from math import sqrt
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split

## Load and analyze the dataset given in the problem statement
Let us load the dataset and analyze the basics like shape and summary statistics of the dataset.

In [3]:
## load dataset
link = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
dataframe = pd.read_csv(link, header=None)
# summarize shape
print(dataframe.shape)
# get information about the dataset
print(dataframe.describe())


(506, 14)
               0           1           2   ...          11          12          13
count  506.000000  506.000000  506.000000  ...  506.000000  506.000000  506.000000
mean     3.613524   11.363636   11.136779  ...  356.674032   12.653063   22.532806
std      8.601545   23.322453    6.860353  ...   91.294864    7.141062    9.197104
min      0.006320    0.000000    0.460000  ...    0.320000    1.730000    5.000000
25%      0.082045    0.000000    5.190000  ...  375.377500    6.950000   17.025000
50%      0.256510    0.000000    9.690000  ...  391.440000   11.360000   21.200000
75%      3.677082   12.500000   18.100000  ...  396.225000   16.955000   25.000000
max     88.976200  100.000000   27.740000  ...  396.900000   37.970000   50.000000

[8 rows x 14 columns]


## Create training and test dataset
We are going to split the dataset into a training set and test set. We will build our lasso model on the training set and evaluate it using our test set. 

Specify the input columns as X and the target column as Y and use the test_size argument in the train_test_split module to split the dataset. We are splitting our dataset into 70% training data and 30% test data here.

In [4]:
## Train and test dataset creation
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.30,
random_state=40)
print(Xtrain.shape)
print(Xtest.shape)

(354, 13)
(152, 13)


## Build the model and find predictions for the test dataset
Let us instantiate the lasso model and fit the model to the training set. We will use this fitted model to predict the housing prices for the training set and test set. 

In [5]:
## Build the lasso model with alpha

model_lasso = Lasso(alpha=1)
model_lasso.fit(Xtrain, ytrain)
pred_train_lasso= model_lasso.predict(Xtrain)
pred_test_lasso= model_lasso.predict(Xtest)

## Evaluate the lasso model
Evaluate the model by finding the RMSE and R-Square for both the training and test predictions.

In [6]:
## Evaluate the lasso model
print(np.sqrt(mean_squared_error(ytrain,pred_train_lasso)))
print(r2_score(ytrain, pred_train_lasso))
print(np.sqrt(mean_squared_error(ytest,pred_test_lasso)))
print(r2_score(ytest, pred_test_lasso))

4.887113841773082
0.6657249068677625
6.379797782769904
0.6439373929767929


As you can see, we have set the lasso hyperparameter - alpha as 1 or a full penalty. This alpha value is giving us a decent RMSE as of now. But, there might be a different alpha value which can provide us with better results. 

Let us tune our model to check this. 

The sci-kit learn library has a built-in algorithm called LassoCV which will do the tuning for us. This algorithm will find the best alpha value and complete the model tuning simultaneously during training itself. Predictions can then be made using the fit model.

By default, the model will do the tuning using 100 alpha values. We can control this by specifying the alphas argument with a grid of alpha values. The range of alpha values has been set between 0-1 with an interval of 0.02 in the below code.

In [12]:
## Tunning lasso regression model

from numpy import arange
from sklearn.model_selection import RepeatedKFold
import pandas as pd
from sklearn.linear_model import LassoCV

## load the dataset
link = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
dataset = pd.read_csv(link, header=None)
dataframe = dataset.values
X, y = dataframe [:, :-1], dataframe [:, -1]
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.30,
random_state=40)

## define model evaluation method
cross_validation = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

## define model

lasso_model = LassoCV(alphas=arange(0.1, 1, 0.02), cv=cross_validation , n_jobs=-1)

## fit model
lasso_model .fit(Xtrain, ytrain)
## summarize chosen configuration
print('alpha: %f' % lasso_model .alpha_)

pred_train_lasso= lasso_model .predict(Xtrain)
pred_test_lasso= lasso_model .predict(Xtest)
print(np.sqrt(mean_squared_error(ytrain,pred_train_lasso)))
print(r2_score(ytrain, pred_train_lasso))
print(np.sqrt(mean_squared_error(ytest,pred_test_lasso)))
print(r2_score(ytest, pred_test_lasso))


alpha: 0.100000
4.409123726980954
0.7279155769109467
5.7263120023687915
0.7131449135744071


LassoCV has chosen the best alpha value as 0.100, meaning zero penalty. You can see that the RMSE and R-Square scores have improved slightly with the alpha value selected.