# Ridge and Lasso Regression - Lab

## Introduction

In this lab, you'll practice your knowledge on Ridge and Lasso regression!

## Objectives

You will be able to:

- Use Lasso and ridge regression in Python
- Compare Lasso and Ridge with standard regression

## Housing Prices Data

Let's look at yet another house pricing data set.

In [1]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('Housing_Prices/train.csv')

Look at df.info

In [2]:
# Your code here
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-n

First, make a selection of the data by removing some of the data with `dtype = object`, this way our first model only contains **continuous features**

Make sure to remove the SalesPrice column from the predictors (which you store in `X`), then replace missing inputs by the median per feature.

Store the target in `y`.

In [4]:
# Load necessary packages
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import Lasso, Ridge, LinearRegression

In [10]:
# remove "object"-type features and SalesPrice from `X`
X = df.select_dtypes(exclude=['object'])
X.drop(columns=['SalePrice'], inplace=True)
X.head()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
0,1,60,65.0,8450,7,5,2003,2003,196.0,706,...,548,0,61,0,0,0,0,0,2,2008
1,2,20,80.0,9600,6,8,1976,1976,0.0,978,...,460,298,0,0,0,0,0,0,5,2007
2,3,60,68.0,11250,7,5,2001,2002,162.0,486,...,608,0,42,0,0,0,0,0,9,2008
3,4,70,60.0,9550,7,5,1915,1970,0.0,216,...,642,0,35,272,0,0,0,0,2,2006
4,5,60,84.0,14260,8,5,2000,2000,350.0,655,...,836,192,84,0,0,0,0,0,12,2008


In [25]:
# Impute null values
# X.isna().sum() - LotFrontage 259, MasVnrArea 8, GarageYrBlt 81

#LotFrontage - filled with median
fill_value = X.LotFrontage.median()
X.LotFrontage.fillna(fill_value, inplace=True)

In [26]:
#MasVnrArea - filled with median
fill_value = X.MasVnrArea.median()
X.MasVnrArea.fillna(fill_value, inplace=True)

In [27]:
#MasVnrArea - filled with YearBuilt
X.GarageYrBlt.fillna(X.YearBuilt, inplace=True)

In [3]:
# Create y
y = df[['SalePrice']]

Unnamed: 0,SalePrice
0,208500
1,181500
2,223500
3,140000
4,250000


Look at the information of `X` again

In [29]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 37 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
LotFrontage      1460 non-null float64
LotArea          1460 non-null int64
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
MasVnrArea       1460 non-null float64
BsmtFinSF1       1460 non-null int64
BsmtFinSF2       1460 non-null int64
BsmtUnfSF        1460 non-null int64
TotalBsmtSF      1460 non-null int64
1stFlrSF         1460 non-null int64
2ndFlrSF         1460 non-null int64
LowQualFinSF     1460 non-null int64
GrLivArea        1460 non-null int64
BsmtFullBath     1460 non-null int64
BsmtHalfBath     1460 non-null int64
FullBath         1460 non-null int64
HalfBath         1460 non-null int64
BedroomAbvGr     1460 non-null int64
KitchenAbvGr     1460 non-null int64
TotRmsAbvGrd     1460 non-null int64
F

## Let's use this data to perform a first naive linear regression model

Compute the R squared and the MSE for both train and test set.

In [71]:
from sklearn.metrics import mean_squared_error, mean_squared_log_error, r2_score
from sklearn.model_selection import train_test_split

linreg = LinearRegression()

In [72]:
# Split in train and test
X_train,X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

In [73]:
X_train.shape

(1168, 37)

In [74]:
# Fit the model and print R2 and MSE for train and test

#TRAIN
linreg.fit(X_train,y_train)
MSE = mean_squared_error(y_train,linreg.predict(X_train))
r2 = r2_score(y_train,linreg.predict(X_train))
print("TRAIN:")
print("MS Error     = ",MSE)
print("R-Squared    = ",r2)
print("\n")

#TEST
MSE = mean_squared_error(y_test,linreg.predict(X_test))
r2 = r2_score(y_test,linreg.predict(X_test))
print("TEST:")
print("MS Error     = ",MSE)
print("R-Squared    = ",r2)

TRAIN:
MS Error     =  1220316777.5885172
R-Squared    =  0.8080458497901669


TEST:
MS Error     =  1139323108.6159625
R-Squared    =  0.8128621599979603


## Normalize your data

We haven't normalized our data, let's create a new model that uses `preprocessing.scale` to scale our predictors!

In [75]:
from sklearn import preprocessing

In [76]:
# Scale the data
X_scaled = pd.DataFrame(preprocessing.scale(X), columns=X.columns)
X_scaled.head()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
0,-1.730865,0.073375,-0.220875,-0.207142,0.651479,-0.5172,1.050994,0.878668,0.514104,0.575425,...,0.351,-0.752176,0.216503,-0.359325,-0.116339,-0.270208,-0.068692,-0.087688,-1.599111,0.138777
1,-1.728492,-0.872563,0.46032,-0.091886,-0.071836,2.179628,0.156734,-0.429577,-0.57075,1.171992,...,-0.060731,1.626195,-0.704483,-0.359325,-0.116339,-0.270208,-0.068692,-0.087688,-0.48911,-0.614439
2,-1.72612,0.073375,-0.084636,0.07348,0.651479,-0.5172,0.984752,0.830215,0.325915,0.092907,...,0.631726,-0.752176,-0.070361,-0.359325,-0.116339,-0.270208,-0.068692,-0.087688,0.990891,0.138777
3,-1.723747,0.309859,-0.44794,-0.096897,0.651479,-0.5172,-1.863632,-0.720298,-0.57075,-0.499274,...,0.790804,-0.752176,-0.176048,4.092524,-0.116339,-0.270208,-0.068692,-0.087688,-1.599111,-1.367655
4,-1.721374,0.073375,0.641972,0.375148,1.374795,-0.5172,0.951632,0.733308,1.366489,0.463568,...,1.698485,0.780197,0.56376,-0.359325,-0.116339,-0.270208,-0.068692,-0.087688,2.100892,0.138777


In [77]:
# Perform train test split
X_train,X_test, y_train, y_test = train_test_split(X_scaled,y,test_size=0.2)

Perform the same linear regression on this data and print out R-squared and MSE.

In [78]:
# Your code here
#TRAIN
linreg.fit(X_train,y_train)
MSE = mean_squared_error(y_train,linreg.predict(X_train))
r2 = r2_score(y_train,linreg.predict(X_train))
print("TRAIN:")
print("MS Error     = ",MSE)
print("R-Squared    = ",r2)
print("\n")

#TEST
MSE = mean_squared_error(y_test,linreg.predict(X_test))
r2 = r2_score(y_test,linreg.predict(X_test))
print("TEST:")
print("MS Error     = ",MSE)
print("R-Squared    = ",r2)

TRAIN:
MS Error     =  1224894222.1083097
R-Squared    =  0.8005777395393985


TEST:
MS Error     =  1087134422.0244868
R-Squared    =  0.8435072247240962


## Include dummy variables

Your model hasn't included dummy variables so far: let's use the "object" variables again and create dummies

In [79]:
# Create X_cat which contains only the categorical variables
X_cat = df.select_dtypes(include=['object'])

In [80]:
# Make dummies
X_cat = pd.get_dummies(X_cat)

Merge `x_cat` together with our scaled `X` so you have one big predictor dataframe.

In [81]:
# Your code here
X_merged = pd.concat([X_scaled,X_cat],axis=1)

Perform the same linear regression on this data and print out R-squared and MSE.

In [82]:
# Your code here
# Perform train test split
X_train, X_test, y_train, y_test = train_test_split(X_merged,y)

#TRAIN
linreg.fit(X_train,y_train)
MSE = mean_squared_error(y_train,linreg.predict(X_train))
r2 = r2_score(y_train,linreg.predict(X_train))
print("TRAIN:")
print("MS Error     = ",MSE)
print("R-Squared    = ",r2)
print("\n")

#TEST
MSE = mean_squared_error(y_test,linreg.predict(X_test))
r2 = r2_score(y_test,linreg.predict(X_test))
print("TEST:")
print("MS Error     = ",MSE)
print("R-Squared    = ",r2)

TRAIN:
MS Error     =  498297575.50045663
R-Squared    =  0.91346051954118


TEST:
MS Error     =  2.317420353954219e+31
R-Squared    =  -2.9381300059568524e+21


Notice the severe overfitting above; our training R squared is quite high, but the testing R squared is negative! Our predictions are far off. Similarly, the scale of the Testing MSE is orders of magnitude higher than that of the training.

## Perform Ridge and Lasso regression

Use all the data (normalized features and dummy categorical variables) and perform Lasso and Ridge regression for both! Each time, look at R-squared and MSE.

## Lasso

With default parameter (alpha = 1)

In [83]:
from sklearn.linear_model import Lasso, Ridge

In [84]:
# Your code here
lasso = Lasso()
lasso.fit(X_train,y_train)

#TRAIN
MSE = mean_squared_error(y_train,lasso.predict(X_train))
r2 = r2_score(y_train,lasso.predict(X_train))
print("TRAIN:")
print("MS Error     = ",MSE)
print("R-Squared    = ",r2)
print("\n")

#TEST
MSE = mean_squared_error(y_test,lasso.predict(X_test))
r2 = r2_score(y_test,lasso.predict(X_test))
print("TEST:")
print("MS Error     = ",MSE)
print("R-Squared    = ",r2)

TRAIN:
MS Error     =  364850082.19211596
R-Squared    =  0.9366363833366984


TEST:
MS Error     =  2063729192.960726
R-Squared    =  0.7383511085651397


With a higher regularization parameter (alpha = 10)

In [85]:
# Your code here
lasso = Lasso(alpha=10)
lasso.fit(X_train,y_train)

#TRAIN
MSE = mean_squared_error(y_train,lasso.predict(X_train))
r2 = r2_score(y_train,lasso.predict(X_train))
print("TRAIN:")
print("MS Error     = ",MSE)
print("R-Squared    = ",r2)
print("\n")

#TEST
MSE = mean_squared_error(y_test,lasso.predict(X_test))
r2 = r2_score(y_test,lasso.predict(X_test))
print("TEST:")
print("MS Error     = ",MSE)
print("R-Squared    = ",r2)

TRAIN:
MS Error     =  372926651.4961665
R-Squared    =  0.9352337232680432


TEST:
MS Error     =  1946364400.4451706
R-Squared    =  0.7532311461010346


## Ridge

With default parameter (alpha = 1)

In [87]:
# Your code here
ridge = Ridge()
ridge.fit(X_train,y_train)

#TRAIN
MSE = mean_squared_error(y_train,ridge.predict(X_train))
r2 = r2_score(y_train,ridge.predict(X_train))
print("TRAIN:")
print("MS Error     = ",MSE)
print("R-Squared    = ",r2)
print("\n")

#TEST
MSE = mean_squared_error(y_test,ridge.predict(X_test))
r2 = r2_score(y_test,ridge.predict(X_test))
print("TEST:")
print("MS Error     = ",MSE)
print("R-Squared    = ",r2)

TRAIN:
MS Error     =  382574769.837574
R-Squared    =  0.933558131834888


TEST:
MS Error     =  1920139724.5348632
R-Squared    =  0.7565560287472535


With default parameter (alpha = 10)

In [88]:
# Your code here
ridge = Ridge(alpha=10)
ridge.fit(X_train,y_train)

#TRAIN
MSE = mean_squared_error(y_train,ridge.predict(X_train))
r2 = r2_score(y_train,ridge.predict(X_train))
print("TRAIN:")
print("MS Error     = ",MSE)
print("R-Squared    = ",r2)
print("\n")

#TEST
MSE = mean_squared_error(y_test,ridge.predict(X_test))
r2 = r2_score(y_test,ridge.predict(X_test))
print("TEST:")
print("MS Error     = ",MSE)
print("R-Squared    = ",r2)

TRAIN:
MS Error     =  453576837.2557231
R-Squared    =  0.9212271827635472


TEST:
MS Error     =  1795402351.750887
R-Squared    =  0.7723707952489576


## Look at the metrics, what are your main conclusions?   

Ridge and Lasso both performed much better on the test set.

## Compare number of parameter estimates that are (very close to) 0 for Ridge and Lasso

Compare with the total length of the parameter space and draw conclusions!

In [113]:
# number of Ridge params almost zero
sum(sum(abs(ridge.coef_) < 10**(-10)))

4

In [111]:
# number of Lasso params almost zero
sum(abs(lasso.coef_) < 10**(-10))

69

Lasso was very effective to essentially perform variable selection and remove about 25% of the variables from your model!

In [118]:
# your code here
print("$ removed using Lasso:")
sum(abs(lasso.coef_) < 10**(-10))/len(lasso.coef_)

$ removed using Lasso:


0.23875432525951557

## Summary

Great! You now know how to perform Lasso and Ridge regression.