# Ridge and Lasso Regression - Lab

## Introduction

In this lab, you'll practice your knowledge on Ridge and Lasso regression!

## Objectives

You will be able to:

- Use Lasso and ridge regression in Python
- Compare Lasso and Ridge with standard regression

## Housing Prices Data

Let's look at yet another house pricing data set.

In [1]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('Housing_Prices/train.csv')

Look at df.info

In [2]:
# Your code here
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-n

We'll make a first selection of the data by removing some of the data with `dtype = object`, this way our first model only contains **continuous features**

Make sure to remove the SalesPrice column from the predictors (which you store in `X`), then replace missing inputs by the median per feature.

Store the target in `y`.

In [12]:
# Load necessary packages
import numpy as np

# remove "object"-type features and SalesPrice from `X`
X = df.select_dtypes(exclude='object').drop('SalePrice', axis = 1)

# Impute null values
X = X.fillna(X.median())

# Create y
y = df['SalePrice']
y.head()

0    208500
1    181500
2    223500
3    140000
4    250000
Name: SalePrice, dtype: int64

Look at the information of `X` again

In [13]:
print(X.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 37 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
LotFrontage      1460 non-null float64
LotArea          1460 non-null int64
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
MasVnrArea       1460 non-null float64
BsmtFinSF1       1460 non-null int64
BsmtFinSF2       1460 non-null int64
BsmtUnfSF        1460 non-null int64
TotalBsmtSF      1460 non-null int64
1stFlrSF         1460 non-null int64
2ndFlrSF         1460 non-null int64
LowQualFinSF     1460 non-null int64
GrLivArea        1460 non-null int64
BsmtFullBath     1460 non-null int64
BsmtHalfBath     1460 non-null int64
FullBath         1460 non-null int64
HalfBath         1460 non-null int64
BedroomAbvGr     1460 non-null int64
KitchenAbvGr     1460 non-null int64
TotRmsAbvGrd     1460 non-null int64
F

## Let's use this data to perform a first naive linear regression model

Compute the R squared and the MSE for both train and test set.

In [20]:
from sklearn.metrics import mean_squared_error, mean_squared_log_error
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Split in train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)


# Fit the model and print R2 and MSE for train and test
reg = LinearRegression()
reg.fit(X = X_train, y = y_train)

r2_train = reg.score(X_train, y_train)
y_hat_train = reg.predict(X_train)
mse_train = mean_squared_error(y_train, y_hat_train)

print(r2_train, mse_train)

r2_test = reg.score(X_test, y_test)
y_hat_test = reg.predict(X_test)
mse_test = mean_squared_error(y_test, y_hat_test)

print(r2_test, mse_test)

0.8118554543955108 1090644660.2119048
0.7849696889294515 1578621667.9588947


## Normalize your data

We haven't normalized our data, let's create a new model that uses `preprocessing.scale` to scale our predictors!

In [25]:
from sklearn import preprocessing

# Scale the data and perform train test split
X_scaled = preprocessing.scale(X)
y_scaled = preprocessing.scale(y)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y_scaled, test_size=0.33, random_state=42)

Perform the same linear regression on this data and print out R-squared and MSE.

In [26]:

# Fit the model and print R2 and MSE for train and test
reg = LinearRegression()
reg.fit(X = X_train, y = y_train)

r2_train = reg.score(X_train, y_train)
y_hat_train = reg.predict(X_train)
mse_train = mean_squared_error(y_train, y_hat_train)

print(r2_train, mse_train)

r2_test = reg.score(X_test, y_test)
y_hat_test = reg.predict(X_test)
mse_test = mean_squared_error(y_test, y_hat_test)

print(r2_test, mse_test)

0.8118599286139094 0.17292774430356497
0.7847972639958313 0.25050586134233366


## Include dummy variables

We haven't included dummy variables so far: let's use our "object" variables again and create dummies

In [29]:
# Create X_cat which contains only the categorical variables
X_cat = df.select_dtypes(include='object')

In [30]:
# Make dummies
X_cat = pd.get_dummies(X_cat)

Merge `x_cat` together with our scaled `X` so you have one big predictor dataframe.

In [39]:
# Your code here
print(X_cat.shape, X_scaled.shape)
X = X_cat.merge(pd.DataFrame(X_scaled), left_index = True, right_index = True)
X_train, X_test, y_train, y_test = train_test_split(X, y_scaled, test_size=0.33, random_state=42)

(1460, 252) (1460, 37)


Perform the same linear regression on this data and print out R-squared and MSE.

In [40]:
# Your code here
reg = LinearRegression()
reg.fit(X = X_train, y = y_train)

r2_train = reg.score(X_train, y_train)
y_hat_train = reg.predict(X_train)
mse_train = mean_squared_error(y_train, y_hat_train)

print(r2_train, mse_train)

r2_test = reg.score(X_test, y_test)
y_hat_test = reg.predict(X_test)
mse_test = mean_squared_error(y_test, y_hat_test)

print(r2_test, mse_test)

0.9383758957065231 0.056641401651910124
-7.103533726510368e+17 8.268839271166732e+17


Notice the severe overfitting above; our training R squared is quite high, but the testing R squared is negative! Our predictions are far far off. Similarly, the scale of the Testing MSE is orders of magnitude higher then that of the training.

## Perform Ridge and Lasso regression

Use all the data (normalized features and dummy categorical variables) and perform Lasso and Ridge regression for both! Each time, look at R-squared and MSE.

## Lasso

With default parameter (alpha = 1)

In [44]:
# Your code here
from sklearn.linear_model import Lasso, Ridge

lasso = Lasso(alpha=1)
lasso.fit(X_train, y_train)
print(lasso.coef_)

r2_train = lasso.score(X_train, y_train)
y_hat_train = lasso.predict(X_train)
mse_train = mean_squared_error(y_train, y_hat_train)

print(r2_train, mse_train)

r2_test = lasso.score(X_test, y_test)
y_hat_test = lasso.predict(X_test)
mse_test = mean_squared_error(y_test, y_hat_test)

print(r2_test, mse_test)

[-0.  0. -0.  0. -0. -0.  0. -0. -0.  0.  0.  0. -0. -0.  0.  0. -0.  0.
 -0. -0.  0. -0.  0. -0. -0.  0.  0.  0. -0. -0. -0.  0.  0.  0. -0.  0.
 -0. -0. -0. -0. -0.  0.  0.  0. -0. -0. -0. -0.  0.  0.  0.  0. -0. -0.
  0.  0.  0. -0.  0.  0.  0. -0. -0.  0.  0.  0.  0. -0.  0.  0. -0. -0.
 -0.  0. -0. -0. -0.  0. -0.  0. -0. -0.  0. -0. -0.  0. -0.  0. -0. -0.
  0. -0. -0.  0.  0.  0. -0. -0. -0.  0. -0.  0. -0.  0. -0. -0.  0.  0.
  0. -0. -0. -0. -0. -0.  0. -0.  0. -0. -0. -0.  0. -0. -0. -0.  0. -0.
 -0. -0.  0. -0.  0.  0. -0.  0. -0. -0. -0. -0. -0.  0. -0. -0.  0. -0.
 -0.  0.  0. -0.  0. -0. -0.  0. -0.  0.  0.  0.  0. -0. -0. -0.  0. -0.
 -0. -0.  0. -0.  0. -0. -0.  0. -0.  0. -0. -0. -0. -0.  0. -0. -0. -0.
 -0. -0.  0. -0. -0. -0.  0.  0.  0. -0.  0. -0. -0. -0. -0. -0.  0. -0.
  0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0. -0.  0.  0. -0.  0. -0.  0.
 -0.  0. -0. -0.  0. -0.  0. -0. -0.  0.  0.  0.  0.  0. -0. -0. -0.  0.
 -0. -0.  0. -0.  0.  0. -0. -0. -0.  0. -0. -0. -0

With a higher regularization parameter (alpha = 10)

In [45]:
# Your code here

lasso = Lasso(alpha=10)
lasso.fit(X_train, y_train)
print(lasso.coef_)

r2_train = lasso.score(X_train, y_train)
y_hat_train = lasso.predict(X_train)
mse_train = mean_squared_error(y_train, y_hat_train)

print(r2_train, mse_train)

r2_test = lasso.score(X_test, y_test)
y_hat_test = lasso.predict(X_test)
mse_test = mean_squared_error(y_test, y_hat_test)

print(r2_test, mse_test)

[-0.  0. -0.  0. -0. -0.  0. -0. -0.  0.  0.  0. -0. -0.  0.  0. -0.  0.
 -0. -0.  0. -0.  0. -0. -0.  0.  0.  0. -0. -0. -0.  0.  0.  0. -0.  0.
 -0. -0. -0. -0. -0.  0.  0.  0. -0. -0. -0. -0.  0.  0.  0.  0. -0. -0.
  0.  0.  0. -0.  0.  0.  0. -0. -0.  0.  0.  0.  0. -0.  0.  0. -0. -0.
 -0.  0. -0. -0. -0.  0. -0.  0. -0. -0.  0. -0. -0.  0. -0.  0. -0. -0.
  0. -0. -0.  0.  0.  0. -0. -0. -0.  0. -0.  0. -0.  0. -0. -0.  0.  0.
  0. -0. -0. -0. -0. -0.  0. -0.  0. -0. -0. -0.  0. -0. -0. -0.  0. -0.
 -0. -0.  0. -0.  0.  0. -0.  0. -0. -0. -0. -0. -0.  0. -0. -0.  0. -0.
 -0.  0.  0. -0.  0. -0. -0.  0. -0.  0.  0.  0.  0. -0. -0. -0.  0. -0.
 -0. -0.  0. -0.  0. -0. -0.  0. -0.  0. -0. -0. -0. -0.  0. -0. -0. -0.
 -0. -0.  0. -0. -0. -0.  0.  0.  0. -0.  0. -0. -0. -0. -0. -0.  0. -0.
  0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0. -0.  0.  0. -0.  0. -0.  0.
 -0.  0. -0. -0.  0. -0.  0. -0. -0.  0.  0.  0.  0.  0. -0. -0. -0.  0.
 -0. -0.  0. -0.  0.  0. -0. -0. -0.  0. -0. -0. -0

## Ridge

With default parameter (alpha = 1)

In [46]:
# Your code here
ridge = Ridge(alpha=1)
ridge.fit(X_train, y_train)
print(ridge.coef_)

r2_train = ridge.score(X_train, y_train)
y_hat_train = ridge.predict(X_train)
mse_train = mean_squared_error(y_train, y_hat_train)

print(r2_train, mse_train)

r2_test = ridge.score(X_test, y_test)
y_hat_test = ridge.predict(X_test)
mse_test = mean_squared_error(y_test, y_hat_test)

print(r2_test, mse_test)

[-1.66802197e-01  1.44125204e-01 -3.65031383e-02  4.02001295e-02
  1.89800015e-02 -2.54293710e-01  2.54293710e-01  5.35188722e-03
  1.48943295e-01  9.42359673e-04  5.71656841e-02 -7.70401010e-02
  1.89320572e-02 -1.79693688e-01  2.06970054e-01 -8.44099393e-02
  5.71335725e-02  1.66732974e-01 -1.66732974e-01  2.76889476e-02
  1.45849931e-01 -7.49161962e-02 -1.12705079e-01  1.40823965e-02
  2.71296494e-02  1.05902975e-01 -1.33032624e-01  1.10106741e-02
 -2.75259307e-02  9.35931062e-02 -2.57432127e-03 -7.29401656e-02
 -5.97742289e-02  1.63501113e-01 -2.36024953e-01 -8.42117903e-02
 -1.95273306e-01 -6.66403614e-02 -2.02327967e-01 -1.64038215e-01
  1.03970219e-01 -1.45914929e-01  3.73758134e-01  3.30815730e-01
 -1.84508041e-01 -5.47357183e-02 -7.33283347e-02 -4.02303413e-02
  1.06615639e-02  5.30620174e-01 -6.99995108e-02  6.21173998e-02
 -3.35565239e-02 -6.71973816e-03  1.06880933e-01 -3.54747970e-02
  1.46787343e-02 -1.94579174e-01  4.95844527e-02  3.73312125e-02
  6.18549003e-02  3.05445

With default parameter (alpha = 10)

In [47]:
# Your code here
ridge = Ridge(alpha=10)
ridge.fit(X_train, y_train)
print(ridge.coef_)

r2_train = ridge.score(X_train, y_train)
y_hat_train = ridge.predict(X_train)
mse_train = mean_squared_error(y_train, y_hat_train)

print(r2_train, mse_train)

r2_test = ridge.score(X_test, y_test)
y_hat_test = ridge.predict(X_test)
mse_test = mean_squared_error(y_test, y_hat_test)

print(r2_test, mse_test)

[-5.50853084e-02  7.47869426e-02 -1.13343480e-02  3.56939807e-02
 -4.40612670e-02 -7.05235489e-02  7.05235489e-02 -1.49607784e-02
  7.84735680e-02  5.26076029e-03  6.42563322e-02 -8.38554693e-02
  1.43383768e-02 -1.81743723e-01  1.69917495e-01 -2.11392168e-02
  3.29654450e-02  4.39194581e-02 -4.39194581e-02 -3.67442727e-03
  1.10110702e-01 -7.17716818e-02 -2.45101345e-02 -1.01544586e-02
 -2.61938618e-02  5.90015724e-02 -3.28077106e-02 -7.23975418e-02
 -8.16940539e-03  3.66950511e-02  6.03459336e-02 -2.98373930e-02
 -7.06748602e-02  1.49688773e-01 -1.74921677e-01 -9.45568011e-02
 -7.71745701e-02 -1.87217052e-02 -1.32841825e-01 -1.11097517e-01
  5.49170839e-03 -8.71983997e-02  2.64948803e-01  2.33593609e-01
 -7.53147379e-02 -3.52395620e-02 -3.80096325e-02 -4.01581083e-02
  3.21103107e-02  3.09097734e-01 -7.51506104e-02  4.94924239e-02
  2.53916616e-03 -1.46773565e-02  1.23967748e-01 -7.00037143e-03
 -5.44934727e-02 -8.76064993e-02  3.47323343e-02  8.57125070e-03
 -6.03279954e-03  4.79431

## Look at the metrics, what are your main conclusions?

Conclusions here

## Compare number of parameter estimates that are (very close to) 0 for Ridge and Lasso

In [None]:
# number of Ridge params almost zero


In [None]:
# number of Lasso params almost zero

Compare with the total length of the parameter space and draw conclusions!

In [None]:
# your code here

## Summary

Great! You now know how to perform Lasso and Ridge regression.