# Ridge and Lasso Regression - Lab

## Introduction

In this lab, you'll practice your knowledge on Ridge and Lasso regression!

## Objectives

You will be able to:

- Use Lasso and ridge regression in Python
- Compare Lasso and Ridge with standard regression

## Housing Prices Data

Let's look at yet another house pricing data set.

In [1]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('Housing_Prices/train.csv')

Look at df.info

In [2]:
# Your code here
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-n

We'll make a first selection of the data by removing some of the data with `dtype = object`, this way our first model only contains **continuous features**

Make sure to remove the SalesPrice column from the predictors (which you store in `X`), then replace missing inputs by the median per feature.

Store the target in `y`.

In [3]:
df.columns


Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive

In [4]:
# Load necessary packages
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import Lasso, Ridge, LinearRegression
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt


In [5]:
y = df['SalePrice']

In [6]:
# remove "object"-type features and SalesPrice from `X`
features = [col for col in df.columns if df[col].dtype in [np.float64, np.int64] and col!='SalePrice']
X = df[features]

# Impute null values
for col in X:
    med = X[col].median()
    X[col].fillna(value = med, inplace = True)

In [7]:
y.shape

(1460,)

Look at the information of `X` again

In [8]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 37 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
LotFrontage      1460 non-null float64
LotArea          1460 non-null int64
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
MasVnrArea       1460 non-null float64
BsmtFinSF1       1460 non-null int64
BsmtFinSF2       1460 non-null int64
BsmtUnfSF        1460 non-null int64
TotalBsmtSF      1460 non-null int64
1stFlrSF         1460 non-null int64
2ndFlrSF         1460 non-null int64
LowQualFinSF     1460 non-null int64
GrLivArea        1460 non-null int64
BsmtFullBath     1460 non-null int64
BsmtHalfBath     1460 non-null int64
FullBath         1460 non-null int64
HalfBath         1460 non-null int64
BedroomAbvGr     1460 non-null int64
KitchenAbvGr     1460 non-null int64
TotRmsAbvGrd     1460 non-null int64
F

## Let's use this data to perform a first naive linear regression model

Compute the R squared and the MSE for both train and test set.

In [9]:
from sklearn.metrics import mean_squared_error, mean_squared_log_error

# Split in train and test
X_train, X_test, y_train, y_test = train_test_split(X,y)
# Fit the model and print R2 and MSE for train and test
linreg = LinearRegression()
linreg.fit(X_train, y_train)

print('Training R**2', linreg.score(X_train, y_train))
print('Training MSE', mean_squared_error(y_train,linreg.predict(X_train)))
print('Test R**2', linreg.score(X_test, y_test))
print('Test MSE', mean_squared_error(y_test,linreg.predict(X_test)))

Training R**2 0.8638706226538876
Training MSE 876120248.6608495
Test R**2 0.5014456093367476
Test MSE 2950970861.3121195


## Normalize your data

We haven't normalized our data, let's create a new model that uses `preprocessing.scale` to scale our predictors!

In [10]:
from sklearn import preprocessing

# Scale the data and perform train test split

X_scaled = preprocessing.scale(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled,y)

Perform the same linear regression on this data and print out R-squared and MSE.

In [11]:
# Your code here
linreg = LinearRegression()
linreg.fit(X_train, y_train)

print('Training R**2', linreg.score(X_train, y_train))
print('Training MSE', mean_squared_error(y_train,linreg.predict(X_train)))
print('Test R**2', linreg.score(X_test, y_test))
print('Test MSE', mean_squared_error(y_test,linreg.predict(X_test)))

Training R**2 0.8237232863821317
Training MSE 1082277358.3873494
Test R**2 0.7757922848959131
Test MSE 1524764895.4726155


## Include dummy variables

We haven't included dummy variables so far: let's use our "object" variables again and create dummies

In [12]:
# Create X_cat which contains only the categorical variables
features2 =[col for col in df.columns if df[col].dtype in [np.object]]

X_cat = df[features2]


np.shape(X_cat)

(1460, 43)

In [13]:
# Make dummies
X_cat = pd.get_dummies(X_cat)

np.shape(X_cat)

(1460, 252)

Merge `x_cat` together with our scaled `X` so you have one big predictor dataframe.

In [18]:
# Your code here
X_scale = pd.DataFrame(X_scaled)
exes = [X_scale, X_cat]
X_all = pd.concat(exes, axis =1)

In [19]:
X_all.shape

(1460, 289)

Perform the same linear regression on this data and print out R-squared and MSE.

In [20]:
# Your code here
X_train, X_test, y_train, y_test = train_test_split(X_all,y)

linreg = LinearRegression()
linreg.fit(X_train, y_train)

print('Training R**2', linreg.score(X_train, y_train))
print('Training MSE', mean_squared_error(y_train,linreg.predict(X_train)))
print('Test R**2', linreg.score(X_test, y_test))
print('Test MSE', mean_squared_error(y_test,linreg.predict(X_test)))

Training R**2 0.9333751449520572
Training MSE 401068200.0575342
Test R**2 -1.274512932945132e+21
Test MSE 9.135362920415102e+30


Notice the severe overfitting above; our training R squared is quite high, but the testing R squared is negative! Our predictions are far far off. Similarly, the scale of the Testing MSE is orders of magnitude higher then that of the training.

## Perform Ridge and Lasso regression

Use all the data (normalized features and dummy categorical variables) and perform Lasso and Ridge regression for both! Each time, look at R-squared and MSE.

## Lasso

With default parameter (alpha = 1)

In [26]:
# Your code here
lasso = Lasso(alpha=1)
lasso.fit(X_train, y_train)

print('Training r^2:', lasso.score(X_train, y_train))
print('Testing r^2:', lasso.score(X_test, y_test))
print('Training MSE:', mean_squared_error(y_train, lasso.predict(X_train)))
print('Testing MSE:', mean_squared_error(y_test, lasso.predict(X_test)))

Training r^2: 0.9356570609460443
Testing r^2: 0.7445949234882079
Training MSE: 387331525.6627855
Testing MSE: 1830674295.5209012


With a higher regularization parameter (alpha = 10)

In [27]:
# Your code here
lasso = Lasso(alpha=10)
lasso.fit(X_train, y_train)

print('Training r^2:', lasso.score(X_train, y_train))
print('Testing r^2:', lasso.score(X_test, y_test))
print('Training MSE:', mean_squared_error(y_train, lasso.predict(X_train)))
print('Testing MSE:', mean_squared_error(y_test, lasso.predict(X_test)))

Training r^2: 0.9342980429522303
Testing r^2: 0.7405267273471775
Training MSE: 395512540.09400016
Testing MSE: 1859834021.734009


## Ridge

With default parameter (alpha = 1)

In [28]:
# Your code here
ridge = Ridge(alpha=0.5)
ridge.fit(X_train, y_train)

print('Training r^2:', ridge.score(X_train, y_train))
print('Testing r^2:', ridge.score(X_test, y_test))
print('Training MSE:', mean_squared_error(y_train, ridge.predict(X_train)))
print('Testing MSE:', mean_squared_error(y_test, ridge.predict(X_test)))

Training r^2: 0.9337761069462184
Testing r^2: 0.737635985996894
Training MSE: 398654489.66110563
Testing MSE: 1880554071.4575279


With default parameter (alpha = 10)

In [29]:
# Your code here
ridge = Ridge(alpha=10)
ridge.fit(X_train, y_train)

print('Training r^2:', ridge.score(X_train, y_train))
print('Testing r^2:', ridge.score(X_test, y_test))
print('Training MSE:', mean_squared_error(y_train, ridge.predict(X_train)))
print('Testing MSE:', mean_squared_error(y_test, ridge.predict(X_test)))

Training r^2: 0.9165584642088656
Testing r^2: 0.7610801673386524
Training MSE: 502301228.9588449
Testing MSE: 1712512540.145576


## Look at the metrics, what are your main conclusions?

Conclusions here

## Compare number of parameter estimates that are (very close to) 0 for Ridge and Lasso

In [30]:
# number of Ridge params almost zero
print(sum(abs(ridge.coef_) < 10**(-10)))

7


In [31]:
# number of Lasso params almost zero
print(sum(abs(lasso.coef_) < 10**(-10)))

63


Compare with the total length of the parameter space and draw conclusions!

In [None]:
# your code here
Lasso was very effective to essentially perform variable selection and remove about 25% of the variables from your model!

len(lasso.coef_)

## Summary

Great! You now know how to perform Lasso and Ridge regression.