# Ridge and Lasso Regression - Lab

## Introduction

In this lab, you'll practice your knowledge on Ridge and Lasso regression!

## Objectives

You will be able to:

- Use Lasso and ridge regression in Python
- Compare Lasso and Ridge with standard regression

## Housing Prices Data

Let's look at yet another house pricing data set.

In [249]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('Housing_Prices/train.csv')

Look at df.info

In [250]:
# Your code here
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-n

First, make a selection of the data by removing some of the data with `dtype = object`, this way our first model only contains **continuous features**

Make sure to remove the SalesPrice column from the predictors (which you store in `X`), then replace missing inputs by the median per feature.

Store the target in `y`.

In [251]:
# Load necessary packages
import numpy as np

# remove "object"-type features and SalesPrice from `X`
[type(df[i] for i in df.columns)]

# Impute null values


# Create y


[generator]

In [252]:
data = [(i, df[i].dtype!='object') for i in df.columns]

In [253]:
data[1][1]

True

In [254]:
for i in data:
    if i[1]==0:
        df.drop(i[0], axis=1, inplace=True)
for col in df:
    med = df[col].median()
    df[col].fillna(value = med, inplace = True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 38 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
LotFrontage      1460 non-null float64
LotArea          1460 non-null int64
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
MasVnrArea       1460 non-null float64
BsmtFinSF1       1460 non-null int64
BsmtFinSF2       1460 non-null int64
BsmtUnfSF        1460 non-null int64
TotalBsmtSF      1460 non-null int64
1stFlrSF         1460 non-null int64
2ndFlrSF         1460 non-null int64
LowQualFinSF     1460 non-null int64
GrLivArea        1460 non-null int64
BsmtFullBath     1460 non-null int64
BsmtHalfBath     1460 non-null int64
FullBath         1460 non-null int64
HalfBath         1460 non-null int64
BedroomAbvGr     1460 non-null int64
KitchenAbvGr     1460 non-null int64
TotRmsAbvGrd     1460 non-null int64
F

## Let's use this data to perform a first naive linear regression model

Compute the R squared and the MSE for both train and test set.

In [273]:
from sklearn.metrics import mean_squared_error, mean_squared_log_error
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Split in train and test
x_train, x_test, y_train, y_test = train_test_split(df.drop('SalePrice', axis=1), df.SalePrice)
# Fit the model and print R2 and MSE for train and test
model = LinearRegression().fit(x_train, y_train)
print(model.score(x_train, y_train))
print(mean_squared_error(y_train, model.predict(x_train)))
print(model.score(x_test, y_test))
print(mean_squared_error(y_test, model.predict(x_test)))

0.8240685713378213
1120576506.9991765
0.7621303735994436
1454894152.9949467


## Normalize your data

We haven't normalized our data, let's create a new model that uses `preprocessing.scale` to scale our predictors!

In [274]:
from sklearn.preprocessing import normalize

# Scale the data and perform train test split
df_norm = pd.DataFrame(normalize(df), columns=df.columns)
df_norm.head()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
0,5e-06,0.000287,0.000311,0.040484,3.4e-05,2.4e-05,0.009596,0.009596,0.000939,0.003382,...,0.0,0.000292,0.0,0.0,0.0,0.0,0.0,1e-05,0.00962,0.998927
1,1.1e-05,0.00011,0.00044,0.052801,3.3e-05,4.4e-05,0.010868,0.010868,0.0,0.005379,...,0.001639,0.0,0.0,0.0,0.0,0.0,0.0,2.8e-05,0.011039,0.998274
2,1.3e-05,0.000268,0.000304,0.050261,3.1e-05,2.2e-05,0.00894,0.008944,0.000724,0.002171,...,0.0,0.000188,0.0,0.0,0.0,0.0,0.0,4e-05,0.008971,0.998511
3,2.8e-05,0.000499,0.000427,0.068019,5e-05,3.6e-05,0.013639,0.014031,0.0,0.001538,...,0.0,0.000249,0.001937,0.0,0.0,0.0,0.0,1.4e-05,0.014288,0.997139
4,2e-05,0.00024,0.000335,0.056936,3.2e-05,2e-05,0.007985,0.007985,0.001397,0.002615,...,0.000767,0.000335,0.0,0.0,0.0,0.0,0.0,4.8e-05,0.008017,0.998169


In [284]:
x_train, x_test, y_train, y_test = train_test_split(df_norm.drop('SalePrice', axis=1), df_norm.SalePrice)
# Fit the model and print R2 and MSE for train and test
model = LinearRegression().fit(x_train, y_train)
print(model.score(x_train, y_train))
print(mean_squared_error(y_train, model.predict(x_train)))
print(model.score(x_test, y_test))
print(mean_squared_error(y_test, model.predict(x_test)))

0.8301849266590235
1.3519789891212516e-05
0.48967679940514086
1.1450550321493281e-05


## Include dummy variables

Your model hasn't included dummy variables so far: let's use the "object" variables again and create dummies

In [285]:
x_cat = pd.read_csv('Housing_Prices/train.csv')

In [286]:
# Create X_cat which contains only the categorical variables
for i in data:
    if i[1]!=0:
        x_cat.drop(i[0], axis=1, inplace=True)
x_cat.shape

(1460, 43)

Merge `x_cat` together with our scaled `X` so you have one big predictor dataframe.

In [287]:
x_cat = pd.get_dummies(x_cat)

In [288]:
# Your code here
df_merged = pd.concat([df_norm, x_cat], axis=1)
len(df)

1460

Perform the same linear regression on this data and print out R-squared and MSE.

In [295]:
# Your code here
x_train, x_test, y_train, y_test = train_test_split(df_merged.drop('SalePrice', axis=1), df_merged.SalePrice)
# Fit the model and print R2 and MSE for train and test
model = LinearRegression().fit(x_train, y_train)
print(model.score(x_train, y_train))
print(mean_squared_error(y_train, model.predict(x_train)))
print(model.score(x_test, y_test))
print(mean_squared_error(y_test, model.predict(x_test)))

0.907081810937922
7.3214282636860675e-06
-1.0758616091167453e+22
2.678426391732962e+17


Notice the severe overfitting above; our training R squared is quite high, but the testing R squared is negative! Our predictions are far off. Similarly, the scale of the Testing MSE is orders of magnitude higher than that of the training.

## Perform Ridge and Lasso regression

Use all the data (normalized features and dummy categorical variables) and perform Lasso and Ridge regression for both! Each time, look at R-squared and MSE.

## Lasso

With default parameter (alpha = 1)

In [296]:
# Your code here
# from sklearn.linear_model import Lasso, Ridge
# lasso = Lasso()
# lasso.fit(x_train, y_train)
from sklearn.linear_model import Lasso, Ridge

lasso = Ridge(alpha=1) 
lasso.fit(x_train, y_train)
print('Training r^2:', lasso.score(x_train, y_train))
print('Testing r^2:', lasso.score(x_test, y_test))
print('Training MSE:', mean_squared_error(y_train, lasso.predict(x_train)))
print('Testing MSE:', mean_squared_error(y_test, lasso.predict(x_test)))

Training r^2: 0.8082957153918604
Testing r^2: 0.29145579907625163
Training MSE: 1.5105214401693189e-05
Testing MSE: 1.763966175000465e-05


With a higher regularization parameter (alpha = 10)

In [297]:
# Your code here
lasso = Ridge(alpha=10) 
lasso.fit(x_train, y_train)
print('Training r^2:', lasso.score(x_train, y_train))
print('Testing r^2:', lasso.score(x_test, y_test))
print('Training MSE:', mean_squared_error(y_train, lasso.predict(x_train)))
print('Testing MSE:', mean_squared_error(y_test, lasso.predict(x_test)))

Training r^2: 0.596655306725709
Testing r^2: 0.0914352658579195
Training MSE: 3.178128273005045e-05
Testing MSE: 2.2619301050456213e-05


## Ridge

With default parameter (alpha = 1)

In [298]:
# Your code here
from sklearn.linear_model import Lasso, Ridge

ridge = Ridge(alpha = 1) #Lasso is also known as the L1 norm.
ridge.fit(x_train, y_train)
print('Training r^2:', ridge.score(x_train, y_train))
print('Testing r^2:', ridge.score(x_test, y_test))
print('Training MSE:', mean_squared_error(y_train, ridge.predict(x_train)))
print('Testing MSE:', mean_squared_error(y_test, ridge.predict(x_test)))

Training r^2: 0.8082957153918604
Testing r^2: 0.29145579907625163
Training MSE: 1.5105214401693189e-05
Testing MSE: 1.763966175000465e-05


With default parameter (alpha = 10)

In [299]:
# Your code here
ridge = Ridge(alpha = 10) #Lasso is also known as the L1 norm.
ridge.fit(x_train, y_train)
print('Training r^2:', ridge.score(x_train, y_train))
print('Testing r^2:', ridge.score(x_test, y_test))
print('Training MSE:', mean_squared_error(y_train, ridge.predict(x_train)))
print('Testing MSE:', mean_squared_error(y_test, ridge.predict(x_test)))

Training r^2: 0.596655306725709
Testing r^2: 0.0914352658579195
Training MSE: 3.178128273005045e-05
Testing MSE: 2.2619301050456213e-05


## Look at the metrics, what are your main conclusions?   

Conclusions here

## Compare number of parameter estimates that are (very close to) 0 for Ridge and Lasso

Compare with the total length of the parameter space and draw conclusions!

In [301]:
# number of Ridge params almost zero
print(sum(abs(ridge.coef_) < 10**(-10)))

6


In [300]:
print(sum(abs(lasso.coef_) < 10**(-10)))

6


Lasso was very effective to essentially perform variable selection and remove about 25% of the variables from your model!

## Summary

Great! You now know how to perform Lasso and Ridge regression.