# Ridge and Lasso Regression - Lab

## Introduction

In this lab, you'll practice your knowledge on Ridge and Lasso regression!

## Objectives

You will be able to:

- Use Lasso and ridge regression in Python
- Compare Lasso and Ridge with standard regression

## Housing Prices Data

Let's look at yet another house pricing data set.

In [50]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('Housing_Prices/train.csv')

Look at df.info

In [51]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-n

We'll make a first selection of the data by removing some of the data with `dtype = object`, this way our first model only contains **continuous features**

Make sure to remove the SalesPrice column from the predictors (which you store in `X`), then replace missing inputs by the median per feature.

Store the target in `y`.

In [68]:
# Load necessary packages
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, mean_squared_log_error
from sklearn.feature_selection import RFE
from sklearn.model_selection import train_test_split, cross_val_score, KFold
import numpy as np

In [53]:
# remove "object"-type features and SalesPrice from `X`
X = df.drop(df.select_dtypes(['object']), axis=1)
X.drop(['SalePrice'], inplace=True, axis=1)

In [54]:
# Impute null values
X.fillna(value=0, inplace=True)

In [55]:
# Create y
y = df['SalePrice']

Look at the information of `X` again

In [56]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 37 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
LotFrontage      1460 non-null float64
LotArea          1460 non-null int64
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
MasVnrArea       1460 non-null float64
BsmtFinSF1       1460 non-null int64
BsmtFinSF2       1460 non-null int64
BsmtUnfSF        1460 non-null int64
TotalBsmtSF      1460 non-null int64
1stFlrSF         1460 non-null int64
2ndFlrSF         1460 non-null int64
LowQualFinSF     1460 non-null int64
GrLivArea        1460 non-null int64
BsmtFullBath     1460 non-null int64
BsmtHalfBath     1460 non-null int64
FullBath         1460 non-null int64
HalfBath         1460 non-null int64
BedroomAbvGr     1460 non-null int64
KitchenAbvGr     1460 non-null int64
TotRmsAbvGrd     1460 non-null int64
F

## Let's use this data to perform a first naive linear regression model

Compute the R squared and the MSE for both train and test set.

In [78]:
# Split in train and test
linreg = LinearRegression()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=25565)

# Fit the model and print R2 and MSE for train and test
linreg.fit(X_train, y_train)

print('Training r^2:', linreg.score(X_train, y_train))
print('Testing r^2:', linreg.score(X_test, y_test))
print('Training MSE:', mean_squared_error(y_train, linreg.predict(X_train)))
print('Testing MSE:', mean_squared_error(y_test, linreg.predict(X_test)))

Training r^2: 0.8014835784695901
Testing r^2: 0.8477511184120321
Training MSE: 1160212135.1697154
Testing MSE: 1233888283.3125625


## Normalize your data

We haven't normalized our data, let's create a new model that uses `preprocessing.scale` to scale our predictors!

In [79]:
from sklearn import preprocessing

# Scale the data and perform train test split
X_scaled = preprocessing.scale(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.20, random_state=25565)

Perform the same linear regression on this data and print out R-squared and MSE.

In [80]:
# Your code here
linreg.fit(X_train, y_train)
print('Training r^2:', linreg.score(X_train, y_train))
print('Testing r^2:', linreg.score(X_test, y_test))
print('Training MSE:', mean_squared_error(y_train, linreg.predict(X_train)))
print('Testing MSE:', mean_squared_error(y_test, linreg.predict(X_test)))

Training r^2: 0.8014795607525241
Testing r^2: 0.8477583349561978
Training MSE: 1160235616.3712184
Testing MSE: 1233829797.4360218


## Include dummy variables

We haven't included dummy variables so far: let's use our "object" variables again and create dummies

In [84]:
np.object

object

In [88]:
# Create X_cat which contains only the categorical variables
features_cat = [col for col in df.columns if df[col].dtype in ['object']]
X_cat = df[features_cat]
X_cat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 43 columns):
MSZoning         1460 non-null object
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-null object
Exterior2nd      1460 non-null object
MasVnrType       1452 non-null object
ExterQual        1460 non-null object
ExterCond        1460 non-null object
Foundation       1460 non-null object
BsmtQual         1423 non-null object
BsmtCond         1423 non-null object
BsmtExposure     1422

In [90]:
# Make dummies
X_cat_dummies = pd.get_dummies(X_cat)

Merge `x_cat` together with our scaled `X` so you have one big predictor dataframe.

In [106]:
type(X_scaled), type(X_cat_dummies)

(pandas.core.frame.DataFrame, pandas.core.frame.DataFrame)

In [107]:
X_scaled = pd.DataFrame(X_scaled)

In [108]:
X_fin = pd.concat([X_scaled, X_cat_dummies], axis=1)

In [109]:
X_fin.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,-1.730865,0.073375,0.212877,-0.207142,0.651479,-0.5172,1.050994,0.878668,0.514104,0.575425,...,0,0,0,1,0,0,0,0,1,0
1,-1.728492,-0.872563,0.645747,-0.091886,-0.071836,2.179628,0.156734,-0.429577,-0.57075,1.171992,...,0,0,0,1,0,0,0,0,1,0
2,-1.72612,0.073375,0.299451,0.07348,0.651479,-0.5172,0.984752,0.830215,0.325915,0.092907,...,0,0,0,1,0,0,0,0,1,0
3,-1.723747,0.309859,0.068587,-0.096897,0.651479,-0.5172,-1.863632,-0.720298,-0.57075,-0.499274,...,0,0,0,1,1,0,0,0,0,0
4,-1.721374,0.073375,0.761179,0.375148,1.374795,-0.5172,0.951632,0.733308,1.366489,0.463568,...,0,0,0,1,0,0,0,0,1,0


Perform the same linear regression on this data and print out R-squared and MSE.

In [110]:
linreg = LinearRegression()
X_train, X_test, y_train, y_test = train_test_split(X_fin, y, test_size=0.20, random_state=25565)
linreg.fit(X_train, y_train)
print('Training R^2:', linreg.score(X_train, y_train))
print('Testing R^2:', linreg.score(X_test, y_test))
print('Training MSE:', mean_squared_error(linreg.predict(X_train), y_train))
print('Testing MSE:', mean_squared_error(linreg.predict(X_test), y_test))

Training R^2: 0.8649919938038831
Testing R^2: -1.217110698429972e+21
Training MSE: 789042669.2474315
Testing MSE: 9.863971509172625e+30


Notice the severe overfitting above; our training R squared is quite high, but the testing R squared is negative! Our predictions are far far off. Similarly, the scale of the Testing MSE is orders of magnitude higher then that of the training.

## Perform Ridge and Lasso regression

Use all the data (normalized features and dummy categorical variables) and perform Lasso and Ridge regression for both! Each time, look at R-squared and MSE.

In [112]:
from sklearn.linear_model import Lasso, Ridge

In [115]:
def test(model):
    print('Training R^2:', model.score(X_train, y_train))
    print('Testing R^2:', model.score(X_test, y_test))
    print('Training MSE:', mean_squared_error(model.predict(X_train), y_train))
    print('Testing MSE:', mean_squared_error(model.predict(X_test), y_test))

## Lasso

With default parameter (alpha = 1)

In [116]:
lasso = Lasso(alpha=1.0)
lasso = lasso.fit(X_train, y_train)
test(lasso)

Training R^2: 0.9321117358410881
Testing R^2: 0.9109449023050062
Training MSE: 396767115.3124788
Testing MSE: 721739565.3026091


With a higher regularization parameter (alpha = 10)

In [117]:
lasso = Lasso(alpha=10.0)
lasso = lasso.fit(X_train, y_train)
test(lasso)

Training R^2: 0.9297205021158049
Testing R^2: 0.9205312122720829
Training MSE: 410742474.9563439
Testing MSE: 644048120.707373


## Ridge

With default parameter (alpha = 1)

In [118]:
ridge = Ridge(alpha=1.0)
ridge = ridge.fit(X_train, y_train)
test(ridge)

Training R^2: 0.9191818591674787
Testing R^2: 0.899573598220608
Training MSE: 472334666.3861887
Testing MSE: 813897344.9156983


With default parameter (alpha = 10)

In [119]:
ridge = Ridge(alpha=10.0)
ridge = ridge.fit(X_train, y_train)
test(ridge)

Training R^2: 0.8942704417328069
Testing R^2: 0.9014498034903808
Training MSE: 617927300.9358563
Testing MSE: 798691796.7677182


## Look at the metrics, what are your main conclusions?

Lasso and ridge regression produce slightly better training R^2, but MUCH better testing R^2. They do their job: mitigating overfitting. In this particular instance, lasso regression with lambda=10 had the best testing MSE by far as well.

## Compare number of parameter estimates that are (very close to) 0 for Ridge and Lasso

In [132]:
# number of Ridge params almost zero
ridge_params_near_zero = sum(abs(ridge.coef_) < 10**(-10))
print(ridge_params_near_zero)

4


In [133]:
# number of Lasso params almost zero
lasso_params_near_zero = sum(abs(lasso.coef_) < 10**(-10))
print(lasso_params_near_zero)

66


Compare with the total length of the parameter space and draw conclusions!

In [135]:
n_lasso_params = len(lasso.coef_)
n_lasso_params

289

In [136]:
lasso_params_near_zero / n_lasso_params

0.22837370242214533

Lasso reduced nearly 1/4th of our params to near zero--that's some heavy feature selection.

## Summary

Great! You now know how to perform Lasso and Ridge regression.