# Ridge and Lasso Regression - Lab

## Introduction

In this lab, you'll practice your knowledge on Ridge and Lasso regression!

## Objectives

You will be able to:

- Use Lasso and ridge regression in Python
- Compare Lasso and Ridge with standard regression

## Housing Prices Data

Let's look at yet another house pricing data set.

In [4]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('Housing_Prices/train.csv')

Look at df.info

In [5]:
# Your code here
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-n

We'll make a first selection of the data by removing some of the data with `dtype = object`, this way our first model only contains **continuous features**

Make sure to remove the SalesPrice column from the predictors (which you store in `X`), then replace missing inputs by the median per feature.

Store the target in `y`.

In [50]:
# Load necessary packages
for column in df.columns:
    if df[column].dtype == "object":
        df.drop(columns=column, inplace=True)
    df[column].fillna(df[column].median(), inplace=True)
    
    
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 38 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
LotFrontage      1460 non-null float64
LotArea          1460 non-null int64
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
MasVnrArea       1460 non-null float64
BsmtFinSF1       1460 non-null int64
BsmtFinSF2       1460 non-null int64
BsmtUnfSF        1460 non-null int64
TotalBsmtSF      1460 non-null int64
1stFlrSF         1460 non-null int64
2ndFlrSF         1460 non-null int64
LowQualFinSF     1460 non-null int64
GrLivArea        1460 non-null int64
BsmtFullBath     1460 non-null int64
BsmtHalfBath     1460 non-null int64
FullBath         1460 non-null int64
HalfBath         1460 non-null int64
BedroomAbvGr     1460 non-null int64
KitchenAbvGr     1460 non-null int64
TotRmsAbvGrd     1460 non-null int64
F

Look at the information of `X` again

In [51]:

y = df['SalePrice']
X = df.drop(columns= 'SalePrice')

## Let's use this data to perform a first naive linear regression model

Compute the R squared and the MSE for both train and test set.

In [52]:
from sklearn.metrics import mean_squared_error, mean_squared_log_error
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression

# Split in train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Fit the model and print R2 and MSE for train and test

reg = LinearRegression().fit(X_train, y_train)

print(mean_squared_error(y_train,reg.predict(X_train)), r2_score(y_train,reg.predict(X_train)))
print(mean_squared_error(y_test,reg.predict(X_test)), r2_score(y_test,reg.predict(X_test)))

974663643.2969127 0.8467083612249391
2580625964.19573 0.5760379839338268


## Normalize your data

We haven't normalized our data, let's create a new model that uses `preprocessing.scale` to scale our predictors!

In [60]:
from sklearn import preprocessing

# Scale the data and perform train test split

X_new = preprocessing.scale(X)
X_train, X_test, y_train, y_test = train_test_split(X_new, y, test_size=0.2)

Perform the same linear regression on this data and print out R-squared and MSE.

In [61]:
# Your code here
reg = LinearRegression().fit(X_train, y_train)

print(mean_squared_error(y_train,reg.predict(X_train)), r2_score(y_train,reg.predict(X_train)))
print(mean_squared_error(y_test,reg.predict(X_test)), r2_score(y_test,reg.predict(X_test)))

1296016143.3832564 0.800732075597373
738658724.8817906 0.8654411636085992


## Include dummy variables

We haven't included dummy variables so far: let's use our "object" variables again and create dummies

In [67]:
# Create X_cat which contains only the categorical variables
X_cat = pd.read_csv('Housing_Prices/train.csv')
for column in X_cat.columns:
    if X_cat[column].dtype != "object":
        X_cat.drop(columns=column, inplace=True)
X_cat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 43 columns):
MSZoning         1460 non-null object
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-null object
Exterior2nd      1460 non-null object
MasVnrType       1452 non-null object
ExterQual        1460 non-null object
ExterCond        1460 non-null object
Foundation       1460 non-null object
BsmtQual         1423 non-null object
BsmtCond         1423 non-null object
BsmtExposure     1422

In [70]:
# Make dummies
X_cat = pd.get_dummies(X_cat)

Merge `x_cat` together with our scaled `X` so you have one big predictor dataframe.

In [75]:
# Your code here
X_all = pd.concat([pd.DataFrame(X_new), X_cat], axis = 1)

Perform the same linear regression on this data and print out R-squared and MSE.

In [76]:
# Your code here
X_train, X_test, y_train, y_test = train_test_split(X_all, y, test_size=0.2)
reg = LinearRegression().fit(X_train, y_train)

print(mean_squared_error(y_train,reg.predict(X_train)), r2_score(y_train,reg.predict(X_train)))
print(mean_squared_error(y_test,reg.predict(X_test)), r2_score(y_test,reg.predict(X_test)))

389820661.12585616 0.9367952544806524
1.8675718182925666e+29 -2.735278612043331e+19


Notice the severe overfitting above; our training R squared is quite high, but the testing R squared is negative! Our predictions are far far off. Similarly, the scale of the Testing MSE is orders of magnitude higher then that of the training.

## Perform Ridge and Lasso regression

Use all the data (normalized features and dummy categorical variables) and perform Lasso and Ridge regression for both! Each time, look at R-squared and MSE.

## Lasso

With default parameter (alpha = 1)

In [78]:
from sklearn.linear_model import Lasso, Ridge

lasso = Lasso() 
lasso.fit(X_train, y_train)

print(lasso.score(X_train, y_train), mean_squared_error(y_train, lasso.predict(X_train)))
print(lasso.score(X_test, y_test), mean_squared_error(y_test, lasso.predict(X_test)))

0.9386568065718848 378339379.76064146
0.7140117698602677 1952647736.214857


With a higher regularization parameter (alpha = 10)

In [79]:
lasso = Lasso(alpha=10) 
lasso.fit(X_train, y_train)

print(lasso.score(X_train, y_train), mean_squared_error(y_train, lasso.predict(X_train)))
print(lasso.score(X_test, y_test), mean_squared_error(y_test, lasso.predict(X_test)))

0.936568726491439 391217791.8285134
0.6741193628434439 2225021946.2146997


## Ridge

With default parameter (alpha = 1)

In [81]:
ridge = Ridge()
ridge.fit(X_train, y_train)
print(mean_squared_error(y_train, ridge.predict(X_train)), ridge.score(X_train, y_train), )
print(mean_squared_error(y_test, ridge.predict(X_test)), ridge.score(X_test, y_test))

401835368.15735364 0.9348472138144083
2168920497.0110693 0.6823360800057119


With default parameter (alpha = 10)

In [82]:
ridge = Ridge(alpha=10)
ridge.fit(X_train, y_train)
print(mean_squared_error(y_train, ridge.predict(X_train)), ridge.score(X_train, y_train), )
print(mean_squared_error(y_test, ridge.predict(X_test)), ridge.score(X_test, y_test))

477205445.1138901 0.9226268596647509
2076257605.6120121 0.6959076504530354


## Look at the metrics, what are your main conclusions?

In [83]:
# Lasso with an alpha of 10 gave us the best R^2 for our testing test. Changing the alpha did not change the values drastically.

Conclusions here

## Compare number of parameter estimates that are (very close to) 0 for Ridge and Lasso

In [85]:
# number of Ridge params almost zero
sum(abs(ridge.coef_) < 0.0000001)

7

In [90]:
# number of Lasso params almost zero
sum(abs(lasso.coef_) < 0.0000001)

72

Compare with the total length of the parameter space and draw conclusions!

In [94]:
print(sum(abs(lasso.coef_) < 0.0000001)/len(lasso.coef_))
# Lasso gave the best R^2 value and was able to do it only using 75% of the variables

0.2491349480968858


## Summary

Great! You now know how to perform Lasso and Ridge regression.