# Ridge and Lasso Regression - Lab

## Introduction

In this lab, you'll practice your knowledge on Ridge and Lasso regression!

## Objectives

You will be able to:

- Use Lasso and ridge regression in Python
- Compare Lasso and Ridge with standard regression

## Housing Prices Data

Let's look at yet another house pricing data set.

In [1]:
import pandas as pd
#import warnings
#warnings.filterwarnings('ignore')

df = pd.read_csv('Housing_Prices/train.csv')

Look at df.info

In [2]:
# Your code here
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-n

We'll make a first selection of the data by removing some of the data with `dtype = object`, this way our first model only contains **continuous features**

Make sure to remove the SalesPrice column from the predictors (which you store in `X`), then replace missing inputs by the median per feature.

Store the target in `y`.

In [3]:
# Load necessary packages

In [4]:
# remove "object"-type features and SalesPrice from `X`
X = df.select_dtypes(exclude='object').copy()
X.drop(['SalePrice'], axis=1, inplace=True)

# Impute null values
# for each column, calc median and use to fillna
for col in X.columns:
    med = X[col].median()
    X[col].fillna(value=med, inplace=True)

# Create y
y = df['SalePrice'].copy()

In [5]:
y.describe()

count      1460.000000
mean     180921.195890
std       79442.502883
min       34900.000000
25%      129975.000000
50%      163000.000000
75%      214000.000000
max      755000.000000
Name: SalePrice, dtype: float64

Look at the information of `X` again

In [6]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 37 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
LotFrontage      1460 non-null float64
LotArea          1460 non-null int64
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
MasVnrArea       1460 non-null float64
BsmtFinSF1       1460 non-null int64
BsmtFinSF2       1460 non-null int64
BsmtUnfSF        1460 non-null int64
TotalBsmtSF      1460 non-null int64
1stFlrSF         1460 non-null int64
2ndFlrSF         1460 non-null int64
LowQualFinSF     1460 non-null int64
GrLivArea        1460 non-null int64
BsmtFullBath     1460 non-null int64
BsmtHalfBath     1460 non-null int64
FullBath         1460 non-null int64
HalfBath         1460 non-null int64
BedroomAbvGr     1460 non-null int64
KitchenAbvGr     1460 non-null int64
TotRmsAbvGrd     1460 non-null int64
F

## Let's use this data to perform a first naive linear regression model

Compute the R squared and the MSE for both train and test set.

In [7]:
from sklearn.metrics import mean_squared_error, mean_squared_log_error
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

import numpy as np

In [8]:
np.random.seed(8675309)

# Split in train and test
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [9]:
# Fit the model and print R2 and MSE for train and test
lr = LinearRegression()
lr.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [10]:
train_r2 = lr.score(X_train, y_train)
test_r2 = lr.score(X_test, y_test)
train_mse = mean_squared_error(y_train, lr.predict(X_train))
test_mse = mean_squared_error(y_test, lr.predict(X_test))

In [11]:
print("Train\nR2 : {:.3f}\nMSE :${:,.0f}".format(train_r2, train_mse)) #${:,.2f}
print("Test\nR2 : {:.3f}\nMSE :${:,.0f}".format(test_r2, test_mse))

Train
R2 : 0.823
MSE :$1,056,138,397
Test
R2 : 0.776
MSE :$1,640,168,131


## Normalize your data

We haven't normalized our data, let's create a new model that uses `preprocessing.scale` to scale our predictors!

In [12]:
from sklearn import preprocessing

In [13]:
# Scale the data and perform train test split
sc = preprocessing.StandardScaler()
sc.fit(X_train)

X_train_scaled = sc.transform(X_train)
X_test_scaled = sc.transform(X_test)

  return self.partial_fit(X, y)
  """
  


Perform the same linear regression on this data and print out R-squared and MSE.

In [14]:
# Fit the model and print R2 and MSE for train and test
lr_sc = LinearRegression()
lr_sc.fit(X_train_scaled, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [15]:
train_sc_r2 = lr_sc.score(X_train_scaled, y_train)
test_sc_r2 = lr_sc.score(X_test_scaled, y_test)
train_sc_mse = mean_squared_error(y_train, lr_sc.predict(X_train_scaled))
test_sc_mse = mean_squared_error(y_test, lr_sc.predict(X_test_scaled))

In [16]:
print("Train\nR2 : {:.3f}\nMSE :${:,.0f}".format(train_sc_r2, train_sc_mse)) #${:,.2f}
print("Test\nR2 : {:.3f}\nMSE :${:,.0f}".format(test_sc_r2, test_sc_mse))

Train
R2 : 0.823
MSE :$1,056,304,708
Test
R2 : 0.776
MSE :$1,639,533,936


## Include dummy variables

We haven't included dummy variables so far: let's use our "object" variables again and create dummies

To make the downstream scaling and dummy-fing easier. I'm going to redo the train-test split on the whole original dataframe.

In [127]:
# Breakup into continuous and categorical columns
X_cont = df.select_dtypes(exclude='object').copy()
X_cat = df.select_dtypes(include='object').copy()

# Remove target variable
X_cont.drop(['SalePrice'], axis=1, inplace=True)

Binarize all the categorical variables together, because if we don't train or test might end up with values the other set doesn't.

In [128]:
# dummify the categorical variables
X_cat_dum = pd.get_dummies(X_cat, drop_first=True)

Now, go back to the continuous dataset and train/test split it, along with a created index.
We'll use the index to make sure we can track the splits until we join continuous and categorical data back together.

In [129]:
# Split the continuous datasets
np.random.seed(1111)
X_cont_train, X_cont_test, y_train, y_test, index_train, index_test = train_test_split(
                                                                X_cont
                                                                ,df['SalePrice']
                                                                ,range(0,len(df['SalePrice'])))

Fit scaler on TRAINING data only, but then run transforms on both train and test based on that fit.

In [132]:
# Run scaling for continous features
# Create object
scaler = preprocessing.StandardScaler()

# Fit to training data
scaler.fit(X_cont_train)

# Scale train and test data
X_sc_train = scaler.transform(X_cont_train)
X_sc_test = scaler.transform(X_cont_test)


  return self.partial_fit(X, y)
  if __name__ == '__main__':
  # Remove the CWD from sys.path while we load stuff.


Convert back to dataframe so can easily concat back to categorical data.

In [133]:
X_sc_train = pd.DataFrame(X_sc_train, columns=X_cont.columns, index=index_train)
X_sc_test = pd.DataFrame(X_sc_test, columns=X_cont.columns, index=index_test)

Merge `x_cat` together with our scaled `X` so you have one big predictor dataframe.

Inner join so will force to get only those rows from the dummified data that have the same index.

In [134]:
X_train = pd.concat([X_sc_train, X_cat_dum], axis=1, join='inner' )
X_test = pd.concat([X_sc_test, X_cat_dum], axis=1, join='inner' )


Perform the same linear regression on this data and print out R-squared and MSE.

In [None]:
# Your code here

Notice the severe overfitting above; our training R squared is quite high, but the testing R squared is negative! Our predictions are far far off. Similarly, the scale of the Testing MSE is orders of magnitude higher then that of the training.

## Perform Ridge and Lasso regression

Use all the data (normalized features and dummy categorical variables) and perform Lasso and Ridge regression for both! Each time, look at R-squared and MSE.

## Lasso

With default parameter (alpha = 1)

In [None]:
# Your code here

With a higher regularization parameter (alpha = 10)

In [None]:
# Your code here

## Ridge

With default parameter (alpha = 1)

In [None]:
# Your code here

With default parameter (alpha = 10)

In [None]:
# Your code here

## Look at the metrics, what are your main conclusions?

Conclusions here

## Compare number of parameter estimates that are (very close to) 0 for Ridge and Lasso

In [None]:
# number of Ridge params almost zero

In [None]:
# number of Lasso params almost zero

Compare with the total length of the parameter space and draw conclusions!

In [None]:
# your code here

## Summary

Great! You now know how to perform Lasso and Ridge regression.