In [1]:
%reload_ext nb_black

<IPython.core.display.Javascript object>

## Day 28 Lecture 2 Assignment

In this assignment, we will learn about overfitting and regularization. We will use the king county housing dataset loaded below and analyze the regression from this dataset.

In [2]:
import numpy as np
import pandas as pd

from sklearn.linear_model import Lasso, Ridge, ElasticNet, LinearRegression
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

import matplotlib.pyplot as plt

%matplotlib inline

<IPython.core.display.Javascript object>

In [3]:
df = pd.read_csv(
    "https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/kc_house_data.csv"
)

<IPython.core.display.Javascript object>

In [4]:
df.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


<IPython.core.display.Javascript object>

Perform the same transformations in the previous assignment to meet model assumptions:
1. Remove all columns except: price, bedrooms, bathrooms, sqft_living, floors, waterfront
1. Remove outliers
1. Split the data into train and test subsets. 20% of the data should be in the test subset

In [5]:
keep_cols = ["price", "bedrooms", "bathrooms", "sqft_living", "floors", "waterfront"]
drop_rows = [15870, 12777]

<IPython.core.display.Javascript object>

In [6]:
df = df[keep_cols]
df = df.drop(drop_rows)

<IPython.core.display.Javascript object>

In [7]:
X = df.drop(columns="price")
y = df["price"]

<IPython.core.display.Javascript object>

In [8]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

<IPython.core.display.Javascript object>

In [9]:
X_train.columns

Index(['bedrooms', 'bathrooms', 'sqft_living', 'floors', 'waterfront'], dtype='object')

<IPython.core.display.Javascript object>

In [10]:
num_cols = ["bedrooms", "bathrooms", "sqft_living", "floors"]
bin_cols = ["waterfront"]
ct = ColumnTransformer([("scale", StandardScaler(), num_cols)], remainder="passthrough")

ct.fit(X_train)

X_train = ct.transform(X_train)
X_test = ct.transform(X_test)

<IPython.core.display.Javascript object>

Apply a ridge regression model with lambda=50 to the data and evaluate by looking at r squared for test and train

In [11]:
model = Ridge(50)
model.fit(X_train, y_train)

print(f"train score: {model.score(X_train, y_train)}")
print(f"test score: {model.score(X_test, y_test)}")

train score: 0.5375237713218302
test score: 0.546853409637465


<IPython.core.display.Javascript object>

In [12]:
# Based on the prices being skewed you might consider a log
# transform, here, no benefit
model = Ridge(50)
model.fit(X_train, np.log(y_train))

print(f"train score: {model.score(X_train, np.log(y_train))}")
print(f"test score: {model.score(X_test, np.log(y_test))}")

train score: 0.5021343369569861
test score: 0.5133870774899247


<IPython.core.display.Javascript object>

Perform a grid search for the following values of alpha: 0.001, 0.01, 0.1, 1, 10, 100, 1000 to find the most optimal ridge regression model. Experiment with different scoring metrics in the grid search (R^2 is the default, but you can use root mean squared error or many others). 
https://scikit-learn.org/stable/modules/model_evaluation.html

In [13]:
grid = {"alpha": [0.001, 0.01, 0.1, 1, 10, 100, 1000]}

model = GridSearchCV(Ridge(), grid)
model.fit(X_train, y_train)

print(f"train score: {model.score(X_train, y_train)}")
print(f"test score: {model.score(X_test, y_test)}")

train score: 0.5401652161501249
test score: 0.5514833360299165


<IPython.core.display.Javascript object>

Might consider an elasticnet to evaluate lasso vs ridge

In [14]:
grid = {
    "alpha": [0.001, 0.01, 0.1, 1, 10, 100, 1000],
    "l1_ratio": [0.25, 0.5, 0.75, 1.0],
}

model = GridSearchCV(ElasticNet(), grid)
model.fit(X_train, y_train)

print(f"train score: {model.score(X_train, y_train)}")
print(f"test score: {model.score(X_test, y_test)}")

train score: 0.540167131062818
test score: 0.5515329406880036


<IPython.core.display.Javascript object>

In [15]:
# From the docs:
# The parameter l1_ratio corresponds to alpha in the glmnet R package while
# alpha corresponds to the lambda parameter in glmnet. Specifically, l1_ratio = 1
# is the lasso penalty.

# In other words the elasticnet decided to use lasso
model.best_params_

{'alpha': 10, 'l1_ratio': 1.0}

<IPython.core.display.Javascript object>