### Multiple Regression, Part 3: Regularized Linear Models

In this notebook, we'll be working with data on sales prices for homes in King County, Washington. This dataset was obtained from https://www.kaggle.com/harlfoxem/housesalesprediction.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score, RepeatedKFold
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

from sklearn.compose import TransformedTargetRegressor, ColumnTransformer
from sklearn.preprocessing import PolynomialFeatures, MinMaxScaler, PowerTransformer
from sklearn.pipeline import Pipeline

In [2]:
kc = pd.read_csv('../data/kc_house_data.csv')

First, let's take a look at the dataset.

In [3]:
kc.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


In [4]:
kc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21613 non-null  int64  
 1   date           21613 non-null  object 
 2   price          21613 non-null  float64
 3   bedrooms       21613 non-null  int64  
 4   bathrooms      21613 non-null  float64
 5   sqft_living    21613 non-null  int64  
 6   sqft_lot       21613 non-null  int64  
 7   floors         21613 non-null  float64
 8   waterfront     21613 non-null  int64  
 9   view           21613 non-null  int64  
 10  condition      21613 non-null  int64  
 11  grade          21613 non-null  int64  
 12  sqft_above     21613 non-null  int64  
 13  sqft_basement  21613 non-null  int64  
 14  yr_built       21613 non-null  int64  
 15  yr_renovated   21613 non-null  int64  
 16  zipcode        21613 non-null  int64  
 17  lat            21613 non-null  float64
 18  long  

The columsn are as follows:

* id - Unique ID for each home sold
* date - Date of the home sale
* price - Price of each home sold
* bedrooms - Number of bedrooms
* bathrooms - Number of bathrooms, where .5 accounts for a room with a toilet but no shower
* sqft_living - Square footage of the apartments interior living space
* sqft_lot - Square footage of the land space
* floors - Number of floors
* waterfront - A dummy variable for whether the apartment was overlooking the waterfront or not
* view - An index from 0 to 4 of how good the view of the property was
* condition - An index from 1 to 5 on the condition of the apartment,
* grade - An index from 1 to 13, where 1-3 falls short of building construction and design, 7 has an average level of construction and design, and 11-13 have a high quality level of construction and design.
* sqft_above - The square footage of the interior housing space that is above ground level
* sqft_basement - The square footage of the interior housing space that is below ground level
* yr_built - The year the house was initially built
* yr_renovated - The year of the house’s last renovation
* zipcode - What zipcode area the house is in
* lat - Lattitude
* long - Longitude
* sqft_living15 - The square footage of interior housing living space for the nearest 15 neighbors
* sqft_lot15 - The square footage of the land lots of the nearest 15 neighbors

Our goal in this notebook is to try and improve out best model, by engineering some additional features.

We'll first add a couple of columns that we haven't used yet - date and yr_renovated.

In [5]:
X = kc[['date', 'bedrooms', 'bathrooms', 'sqft_living',
       'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
       'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
       'lat', 'long', 'sqft_living15', 'sqft_lot15']]
y = kc['price']

We can break out the date into the year and month pieces in case time of year is relevant for sales price.

In [6]:
X['date'] = pd.to_datetime(X['date'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['date'] = pd.to_datetime(X['date'])


In [7]:
X['sales_year'] = X['date'].dt.year
X['sales_month'] = X['date'].dt.month

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['sales_year'] = X['date'].dt.year


Also, the yr_renovated column has a 0 for any houses that were not renovated. Perhaps we could make use of this column by converting it and the yr_built column into an age at sale and years since renomvation.

In [8]:
X['age_at_sale'] = X['sales_year'] - X['yr_built']
X['years_since_renovation'] = X['sales_year'] - np.max(X[['yr_built', 'yr_renovated']], axis = 1)

Now, we can drop the original features that we used to do these calculations.

In [9]:
X = X.drop(columns = ['date', 'yr_built', 'yr_renovated'])

Finally, we'll add a boolean feature that indicates whether or not a home has a basement by checking whether the sqft_basement feature is nonzero.

In [10]:
X['has_basement'] = X['sqft_basement'] > 0

Note that when we use `get_dummies` we'll need to also include the sales_year and sales_month features that we just created.

In [11]:
X = pd.get_dummies(X, columns = ['zipcode', 'sales_year', 'sales_month'], drop_first = True)

In [12]:
cv = RepeatedKFold(n_splits = 5, n_repeats = 10, random_state = 321)

ct = ColumnTransformer(transformers = [
    ('yeo-johnson', PowerTransformer(), ['sqft_living', 'sqft_lot', 'sqft_above', 'sqft_basement']),
    ('minmax', MinMaxScaler(), ['lat', 'long']),
    ('poly', PolynomialFeatures(degree = 2, include_bias = False), 
     ['sqft_living', 'sqft_lot', 'sqft_above', 'sqft_basement', 'bedrooms', 'bathrooms', 'floors', 'lat', 'long'])
], remainder = 'passthrough')

pipe = Pipeline(steps = [
    ('transform', ct),
    ('linear', LinearRegression())
])

ttr = TransformedTargetRegressor(regressor = pipe, func = np.log, inverse_func = np.exp)

cvs = cross_val_score(estimator = ttr, X = X, y = y, 
                      scoring = 'neg_mean_absolute_error',
                     cv = cv, n_jobs = -1)

In [13]:
-cvs.mean()

69903.37705873967

In [14]:
cv = RepeatedKFold(n_splits = 5, n_repeats = 10, random_state = 321)

ct = ColumnTransformer(transformers = [
    ('yeo-johnson', PowerTransformer(), ['sqft_living', 'sqft_lot', 'sqft_above', 'sqft_basement']),
    ('minmax', MinMaxScaler(), ['lat', 'long']),
    ('poly', PolynomialFeatures(degree = 2, include_bias = False), 
     ['sqft_living', 'sqft_lot', 'sqft_above', 'sqft_basement', 'bedrooms', 'bathrooms', 'floors', 'lat', 'long'])
], remainder = 'passthrough')

pipe = Pipeline(steps = [
    ('transform', ct),
    ('linear', LinearRegression())
])

ttr = TransformedTargetRegressor(regressor = pipe, func = np.log, inverse_func = np.exp)

cvs = cross_val_score(estimator = ttr, X = X, y = y, 
                      scoring = 'r2',
                     cv = cv, n_jobs = -1)

In [15]:
cvs.mean()

0.8895589121150461

Let's see if we can't do better by applying some regularization to our model.

We'll be using the RidgeCV, the LassoCV, and the ElasticNetCV.

These classes will build Ridge/Lasso/ElasticNet models but will automatically do hyperparameter tuning for the regularization strength.

In [65]:
from sklearn.linear_model import RidgeCV, LassoCV, ElasticNetCV
from sklearn.preprocessing import StandardScaler

Since we'll be searching for the best values for our hyperparameters, we ought to split off a test set for final model evaluation.

In [18]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 321)

In [56]:
cv = KFold(n_splits = 5, shuffle = True, random_state = 321)

ct = ColumnTransformer(transformers = [
    ('yeo-johnson', PowerTransformer(), ['sqft_living', 'sqft_lot', 'sqft_above', 'sqft_basement']),
    ('minmax', MinMaxScaler(), ['lat', 'long']),
    ('poly', PolynomialFeatures(degree = 2, include_bias = False), 
     ['sqft_living', 'sqft_lot', 'sqft_above', 'sqft_basement', 'bedrooms', 'bathrooms', 'floors', 'lat', 'long'])
], remainder = 'passthrough')

pipe = Pipeline(steps = [
    ('transform', ct),
    ('scale', StandardScaler()),
    ('linear', RidgeCV(alphas = [0.01, 0.1, 0.5, 1, 10, 25, 100, 1000], 
                       cv = cv, ))
])

ttr = TransformedTargetRegressor(regressor = pipe, func = np.log, inverse_func = np.exp)

In [57]:
ttr.fit(X_train, y_train)

TransformedTargetRegressor(func=<ufunc 'log'>, inverse_func=<ufunc 'exp'>,
                           regressor=Pipeline(steps=[('transform',
                                                      ColumnTransformer(remainder='passthrough',
                                                                        transformers=[('yeo-johnson',
                                                                                       PowerTransformer(),
                                                                                       ['sqft_living',
                                                                                        'sqft_lot',
                                                                                        'sqft_above',
                                                                                        'sqft_basement']),
                                                                                      ('minmax',
                                           

In [58]:
mean_absolute_error(y_test, ttr.predict(X_test))

69015.7906397141

In [62]:
cv = KFold(n_splits = 5, shuffle = True, random_state = 321)

ct = ColumnTransformer(transformers = [
    ('yeo-johnson', PowerTransformer(), ['sqft_living', 'sqft_lot', 'sqft_above', 'sqft_basement']),
    ('minmax', MinMaxScaler(), ['lat', 'long']),
    ('poly', PolynomialFeatures(degree = 2, include_bias = False), 
     ['sqft_living', 'sqft_lot', 'sqft_above', 'sqft_basement', 'bedrooms', 'bathrooms', 'floors', 'lat', 'long'])
], remainder = 'passthrough')

pipe = Pipeline(steps = [
    ('transform', ct),
    ('scaler', StandardScaler()),
    ('linear', LassoCV(n_jobs = -1, max_iter = 5000, random_state = 321,
                       cv = cv, ))
])

ttr = TransformedTargetRegressor(regressor = pipe, func = np.log, inverse_func = np.exp)

In [63]:
ttr.fit(X_train, y_train)

TransformedTargetRegressor(func=<ufunc 'log'>, inverse_func=<ufunc 'exp'>,
                           regressor=Pipeline(steps=[('transform',
                                                      ColumnTransformer(remainder='passthrough',
                                                                        transformers=[('yeo-johnson',
                                                                                       PowerTransformer(),
                                                                                       ['sqft_living',
                                                                                        'sqft_lot',
                                                                                        'sqft_above',
                                                                                        'sqft_basement']),
                                                                                      ('minmax',
                                           

In [64]:
mean_absolute_error(y_test, ttr.predict(X_test))

69624.2777682329

In [66]:
cv = KFold(n_splits = 5, shuffle = True, random_state = 321)

ct = ColumnTransformer(transformers = [
    ('yeo-johnson', PowerTransformer(), ['sqft_living', 'sqft_lot', 'sqft_above', 'sqft_basement']),
    ('minmax', MinMaxScaler(), ['lat', 'long']),
    ('poly', PolynomialFeatures(degree = 2, include_bias = False), 
     ['sqft_living', 'sqft_lot', 'sqft_above', 'sqft_basement', 'bedrooms', 'bathrooms', 'floors', 'lat', 'long'])
], remainder = 'passthrough')

pipe = Pipeline(steps = [
    ('transform', ct),
    ('scaler', StandardScaler()),
    ('enet', ElasticNetCV(n_jobs = -1, max_iter = 5000, random_state = 321,
                       cv = cv, ))
])

ttr = TransformedTargetRegressor(regressor = pipe, func = np.log, inverse_func = np.exp)

In [67]:
ttr.fit(X_train, y_train)

TransformedTargetRegressor(func=<ufunc 'log'>, inverse_func=<ufunc 'exp'>,
                           regressor=Pipeline(steps=[('transform',
                                                      ColumnTransformer(remainder='passthrough',
                                                                        transformers=[('yeo-johnson',
                                                                                       PowerTransformer(),
                                                                                       ['sqft_living',
                                                                                        'sqft_lot',
                                                                                        'sqft_above',
                                                                                        'sqft_basement']),
                                                                                      ('minmax',
                                           

In [68]:
mean_absolute_error(y_test, ttr.predict(X_test))

69643.50157748783