 ## 3. Pre-Processing, Training  & Modeling for Home Prices

## 3.1 Contents<a id='3.1_Contents'></a>
* [3 Pre-Processing and Training & Modeling for Home Prices](#3_Pre-Processing_and_Training_&_Modeling_for_Home_Prices)
  * [3.1 Contents](#3.1_Contents)
  * [3.2 Introduction](#3.2_Introduction)
  * [3.3 Imports](#3.3_Imports)
  * [3.4 Load Data](#3.4_Load_Data)
  * [3.5 Pre-Processing the Data](#3.5_Pre-Processing_the_Data)
      * [3.5.1 Renovation Age Column](#3.5.1_Renovation_Age_Column)
      * [3.5.2 bd/ba % NaNs](#3.5.2_bd/ba_%_NaNs)
  * [3.6 Train/Test Split](#3.6_Train/Test_Split)
  * [3.7 Linear Regression](#3.7_Linear_Regression)
       * [3.7.2 Cross Validation of Linear Regression](#3.7.2_Cross_Validation_of_Linear_Regression) 
       * [3.7.3 Grid Search CV for LR](#3.7.3_Grid_Search_CV_for_LR) 
  * [3.8 Ridge Regression](#3.8_Ridge_Regression)
       * [3.8.1 RR Metrics](34.8.1_RR_Metrics)  
       * [3.8.2 Cross Validation of Ridge Regression](#3.8.2_Cross_Validation_of_Ridge_Regression)
       * [3.8.3 Grid Search CV for RR](#3.8.3_Grid_Search_CV_for_RR)   
  * [3.9 Random Forest Regression](#3.9_Random_Forest_Regression)
       * [3.9.1 RF Metrics](#3.9.1_RF_Metrics)
       * [3.9.2 Cross Validation of RF](#3.9.2_Cross_Validation_of_RF)
       * [3.9.3 Grid Search CV for Random Forest](#3.9.3_Grid_Search_CV_for_Random_Forest)
  * [3.10 XGB Regression](#3.10_XGB_Regression)
       * [3.10.1 XGB Metrics](#3.10.1_XGB_Metrics)
       * [3.10.2 Cross Validation of XGB Regression](#3.10.2_Cross_Validation_of_XGB_Regression)
       * [3.10.3 Grid Search CV for XGB Regression](#3.10.3_Grid_Search_CV_for_XGB_Regression)
  * [3.11 Model Metrics](#3.11_Model_Metrics)
  * [3.12 Summary](#3.12_Summary)

## 3.2 Introduction<a id='3.2_Introduction'></a>

In the last step I explored and cleaned the data. I created a new column of 'Age' so we know how old the home is. Found which columns were positively correlated such as 'Square Feet' of living and 'Square Feet Above' at + .88.

Now I will prepare the data to do a train/test split. This will help to get the data ready for training a model to predict the price of a home. I will standardize the data to keep everything on an equal level so it does not overfit. I will try three different models to find which one is most accurate at predicting the Y variable of price. I will compare the models using RMSE, MAE and R2 score.

## 3.3 Imports<a id='3.3_Imports'></a>

In [404]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, learning_curve
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import Ridge, RidgeCV, Lasso
from sklearn.metrics import mean_squared_error
from sklearn.svm import SVC, SVR
from math import sqrt

In [405]:
pd.options.display.float_format = '{:.2f}'.format

## 3.4 Load Data<a id='3.4_Load_Data'></a>

In [406]:
df= pd.read_csv('house_data_clean.csv')

In [407]:
df.head()

Unnamed: 0.1,Unnamed: 0,date,price,bedrooms,bathrooms,bd/ba %,age,yr_built,sqft_above,sqft_basement,...,floors,view,waterfront,condition,yr_renovated,street,city,state,zip code,country
0,0,2014-05-02,313000.0,3.0,1.5,2.0,59,1955,1340,0,...,1.5,0,0,3,2005,18810 Densmore Ave N,Shoreline,WA,98133,USA
1,1,2014-05-02,2384000.0,5.0,2.5,2.0,93,1921,3370,280,...,2.0,4,0,5,0,709 W Blaine St,Seattle,WA,98119,USA
2,2,2014-05-02,342000.0,3.0,2.0,1.5,48,1966,1930,0,...,1.0,0,0,4,0,26206-26214 143rd Ave SE,Kent,WA,98042,USA
3,3,2014-05-02,420000.0,3.0,2.25,1.33,51,1963,1000,1000,...,1.0,0,0,4,0,857 170th Pl NE,Bellevue,WA,98008,USA
4,4,2014-05-02,550000.0,4.0,2.5,1.6,38,1976,1140,800,...,1.0,0,0,4,1992,9105 170th Ave NE,Redmond,WA,98052,USA


In [408]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4600 entries, 0 to 4599
Data columns (total 22 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Unnamed: 0     4600 non-null   int64  
 1   date           4600 non-null   object 
 2   price          4600 non-null   float64
 3   bedrooms       4600 non-null   float64
 4   bathrooms      4600 non-null   float64
 5   bd/ba %        4598 non-null   float64
 6   age            4600 non-null   int64  
 7   yr_built       4600 non-null   int64  
 8   sqft_above     4600 non-null   int64  
 9   sqft_basement  4600 non-null   int64  
 10  sqft_living    4600 non-null   int64  
 11  sqft_lot       4600 non-null   int64  
 12  floors         4600 non-null   float64
 13  view           4600 non-null   int64  
 14  waterfront     4600 non-null   int64  
 15  condition      4600 non-null   int64  
 16  yr_renovated   4600 non-null   int64  
 17  street         4600 non-null   object 
 18  city    

In [409]:
df.shape

(4600, 22)

I need to drop the unnamed column that is not useful to our model.

In [410]:
df=df.drop(df.columns[0:1], axis=1)

In [411]:
df.head()

Unnamed: 0,date,price,bedrooms,bathrooms,bd/ba %,age,yr_built,sqft_above,sqft_basement,sqft_living,...,floors,view,waterfront,condition,yr_renovated,street,city,state,zip code,country
0,2014-05-02,313000.0,3.0,1.5,2.0,59,1955,1340,0,1340,...,1.5,0,0,3,2005,18810 Densmore Ave N,Shoreline,WA,98133,USA
1,2014-05-02,2384000.0,5.0,2.5,2.0,93,1921,3370,280,3650,...,2.0,4,0,5,0,709 W Blaine St,Seattle,WA,98119,USA
2,2014-05-02,342000.0,3.0,2.0,1.5,48,1966,1930,0,1930,...,1.0,0,0,4,0,26206-26214 143rd Ave SE,Kent,WA,98042,USA
3,2014-05-02,420000.0,3.0,2.25,1.33,51,1963,1000,1000,2000,...,1.0,0,0,4,0,857 170th Pl NE,Bellevue,WA,98008,USA
4,2014-05-02,550000.0,4.0,2.5,1.6,38,1976,1140,800,1940,...,1.0,0,0,4,1992,9105 170th Ave NE,Redmond,WA,98052,USA


## 3.5 Pre-Processing the Data<a id='3.5_Pre-Processing_the_Data'></a>

In [412]:
df['city'].unique()

array(['Shoreline', 'Seattle', 'Kent', 'Bellevue', 'Redmond',
       'Maple Valley', 'North Bend', 'Lake Forest Park', 'Sammamish',
       'Auburn', 'Des Moines', 'Bothell', 'Federal Way', 'Kirkland',
       'Issaquah', 'Woodinville', 'Normandy Park', 'Fall City', 'Renton',
       'Carnation', 'Snoqualmie', 'Duvall', 'Burien', 'Covington',
       'Inglewood-Finn Hill', 'Kenmore', 'Newcastle', 'Mercer Island',
       'Black Diamond', 'Ravensdale', 'Clyde Hill', 'Algona', 'Skykomish',
       'Tukwila', 'Vashon', 'Yarrow Point', 'SeaTac', 'Medina',
       'Enumclaw', 'Snoqualmie Pass', 'Pacific', 'Beaux Arts Village',
       'Preston', 'Milton'], dtype=object)

Here we have too many different cities & street addresses to do a dummy variable for Linear and Ridge Regression. For the Random Forest model I will give numeric values to each city.

Lets now separate the month, day and week for our model from the date column. The year column is all the same, all the homes sold are from data in 2014.

In [414]:
df['date'] = pd.to_datetime(df['date'], 
 format = '%Y-%m-%dT', errors = 'coerce')

In [415]:
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['week'] = df['date'].dt.isocalendar().week

### 3.5.1 Renovation Age Column<a id='3.5.1_Renovation_Age_Column'></a>

Now I will create a new column called renovation age. This is when the home was renovated last or never renovated.

In [416]:
df['renovation_age'] = df['year'] - df['yr_renovated']

In [417]:
df['renovation_age'] = df['renovation_age'].replace(2014, 0)

In [418]:
df.head(3)

Unnamed: 0,date,price,bedrooms,bathrooms,bd/ba %,age,yr_built,sqft_above,sqft_basement,sqft_living,...,street,city,state,zip code,country,year,month,day,week,renovation_age
0,2014-05-02,313000.0,3.0,1.5,2.0,59,1955,1340,0,1340,...,18810 Densmore Ave N,Shoreline,WA,98133,USA,2014,5,2,18,9
1,2014-05-02,2384000.0,5.0,2.5,2.0,93,1921,3370,280,3650,...,709 W Blaine St,Seattle,WA,98119,USA,2014,5,2,18,0
2,2014-05-02,342000.0,3.0,2.0,1.5,48,1966,1930,0,1930,...,26206-26214 143rd Ave SE,Kent,WA,98042,USA,2014,5,2,18,0


### 3.5.2 bd/ba % NaNs<a id='3.5.2_bd/ba_%_NaNs'></a>

I need to fix up the Bed/Bathrooms % column. Two of the entries have NaN values.

In [419]:
df[df['bd/ba %'].isna()]

Unnamed: 0,date,price,bedrooms,bathrooms,bd/ba %,age,yr_built,sqft_above,sqft_basement,sqft_living,...,street,city,state,zip code,country,year,month,day,week,renovation_age
2365,2014-06-12,1095000.0,0.0,0.0,,24,1990,3064,0,3064,...,814 E Howe St,Seattle,WA,98102,USA,2014,6,12,24,5
3209,2014-06-24,1295648.0,0.0,0.0,,24,1990,4810,0,4810,...,20418 NE 64th Pl,Redmond,WA,98053,USA,2014,6,24,26,5


In [420]:
df['bd/ba %'] = df['bd/ba %'].replace(np.nan, 0)

## 3.6 Train/Test Split<a id='3.6_Train/Test_Split'></a>

Now I will do a 70% Train/ 30% Test on the housing data. We will set aside data, which is the actual test to evaluate our model performance. A train/test split is helpful to check in on future performance that we predict. Lets see what the size of the train/test split will be.

In [421]:
len(df) * .7, len(df) * .3

(3220.0, 1380.0)

I need to get dummies for the View and Condition colmuns for Linear Regression, Ridge Regression and the XGB Regression models.

In [558]:
encode = pd.get_dummies(df, columns=['view', 'condition'])

In [559]:
encode.head()

Unnamed: 0,date,price,bedrooms,bathrooms,bd/ba %,age,yr_built,sqft_above,sqft_basement,sqft_living,...,view_0,view_1,view_2,view_3,view_4,condition_1,condition_2,condition_3,condition_4,condition_5
0,2014-05-02,313000.0,3.0,1.5,2.0,59,1955,1340,0,1340,...,1,0,0,0,0,0,0,1,0,0
1,2014-05-02,2384000.0,5.0,2.5,2.0,93,1921,3370,280,3650,...,0,0,0,0,1,0,0,0,0,1
2,2014-05-02,342000.0,3.0,2.0,1.5,48,1966,1930,0,1930,...,1,0,0,0,0,0,0,0,1,0
3,2014-05-02,420000.0,3.0,2.25,1.33,51,1963,1000,1000,2000,...,1,0,0,0,0,0,0,0,1,0
4,2014-05-02,550000.0,4.0,2.5,1.6,38,1976,1140,800,1940,...,1,0,0,0,0,0,0,0,1,0


First I need to create the X/independent variables for our model to predict the Y/dependent variable.

In [562]:
features= encode.drop(['date','price','street','yr_built','zip code','country','city',
                       'state','yr_renovated'],axis=1)

In [563]:
X_train, X_test, y_train, y_test = train_test_split(features, df['price'],test_size=0.3, 
                                                    random_state=47)

Lets look at the shape of our X & Y train and test splits.

In [565]:
X_train.shape, X_test.shape

((3220, 25), (1380, 25))

In [566]:
y_train.shape, y_test.shape

((3220,), (1380,))

In [567]:
X_train.dtypes

bedrooms          float64
bathrooms         float64
bd/ba %           float64
age                 int64
sqft_above          int64
sqft_basement       int64
sqft_living         int64
sqft_lot            int64
floors            float64
waterfront          int64
year                int64
month               int64
day                 int64
week               UInt32
renovation_age      int64
view_0              uint8
view_1              uint8
view_2              uint8
view_3              uint8
view_4              uint8
condition_1         uint8
condition_2         uint8
condition_3         uint8
condition_4         uint8
condition_5         uint8
dtype: object

All of our varibles are in a numeric value!

## 3.7 Linear Regression<a id='3.7_Linear_Regression'></a>

Lets start with a Linear Regression pipeline model.

In [568]:
lr_pipeline=make_pipeline(StandardScaler(),
    LinearRegression())

In [569]:
lr_pipeline.fit(X_train, y_train)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('linearregression', LinearRegression())])

In [570]:
y_te_pred = lr_pipeline.predict(X_test)
print(y_te_pred)

[628503.91805971 501527.91805971 507671.91805971 ... 524055.91805971
 644887.91805971 323351.91805971]


In [571]:
y_tr_pred = lr_pipeline.predict(X_train)
print(y_tr_pred)

[2418455.91805971 1891607.91805971  364311.91805971 ...  466711.91805971
  474903.91805971  454423.91805971]


### 3.7.1 LR Metrics<a id='3.7.1_LR_Metrics'></a>

In [572]:
r2_test_lr = r2_score(y_test,y_te_pred)
mse_test_lr = mean_squared_error(y_test, y_te_pred)
rmse_test_lr = sqrt(mse_test_lr)
mae_test_lr = mean_absolute_error(y_test, y_te_pred)

In [573]:
print('R2 Score: ', r2_test_lr)
print('MSE: ', mse_test_lr)
print('RMSE: ', rmse_test_lr)
print('MAE: ', mae_test_lr)

R2 Score:  0.46965718257726086
MSE:  61734606846.020485
RMSE:  248464.49816024117
MAE:  164817.44813336252


This score is low for a prediction model

In [574]:
r2_train_lr = r2_score(y_train,y_tr_pred)
mse_train_lr = mean_squared_error(y_train, y_tr_pred)
rmse_train_lr = sqrt(mse_train_lr)
mae_train_lr = mean_absolute_error(y_train, y_tr_pred)

In [575]:
print('R2 Score: ', r2_train_lr)
print('MSE: ', mse_train_lr)
print('RMSE: ', rmse_train_lr)
print('MAE: ', mae_train_lr)

R2 Score:  0.1855539551564973
MSE:  329085656128.6325
RMSE:  573659.878437243
MAE:  173296.70994937117


The training set R2 score is lower than the test. This means that the model is underfitting the test data.

In [587]:
lr_table=pd.DataFrame({'Test':[r2_test_lr, mse_test_lr, rmse_test_lr, mae_test_lr],
                    'Training': [r2_train_lr, mse_train_lr, rmse_train_lr, mae_train_lr]},
                     index=['R2', 'MSE', 'RMSE', 'MAE'])

In [588]:
lr_table

Unnamed: 0,Test,Training
R2,0.47,0.19
MSE,61734606846.02,329085656128.63
RMSE,248464.5,573659.88
MAE,164817.45,173296.71


These statistics are not very favorable for predicting the price of a home.

### 3.7.2 Cross Validation of Linear Regression<a id='3.7.2_Cross_Validation_of_Linear_Regression'></a>

In [576]:
cv_results_lr = cross_validate(lr_pipeline, X_train, y_train, cv=10)
cv_results_lr

{'fit_time': array([0.01699162, 0.01397467, 0.01399183, 0.01599169, 0.01499295,
        0.01399493, 0.00999427, 0.01399326, 0.01099467, 0.01698923]),
 'score_time': array([0.00601101, 0.00299764, 0.00499868, 0.00399709, 0.00499487,
        0.00199628, 0.00499582, 0.00299764, 0.00399637, 0.00399709]),
 'test_score': array([ 0.5920096 ,  0.57684533,  0.13968156,  0.40049369,  0.38165976,
         0.34183103,  0.55054558,  0.6245825 , -0.00588385,  0.5294712 ])}

In [577]:
cv_scores_lr = cv_results_lr['test_score']
cv_scores_lr

array([ 0.5920096 ,  0.57684533,  0.13968156,  0.40049369,  0.38165976,
        0.34183103,  0.55054558,  0.6245825 , -0.00588385,  0.5294712 ])

In [578]:
np.mean(cv_scores_lr), np.std(cv_scores_lr)

(0.4131236399322528, 0.1981189922324615)

The average cross validation score is lower .413 than our original R2 score for Linear Regression.

### 3.7.3 Grid Search CV for LR<a id='3.7.3_Grid_Search_CV_for_LR'></a>

In [579]:
param_grid = {'C': [1000, 10000, 500000], 'max_iter':[20000, 30000, 50000]}

In [580]:
lr_grid_cv = GridSearchCV(SVR(), param_grid, refit = True, verbose = 3,n_jobs=-1) 

In [581]:
lr_grid_cv.fit(X_train, y_train)

Fitting 5 folds for each of 9 candidates, totalling 45 fits


GridSearchCV(estimator=SVR(), n_jobs=-1,
             param_grid={'C': [1000, 10000, 500000],
                         'max_iter': [20000, 30000, 50000]},
             verbose=3)

In [582]:
lr_grid_cv.best_params_

{'C': 500000, 'max_iter': 20000}

In [583]:
lr_best_cv_results = cross_validate(lr_grid_cv.best_estimator_, X_train, y_train, cv=10)
lr_best_scores = lr_best_cv_results['test_score']
lr_best_scores

array([ 0.36074056,  0.43969383,  0.09503574,  0.35492343,  0.40632209,
        0.31531083,  0.43707087,  0.52086752, -0.0024624 ,  0.39496169])

In [584]:
np.mean(lr_best_scores), np.std(lr_best_scores)

(0.33224641549715306, 0.15409195797555822)

 So this indicates that the model is not a flexible one. Some of the linear relationships between the features and the dependent variable may have been lost. This is also quite clear from the high MSE values and not just for the test set but also for the training set. The grid search mean R2 score was even lower than cross validation R2 score at .332.

## 3.8 Ridge Regression<a id='3.8_Ridge_Regression'></a>

In [447]:
r_pipeline=make_pipeline(
    StandardScaler(), 
    Ridge(alpha=10))

In [448]:
r_pipeline.fit(X_train, y_train)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('ridge', Ridge(alpha=10))])

In [449]:
y_te_pred_r = r_pipeline.predict(X_test)
print(y_te_pred_r)

[652469.00685731 469446.68222105 513795.87995955 ... 527792.76632784
 626678.12625959 290267.00976078]


In [450]:
y_tr_pred_r = r_pipeline.predict(X_train)
print(y_tr_pred_r)

[2485362.70408175 2013035.08146989  345825.05218298 ...  466858.97695807
  462067.55011513  447551.80418501]


### 3.8.1 RR Metrics<a id='3.8.1_RR_Metrics'></a>

In [451]:
r2_test = r2_score(y_test,y_te_pred_r)
mse_test = mean_squared_error(y_test, y_te_pred_r)
rmse_test = np.sqrt(mse_test)
mae_test = mean_absolute_error(y_test, y_te_pred_r)

In [452]:
r2_train_tr = r2_score(y_train,y_tr_pred_r)
mse_train_tr = mean_squared_error(y_train, y_tr_pred_r)
rmse_train_tr = np.sqrt(mse_test)
mae_train_tr = mean_absolute_error(y_train, y_tr_pred_r)

In [455]:
print('R2 Score: ', r2_test)
print('MSE: ', mse_test)
print('RMSE: ', rmse_test)
print('MAE: ', mae_test)

R2 Score:  0.4675921933834928
MSE:  61974982112.415276
RMSE:  248947.749763711
MAE:  165587.70739882597


In [456]:
print('R2 Score: ', r2_train_tr)
print('MSE: ', mse_train_tr)
print('RMSE: ', rmse_train_tr)
print('MAE: ', mae_train_tr)

R2 Score:  0.18731302780600578
MSE:  328374883965.50824
RMSE:  248947.749763711
MAE:  173932.0818063621


Once again we get a lower R2 score for both the training and test datasets. This is very similar to the Linear Regression scores.

In [None]:
table=pd.DataFrame({'Test':[r2_test, mse_test, rmse_test, mae_test],
                    'Training': [r2_train_tr, mse_train_tr, rmse_train_tr, mae_train_tr]},
                     index=['R2', 'MSE', 'RMSE', 'MAE'])

In [622]:
table

Unnamed: 0,Test,Training
R2,0.47,0.19
MSE,61974982112.42,328374883965.51
RMSE,248947.75,248947.75
MAE,165587.71,173932.08


Here we can see the low R2 score and the high MSE, RMSE, and MAE for both the training and test sets.

### 3.8.2 Cross Validation of Ridge Regression<a id='3.8.2_Cross_Validation_of_Ridge_Regression'></a>

In [457]:
cv_results_rr = cross_validate(r_pipeline, X_train, y_train, cv=10)
cv_results_rr

{'fit_time': array([0.01699138, 0.00999594, 0.01099586, 0.01698756, 0.01399183,
        0.00999475, 0.0079968 , 0.00799632, 0.00799561, 0.0099957 ]),
 'score_time': array([0.00299716, 0.00199676, 0.00499701, 0.00299835, 0.00399685,
        0.00199819, 0.00299811, 0.0039959 , 0.00299811, 0.00199819]),
 'test_score': array([ 0.59138638,  0.57728452,  0.14008883,  0.39392291,  0.38549099,
         0.34312905,  0.54959919,  0.62643211, -0.00559231,  0.53265065])}

In [458]:
cv_scores_rr = cv_results_rr['test_score']
cv_scores_rr

array([ 0.59138638,  0.57728452,  0.14008883,  0.39392291,  0.38549099,
        0.34312905,  0.54959919,  0.62643211, -0.00559231,  0.53265065])

In [459]:
np.mean(cv_scores_rr), np.std(cv_scores_rr)

(0.41343923266752886, 0.1982525443067596)

Very similar to the cross validation standard deviation and the R2 score from Linear Regression.

### 3.8.3 Grid Search CV for RR<a id='3.8.3_Grid_Search_CV_for_RR'></a>

In [460]:
param_grid = {'alpha': [0.1, 50, 100, 1000],
              'max_iter': [5000, 10000, 50000]}  

In [461]:
rr_grid_cv = GridSearchCV(Ridge(), param_grid, refit = True, verbose = 3,n_jobs=-1) 

In [462]:
rr_grid_cv.fit(X_train, y_train)

Fitting 5 folds for each of 12 candidates, totalling 60 fits


GridSearchCV(estimator=Ridge(), n_jobs=-1,
             param_grid={'alpha': [0.1, 50, 100, 1000],
                         'max_iter': [5000, 10000, 50000]},
             verbose=3)

In [463]:
rr_grid_cv.best_params_

{'alpha': 50, 'max_iter': 5000}

In [464]:
rr_best_cv_results = cross_validate(rr_grid_cv.best_estimator_, X_train, y_train, cv=10)
rr_best_scores = rr_best_cv_results['test_score']
rr_best_scores

array([ 0.57466382,  0.56872382,  0.13872922,  0.40090761,  0.42611772,
        0.37850081,  0.5515727 ,  0.62897515, -0.00605918,  0.53570095])

In [465]:
np.mean(rr_best_scores), np.std(rr_best_scores)

(0.41978326306830543, 0.19574772196275597)

Here we have a lower grid search cross validation for Ridge Regression. This suggests that the model is not accurate at predicting the price of a home for our dataset.

## 3.9 Random Forest Regression<a id='3.9_Random_Forest_Regression'></a>

First we need to give numeric values for the city, view and condition columns for our Random Forest model to see if it will increase the R2 score.

In [466]:
df['city']= df['city'].apply({'Shoreline':0,'Seattle':1,'Kent':2,'Bellevue':3,'Redmond':4,'Maple Valley':5,'North Bend':6,'Lake Forest Park':7,
                                 'Sammamish':8,'Auburn':9,'Des Moines':10,'Bothell':11,'Federal Way':12,'Kirkland':13,'Issaquah':14,
                                 'Woodinville':15,'Normandy Park':16,'Fall City':17,'Renton':18,'Carnation':19,'Snoqualmie':20,
                                 'Duvall':21,'Burien':22,'Covington':23,'Inglewood-Finn Hill':24,'Kenmore':25,'Newcastle':26,'Mercer Island':27,
                                 'Black Diamond':28,'Ravensdale':29,'Clyde Hill':30,'Algona':31,'Skykomish':32,'Tukwila':33,'Vashon':34,
                                 'Yarrow Point':35,'SeaTac':36,'Medina':37,'Enumclaw':38,'Snoqualmie Pass':39,'Pacific':40,'Beaux Arts Village':41,
                                'Preston':42,'Milton':43}.get)

In [467]:
df['view']= df['view'].apply({0:0, 1:1, 2:2, 3:3, 4:4}.get)
df['condition']= df['condition'].apply({1:1, 2:2, 3:3, 4:4, 5:5}.get)

In [468]:
features1= df.drop(['date','price','street','yr_built','zip code','country',
                       'state','yr_renovated'],axis=1)

In [627]:
features1.head(3)

Unnamed: 0,bedrooms,bathrooms,bd/ba %,age,sqft_above,sqft_basement,sqft_living,sqft_lot,floors,view,waterfront,condition,city,year,month,day,week,renovation_age
0,3.0,1.5,2.0,59,1340,0,1340,7912,1.5,0,0,3,0,2014,5,2,18,9
1,5.0,2.5,2.0,93,3370,280,3650,9050,2.0,4,0,5,1,2014,5,2,18,0
2,3.0,2.0,1.5,48,1930,0,1930,11947,1.0,0,0,4,2,2014,5,2,18,0


In [470]:
X_train, X_test, y_train, y_test = train_test_split(features1, df['price'],test_size=0.3, 
                                                    random_state=47)

In [471]:
rf_pipeline=make_pipeline(
    StandardScaler(), 
    RandomForestRegressor(n_estimators = 10000,
                           random_state = 42,
                           min_samples_split = 10,
                           bootstrap = True))

In [472]:
rf_pipeline.fit(X_train, y_train)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('randomforestregressor',
                 RandomForestRegressor(min_samples_split=10, n_estimators=10000,
                                       random_state=42))])

In [473]:
y_te_pred_rf = rf_pipeline.predict(X_test)

In [474]:
y_te_pred_rf

array([836569.22005153, 308434.64192059, 387037.6563454 , ...,
       527241.23387044, 952117.63913168, 422250.79643928])

In [475]:
y_tr_pred_rf = rf_pipeline.predict(X_train)

In [476]:
y_tr_pred_rf

array([3512448.99640136, 3411238.30771917,  528018.75791143, ...,
        341091.36232179,  487416.69875059,  590182.92122278])

### 3.9.1 RF Metrics<a id='3.9.1_RF_Metrics'></a>

In [477]:
r2_test_rf = r2_score(y_test,y_te_pred_rf)
mse_test_rf = mean_squared_error(y_test, y_te_pred_rf)
rmse_test_rf = sqrt(mse_test)
mae_test_rf = mean_absolute_error(y_test, y_te_pred_rf)

In [478]:
r2_train_rf = r2_score(y_train,y_tr_pred_rf)
mse_train_rf = mean_squared_error(y_train, y_tr_pred_rf)
rmse_train_rf = sqrt(mse_test)
mae_train_rf = mean_absolute_error(y_train, y_tr_pred_rf)

In [479]:
print('R2 Score: ', r2_test_rf)
print('MSE: ', mse_test_rf)
print('RMSE: ', rmse_test_rf)
print('MAE: ', mae_test_rf)

R2 Score:  0.3956232210568841
MSE:  70352537281.89313
RMSE:  248947.749763711
MAE:  140215.58163085923


In [480]:
print('R2 Score: ', r2_train_rf)
print('MSE: ', mse_train_rf)
print('RMSE: ', rmse_train_rf)
print('MAE: ', mae_train_rf)

R2 Score:  0.65300007638006
MSE:  140209039339.11996
RMSE:  248947.749763711
MAE:  89118.92067473156


This model has a lower R2 score than the first two models. The MSE, RMSE and MAE are higher as well than the firrst two models.

In [623]:
table_rf=pd.DataFrame({'Test':[r2_test_rf, mse_test_rf, rmse_test_rf, mae_test_rf],
                    'Training': [r2_train_rf, mse_train_rf, rmse_train_rf, mae_train_rf]},
                     index=['R2', 'MSE', 'RMSE', 'MAE'])

In [624]:
table_rf

Unnamed: 0,Test,Training
R2,0.4,0.65
MSE,70352537281.89,140209039339.12
RMSE,248947.75,248947.75
MAE,140215.58,89118.92


The higher training R2 score means that the model is overfitting the data.

### 3.9.2 Cross Validation of RF<a id='3.9.2_Cross_Validation_of_RF'></a>

In [482]:
cv_results_rf = cross_validate(rf_pipeline, X_train, y_train, cv=10)
cv_results_rf

{'fit_time': array([118.13048077, 122.07361126, 120.76079416, 121.88335609,
        120.54279017, 121.50665975, 122.78863382, 122.16399574,
        121.93012857, 125.03638673]),
 'score_time': array([1.1243608 , 1.11736178, 1.12035799, 1.20830441, 1.18532181,
        1.12635279, 1.12635541, 1.12535405, 1.12635159, 1.12035584]),
 'test_score': array([0.53936695, 0.36899571, 0.07518875, 0.36522929, 0.44604904,
        0.27717615, 0.60435922, 0.67969309, 0.00267993, 0.22417229])}

In [483]:
cv_scores_rf = cv_results_rf['test_score']
cv_scores_rf

array([0.53936695, 0.36899571, 0.07518875, 0.36522929, 0.44604904,
       0.27717615, 0.60435922, 0.67969309, 0.00267993, 0.22417229])

In [484]:
np.mean(cv_scores_rf), np.std(cv_scores_rf)

(0.3582910431354306, 0.20872832568212693)

### 3.9.3 Grid Search CV for Random Forest<a id='3.9.3_Grid_Search_CV_for_Random_Forest'></a> 

In [485]:
param_grid = {'n_estimators':[100, 200, 500, 1000], 'max_depth':[10, 50, 100]}

In [486]:
rf_grid_cv = GridSearchCV(RandomForestRegressor(), param_grid, refit = True, verbose = 3,n_jobs=-1) 

In [487]:
rf_grid_cv.fit(X_train, y_train)

Fitting 5 folds for each of 12 candidates, totalling 60 fits


GridSearchCV(estimator=RandomForestRegressor(), n_jobs=-1,
             param_grid={'max_depth': [10, 50, 100],
                         'n_estimators': [100, 200, 500, 1000]},
             verbose=3)

In [488]:
rf_grid_cv.best_params_

{'max_depth': 100, 'n_estimators': 200}

In [489]:
rf_best_cv_results = cross_validate(rf_grid_cv.best_estimator_, X_train, y_train, cv=10)
rf_best_scores = rf_best_cv_results['test_score']
rf_best_scores

array([ 4.11262046e-01,  4.42329651e-01,  1.01395094e-01,  4.37197525e-01,
        4.05809038e-01, -4.80115828e-01,  6.18407785e-01,  6.38142837e-01,
        2.42037715e-04,  1.95319903e-01])

In [490]:
np.mean(rf_best_scores), np.std(rf_best_scores)

(0.27699900889897217, 0.3196758824381164)

Once again the grid search CV R2 score is lower for this model. 

## 3.10 XGB Regression<a id='3.10_XGB_Regression'></a>

In [513]:
conda install -c conda-forge xgboost

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

Retrieving notices: ...working... done

Note: you may need to restart the kernel to use updated packages.


In [514]:
from xgboost import XGBRegressor
import xgboost as xgb

In [516]:
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [517]:
xgbr = XGBRegressor(objective = 'reg:squarederror',
        colsample_bytree = 0.5,
        learning_rate = 0.05,
        max_depth = 10,
        min_child_weight = 1,
        n_estimators = 1000,
        subsample = 0.7).fit(X_train_scaled, y_train)

In [518]:
y_te_pred_xgb = xgbr.predict(X_test_scaled)

In [519]:
y_te_pred_xgb

array([815268.8 , 325512.28, 374205.28, ..., 575320.7 , 999218.56,
       386445.3 ], dtype=float32)

In [520]:
y_tr_pred_xgb = xgbr.predict(X_train_scaled)

In [521]:
y_tr_pred_xgb

array([2888082.  , 3800037.  ,  604712.9 , ...,  253129.47,  505093.16,
        586611.25], dtype=float32)

### 3.10.1 XGB Metrics<a id='3.10.1_XGB_Metrics'></a>

In [522]:
r2_test_xgb = r2_score(y_test,y_te_pred_xgb)
mse_test_xgb = mean_squared_error(y_test, y_te_pred_xgb)
rmse_test_xgb = np.sqrt(mse_test)
mae_test_xgb = mean_absolute_error(y_test, y_te_pred_xgb)

In [523]:
print('R2 Score: ', r2_test_xgb)
print('MSE: ', mse_test_xgb)
print('RMSE: ', rmse_test_xgb)
print('MAE: ', mae_test_xgb)

R2 Score:  0.3030709607666885
MSE:  81126092073.28644
RMSE:  248947.749763711
MAE:  145701.3168604714


In [524]:
r2_train_xgb = r2_score(y_train,y_tr_pred_xgb)
mse_train_xgb = mean_squared_error(y_train, y_tr_pred_xgb)
rmse_train_xgb = np.sqrt(mse_train_xgb)
mae_train_xgb = mean_absolute_error(y_train, y_tr_pred_xgb)

In [525]:
print('R2 Score: ', r2_train_xgb)
print('MSE: ', mse_train_xgb)
print('RMSE: ', rmse_train_xgb)
print('MAE: ', mae_train_xgb)

R2 Score:  0.9999996096053836
MSE:  157743.13018959042
RMSE:  397.16889378398
MAE:  262.5218084015361


This is the lowest R2 score out of the four models. The very high R2 score on the training data means that the model is overfitting the data and simply repeating the information 

In [620]:
xgb_table=pd.DataFrame({'Test':[r2_test_xgb, mse_test_xgb, rmse_test_xgb, mae_test_xgb], 
                        'Training': [r2_train_xgb, mse_train_xgb, rmse_train_xgb, mae_train_xgb]},
              index=['R2', 'MSE', 'RMSE', 'MAE'])

In [621]:
xgb_table

Unnamed: 0,Test,Training
R2,0.3,1.0
MSE,81126092073.29,157743.13
RMSE,248947.75,397.17
MAE,145701.32,262.52


This was the least accurate model of all four. It had the highest MSE, RMSE and MAE values and the lowest R2 score.

### 3.10.2 Cross Validation of XGB Regression<a id='3.10.2_Cross_Validation_of_XGB_Regression'></a>

In [504]:
cv_results_xgb = cross_validate(xgbr, X_train_scaled, y_train, cv=10)
cv_results_xgb

{'fit_time': array([5.68773222, 5.69675303, 5.66476917, 5.73173499, 5.82068396,
        5.71274567, 5.69275475, 5.66376901, 5.68474436, 5.78769946]),
 'score_time': array([0.01299071, 0.01297593, 0.01297688, 0.01297331, 0.01297259,
        0.01397419, 0.01197648, 0.01197958, 0.01299143, 0.01197743]),
 'test_score': array([ 4.83216294e-01,  2.64464951e-01,  9.22841872e-02,  4.13101956e-01,
         3.83412949e-01,  1.04206583e-01,  1.84211180e-01,  4.12306531e-01,
        -1.37541803e-04,  5.22579761e-01])}

In [505]:
cv_scores_xgb = cv_results_xgb['test_score']
cv_scores_xgb

array([ 4.83216294e-01,  2.64464951e-01,  9.22841872e-02,  4.13101956e-01,
        3.83412949e-01,  1.04206583e-01,  1.84211180e-01,  4.12306531e-01,
       -1.37541803e-04,  5.22579761e-01])

In [506]:
np.mean(cv_scores_xgb), np.std(cv_scores_xgb)

(0.28596468501348093, 0.1731288088295915)

Another lower cross validation score of .285. This proves that this model is more than likely inaccurate as well.

### 3.10.3 Grid Search CV for XGB Regression<a id='3.10.3_Grid_Search_CV_for_XGB_Regression'></a> 

In [507]:
parameters = {"learning_rate": [0.05, 0.10, 0.15, 0.20, 0.25, 0.30 ],
 "max_depth"        : [1,2, 3, 4, 5, 6, 8, 10, 12, 15],
 'n_estimators' : [5,10,50,100,500,1000]}

In [508]:
xgb_grid = GridSearchCV(xgbr,
                        parameters,
                        cv = 10,
                        n_jobs = 5,
                        verbose=True)

In [509]:
xgb_grid.fit(X_train_scaled, y_train)

Fitting 10 folds for each of 360 candidates, totalling 3600 fits


GridSearchCV(cv=10,
             estimator=XGBRegressor(base_score=0.5, booster='gbtree',
                                    colsample_bylevel=1, colsample_bynode=1,
                                    colsample_bytree=0.5,
                                    enable_categorical=False, gamma=0,
                                    gpu_id=-1, importance_type=None,
                                    interaction_constraints='',
                                    learning_rate=0.05, max_delta_step=0,
                                    max_depth=10, min_child_weight=1,
                                    missing=nan, monotone_constraints='()',
                                    n_estimators=1000, n_jobs=8,
                                    num_parallel_tree=1, predictor='auto',
                                    random_state=0, reg_alpha=0, reg_lambda=1,
                                    scale_pos_weight=1, subsample=0.7,
                                    tree_method='exact', vali

In [510]:
xgb_grid.best_params_

{'learning_rate': 0.1, 'max_depth': 2, 'n_estimators': 100}

In [511]:
xgb_grid_cv_results = cross_validate(rf_grid_cv.best_estimator_, X_train_scaled, y_train, cv=10)
xgb_grid_best_scores = xgb_grid_cv_results['test_score']
xgb_grid_best_scores

array([ 0.49396733,  0.36840425,  0.10544041,  0.36324592,  0.40737846,
       -0.828053  ,  0.62519659,  0.68219228,  0.00337   ,  0.14866684])

In [512]:
np.mean(xgb_grid_best_scores), np.std(xgb_grid_best_scores)

(0.23698090985441186, 0.4112408589476557)

The grid search CV score is even lower then the first two R2 scores for the XGB Regression model.

## 3.11 Model Metrics<a id='3.11_Model_Metrics'></a> 

In [612]:
data = {
        'R2 Score': [r2_test_lr,r2_test,r2_test_rf, r2_test_xgb],
        'MSE' : [mse_test_lr,mse_test, mse_test_rf, mse_test_xgb],
        'MAE': [mae_test_lr, mae_test, mae_test_rf, mae_test_xgb],
        'RMSE': [rmse_test_lr, rmse_test, rmse_test_rf, rmse_test_xgb],
        'CV Mean Score': [np.mean(cv_scores_lr), np.mean(cv_scores_rr), np.mean(cv_scores_rf), np.mean(cv_scores_xgb)],
        'Grid Search CV Best Score': [np.mean(lr_best_scores), np.mean(rr_best_scores), np.mean(rf_best_scores), np.mean(xgb_grid_best_scores)]}

In [613]:
df_metrics = pd.DataFrame(data, index=['Linear Regression',
                                       'Ridge Regression',
                                       'Random Forest Regression Model',
                                       'XGB Regession Model'])

In [614]:
df_metrics

Unnamed: 0,R2 Score,MSE,MAE,RMSE,CV Mean Score,Grid Search CV Best Score
Linear Regression,0.47,61734606846.02,164817.45,248464.5,0.41,0.33
Ridge Regression,0.47,61974982112.42,165587.71,248947.75,0.41,0.42
Random Forest Regression Model,0.4,70352537281.89,140215.58,248947.75,0.36,0.28
XGB Regession Model,0.3,81126092073.29,145701.32,248947.75,0.29,0.24


## 3.12 Summary<a id='3.12_Summary'></a>


I tried four different models with similar results. For this particular data it is hard to predict the price of a home for this time period. Sometimes in data science this happens.  

The most accurate R2 score on the test data was Linear Regression at .4696. It would be hard to predict the price of a home if you get less than half of them correct. The MSE for Linear Regression was 61,734,606,846.02. The RMSE was 248,464.50 and 164,817.45 MAE for the test data from Linear Regression. All these statistics suggest that this model could be very inaccurate at predicting the price. The cross validation mean scores were slightly lower for the R2 score at .413 and a standard deviation average of .198. The grid search cross validation average was .332 for the R2 score and a standard deviation of .154 for Linear Regression.

Ridge Regression was similar to Linear Regression. The R2 test score was slightly lower at .467. The MSE was 61,974,982,112.41, RMSE was 248947.74 and the MAE was 165,587.70. The mean cross validation R2 score was .413 and the standard deviation average was also .198. The grid search mean was .419 for the R2 score and the standard deviation was .195.

The Random Forest Regressor model had an R2 test score lower than the first three models at .395. The MSE was also very high at  70,352,537,281.89, RMSE was 248,947.74 and the MAE was 140,215.58. These numbers suggest that it is inaccurate at predicting the price. The cross validation mean R2 score was .358 and the standard deviation was .208. For the grid search the R2 mean score was .276 and a standard deviation of .319. The cross validations confirm that this model is not very good at predicting the price.

The XGB Regression Model had the lowest R2 test score of .303. The MSE was 81,126,092,073.28, RMSE was  248,947.74 and the MAE was 145,701.31. The cross validation mean R2 score was .285 and the standard deviation was .173. The grid search R2 score average was .236 and the standard deviation average was .411. This was the most inaccurate model out of all four.

Going forward I would recommend that we need more information to predict the price of the home. I would also try a different model as well, but typically linear regression is the best at predicting the price of a home. Houses come in all different shapes and sizes which can make it tricky to predict the price. Sometimes a very small home is extremely expensive due to the location and vice versa. 