# House Prices - Advanced Regression Techniques

## Salvatore Porcheddu
## 2022/01/29

# Introduction

The "House Prices - Advanced Regression Techniques" is one of the most famous Kaggle competitions for beginner data scientists where, as the name suggests, the goal is to predict the final price of each house based on their characteristics using regression machine learning algorithms.

The dataset used in the competition is the **Ames Housing dataset**, compiled by **Dean De Cock**, which boasts 79 different variables, each corresponding to a different feature of a house. A short description of what each variable means can be found [here](https://support.minitab.com/en-us/datasets/predictive-analytics-data-sets/ames-housing-data/), while the complete documentation along with the data can be found on the Kaggle competition page [here](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data).

The data is provided splitted into train and test datasets: each of these datasets contains a number of houses and their features, but while the train dataset also contains the sale price for its houses, the test dataset does not. 
The train data is used to fit the various algorithms; the test data is fed to the fitted algorithms in order for them to return a sale price prediction for each house in the dataset. 
The predictions from the test set are then scored by Kaggle according to the **Root-Mean-Squared-Error** (RMSE) between the logarithm of the predicted price and the logarithm of the real price: the lower the RMSE, the better the prediction is going to be.

In [42]:
# importing relevant libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn import ensemble, model_selection, metrics, tree, feature_selection
import xgboost as xgb
import warnings
warnings.filterwarnings("ignore")

## Data preparation

In [43]:
# importing the data

train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

train.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [44]:
test.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,...,120,0,,MnPrv,,0,6,2010,WD,Normal
1,1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,...,0,0,,,Gar2,12500,6,2010,WD,Normal
2,1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,...,0,0,,MnPrv,,0,3,2010,WD,Normal
3,1464,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,6,2010,WD,Normal
4,1465,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,...,144,0,,,,0,1,2010,WD,Normal


In [45]:
# getting preliminary information about the data

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

In [46]:
# getting basic descriptive statistics about the features

train.describe()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,...,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,421.610009,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,...,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,365.75,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,730.5,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,...,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,1095.25,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,...,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,...,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


In [47]:
# now getting info about the test data

test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 80 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1459 non-null   int64  
 1   MSSubClass     1459 non-null   int64  
 2   MSZoning       1455 non-null   object 
 3   LotFrontage    1232 non-null   float64
 4   LotArea        1459 non-null   int64  
 5   Street         1459 non-null   object 
 6   Alley          107 non-null    object 
 7   LotShape       1459 non-null   object 
 8   LandContour    1459 non-null   object 
 9   Utilities      1457 non-null   object 
 10  LotConfig      1459 non-null   object 
 11  LandSlope      1459 non-null   object 
 12  Neighborhood   1459 non-null   object 
 13  Condition1     1459 non-null   object 
 14  Condition2     1459 non-null   object 
 15  BldgType       1459 non-null   object 
 16  HouseStyle     1459 non-null   object 
 17  OverallQual    1459 non-null   int64  
 18  OverallC

In [48]:
# removing columns that would leak information about the sale price that we are not supposed to know and thus invalidate the predictions
# we also remove columns where more than 90% of values are missing because they do not convey any useful information

leak_cols = ["YrSold", "MoSold", "SaleType", "SaleCondition"]
useless_cols = ["Alley", "PoolQC", "MiscFeature"]

train = train.drop(leak_cols + useless_cols, axis=1)
test = test.drop(leak_cols + useless_cols, axis=1)

In [49]:
# as per the data documentation, missing values sometimes have a precise meaning: for certain columns with object datatype, they mean that a certain characteristic is not present;
# for certain numeric columns, they mean zero. So let's fill the null values accordingly, starting with the object columns.

object_cols = train.select_dtypes("object").columns

to_fill = ["FireplaceQu", "Fence", "MasVnrType"]
basement_cols = [col for col in object_cols if col.startswith("Bsmt")]
garage_cols = [col for col in object_cols if col.startswith("Garage")]

to_fill = to_fill + basement_cols + garage_cols

to_fill

['FireplaceQu',
 'Fence',
 'MasVnrType',
 'BsmtQual',
 'BsmtCond',
 'BsmtExposure',
 'BsmtFinType1',
 'BsmtFinType2',
 'GarageType',
 'GarageFinish',
 'GarageQual',
 'GarageCond']

In [50]:
train[to_fill] = train[to_fill].fillna("None")
test[to_fill] = test[to_fill].fillna("None")

train[to_fill].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   FireplaceQu   1460 non-null   object
 1   Fence         1460 non-null   object
 2   MasVnrType    1460 non-null   object
 3   BsmtQual      1460 non-null   object
 4   BsmtCond      1460 non-null   object
 5   BsmtExposure  1460 non-null   object
 6   BsmtFinType1  1460 non-null   object
 7   BsmtFinType2  1460 non-null   object
 8   GarageType    1460 non-null   object
 9   GarageFinish  1460 non-null   object
 10  GarageQual    1460 non-null   object
 11  GarageCond    1460 non-null   object
dtypes: object(12)
memory usage: 137.0+ KB


In [51]:
# Let's now head to the numeric columns

num_cols = train.select_dtypes("number").columns

num_to_fill = ["LotFrontage", "MasVnrArea", "GarageYrBlt", "GarageCars", "GarageArea"]

train[num_to_fill] = train[num_to_fill].fillna(0)
test[num_to_fill] = test[num_to_fill].fillna(0)

In [52]:
# Updated missing values situation for the train dataset

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 74 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1460 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   LotShape       1460 non-null   object 
 7   LandContour    1460 non-null   object 
 8   Utilities      1460 non-null   object 
 9   LotConfig      1460 non-null   object 
 10  LandSlope      1460 non-null   object 
 11  Neighborhood   1460 non-null   object 
 12  Condition1     1460 non-null   object 
 13  Condition2     1460 non-null   object 
 14  BldgType       1460 non-null   object 
 15  HouseStyle     1460 non-null   object 
 16  OverallQual    1460 non-null   int64  
 17  OverallCond    1460 non-null   int64  
 18  YearBuil

In [53]:
# filling NAs in the 'Electrical' column with the mode

train["Electrical"] = train["Electrical"].fillna(train.Electrical.mode()[0])

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 74 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1460 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   LotShape       1460 non-null   object 
 7   LandContour    1460 non-null   object 
 8   Utilities      1460 non-null   object 
 9   LotConfig      1460 non-null   object 
 10  LandSlope      1460 non-null   object 
 11  Neighborhood   1460 non-null   object 
 12  Condition1     1460 non-null   object 
 13  Condition2     1460 non-null   object 
 14  BldgType       1460 non-null   object 
 15  HouseStyle     1460 non-null   object 
 16  OverallQual    1460 non-null   int64  
 17  OverallCond    1460 non-null   int64  
 18  YearBuil

In [54]:
# updated situation for the test dataset

test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 73 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1459 non-null   int64  
 1   MSSubClass     1459 non-null   int64  
 2   MSZoning       1455 non-null   object 
 3   LotFrontage    1459 non-null   float64
 4   LotArea        1459 non-null   int64  
 5   Street         1459 non-null   object 
 6   LotShape       1459 non-null   object 
 7   LandContour    1459 non-null   object 
 8   Utilities      1457 non-null   object 
 9   LotConfig      1459 non-null   object 
 10  LandSlope      1459 non-null   object 
 11  Neighborhood   1459 non-null   object 
 12  Condition1     1459 non-null   object 
 13  Condition2     1459 non-null   object 
 14  BldgType       1459 non-null   object 
 15  HouseStyle     1459 non-null   object 
 16  OverallQual    1459 non-null   int64  
 17  OverallCond    1459 non-null   int64  
 18  YearBuil

In [55]:
# there are many columns with a bunch of missing values: we will fill those with the mean or mode for numeric and object columns, respectively
# we will also fill the missing values in the Exterior1st and Exterior2nd columns with 'Other' 
# and those in the 'Bsmt' numeric columns with 0 as those houses do not have a basement

test[["Exterior1st", "Exterior2nd"]] = test[["Exterior1st", "Exterior2nd"]].fillna("Other")

basement_num_cols = [col for col in test.columns if (col.startswith("Bsmt") & bool(test[col].dtype != "object"))]
test[basement_num_cols] = test[basement_num_cols].fillna(0)

for col in test.select_dtypes("object").columns:
    test[col] = test[col].fillna(test[col].mode()[0])

for col in test.select_dtypes("number").columns:
    test[col] = test[col].fillna(test[col].mean())
    
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 73 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1459 non-null   int64  
 1   MSSubClass     1459 non-null   int64  
 2   MSZoning       1459 non-null   object 
 3   LotFrontage    1459 non-null   float64
 4   LotArea        1459 non-null   int64  
 5   Street         1459 non-null   object 
 6   LotShape       1459 non-null   object 
 7   LandContour    1459 non-null   object 
 8   Utilities      1459 non-null   object 
 9   LotConfig      1459 non-null   object 
 10  LandSlope      1459 non-null   object 
 11  Neighborhood   1459 non-null   object 
 12  Condition1     1459 non-null   object 
 13  Condition2     1459 non-null   object 
 14  BldgType       1459 non-null   object 
 15  HouseStyle     1459 non-null   object 
 16  OverallQual    1459 non-null   int64  
 17  OverallCond    1459 non-null   int64  
 18  YearBuil

Next step is to transform all string variables into numeric ones with one-hot encoding: this will greatly increase the dimensionality of our data, and so we will then employ **Recursive Feature Elimination** (RFE), which is a technique that selects the most significant features while removing the others.

In [56]:
# creating a function to perform one-hot encoding

def encoder(data):
    object_cols = data.select_dtypes("object").columns
    dummies = pd.get_dummies(data[object_cols], prefix=object_cols)
    data = pd.concat([data, dummies], axis=1)
    data.drop(object_cols, axis=1, inplace=True)
    return data
    

In [57]:
# performing encoding

enc_train = encoder(train)
enc_test = encoder(test)

print(f"We now have {enc_train.shape[1]-1} feature columns in our training data set and {enc_test.shape[1]} in our test set.")
enc_train.head()

We now have 274 feature columns in our training data set and 260 in our test set.


Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,GarageCond_Po,GarageCond_TA,PavedDrive_N,PavedDrive_P,PavedDrive_Y,Fence_GdPrv,Fence_GdWo,Fence_MnPrv,Fence_MnWw,Fence_None
0,1,60,65.0,8450,7,5,2003,2003,196.0,706,...,0,1,0,0,1,0,0,0,0,1
1,2,20,80.0,9600,6,8,1976,1976,0.0,978,...,0,1,0,0,1,0,0,0,0,1
2,3,60,68.0,11250,7,5,2001,2002,162.0,486,...,0,1,0,0,1,0,0,0,0,1
3,4,70,60.0,9550,7,5,1915,1970,0.0,216,...,0,1,0,0,1,0,0,0,0,1
4,5,60,84.0,14260,8,5,2000,2000,350.0,655,...,0,1,0,0,1,0,0,0,0,1


Unfortunately, we can see that the test set has less features than the training set: this happens because some of the values that have been encoded only appear in the training data; before the modeling phase begins, we will need to make sure that both the datasets have the same feature columns.

In [58]:
# splitting training data into X and y

X_train = enc_train.drop("SalePrice", axis=1)
y_train = enc_train["SalePrice"]

In [59]:
# performing RFE using a simple Decision Tree (keeping 75 features)

rfe = feature_selection.RFE(tree.DecisionTreeRegressor(max_depth=6, random_state=27), 75)
%time rfe.fit(X_train, y_train)

print("\n", "In the following rankings, features with rank 1 have been selected by the algorithm:\n", rfe.ranking_)

CPU times: user 1.98 s, sys: 3.09 ms, total: 1.98 s
Wall time: 1.98 s

 In the following rankings, features with rank 1 have been selected by the algorithm:
 [200 199   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1
   1   1   1   1   1   1   1   1   1   1   1   1   1   1   2   3   5   1
   9  11  13   1  17  20  21  24  25  27  29  31  33  35  37  40  39   1
  44  46  48  50  52  54  56  58  60  62  64  66  68  70  72  74  78  76
  82  84  80  87  90  88  94  92  96  98 102 100 104 109 106 108  97  86
   1   1   1 110 112 131 114 116 139 118 120 122 124 126 128 132 134 136
 140 142 144 146 148 150 152 154 156 158 160 162 164 166 168 170 172 174
 176 178 180 182 184 186 188 190 192 194 196 198 197 195 193 187 185 183
 181 179 177 175 173 171 169 167 165 163 161 159 157   1 107 105 103 101
  99   1  95  93  89   1  85  83  81  79  77  75  73  71  69  67  65  63
  61  59  57  55  53  51  49  47  45  43  23  19   1   1   1   1   1   1
   1   1   1   1   1   1   1   1   1   

In [60]:
# getting final dataframes

X_train = X_train.loc[:, rfe.support_].copy()

X_train.head()

Unnamed: 0,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,...,FireplaceQu_Gd,FireplaceQu_None,FireplaceQu_Po,FireplaceQu_TA,GarageType_2Types,GarageType_Attchd,GarageType_Basment,GarageType_BuiltIn,GarageType_CarPort,GarageType_Detchd
0,65.0,8450,7,5,2003,2003,196.0,706,0,150,...,0,1,0,0,0,1,0,0,0,0
1,80.0,9600,6,8,1976,1976,0.0,978,0,284,...,0,0,0,1,0,1,0,0,0,0
2,68.0,11250,7,5,2001,2002,162.0,486,0,434,...,0,0,0,1,0,1,0,0,0,0
3,60.0,9550,7,5,1915,1970,0.0,216,0,540,...,1,0,0,0,0,0,0,0,0,1
4,84.0,14260,8,5,2000,2000,350.0,655,0,490,...,0,0,0,1,0,1,0,0,0,0


In [61]:
# checking if all the selected features appear in the test set

missing_cols = [col for col in X_train.columns if col not in enc_test.columns]

missing_cols

['Heating_Floor', 'Heating_OthW']

In [62]:
# two columns do not exist in the test set: we will remove them from the training set

X_train = X_train.drop(missing_cols, axis=1)

X_test = enc_test.loc[:, X_train.columns].copy()

X_test.head()

Unnamed: 0,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,...,FireplaceQu_Gd,FireplaceQu_None,FireplaceQu_Po,FireplaceQu_TA,GarageType_2Types,GarageType_Attchd,GarageType_Basment,GarageType_BuiltIn,GarageType_CarPort,GarageType_Detchd
0,80.0,11622,5,6,1961,1961,0.0,468.0,144.0,270.0,...,0,1,0,0,0,1,0,0,0,0
1,81.0,14267,6,6,1958,1958,108.0,923.0,0.0,406.0,...,0,1,0,0,0,1,0,0,0,0
2,74.0,13830,5,5,1997,1998,0.0,791.0,0.0,137.0,...,0,0,0,1,0,1,0,0,0,0
3,78.0,9978,6,6,1998,1998,20.0,602.0,0.0,324.0,...,1,0,0,0,0,1,0,0,0,0
4,43.0,5005,8,5,1992,1992,0.0,263.0,0.0,1017.0,...,0,1,0,0,0,1,0,0,0,0


## Modeling phase

We can finally start to model on our final datasets: given that we are dealing with a relatively small number of observations, we will employ machine learning algorithms that can deal with outliers in the data without the need for us to remove them so that we don't lose any valuable data. Tree algorithms satisfy these requirements, as they can deal with both unscaled data and outliers with no issues.

This is how we will proceed:
1) we will train four different algorithms based on decision trees: bagging ensemble, random forest, extra-trees and XGBoost gradient boosting (gtree); 
2) we will then submit to Kaggle both the predictions made by the best-performing model and the average of the predictions made by all the models.

For each model we will perform **grid-search hyperparameter tuning** and **five-fold cross-validation**.
Grid-Search is a technique that repeatedly fits the algorithm using different hyperparameters to determine which ones yield the best performance.
Cross-Validation also repeatedly fits an algorithm, each time using a different subset of the data while keeping the rest to evaluate performance, thus allowing us to detect if the algorithm is *overfitting* the train data.

In [63]:
y_train

0       208500
1       181500
2       223500
3       140000
4       250000
         ...  
1455    175000
1456    210000
1457    266500
1458    142125
1459    147500
Name: SalePrice, Length: 1460, dtype: int64

In [64]:
# creating, tuning and fitting bagging regressor

grid1 = {"n_estimators":[100, 500, 1000], "max_samples":[.5, .75, 1.0], "oob_score":[False, True]}

bg_reg = ensemble.BaggingRegressor(random_state=27)

gs1 = model_selection.GridSearchCV(bg_reg, param_grid=grid1, cv=5, scoring="neg_root_mean_squared_error", n_jobs=-1)
%time gs1.fit(X_train, y_train)

print(f"Best score achieved with the Bagging Regressor is {gs1.best_score_} with the following parameters: {gs1.best_params_}")
preds_bg = gs1.best_estimator_.predict(X_test)

preds_bg[:5]

CPU times: user 5.02 s, sys: 67.4 ms, total: 5.08 s
Wall time: 2min 42s
Best score achieved with the Bagging Regressor is -29474.91283081622 with the following parameters: {'max_samples': 0.75, 'n_estimators': 500, 'oob_score': False}


array([127967.266, 155715.25 , 177718.602, 182819.998, 196586.746])

In [65]:
# creating, tuning and fitting random forest regressor

grid2 = {"n_estimators":[500, 1000], "max_depth":[5, 10, 15, 20], "max_features":["log2", "sqrt"], "oob_score":[False, True]}

rf_reg = ensemble.RandomForestRegressor(random_state=27)

gs2 = model_selection.GridSearchCV(rf_reg, param_grid=grid2, cv=5, scoring="neg_root_mean_squared_error", n_jobs=-1)
%time gs2.fit(X_train, y_train)

print(f"Best score achieved with the Random Forest is {gs2.best_score_} with the following parameters: {gs2.best_params_}")
preds_rf = gs2.best_estimator_.predict(X_test)

preds_rf[:5]

CPU times: user 2.87 s, sys: 99.1 ms, total: 2.97 s
Wall time: 1min 38s
Best score achieved with the Random Forest is -29066.821315768542 with the following parameters: {'max_depth': 20, 'max_features': 'sqrt', 'n_estimators': 1000, 'oob_score': False}


array([128017.06708936, 150456.74616587, 184373.98081472, 189055.34777459,
       186388.8629706 ])

In [66]:
# creating, tuning and fitting extra trees regressor

grid3 = {"n_estimators":[500, 1000], "max_depth":[5, 10, 15, 20], "max_features":["log2", "sqrt"], "max_samples":[.5, .75, .9]}

ex_reg = ensemble.ExtraTreesRegressor(random_state=27)

gs3 = model_selection.GridSearchCV(ex_reg, param_grid=grid3, cv=5, scoring="neg_root_mean_squared_error", n_jobs=-1)
%time gs3.fit(X_train, y_train)

print(f"Best score achieved with Extra-Trees is {gs3.best_score_} with the following parameters: {gs3.best_params_}")
preds_ex = gs3.best_estimator_.predict(X_test)

preds_ex[:5]

CPU times: user 2.84 s, sys: 106 ms, total: 2.95 s
Wall time: 1min 35s
Best score achieved with Extra-Trees is -29363.284656071417 with the following parameters: {'max_depth': 20, 'max_features': 'sqrt', 'max_samples': 0.5, 'n_estimators': 1000}


array([125736.84616585, 147969.34766783, 183525.75036439, 191701.2918818 ,
       183747.70187472])

In [67]:
# creating, tuning and fitting xgboost model

grid4 = {"n_estimators":[200, 500], "max_depth":[10, 15, 20], "learning_rate":[.01, .1, .2]}

xgb_reg = xgb.XGBRegressor(random_state=27, n_jobs=-1)

gs4 = model_selection.GridSearchCV(xgb_reg, param_grid=grid4, cv=5, scoring="neg_root_mean_squared_error")
%time gs4.fit(X_train, y_train)

print(f"Best score achieved with XGBoost is {gs4.best_score_} with the following parameters: {gs4.best_params_}")
preds_xgb = gs4.best_estimator_.predict(X_test)

preds_xgb[:5]

CPU times: user 27min 40s, sys: 7.33 s, total: 27min 47s
Wall time: 4min
Best score achieved with XGBoost is -28953.92935010364 with the following parameters: {'learning_rate': 0.1, 'max_depth': 15, 'n_estimators': 500}


array([123799.016, 148408.27 , 190341.6  , 187995.45 , 189405.22 ],
      dtype=float32)

XGBoost performed slightly better than the other models, although with considerably longer training time. We will submit the predictions made by this model, as well as the average prediction from all the models.

In [68]:
# creating file with XGBoost predictions

Id = [*range(1461, 1461+len(X_test))]

submission1 = pd.DataFrame({"Id":Id, "SalePrice":preds_xgb})

submission1.head()

Unnamed: 0,Id,SalePrice
0,1461,123799.015625
1,1462,148408.265625
2,1463,190341.59375
3,1464,187995.453125
4,1465,189405.21875


In [69]:
submission1.to_csv("submission_1.csv", index=False)

Predictions from the XGBoost model scored a RMSE of 0.145 on Kaggle.

In [70]:
# creating second submission file with average prediction

pred = np.empty((1459, 4))
pred[:, 0] = preds_bg
pred[:, 1] = preds_rf
pred[:, 2] = preds_ex
pred[:, 3] = preds_xgb

pred[:5]

array([[127967.266     , 128017.06708936, 125736.84616585,
        123799.015625  ],
       [155715.25      , 150456.74616587, 147969.34766783,
        148408.265625  ],
       [177718.602     , 184373.98081472, 183525.75036439,
        190341.59375   ],
       [182819.998     , 189055.34777459, 191701.2918818 ,
        187995.453125  ],
       [196586.746     , 186388.8629706 , 183747.70187472,
        189405.21875   ]])

In [71]:
avg_pred = np.mean(pred, axis=1)

avg_pred[:5]

array([126380.04872005, 150637.40236468, 183989.98173228, 187893.02269535,
       189032.13239883])

In [72]:
submission2 = pd.DataFrame({"Id":Id, "SalePrice":avg_pred})
submission2.to_csv("submission_2.csv", index=False)

submission2.head()

Unnamed: 0,Id,SalePrice
0,1461,126380.04872
1,1462,150637.402365
2,1463,183989.981732
3,1464,187893.022695
4,1465,189032.132399


The average predictions scored better than the XGBoost ones with a RMSE of 0.14296.