# Predicting House Prices

We have Iowa house prices data which we will use and predict house prices accordingly using machine learning models. Lets import the data and the required dependencies to work with the data.

In [22]:
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import Imputer

# Read the training and testing data and store data in DataFrame
Iowa_train = pd.read_csv('Input\Data\Iowa_House_Prices\train.csv')
Iowa_test = pd.read_csv('Input\Data\Iowa_House_Prices\test.csv')

Lets look at the data and get descriptive statistics about the data aswell.

In [2]:
Iowa_train.describe()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,...,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,421.610009,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,...,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,365.75,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,730.5,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,...,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,1095.25,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,...,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,...,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


Check the columns names for easy calling and assigning etc.

In [3]:
Iowa_train.columns

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive

Since we are predicting the price lets assign a variable for the pridiction column `SalePrice`. This column is usually called **Prediction Target**.

In [4]:
# Target Variable
y = Iowa_train.SalePrice
y.head()

0    208500
1    181500
2    223500
3    140000
4    250000
Name: SalePrice, dtype: int64

Lets create some predictors based on our intuition and assing them to the predictor variable namely `X`.

In [5]:
# Our chosen Predictors
Iowa_Predictors = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']

# Predictor Variable
X = Iowa_train[Iowa_Predictors]
X.head()

Unnamed: 0,LotArea,YearBuilt,1stFlrSF,2ndFlrSF,FullBath,BedroomAbvGr,TotRmsAbvGrd
0,8450,2003,856,854,2,3,8
1,9600,1976,1262,0,2,3,6
2,11250,2001,920,866,2,3,6
3,9550,1915,961,756,1,3,7
4,14260,2000,1145,1053,2,4,9


Model the data using the `DecisionTreeRegressor` we have chosen with the X and y.

In [6]:
# Define the model
Iowa_DT_model = DecisionTreeRegressor()

# Fit the model to our data
Iowa_DT_model.fit(X, y)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

Lets test the prices of the houses present in the Test data to see how our model fares so far in its default state.

In [7]:
# First 5 rows of the test data to sample our prediction power
print('The first 5 rows of our test data are: \n', Iowa_test[Iowa_Predictors])
print('The predictions of the selected data are: \n', Iowa_DT_model.predict(Iowa_test[Iowa_Predictors]))

The first 5 rows of our test data are: 
       LotArea  YearBuilt  1stFlrSF  2ndFlrSF  FullBath  BedroomAbvGr  \
0       11622       1961       896         0         1             2   
1       14267       1958      1329         0         1             3   
2       13830       1997       928       701         2             3   
3        9978       1998       926       678         2             3   
4        5005       1992      1280         0         2             2   
5       10000       1993       763       892         2             3   
6        7980       1992      1187         0         2             3   
7        8402       1998       789       676         2             3   
8       10176       1990      1341         0         1             2   
9        8400       1970       882         0         1             2   
10       5858       1999      1337         0         2             2   
11       1680       1971       483       504         1             2   
12       1680       197

Lets do the same prediction with using `train_test_split()` to divide our data.

In [10]:
train_X, val_X, train_Y, val_Y = train_test_split(X, y, random_state=200)

# Fit the data
Iowa_DT_model.fit(train_X, train_Y)

# Get predictions on the validation data
val_predictions = Iowa_DT_model.predict(val_X)
print(mean_absolute_error(val_Y, val_predictions))

33196.890410958906


## Controlling Underfit and Overfit in a model

Given the number of max_leaf_nodes we can control the model to not underfit or overfit the data and find the right amount of splits to get predicter value closer to the estimate using MAE and looping.

In [11]:
def get_mae(max_leaf_nodes, predictors_train, predictors_val, targ_train, targ_val):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(predictors_train, targ_train)
    preds_val = model.predict(predictors_val)
    mae = mean_absolute_error(targ_val, preds_val)
    return mae


In [12]:
for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_Y, val_Y)
    print('Max Leaf Nodes: %d \t\t MAE: %d' % (max_leaf_nodes, my_mae))

Max Leaf Nodes: 5 		 MAE: 39206
Max Leaf Nodes: 50 		 MAE: 29954
Max Leaf Nodes: 500 		 MAE: 30853
Max Leaf Nodes: 5000 		 MAE: 30936


## Random Forests
Lets see if random forests give better predictions.

In [14]:
forest_model = RandomForestRegressor()
forest_model.fit(train_X, train_Y)
pred = forest_model.predict(val_X)
print(mean_absolute_error(val_Y, pred))

26487.975159817353


It did give better `MAE` than decision trees as expected. There are better models to give much better prediction with less overhead like some gradient boosting algorithms which are prone to overfitting but with right parameters can be easily taken care of.

## XGBoost
This is a very popular and successful model used on standard tabular data. This is an type of `Gradient Boosted Decision Tree` algorithm where it calculates errors for each observation in the data -> Build Model to predict those -> add these predictions to ensemble of models and repeat this cycle.

Lets perform this on the whole dataset not excluding any features for now.

In [31]:
data = Iowa_train.dropna(axis=0, subset=['SalePrice'])
y = data.SalePrice
X = data.drop(['SalePrice'], axis=1).select_dtypes(exclude=['object'])
train_X, test_X, train_Y, test_Y = train_test_split(X, y, test_size=0.25)

my_imputer = Imputer()
train_X = my_imputer.fit_transform(train_X)
test_X = my_imputer.transform(test_X)

In [32]:
from xgboost import XGBRegressor

model_xgb = XGBRegressor()
model_xgb.fit(train_X, train_Y, verbose=False)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

In [33]:
# Predictions
preds = model_xgb.predict(test_X)
print("Mean Absolute Error of XGB model is: " + str(mean_absolute_error(preds, test_Y)))

Mean Absolute Error of XGB model is: 17348.051305650686


We got a very good MAE score so far compared to even random forests, we can improve this further by tuning the model more as some parameters if tuned well could dramatically improve the peformance.

## Model Tuning
### n_estimators, early_stopping_rounds and learning_rate
n_estimators specifies how many times to go through the modeling cycle described above. Number of boosted trees to fit. Typical values range from 100-1000 depends on learning rate aswell. too big overfits, too small underfits need to find the right number.

early_stopping_rounds offers a way to automatically find the ideal value. Early stopping causes the model to stop iterating when the validation score stops improving, even if we aren't at the hard stop for n_estimators. It's smart to set a high value for n_estimators and then use early_stopping_rounds to find the optimal time to stop iterating. Usually 5 rounds is a safe bet.

learning_rate We can multiple the predictions from each model by a small number before adding rather than just adding, doing this reduces the models nature to overfit.

n_jobs larger datasets need more time so this parameter helps with prarallel processing. Usually set this to number of cores in the system.

In conclusion we can set high n_estimators if we have low early_stopping and a small learning rate will yield more accurate XGBoost models but it could take longer time since it does iterations throgh the cycle. 

In [34]:
model_xgb_tuned = XGBRegressor(n_estimators=1000, learning_rate=0.05, n_jobs=2)
model_xgb_tuned.fit(train_X, train_Y, early_stopping_rounds=5, eval_set=[(test_X, test_Y)], verbose=False)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.05, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=1000,
       n_jobs=2, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

In [35]:
# Predictions
preds = model_xgb_tuned.predict(test_X)
print("Mean Absolute Error of XGB model is: " + str(mean_absolute_error(preds, test_Y)))

Mean Absolute Error of XGB model is: 17245.806752996574
