***Author: Rishika Ravindran***

***Implementation of Decision Tree Model to predict house prices using publicly available Kaggle dataset***

***The goal of this notebook is to understand the implementation of Decision Trees, and how Post-Pruning techniques like Cost Complexity Pruning can help improve the accuracy of the decision tree model.***

In [1]:
from google.colab import files
uploaded = files.upload()

Saving house_prices.csv to house_prices.csv


In [2]:
import pandas as pd
df = pd.read_csv('house_prices.csv')
df

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,...,0,,,,0,8,2007,WD,Normal,175000
1456,1457,20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,,0,2,2010,WD,Normal,210000
1457,1458,70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,...,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1458,1459,20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2010,WD,Normal,142125


In [3]:
df.columns

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive

In [4]:
#Features to determine the home price
df_sub = df[["LotArea", "YearBuilt", "YrSold", "TotRmsAbvGrd", "BedroomAbvGr"]]
df_sub["Age"] = abs(df_sub["YearBuilt"] - df_sub["YrSold"])
df_sub
x = df_sub

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_sub["Age"] = abs(df_sub["YearBuilt"] - df_sub["YrSold"])


In [5]:
y = df[["SalePrice"]]
y

Unnamed: 0,SalePrice
0,208500
1,181500
2,223500
3,140000
4,250000
...,...
1455,175000
1456,210000
1457,266500
1458,142125


In [7]:
#Use sklearn to build models
#Lets use the Decision Tree model
from sklearn.tree import DecisionTreeRegressor

#Define the model. Specify the number of random states to ensure same results every run
house_model = DecisionTreeRegressor(random_state=1)

In [8]:
#Fit Model
house_model.fit(x,y)

DecisionTreeRegressor(random_state=1)

In [9]:
#We have now fitted a model -> captured patterns from provided data

#you'll want to make predictions for new houses coming on the market rather than the houses we already have prices for. 
#But we'll make predictions for the first few rows of the training data to see how the predict function works.

print("Making predictions for the following 5 houses:")
print(x.head())

Making predictions for the following 5 houses:
   LotArea  YearBuilt  YrSold  TotRmsAbvGrd  BedroomAbvGr  Age
0     8450       2003    2008             8             3    5
1     9600       1976    2007             6             3   31
2    11250       2001    2008             6             3    7
3     9550       1915    2006             7             3   91
4    14260       2000    2008             9             4    8


In [10]:
#Let's make prediction for these five houses with the specified features
print(house_model.predict(x.head()))

[208500. 181500. 223500. 140000. 250000.]


In [11]:
#just for one house - features should be in same order
print(house_model.predict([[8450, 2003, 2008, 8, 3, 5]]))

[208500.]




In [12]:
#To test the model quality, let's use Mean Absolute Error(MAE).
from sklearn.metrics import mean_absolute_error

predicted_house_model = house_model.predict(x)
mean_absolute_error(y, predicted_house_model)

48.35890410958904

In [13]:
#This measure we calculated above is an "in-sample" score, i.e, we used the same data to both building the model and evaluating it
#Since this pattern("fit") was derived from the training data, the model will appear accurate in the training data.
#Hence, test training split
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x,y,random_state=0)

In [14]:
#Now fit the model using training data, predict for test data, and compare the metrics
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import accuracy_score
model = DecisionTreeRegressor(random_state=1, ccp_alpha = 14102236.08239536)

model.fit(x_train,y_train) # -- fit
predictions = model.predict(x_test) #--predict
prediction_train = model.predict(x_train)

train_mae = mean_absolute_error(prediction_train, y_train)
test_mae = mean_absolute_error(predictions, y_test)

print("Training MAE is: ", train_mae)
print("Testing MAE is: ", test_mae)

Training MAE is:  27422.073171612457
Testing MAE is:  33195.481856892984


***Overfitting! The mean absolute error of training is 40.23 and mean absolute error of testing is 42,376.2***

***Let's implement the Post Pruning - Cost Complexity Technique to improve the model's accuracy***

In [15]:
## Post Pruning - Cost Complexity Technique
path = model.cost_complexity_pruning_path(x_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities

In [16]:
ccp_alphas

array([0.00000000e+00, 1.14155251e+00, 1.14155251e+00, 4.56621005e+00,
       4.56621005e+00, 1.56278539e+01, 1.88590563e+01, 2.85388128e+01,
       2.85388128e+01, 2.85388128e+01, 7.30593607e+01, 7.30593607e+01,
       1.14155251e+02, 1.14155251e+02, 1.14155251e+02, 1.14155251e+02,
       1.14155251e+02, 1.14155251e+02, 1.14155251e+02, 1.14155251e+02,
       1.14155251e+02, 1.14155251e+02, 1.14155251e+02, 1.14155251e+02,
       1.64383562e+02, 2.23744292e+02, 3.04414003e+02, 3.37899543e+02,
       3.49600457e+02, 3.69863014e+02, 4.49344292e+02, 4.56621005e+02,
       4.56621005e+02, 4.56621005e+02, 4.56621005e+02, 4.56621005e+02,
       4.56621005e+02, 4.56621005e+02, 4.56621005e+02, 4.66133942e+02,
       5.83059361e+02, 6.08828006e+02, 6.08828006e+02, 6.08828006e+02,
       6.08828006e+02, 6.30422374e+02, 6.57534247e+02, 7.13470320e+02,
       7.71689498e+02, 8.94977169e+02, 9.89345510e+02, 1.02739726e+03,
       1.02739726e+03, 1.02739726e+03, 1.02739726e+03, 1.02739726e+03,
      

In [17]:
clfs = []
for ccp_alpha in ccp_alphas:
  clf = DecisionTreeRegressor(random_state=1, ccp_alpha=ccp_alpha)
  clf.fit(x_train, y_train)
  clfs.append(clf)

print("The number of nodes in the last tree is: {} with ccp_alpha of {}".format(clfs[-1].tree_.node_count, ccp_alphas[-1]))
print("The number of nodes in the first tree is: {} with ccp_alpha of {}".format(clfs[0].tree_.node_count, ccp_alphas[0]))

The number of nodes in the last tree is: 1 with ccp_alpha of 2069806170.8008327
The number of nodes in the first tree is: 2153 with ccp_alpha of 0.0


***The number of nodes decreases as ccp_alpha increases***


In [18]:
train_scores = [clf.score(x_train, y_train) for clf in clfs]
test_scores = [clf.score(x_test, y_test) for clf in clfs]

#R-squared score -- goodness of fit measure, accuracy scores

In [19]:
scores = []
for ccp_alpha in ccp_alphas:
  clf = DecisionTreeRegressor(random_state=1, ccp_alpha=ccp_alpha)
  clf.fit(x_train, y_train)
  new_pred = clf.predict(x_test)
  new_pred = new_pred.reshape(-1, 1)
  # y_test = y_test.reshape(-1, 1)
  new_acc = mean_absolute_error(new_pred, y_test)
  scores.append(new_acc)

In [20]:
min_error = min(scores)
index = scores.index(min_error)
best_alpha = ccp_alphas[index]

print("The best alpha value with the lowesr mean absolute error for the model is {} , with a mean absolute error of {}".format(best_alpha, min_error))

The best alpha value with the lowesr mean absolute error for the model is 14102236.08239536 , with a mean absolute error of 33195.481856892984


***An alpha value (ccp_alpha) of 14102236.08239536 for the decision tree model will produce the more accurate predictions of house prices***