In this notebook a random forest will be trained for trying to predict the sale prices of the houses. Despite Random Forest algorithms issues with working with One Hot Encoded data, this algorithm was still chosen as a first approach at modeling the problem due to the negligible importance of hyperparameter tuning in this type of algorithms, making them appropiate for a first approximation at any regression problem.\
The results obtained will serve as a baseline for the error of the predictions for further modeling.

In [1]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score, LeaveOneOut
import pickle

# 1. Load data

In [2]:
dataset = pd.read_csv('encoded_train_dataset.csv', sep = ',')

In [3]:
dataset

Unnamed: 0,Id,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,HeatingQC,CentralAir,KitchenQual,FireplaceQu,GarageFinish,GarageQual,GarageCond,PavedDrive,PoolQC,SalePrice
0,1,65.0,8450.0,7.0,5.0,2003.0,2003.0,196.0,706.0,0.0,...,5.0,1.0,4.0,0.0,2.0,3.0,3.0,2.0,0.0,208500.0
1,2,80.0,9600.0,6.0,8.0,1976.0,1976.0,0.0,978.0,0.0,...,5.0,1.0,3.0,3.0,2.0,3.0,3.0,2.0,0.0,181500.0
2,3,68.0,11250.0,7.0,5.0,2001.0,2002.0,162.0,486.0,0.0,...,5.0,1.0,4.0,3.0,2.0,3.0,3.0,2.0,0.0,223500.0
3,4,60.0,9550.0,7.0,5.0,1915.0,1970.0,0.0,216.0,0.0,...,4.0,1.0,4.0,4.0,1.0,3.0,3.0,2.0,0.0,140000.0
4,5,84.0,14260.0,8.0,5.0,2000.0,2000.0,350.0,655.0,0.0,...,5.0,1.0,4.0,3.0,2.0,3.0,3.0,2.0,0.0,250000.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,62.0,7917.0,6.0,5.0,1999.0,2000.0,0.0,0.0,0.0,...,5.0,1.0,3.0,3.0,2.0,3.0,3.0,2.0,0.0,175000.0
1456,1457,85.0,13175.0,6.0,6.0,1978.0,1988.0,119.0,790.0,163.0,...,3.0,1.0,3.0,3.0,1.0,3.0,3.0,2.0,0.0,210000.0
1457,1458,66.0,9042.0,7.0,9.0,1941.0,2006.0,0.0,275.0,0.0,...,5.0,1.0,4.0,4.0,2.0,3.0,3.0,2.0,0.0,266500.0
1458,1459,68.0,9717.0,5.0,6.0,1950.0,1996.0,0.0,49.0,1029.0,...,4.0,1.0,4.0,0.0,1.0,3.0,3.0,2.0,0.0,142125.0


# 2. Train the model and check accuracy

The Random Forest algorithm chosen was scikit-learn's RandomForestRegressor with default hyperparameters and the error metric was the negative mean squared error.\
The validation process was done using Leave One Out Cross Validation.

In [4]:
rfregressor = RandomForestRegressor()

In [8]:
X = dataset.drop(columns = ['Id', 'SalePrice'])
y = dataset['SalePrice'].values

In [11]:
cv = LeaveOneOut()

In [12]:
scores = cross_val_score(rfregressor, X, y = y, scoring = 'neg_root_mean_squared_error', cv = cv)

In [13]:
-scores.mean()

17314.8897260274

# 3. Fit the model

In [14]:
rfregressor.fit(X,y)

RandomForestRegressor()

# 4. Save the model

The fitted model is saved to a pickle file.

In [15]:
with open('random_forest.pkl','wb') as f:
    pickle.dump(rfregressor,f)