# Decision Trees and Random Forests

![](https://i.imgur.com/3sw1fY9.jpg)

We'll follow a step-by-step process:

1. Download and prepare the dataset for training
2. Train, evaluate and interpret a decision tree
3. Train, evaluate and interpret a random forest
4. Tune hyperparameters to improve the model
5. Make predictions and save the model

Let's begin by installing the required libraries.

In [1]:
!pip install opendatasets scikit-learn plotly folium --upgrade --quiet

[K     |████████████████████████████████| 15.2 MB 10.1 MB/s 
[K     |████████████████████████████████| 95 kB 3.5 MB/s 
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.12.1.post1 which is incompatible.[0m
[?25h

In [2]:
!pip install pandas numpy matplotlib seaborn --quiet

## Download and prepare the dataset for training

In [3]:
import os
from zipfile import ZipFile
from urllib.request import urlretrieve

dataset_url = 'https://github.com/JovianML/opendatasets/raw/master/data/house-prices-advanced-regression-techniques.zip'
urlretrieve(dataset_url, 'house-prices.zip')
with ZipFile('house-prices.zip') as f:
    f.extractall(path='house-prices')
    
os.listdir('house-prices')

['train.csv', 'sample_submission.csv', 'test.csv', 'data_description.txt']

In [4]:
import pandas as pd
pd.options.display.max_columns = 200
pd.options.display.max_rows = 200

prices_df = pd.read_csv('house-prices/train.csv')
prices_df

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Gilbert,Norm,Norm,1Fam,2Story,6,5,1999,2000,Gable,CompShg,VinylSd,VinylSd,,0.0,TA,TA,PConc,Gd,TA,No,Unf,0,Unf,0,953,953,GasA,Ex,Y,SBrkr,953,694,0,1647,0,0,2,1,3,1,TA,7,Typ,1,TA,Attchd,1999.0,RFn,2,460,TA,TA,Y,0,40,0,0,0,0,,,,0,8,2007,WD,Normal,175000
1456,1457,20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,Inside,Gtl,NWAmes,Norm,Norm,1Fam,1Story,6,6,1978,1988,Gable,CompShg,Plywood,Plywood,Stone,119.0,TA,TA,CBlock,Gd,TA,No,ALQ,790,Rec,163,589,1542,GasA,TA,Y,SBrkr,2073,0,0,2073,1,0,2,0,3,1,TA,7,Min1,2,TA,Attchd,1978.0,Unf,2,500,TA,TA,Y,349,0,0,0,0,0,,MnPrv,,0,2,2010,WD,Normal,210000
1457,1458,70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,9,1941,2006,Gable,CompShg,CemntBd,CmentBd,,0.0,Ex,Gd,Stone,TA,Gd,No,GLQ,275,Unf,0,877,1152,GasA,Ex,Y,SBrkr,1188,1152,0,2340,0,0,2,0,4,1,Gd,9,Typ,2,Gd,Attchd,1941.0,RFn,1,252,TA,TA,Y,0,60,0,0,0,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1458,1459,20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,Inside,Gtl,NAmes,Norm,Norm,1Fam,1Story,5,6,1950,1996,Hip,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,TA,TA,Mn,GLQ,49,Rec,1029,0,1078,GasA,Gd,Y,FuseA,1078,0,0,1078,1,0,1,0,2,1,Gd,5,Typ,0,,Attchd,1950.0,Unf,1,240,TA,TA,Y,366,0,112,0,0,0,,,,0,4,2010,WD,Normal,142125


# Columnas a utilizar para el modelo

In [5]:
columnas = ['2ndFlrSF', 'BsmtFinSF1', 'GarageYrBlt', '1stFlrSF','GarageArea', 'TotalBsmtSF', 'YearBuilt', 'GarageCars', 'GrLivArea','OverallQual','SalePrice']

prices_df = prices_df[columnas]

# Qué significa cada columna
https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data

*   2ndFlrSF -> Second floor square feet
*   BsmtFinSF1 -> Basement finished area (Square feets)
*   GarageYrBlt -> Year garage was built
*   1stFlrSF -> First Floor square feet
*   GarageArea -> Size of garage in square feet
*   TotalBsmtSF -> Total square feet of basement area
*   YearBuilt-> Original construction date
*   GarageCars -> Size of garage in car capacity
*   GrLivArea -> Above grade (ground) living area square feet
*   OverallQual -> Rates the overall material and finish of the house
**       10	Very Excellent
**       9	Excellent
**       8	Very Good
**       7	Good
**      6	Above Average
**       5	Average
**       4	Below Average
**       3	Fair
**       2	Poor
**       1	Very Poor



In [6]:
prices_df

Unnamed: 0,2ndFlrSF,BsmtFinSF1,GarageYrBlt,1stFlrSF,GarageArea,TotalBsmtSF,YearBuilt,GarageCars,GrLivArea,OverallQual,SalePrice
0,854,706,2003.0,856,548,856,2003,2,1710,7,208500
1,0,978,1976.0,1262,460,1262,1976,2,1262,6,181500
2,866,486,2001.0,920,608,920,2001,2,1786,7,223500
3,756,216,1998.0,961,642,756,1915,3,1717,7,140000
4,1053,655,2000.0,1145,836,1145,2000,3,2198,8,250000
...,...,...,...,...,...,...,...,...,...,...,...
1455,694,0,1999.0,953,460,953,1999,2,1647,6,175000
1456,0,790,1978.0,2073,500,1542,1978,2,2073,6,210000
1457,1152,275,1941.0,1188,252,1152,1941,1,2340,7,266500
1458,0,49,1950.0,1078,240,1078,1950,1,1078,5,142125


# Empezamos a modelar

In [7]:
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

# Identify input and target columns
input_cols, target_col = prices_df.columns[1:-1], prices_df.columns[-1]
inputs_df, targets = prices_df[input_cols].copy(), prices_df[target_col].copy()

# Identify numeric columns
numeric_cols = prices_df[input_cols].select_dtypes(include=np.number).columns.tolist()

# Impute and scale numeric columns
imputer = SimpleImputer().fit(inputs_df[numeric_cols])
inputs_df[numeric_cols] = imputer.transform(inputs_df[numeric_cols])
scaler = MinMaxScaler().fit(inputs_df[numeric_cols])
inputs_df[numeric_cols] = scaler.transform(inputs_df[numeric_cols])


In [8]:
# Create training and validation sets
train_inputs, val_inputs, train_targets, val_targets = train_test_split(
    inputs_df, targets, test_size=0.25, random_state=42)

## Decision Tree


In [9]:
from sklearn.tree import DecisionTreeRegressor

In [10]:
# Create the model
tree = DecisionTreeRegressor()

In [11]:
# Fit the model to the training data
tree.fit(train_inputs,train_targets)

DecisionTreeRegressor()

In [12]:
from sklearn.metrics import mean_squared_error

In [13]:
tree_train_preds = tree.predict(train_inputs)

In [14]:
tree_train_rmse = mean_squared_error(tree_train_preds,train_targets)

In [15]:
tree_val_preds = tree.predict(val_inputs)

In [16]:
tree_val_rmse = mean_squared_error(tree_val_preds,val_targets)

In [17]:
print('Train RMSE: {}, Validation RMSE: {}'.format(tree_train_rmse, tree_val_rmse))

Train RMSE: 284034.29406392697, Validation RMSE: 1791264330.3041096


## Random Forests


In [18]:
from sklearn.ensemble import RandomForestRegressor

In [19]:
# Create the model
rf1 = RandomForestRegressor(random_state=42)

In [20]:
# Fit the model
rf1.fit(train_inputs,train_targets)

RandomForestRegressor(random_state=42)

In [21]:
rf1_train_preds = rf1.predict(train_inputs)

In [22]:
rf1_train_rmse = mean_squared_error(rf1_train_preds,train_targets)

In [23]:
rf1_val_preds = rf1.predict(val_inputs)

In [24]:
rf1_val_rmse =  mean_squared_error(rf1_val_preds,val_targets)

In [25]:
print('Train RMSE: {}, Validation RMSE: {}'.format(rf1_train_rmse, rf1_val_rmse))

Train RMSE: 158428879.58542922, Validation RMSE: 823012269.8914587


## Hyperparameter Tuning

Let us now tune the hyperparameters of our model. You can find the hyperparameters for `RandomForestRegressor` here: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html

<img src="https://i.imgur.com/EJCrSZw.png" width="480">

Hyperparameters are use

In [26]:
#Parameter after Randomized Search
{'bootstrap': False,
 'max_depth': 90,
 'max_features': 'sqrt',
 'min_samples_leaf': 1,
 'min_samples_split': 5,
 'n_estimators': 800}

{'bootstrap': False,
 'max_depth': 90,
 'max_features': 'sqrt',
 'min_samples_leaf': 1,
 'min_samples_split': 5,
 'n_estimators': 800}

In [27]:
rf2 = RandomForestRegressor(bootstrap = False,
 max_depth= 90,
 max_features= 'sqrt',
 min_samples_leaf= 1,
 min_samples_split= 5,
 n_estimators= 800)

In [28]:
# Train the model
rf2.fit(train_inputs,train_targets)

RandomForestRegressor(bootstrap=False, max_depth=90, max_features='sqrt',
                      min_samples_split=5, n_estimators=800)

In [29]:
rf2_train_preds = rf2.predict(train_inputs)
rf2_train_rmse = np.sqrt(mean_squared_error(rf2_train_preds,train_targets))
rf2_val_preds = rf2.predict(val_inputs)
rf2_val_rmse = np.sqrt(mean_squared_error(rf2_val_preds,val_targets))
print('Train RMSE: {}, Validation RMSE: {}'.format(rf2_train_rmse, rf2_val_rmse))

Train RMSE: 6168.245435277977, Validation RMSE: 28014.52935189268


In [30]:
from sklearn.model_selection import RandomizedSearchCV

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]

# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

In [31]:
# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf = RandomForestRegressor(random_state = 42)
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator=rf, param_distributions=random_grid,
                              n_iter = 10, scoring='neg_mean_absolute_error', 
                              cv = 3, verbose=2, random_state=42, n_jobs=-1,
                              return_train_score=True)

# Fit the random search model
rf_random.fit(train_inputs, train_targets);

Fitting 3 folds for each of 10 candidates, totalling 30 fits


In [32]:
rf_random.best_params_

{'bootstrap': True,
 'max_depth': 30,
 'max_features': 'sqrt',
 'min_samples_leaf': 1,
 'min_samples_split': 5,
 'n_estimators': 1400}

In [33]:
rf2_train_preds = rf_random.predict(train_inputs)
rf2_train_rmse = np.sqrt(mean_squared_error(rf2_train_preds,train_targets))
rf2_val_preds = rf_random.predict(val_inputs)
rf2_val_rmse = np.sqrt(mean_squared_error(rf2_val_preds,val_targets))
print('Train RMSE: {}, Validation RMSE: {}'.format(rf2_train_rmse, rf2_val_rmse))

Train RMSE: 15503.569462918796, Validation RMSE: 28623.355938922843


¿ Cuál modelo es mejor, con los hipeparámetros iniciales o después del randomize search ?


*   Probar con más iteraciones y valores diferentes de CV



### Making Predictions on Single Inputs

In [34]:
def predict_input(model, single_input):
    input_df = pd.DataFrame([single_input])
    input_df[numeric_cols] = imputer.transform(input_df[numeric_cols])
    input_df[numeric_cols] = scaler.transform(input_df[numeric_cols])
    return model.predict(input_df[numeric_cols])[0]

In [35]:
inputs_df.columns #estos son los valores que tiene que ingresar el frontend

Index(['BsmtFinSF1', 'GarageYrBlt', '1stFlrSF', 'GarageArea', 'TotalBsmtSF',
       'YearBuilt', 'GarageCars', 'GrLivArea', 'OverallQual'],
      dtype='object')

In [36]:
sample_input = {'BsmtFinSF1':100, 'GarageYrBlt':200, '1stFlrSF':200, 'GarageArea':500, 'TotalBsmtSF':400,'YearBuilt':87, 'GarageCars':8, 'GrLivArea':78, 'OverallQual':10}

predicted_price = round(predict_input(rf2, sample_input),2)

print('The predicted sale price of the house is ${}'.format(predicted_price))

The predicted sale price of the house is $246057.19


### Saving the Model

In [37]:
import pickle

In [38]:
# Lets dump our rf_model
pickle.dump(rf2, open('rf_model.pkl','wb'))

# Load the model

In [39]:
rf_load = pickle.load(open('rf_model.pkl', 'rb'))

In [41]:
sample_input = {'BsmtFinSF1':100, 'GarageYrBlt':200, '1stFlrSF':200, 'GarageArea':500, 'TotalBsmtSF':400,'YearBuilt':87, 'GarageCars':8, 'GrLivArea':78, 'OverallQual':10}


predicted_price = round(predict_input(rf_load, sample_input),2)

print('The predicted sale price of the house is ${}'.format(predicted_price))

The predicted sale price of the house is $246057.19
