# Machine Learning with Decision Tree and Random Forest

Now I will train a regression model using decision tree and then random forest.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

In [2]:
# Bring in the data
%store -r X
%store -r y

I need to split the data again into training and test sets because this time I will not scale the data.

In [3]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 1)

### Decision Tree Regressor

In [4]:
# Define a decision tree function that can take different criterion and max_depth and output R squared and RMSE.
def dt_R2_RMSE(criterion, max_depth):
    data = []
    for crit in criterion:
        for depth in max_depth:
            model = DecisionTreeRegressor(criterion=crit, max_depth=depth, random_state=1)
            model.fit(X_train, y_train)
    
            model_r2_train = model.score(X_train, y_train)
            model_r2_test = model.score(X_test, y_test)
    
            model_y_pred_train = model.predict(X_train)
            model_y_pred_test = model.predict(X_test)
            model_RMSE_train = np.sqrt(mean_squared_error(y_train, model_y_pred_train))
            model_RMSE_test = np.sqrt(mean_squared_error(y_test, model_y_pred_test))
    
            data.append([crit, depth, model_r2_train, model_r2_test, model_RMSE_train, model_RMSE_test])
            
    table = pd.DataFrame(data, columns = ['Criterion','Max_depth', 'Training R2', 'Test R2', 'Training RMSE', 'Test RMSE'])
    
    return table

In [6]:
criterion = ['squared_error']
max_depth = [None, 9, 6, 3, 2, 1]

dt_results = dt_R2_RMSE(criterion, max_depth)
dt_results

Unnamed: 0,Criterion,Max_depth,Training R2,Test R2,Training RMSE,Test RMSE
0,squared_error,,1.0,0.648793,0.0,6873.33244
1,squared_error,9.0,0.962179,0.719763,2385.618018,6139.727399
2,squared_error,6.0,0.90029,0.790269,3873.491437,5311.497672
3,squared_error,3.0,0.860714,0.836862,4578.104656,4684.505601
4,squared_error,2.0,0.832431,0.805508,5021.460218,5114.896537
5,squared_error,1.0,0.635737,0.56499,7403.550165,7649.550036


3 seems to be the optimal max depth because the training and test RMSE are fairly low while the R squared values are high but not yet showing signs of overfitting.

In [7]:
criterion = ['squared_error','friedman_mse','absolute_error','poisson']
max_depth = [3]

dt_results1 = dt_R2_RMSE(criterion, max_depth)
dt_results1

Unnamed: 0,Criterion,Max_depth,Training R2,Test R2,Training RMSE,Test RMSE
0,squared_error,3,0.860714,0.836862,4578.104656,4684.505601
1,friedman_mse,3,0.860714,0.836862,4578.104656,4684.505601
2,absolute_error,3,0.84736,0.831217,4792.54465,4764.86546
3,poisson,3,0.663235,0.586406,7118.622571,7458.872129


Either squared error or friedman mse seem to be the best fit.

This model performs better than the previous best model, elastic net regression with alpha equal to 0.01 and L1 ratio equal to 0.75. The test R2 value is 0.86 and test RMSE is 4684.51.

### Random Forest Regressor

In [8]:
# Define a random forest function that can take different criterion and max_depth and output R squared and RMSE.
def rf_R2_RMSE(n_estimators):
    data = []
    for n in n_estimators:
        model = RandomForestRegressor(n_estimators=n, criterion='squared_error', max_depth=3, random_state=1)
        model.fit(X_train, y_train)
    
        model_r2_train = model.score(X_train, y_train)
        model_r2_test = model.score(X_test, y_test)
    
        model_y_pred_train = model.predict(X_train)
        model_y_pred_test = model.predict(X_test)
        model_RMSE_train = np.sqrt(mean_squared_error(y_train, model_y_pred_train))
        model_RMSE_test = np.sqrt(mean_squared_error(y_test, model_y_pred_test))
    
        data.append([n, model_r2_train, model_r2_test, model_RMSE_train, model_RMSE_test])
            
    table = pd.DataFrame(data, columns = ['Estimators', 'Training R2', 'Test R2', 'Training RMSE', 'Test RMSE'])
    
    return table

In [10]:
n_estiamtors = [500, 400, 300, 200, 150, 100, 75, 50, 25]

rf_results = rf_R2_RMSE(n_estiamtors)
rf_results

Unnamed: 0,Estimators,Training R2,Test R2,Training RMSE,Test RMSE
0,500,0.865829,0.845451,4493.268535,4559.522879
1,400,0.86588,0.845482,4492.417071,4559.057922
2,300,0.865862,0.845445,4492.712024,4559.612126
3,200,0.865887,0.845213,4492.297084,4563.033507
4,150,0.86586,0.844992,4492.736316,4566.285567
5,100,0.865805,0.844913,4493.660845,4567.44359
6,75,0.865781,0.845111,4494.062035,4564.527144
7,50,0.865727,0.844993,4494.970282,4566.272178
8,25,0.865401,0.844018,4500.430107,4580.610101


A random forest with 400 estimators performs the best.

This model performs better than the previous best model, decision tree with criterion equal to squared error and max depth of 3. The model also has less overfitting. The test R2 value is 0.84 and test RMSE is 4559.06.