# 03 Random Forest Models

## Imports

In [1]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression
import pickle

## Read in the training and test data

In [2]:
with open('../../02_Data/02_Processed_Data/X_train.pkl', 'rb') as f:
    X_train = pickle.load(f)
    
with open('../../02_Data/02_Processed_Data/y_train.pkl', 'rb') as f:
    y_train = pickle.load(f)    

with open('../../02_Data/02_Processed_Data/X_test.pkl', 'rb') as f:
    X_test = pickle.load(f)
    
with open('../../02_Data/02_Processed_Data/y_test.pkl', 'rb') as f:
    y_test = pickle.load(f)    

## Read in Select 100 Best and PCA datasets for modeling

In [3]:
with open('../../02_Data/02_Processed_Data/X_train_100b.pkl', 'rb') as f:
    X_train_100b = pickle.load(f)
    
with open('../../02_Data/02_Processed_Data/X_test_100b.pkl', 'rb') as f:
    X_test_100b = pickle.load(f)
    
with open('../../02_Data/02_Processed_Data/X_train_100b_pca.pkl', 'rb') as f:
    X_train_100b_pca = pickle.load(f)
    
with open('../../02_Data/02_Processed_Data/X_test_100b_pca.pkl', 'rb') as f:
    X_test_100b_pca = pickle.load(f)   

## Naive Random Forest

In [4]:
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)
print('Train:', rf.score(X_train,y_train))
print('Test:',rf.score(X_test,y_test))

Train: 0.9818786982248521
Test: 0.5764705882352941


The untuned random forest classifier does a lot better, at least beating the baseline (50/50). However, the model is significantly overfit with the train being 0.98 and test only coming in at 0.57. This is not surprising because not setting a max depth allows the trees to grow as long as it needs to almost perfectly classify the training data, causing significant overfitting.

## Random Forest with Initial Tuning

In [5]:
rf = RandomForestClassifier(n_estimators=1000, max_depth=3, min_samples_split=2, random_state=42)
rf.fit(X_train, y_train)
print('Train:', rf.score(X_train,y_train))
print('Test:',rf.score(X_test,y_test))

Train: 0.6834319526627219
Test: 0.596078431372549


Updating the max depth to 3, setting min samples and increasing the estimators helped to reduce the overfitting by 0.30 and increase the test score, but the model is still quite overfit.

## Random Forest using Select K Best 100 parameters

In [7]:
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train_100b, y_train)
print('Train:', rf.score(X_train_100b,y_train))
print('Test:',rf.score(X_test_100b,y_test))

Train: 0.9837278106508875
Test: 0.5764705882352941


Doing feature selection did not help reduce overfitting and made no difference in the score.  This is unsurprising since the randomforest model was already picking the best features to split on when providing the entire feature space.

In [46]:
rf = RandomForestClassifier(n_estimators=1000, max_depth=3, min_samples_split=2, random_state=42)
rf.fit(X_train_100b, y_train)
print('Train:', rf.score(X_train_100b,y_train))
print('Test:',rf.score(X_test_100b,y_test))

Train: 0.6464497041420119
Test: 0.6078431372549019


With some light tuning and feature selection, the model is not as overfit anymore and is also up to 0.60

## Look at the splits done for the random forest

In [52]:
pd.DataFrame(rf.feature_importances_, index=X_train_100b.columns).sort_values(0, ascending=False)[:10]

Unnamed: 0,0
f2_head_significant_strikes_landed_diff_avg,0.03172
f2_head_significant_strikes_percent_avg_diff,0.030295
f1_head_significant_strikes_landed_avg_diff,0.029473
f1_significant_strikes_landed_diff_avg,0.0282
f1_f2_clinch_head_strikes_percent_avg,0.024601
f1_significant_strikes_attempts_diff_avg,0.023133
f1_head_significant_strikes_landed_diff_avg,0.023019
f2_significant_strikes_landed_diff_avg,0.022822
f1_distance_head_strikes_landed_avg_diff,0.021123
f2_head_significant_strikes_attempts_diff_avg,0.020115


In [None]:
# Rename the columns 
# diff -> differential
# Try to take denominator out
# 

In [13]:
%%time
rf = RandomForestClassifier(random_state=42)
rf_params = {
    'n_estimators': [1000],
    'max_depth': [3, 5, 7, 9, 11],
    'min_samples_split': [2]
}
gs = GridSearchCV(rf, param_grid=rf_params)
gs.fit(X_train, y_train)
print('Best Score:', gs.best_score_)
print('Best Parameters:', gs.best_params_)
print('Test:',gs.score(X_test,y_test))

Best Score: 0.5673076923076923
Best Parameters: {'max_depth': 5, 'min_samples_split': 2, 'n_estimators': 1000}
Test: 0.6078431372549019
CPU times: user 3min 33s, sys: 375 ms, total: 3min 34s
Wall time: 3min 34s


The grid search performed about as good as the initial tuning with a score of 0.60

## Random Forest using Select K Best 100 parameters and PCA

In [14]:
rf = RandomForestClassifier(n_estimators=1000, max_depth=3, min_samples_split=2, random_state=42)
rf.fit(X_train_100b_pca, y_train)
print('Train:', rf.score(X_train_100b_pca,y_train))
print('Test:',rf.score(X_test_100b_pca,y_test))

Train: 0.6275887573964497
Test: 0.5490196078431373


In [16]:
%%time
rf = RandomForestClassifier(random_state=42)
rf_params = {
    'n_estimators': [1000],
    'max_depth': [3, 5, 7,],
    'min_samples_split': [2]
}
gs = GridSearchCV(rf, param_grid=rf_params)
gs.fit(X_train_100b_pca, y_train)
print('Best Score:', gs.best_score_)
print('Best Parameters:', gs.best_params_)
print('Train:',gs.score(X_train_100b_pca,y_train))
print('Test:',gs.score(X_test_100b_pca,y_test))

Best Score: 0.5558431952662722
Best Parameters: {'max_depth': 3, 'min_samples_split': 2, 'n_estimators': 1000}
Train: 0.6275887573964497
Test: 0.5490196078431373
CPU times: user 28.3 s, sys: 40.5 ms, total: 28.4 s
Wall time: 28.4 s


It looks like random forest does not perform nearly as well on data that has been through PCA.  I believe this is probably because the splits that are being made are not distinct to specific metrics

## Trying some random models

In [None]:
%%time
gb = GradientBoostingClassifier(random_state=42)
gb_params = {
    'n_estimators': [500, 1000],
    'learning_rate': [0.05, 0.1, 0.3],
    'max_depth': [3,5]
}
gb_gs = GridSearchCV(gb, param_grid=gb_params, verbose=2, n_jobs=3 )
gb_gs.fit(X_train, y_train)

In [40]:
%%time
gb = GradientBoostingClassifier(n_estimators = 500, learning_rate = 0.001, max_depth= 3, min_samples_leaf=4,
                                min_samples_split=4, random_state=42)
gb.fit(X_train_100b_pca, y_train)
print('Train:', gb.score(X_train_100b_pca,y_train))
print('Test:',gb.score(X_test_100b_pca,y_test))

Train: 0.5906065088757396
Test: 0.5666666666666667
CPU times: user 1.52 s, sys: 3.55 ms, total: 1.52 s
Wall time: 1.52 s


In [42]:
%%time
gb = GradientBoostingClassifier(n_estimators = 500, learning_rate = 0.01, max_depth= 3, min_samples_leaf=4,
                                min_samples_split=4, random_state=42)
gb.fit(X_train_100b, y_train)
print('Train:', gb.score(X_train_100b,y_train))
print('Test:',gb.score(X_test_100b,y_test))

Train: 0.7514792899408284
Test: 0.5568627450980392
CPU times: user 8.03 s, sys: 6.93 ms, total: 8.04 s
Wall time: 8.04 s
