## Extra RandomForest Notebook
by Anitha 

In this Notebook, we will look at using Extra Trees ,which can often achieve as-good or better performance than the random forest algorithm, also Extremely Randomized Trees are surely faster than Random Forest due to the random nature of picking up thresholds.

In [190]:
#Import the needed modules
import numpy as np
import pandas as pd
from sklearn.metrics import (
    accuracy_score, 
    f1_score, 
    log_loss )
%matplotlib inline
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from scipy.ndimage import gaussian_filter
from hyperopt import STATUS_OK, Trials, fmin, hp, tpe
import hyperopt.pyll.stochastic

load the base data from the CSV files

In [192]:
OUTPUT_DIR = '../data'

df_train = pd.read_csv(f'{OUTPUT_DIR}/Train_Dataset.csv')
df_test = pd.read_csv(f'{OUTPUT_DIR}/Test_Dataset.csv')

Set a random seed

In [193]:

RSEED = 42
np.random.seed(RSEED)

In [194]:

# get X for the train and validation data
X_train = df_train.drop(columns=["label", "field_id"])
X_val = df_test.drop(columns=["label", "field_id"])

# get y for the train and validation data
y_train = df_train["label"]
y_train = y_train.astype(int)
y_val = df_test["label"]
y_val = y_val.astype(int)

# set the class labels from 0 to 8 
y_train = y_train-1
y_val = y_val-1

Then we can use the Extra Trees model and make predictions for classification.

First, the Extra Trees ensemble is fit on all available data(Train), then the predict() function can be called to make predictions on new data.

In [198]:
model = ExtraTreesClassifier(n_estimators=200, random_state=42)
model.fit(X_train, y_train)

In [199]:
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_val)
y_proba_train = model.predict_proba(X_train)
y_proba_test = model.predict_proba(X_val)


In [200]:
labels = y_train.unique()

In [201]:

print(f'Accuracy on train data: {accuracy_score(y_train, y_pred_train)}')
print(f'Accuracy on test data: {accuracy_score(y_val, y_pred_test)}')
print('---'*10)
print(f'F1-score on train data: {f1_score(y_train, y_pred_train, average="macro")}')
print(f'F1-score on test data: {f1_score(y_val, y_pred_test, average="macro")}')
print('---'*10)
print(f'Cross-entropy on train data: {log_loss(y_train, y_proba_train, labels=labels)}')
print(f'Cross-entropy on test data: {log_loss(y_val, y_proba_test, labels=labels)}')
print('---'*10)

Accuracy on train data: 1.0
Accuracy on test data: 0.6854385891943601
------------------------------
F1-score on train data: 1.0
F1-score on test data: 0.6386690201773105
------------------------------
Cross-entropy on train data: 6.994405055138511e-15
Cross-entropy on test data: 1.0089041966703398
------------------------------


The results of our model show us overfitting ,so we are going to do some generalization ,starting by doing a hyperparameter tuning

### Hyperparameter tuning

In [202]:
etr = ExtraTreesClassifier(n_estimators=100, random_state=42,
                                                    min_samples_split = 20,
                                                    min_samples_leaf = 10,
                                                    max_features = 35,
                                                    max_depth = 85,
                                                    )
etr.fit(X_train, y_train)

In [203]:
y_pred_train = etr.predict(X_train)
y_pred_test = etr.predict(X_val)
y_proba_train = etr.predict_proba(X_train)
y_proba_test = etr.predict_proba(X_val)

In [204]:

print(f'Accuracy on train data: {accuracy_score(y_train, y_pred_train)}')
print(f'Accuracy on test data: {accuracy_score(y_val, y_pred_test)}')
print('---'*10)
print(f'F1-score on train data: {f1_score(y_train, y_pred_train, average="macro")}')
print(f'F1-score on test data: {f1_score(y_val, y_pred_test, average="macro")}')
print('---'*10)
print(f'Cross-entropy on train data: {log_loss(y_train, y_proba_train, labels=labels)}')
print(f'Cross-entropy on test data: {log_loss(y_val, y_proba_test, labels=labels)}')
print('---'*10)

Accuracy on train data: 0.8151006033960656
Accuracy on test data: 0.6487127230378904
------------------------------
F1-score on train data: 0.8140396877609182
F1-score on test data: 0.5946335631692139
------------------------------
Cross-entropy on train data: 0.8026581949276058
Cross-entropy on test data: 1.0632284277918556
------------------------------


### Hyperparameter tuning via Bayesian optimization

1st Round

In [205]:
# hp.uniform for float number
# hp.quniform for int that are a multiple from the last number, 
# for example 3, 15, 1 means any whole number between 3 and 15


space={
    'criterion': hp.choice('criterion',('gini', 'entropy')),
    'n_estimators': hp.quniform('n_estimators', 200, 500,100),
    'random_state': RSEED,
    'bootstrap':hp.choice('bootstrap',('True', 'False')),
    'max_features':hp.choice('max_features',('auto', 'sqrt')),
    'min_samples_leaf': hp.quniform('min_samples_leaf',20, 60,10),
    'max_depth': hp.quniform('max_depth', 10, 40,5),
    'min_samples_split':hp.quniform('min_samples_split',20, 80,10),
    }
print(hyperopt.pyll.stochastic.sample(space))

{'bootstrap': 'False', 'criterion': 'entropy', 'max_depth': 20.0, 'max_features': 'auto', 'min_samples_leaf': 20.0, 'min_samples_split': 20.0, 'n_estimators': 300.0, 'random_state': 42}


In [206]:
def objective(space):
    rf = ExtraTreesClassifier(
        criterion =space['criterion'],
        n_estimators=int(space['n_estimators']),
        random_state=space['random_state'],
        bootstrap =space['bootstrap'],
        max_features =space['max_features'],
         min_samples_leaf=int(space['min_samples_leaf']),
        max_depth=int(space['max_depth']),
        min_samples_split=int(space['min_samples_split']) 
        )
    
    evaluation = [
                  ( X_train, y_train), 
                  ( X_val, y_val)
    ]
                
    rf.fit(
        X_train, y_train
        )
    
    y_pred_val = rf.predict(X_val)
    f1 = f1_score(y_val, y_pred_val, average="macro")
    print ("SCORE:", f1)
    return {'loss': -f1, 'status': STATUS_OK }

In [207]:
trials = Trials()

best_hyperparams = fmin(fn = objective,
                        space = space,
                        algo = tpe.suggest,
                        max_evals = 20,
                        trials = trials)

SCORE:                                                
0.5418150141261759                                    
SCORE:                                                                           
0.5471437470615752                                                               
SCORE:                                                                           
0.5230138834678024                                                               
SCORE:                                                                           
0.5285809186856789                                                               
SCORE:                                                                           
0.5333134769282599                                                               
 25%|██▌       | 5/20 [00:37<01:58,  7.92s/trial, best loss: -0.5471437470615752]

  warn(



SCORE:                                                                           
0.5391740048433141                                                               
 30%|███       | 6/20 [00:45<01:53,  8.08s/trial, best loss: -0.5471437470615752]

  warn(



SCORE:                                                                           
0.5346918543725039                                                               
SCORE:                                                                           
0.5343537623393608                                                               
 40%|████      | 8/20 [01:06<01:54,  9.58s/trial, best loss: -0.5471437470615752]

  warn(



SCORE:                                                                           
0.5286138960222326                                                               
 45%|████▌     | 9/20 [01:16<01:47,  9.79s/trial, best loss: -0.5471437470615752]

  warn(



SCORE:                                                                           
0.5331657488578245                                                               
SCORE:                                                                            
0.5285675297401744                                                                
 55%|█████▌    | 11/20 [01:36<01:29,  9.94s/trial, best loss: -0.5471437470615752]

  warn(



SCORE:                                                                            
0.5370217488590061                                                                
SCORE:                                                                            
0.5209718793314074                                                                
SCORE:                                                                            
0.5331657488578245                                                                
 70%|███████   | 14/20 [02:02<00:54,  9.04s/trial, best loss: -0.5471437470615752]

  warn(



SCORE:                                                                            
0.5223080522114836                                                                
 75%|███████▌  | 15/20 [02:09<00:42,  8.48s/trial, best loss: -0.5471437470615752]

  warn(



SCORE:                                                                            
0.5419532230338948                                                                
SCORE:                                                                            
0.5355484991463287                                                                
SCORE:                                                                            
0.5336351869596447                                                                
SCORE:                                                                            
0.5292241612443811                                                                
SCORE:                                                                            
0.5278654218168892                                                                
100%|██████████| 20/20 [02:58<00:00,  8.95s/trial, best loss: -0.5471437470615752]


In [238]:
print("The best hyperparameters are : ","\n")
print(best_hyperparams)

The best hyperparameters are :  

{'bootstrap': 0, 'criterion': 0, 'max_depth': 75.0, 'min_samples_leaf': 10.0, 'min_samples_split': 10.0, 'n_estimators': 400.0}


In [241]:
best= ExtraTreesClassifier(n_estimators=400,
                                                    min_samples_split = 20,
                                                    min_samples_leaf = 10,
                                                    max_depth = 75,
                                                    bootstrap =  'True',
                                                    criterion = 'gini'

                                                    )

In [242]:
best.fit(X_train, y_train)

In [243]:
y_predu_train = best.predict(X_train)
y_predu_test = best.predict(X_val)
y_probab_train = best.predict_proba(X_train)
y_probab_test = best.predict_proba(X_val)

In [244]:
print(f'Accuracy on train data: {accuracy_score(y_train, y_predu_train)}')
print(f'Accuracy on test data: {accuracy_score(y_val, y_predu_test)}')
print('---'*10)
print(f'F1-score on train data: {f1_score(y_train, y_predu_train, average="macro")}')
print(f'F1-score on test data: {f1_score(y_val, y_predu_test, average="macro")}')
print('---'*10)
print(f'Cross-entropy on train data: {log_loss(y_train, y_probab_train, labels=labels)}')
print(f'Cross-entropy on test data: {log_loss(y_val, y_probab_test, labels=labels)}')
print('---'*10)

Accuracy on train data: 0.6979879320786883
Accuracy on test data: 0.6291644137586824
------------------------------
F1-score on train data: 0.6935997401339595
F1-score on test data: 0.5701559112161112
------------------------------
Cross-entropy on train data: 1.0374034977682796
Cross-entropy on test data: 1.145153561843625
------------------------------


2nd Round

In [245]:
space={
    'criterion': hp.choice('criterion',('gini', 'entropy')),
    'n_estimators': hp.quniform('n_estimators', 200, 600,100),
    'random_state': RSEED,
    'bootstrap':hp.choice('bootstrap',('True', 'False')),
    'min_samples_leaf': hp.quniform('min_samples_leaf',10, 90,5),
    'max_depth': hp.quniform('max_depth', 10, 90,5),
    'min_samples_split':hp.quniform('min_samples_split',10, 90,5),
    }
print(hyperopt.pyll.stochastic.sample(space))

{'bootstrap': 'False', 'criterion': 'gini', 'max_depth': 30.0, 'min_samples_leaf': 65.0, 'min_samples_split': 40.0, 'n_estimators': 300.0, 'random_state': 42}


In [246]:
def objective(space):
    rf = ExtraTreesClassifier(
        criterion =space['criterion'],
        n_estimators=int(space['n_estimators']),
        random_state=space['random_state'],
        bootstrap =space['bootstrap'],
        min_samples_leaf=int(space['min_samples_leaf']),
        max_depth=int(space['max_depth']),
        min_samples_split=int(space['min_samples_split']) 
        )
    
    evaluation = [
                  ( X_train, y_train), 
                  ( X_val, y_val)
    ]
                
    rf.fit(
        X_train, y_train
        )
    
    y_pred_val = rf.predict(X_val)
    f1 = f1_score(y_val, y_pred_val, average="macro")
    print ("SCORE:", f1)
    return {'loss': -f1, 'status': STATUS_OK }

In [247]:
trials = Trials()

best_hyperparams = fmin(fn = objective,
                        space = space,
                        algo = tpe.suggest,
                        max_evals = 50,
                        trials = trials)

SCORE:                                                
0.531450948658242                                     
SCORE:                                                                          
0.5535439541665841                                                              
SCORE:                                                                           
0.5111395901436563                                                               
SCORE:                                                                           
0.5496295586505724                                                               
SCORE:                                                                           
0.5431156903616343                                                               
SCORE:                                                                           
0.5171930171400058                                                               
SCORE:                                                                  

In [248]:
print("The best hyperparameters are : ","\n")
print(best_hyperparams)

The best hyperparameters are :  

{'bootstrap': 0, 'criterion': 1, 'max_depth': 85.0, 'min_samples_leaf': 10.0, 'min_samples_split': 25.0, 'n_estimators': 200.0}


In [249]:
best= ExtraTreesClassifier(n_estimators=200,
                                                    min_samples_split = 25,
                                                    min_samples_leaf = 10,
                                                    max_depth = 85,
                                                    bootstrap = 'True',
                                                    criterion = 'entropy'
                                                    )

In [250]:
best.fit(X_train, y_train)

In [252]:
y_predu_train = best.predict(X_train)
y_predu_test = best.predict(X_val)
y_probab_train = best.predict_proba(X_train)
y_probab_test = best.predict_proba(X_val)

In [253]:
print(f'Accuracy on train data: {accuracy_score(y_train, y_predu_train)}')
print(f'Accuracy on test data: {accuracy_score(y_val, y_predu_test)}')
print('---'*10)
print(f'F1-score on train data: {f1_score(y_train, y_predu_train, average="macro")}')
print(f'F1-score on test data: {f1_score(y_val, y_predu_test, average="macro")}')
print('---'*10)
print(f'Cross-entropy on train data: {log_loss(y_train, y_probab_train, labels=labels)}')
print(f'Cross-entropy on test data: {log_loss(y_val, y_probab_test, labels=labels)}')
print('---'*10)

Accuracy on train data: 0.6853143715932797
Accuracy on test data: 0.6264609241775153
------------------------------
F1-score on train data: 0.6800193154931296
F1-score on test data: 0.5661273782582389
------------------------------
Cross-entropy on train data: 1.053958711608736
Cross-entropy on test data: 1.1492306743445926
------------------------------


3rd Round

In [259]:
space={
    'criterion': hp.choice('criterion',('gini', 'entropy')),
    'n_estimators': hp.quniform('n_estimators', 100, 1000,50),
    'random_state': RSEED,
    'bootstrap':hp.choice('bootstrap',('True', 'False')),
    'min_samples_leaf': hp.quniform('min_samples_leaf',10, 60,5),
    'max_depth': hp.quniform('max_depth', 10, 80,5),
    'min_samples_split':hp.quniform('min_samples_split',20, 80,5)
    }
print(hyperopt.pyll.stochastic.sample(space))

{'bootstrap': 'False', 'criterion': 'entropy', 'max_depth': 75.0, 'min_samples_leaf': 45.0, 'min_samples_split': 55.0, 'n_estimators': 950.0, 'random_state': 42}


In [260]:
def objective(space):
    rf = ExtraTreesClassifier(
        criterion =space['criterion'],
        n_estimators=int(space['n_estimators']),
        random_state=space['random_state'],
        bootstrap =space['bootstrap'],
        min_samples_leaf=int(space['min_samples_leaf']),
        max_depth=int(space['max_depth']),
        min_samples_split=int(space['min_samples_split']) 
        )
    
    evaluation = [
                  ( X_train, y_train), 
                  ( X_val, y_val)
    ]
                
    rf.fit(
        X_train, y_train
        )
    
    y_pred_val = rf.predict(X_val)
    f1 = f1_score(y_val, y_pred_val, average="macro")
    print ("SCORE:", f1)
    return {'loss': -f1, 'status': STATUS_OK }

In [261]:
trials = Trials()

best_hyperparams = fmin(fn = objective,
                        space = space,
                        algo = tpe.suggest,
                        max_evals = 50,
                        trials = trials)

SCORE:                                                
0.5435479557453539                                    
SCORE:                                                                           
0.5241569583951038                                                               
SCORE:                                                                           
0.5260525385558248                                                               
SCORE:                                                                           
0.5318870486533139                                                               
SCORE:                                                                           
0.5383877491197894                                                               
SCORE:                                                                           
0.5246247240005694                                                               
SCORE:                                                                

In [262]:
print("The best hyperparameters are : ","\n")
print(best_hyperparams)

The best hyperparameters are :  

{'bootstrap': 0, 'criterion': 0, 'max_depth': 25.0, 'min_samples_leaf': 10.0, 'min_samples_split': 20.0, 'n_estimators': 800.0}


In [263]:
best= ExtraTreesClassifier(n_estimators=800,
                                                    min_samples_split = 20,
                                                    min_samples_leaf = 10,
                                                    max_depth = 25,
                                                    bootstrap = 'True',
                                                    criterion = 'gini'
                                                    )

In [265]:
best.fit(X_train, y_train)

In [172]:
y_predu_train = best.predict(X_train)
y_predu_test = best.predict(X_val)
y_probab_train = best.predict_proba(X_train)
y_probab_test = best.predict_proba(X_val)

In [266]:
print(f'Accuracy on train data: {accuracy_score(y_train, y_predu_train)}')
print(f'Accuracy on test data: {accuracy_score(y_val, y_predu_test)}')
print('---'*10)
print(f'F1-score on train data: {f1_score(y_train, y_predu_train, average="macro")}')
print(f'F1-score on test data: {f1_score(y_val, y_predu_test, average="macro")}')
print('---'*10)
print(f'Cross-entropy on train data: {log_loss(y_train, y_probab_train, labels=labels)}')
print(f'Cross-entropy on test data: {log_loss(y_val, y_probab_test, labels=labels)}')
print('---'*10)

Accuracy on train data: 0.6853143715932797
Accuracy on test data: 0.6264609241775153
------------------------------
F1-score on train data: 0.6800193154931296
F1-score on test data: 0.5661273782582389
------------------------------
Cross-entropy on train data: 1.053958711608736
Cross-entropy on test data: 1.1492306743445926
------------------------------
