# 5 Modeling -  selection of the best tree-based models

<b> Purpose of the action </b> - checking accuracy of prediction on test set using different types of tree-based models:
- RandomForestClassifier
- AdaBoostClassifier
- XGBClassifier
- CatBoostClassifier

<b> </b>
<b> Action plan </b>:
- Test 20 diffrent models for each type
- Use ParameterSampler to generate different models with random hyperparameters
- Use training set for fitting model and use validation set for model evaluation 
- Select the best 5 models of each type and create one averaging model (Votingclassifier)
- Create one large averaging model using the best one model of each type
- Create another large averaging model from the previously created VotingClassifiers (each model contains the top 5 models of the same type)
- Save this models for use in future
- Compare prediction accuracy and other metrics on test set and save results for future purpose

## 5.1 Import nessesary libraries and modules

In [12]:
import numpy as np
import pandas as pd
from sklearn import metrics
from catboost import CatBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, VotingClassifier
from sklearn.model_selection import ParameterSampler
import pickle
from modeling import make_voting_classifier

## 5.2 Create empty lists for future results

In [2]:
accuracy_score = []
precision_score = []
recall_score = []
f1_score = []
roc_auc_score = []
models_name = []
single_models = []
voting_models = []

## 5.3 Perform ensembling for RandomForestClassifier

### 5.3.1 Import data dedicated for this model

In [3]:
train_set = pd.read_csv("./preprocessed_data/processed_train_set_randomforest.csv", index_col=0)
validation_set = pd.read_csv("./preprocessed_data/processed_validation_set_randomforest.csv", index_col=0)
test_set = pd.read_csv("./preprocessed_data/processed_test_set_randomforest.csv", index_col=0)

### 5.3.2 Split datasets to feature and label sets

In [4]:
X_train, y_train = np.array(train_set.drop(columns='FTR')), np.array(train_set['FTR'])
X_val, y_val = np.array(validation_set.drop(columns='FTR')), np.array(validation_set['FTR'])
X_test, y_test = np.array(test_set.drop(columns='FTR')), np.array(test_set['FTR'])

### 5.3.3 Perform averaging ensembling for this model type

Choose the top 5 models from 20 and create one model from them (Voting Classifier) using <b> ParameterSampler </b> for generating random parameters

In [5]:
# define params for random grid search
params_grid={
   'n_estimators': [400, 600, 800, 1000, 1200],
   'max_depth': [7, 9, 11, 13, 15, 17, 19, 21],
   'random_state': [0, 1, 2, 3, 4, 5, 6, 7, 8 ,9, 10] 
}

# function build a voting classifier using n best models
voting_clf = make_voting_classifier(estimator=RandomForestClassifier, 
                                    params_grid=params_grid,
                                    n_iter=20, 
                                    random_state=42,
                                    X_train=X_train, 
                                    y_train=y_train, 
                                    X_val=X_val, 
                                    y_val=y_val, 
                                    verbose=1,
                                    n_best_models=5, 
                                    voting='soft')

voting_clf.fit(X_train, y_train)
# look on the estimators of voting claffier
print(voting_clf.estimators[:,0])

RandomForestClassifier{'random_state': 1, 'n_estimators': 1200, 'max_depth': 15}
Accuracy score on training set: 0.9988 | Accuracy score on validation set: 0.7091
--------------------------------------------------------------------------------
RandomForestClassifier{'random_state': 1, 'n_estimators': 800, 'max_depth': 9}
Accuracy score on training set: 0.8478 | Accuracy score on validation set: 0.6909
--------------------------------------------------------------------------------
RandomForestClassifier{'random_state': 6, 'n_estimators': 600, 'max_depth': 19}
Accuracy score on training set: 1.0 | Accuracy score on validation set: 0.703
--------------------------------------------------------------------------------
RandomForestClassifier{'random_state': 2, 'n_estimators': 1000, 'max_depth': 15}
Accuracy score on training set: 0.9988 | Accuracy score on validation set: 0.703
--------------------------------------------------------------------------------
RandomForestClassifier{'random_s

### 5.3.4 Exctract the best single model from voting classifier

In [6]:
# extract single classifier
clf = voting_clf.estimators_[0]
clf

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=15, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=1200,
                       n_jobs=None, oob_score=False, random_state=1, verbose=0,
                       warm_start=False)

### 5.3.5 Fit averaging model on the training set

In [7]:
# train averaging model on training data
voting_clf.fit(X_train, y_train)

# give models a name
clf_name = f'{clf.__class__.__name__}'
voting_clf_name = f'Voting_{voting_clf.estimators_[0].__class__.__name__}'
print(clf_name, voting_clf_name)

RandomForestClassifier Voting_RandomForestClassifier


### 5.3.6 Calculate metrics of prediction and add results to the lists

In [8]:
# append metrics for single classifier to the list 
accuracy_score.append(metrics.accuracy_score(y_test , clf.predict(X_test)))  
precision_score.append(metrics.precision_score(y_test , clf.predict(X_test)))
recall_score.append(metrics.recall_score(y_test , clf.predict(X_test)))
f1_score.append( metrics.f1_score(y_test , clf.predict(X_test)))
roc_auc_score.append(metrics.roc_auc_score(y_test , clf.predict_proba(X_test)[:,1]))

# append metrics for voting classifier to the list  
accuracy_score.append(metrics.accuracy_score(y_test , voting_clf.predict(X_test)))  
precision_score.append(metrics.precision_score(y_test , voting_clf.predict(X_test)))
recall_score.append(metrics.recall_score(y_test , voting_clf.predict(X_test)))
f1_score.append( metrics.f1_score(y_test , voting_clf.predict(X_test)))
roc_auc_score.append(metrics.roc_auc_score(y_test , voting_clf.predict_proba(X_test)[:,1]))

# add claffiers name to the list (needed for created table with results)
models_name.append(clf_name)
models_name.append(voting_clf_name)

# add classifiers to the lists (to create largest average classifiers)
single_models.append( (clf_name, clf) )
voting_models.append( (voting_clf_name, voting_clf) )

## 5.4 Perform ensembling for AdaBoostClassifier

### 5.4.1 Import data dedicated for this model

In [9]:
train_set = pd.read_csv("./preprocessed_data/processed_train_set_adaboost.csv", index_col=0)
validation_set = pd.read_csv("./preprocessed_data/processed_validation_set_adaboost.csv", index_col=0)
test_set = pd.read_csv("./preprocessed_data/processed_test_set_adaboost.csv", index_col=0)

### 5.4.2 Split datasets to feature set and labels set

In [10]:
X_train, y_train = np.array(train_set.drop(columns='FTR')), np.array(train_set['FTR'])
X_val, y_val = np.array(validation_set.drop(columns='FTR')), np.array(validation_set['FTR'])
X_test, y_test = np.array(test_set.drop(columns='FTR')), np.array(test_set['FTR'])

### 5.4.3 Perform averaging ensembling for this model type

Choose the top 5 models from 20 and create one model from them (Voting Classifier) using <b> ParameterSampler </b> for generating random parameters

In [13]:
# define params for random grid search
params_grid={
   'base_estimator': [DecisionTreeClassifier(max_depth=1), DecisionTreeClassifier(max_depth=2), 
                      DecisionTreeClassifier(max_depth=3), DecisionTreeClassifier(max_depth=4),
                      DecisionTreeClassifier(max_depth=5)], 
   'n_estimators': [20, 30, 40, 50, 70, 80, 90, 100],
   'learning_rate': [0.6, 0.8, 1.0, 1.2, 1.4],
   'random_state': [0, 1, 2, 3, 4, 5, 6, 7, 8 ,9, 10] 
}

# function build a voting classifier using n best models
voting_clf = make_voting_classifier(estimator=AdaBoostClassifier, 
                                    params_grid=params_grid,
                                    n_iter=20, 
                                    random_state=42,
                                    X_train=X_train, 
                                    y_train=y_train, 
                                    X_val=X_val, 
                                    y_val=y_val, 
                                    verbose=1,
                                    n_best_models=5, 
                                    voting='soft')

voting_clf.fit(X_train, y_train)
# look on the estimators of voting claffier
print(voting_clf.estimators[:,0])

AdaBoostClassifier{'random_state': 2, 'n_estimators': 90, 'learning_rate': 1.4, 'base_estimator': DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=2, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')}
Accuracy score on training set: 0.7441 | Accuracy score on validation set: 0.6848
--------------------------------------------------------------------------------
AdaBoostClassifier{'random_state': 7, 'n_estimators': 80, 'learning_rate': 1.4, 'base_estimator': DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=3, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_sp

AdaBoostClassifier{'random_state': 2, 'n_estimators': 30, 'learning_rate': 1.4, 'base_estimator': DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=4, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')}
Accuracy score on training set: 0.7877 | Accuracy score on validation set: 0.6515
--------------------------------------------------------------------------------
AdaBoostClassifier{'random_state': 10, 'n_estimators': 80, 'learning_rate': 1.2, 'base_estimator': DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=2, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_s

### 5.4.4 Exctract the best single model from voting classifier

In [14]:
# extract single classifier
clf = voting_clf.estimators_[0]
clf

AdaBoostClassifier(algorithm='SAMME.R',
                   base_estimator=DecisionTreeClassifier(ccp_alpha=0.0,
                                                         class_weight=None,
                                                         criterion='gini',
                                                         max_depth=2,
                                                         max_features=None,
                                                         max_leaf_nodes=None,
                                                         min_impurity_decrease=0.0,
                                                         min_impurity_split=None,
                                                         min_samples_leaf=1,
                                                         min_samples_split=2,
                                                         min_weight_fraction_leaf=0.0,
                                                         presort='deprecated',
                          

### 5.4.5 Fit averaging model on the training set

In [7]:
# train averaging model on training data
voting_clf.fit(X_train, y_train)

# give models a name
clf_name = f'{clf.__class__.__name__}'
voting_clf_name = f'Voting_{voting_clf.estimators_[0].__class__.__name__}'
print(clf_name, voting_clf_name)

RandomForestClassifier Voting_RandomForestClassifier


### 5.4.6 Calculate metrics of prediction and add results to the lists

In [16]:
# append metrics for single classifier to the list 
accuracy_score.append(metrics.accuracy_score(y_test , clf.predict(X_test)))  
precision_score.append(metrics.precision_score(y_test , clf.predict(X_test)))
recall_score.append(metrics.recall_score(y_test , clf.predict(X_test)))
f1_score.append( metrics.f1_score(y_test , clf.predict(X_test)))
roc_auc_score.append(metrics.roc_auc_score(y_test , clf.predict_proba(X_test)[:,1]))

# append metrics for voting classifier to the list  
accuracy_score.append(metrics.accuracy_score(y_test , voting_clf.predict(X_test)))  
precision_score.append(metrics.precision_score(y_test , voting_clf.predict(X_test)))
recall_score.append(metrics.recall_score(y_test , voting_clf.predict(X_test)))
f1_score.append( metrics.f1_score(y_test , voting_clf.predict(X_test)))
roc_auc_score.append(metrics.roc_auc_score(y_test , voting_clf.predict_proba(X_test)[:,1]))

# add claffiers name to the list (needed for created table with results)
models_name.append(clf_name)
models_name.append(voting_clf_name)

# add classifiers to the lists (to create largest average classifiers)
single_models.append( (clf_name, clf) )
voting_models.append( (voting_clf_name, voting_clf) )

## 5.5 Perform ensembling for XGBClassifier

### 5.5.1 Import data dedicated for this model

In [17]:
train_set = pd.read_csv("./preprocessed_data/processed_train_set_xgbboost.csv", index_col=0)
validation_set = pd.read_csv("./preprocessed_data/processed_validation_set_xgbboost.csv", index_col=0)
test_set = pd.read_csv("./preprocessed_data/processed_test_set_xgbboost.csv", index_col=0)

### 5.5.2 Split datasets to feature set and labels set

In [18]:
X_train, y_train = np.array(train_set.drop(columns='FTR')), np.array(train_set['FTR'])
X_val, y_val = np.array(validation_set.drop(columns='FTR')), np.array(validation_set['FTR'])
X_test, y_test = np.array(test_set.drop(columns='FTR')), np.array(test_set['FTR'])

### 5.5.3 Perform averaging ensembling for this model type

Choose the top 5 models from 20 and create one model from them (Voting Classifier) using <b> ParameterSampler </b> for generating random parameters

In [19]:
# define params for random grid search
params_grid = {
                  'random_state':[0, 1, 2 ,3 ,4, 5, 6, 7, 8, 9, 10],
                  'n_estimators': [300, 400, 500, 600, 700], 
                  'learning_rate' : [0.005, 0.01, 0.02],
                  'max_depth' : [3, 4, 5, 6, 7, 8],
                  'min_child_weight': [2, 3, 4],
                  'gamma':[0.2, 0.4, 0.6],
                  'subsample' : [0.7, 0.8, 0.9],
                  'colsample_bytree' : [0.7, 0.8, 0.9],
                  'scale_pos_weight' : [0.8, 1, 1.2],
                  'reg_alpha':[1e-4, 1e-5, 1e-6]
              }

# function build a voting classifier using n best models
voting_clf = make_voting_classifier(estimator=XGBClassifier, 
                                    params_grid=params_grid,
                                    n_iter=20, 
                                    random_state=42,
                                    X_train=X_train, 
                                    y_train=y_train, 
                                    X_val=X_val, 
                                    y_val=y_val, 
                                    verbose=1,
                                    n_best_models=5, 
                                    voting='soft')

voting_clf.fit(X_train, y_train)
# look on the estimators of voting claffier
print(voting_clf.estimators[:,0])

XGBClassifier{'subsample': 0.9, 'scale_pos_weight': 1.2, 'reg_alpha': 1e-06, 'random_state': 6, 'n_estimators': 300, 'min_child_weight': 3, 'max_depth': 6, 'learning_rate': 0.01, 'gamma': 0.4, 'colsample_bytree': 0.7}
Accuracy score on training set: 0.8239 | Accuracy score on validation set: 0.7091
--------------------------------------------------------------------------------
XGBClassifier{'subsample': 0.8, 'scale_pos_weight': 1.2, 'reg_alpha': 1e-05, 'random_state': 8, 'n_estimators': 700, 'min_child_weight': 3, 'max_depth': 3, 'learning_rate': 0.01, 'gamma': 0.6, 'colsample_bytree': 0.9}
Accuracy score on training set: 0.7303 | Accuracy score on validation set: 0.7061
--------------------------------------------------------------------------------
XGBClassifier{'subsample': 0.8, 'scale_pos_weight': 0.8, 'reg_alpha': 1e-05, 'random_state': 2, 'n_estimators': 700, 'min_child_weight': 3, 'max_depth': 8, 'learning_rate': 0.01, 'gamma': 0.4, 'colsample_bytree': 0.7}
Accuracy score on tr

### 5.5.4 Exctract the best single model from voting classifier

In [20]:
# extract single classifier
clf = voting_clf.estimators_[0]
clf

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.9, gamma=0.6,
              learning_rate=0.02, max_delta_step=0, max_depth=8,
              min_child_weight=2, missing=None, n_estimators=600, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=6,
              reg_alpha=0.0001, reg_lambda=1, scale_pos_weight=1.2, seed=None,
              silent=None, subsample=0.8, verbosity=1)

### 5.5.5 Fit averaging model on the training set

In [7]:
# train averaging model on training data
voting_clf.fit(X_train, y_train)

# give models a name
clf_name = f'{clf.__class__.__name__}'
voting_clf_name = f'Voting_{voting_clf.estimators_[0].__class__.__name__}'
print(clf_name, voting_clf_name)

RandomForestClassifier Voting_RandomForestClassifier


### 5.5.6 Calculate metrics of prediction and add results to the lists

In [22]:
# append metrics for single classifier to the list 
accuracy_score.append(metrics.accuracy_score(y_test , clf.predict(X_test)))  
precision_score.append(metrics.precision_score(y_test , clf.predict(X_test)))
recall_score.append(metrics.recall_score(y_test , clf.predict(X_test)))
f1_score.append( metrics.f1_score(y_test , clf.predict(X_test)))
roc_auc_score.append(metrics.roc_auc_score(y_test , clf.predict_proba(X_test)[:,1]))

# append metrics for voting classifier to the list  
accuracy_score.append(metrics.accuracy_score(y_test , voting_clf.predict(X_test)))  
precision_score.append(metrics.precision_score(y_test , voting_clf.predict(X_test)))
recall_score.append(metrics.recall_score(y_test , voting_clf.predict(X_test)))
f1_score.append( metrics.f1_score(y_test , voting_clf.predict(X_test)))
roc_auc_score.append(metrics.roc_auc_score(y_test , voting_clf.predict_proba(X_test)[:,1]))

# add claffiers name to the list (needed for created table with results)
models_name.append(clf_name)
models_name.append(voting_clf_name)

# add classifiers to the lists (to create largest average classifiers)
single_models.append( (clf_name, clf) )
voting_models.append( (voting_clf_name, voting_clf) )

## 5.6 Perform ensembling for CatBoostClassifier

### 5.6.1 Import data dedicated for this model

In [23]:
train_set = pd.read_csv("./preprocessed_data/processed_train_set_catboost.csv", index_col=0)
validation_set = pd.read_csv("./preprocessed_data/processed_validation_set_catboost.csv", index_col=0)
test_set = pd.read_csv("./preprocessed_data/processed_test_set_catboost.csv", index_col=0)

### 5.6.2 Split datasets to feature set and labels set

In [24]:
X_train, y_train = np.array(train_set.drop(columns='FTR')), np.array(train_set['FTR'])
X_val, y_val = np.array(validation_set.drop(columns='FTR')), np.array(validation_set['FTR'])
X_test, y_test = np.array(test_set.drop(columns='FTR')), np.array(test_set['FTR'])

### 5.6.3 Perform averaging ensembling for this model type

Choose the top 5 models from 20 and create one model from them (Voting Classifier) using <b> ParameterSampler </b> for generating random parameters

In [25]:
# define params for random grid search
params_grid = {
                  'random_state':[0, 1, 2 ,3 ,4, 5, 6, 7, 8, 9, 10],
                  'n_estimators': [None, 300, 400, 500, 600, 700], 
                  'max_depth' : [None, 4, 5, 6, 7, 8, 9, 10],
                  'subsample' : [None, 0.6,0.7, 0.8, 0.9],
                  'verbose': [0],
              }

# function build a voting classifier using n best models
voting_clf = make_voting_classifier(estimator=CatBoostClassifier, 
                                    params_grid=params_grid,
                                    n_iter=20, 
                                    random_state=42,
                                    X_train=X_train, 
                                    y_train=y_train, 
                                    X_val=X_val, 
                                    y_val=y_val, 
                                    verbose=1,
                                    n_best_models=5, 
                                    voting='soft')

voting_clf.fit(X_train, y_train)
# look on the estimators of voting claffier
print(voting_clf.estimators[:,0])

CatBoostClassifier{'verbose': 0, 'subsample': None, 'random_state': 7, 'n_estimators': 500, 'max_depth': 5}
Accuracy score on training set: 0.8286 | Accuracy score on validation set: 0.7182
--------------------------------------------------------------------------------
CatBoostClassifier{'verbose': 0, 'subsample': 0.9, 'random_state': 5, 'n_estimators': 700, 'max_depth': 6}
Accuracy score on training set: 0.8968 | Accuracy score on validation set: 0.7121
--------------------------------------------------------------------------------
CatBoostClassifier{'verbose': 0, 'subsample': None, 'random_state': 6, 'n_estimators': 400, 'max_depth': 6}
Accuracy score on training set: 0.8806 | Accuracy score on validation set: 0.7242
--------------------------------------------------------------------------------
CatBoostClassifier{'verbose': 0, 'subsample': None, 'random_state': 10, 'n_estimators': 300, 'max_depth': 6}
Accuracy score on training set: 0.8754 | Accuracy score on validation set: 0.71

### 5.6.4 Exctract the best single model from voting classifier

In [26]:
# extract single classifier
clf = voting_clf.estimators_[0]
clf

<catboost.core.CatBoostClassifier at 0x121c4727288>

### 5.6.5 Fit averaging model on the training set

In [7]:
# train averaging model on training data
voting_clf.fit(X_train, y_train)

# give models a name
clf_name = f'{clf.__class__.__name__}'
voting_clf_name = f'Voting_{voting_clf.estimators_[0].__class__.__name__}'
print(clf_name, voting_clf_name)

RandomForestClassifier Voting_RandomForestClassifier


### 5.6.6 Calculate metrics of prediction and add results to the lists

In [28]:
# append metrics for single classifier to the list 
accuracy_score.append(metrics.accuracy_score(y_test , clf.predict(X_test)))  
precision_score.append(metrics.precision_score(y_test , clf.predict(X_test)))
recall_score.append(metrics.recall_score(y_test , clf.predict(X_test)))
f1_score.append( metrics.f1_score(y_test , clf.predict(X_test)))
roc_auc_score.append(metrics.roc_auc_score(y_test , clf.predict_proba(X_test)[:,1]))

# append metrics for voting classifier to the list  
accuracy_score.append(metrics.accuracy_score(y_test , voting_clf.predict(X_test)))  
precision_score.append(metrics.precision_score(y_test , voting_clf.predict(X_test)))
recall_score.append(metrics.recall_score(y_test , voting_clf.predict(X_test)))
f1_score.append( metrics.f1_score(y_test , voting_clf.predict(X_test)))
roc_auc_score.append(metrics.roc_auc_score(y_test , voting_clf.predict_proba(X_test)[:,1]))

# add claffiers name to the list (needed for created table with results)
models_name.append(clf_name)
models_name.append(voting_clf_name)

# add classifiers to the lists (to create largest average classifiers)
single_models.append( (clf_name, clf) )
voting_models.append( (voting_clf_name, voting_clf) )

## 5.7 Merge single and voting classifiers in largest models

### 5.7.1 Import data dedicated for this models

In [29]:
train_set = pd.read_csv("./preprocessed_data/processed_base_train_set.csv", index_col=0)
validation_set = pd.read_csv("./preprocessed_data/processed_base_validation_set.csv", index_col=0)
test_set = pd.read_csv("./preprocessed_data/processed_base_test_set.csv", index_col=0)

### 5.7.2 Split datasets to feature set and labels set

In [30]:
X_train, y_train = np.array(train_set.drop(columns='FTR')), np.array(train_set['FTR'])
X_val, y_val = np.array(validation_set.drop(columns='FTR')), np.array(validation_set['FTR'])
X_test, y_test = np.array(test_set.drop(columns='FTR')), np.array(test_set['FTR'])

### 5.7.3 Create new largest voting models and fit them on training data

In [31]:
# create models

# as base models using single classifier
voting_tree_clf = VotingClassifier(estimators=single_models, voting='soft')
# as base models using voting classifier
average_voting_tree_clf = VotingClassifier(estimators=voting_models, voting='soft')

# train models on all data
voting_tree_clf.fit(X_train, y_train)
average_voting_tree_clf.fit(X_train, y_train)

# give models a name
voting_tree_clf_name = 'TreeModelsVotingClassifier'
average_voting_tree_clf_name = 'TreeModelsAveragingVotingClassifier'
print(voting_tree_clf_name, average_voting_tree_clf_name)

TreeModelsVotingClassifier TreeModelsAveragingVotingClassifier


### 5.7.4 Calculate metrics of prediction and add results to the lists

In [32]:
# append metrics for single voting classifier to the lists
accuracy_score.append(metrics.accuracy_score(y_test , voting_tree_clf.predict(X_test)))  
precision_score.append(metrics.precision_score(y_test , voting_tree_clf.predict(X_test)))
recall_score.append(metrics.recall_score(y_test , voting_tree_clf.predict(X_test)))
f1_score.append( metrics.f1_score(y_test , voting_tree_clf.predict(X_test)))
roc_auc_score.append(metrics.roc_auc_score(y_test , voting_tree_clf.predict_proba(X_test)[:,1]))

# append metrics for averaging voting classifier to the lists 
accuracy_score.append(metrics.accuracy_score(y_test , average_voting_tree_clf.predict(X_test)))  
precision_score.append(metrics.precision_score(y_test , average_voting_tree_clf.predict(X_test)))
recall_score.append(metrics.recall_score(y_test , average_voting_tree_clf.predict(X_test)))
f1_score.append( metrics.f1_score(y_test , average_voting_tree_clf.predict(X_test)))
roc_auc_score.append(metrics.roc_auc_score(y_test , average_voting_tree_clf.predict_proba(X_test)[:,1]))


# add claffiers name to the list (needed for created table with results)
models_name.append(voting_tree_clf_name)
models_name.append(average_voting_tree_clf_name)

### 5.7.5 Save models for future purpose

In [33]:
# save single voting model using pickle library
with open(f'./models_new/{voting_tree_clf_name}.pickle', 'wb') as f:
    # pickle the 'models'using the highest protocol available.
    pickle.dump(voting_tree_clf, f, pickle.HIGHEST_PROTOCOL)
    
# save averaging voting model using pickle library
with open(f'./models_new/{average_voting_tree_clf_name}.pickle', 'wb') as f:
    # pickle the 'models'using the highest protocol available.
    pickle.dump(average_voting_tree_clf, f, pickle.HIGHEST_PROTOCOL)

## 5.8 Show all result in one table and save it for future purpose

In [34]:
# create dictionary of results 
results_dict = {'precision_score': precision_score, 
               'recall_score': recall_score, 
               'f1_score': f1_score,
               'roc_auc_score' : roc_auc_score,
               'accuracy_score' : accuracy_score}

results_df = pd.DataFrame(data=results_dict)
results_df.insert(loc=0, column='Model', value=models_name)
results_df

Unnamed: 0,Model,precision_score,recall_score,f1_score,roc_auc_score,accuracy_score
0,RandomForestClassifier,0.632911,0.595238,0.613497,0.735175,0.668421
1,Voting_RandomForestClassifier,0.626667,0.559524,0.591195,0.736972,0.657895
2,AdaBoostClassifier,0.68,0.607143,0.641509,0.776112,0.7
3,Voting_AdaBoostClassifier,0.690141,0.583333,0.632258,0.757412,0.7
4,XGBClassifier,0.630952,0.630952,0.630952,0.738208,0.673684
5,Voting_XGBClassifier,0.62963,0.607143,0.618182,0.742026,0.668421
6,CatBoostClassifier,0.641975,0.619048,0.630303,0.737758,0.678947
7,Voting_CatBoostClassifier,0.65,0.619048,0.634146,0.745283,0.684211
8,TreeModelsVotingClassifier,0.614458,0.607143,0.610778,0.731357,0.657895
9,TreeModelsAveragingVotingClassifier,0.641026,0.595238,0.617284,0.733827,0.673684


In [35]:
results_df.to_csv("./results_new/tree_models_results.csv")