# 8 Modeling - Advance ensembling

<b> Purpose of the action </b> - create 4 advance machine learning models:
- AveragingClassifier consisting of best single models of each type from previous parts
- LargeAveragingClassifier consisting averaging models of each type from previous parts
- StackClassifier classifier consisting of:
    - base models - best single models of each type from previous parts
    - meta model - will be determined based on the evaluation of different models
- LargeStackClassifier classifier consisting of:
    - base models - averaging models of each type from previous parts
    - meta model - will be determined based on the evaluation of different models

![title](Stack_Classifier.png)
<i> Schematic of a stacking classifier framework. Here, three classifiers are used in the stack and are individually trained. Then, their predictions get stacked and are used to train the meta-classifier. </i> 

<b>Source</b> https://towardsdatascience.com/stacking-classifiers-for-higher-predictive-performance-566f963e4840
 
<b> </b>
<b> Action plan </b>:
- Loads AveragingClassifiers and LargeAveragingClassifier from previous parts
- Extract single models from AveragingClassifiers for future use as base learners in StackingClassifiers
- Create AveragingClassifier consisting of best single models of each type (all base models is already fitted)
- Create LargeAveragingClassifier consisting of averaging models of each type (all base models is already fitted)
- Create a TestStackClassifier consisting of the best individual models of each type and several meta models to choose the best (evaluate on validation set)
- Train StackClassifier with the best meta model on all data (training and validation sets)
- Create a TestLargeStackClassifier consisting of five best models of each type and several meta models to choose the best (evaluate on validation set)
- Train LargeStackClassifier with the best meta model on all data (training and validation sets)
- Compare prediction accuracy and other metrics on the test set and save results for future purpose

## 8.1 Import nessesary libraries and modules

In [1]:
import numpy as np
import pandas as pd
import pickle
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import StratifiedKFold
from modeling import Metrics
from preprocessing_pipelines import basic_preprocess_pipeline, categorical_preprocess_pipeline, ImportantFeaturesSelector
from classifiers import AveragingClassifier, LargeAveragingClassifier, StackClassifier, FastStackClassifier

## 8.2 Import raw data 

In [2]:
# data for training final models and prediction
X_train = pd.read_csv('./preprocessed_data/train_set_stage2.csv', index_col=0)
X_test = pd.read_csv('./preprocessed_data/test_set_stage2.csv', index_col=0)
y_train = np.array(X_train['FTR'])
y_test = np.array(X_test['FTR'])

# data for evaluate meta models of StackClassifiers
# split train set to training set and validation set
break_point = X_train.shape[0]//19 # 19 - total number of seasons in train set
X_train_eval = X_train.iloc[:-break_point]
X_val_eval = X_train.iloc[-break_point:] # last season from training set
y_train_eval = np.array(X_train_eval['FTR'])
y_val_eval = np.array(X_val_eval['FTR'])

## 8.3 Import all previously prepared AveragingClassifiers and LargeAveragingClassifiers

In [3]:
with open('./models/LinearModelsAveragingClassifier.pickle', 'rb') as f:
    linear_averaging_clf = pickle.load(f)
    
with open('./models/LargeLinearModelsAveragingClassifier.pickle', 'rb') as f:
    large_linear_averaging_clf = pickle.load(f) 
    
with open('./models/TreeModelsAveragingClassifier.pickle', 'rb') as f:
    tree_averaging_clf = pickle.load(f) 

with open('./models/LargeTreeModelsAveragingClassifier.pickle', 'rb') as f:
    large_tree_averaging_clf = pickle.load(f) 

## 8.4 Extract single and averaging models from AveragingClassifiers

Note that all models are extracted along with their pipes with scaling, encoding and feature selection

In [4]:
# extract best tree-based estimators
clf_rf, clf_ada, clf_xgb, clf_cat = tree_averaging_clf.base_estimators

# extract best linear estimators
clf_lr, clf_svc, clf_rbf, clf_knn = linear_averaging_clf.base_estimators

# extract average tree-based estimators
avg_clf_rf, avg_clf_ada, avg_clf_xgb, avg_clf_cat = large_tree_averaging_clf.base_estimators

# extract average linear estimators
avg_clf_lr, avg_clf_svc, avg_clf_rbf, avg_clf_knn = large_linear_averaging_clf.base_estimators

## 8.5 Create placeholder to hold prediction results

In [5]:
# placeholder to hold prediction results
prediction_metrics = Metrics()

## 8.6 Create AveragingAllModelsClassifier

Base models are the best single classifier of each type

### 8.6.1 Initialize model

In [6]:
# all base models is already trained
average_all_models_clf = AveragingClassifier(base_estimators=[*linear_averaging_clf.base_estimators,
                                                              *tree_averaging_clf.base_estimators], 
                                             voting='soft')

### 8.6.2 Calculate metrics of prediction and add results to the lists

In [7]:
# give model a name
average_all_models_clf_name = 'AveragingAllModelsClassifier'


# add prediction metrics for large averaging classifier to placeholder
prediction_metrics.add_metrics(average_all_models_clf, average_all_models_clf_name, X_test, y_test)

## 8.7 Create LargeAveragingAllModelsClassifier

Base models are AveragingClassifiers of each type

### 8.7.1 Initialize model

In [8]:
# all base models is already trained
large_average_all_models_clf = LargeAveragingClassifier(base_estimators=[*large_linear_averaging_clf.base_estimators,
                                                                         *large_tree_averaging_clf.base_estimators], 
                                                        voting='soft')

### 8.7.2 Calculate metrics of prediction and add results to the lists

In [9]:
# give model a name
large_average_all_models_clf_name = 'LargeAveragingAllModelsClassifier'


# add prediction metrics for large averaging classifier to placeholder
prediction_metrics.add_metrics(large_average_all_models_clf, large_average_all_models_clf_name, X_test, y_test)

## 8.8 Create list of meta models for StackingClassifiers to check

In [10]:
meta_models_to_check = [
    LogisticRegression(penalty='l2', C=0.01, solver='lbfgs', max_iter=1000),
    LogisticRegression(penalty='l2', C=0.1, solver='lbfgs', max_iter=1000),
    LogisticRegression(penalty='l2', C=1, solver='lbfgs', max_iter=1000),
    LogisticRegression(penalty='l2', C=10, solver='lbfgs', max_iter=1000),
    LogisticRegression(penalty='l2', C=100, solver='lbfgs', max_iter=1000),
    LogisticRegression(penalty='l1', C=0.01, solver='liblinear', max_iter=1000),
    LogisticRegression(penalty='l1', C=0.1, solver='liblinear', max_iter=1000),
    LogisticRegression(penalty='l1', C=1, solver='liblinear', max_iter=1000),
    LogisticRegression(penalty='l1', C=10, solver='liblinear', max_iter=1000),
    LogisticRegression(penalty='l1', C=100, solver='liblinear', max_iter=1000),
    SVC(kernel='linear', C=0.001, probability=True),
    SVC(kernel='linear', C=0.01, probability=True),
    SVC(kernel='linear', C=0.1, probability=True),
    SVC(kernel='linear', C=1, probability=True),
    SVC(kernel='linear', C=10, probability=True),
    SVC(kernel='linear', C=100, probability=True),
    SVC(kernel='linear', C=1000, probability=True),
    RandomForestClassifier(criterion='gini', n_estimators=10),
    RandomForestClassifier(criterion='gini', n_estimators=100),
    RandomForestClassifier(criterion='gini', n_estimators=500),
    RandomForestClassifier(criterion='gini', n_estimators=1000),
    RandomForestClassifier(criterion='entropy', n_estimators=10),
    RandomForestClassifier(criterion='entropy', n_estimators=100),
    RandomForestClassifier(criterion='entropy', n_estimators=500),
    RandomForestClassifier(criterion='entropy', n_estimators=1000),
    ExtraTreesClassifier(criterion='gini', n_estimators=10),
    ExtraTreesClassifier(criterion='gini', n_estimators=100),
    ExtraTreesClassifier(criterion='gini', n_estimators=500),
    ExtraTreesClassifier(criterion='gini', n_estimators=1000),
    ExtraTreesClassifier(criterion='entropy', n_estimators=10),
    ExtraTreesClassifier(criterion='entropy', n_estimators=100),
    ExtraTreesClassifier(criterion='entropy', n_estimators=500),
    ExtraTreesClassifier(criterion='entropy', n_estimators=1000),
]

## 8.9 StackingClassifier

<h4><center>StackingClassifier flow chart</center></h4>

![title](StackingClassifier.png)

<i> Base models are the best single classifier of each type (along with their pipes with scaling, encoding and feature selection) </i>

### 8.9.1 Create Test StackingClassifier for selecting the best meta model

All base models are trained five times on different four folds combination and predicted results for rest fold, which will be a new meta-model features set

#### 8.9.1.1 Initialize StackingClassifier and train base models

In [11]:
# split training data to 5 folds
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# it is a fast implementaion of StackClassifier, only use for selecting best meta model
stack_clf_eval = FastStackClassifier(base_models=[
                                                  clf_svc, 
                                                  clf_lr, 
                                                  clf_knn, 
                                                  clf_rbf,
                                                  clf_rf, 
                                                  clf_xgb, 
                                                  clf_ada, 
                                                  clf_cat
                                                 ],
                                     kfold=kfold)

# to safely run multiprocessing on Windows
if __name__ == '__main__':
    stack_clf_eval.fit_base_models(X_train_eval, y_train_eval, X_val_eval)

#### 8.9.1.2 Select best meta model

In [12]:
# function to selecting best meta model
meta_model = stack_clf_eval.evaluate_meta_models(meta_models=[*meta_models_to_check], 
                                                 y_train=y_train_eval, 
                                                 y_test=y_val_eval)

Model 0 : LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
Accuracy score: 0.6848
-----------------------------------------------------------------------------------------------------------------------------
Model 1 : LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
Accuracy score: 0.7091
-----------------------------------------------------------------------------------------------------------------------------
Model 2 : LogisticRegressio

Model 18 : RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)
Accuracy score: 0.6727
-----------------------------------------------------------------------------------------------------------------------------
Model 19 : RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=

Model 30 : ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                     criterion='entropy', max_depth=None, max_features='auto',
                     max_leaf_nodes=None, max_samples=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=100,
                     n_jobs=None, oob_score=False, random_state=None, verbose=0,
                     warm_start=False)
Accuracy score: 0.6939
-----------------------------------------------------------------------------------------------------------------------------
Model 31 : ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                     criterion='entropy', max_depth=None, max_features='auto',
                     max_leaf_nodes=None, max_samples=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
          

### 8.9.2 Create final StackClassifier to make test set prediction

#### 8.9.2.1 Initialize and train model

In [13]:
# split training data to 5 folds
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

stack_clf = StackClassifier(base_models=[
                                         clf_svc, 
                                         clf_lr, 
                                         clf_knn, 
                                         clf_rbf,
                                         clf_rf, 
                                         clf_xgb, 
                                         clf_ada, 
                                         clf_cat
                                        ],
                            meta_model=meta_model, 
                            kfold=kfold)

# to safely run multiprocessing on Windows
if __name__ == '__main__':
    stack_clf.fit(X_train, y_train)

#### 8.9.2.2 Calculate metrics of prediction and add results to the lists

In [14]:
# give model a name
stack_clf_name = f'{stack_clf.__class__.__name__}'


# add prediction metrics for stacking classifier to placeholder
prediction_metrics.add_metrics(stack_clf, stack_clf_name, X_test, y_test)

## 8.10 LargeStackingClassifier

<h4><center>LargeStackingClassifier flow chart</center></h4>

![title](LargeStackingClassifier.png)

<i> Base models are best five classifiers of each type (along with their pipes with scaling, encoding and feature selection) </i>

### 8.10.1 Create Test LargeStackingClassifier for selecting the best meta model

All base models are trained five times on different four folds combination and predicted results for rest fold, which will be a new meta-model features set

#### 8.10.1.1 Initialize LargeStackingClassifier and train base models

In [17]:
# split training data to 5 folds
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# it is a fast implementaion of StackClassifier, only use for selecting best meta model
large_stack_clf_eval = FastStackClassifier(base_models=[
                                                        *avg_clf_svc.base_estimators, 
                                                        *avg_clf_lr.base_estimators, 
                                                        *avg_clf_knn.base_estimators, 
                                                        *avg_clf_rbf.base_estimators,
                                                        *avg_clf_rf.base_estimators, 
                                                        *avg_clf_xgb.base_estimators, 
                                                        *avg_clf_ada.base_estimators, 
                                                        *avg_clf_cat.base_estimators
                                                       ],
                                         kfold=kfold)

# to safely run multiprocessing on Windows
if __name__ == '__main__':
    large_stack_clf_eval.fit_base_models(X_train_eval, y_train_eval, X_val_eval)

#### 8.10.1.2 Select best meta model

In [18]:
# function to selecting best meta model
meta_model = large_stack_clf_eval.evaluate_meta_models(meta_models=[*meta_models_to_check], 
                                                       y_train=y_train_eval, 
                                                       y_test=y_val_eval)

Model 0 : LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
Accuracy score: 0.7212
-----------------------------------------------------------------------------------------------------------------------------
Model 1 : LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
Accuracy score: 0.7333
-----------------------------------------------------------------------------------------------------------------------------
Model 2 : LogisticRegressio

Model 18 : RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)
Accuracy score: 0.7242
-----------------------------------------------------------------------------------------------------------------------------
Model 19 : RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=

Model 30 : ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                     criterion='entropy', max_depth=None, max_features='auto',
                     max_leaf_nodes=None, max_samples=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=100,
                     n_jobs=None, oob_score=False, random_state=None, verbose=0,
                     warm_start=False)
Accuracy score: 0.6848
-----------------------------------------------------------------------------------------------------------------------------
Model 31 : ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                     criterion='entropy', max_depth=None, max_features='auto',
                     max_leaf_nodes=None, max_samples=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
          

### 8.10.2 Create final LargeStackClassifier to make test set prediction

#### 8.10.2.1 Initialize and train model

In [19]:
# split training data to 5 folds
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

large_stack_clf = StackClassifier(base_models=[
                                               *avg_clf_svc.base_estimators, 
                                               *avg_clf_lr.base_estimators, 
                                               *avg_clf_knn.base_estimators, 
                                               *avg_clf_rbf.base_estimators,
                                               *avg_clf_rf.base_estimators, 
                                               *avg_clf_xgb.base_estimators, 
                                               *avg_clf_ada.base_estimators, 
                                               *avg_clf_cat.base_estimators
                                        ],
                                    meta_model=meta_model, 
                                    kfold=kfold)

# to safely run multiprocessing on Windows
if __name__ == '__main__':
    large_stack_clf.fit(X_train, y_train)

#### 8.10.2.2 Calculate metrics of prediction and add results to the lists

In [20]:
# give model a name
large_stack_clf_name = f'Large{large_stack_clf.__class__.__name__}'


# add prediction metrics for stacking classifier to placeholder
prediction_metrics.add_metrics(large_stack_clf, large_stack_clf_name, X_test, y_test)

## 8.11 Show all results in one table and save it for future purpose

In [33]:
# get prediction metric result lists from placeholder
precision_score, recall_score, f1_score, roc_auc_score, accuracy_score = prediction_metrics.get_metrics()

# get model names list from placeholder
models_name = prediction_metrics.get_names()

# create dictionary of results 
results_dict = {'precision_score': precision_score, 
               'recall_score': recall_score, 
               'f1_score': f1_score,
               'roc_auc_score' : roc_auc_score,
               'accuracy_score' : accuracy_score}

results_df = pd.DataFrame(data=results_dict)
results_df.insert(loc=0, column='Model', value=models_name)
results_df

Unnamed: 0,Model,precision_score,recall_score,f1_score,roc_auc_score,accuracy_score
0,AveragingAllModelsClassifier,0.630952,0.630952,0.630952,0.740566,0.673684
1,LargeAveragingAllModelsClassifier,0.633333,0.678571,0.655172,0.739218,0.684211
2,StackClassifier,0.69863,0.607143,0.649682,0.759659,0.710526
3,LargeStackClassifier,0.650602,0.642857,0.646707,0.749102,0.689474


In [34]:
results_df.to_csv("./results/advance_ensembling_models_results.csv")