# Tutorial

In this tutorial we fairly compare a number of ensemble methods using EI's built in nested cross-validation implementation, and show how predictions can be made with the selected final model. We then show how we can intepret the model by calculating feature rankings.

### Performance analysis and selection of ensemble methods

First of all let's import some `sklearn` models, `EnsembleIntegration` and some additional ensemble methods:

In [19]:
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from xgboost import XGBClassifier
from eipy.ei import EnsembleIntegration
from eipy.additional_ensembles import MeanAggregation, CES

Next make some dummy "multi-modal" data from the breast cancer dataset:

In [60]:
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

data = load_breast_cancer()
feature_names = data.feature_names
X = data.data
y = np.abs(data.target - 1)  # make "malignancy" the positive class rather than "benign"

X_1 = X[:, 0:10]
X_2 = X[:, 10:]

X_1_train, X_1_test, y_train, y_test = train_test_split(X_1, y, test_size=0.2, random_state=3, stratify=y)
X_2_train, X_2_test, _, _ = train_test_split(X_2, y, test_size=0.2, random_state=3, stratify=y)

Create dictionaries containing data modalities:

In [61]:
data_train = {
                "Modality_1": X_1_train,
                "Modality_2": X_2_train
                }

data_test = {
                "Modality_1": X_1_test,
                "Modality_2": X_2_test
                }

Define base predictors:

In [62]:
base_predictors = {
                    'ADAB': AdaBoostClassifier(),
                    'XGB': XGBClassifier(),
                    'DT': DecisionTreeClassifier(),
                    'RF': RandomForestClassifier(), 
                    'GB': GradientBoostingClassifier(),
                    'KNN': KNeighborsClassifier(),
                    'LR': LogisticRegression(),
                    'NB': GaussianNB(),
                    'MLP': MLPClassifier(),
                    'SVM': SVC(probability=True),
}

Initialise Ensemble Integration:

In [63]:
EI = EnsembleIntegration(
                        base_predictors=base_predictors,
                        k_outer=5,
                        k_inner=5,
                        n_samples=1,
                        sampling_strategy="undersampling",
                        sampling_aggregation="mean",
                        n_jobs=-1,
                        random_state=38,
                        project_name="breast_cancer",
                        model_building=True,
                        )

Train base predictors on each modality. Remember to include the unique modality name.

In [64]:
for name, modality in data_train.items():
    EI.train_base(modality, y_train, modality=name)

Training base predictors on Modality_1...

... for ensemble performance analysis...


Generating meta training data: |██████████|100%
Generating meta test data: |██████████|100%



... for final ensemble...


Generating meta training data: |██████████|100%
Training final base predictors: |██████████|100%




Training base predictors on Modality_2...

... for ensemble performance analysis...


Generating meta training data: |██████████|100%
Generating meta test data: |██████████|100%



... for final ensemble...


Generating meta training data: |██████████|100%
Training final base predictors: |██████████|100%






We can check the cross validated performance of each base predictor on each modality with `base_summary`:

In [65]:
EI.base_summary['metrics']

modality,Modality_1,Modality_1,Modality_1,Modality_1,Modality_1,Modality_1,Modality_1,Modality_1,Modality_1,Modality_1,Modality_2,Modality_2,Modality_2,Modality_2,Modality_2,Modality_2,Modality_2,Modality_2,Modality_2,Modality_2
base predictor,ADAB,DT,GB,KNN,LR,MLP,NB,RF,SVM,XGB,ADAB,DT,GB,KNN,LR,MLP,NB,RF,SVM,XGB
fmax (minority),0.918129,0.890173,0.919075,0.836013,0.882022,0.825397,0.896359,0.912181,0.844156,0.928994,0.952663,0.902857,0.935385,0.896755,0.936047,0.895522,0.935294,0.941896,0.88254,0.94362
f (majority),0.950704,0.932624,0.950355,0.914858,0.924188,0.907563,0.933092,0.944345,0.920266,0.958042,0.972028,0.939286,0.964103,0.938704,0.961131,0.93913,0.961404,0.96741,0.937815,0.966841
AUC,0.977802,0.914345,0.977564,0.917214,0.97032,0.924272,0.972621,0.974592,0.935459,0.978803,0.984128,0.926109,0.984004,0.958524,0.988524,0.963715,0.987255,0.987513,0.96613,0.984479
max MCC,0.856092,0.823152,0.872648,0.759582,0.811426,0.74558,0.830594,0.858614,0.77523,0.887078,0.919982,0.843132,0.897333,0.835469,0.897343,0.84135,0.892473,0.911979,0.822439,0.910852


Now let's define some meta models for stacked generalization. We add an "S." prefix to the keys of stacking algorithms.

In [66]:
meta_predictors = {     
                    'Mean' : MeanAggregation(),
                    'CES' : CES(),
                    'S.ADAB': AdaBoostClassifier(),
                    'S.XGB': XGBClassifier(),
                    'S.DT': DecisionTreeClassifier(),
                    "S.RF": RandomForestClassifier(), 
                    'S.GB': GradientBoostingClassifier(),
                    'S.KNN': KNeighborsClassifier(),
                    'S.LR': LogisticRegression(),
                    'S.NB': GaussianNB(),
                    'S.MLP': MLPClassifier(),
                    'S.SVM': SVC(probability=True),
}

Train meta models:

In [67]:
EI.train_meta(meta_predictors=meta_predictors)

Analyzing ensembles: |██████████|100%
Training final meta models: |██████████|100%


<eipy.ei.EnsembleIntegration at 0x7f5d544167d0>

Check the meta summary with `meta_summary`:

In [68]:
EI.meta_summary['metrics']

Unnamed: 0,Mean,CES,S.ADAB,S.XGB,S.DT,S.RF,S.GB,S.KNN,S.LR,S.NB,S.MLP,S.SVM
fmax (minority),0.93913,0.947368,0.932153,0.948328,0.923077,0.95092,0.950437,0.94362,0.953488,0.94186,0.956012,0.949853
f (majority),0.962832,0.96831,0.95972,0.97074,0.954545,0.972603,0.970018,0.966841,0.971731,0.964664,0.973638,0.970228
AUC,0.98871,0.990134,0.977141,0.986151,0.937771,0.985738,0.980361,0.976966,0.98838,0.975418,0.983385,0.979711
max MCC,0.902221,0.91572,0.891884,0.920696,0.877664,0.926137,0.9211,0.910557,0.925388,0.902221,0.92966,0.920091


The MLP stacking algorithm has the best $\text{F}_\text{max}$ performance (the preferred metric for imbalanced datasets) so let's select it as our final model.

### Predictions on unseen data

Since we ran EI with `model_building=True`, we can make predictions. Let's predict the test set and apply the $\text{F}_\text{max}$ threshold calculated during training:

In [70]:
y_pred = EI.predict(X_dict=data_test, meta_model_key='S.MLP')

threshold = EI.meta_summary['thresholds']['S.MLP']['fmax (minority)']

y_pred[y_pred>=threshold] = 1
y_pred[y_pred<threshold] = 0

print(y_pred)

[0. 1. 0. 1. 0. 1. 1. 0. 1. 0. 0. 0. 0. 1. 0. 0. 1. 1. 0. 1. 0. 1. 1. 0.
 1. 1. 1. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 1. 0. 0. 0. 1. 0. 0. 1. 0. 0. 1. 0. 1. 0. 0. 0. 1. 0. 0. 1. 1. 0.
 1. 0. 0. 0. 1. 1. 0. 1. 0. 1. 0. 1. 1. 1. 1. 0. 0. 0.]



### Interpreting the final model

We now use `PermutationInterpreter` to interpret the final MLP stacked generalization model. Let's first import `PermutationInterpreter` and our chosen metric, and initialise the interpreter:

In [71]:
from eipy.interpretation import PermutationInterpreter
from eipy.utils import f_minority_score

interpreter = PermutationInterpreter(EI=EI,
                                     metric=f_minority_score,
                                     meta_predictor_keys=['S.MLP'])

Calculate feature importance scores:

In [73]:
interpreter.rank_product_score(X_dict=data_train, y=y_train)

Interpreting ensembles...



Calculating local feature ranks: |██████████|100%
Calculating local model ranks: |██████████|100%

Calculating combined rank product score...
... complete!





<eipy.interpretation.PermutationInterpreter at 0x7f5df4bd8700>

We can now inspect the most important features for model prediction:

In [78]:
ranking_dataframe = interpreter.ensemble_feature_ranking['S.MLP']
reordered_feature_names = feature_names[ranking_dataframe.index]
ranking_dataframe['feature'] = reordered_feature_names

ranking_dataframe

Unnamed: 0,modality,feature,RPS,feature rank,ensemble method
3,Modality_1,mean area,0.0565,1.0,S.MLP
23,Modality_2,worst area,0.10875,2.0,S.MLP
1,Modality_1,mean texture,0.14725,3.0,S.MLP
7,Modality_1,mean concave points,0.1725,4.0,S.MLP
13,Modality_2,area error,0.17775,5.0,S.MLP
21,Modality_2,worst texture,0.19775,6.0,S.MLP
27,Modality_2,worst concave points,0.207125,7.0,S.MLP
6,Modality_1,mean concavity,0.2175,8.0,S.MLP
2,Modality_1,mean perimeter,0.221,9.0,S.MLP
22,Modality_2,worst perimeter,0.24175,10.0,S.MLP
