# Tutorial

In this tutorial we fairly compare a number of ensemble methods using EI's built in nested cross-validation implementation, and show how predictions can be made with the selected final model.

First of all let's import some `sklearn` models and `EnsembleIntegration`:

In [1]:
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.datasets import make_classification
from eipy.ei import EnsembleIntegration

Next make some dummy "multi-modal" data:

In [14]:
X, y = make_classification(
                        n_samples=200, 
                        n_features=50, 
                        weights=[0.3, 0.7], 
                        random_state=1
)

X_1 = X[:, 0:25]
X_2 = X[:, 25:]

Create a dictionary containing data modalities:

In [15]:
data = {
                "Modality_1": X_1,
                "Modality_2": X_2
                }

Define base predictors:

In [4]:
base_predictors = {
                    'ADAB': AdaBoostClassifier(),
                    'XGB': XGBClassifier(),
                    'DT': DecisionTreeClassifier(),
                    'RF': RandomForestClassifier(), 
                    'GB': GradientBoostingClassifier(),
                    'KNN': KNeighborsClassifier(),
                    'LR': LogisticRegression(),
                    'NB': GaussianNB(),
                    'MLP': MLPClassifier(),
                    'SVM': SVC(probability=True),
}

Initialise Ensemble Integration:

In [5]:
EI = EnsembleIntegration(base_predictors=base_predictors,
                        k_outer=5,
                        k_inner=5,
                        n_samples=1,
                        sampling_strategy="undersampling",
                        sampling_aggregation="mean",
                        n_jobs=-1,
                        random_state=38,
                        parallel_backend="loky",
                        project_name="cell-division",
                        model_building=True,
                        )

Train base predictors on each modality. Remember to include the unique modality name.

In [7]:
for name, modality in data.items():
    EI.train_base(modality, y, modality=name)

Training base predictors on Modality_1...

... for analysis...


Generating meta training data: |██████████|100%
Generating meta test data: |██████████|100%



... for final ensemble...


Generating meta training data: |██████████|100%
Training final base predictors: |██████████|100%




Training base predictors on Modality_2...

... for analysis...


Generating meta training data: |██████████|100%
Generating meta test data: |██████████|100%



... for final ensemble...


Generating meta training data: |██████████|100%
Training final base predictors: |██████████|100%








We can check the performance of each base predictor on each modality with `base_summary`:

In [9]:
EI.base_summary['metrics']

modality,Modality_1,Modality_1,Modality_1,Modality_1,Modality_1,Modality_1,Modality_1,Modality_1,Modality_1,Modality_1,Modality_2,Modality_2,Modality_2,Modality_2,Modality_2,Modality_2,Modality_2,Modality_2,Modality_2,Modality_2
base predictor,ADAB,DT,GB,KNN,LR,MLP,NB,RF,SVM,XGB,ADAB,DT,GB,KNN,LR,MLP,NB,RF,SVM,XGB
fmax (minority),0.470588,0.467433,0.543046,0.502326,0.507463,0.49505,0.520548,0.539683,0.539877,0.546875,0.727273,0.740157,0.830508,0.636364,0.760331,0.677966,0.761062,0.827586,0.769231,0.817391
f (majority),0.068966,0.0,0.722892,0.421622,0.75188,0.484848,0.721569,0.587678,0.683544,0.786765,0.88172,0.879121,0.929078,0.862069,0.896057,0.865248,0.902778,0.929577,0.901408,0.926316
AUC,0.566576,0.575658,0.672013,0.645418,0.620002,0.58698,0.660101,0.66535,0.655974,0.684751,0.815544,0.816901,0.901227,0.781224,0.870622,0.806935,0.88041,0.911428,0.900342,0.869796
max MCC,0.19168,0.139396,0.295787,0.255424,0.264714,0.166575,0.288023,0.275688,0.281959,0.335124,0.620729,0.620585,0.760581,0.5064,0.657904,0.580915,0.671947,0.759442,0.683154,0.746812


Now let's define some meta models for stacked generalization:

In [10]:
meta_models = {
                    'ADAB': AdaBoostClassifier(),
                    'XGB': XGBClassifier(),
                    'DT': DecisionTreeClassifier(),
                    "RF": RandomForestClassifier(), 
                    'GB': GradientBoostingClassifier(),
                    'KNN': KNeighborsClassifier(),
                    'LR': LogisticRegression(),
                    'NB': GaussianNB(),
                    'MLP': MLPClassifier(),
                    'SVM': SVC(probability=True),
}

Train meta models:

In [11]:
EI.train_meta(meta_models=meta_models)

Analyzing ensembles: |██████████|100%
Training final meta models: |██████████|100%


<eipy.ei.EnsembleIntegration at 0x7fa9a20b0460>

Check the meta summary with `meta_summary`:

In [12]:
EI.meta_summary['metrics']

Unnamed: 0,Mean,CES,S.ADAB,S.XGB,S.DT,S.RF,S.GB,S.KNN,S.LR,S.NB,S.MLP,S.SVM
fmax (minority),0.752137,0.786325,0.788462,0.810811,0.690265,0.810345,0.774194,0.817391,0.844037,0.831858,0.844037,0.844037
f (majority),0.897527,0.911661,0.925676,0.927336,0.878049,0.922535,0.898551,0.926316,0.938356,0.930556,0.938356,0.941581
AUC,0.873334,0.885954,0.844439,0.869442,0.77291,0.878877,0.845029,0.881944,0.89362,0.891968,0.890317,0.853167
max MCC,0.644859,0.699489,0.724959,0.746129,0.572913,0.751622,0.688133,0.746812,0.785427,0.746812,0.797428,0.797428


The Logistic regression stacking algorithm has the best $\text{F}_\text{max}$ performance so let's select it as our final model. Since we ran EI with `model_building=True`, we can now predict. Let's just predict the training data, and apply the $\text{F}_\text{max}$ training threshold:

In [13]:
y_pred = EI.predict(X_dictionary=data, meta_model_key='S.LR') # stacked generalisation algorithms have the prefix 'S.'

threshold = EI.meta_summary['thresholds']['S.LR']['fmax (minority)']

y_pred[y_pred>=threshold] = 1
y_pred[y_pred<threshold] = 0

print(y_pred)

[1. 1. 0. 1. 1. 0. 0. 0. 1. 0. 0. 1. 1. 0. 1. 1. 1. 0. 1. 0. 0. 0. 1. 1.
 1. 1. 0. 1. 0. 1. 1. 0. 1. 1. 1. 1. 0. 0. 1. 1. 0. 1. 1. 1. 0. 1. 1. 0.
 1. 1. 1. 1. 1. 0. 0. 1. 0. 0. 1. 0. 1. 0. 1. 0. 1. 0. 0. 1. 1. 0. 1. 1.
 1. 0. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 0.
 0. 1. 1. 0. 0. 0. 1. 0. 0. 1. 0. 1. 0. 0. 1. 1. 0. 0. 1. 1. 0. 1. 0. 1.
 1. 1. 1. 0. 1. 1. 0. 1. 0. 1. 0. 1. 1. 1. 0. 1. 0. 0. 1. 0. 1. 1. 1. 0.
 0. 1. 0. 1. 1. 0. 1. 1. 1. 1. 1. 0. 1. 1. 0. 1. 1. 1. 0. 0. 0. 1. 0. 0.
 1. 1. 0. 1. 0. 1. 0. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 0. 1. 0. 1. 1. 0. 1.]
