# Step_4_Ensemble models

-	**Bagging** : Use under-or over-sampling to create balanced datasets During bootstrap:
    o	Undersampling : Underbagging
    o	Oversampling: Overbagging
    
-	**Boosting**: (mirar esquemas)
    o	Undersampling 
    o	Oversampling
    
-	**Hybrid**: Boosting and bagging with different sampling techniques


# **CONTENT**

    1. Load Libraries
    2. Load data
    3. Ensemble methods
    

# 1. Load Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#show cells with width as long as screen
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
#Hide warnings
import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler, RobustScaler # for preprocessing the data

from imblearn.under_sampling import EditedNearestNeighbours, TomekLinks
from imblearn.over_sampling import SMOTE
from imblearn.combine import SMOTEENN, SMOTETomek

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, plot_roc_curve, precision_recall_curve

from imblearn.over_sampling import RandomOverSampler, SMOTE, SMOTENC, ADASYN, \
BorderlineSMOTE, SVMSMOTE

from imblearn.under_sampling import RandomUnderSampler, CondensedNearestNeighbour, TomekLinks,\
OneSidedSelection, EditedNearestNeighbours, RepeatedEditedNearestNeighbours, AllKNN, \
NeighbourhoodCleaningRule, NearMiss, InstanceHardnessThreshold

from sklearn.svm import SVC

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.ensemble import (
    RandomForestClassifier,
    BaggingClassifier,
    AdaBoostClassifier,
)

from sklearn.linear_model  import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split, RepeatedStratifiedKFold
from sklearn.preprocessing import MinMaxScaler

from imblearn.datasets import fetch_datasets

from imblearn.under_sampling import RandomUnderSampler

from imblearn.over_sampling import SMOTE

from imblearn.ensemble import (
    BalancedBaggingClassifier,
    BalancedRandomForestClassifier,
    RUSBoostClassifier,
    EasyEnsembleClassifier
)
#metric
from sklearn.metrics import roc_auc_score, plot_roc_curve, precision_recall_curve, accuracy_score, recall_score, f1_score
from sklearn.metrics import classification_report, confusion_matrix

# 2. Load data

This data has been generated in the notebook: step_2_Preprocessing_data

In [3]:
X_train = pd.read_excel('X_train_step_2.xlsx', engine='openpyxl')
X_test =  pd.read_excel('X_test_step_2.xlsx', engine='openpyxl')
y_train = pd.read_excel('y_train_step_2.xlsx', engine='openpyxl')
y_test = pd.read_excel('y_test_step_2.xlsx', engine='openpyxl')

# 3. Ensemble methods

This is the same as “balanced” except that weights are computed based on the bootstrap sample for every tree grown.

It is feasible to strike a balance between false negatives and false positives using a class weight hyper parameter. The optimal setting of this hyperparameter depends not only on the cost of false negatives and the cost of false positives, but also on the class imbalance. Similarly to the process of finding the expected cost association to class weights, it makes sense to bootstrap the data to estimate the variance around this expectation.

In [4]:
model=  BalancedRandomForestClassifier(
        n_estimators=20,
        criterion='gini',
        max_depth=3,
        sampling_strategy='auto',
        n_jobs=4,
        random_state=42)

In [5]:
def confusion(classifier, X_test, y_test):
    y_pred  = classifier.predict(X_test)
    return confusion_matrix(y_test, y_pred).ravel()

In [6]:
def show(tn,fp,fn,tp):
    print("TN:" + str(tn) + " FP:" + str(fp) + " FN:" + str(fn) + " TP:" + str(tp) + 
          " FNR=" + str(fn/(fn+tp)) + " FPR=" + str(fp/(fp+tn)))

In [7]:
w_neg = 10**-4
w_pos_range = np.exp(np.arange(np.log(1), np.log(10**9)))

In [8]:
for w_pos in w_pos_range:
    print("w_pos: " + str(w_pos))
    show(*confusion(RandomForestClassifier(random_state=0, n_jobs=-1, n_estimators=10, class_weight={0: w_neg, 1: w_pos}).fit(X_train,y_train),X_test,y_test))

w_pos: 1.0
TN:67 FP:1 FN:11 TP:2 FNR=0.8461538461538461 FPR=0.014705882352941176
w_pos: 2.718281828459045
TN:65 FP:3 FN:11 TP:2 FNR=0.8461538461538461 FPR=0.04411764705882353
w_pos: 7.38905609893065
TN:66 FP:2 FN:11 TP:2 FNR=0.8461538461538461 FPR=0.029411764705882353
w_pos: 20.085536923187668
TN:66 FP:2 FN:11 TP:2 FNR=0.8461538461538461 FPR=0.029411764705882353
w_pos: 54.598150033144236
TN:64 FP:4 FN:11 TP:2 FNR=0.8461538461538461 FPR=0.058823529411764705
w_pos: 148.4131591025766
TN:67 FP:1 FN:10 TP:3 FNR=0.7692307692307693 FPR=0.014705882352941176
w_pos: 403.4287934927351
TN:66 FP:2 FN:11 TP:2 FNR=0.8461538461538461 FPR=0.029411764705882353
w_pos: 1096.6331584284585
TN:65 FP:3 FN:11 TP:2 FNR=0.8461538461538461 FPR=0.04411764705882353
w_pos: 2980.9579870417283
TN:64 FP:4 FN:10 TP:3 FNR=0.7692307692307693 FPR=0.058823529411764705
w_pos: 8103.083927575384
TN:67 FP:1 FN:11 TP:2 FNR=0.8461538461538461 FPR=0.014705882352941176
w_pos: 22026.465794806718
TN:34 FP:34 FN:6 TP:7 FNR=0.461538461

In [9]:
# re-sampling methods 

resampling_dict = {
    
    'random': RandomUnderSampler(
        sampling_strategy='auto',
        random_state=0,
        replacement=False,
    ),

    'smote': SMOTE(
        sampling_strategy='auto',
        random_state=0,
        k_neighbors=5,
        n_jobs=4,
    ),
      

    'smtomek': SMOTETomek(
        sampling_strategy='auto',
        random_state=42,
        smote=SMOTE(sampling_strategy='auto', random_state=0, k_neighbors=5),
        tomek=TomekLinks(sampling_strategy='all'),
        n_jobs=4),
    
    'cnn' :  CondensedNearestNeighbour(
        sampling_strategy='auto', 
        random_state=42,
        n_neighbors=1,
        n_jobs=-1),
    
    'tomek' : TomekLinks(
        sampling_strategy='auto', 
        n_jobs=-1),
    
    'oss' : OneSidedSelection(
        sampling_strategy='auto',
        random_state= 42,
        n_neighbors=1,
        n_jobs=-1),
    
    'enn' : EditedNearestNeighbours(
        sampling_strategy='auto', 
        n_neighbors=3,
        kind_sel='all', 
        n_jobs=-1),
    
    'renn': RepeatedEditedNearestNeighbours(
        sampling_strategy='auto', 
        n_neighbors=3,
        kind_sel='all', 
        n_jobs=-1,
        max_iter=100),
    
    'allknn' : AllKNN(
        sampling_strategy='auto', 
        n_neighbors=3,
        kind_sel='all', 
        n_jobs=-1),
    
    'ncr' :  NeighbourhoodCleaningRule(
        sampling_strategy='auto',
        n_neighbors=3,
        kind_sel='all', 
        n_jobs=-1,
        threshold_cleaning=0.5),
    
    'nm1' : NearMiss(
        sampling_strategy='auto', 
        version=1,
        n_neighbors=3, 
        n_jobs=-1),
    
    'nm2':  NearMiss(
        sampling_strategy='auto', 
        version=2,
        n_neighbors=3, 
        n_jobs=-1),
    
    'nm3' :  NearMiss(
        sampling_strategy='auto', 
        version=3,
        n_neighbors=3, 
        n_jobs=-1),
    
    'iht' : InstanceHardnessThreshold(
        estimator=LogisticRegression(random_state=42),
        sampling_strategy='auto',
        random_state=42,
        n_jobs=-1,
        cv = 3),
    
    'random' : RandomOverSampler(
        sampling_strategy='auto', 
        random_state=0),
    
    'smote' : SMOTE(
        sampling_strategy='all', 
        random_state=0,
        k_neighbors=5,
        n_jobs = 4),
    
    'adasyn' : ADASYN(
    sampling_strategy='auto',
    random_state=0,
    n_neighbors=5,
    n_jobs=4),
    
    'border1' : BorderlineSMOTE(
    sampling_strategy='auto',
    random_state=0,
    k_neighbors=5,
    m_neighbors=10,
    kind='borderline-1',
    n_jobs=4),
    
    'border2' : BorderlineSMOTE(
    sampling_strategy='auto',
    random_state=0,
    k_neighbors=5,
    m_neighbors=10,
    kind='borderline-2',
    n_jobs=4),
    
    'svm' : SVMSMOTE(
        sampling_strategy='auto',
        random_state=0,
        k_neighbors=5,
        m_neighbors=10,
        n_jobs=4,
        svm_estimator=SVC(kernel='linear'))
}

In [10]:
# ensemble methods (with or without resampling)

ensemble_dict = {

    # balanced random forests (bagging)
    'balancedRF': BalancedRandomForestClassifier(
        n_estimators=20,
        criterion='gini',
        max_depth=3,
        sampling_strategy='auto',
        n_jobs=4,
        random_state=42,
    ),

    # bagging of Logistic regression, no resampling
    'bagging': BaggingClassifier(
        base_estimator=LogisticRegression(random_state=42),
        n_estimators=20,
        n_jobs=4,
        random_state=42,
    ),

    # bagging of Logistic regression, with resampling
    'balancedbagging': BalancedBaggingClassifier(
        base_estimator=LogisticRegression(random_state=42),
        n_estimators=20,
        max_samples=1.0,  # The number of samples to draw from X to train each base estimator
        max_features=1.0,  # The number of features to draw from X to train each base estimator
        bootstrap=True,
        bootstrap_features=False,
        sampling_strategy='auto',
        n_jobs=4,
        random_state=2909,
    ),

    # boosting + undersampling
    'rusboost': RUSBoostClassifier(
        base_estimator=None,
        n_estimators=20,
        learning_rate=1.0,
        sampling_strategy='auto',
        random_state=42,
    ),

    # bagging + boosting + under-sammpling
    'easyEnsemble': EasyEnsembleClassifier(
        n_estimators=20,
        sampling_strategy='auto',
        n_jobs=4,
        random_state=42,
    ),
}

In [11]:
# function to train random forests and evaluate the performance

def run_randomForests(X_train, X_test, y_train, y_test):

    rf = RandomForestClassifier(
        n_estimators=20, random_state=42, max_depth=2, n_jobs=4)
    rf.fit(X_train, y_train)

    print('Train set')
    pred = rf.predict_proba(X_train)
    print(
        'Random Forests roc-auc: {}'.format(roc_auc_score(y_train, pred[:, 1])))

    print('Test set')
    pred = rf.predict_proba(X_test)
    print(
        'Random Forests roc-auc: {}'.format(roc_auc_score(y_test, pred[:, 1])))

     #confusion matrix
    cf_matrix = confusion_matrix(y_test, rf.predict(X_test))
    
    #False Positive Rate (FPR) Vs False Negative Rate (FNR)
    tn, fp, fn, tp = confusion_matrix(y_test, rf.predict(X_test), labels=[0,1]).ravel()

    FPR = fp / (tn + fp)

    FNR = fn / (tp + fn)
    
    return roc_auc_score(y_test, pred[:, 1]), cf_matrix, FPR, FNR

In [12]:
# function to train random forests and evaluate the peadaormance

def run_adaboost(X_train, X_test, y_train, y_test):

    ada = AdaBoostClassifier(n_estimators=20, random_state=42)
    
    ada.fit(X_train, y_train)

    print('Train set')
    pred = ada.predict_proba(X_train)
    print(
        'AdaBoost roc-auc: {}'.format(roc_auc_score(y_train, pred[:, 1])))

    print('Test set')
    pred = ada.predict_proba(X_test)
    print(
        'AdaBoost roc-auc: {}'.format(roc_auc_score(y_test, pred[:, 1])))

     #confusion matrix
    cf_matrix = confusion_matrix(y_test, ada.predict(X_test))
    
    #False Positive Rate (FPR) Vs False Negative Rate (FNR)
    tn, fp, fn, tp = confusion_matrix(y_test, ada.predict(X_test), labels=[0,1]).ravel()

    FPR = fp / (tn + fp)

    FNR = fn / (tp + fn)
    
    return roc_auc_score(y_test, pred[:, 1]), cf_matrix, FPR, FNR

In [13]:
# function to train random forests and evaluate the peensembleormance

def run_ensemble(ensemble, X_train, X_test, y_train, y_test):
    
    ensemble.fit(X_train, y_train)

    print('Train set')
    pred = ensemble.predict_proba(X_train)
    print(
        'ensembleBoost roc-auc: {}'.format(roc_auc_score(y_train, pred[:, 1])))

    print('Test set')
    pred = ensemble.predict_proba(X_test)
    print(
        'ensembleBoost roc-auc: {}'.format(roc_auc_score(y_test, pred[:, 1])))

    #confusion matrix
    cf_matrix = confusion_matrix(y_test, ensemble.predict(X_test))
    
    #False Positive Rate (FPR) Vs False Negative Rate (FNR)
    tn, fp, fn, tp = confusion_matrix(y_test, ensemble.predict(X_test), labels=[0,1]).ravel()

    FPR = fp / (tn + fp)

    FNR = fn / (tp + fn)
    
    return roc_auc_score(y_test, pred[:, 1]), cf_matrix, FPR, FNR

In [14]:

    
results_dict = {}
shapes_dict = {}
conf_matrix = {}
FNR_dict= {}
FPR_dict = {}   

    
    # train model and store result
roc = run_randomForests(X_train, X_test, y_train, y_test)
results_dict['full_data'] = roc
print()
    
    # train model and store result
roc = run_adaboost(X_train, X_test, y_train, y_test)
results_dict['full_data_adaboost'] = roc
print()
    
for sampler in resampling_dict.keys():
        
    print(sampler)
        
    # resample
    X_resampled, y_resampled = resampling_dict[sampler].fit_resample(X_train, y_train)
        
    # train model and store result
    roc = run_randomForests(X_resampled, X_test, y_resampled, y_test)
    
    #store results
    results_dict[sampler] = roc[0]
    conf_matrix[sampler] = roc[1]
    FPR_dict[sampler] = roc[2]
    FNR_dict[sampler] = roc[3]
    shapes_dict[sampler] = len(X_resampled)  
    print()
    
    results_dict[sampler] = roc
    print()
    
for ensemble in ensemble_dict.keys():
        
    print(ensemble)
    
    # train model and store result
    roc = run_ensemble(ensemble_dict[ensemble], X_train, X_test, y_train, y_test)
    results_dict[ensemble] = roc
    print()

Train set
Random Forests roc-auc: 0.862662613122172
Test set
Random Forests roc-auc: 0.6691176470588235

Train set
AdaBoost roc-auc: 0.9063560520361992
Test set
AdaBoost roc-auc: 0.5967194570135747

random
Train set
Random Forests roc-auc: 0.8762908196366783
Test set
Random Forests roc-auc: 0.5916289592760181


smote
Train set
Random Forests roc-auc: 0.9024789143598615
Test set
Random Forests roc-auc: 0.6713800904977375


smtomek
Train set
Random Forests roc-auc: 0.8942661179698217
Test set
Random Forests roc-auc: 0.697398190045249


cnn
Train set
Random Forests roc-auc: 0.874102564102564
Test set
Random Forests roc-auc: 0.47737556561085975


tomek
Train set
Random Forests roc-auc: 0.8536121673003803
Test set
Random Forests roc-auc: 0.6029411764705883


oss
Train set
Random Forests roc-auc: 0.8466346153846154
Test set
Random Forests roc-auc: 0.6628959276018099


enn
Train set
Random Forests roc-auc: 0.8104260935143287
Test set
Random Forests roc-auc: 0.623868778280543


renn
Train set


In [15]:
FNR_dict

{'random': 0.5384615384615384,
 'smote': 0.46153846153846156,
 'smtomek': 0.46153846153846156,
 'cnn': 0.9230769230769231,
 'tomek': 1.0,
 'oss': 1.0,
 'enn': 1.0,
 'renn': 1.0,
 'allknn': 1.0,
 'ncr': 1.0,
 'nm1': 0.38461538461538464,
 'nm2': 0.3076923076923077,
 'nm3': 0.38461538461538464,
 'iht': 0.15384615384615385,
 'adasyn': 0.3076923076923077,
 'border1': 0.5384615384615384,
 'border2': 0.46153846153846156,
 'svm': 0.9230769230769231}

In [16]:
conda list 

# packages in environment at C:\Users\Omen\anaconda3:
#
# Name                    Version                   Build  Channel
_anaconda_depends         2020.07                  py38_0  
_ipyw_jlab_nb_ext_conf    0.1.0                    py38_0  
alabaster                 0.7.12             pyhd3eb1b0_0  
anaconda                  custom                   py38_1  
anaconda-client           1.8.0            py38haa95532_0  
anaconda-navigator        2.0.3                    py38_0  
anaconda-project          0.10.0             pyhd3eb1b0_0  
anyio                     2.2.0            py38haa95532_2  
appdirs                   1.4.4                      py_0  
argh                      0.26.2                   py38_0  
argon2-cffi               20.1.0           py38h2bbff1b_1  
asn1crypto                1.4.0                      py_0  
astroid                   2.6.2            py38haa95532_0  
astropy                   4.2.1            py38h2bbff1b_1  
async_generator           1.10       

Another more sophisticated way of combining predictions is the use of a meta-classifier, which receives as input the prediction of other classifiers, and performs the final predict. From the mlxtend library documentation:

Stacking is an ensemble learning technique to combine multiple classification models via a meta-classifier. The individual classification models are trained based on the complete training set; then, the meta-classifier is fitted based on the outputs -- meta-features -- of the individual classification models in the ensemble. The meta-classifier can either be trained on the predicted class labels or probabilities from the ensemble.

In [17]:
from mlxtend.classifier import StackingClassifier

m = StackingClassifier(
    classifiers=[
        LogisticRegression(),
        XGBClassifier(max_depth=2)
    ],
    use_probas=True,
    meta_classifier=LogisticRegression()
)

m.fit(train[labels], target),
preds['stack_pred'] = m.predict_proba(test[labels])[:,1]
preds.head()

ModuleNotFoundError: No module named 'mlxtend'