# Undersampling

In this notebook we'll explain how to use the UnderSamplingClassifier that is referenced in the documentation

## Imports

In [1]:
import sys
sys.path.append('..')
import numpy as np, pandas as pd
from sklearn.decomposition import PCA
from sklearn.metrics import roc_auc_score, make_scorer
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from skopt.space import Real, Integer, Categorical
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.pipeline import Pipeline as SkPipeline
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
from nestedcvtraining.api import find_best_model
from nestedcvtraining.switch_case import SwitchCase
from nestedcvtraining.under_sampling_classifier import UnderSamplingClassifier
from skopt import gbrt_minimize
from collections import Counter

## Dataset

In [2]:
dataset = pd.read_csv(
    "https://raw.githubusercontent.com/JaimeArboleda/nestedcvtraining/master/datasets/haberman.csv", header=None
)
values = dataset.values
X, y = values[:, :-1], (values[:, -1] - 1).astype(int)


In [3]:
X.shape

(306, 3)

In [4]:
Counter(y)

Counter({0: 225, 1: 81})

In this case we have a binary classification problem with some (but not too much) imbalance. It's a good candidate for the `UnderSamplingClassifier`, which is suited specially for binary problems or for multiclass problems where there is only one class that is imbalanced (and the rest are more or less balanced). This happens because the `UnderSamplingClassifier` will train an ensemble of models by taking all examples from the less frequent class and a stratified sample from the rest of the classes. 

## Model

Let's compare only two pipelines: one with oversampling using `SMOTE` and other without oversampling but using `UnderSamplingClassifier`, and see what works better.  

In [5]:
clf = SwitchCase(
    cases=[
        (
            "option_1",
            ImbPipeline(
                [
                    ("resampler", SMOTE(k_neighbors=3)), 
                    ("model", RandomForestClassifier())
                ]
            )
        ),
        (
            "option_2",
            UnderSamplingClassifier(RandomForestClassifier())
        )
    ],
    switch="option_1"
)

search_space = [
    Categorical(["option_1", "option_2"], name="switch"),
    Categorical(["minority", "all"], name="option_1__resampler__sampling_strategy"),
    Integer(5, 15, name="option_1__model__max_depth"),
    Integer(5, 15, name="option_2__estimator__max_depth"),
]

In [6]:
best_model, best_params, report = find_best_model(
    X=X,
    y=y,
    model=clf,
    search_space=search_space,
    verbose=False,
    k_inner=8,
    k_outer=8,
    n_initial_points=5,
    n_calls=5,
    calibrate="only_best",
    calibrate_params={"method": "sigmoid"},
    optimizing_metric=make_scorer(roc_auc_score, average='weighted', multi_class='ovr', needs_proba=True),
    other_metrics={"acc": "accuracy"},
    skopt_func=gbrt_minimize
)

Looping over outer folds
Looping over outer folds
Looping over outer folds
Looping over outer folds
Looping over outer folds
Looping over outer folds
Looping over outer folds
Looping over outer folds


In [7]:
report.get_best_params()

{'switch': ['option_2',
  'option_1',
  'option_2',
  'option_1',
  'option_2',
  'option_2',
  'option_2',
  'option_2'],
 'option_1__resampler__sampling_strategy': ['all',
  'minority',
  'all',
  'all',
  'minority',
  'all',
  'all',
  'minority'],
 'option_1__model__max_depth': [5, 14, 7, 6, 7, 15, 14, 5],
 'option_2__estimator__max_depth': [13, 13, 12, 11, 6, 7, 12, 5]}

In [8]:
df = report.to_dataframe()
df[df["best"]==True]

Unnamed: 0,best,outer_kfold,model,outer_test_indexes,param__option_1__model__max_depth,param__option_1__resampler__sampling_strategy,param__option_2__estimator__max_depth,param__switch,inner_validation_metrics__acc,inner_validation_metrics__optimizing_metric,outer_test_metrics__acc,outer_test_metrics__optimizing_metric
2,True,0,CalibratedClassifierCV(base_estimator=SwitchCa...,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...",5,all,13,option_2,0.333556,0.695799,0.74359,0.605172
6,True,1,CalibratedClassifierCV(base_estimator=SwitchCa...,"[32, 33, 35, 36, 37, 38, 39, 40, 41, 42, 46, 4...",14,minority,13,option_1,0.636698,0.6736,0.717949,0.311688
12,True,2,CalibratedClassifierCV(base_estimator=SwitchCa...,"[70, 71, 72, 76, 77, 78, 79, 83, 84, 85, 86, 8...",7,all,12,option_2,0.357843,0.682975,0.710526,0.592857
19,True,3,CalibratedClassifierCV(base_estimator=SwitchCa...,"[116, 117, 118, 119, 120, 121, 122, 123, 124, ...",6,all,11,option_1,0.657531,0.654306,0.763158,0.623214
22,True,4,CalibratedClassifierCV(base_estimator=SwitchCa...,"[152, 153, 154, 155, 158, 159, 160, 161, 162, ...",7,minority,6,option_2,0.28721,0.626539,0.763158,0.832143
25,True,5,CalibratedClassifierCV(base_estimator=SwitchCa...,"[190, 191, 194, 195, 196, 197, 198, 199, 200, ...",15,all,7,option_2,0.272504,0.643322,0.736842,0.746429
30,True,6,CalibratedClassifierCV(base_estimator=SwitchCa...,"[228, 231, 232, 233, 234, 235, 236, 237, 238, ...",14,all,12,option_2,0.320967,0.656192,0.736842,0.532143
35,True,7,CalibratedClassifierCV(base_estimator=SwitchCa...,"[267, 269, 270, 271, 272, 273, 274, 275, 276, ...",5,minority,5,option_2,0.264929,0.645428,0.736842,0.741071


It looks like both methods, in this case, perform more or less equally well. Anyway, the results show a lot of variability, probably because we don't have too much samples (inner test set has only 5 elements!!). We should use a less folds to have greater tests sets. Let's do that. 

In [15]:
best_model, best_params, report = find_best_model(
    X=X,
    y=y,
    model=clf,
    search_space=search_space,
    verbose=False,
    k_inner=3,
    k_outer=3,
    n_initial_points=5,
    n_calls=5,
    calibrate="only_best",
    calibrate_params={"method": "sigmoid"},
    optimizing_metric=make_scorer(roc_auc_score, average='weighted', multi_class='ovr', needs_proba=True),
    other_metrics={"acc": "accuracy"},
    skopt_func=gbrt_minimize
)

Looping over 0 outer fold
Looping over 1 outer fold
Looping over 2 outer fold


In [16]:
report.get_best_params()

{'switch': ['option_2', 'option_1', 'option_2'],
 'option_1__resampler__sampling_strategy': ['all', 'all', 'all'],
 'option_1__model__max_depth': [13, 9, 5],
 'option_2__estimator__max_depth': [12, 6, 11]}

In [17]:
df = report.to_dataframe()
df[df["best"]==True]

Unnamed: 0,best,outer_kfold,model,outer_test_indexes,param__option_1__model__max_depth,param__option_1__resampler__sampling_strategy,param__option_2__estimator__max_depth,param__switch,inner_validation_metrics__optimizing_metric,inner_validation_metrics__acc,outer_test_metrics__optimizing_metric,outer_test_metrics__acc
4,True,0,CalibratedClassifierCV(base_estimator=SwitchCa...,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...",13,all,12,option_2,0.66537,0.421569,0.576049,0.705882
6,True,1,CalibratedClassifierCV(base_estimator=SwitchCa...,"[101, 102, 103, 104, 105, 106, 108, 109, 110, ...",9,all,6,option_1,0.548333,0.573529,0.579012,0.72549
10,True,2,CalibratedClassifierCV(base_estimator=SwitchCa...,"[198, 199, 206, 207, 208, 209, 210, 211, 212, ...",5,all,11,option_2,0.634815,0.343137,0.662716,0.735294
