# Nested CV Training

In this notebook we'll explain, with examples, how to use this library that has been designed with the purpose of:
* Performing nested cross validation for model and hyperparameter selection and performance estimation.
* Optimizing not only model hyperparameters, but whole preprocessing pipelines. 
* Dealing with imbalanced multiclass classification problems. 
* Probability calibration of predictions. 

## Imports

In [None]:
import sys
sys.path.append('..')

import pandas as pd
from sklearn.decomposition import PCA
from sklearn.metrics import roc_auc_score, make_scorer, fbeta_score, balanced_accuracy_score
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from skopt.space import Real, Integer, Categorical
from xgboost import XGBClassifier
from imblearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
from nestedcvtraining.utils.pipes_and_transformers import IdentityTransformer
from nestedcvtraining.api import find_best_model, OptionedTransformer
from skopt import gbrt_minimize, dummy_minimize

## Dataset
Let's show how to use the library with a synthetic multiclass imbalanced dataset:

In [4]:
from sklearn.datasets import make_classification
X, y = make_classification(
    n_samples=2000,
    n_features=20, 
    n_redundant=5, 
    n_informative=5, 
    n_classes=3, 
    n_clusters_per_class=3,
    flip_y=0.05,
    class_sep=0.7,
    weights=[0.8, 0.15, 0.05]
)

In [7]:
from collections import Counter
Counter(y)

Counter({1: 325, 0: 1548, 2: 127})

In [8]:
X.shape

(2000, 20)

We have 20 features and 2000 samples, with a minority class quite imbalanced. 

## Model definition

One important piece of the library is that it allows to define potentially very complex nested pipelines, with different hyperparameters to optimize. Let's see step by step how this is define. 

Those structures are defined using a **model dictionary**. This is a nested dictionary with the following structure (for simplicity I am using here a YAML notation):

```yaml
model_name_1:
  model:
  pipeline:
  search_space:
model_name_2:
  model:
  pipeline:
  search_space:
...
```

That is, the top level keys are only *names*, that make easy to identify what we are optimizing. It's possible to have only one of those keys (only one model structure to optimize) or several. In case of several, they will be treated sequentially by the hyperparameter tuning algorithm.

Each key (model name) has a value that is also a dictionary with three keys:
* **model**, which is a `sklearn`-compatible (implementing `fit`, `predict`, `predict_proba` and so on) model object. 
* **pipeline**, which is an `imblearn pipeline` that defines how to preprocess the features. 
* **search_space**, that defines the whole set of hyperparameters (their possible values and restrictions) for the hyperparameter tuning algorithm. The hyperparameters (except two special ones that we will explain later) control the options of the model and the steps of the pipeline. 

## Pipeline

In `pipeline` we can include several steps with options to be tested and tuned. 

It's even posible to specify a kind of *branching points* or *disjoint options* for tuning. For example, in a very high dimensional problem, we could be interesting in performing featuring reduction, but a priori we don't know if `PCA` will perform better or worse than, say, `SelectKBest`. In this case, we could use the utility of the library called `OptionedTransformer`, that can be defined using a dictionary of options. We will see an example later. jk

## Search_space

The search_space is built using the methods of `skopt` library for hyperparameter optimization. [Here](https://scikit-optimize.github.io/stable/modules/generated/skopt.Space.html) is a link for further information. 

In [11]:
preprocess_options = {
    "option_1": Pipeline(
        [("scale", StandardScaler()), ("reduce_dims", PCA(n_components=5))]
    ),
    "option_2": Pipeline(
        [
            ("scale", StandardScaler()),
            ("reduce_dims", SelectKBest(mutual_info_classif, k=5)),
        ]
    ),
    "option_3": Pipeline([("identity", IdentityTransformer())])
}
op = OptionedTransformer(preprocess_options)

In [17]:
import pprint as pp
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.compose import make_column_selector as selector

numeric_features = ["age", "fare"]
numeric_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())]
)

categorical_features = ["embarked", "sex", "pclass"]
categorical_transformer = OneHotEncoder(handle_unknown="ignore")

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, selector(dtype_exclude="category")),
        ("cat", categorical_transformer, selector(dtype_include="category")),
    ]
)
clf = Pipeline(
    steps=[("preprocessor", preprocessor), ("classifier", LogisticRegression())]
)


pp.pprint(sorted(clf.get_params().keys()))

['classifier',
 'classifier__C',
 'classifier__class_weight',
 'classifier__dual',
 'classifier__fit_intercept',
 'classifier__intercept_scaling',
 'classifier__l1_ratio',
 'classifier__max_iter',
 'classifier__multi_class',
 'classifier__n_jobs',
 'classifier__penalty',
 'classifier__random_state',
 'classifier__solver',
 'classifier__tol',
 'classifier__verbose',
 'classifier__warm_start',
 'memory',
 'preprocessor',
 'preprocessor__cat',
 'preprocessor__cat__categories',
 'preprocessor__cat__drop',
 'preprocessor__cat__dtype',
 'preprocessor__cat__handle_unknown',
 'preprocessor__cat__max_categories',
 'preprocessor__cat__min_frequency',
 'preprocessor__cat__sparse',
 'preprocessor__n_jobs',
 'preprocessor__num',
 'preprocessor__num__imputer',
 'preprocessor__num__imputer__add_indicator',
 'preprocessor__num__imputer__copy',
 'preprocessor__num__imputer__fill_value',
 'preprocessor__num__imputer__missing_values',
 'preprocessor__num__imputer__strategy',
 'preprocessor__num__imputer_

In [None]:

dict_models = {
    "xgboost": {
        "model": XGBClassifier(),
        "pipeline": Pipeline(
            [
                (
                    "pre_process",
                    OptionedTransformer(dict_pipelines_post_process),
                ),
                ("resample", SMOTE(k_neighbors=3))
            ]
        ),
        "search_space": [
            Categorical([True, False], name="undersampling_majority_class"),
            Integer(5, 6, name="max_k_undersampling"),
            Categorical(["minority", "all"], name="resample__sampling_strategy"),
            Categorical(
                ["option_1", "option_2", "option_3"], name="pre_process__option"
            ),
            Integer(5, 15, name="model__max_depth"),
            Real(0.05, 0.31, prior="log-uniform", name="model__learning_rate"),
            Integer(1, 10, name="model__min_child_weight"),
            Real(0.8, 1, prior="log-uniform", name="model__subsample"),
            Real(0.13, 0.8, prior="log-uniform", name="model__colsample_bytree"),
            Categorical(["binary:logistic"], name="model__objective"),
            Integer(5, 10, name="pre_process__option_2__reduce_dims__k")
        ],
    }
}

best_model, loop_info = find_best_model(
    X=X,
    y=y,
    model_search_spaces=dict_models,
    verbose=False,
    k_inner=10,
    k_outer=9,
    skip_inner_folds=[0, 2, 4, 6, 8, 9],
    skip_outer_folds=[0, 2, 3, 4, 6, 8],
    n_initial_points=5,
    n_calls=5,
    optimizing_metric=make_scorer(roc_auc_score, multi_class='ovr', needs_proba=True),
    other_metrics=[],
    skopt_func=gbrt_minimize,
    calibrate="only_best",
)

In [None]:
loop_info

In [2]:
df = loop_info.to_dataframe()

In [7]:
loop_info.outer_test_indexes

[]

In [4]:
df.columns

Index(['name', 'best', 'outer_kfold', 'model', 'outer_test_indexes',
       'param__model__colsample_bytree', 'param__model__learning_rate',
       'param__model__max_depth', 'param__model__min_child_weight',
       'param__model__objective', 'param__model__subsample',
       'param__pre_process__option', 'param__resample__sampling_strategy',
       'inner_validation_metrics__optimizing_metric',
       'outer_test_metrics__optimizing_metric'],
      dtype='object')

In [3]:
df

Unnamed: 0,name,best,outer_kfold,model,outer_test_indexes,param__max_k_undersampling,param__model__colsample_bytree,param__model__learning_rate,param__model__max_depth,param__model__min_child_weight,param__model__objective,param__model__subsample,param__pre_process__option,param__resample__sampling_strategy,param__undersampling_majority_class,inner_validation_metrics__optimizing_metric,outer_test_metrics__optimizing_metric
0,xgboost,True,1,CalibratedClassifierCV(base_estimator=Pipeline...,"[17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 2...",6,0.316229,0.302373,9,1,binary:logistic,0.912833,option_3,minority,False,0.997073,1.0
1,xgboost,False,1,,"[17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 2...",5,0.181328,0.132725,9,7,binary:logistic,0.875158,option_1,minority,False,1.0,
2,xgboost,False,1,,"[17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 2...",6,0.415824,0.066304,11,5,binary:logistic,0.993736,option_3,all,True,0.998932,
3,xgboost,False,1,,"[17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 2...",6,0.630219,0.065294,11,3,binary:logistic,0.811494,option_3,minority,False,0.997196,
4,xgboost,False,1,,"[17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 2...",6,0.346505,0.08766,8,10,binary:logistic,0.866375,option_1,all,False,1.0,
5,xgboost,False,5,,"[85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 9...",6,0.374347,0.20744,11,4,binary:logistic,0.919624,option_1,minority,True,0.997863,
6,xgboost,False,5,,"[85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 9...",6,0.188462,0.059691,8,8,binary:logistic,0.934988,option_2,minority,False,1.0,
7,xgboost,False,5,,"[85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 9...",5,0.750325,0.301141,5,7,binary:logistic,0.905782,option_2,minority,True,0.997863,
8,xgboost,False,5,,"[85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 9...",6,0.199586,0.12822,7,7,binary:logistic,0.99454,option_3,all,True,0.998611,
9,xgboost,True,5,<nestedcvtraining.utils.pipes_and_transformers...,"[85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 9...",6,0.187036,0.224114,8,7,binary:logistic,0.802147,option_2,minority,True,0.991097,1.0


In [5]:
loop_info.model[0].predict(X[loop_info.outer_test_indexes[0]])

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,
       2, 2])

In [6]:
y[loop_info.outer_test_indexes[0]]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 2,
       2, 2])

In [8]:
dict_models = {
    "xgboost": {
        "model": XGBClassifier(),
        "pipeline": Pipeline(
            [
                (
                    "pre_process",
                    OptionedTransformer(dict_pipelines_post_process),
                ),
                ("resample", SMOTE(k_neighbors=2))
            ]
        ),
        "search_space": [
            Categorical([True, False], name="undersampling_majority_class"),
            Integer(5, 6, name="max_k_undersampling"),
            Categorical(["minority", "all"], name="resample__sampling_strategy"),
            Categorical(
                ["option_1", "option_2", "option_3"], name="pre_process__option"
            ),
            Integer(5, 15, name="model__max_depth"),
            Real(0.05, 0.31, prior="log-uniform", name="model__learning_rate"),
            Integer(1, 10, name="model__min_child_weight"),
            Real(0.8, 1, prior="log-uniform", name="model__subsample"),
            Real(0.13, 0.8, prior="log-uniform", name="model__colsample_bytree"),
            Categorical(["binary:logistic"], name="model__objective"),
        ],
    }
}

best_model, loop_info = find_best_model(
    X=X,
    y=y,
    model_search_spaces=dict_models,
    verbose=False,
    k_inner=10,
    k_outer=9,
    skip_inner_folds=[0, 2, 4, 6, 8, 9],
    skip_outer_folds=[0, 2, 3, 4, 6, 8],
    n_initial_points=5,
    n_calls=5,
    optimizing_metric=make_scorer(roc_auc_score, multi_class='ovr', needs_proba=True),
    other_metrics=[],
    skopt_func=gbrt_minimize,
    calibrate="all",
)

Looping over 0 outer fold
Looping over 1 outer fold
Looping over 2 outer fold
Looping over 3 outer fold
Looping over 4 outer fold
Looping over 5 outer fold
Looping over 6 outer fold
Looping over 7 outer fold
Looping over 8 outer fold


In [10]:
loop_info.to_dataframe()

Unnamed: 0,name,best,outer_kfold,model,outer_test_indexes,param__max_k_undersampling,param__model__colsample_bytree,param__model__learning_rate,param__model__max_depth,param__model__min_child_weight,param__model__objective,param__model__subsample,param__pre_process__option,param__resample__sampling_strategy,param__undersampling_majority_class,inner_validation_metrics__optimizing_metric,outer_test_metrics__optimizing_metric
0,xgboost,True,1,CalibratedClassifierCV(base_estimator=Pipeline...,"[17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 2...",5,0.450705,0.233304,10,9,binary:logistic,0.94398,option_1,minority,False,1.0,1.0
1,xgboost,False,1,,"[17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 2...",6,0.150003,0.268583,12,5,binary:logistic,0.871615,option_2,minority,False,1.0,
2,xgboost,False,1,,"[17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 2...",6,0.313034,0.112805,14,2,binary:logistic,0.963499,option_3,minority,False,1.0,
3,xgboost,False,1,,"[17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 2...",6,0.350818,0.082831,7,7,binary:logistic,0.931276,option_3,minority,False,1.0,
4,xgboost,False,1,,"[17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 2...",5,0.176513,0.128081,14,4,binary:logistic,0.88204,option_2,all,False,1.0,
5,xgboost,False,5,,"[85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 9...",5,0.382755,0.05632,12,5,binary:logistic,0.960859,option_3,minority,False,0.997543,
6,xgboost,False,5,,"[85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 9...",5,0.376941,0.141022,13,8,binary:logistic,0.902715,option_3,all,False,0.997543,
7,xgboost,False,5,,"[85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 9...",6,0.222033,0.056644,13,2,binary:logistic,0.855059,option_1,minority,True,1.0,
8,xgboost,False,5,,"[85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 9...",5,0.433423,0.167481,7,10,binary:logistic,0.890572,option_1,minority,False,0.992207,
9,xgboost,True,5,<nestedcvtraining.utils.pipes_and_transformers...,"[85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 9...",5,0.1699,0.075548,8,10,binary:logistic,0.862665,option_2,all,True,0.991411,0.993032
