<a href="https://colab.research.google.com/github/JOTOR/Examples_Python/blob/master/BayesSearchCV_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BayesSearchCV Demo <br>
Created on: August 2022 <br>
By: Jhonnatan Torres (jhonnatan.torres.suarez@gmail.com)
___

The main goal of this notebook is to demonstrate the use of the **BayesSearchCV** to optimize the hyperparameters of a ML model and compare its results against the **RandomizedSearchCV** which is already available in scikit-learn

Importing required libraries

In [1]:
import numpy as np
import pandas as pd
import time
from collections import Counter
from sklearn.datasets import fetch_openml
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.impute import KNNImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split, RandomizedSearchCV, StratifiedKFold, cross_val_score
from sklearn.metrics import precision_score, classification_report
from xgboost import XGBClassifier

Installing **scikit-optimize** and importing **BayesSearchCV**

In [2]:
!pip install scikit-optimize -q

In [3]:
from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer

In [4]:
# for reproducibility purposes
np.random.seed(1234)

Getting the data from openml

In [5]:
X, y = fetch_openml("credit-g", version=1, as_frame=True, return_X_y=True)

In [6]:
X.head(3)

Unnamed: 0,checking_status,duration,credit_history,purpose,credit_amount,savings_status,employment,installment_commitment,personal_status,other_parties,residence_since,property_magnitude,age,other_payment_plans,housing,existing_credits,job,num_dependents,own_telephone,foreign_worker
0,<0,6.0,critical/other existing credit,radio/tv,1169.0,no known savings,>=7,4.0,male single,none,4.0,real estate,67.0,none,own,2.0,skilled,1.0,yes,yes
1,0<=X<200,48.0,existing paid,radio/tv,5951.0,<100,1<=X<4,2.0,female div/dep/mar,none,2.0,real estate,22.0,none,own,1.0,skilled,1.0,none,yes
2,no checking,12.0,critical/other existing credit,education,2096.0,<100,4<=X<7,2.0,male single,none,3.0,real estate,49.0,none,own,1.0,unskilled resident,2.0,none,yes


In [7]:
y.head(3)

0    good
1     bad
2    good
Name: class, dtype: category
Categories (2, object): ['good', 'bad']

In [8]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype   
---  ------                  --------------  -----   
 0   checking_status         1000 non-null   category
 1   duration                1000 non-null   float64 
 2   credit_history          1000 non-null   category
 3   purpose                 1000 non-null   category
 4   credit_amount           1000 non-null   float64 
 5   savings_status          1000 non-null   category
 6   employment              1000 non-null   category
 7   installment_commitment  1000 non-null   float64 
 8   personal_status         1000 non-null   category
 9   other_parties           1000 non-null   category
 10  residence_since         1000 non-null   float64 
 11  property_magnitude      1000 non-null   category
 12  age                     1000 non-null   float64 
 13  other_payment_plans     1000 non-null   category
 14  housing                 1

This is a "nice" dataset, all the *Object* fields have been already transformed to *category*, in addition, there are not null records, the only required preprocessing step is to encode the labels

In [9]:
class_map = {"good":1, "bad":0}
y = pd.Series(y).map(class_map)
Counter(y)

Counter({0: 300, 1: 700})

Train and Test split

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)

In [11]:
SCALE_POS = pd.Series(y_train).value_counts()[0]/pd.Series(y_train).value_counts()[1]

In [12]:
CAT_FEATURES = list(X_train.select_dtypes("category").columns)
NUM_FEATURES = list(X_train.select_dtypes("float64").columns)
print(CAT_FEATURES, '\n', NUM_FEATURES)

['checking_status', 'credit_history', 'purpose', 'savings_status', 'employment', 'personal_status', 'other_parties', 'property_magnitude', 'other_payment_plans', 'housing', 'job', 'own_telephone', 'foreign_worker'] 
 ['duration', 'credit_amount', 'installment_commitment', 'residence_since', 'age', 'existing_credits', 'num_dependents']


Preprocessing pipeline

In [13]:
numeric_transformer = KNNImputer()

categorical_transformer = OneHotEncoder(handle_unknown="ignore")

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, NUM_FEATURES),
        ("cat", categorical_transformer, CAT_FEATURES),
    ]
)

In [14]:
X_train_p = preprocessor.fit_transform(X_train)
X_test_p = preprocessor.transform(X_test)

For this demo the ML to be optimized is a **XGBClassifier**

In [15]:
xgb = XGBClassifier(scale_pos_weight=SCALE_POS, eval_set=[(X_test_p, y_test)], early_stopping_rounds=10, random_state=1234)

### RandomizedSearchCV

Defining the parameters

In [16]:
params = {"subsample":[0.3, 0.5, 0.75, 1],
          "n_estimators": [50, 100, 250, 750, 1000],
          "learning_rate": [0.1, 0.2, 0.25, 0.3, 0.4, 0.5],
          "max_depth": [3, 5, 7]}

In [17]:
cv = StratifiedKFold(3)
grids = RandomizedSearchCV(estimator=xgb, param_distributions=params, scoring="precision", cv=cv, 
                           verbose=1, random_state=1234, n_iter=30, n_jobs=4)

In [18]:
t0 = time.time()
grids.fit(X_train_p, y_train)
t1 = time.time()
time_diff = (t1-t0)/60
print(f"Time in minutes: {time_diff:.2f}")

Fitting 3 folds for each of 30 candidates, totalling 90 fits
Time in minutes: 0.91


In [19]:
grids.best_params_

{'learning_rate': 0.4, 'max_depth': 3, 'n_estimators': 750, 'subsample': 0.75}

In [20]:
grids_pred = grids.best_estimator_.predict(X_test_p)
print(f"Precision:{precision_score(y_test, grids_pred):.2f}")

Precision:0.77


In [21]:
grid_cv_score = cross_val_score(grids.best_estimator_, X=X_test_p, y=y_test, scoring="precision").mean()
print(f"GridSearch Mean CV Score: {grid_cv_score:.2f}")

GridSearch Mean CV Score: 0.78


The training took around 1 minute, the original dataset does contain 1000 observations and 20 features, the model was trained with 70% of these observations

### BayesSearchCV

One of the main advantages of the **BayesSearchCV** is the wider search space, you have to specify a "start", "end" and the distribution, it can be "uniform" or "log-uniform"

In [22]:
bayes_params = {"subsample":Real(0.3, 1, prior="uniform"),
                "n_estimators": Integer(50, 1000, prior="uniform"),
                "learning_rate": Real(0.1, 0.5, prior="uniform"),
                "max_depth": Integer(3, 7, prior="uniform")}

In [23]:
bayes = BayesSearchCV(estimator=xgb, search_spaces=bayes_params, scoring="precision", cv=cv, 
                      verbose=1, random_state=1234, n_iter=30, n_jobs=4)

In [24]:
t0 = time.time()
bayes.fit(X_train_p, y_train)
t1 = time.time()
time_diff = (t1-t0)/60
print(f"Time in minutes: {time_diff:.2f}")

Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fi

In [25]:
bayes.best_params_

OrderedDict([('learning_rate', 0.2564215253698481),
             ('max_depth', 5),
             ('n_estimators', 454),
             ('subsample', 0.3552344572320146)])

In [26]:
bayes_pred = bayes.best_estimator_.predict(X_test_p)
print(f"Precision:{precision_score(y_test, grids_pred):.2f}")

Precision:0.77


In [27]:
bayes_cv_score = cross_val_score(bayes.best_estimator_, X=X_test_p, y=y_test, scoring="precision").mean()
print(f"GridSearch Mean CV Score: {bayes_cv_score:.2f}")

GridSearch Mean CV Score: 0.80


The training took around 1.5 minutes, with a wider search space and same parameters, the performance metrics are pretty similar with a small difference (0.2) in the mean cv precision score, for this demo "precision" was chosen as the key metric because according to the dataset documentation there is a higher cost for the False Positives

### Main References:


*   https://scikit-optimize.github.io/stable/modules/generated/skopt.BayesSearchCV.html
*   https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html
*   https://www.openml.org/search?type=data&sort=runs&status=active&id=31
