# Introduction

During my Python learning journey, one thing really delights me is how easy it is to use Object Oriented Programming to modulize code, make it easy to test, extend, reuse and maintain. Below is a toy example where I created one classifier class to train and tune three classification models (svm, RandomForest & XGBoost).   

In [1]:
import numpy as np
import pandas as pd
from hyperopt import hp, fmin, space_eval, tpe, Trials
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier

## Define the class for multiple models

In [2]:
# define the class
class Classifier:
    def __init__(self, X, y, model_type, space, n_splits, seed):
        self.X = X
        self.y = y
        self.n_splits = n_splits
        self.seed = seed
        self.space = space
        self.model_type = model_type

    def objective(self, space):
        if self.model_type == 'svm':
             model = SVC(**space)
        elif self.model_type == 'randomforest':
            model = RandomForestClassifier(**space)
        elif self.model_type == 'xgboost':
            model = XGBClassifier(**space)
        else:
            print('model not available yet')        

        cv_scores = cross_val_score(model, self.X, self.y, cv=self.n_splits, scoring='roc_auc', n_jobs=-1)
        return 1 - cv_scores.mean()
    
    def optimize(self, max_evals=10):
        best = fmin(fn=self.objective, space=self.space, algo=tpe.suggest, trials = trials, 
                    max_evals=max_evals, rstate=np.random.default_rng(self.seed))
        return best    


## Define the hyperparameter search space for each model

In [3]:
# set up space for hyperparameter tuning
xgb_params = {
    'max_depth':         hp.choice('max_depth',          np.arange(1, 4, 1, dtype=int)),
    'min_child_weight':  hp.quniform('min_child_weight', 1, 7, 1),
    'gamma':             hp.uniform('gamma', 0, 0.4),
    'subsample':         hp.choice('subsample', [0.6, 0.8]),
    'colsample_bytree':  hp.choice('colsample_bytree', [0.6, 0.8]),
    'learning_rate':     hp.uniform('learning_rate', 0.01, 0.2),
    'n_estimators':      hp.choice('n_estimators',       np.arange(50, 200, 50, dtype =int)),
    'scale_pos_weight':  hp.quniform('scale_pos_weight', 1, 16, 1),
    'reg_alpha':         hp.uniform('reg_alpha', 0, 0.4),
    'reg_lambda':        hp.uniform('reg_lambda', 0, 0.4),
}

rf_params = {
    "n_estimators":      hp.choice('n_estimators',        np.arange(50, 200, 50, dtype =int)),
    'max_depth':         hp.choice('max_depth',           np.arange(3, 6, 1, dtype=int)),
    'min_samples_split': hp.choice('min_samples_split',   np.arange(2, 20, 1, dtype=int)),
    'min_samples_leaf':  hp.choice('min_samples_leaf',    np.arange(2, 20, 1, dtype=int)),
    'max_features':      hp.choice('max_features', ['auto', 'sqrt']),
    'criterion':         hp.choice('criterion', ["gini", "entropy"])
}

svm_params = {
    'kernel':            hp.choice('kernel', ['linear', 'rbf']),
    'C':                 hp.choice('C', [0.1, 1, 10])
}


## Prepare the data

In [4]:
# prepare the data (minimal process)
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=1)

In [5]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## Train the model

In [6]:
# baseline model
baseline_model = RandomForestClassifier()
baseline_model.fit(X_train, y_train)
y_pred = baseline_model.predict(X_test)
baseline_auc = roc_auc_score(y_pred, y_test)
print('baseline loss:', 1 - baseline_auc)

baseline loss: 0.046052631578947456


We can now play with the Classifier defined above. Train and tune any of the three classification algorithms and see which one performs the best. 

In [7]:
%%time 
model = Classifier(X_train, y_train, model_type = 'randomforest', space = rf_params, n_splits = 5, seed=123)
trials = Trials()
best = model.optimize(max_evals=50)

100%|██████████████████████████████████████████████| 50/50 [00:30<00:00,  1.65trial/s, best loss: 0.008462332301341524]
CPU times: total: 2.41 s
Wall time: 30.3 s


## Discussions

The tuned RandomForest model has much lower loss than the baseline one. It is super easy to define another model to check out is performance as well. Please see below, the tuned svm classifier performed even better.

In [8]:
%%time 
model = Classifier(X_train, y_train, model_type = 'svm', space = svm_params, n_splits = 5, seed=123)
trials = Trials()
best = model.optimize(max_evals=50)

100%|█████████████████████████████████████████████| 50/50 [00:01<00:00, 25.83trial/s, best loss: 0.0066047471620226395]
CPU times: total: 1.41 s
Wall time: 1.95 s


### Some final notes
If none of the hyperparamters across models have the same name (say if we only have svm and randomforest to test), we could create a params dictionary and just pass the whole dictionary when defining the model later (demo code below). 

In [None]:
"""
space = dict()
space['svm_params'] = svm_params
space['rf_params']  = rf_params

# in objective()
model = SVC(**space['svm_params'])
model = RandomForestClassifier(**space['rf_params'])

# when defining the model
model = Classifier(X_train, y_train, model_type= 'randomforest', space=params, n_splits=5, seed=123)

"""