<font  size="5" color="green"><strong>Using Optuna to tune hyperparameters of a sklearn pipeline of transforms with a final estimator</strong></font>

Tutorial for improve skills: 'Using Optuna to tune hyperparameters of a sklearn pipeline of transforms with a final estimator' (from Sandra Prusaka) by Marcus Mariano

**For more information about Marcus Mariano: [Web site](https://marcusmariano.github.io/mmariano/)**  

**Using Optuna to tune hyperparameters of a sklearn pipeline of transforms with a final estimator: Sandra Prusaka [Kaggle](https://www.kaggle.com/general/271613)**


## Import Packages

In [2]:
import pandas as pd
import numpy as np

from tqdm.notebook import tqdm

from matplotlib import pyplot as plt
import seaborn as sns

sns.set(style="darkgrid", color_codes=True)
%matplotlib inline

In [3]:
import warnings

warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

## Set parameters

In [None]:
N_THREADS = 16 # threads cnt for lgbm and linear models
N_FOLDS = 5 # folds cnt for AutoML
N_SPLITS = 3 # splits KFolds
N_TRIAL = 30 # number os trial of val score, optuna
N_JOBS = -1 # -1 means using all processors
SEED = 0 # fixed random state for various reasons
TEST_SIZE = 0.3 # Test size for metric check
VERBOSE = 1
EPOCHS = 50
TIMEOUT = 600 # Time in seconds for automl run, 600 seconds = 10 minutes
RAM = 16 #  Number of RAM limit
CPU_LIMIT = 8 # Number of CPU limit
TARGET_NAME = 'Class'

## Load data

In [22]:
from sklearn.datasets import load_breast_cancer

# -- Get the dataset
X, y = load_breast_cancer(return_X_y=True)

X.shape, y.shape

((569, 30), (569,))

In [23]:
import optuna

from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler
from sklearn.decomposition import PCA

from sklearn.pipeline import make_pipeline
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

from sklearn.neighbors import KNeighborsClassifier


# -- Define the objective function
def objective(trial):
    
    N_SPLITS = 5
  
    kf = KFold(n_splits=N_SPLITS, 
                shuffle=True, 
                random_state=SEED) 
    
    # -- Instantiate scaler
    # (a) List scalers to chose from
    scalers = trial.suggest_categorical("scalers", ['minmax', 'standard', 'robust'])

    # (b) Define your scalers
    if scalers == "minmax":
        scaler = MinMaxScaler()
    elif scalers == "standard":
        scaler = StandardScaler()
    else:
        scaler = RobustScaler()

    # -- Instantiate dimensionality reduction
     # (a) List all dimensionality reduction options
    dim_red = trial.suggest_categorical("dim_red", ["PCA", None])

    # (b) Define the PCA algorithm and its hyperparameters
    if dim_red == "PCA":
        pca_n_components = trial.suggest_int("pca_n_components", 2, 30) # suggest an integer from 2 to 30
        dimen_red_algorithm = PCA(n_components = pca_n_components)
    # (c) No dimensionality reduction option
    else:
        dimen_red_algorithm = 'passthrough'

    # -- Instantiate estimator model
    params_knn = {
                "n_neighbors" : trial.suggest_int("knn_n_neighbors", 1, 19, step = 2), # suggest an integer from 1 to with step 2
                "metric" : trial.suggest_categorical("knn_metric", ['euclidean', 'manhattan', 'minkowski']),
                "weights": trial.suggest_categorical("knn_weights", ['uniform', 'distance'])
    }

    estimator = KNeighborsClassifier(**params_knn) #, 
#                                     n_jobs=N_JOBS)

    # -- Make a pipeline
    pipeline = make_pipeline(scaler, 
                             dimen_red_algorithm,
                             estimator)

    # -- Evaluate the score by cross-validation       
    cross_val = cross_val_score(pipeline, 
                                X, 
                                y, 
                                cv=kf, 
                                n_jobs=N_JOBS,
                                scoring = 'f1', 
                                verbose = VERBOSE).mean()    
    
    return cross_val



from timeit import default_timer as timer
from datetime import timedelta, datetime


# Start Measure time elapsed
start = timer()
print(f"Time Started: {datetime.now().strftime('%H:%M:%S')}")
# Code here ...


db_path = "db/"

# study = optuna.create_study(direction='maximize' or minimize)
study = optuna.create_study(direction='maximize',
                            study_name = "scaler_pca_knn",
                            storage = f"sqlite:///{db_path}scaler_pca_knn.db",
                            load_if_exists = True)

# use n_jobs of model, never use both (limit RAM)
study.optimize(objective, 
               n_trials = N_TRIALS, 
               timeout = None,
               n_jobs=N_JOBS)


# End Measure time elapsed
end = timer()
print('\n')
print(f"Time Elapsed: {timedelta(seconds = end - start)}")


# print the best performing pipeline
print(f"\nBest score: {study.best_trial.value:0.4f}" )
print("Best parameter:")
print(study.best_trial.params)

[32m[I 2022-07-30 14:28:18,243][0m Using an existing study with name 'scaler_pca_knn' instead of creating a new one.[0m


Time Started: 14:28:18


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.

[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    0.0s finished
[32m[I 2022-07-30 14:28:27,348][0m Trial 32 finished with value: 0.9697613914895882 and parameters: {'scalers': 'standard', 'dim_red': None, 'knn_n_neighbors': 7, 'knn_metric': 'minkowski', 'knn_weights': 'distance'}. Best is trial 35 with value: 0.9780030323616973.[0m
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    0.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    0.0s finished
[32m[I 2022-07-30 14:28:27,536][0m Trial 34 finished with value: 0.9767613972734122 and parameters: {'scalers': 'minmax', 'dim_red': None, 'knn_n_neighbors': 7, 'knn_metric': 'minkowski', 'knn_weights': 'uniform'}. Best is trial 35 with value: 0.9780030323616973.[0m
[32m[I 2022-07-30 14:28:27,552][0m Trial 38 finished with value: 0.9767613972734122 and 



Time Elapsed: 0:00:09.670482

Best score: 0.9780
Best parameter:
{'dim_red': None, 'knn_metric': 'minkowski', 'knn_n_neighbors': 7, 'knn_weights': 'distance', 'scalers': 'minmax'}


# Optuna Optuna Pipe CountVect TF-IDF

In [6]:
Path = ""

data = pd.read_csv(Path+"data/train_data.csv", header=None, delimiter="\t", quoting=3)
test = pd.read_csv(Path+"data/test_data.csv", header=None, delimiter="\t", quoting=3)

data.columns = ["Sentiment","Text"]
test.columns = ["Text"]

print(data.shape, test.shape)
data.head()

(7086, 2) (33052, 1)


Unnamed: 0,Sentiment,Text
0,1,The Da Vinci Code book is just awesome.
1,1,this was the first clive cussler i've ever rea...
2,1,i liked the Da Vinci Code a lot.
3,1,i liked the Da Vinci Code a lot.
4,1,I liked the Da Vinci Code but it ultimatly did...


## Covert to list

In [7]:
X_list = data["Text"].tolist()
X_test_list = test["Text"].tolist()
y = data["Sentiment"]

print(type(X_list), type(y))
print(len(X_list), len(y))

X_list[0]

<class 'list'> <class 'pandas.core.series.Series'>
7086 7086


'The Da Vinci Code book is just awesome.'

In [19]:
import optuna

from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler
from sklearn.decomposition import PCA

from sklearn.pipeline import make_pipeline
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from lightgbm import LGBMClassifier


# -- Define the objective function
def objective(trial):
    
    # Kold
    N_SPLITS = 5  
    kf = KFold(n_splits=N_SPLITS, 
                shuffle=True, 
                random_state=SEED)     

    # -- CountVectorizer
    n_gram = trial.suggest_categorical("ngram_range", [(1, 1), (1, 2), (1, 3)])
    
    count_vect = CountVectorizer(tokenizer = None,
                                 lowercase = None,
                                 ngram_range = n_gram,
                                 stop_words = None)
    
    # -- TfidfTransformer
    tf_idf = trial.suggest_categorical("tf_idf", ["TFIDF", None])
    
    if tf_idf == "TFIDF":    
        l_norm = trial.suggest_categorical("norm", ['l1', 'l2'])  
        
        tfidf_algorithm = TfidfTransformer(norm = l_norm)
    else:
        tfidf_algorithm = 'passthrough'
        
        
    # -- Estimator   
    params = {'boosting_type': 'gbdt', 'num_leaves': 133, 'max_depth': 220, 
              'n_estimators': 399, 'colsample_bytree': 0.5392245991058503, 
              'min_child_samples': 76}
    
    lr = trial.suggest_loguniform('learning_rate', 1e-3, 1e-1), 

    estimator = LGBMClassifier(**params,
                               learning_rate = lr,
                                n_jobs=None,
                                random_state=SEED)
    
    # -- Make a pipeline
    pipeline = make_pipeline(count_vect, 
                             tfidf_algorithm,
                             estimator)
    
    # -- Evaluate the score by cross-validation       
    cross_val = cross_val_score(pipeline, 
                                X_list, 
                                y, 
                                cv=kf, 
                                n_jobs=None,
                                scoring = 'f1', 
                                verbose = VERBOSE).mean()    
    
    return cross_val



from timeit import default_timer as timer
from datetime import timedelta, datetime


# Start Measure time elapsed
start = timer()
print(f"Time Started: {datetime.now().strftime('%H:%M:%S')}")
# Code here ...


db_path = "db/"

# study = optuna.create_study(direction='maximize' or minimize)
study = optuna.create_study(direction='maximize',
                            study_name = "lgbmc_parans_cv_tfidf",
                            storage = f"sqlite:///{db_path}lgbmc_parans_cv_tfidf",
                            load_if_exists = True)

# use n_jobs of model, never use both (limit RAM)
study.optimize(objective, 
               n_trials = N_TRIALS, 
               timeout = None,
               n_jobs = N_JOBS)


# End Measure time elapsed
end = timer()
print('\n')
print(f"Time Elapsed: {timedelta(seconds = end - start)}")


# print the best performing pipeline
print(f"\nBest score: {study.best_trial.value:0.4f}" )
print("Best parameter:")
print(study.best_trial.params)

[32m[I 2022-07-30 18:19:42,250][0m A new study created in RDB with name: lgbmc_parans_cv_tfidf[0m


Time Started: 18:19:42


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[33m[W 2022-07-30 18:19:43,796][0m Trial 2 failed, because the objective function returned nan.[0m
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.3s finished
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.9s finished
[Parallel(n_jobs=1)]: Using backend SequentialBacke

[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   43.3s finished
[32m[I 2022-07-30 18:20:38,341][0m Trial 20 finished with value: 0.9610874656917161 and parameters: {'ngram_range': (1, 2), 'tfidf': 'TFIDF', 'norm': 'l2', 'learning_rate': 0.0011652405329919664}. Best is trial 10 with value: 0.9826785206405054.[0m
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   41.7s finished
[32m[I 2022-07-30 18:20:38,875][0m Trial 26 finished with value: 0.9819832729405633 and parameters: {'ngram_range': (1, 1), 'tfidf': 'TFIDF', 'norm': 'l1', 'learning_rate': 0.022260885672001112}. Best is trial 10 with value: 0.9826785206405054.[0m
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   40.0s finished
[32m[I 2022-07-30 18:20:38,940][0m Trial 28 finished with value: 0.9652414850248991 and parameters: {'ngram_range': (1, 3), 'tfidf': 'TFIDF', 'norm': 'l1', 'learning_rate': 0.0038715361762780435}. Best is trial 10 with value: 0.9826785206405054.[0m
[Parallel(n_jobs=1)]: Done   5 out 



Time Elapsed: 0:00:57.308550

Best score: 0.9829
Best parameter:
{'learning_rate': 0.023599980747524948, 'ngram_range': [1, 2], 'norm': 'l1', 'tfidf': 'TFIDF'}



    Time Elapsed: 0:01:00.954454

    Best score: -0.0487
    Best parameter:
    {'learning_rate': 0.026624489179423422, 'ngram_range': [1, 3], 'norm': 'l1', 'tfidf': 'TFIDF'}