# Hyperparameters optimisation

Label	Description
Machine Learning models are composed of two different types of parameters:
Hyperparameters = are all the parameters which can be arbitrarily set by the user before starting training (eg. number of estimators in Random Forest).
Model parameters = are instead learned during the model training (eg. weights in Neural Networks, Linear Regression).
The model parameters define how to use input data to get the desired output and are learned at training time. Instead, Hyperparameters determine how our model is structured in the first place.
Machine Learning models tuning is a type of optimization problem. We have a set of hyperparameters and we aim to find the right combination of their values which can help us to find either the minimum (eg. loss) or the maximum (eg. accuracy) of a function (Figure 1).
This can be particularly important when comparing how different Machine Learning models performs on a dataset. In fact, it would be unfair for example to compare an SVM model with the best Hyperparameters against a Random Forest model which has not been optimized.
In this post, the following approaches to Hyperparameter optimization will be explained:
Manual Search
Random Search
Grid Search
Automated Hyperparameter Tuning (Bayesian Optimization, Genetic Algorithms)
Artificial Neural Networks (ANNs) Tuning

Figure 1: ML Optimization Workflow [1]
In order to demonstrate how to perform Hyperparameters Optimization in Python, I decided to perform a complete Data Analysis of the Credit Card Fraud Detection Kaggle Dataset. Our objective in this article will be to correctly classify which credit card transactions should be labelled as fraudulent or genuine (binary classification). This Dataset has been anonymized before being distributed, therefore, the meaning of most of the features has not been disclosed.
In this case, I decided to use just a subset of the dataset, in order to speed up training times and make sure to achieve a perfect balance between the two different classes. Additionally, just a limited amount of features has been used to make the optimization tasks more challenging. The final dataset is shown in the figure below (Figure 2).


In statistics, hyperparameter is a parameter from a prior distribution; it captures the prior belief before data is observed.
In any machine learning algorithm, these parameters need to be initialized before training a model.
Model parameters vs Hyperparameters
Model parameters are the properties of training data that will learn on its own during training by the classifier or other ML model. For example,
Weights and Biases
Split points in Decision Tree

Figure 2: Hyperparameters vs model parameters → Source
Model Hyperparameters are the properties that govern the entire training process. The below are the variables usually configure before training a model.
Learning Rate
Number of Epochs
Hidden Layers
Hidden Units
Activations Functions
Why are Hyperparameters essential?
Hyperparameters are important because they directly control the behaviour of the training algorithm and have a significant impact on the performance of the model is being trained.
“A good choice of hyperparameters can really make an algorithm shine”.
Choosing appropriate hyperparameters plays a crucial role in the success of our neural network architecture. Since it makes a huge impact on the learned model. For example, if the learning rate is too low, the model will miss the important patterns in the data. If it is high, it may have collisions.
Choosing good hyperparameters gives two benefits:
Efficiently search the space of possible hyperparameters
Easy to manage a large set of experiments for hyperparameter tuning.
Hyperparameters Optimisation Techniques
The process of finding most optimal hyperparameters in machine learning is called hyperparameter optimisation.
Common algorithms include:
Grid Search
Random Search
Bayesian Optimisation
Grid Search
Grid search is a very traditional technique for implementing hyperparameters. It brute force all combinations. Grid search requires to create two set of hyperparameters.
Learning Rate
Number of Layers
Grid search trains the algorithm for all combinations by using the two set of hyperparameters (learning rate and number of layers) and measures the performance using “Cross Validation” technique. This validation technique gives assurance that our trained model got most of the patterns from the dataset. One of the best methods to do validation by using “K-Fold Cross Validation” which helps to provide ample data for training the model and ample data for validations.

Figure 3: Grid Search → Source
The Grid search method is a simpler algorithm to use but it suffers if data have high dimensional space called the curse of dimensionality.





<div class="row">
  <div class="column">
    Lorem ipsum dolor sit amet, consectetur adipiscing elit.    Maecenas quis nunc pulvinar urna faucibus tincidunt ut vestibulum ligula. Sed placerat sollicitudin erat, quis dapibus nibh tempor non. 
      <br/>
    
Id | Syntax      | Description 
--|:---------:|:-----------:
1|Header      | Something  here
2|More here   | Text
    
  </div>
    
  <div class="column">
    
| Label |    Cloth    |
|-------|-------------|
|   0   | T-shirt/top |
|   1   |  Trouser    |
|   2   |  Pullover   |
|   3   |   Dress     |
|   4   |    Coat     |
|   5   |   Sandal    |
|   6   |    Shirt    |
|   7   |   Sneaker   |
|   8   |     Bag     |
|   9   | Ankle boot  |
    
  </div>
</div>

<div class="row">
  <div class="column">
    Lorem ipsum dolor sit amet, consectetur adipiscing elit.    Maecenas quis nunc pulvinar urna faucibus tincidunt ut vestibulum ligula. Sed placerat sollicitudin erat, quis dapibus nibh tempor non. 
      <br/>
    
Id | Syntax      | Description 
--|:---------:|:-----------:
1|Header      | Something  here
2|More here   | Text
    
  </div>
    
  <div class="column">
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas quis nunc pulvinar urna faucibus tincidunt ut vestibulum ligula. Sed placerat sollicitudin erat, quis dapibus nibh tempor non. 
  <br/>
    
  $$
  \begin{align}
  {x} & = \sigma(y-x) \tag{3-1}\\
  {y} & = \rho x - y - xz \tag{3-2}\\
  {x+y+z} & = -\beta z + xy \tag{3-3}
  \end{align}
  $$
    
  </div>
</div>

%%html
<style>
    @media print { 
        * {
             box-sizing: border-box;
          }
        .row {
             display: flex;
         }
       /* Create two equal columns that sits next to each other */
       .column {
          flex: 50%;
          padding: 10px;
  
        }
        
        div.input {
          display: none;
          padding: 0;
        }
        div.output_prompt {
          display: none;
          padding: 0;
        }
        div.text_cell_render {
          padding: 1pt;
        }
        div#notebook p,
        div#notebook,
        div#notebook li,
        p {
          font-size: 10pt;
          line-height: 115%;
          margin: 0;
        }
        .rendered_html h1,
        .rendered_html h1:first-child {
          font-size: 10pt;
          margin: 3pt 0;
        }
       .rendered_html h2,
       .rendered_html h2:first-child {
          font-size: 10pt;
          margin: 3pt 0;
       }
       .rendered_html h3,
       .rendered_html h3:first-child {
         font-size: 10pt;
         margin: 3pt 0;
       }
       div.output_subarea {
         padding: 0;
       }
       div.input_prompt{
         display: none;
         padding: 0;
      }
}

In [1]:
from keras.datasets import fashion_mnist
from keras.models import Sequential
from keras.layers import Dense, BatchNormalization, Dense, Dropout, convolutional, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras.utils import to_categorical
from keras.optimizers import Adam, Nadam, RMSprop
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import confusion_matrix, make_scorer, accuracy_score
from sklearn.model_selection import RandomizedSearchCV, KFold
from keras.wrappers.scikit_learn import KerasClassifier
import pandas as pd
import seaborn as sns
import talos as ta
import warnings
warnings.filterwarnings("default", "", DeprecationWarning, "", 0)

# load training data and do basic data normalization
(X_train, y_train), (X_val, y_val) = fashion_mnist.load_data()

# Select image, resize them to the correct format and select label from training data
X_train = X_train.reshape(-1,28,28,1) / 255
X_val = X_val.reshape(-1,28,28,1) / 255

# One-hot encoding of labels since we want to classifiy images into different units of a single vector.
y_train = to_categorical(y_train)
y_val = to_categorical(y_val)

X = np.concatenate((X_train, X_val), axis=0)
y = np.concatenate((y_train, y_val), axis=0)

Using TensorFlow backend.


In [2]:
# model = Sequential()

# model.add(convolutional.Conv2D(32, (3, 3), input_shape=(28, 28, 1), activation='relu'))
# model.add(convolutional.MaxPooling2D())
# model.add(BatchNormalization())
# model.add(Dropout(0.25))

# model.add(convolutional.Conv2D(32, (3, 3), input_shape=(28, 28, 1), activation='relu'))
# model.add(convolutional.MaxPooling2D())
# model.add(BatchNormalization())
# model.add(Dropout(0.25))

# model.add(Flatten())
# model.add(Dense(y_train.shape[1], activation='softmax'))

# model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [3]:
# history = model.fit(X_train, y_train, epochs=15, batch_size=10, verbose=1, validation_data=(X_val, y_val))

In [4]:
# # Train/val accuracy/loss
# _, train_acc = model.evaluate(X_train, y_train, verbose=0)
# _, test_acc = model.evaluate(X_val, y_val, verbose=0)
# print('Train: %.3f, Validation: %.3f' % (train_acc, test_acc))

In [5]:
# loss = history.history['loss']
# val_loss = history.history['val_loss']
# acc = history.history['accuracy']
# val_acc = history.history['val_accuracy']

# plt.figure(figsize=(20, 7))
# plt.subplot(1,2,1)
# train_loss_plot, = plt.plot(range(1, len(loss)+1), loss, label='Train Loss')
# val_loss_plot, = plt.plot(range(1, len(val_loss)+1), val_loss, label='Validation Loss')
# _ = plt.legend(handles=[train_loss_plot, val_loss_plot])

# plt.subplot(1,2,2)
# train_acc_plot, = plt.plot(range(1, len(acc)+1), acc, label='Training accuracy')
# val_acc_plot, = plt.plot(range(1, len(val_acc)+1), val_acc, label='Validation accuracy')
# _ = plt.legend(handles=[train_acc_plot, val_acc_plot])

In [6]:
# label = {'0':'T-shirt/top', '1':'Trouser', '2':'Pullover',
#          '3':'Dress', '4':'Coat', '5':'Sandal',
#          '6':'Shirt', '7':'Sneaker', '8':'Bag', '9':'Ankle boot'}

# gt = [np.argmax(k, axis=None, out=None) for k in y_val]
# pred = [np.argmax(k, axis=None, out=None) for k in model.predict(X_val, verbose=0)]
# res = confusion_matrix(gt, pred)

# confusion = pd.DataFrame(res, index = label.values(), columns = label.values())
# sns.heatmap(confusion, annot=True)

In [7]:
# wrong = np.where(np.subtract(gt,pred)!=0)[0]

# plt.figure(figsize=(20,20))
# for idx, f in enumerate(wrong[:100]):
#     img = np.squeeze(X_val[f]*255)
#     plt.subplot(10,10,idx+1)
#     plt.subplots_adjust(hspace=0.5)
#     plt.imshow(img, cmap='gray', aspect='equal')
#     plt.title('%s - %s'%(label.get(str(gt[f])),label.get(str(pred[f]))))
#     plt.xticks([]), plt.yticks([])

In [8]:
# params = {'batch_size': [10, 50, 100],
#           'dropout': [0, 0.25, 0.5],
#           'kernel_size':[(3,3),(5,5)],
#           'activation':['relu', 'elu']}

In [9]:
# def model_grid_search(X_train, y_train, X_val, y_val, params):
    
#     model = Sequential()
    
#     model.add(convolutional.Conv2D(32, params['kernal_size'], input_shape=(28, 28, 1), activation=params['activation']))
#     model.add(convolutional.MaxPooling2D())
#     model.add(BatchNormalization())
#     model.add(Dropout(params['dropout']))
    
#     model.add(convolutional.Conv2D(32, params['kernal_size'], activation=params['activation']))
#     model.add(convolutional.MaxPooling2D())
#     model.add(BatchNormalization())
#     model.add(Dropout(params['dropout']))

#     model.add(Flatten())
#     model.add(Dense(y_train.shape[1], activation='softmax'))
    
#     model.compile(loss = 'categorical_crossentropy', optimizer='Adam', metrics=['accuracy'])
#     model.fit(X_train, y_train, epochs=15, batch_size = params['batch_size'],
#                         verbose = 0, validation_data = (X_val, y_val))
#     return model

In [10]:
# exp = ta.Scan(X_train, y_train, model = model_grid_search, params = params, experiment_name = 'grid_search')

In [11]:
# exp.data.sort_values('val_accuracy',ascending=False).head(5)

# Random search

In [12]:
# params = {'batch_size': [10, 50, 100],
#           'dropout': [0, 0.25, 0.5],
#           'kernel_size':[(3,3),(5,5)],
#           'activation':['relu', 'elu'],
#           'n_classes':[10]}

# def model_random_search(kernel_size, activation, n_classes, dropout):
#     model = Sequential()
    
#     model.add(convolutional.Conv2D(32, kernel_size, input_shape=(28, 28, 1), activation=activation))
#     model.add(convolutional.MaxPooling2D())
#     model.add(BatchNormalization())
#     model.add(Dropout(dropout))
    
#     model.add(convolutional.Conv2D(32, kernel_size, activation=activation))
#     model.add(convolutional.MaxPooling2D())
#     model.add(BatchNormalization())
#     model.add(Dropout(dropout))

#     model.add(Flatten())
#     model.add(Dense(n_classes, activation='softmax'))
    
#     model.compile(loss = 'categorical_crossentropy', optimizer='Adam', metrics=['accuracy'])
#     return model

# model_keras = KerasClassifier(build_fn = model_random_search, epochs=1, verbose=1)
# random_search = RandomizedSearchCV(estimator=model_keras,
#                                    param_distributions=params, 
#                                    verbose=20,
#                                    n_iter=10,
# #                                    scoring = 'accuracy',
#                                    n_jobs=1).fit(X_train, y_train)

# print("Best: %f using %s" % (random_search.best_score_, random_search.best_params_))
# means = random_search.cv_results_['mean_test_score']
# stds = random_search.cv_results_['std_test_score']
# params = random_search.cv_results_['params']
# for mean, stdev, param in zip(means, stds, params):
#     print("%f (%f) with: %r" % (mean, stdev, param))

# Bayesian optimization

Compared to more simpler hyperparameter search methods like grid search and random search, Bayesian optimization is built upon Bayesian inference and Gaussian process with an attempts to find the maximum value of an unknown function as few iterations as possible. It is particularly suited for optimization of high-cost functions like hyperparameter search for deep learning model, or other situations where the balance between exploration and exploitation is important.

The Bayesian Optimization package we are going to use is BayesianOptimization, which can be installed with the following command,

pip install bayesian-optimization
Firstly, we will specify the function to be optimized, in our case, hyperparameters search, the function takes a set of hyperparameters values as inputs, and output the evaluation accuracy for the Bayesian optimizer. Inside the function, a new model will be constructed with the specified hyperparameters, train for a number of epochs and evaluated against a set metrics. Every new evaluated accuracy will become a new observation for the Bayesian optimizer, which contributes to the next search hyperparameters' values. 

Let's create a helper function first which builds the model with various parameters.

Bayesian optimization is a probabilistic model that maps the hyperparameters to a probability score on the objective function. Unlike Random Search and Hyperband models, Bayesian Optimization keeps track of its past evaluation results and uses it to build the probability model.




How to Implement Bayesian Optimization from Scratch in Python
by Jason Brownlee on October 9, 2019 in Probability
Tweet  Share
Last Updated on January 10, 2020

In this tutorial, you will discover how to implement the Bayesian Optimization algorithm for complex optimization problems.

Global optimization is a challenging problem of finding an input that results in the minimum or maximum cost of a given objective function.

Typically, the form of the objective function is complex and intractable to analyze and is often non-convex, nonlinear, high dimension, noisy, and computationally expensive to evaluate.

Bayesian Optimization provides a principled technique based on Bayes Theorem to direct a search of a global optimization problem that is efficient and effective. It works by building a probabilistic model of the objective function, called the surrogate function, that is then searched efficiently with an acquisition function before candidate samples are chosen for evaluation on the real objective function.

Bayesian Optimization is often used in applied machine learning to tune the hyperparameters of a given well-performing model on a validation dataset.

After completing this tutorial, you will know:

Global optimization is a challenging problem that involves black box and often non-convex, non-linear, noisy, and computationally expensive objective functions.
Bayesian Optimization provides a probabilistically principled method for global optimization.
How to implement Bayesian Optimization from scratch and how to use open-source implementations.
Discover bayes opimization, naive bayes, maximum likelihood, distributions, cross entropy, and much more in my new book, with 28 step-by-step tutorials and full Python source code.

Let’s get started.

Update Jan/2020: Updated for changes in scikit-learn v0.22 API.
A Gentle Introduction to Bayesian Optimization
A Gentle Introduction to Bayesian Optimization
Photo by Beni Arnold, some rights reserved.

Tutorial Overview
This tutorial is divided into four parts; they are:

Challenge of Function Optimization
What Is Bayesian Optimization
How to Perform Bayesian Optimization
Hyperparameter Tuning With Bayesian Optimization
Challenge of Function Optimization
Global function optimization, or function optimization for short, involves finding the minimum or maximum of an objective function.

Samples are drawn from the domain and evaluated by the objective function to give a score or cost.

Let’s define some common terms:

Samples. One example from the domain, represented as a vector.
Search Space: Extent of the domain from which samples can be drawn.
Objective Function. Function that takes a sample and returns a cost.
Cost. Numeric score for a sample calculated via the objective function.
Samples are comprised of one or more variables generally easy to devise or create. One sample is often defined as a vector of variables with a predefined range in an n-dimensional space. This space must be sampled and explored in order to find the specific combination of variable values that result in the best cost.

The cost often has units that are specific to a given domain. Optimization is often described in terms of minimizing cost, as a maximization problem can easily be transformed into a minimization problem by inverting the calculated cost. Together, the minimum and maximum of a function are referred to as the extreme of the function (or the plural extrema).

The objective function is often easy to specify but can be computationally challenging to calculate or result in a noisy calculation of cost over time. The form of the objective function is unknown and is often highly nonlinear, and highly multi-dimensional defined by the number of input variables. The function is also probably non-convex. This means that local extrema may or may not be the global extrema (e.g. could be misleading and result in premature convergence), hence the name of the task as global rather than local optimization.

Although little is known about the objective function, (it is known whether the minimum or the maximum cost from the function is sought), and as such, it is often referred to as a black box function and the search process as black box optimization. Further, the objective function is sometimes called an oracle given the ability to only give answers.

Function optimization is a fundamental part of machine learning. Most machine learning algorithms involve the optimization of parameters (weights, coefficients, etc.) in response to training data. Optimization also refers to the process of finding the best set of hyperparameters that configure the training of a machine learning algorithm. Taking one step higher again, the selection of training data, data preparation, and machine learning algorithms themselves is also a problem of function optimization.

Summary of optimization in machine learning:

Algorithm Training. Optimization of model parameters.
Algorithm Tuning. Optimization of model hyperparameters.
Predictive Modeling. Optimization of data, data preparation, and algorithm selection.
Many methods exist for function optimization, such as randomly sampling the variable search space, called random search, or systematically evaluating samples in a grid across the search space, called grid search.

More principled methods are able to learn from sampling the space so that future samples are directed toward the parts of the search space that are most likely to contain the extrema.

A directed approach to global optimization that uses probability is called Bayesian Optimization.

Want to Learn Probability for Machine Learning
Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-Course
What Is Bayesian Optimization
Bayesian Optimization is an approach that uses Bayes Theorem to direct the search in order to find the minimum or maximum of an objective function.

It is an approach that is most useful for objective functions that are complex, noisy, and/or expensive to evaluate.

Bayesian optimization is a powerful strategy for finding the extrema of objective functions that are expensive to evaluate. […] It is particularly useful when these evaluations are costly, when one does not have access to derivatives, or when the problem at hand is non-convex.








In [13]:
# def bayesian_optimization_model(X_train, y_train, X_val, y_val, params):
#     model = Sequential()
    
#     model.add(convolutional.Conv2D(32, params['kernal_size'], input_shape=(28, 28, 1), activation=params['activation']))
#     model.add(convolutional.MaxPooling2D())
#     model.add(BatchNormalization())
#     model.add(Dropout(params['dropout']))
    
#     model.add(convolutional.Conv2D(32, params['kernal_size'], activation=params['activation']))
#     model.add(convolutional.MaxPooling2D())
#     model.add(BatchNormalization())
#     model.add(Dropout(params['dropout']))

#     model.add(Flatten())
#     model.add(Dense(y_train.shape[1], activation='softmax'))
    
#     model.compile(loss = 'categorical_crossentropy', optimizer='Adam', metrics=['accuracy'])
#     model.fit(X_train, y_train, epochs=15, batch_size = params['batch_size'],
#                         verbose = 0, validation_data = (X_val, y_val))
#     score = model.evaluate(eval_ds, steps=10, verbose=0)
#     return score[1]

In [14]:
# def bayesian_optimization_model(params, n_class):
#     model = Sequential()
    
#     model.add(convolutional.Conv2D(32, (3,3), input_shape=(28, 28, 1), activation=params[0]))
#     model.add(convolutional.MaxPooling2D())
#     model.add(BatchNormalization())
#     model.add(Dropout(params[1]))
    
#     model.add(convolutional.Conv2D(32, (3,3), activation=params[0]))
#     model.add(convolutional.MaxPooling2D())
#     model.add(BatchNormalization())
#     model.add(Dropout(params[1]))

#     model.add(Flatten())
#     model.add(Dense(n_class, activation='softmax'))
    
#     model.compile(loss = 'categorical_crossentropy', optimizer='Adam', metrics=['accuracy'])
#     return model

In [15]:
# import skopt
# from skopt import gbrt_minimize, gp_minimize
# from skopt.utils import use_named_args
# from skopt.space import Real, Categorical, Integer  
# import tensorflow
# from tensorflow.python.keras import backend as K

In [16]:
# # dim_learning_rate = Real(low=1e-4, high=1e-2, prior='log-uniform', name='learning_rate')
# # dim_num_dense_layers = Integer(low=1, high=5, name='num_dense_layers')
# # dim_num_input_nodes = Integer(low=1, high=512, name='num_input_nodes')
# # dim_num_dense_nodes = Integer(low=1, high=28, name='num_dense_nodes')

# dim_activation = Categorical(categories=['relu','sigmoid','elu'], name='activation')
# dim_dropout = Real(low=0, high=0.5, name='dropout_rate')
# dim_batch_size = Integer(low=10, high=100, name='batch_size')
# dim_kernel_size = Integer(low=3, high=9, name='kernel_size')

# dimensions = [dim_activation, dim_dropout, dim_batch_size]
# default_parameters = ['relu',0.25, 50]

In [17]:
# def fitness(X_val, y_val, params):

#     model = bayesian_optimization_model(params, 10)
#     blackbox = model.fit(x=X_train, y=y_train, epochs=15, batch_size=params[2], validation_data=(X_val, y_val))
#     accuracy = blackbox.history['val_acc'][-1]
#     print("Accuracy: {0:.2%}".format(accuracy))

#     # Delete the Keras model with these hyper-parameters from memory.
#     del model
#     K.clear_session()
#     tensorflow.reset_default_graph()
    
#     # the optimizer aims for the lowest score, so we return our negative accuracy
#     return -accuracy

In [18]:
# K.clear_session()
# tensorflow.reset_default_graph()

In [19]:
# gp_result = gp_minimize(func=fitness, dimensions=dimensions, n_calls=12, noise=0.01, n_jobs=-1, kappa=5, x0=default_parameters)

In [20]:
# model = create_model(gp_result.x[0],gp_result.x[1],gp_result.x[2],gp_result.x[3],gp_result.x[4],gp_result.x[5])
# model.fit(X_train,y_train, epochs=3)
# model.evaluate(X_test,y_test)

In [21]:
import os
import numpy as np
import pandas as pd
from bayes_opt import BayesianOptimization
from keras.layers import Dense, Conv2D, BatchNormalization
from keras.layers import MaxPooling2D
from keras.layers import Input, Flatten, Dropout
from keras.layers import Activation
from keras.optimizers import Adam
from keras.callbacks import ReduceLROnPlateau, EarlyStopping, ModelCheckpoint
from keras.models import Model, load_model
from keras.utils import to_categorical
from sklearn.model_selection import train_test_split
from sklearn.metrics import balanced_accuracy_score, accuracy_score
from sklearn.metrics import confusion_matrix, classification_report
# import confusion_matrix_plot

early_stop_epochs = 10
learning_rate_epochs = 5

# parameters that change for each iteration that must be saved
list_early_stop_epochs = []
list_validation_loss = []
list_saved_model_name = []

def bayesian_optimization(kernel_size, batch_size, dropout):
    ker = int(kernel_size)
    batch = int(batch_size)
    model = Sequential()
    
    model.add(convolutional.Conv2D(32, (ker,ker), input_shape=(28, 28, 1), activation='relu'))
    model.add(convolutional.MaxPooling2D())
    model.add(BatchNormalization())
    model.add(Dropout(dropout))
    
    model.add(convolutional.Conv2D(32, (ker,ker), activation='relu'))
    model.add(convolutional.MaxPooling2D())
    model.add(BatchNormalization())
    model.add(Dropout(dropout))

    model.add(Flatten())
    model.add(Dense(10, activation='softmax'))
    
    model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=0.00025), metrics=['accuracy'])
    callbacks_list = [EarlyStopping(monitor='val_loss', patience=early_stop_epochs),                     
                      ReduceLROnPlateau(monitor='val_loss', factor=0.1, 
                                        patience=learning_rate_epochs, 
                                        verbose=0, mode='auto', min_lr=1.0e-6),
                      ModelCheckpoint(filepath='model_ker_%s_batch_%s_drop_%s.h5'%(kernel_size,batch_size,dropout),
                                      monitor='val_loss', save_best_only=True)]
    
    history = model.fit(X,y,batch_size=batch,
                        epochs=15,verbose=0,validation_split=0.25, 
                        shuffle=False,callbacks=callbacks_list)

    # record actual best epochs and valid loss here, added to bayes opt parameter df below
    list_early_stop_epochs.append(len(history.history['val_loss']) - early_stop_epochs)

    validation_loss = np.min(history.history['val_loss'])  # h.history['val_loss']
    list_validation_loss.append(validation_loss)
    list_saved_model_name.append('model_ker_%s_batch_%s_drop_%s.h5'%(kernel_size,batch_size,dropout))

    # bayes opt is a maximization algorithm, to minimize validation_loss, return 1-this
    bayes_score = 1.0 - validation_loss
    return bayes_score

params_opt = {'kernel_size':(3, 9),
              'batch_size':(1, 50),
              'dropout': (0, 0.5)}

optimizer = BayesianOptimization(f=bayesian_optimization, pbounds=params_opt, verbose=2)
optimizer.maximize(init_points=12, n_iter=100)

print('nbest result:', optimizer.max)

list_dfs = []
counter = 0
for result in optimizer.res:
    df_temp = pd.DataFrame.from_dict(data=result['params'], orient='index', columns=['trial' + str(counter)]).T
    df_temp['bayes opt error'] = result['target']
    df_temp['epochs'] = list_early_stop_epochs[counter]
    df_temp['validation_loss'] = list_validation_loss[counter]
    df_temp['model_name'] = list_saved_model_name[counter]
    list_dfs.append(df_temp)
    counter = counter + 1

df_results = pd.concat(list_dfs, axis=0)

|   iter    |  target   | batch_... |  dropout  | kernel... |
-------------------------------------------------------------


| [0m 1       [0m | [0m 0.7252  [0m | [0m 2.732   [0m | [0m 0.0998  [0m | [0m 5.479   [0m |
| [0m 2       [0m | [0m 0.7119  [0m | [0m 12.11   [0m | [0m 0.27    [0m | [0m 7.432   [0m |
| [95m 3       [0m | [95m 0.7362  [0m | [95m 17.4    [0m | [95m 0.08257 [0m | [95m 3.946   [0m |
| [0m 4       [0m | [0m 0.7265  [0m | [0m 23.69   [0m | [0m 0.1232  [0m | [0m 4.645   [0m |
| [0m 5       [0m | [0m 0.7213  [0m | [0m 35.05   [0m | [0m 0.3421  [0m | [0m 3.502   [0m |
| [0m 6       [0m | [0m 0.7     [0m | [0m 15.46   [0m | [0m 0.1925  [0m | [0m 8.685   [0m |
| [0m 7       [0m | [0m 0.7119  [0m | [0m 30.51   [0m | [0m 0.3762  [0m | [0m 7.572   [0m |
| [0m 8       [0m | [0m 0.7134  [0m | [0m 14.36   [0m | [0m 0.3353  [0m | [0m 7.667   [0m |
| [0m 9       [0m | [0m 0.7165  [0m | [0m 36.41 

| [0m 77      [0m | [0m 0.689   [0m | [0m 22.47   [0m | [0m 0.0     [0m | [0m 8.201   [0m |
| [0m 78      [0m | [0m 0.7149  [0m | [0m 25.36   [0m | [0m 0.01761 [0m | [0m 5.05    [0m |
| [0m 79      [0m | [0m 0.6664  [0m | [0m 19.98   [0m | [0m 0.0     [0m | [0m 7.656   [0m |
| [0m 80      [0m | [0m 0.7169  [0m | [0m 38.26   [0m | [0m 0.0     [0m | [0m 5.491   [0m |
| [0m 81      [0m | [0m 0.6986  [0m | [0m 42.18   [0m | [0m 0.06302 [0m | [0m 8.998   [0m |
| [0m 82      [0m | [0m 0.6772  [0m | [0m 50.0    [0m | [0m 0.5     [0m | [0m 7.145   [0m |
| [0m 83      [0m | [0m 0.7197  [0m | [0m 45.19   [0m | [0m 0.0     [0m | [0m 5.041   [0m |
| [0m 84      [0m | [0m 0.7046  [0m | [0m 43.41   [0m | [0m 0.0     [0m | [0m 6.714   [0m |
| [0m 85      [0m | [0m 0.6939  [0m | [0m 17.47   [0m | [0m 0.5     [0m | [0m 7.681   [0m |
| [0m 86      [0m | [0m 0.6829  [0m | [0m 7.887   [0m | [0m 0.5     [0m | 