# Hyperparameter Tuning

Optimizing hyperparameters is a crucial step in building a robust neural network model. In this blog post, we will explore the use of GridsearchCV to identify optimal values for various factors in neural networks. Specifically, we will focus on hyperparameter tuning, which involves optimizing the number of neurons in hidden layers, batch size, epochs, and more. By leveraging GridsearchCV to fine-tune these hyperparameters, we can enhance the performance of our neural network model and achieve superior results on our dataset.

![1.png](attachment:1.png)

# Content
 - I. Data Preparation
 - II. Number of Neurons
 - III. Batch_size and Epoch
 - IV. Other parameters to tune
       - Dropout Regularization
       - Neuron Activation Function
       - Network Weight Initialization
       - Learning Rate and Momentum
       - Training Optimization Algorithm
 - V. Conclusion

---

# I. Data Preparation for GridSearchCV

In [1]:
#
import pandas as pd
import numpy as np

# load dataset
df = pd.read_csv('heart_failure.csv')

# define X and y
y = df['death_event']
X = df.loc[:,'age':'time']

# data split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size = 0.3, 
                                                    random_state = 7)

# ColumnTransformer                                                   
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, LabelEncoder,  OneHotEncoder

preprocessor = ColumnTransformer([("numeric", 
                                     StandardScaler(),
                                    ['age', 'creatinine_phosphokinase', 'ejection_fraction', 'platelets',
                                     'serum_creatinine', 'serum_sodium', 'time',]), 
                                   
                                   ("categorical",
                                    OneHotEncoder(),
                                    ['anaemia','diabetes', 'high_blood_pressure', 'sex', 'smoking' ])

                                   ])

# fit the preprocessor to the training data and transform both train and test data
X_train = preprocessor.fit_transform(X_train)
X_test = preprocessor.transform(X_test)

# labeEncoder
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y_train = le.fit_transform(y_train.astype(str)) 
y_test = le.fit_transform(y_test.astype(str))   

# Reshape to 2d (Using softmax with 2 classes)
from tensorflow.keras.utils import to_categorical
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

# II. Number of Neurons in Hidden Layer

The number of neurons in a layer is an important parameter to consider while tuning a neural network. This is because the number of neurons in a layer affects the network's representational capacity, at least up to that point in the topology. In theory, a large enough single-layer network can approximate any other neural network. However, in practice, it may be more efficient to use a network with multiple layers to approximate complex functions, rather than relying on a single layer with a large number of neurons.

#### Importing necessary library and packages

In [2]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, InputLayer
from scikeras.wrappers import KerasClassifier
import numpy as np

#### Function for tunning the number of neurons in hidden layer 1

In [3]:
# set random seed for reproducibility
seed = 7
np.random.seed(seed)
tf.random.set_seed(seed)

# define the function
def create_model_neurons(neurons):
    model = Sequential()
    # input layer
    model.add(InputLayer(input_shape=(X_train.shape[1])))
    # hidden layer 1
    model.add(Dense(neurons, activation='relu', kernel_initializer='glorot_uniform'))
    # output layer
    model.add(Dense(2, activation='softmax', kernel_initializer='glorot_uniform'))
    # compile
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=[tf.keras.metrics.Recall()])
    return model

This code defines a function called create_model_neurons which creates a neural network model with a specified number of neurons in the hidden layer.

First, the random seed is set for reproducibility.

The model is built using the Sequential class from Keras, which allows you to define a linear stack of layers.

The model has an input layer defined using InputLayer, which specifies the shape of the input data to the model. The shape of the input is determined by the number of features in the training data, as specified by X_train.shape[1].

The hidden layer of the model is defined using Dense, which creates a fully connected layer of neurons. The number of neurons in the hidden layer is specified by the neurons argument passed to the function. The activation function used for the hidden layer is ReLU (activation='relu'). Additionally, kernel_initializer='glorot_uniform' is used to initialize the weights of the layer. This initialization scheme is designed to keep the variance of the outputs of each layer approximately constant during training, which can help to improve the performance of the network.

The output layer of the model is defined using another Dense layer with two output neurons, since this is a binary classification problem. The activation function used for the output layer is softmax (activation='softmax'), which is commonly used for multi-class classification problems.

Finally, the model is compiled using the compile method. The loss function used is categorical cross-entropy (loss='categorical_crossentropy'), which is suitable for multi-class classification problems. The optimizer used is Adam (optimizer='adam'), which is a commonly used optimization algorithm for neural networks. Additionally, the model is evaluated using the recall metric (metrics=[tf.keras.metrics.Recall()]), which is a measure of the model's ability to correctly identify positive samples.

#### Create the model

In [5]:
model = KerasClassifier(model = create_model_neurons,
                          epochs = 100, # initial epoch
                          batch_size = 17, # initial batch_size
                          verbose=0) # set to 1 if we want a print out preview of the result

The code defines a KerasClassifier using the function create_model_neurons as the model. The KerasClassifier is used to enable the model to be used with scikit-learn's GridSearchCV. The model is set to train for 100 epochs and to use a batch size of 17. The verbose parameter is set to 0, which means no output will be displayed during training. If verbose is set to 1, then the output of each epoch will be printed.

#### GridSearchCV

In [6]:
from sklearn.model_selection import GridSearchCV

# define the grid search parameters
neurons = [17, 34, 51, 68]
param_grid = dict(model__neurons = neurons)

grid = GridSearchCV(estimator = model, 
                    param_grid = param_grid,
                    n_jobs=-1,
                    verbose=1, 
                    cv=5) 

grid_result = grid.fit(X_train, y_train, shuffle = False) # shuffle  set to false 

Fitting 5 folds for each of 4 candidates, totalling 20 fits


The objective of this code is to find the best combination of hyperparameters for a neural network model using hyperparameter tuning via GridSearchCV method from scikit-learn.

The hyperparameter being tuned in this code is the number of neurons in the hidden layer of the model. The list of values being tested for this hyperparameter is [17, 34, 51, 68].

The param_grid dictionary in the code specifies the hyperparameters to tune and the values to test. In this case, model__neurons is the hyperparameter being tuned, and its values are specified in the neurons list.

The estimator argument of the GridSearchCV method is set to model, which is a KerasClassifier model created using the create_model_neurons function. The epochs and batch_size hyperparameters are not being tuned but are specified in the create_model_neurons function.

The n_jobs argument is set to -1 to use all available processors, while the verbose argument is set to 1 to display progress messages.

Generally, the default value of 5-fold cross-validation is a good starting point. However, for smaller datasets, a higher number of folds (e.g., 10) may be more appropriate to reduce the variance of the performance estimate. On the other hand, for larger datasets, a lower number of folds (e.g., 3) may be sufficient to obtain reliable estimates of the model performance while reducing computational costs.

The grid_result variable is the result of the GridSearchCV method, obtained by calling the fit method on the grid object. The shuffle parameter is set to False to ensure that the order of the training samples is not changed during each epoch of training.

#### GridSearchCV result (Traning Dataset)

In [7]:
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Best: 0.789663 using {'model__neurons': 34}
0.755981 (0.050787) with: {'model__neurons': 17}
0.789663 (0.030507) with: {'model__neurons': 34}
0.751684 (0.067284) with: {'model__neurons': 51}
0.775377 (0.048097) with: {'model__neurons': 68}


#### `Note:`

1. Upon re-running the grid search several times, I observed that the results vary despite setting a random seed. This suggests that there are additional factors impacting the reproducibility of the results beyond just setting a seed.

2. When the verbose parameter in GridSearchCV is set to 1, it will print the details of each cross-validation iteration, including the train and test scores for each combination of hyperparameters.

3. The grid_result.best_score_ attribute returns the mean score over all cross-validation iterations for the best set of hyperparameters found during the search.

It's possible for the highest score printed during the verbose output to be higher than the grid_result.best_score_, because the former is based on a single iteration, while the latter is based on the mean score across all iterations. Therefore, it's always recommended to rely on the grid_result.best_score_ as the optimal score for the best set of hyperparameters.

#### `Reproducibility is a Problem` 
Reproducibility is a crucial aspect of machine learning, as it enables researchers to validate and compare their models' performance. However, even when setting seeds for numpy and tensorflow, the results are not always 100% reproducible. This can be particularly problematic when grid searching wrapped Keras models, as there are additional factors that can impact reproducibility beyond just setting seeds. Therefore, it is essential to be aware of these additional factors and take appropriate measures to ensure reproducibility, such as setting the shuffle parameter in the fit method to False and initializing weights consistently.

#### Model evaluation (Test dataset)

In [8]:
# placing our newfound number of neurons
n = grid_result.best_params_['model__neurons']

seed = 7
np.random.seed(seed)
tf.random.set_seed(seed)

model = Sequential()
model.add(InputLayer(input_shape=(X_train.shape[1])))
model.add(Dense(n, activation='relu',  kernel_initializer='glorot_uniform'))
model.add(Dense(2, activation='softmax',  kernel_initializer='glorot_uniform'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=[tf.keras.metrics.Recall()])

# fit the training dataset
model.fit(X_train, y_train, epochs=100, batch_size=17, verbose=0, shuffle=False)

<keras.callbacks.History at 0x1fbff2a94c0>

In [9]:
# evaluate training dataset
print(model.evaluate(X_train, y_train))
# evaluate testing dataset
print(model.evaluate(X_test, y_test))

[0.23052039742469788, 0.9282296895980835]
[0.3269091248512268, 0.855555534362793]


# III. `Batch_size and Epoch`

The `batch size` refers to the number of training examples used in one forward/backward pass of the neural network during the training process. It determines how many patterns are loaded into memory at a time and can affect the speed of training and the memory requirements of the network. Choosing an appropriate batch size is crucial for achieving good performance, as a batch size that is too small can result in slow convergence or noisy gradients, while a batch size that is too large can result in memory issues and slow training times. In hyperparameter tuning, different batch sizes can be tried to find the one that works best for the given dataset and model architecture.

The `number of epochs` refers to the number of times the entire training dataset is shown to the network during training. Each epoch consists of one full pass through the training dataset. The number of epochs is another important hyperparameter to tune, as it determines how long the network will be trained for. Too few epochs can result in underfitting, where the model is not able to capture the underlying patterns in the data. Conversely, too many epochs can result in overfitting, where the model becomes too specialized to the training data and is unable to generalize well to new data. By trying different numbers of epochs, we can find the optimal number that leads to the best performance on the validation set.

#### Function to create a model for tunning batch size and epoch

In [10]:
def create_model_batch_epoch():
    model = Sequential()
    model.add(InputLayer(input_shape=(X_train.shape[1],)))
    # placing our newfound neuron (n)
    model.add(Dense(n, activation='relu', kernel_initializer='glorot_uniform'))
    model.add(Dense(2, activation='softmax', kernel_initializer='glorot_uniform'))
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=[tf.keras.metrics.Recall()])
    return model

In [11]:
# create model
model = KerasClassifier(model=create_model_batch_epoch, verbose=1)

#### GridSearchCV

In [12]:
from sklearn.model_selection import GridSearchCV

# define gridsearch parameter
n_column = X_train.shape[1]
batch_size = [n_column, n_column*2, n_column*3]
epochs = [50, 80, 100]

param_grid = dict(batch_size = batch_size,
                  epochs = epochs)

grid = GridSearchCV(estimator = model, 
                    param_grid = param_grid,
                    n_jobs=-1,
                    verbose=1, 
                    cv=5)

In [13]:
# fit the training dataset
grid_result = grid.fit(X_train, y_train, verbose=0, shuffle=False)

Fitting 5 folds for each of 9 candidates, totalling 45 fits


#### GridSearchCV result

In [14]:
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

Best: 0.784785 using {'batch_size': 34, 'epochs': 50}


#### Model evaluation (Test dataset)

In [15]:
# placing newfound batch_size and epochs along with the nefound number of neurons
bs = grid_result.best_params_['batch_size']
ep = grid_result.best_params_['epochs']

seed = 7
np.random.seed(seed)
tf.random.set_seed(seed)

model = Sequential()
model.add(InputLayer(input_shape=(X_train.shape[1])))
# newfound neurons
model.add(Dense(n, activation='relu',  kernel_initializer='glorot_uniform'))
model.add(Dense(2, activation='softmax',  kernel_initializer='glorot_uniform'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=[tf.keras.metrics.Recall()])

# placing our newfound batch_size and epoch
model.fit(X_train, y_train, batch_size= bs,  epochs=ep, verbose=0, shuffle=False)

<keras.callbacks.History at 0x1fbff407310>

In [19]:
# evaluate training dataset
print(model.evaluate(X_train, y_train))
# evaluate testing dataset
print(model.evaluate(X_test, y_test))

[0.33950579166412354, 0.8516746163368225]
[0.363415390253067, 0.855555534362793]


We have observed that our training and testing scores are similar at this point in our hyperparameter experiment. Therefore, we will incorporate the parameters that yielded promising results into our model. However, there are still other hyperparameters to consider that can contribute to improving our model's performance. It may be advantageous to perform a GridSearchCV with all of these parameters in a single run. Unfortunately, our hardware may not be able to handle the computational demands of a comprehensive search. As a workaround, we can consider utilizing cloud services to perform the hyperparameter tuning.

----

# IV. Other parameters to tune

### Dropout Regularization
In neural networks, dropout regularization is a technique used to reduce overfitting, which occurs when a model is too complex and performs well on the training data but poorly on new, unseen data. Dropout regularization randomly drops out (sets to zero) some of the neurons in a layer during training. This means that these neurons will not contribute to the forward or backward pass, effectively making the model simpler and reducing the chances of overfitting.

For the best results, dropout is best combined with a weight constraint such as the max norm constraint.
We can create a list of values for `dropout_rate` and `weight_constraint` to be set in a param_grid dictionary for GridSearchCV
ex.
 - weight_constraint = [1.0, 2.0]
 - dropout_rate = [0.0, 0.1, 0.2]

Code implementation 

In [20]:
# weight_constraint = [1.0, 2.0]
# dropout_rate = [0.0, 0.1, 0.2]

# param_grid = dict(
#                   model__dropout_rate = dropout_rate,
#                   model__weight_constraint = weight_constraint
#                  )

# # Function for dropout value in hidden layer 1
# def create_model_dropout(dropout_rate, weight_constraint):
#     model = Sequential()
#     # input layer
#     model.add(InputLayer(input_shape=(X_train.shape[1])))    
#     # hidden layer1
#     model.add(Dense(n, activation ='relu', kernel_constraint=MaxNorm(weight_constraint)))
#     model.add(Dropout(dropout_rate))
#     #output layer
#     model.add(Dense(2, activation='sigmoid'))    
#     # compile
#     model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
#     return model

### Neuron Activation Function
Hyperparameter tuning of neuron activation function involves selecting the most appropriate activation function for a neural network model. Activation functions are mathematical functions applied to the outputs of individual neurons in a neural network to introduce non-linearity and to help the network learn complex patterns in the data.

There are various activation functions that can be used in a neural network such as `sigmoid`, `ReLU` (Rectified Linear Unit), `tanh` (Hyperbolic Tangent) and `softmax`. Each of these functions has its own strengths and weaknesses and the choice of activation function can have a significant impact on the performance of the neural network.

### Network Weight Initialization
Hyperparameter tuning for network weight initialization involves selecting the most suitable method for initializing the weights in a neural network to improve its performance. The available methods include `uniform`, `lecun_uniform`, `normal, zero`, `glorot_normal`, `glorot_uniform`, `he_normal`, and `he_uniform` initialization. These methods differ in the way they distribute weights in the network, and some may work better than others for a particular dataset or problem. Therefore, it's essential to experiment with different weight initialization methods and select the one that yields the best results.

### Learning Rate and Momentum
Hyperparameter tuning for learning rate and momentum involves finding the optimal values for these two parameters to improve the performance of a neural network. Learning rate is the step size used to update the network's weights during backpropagation, while momentum is a term that helps to accelerate the training process by adding a fraction of the previous weight update to the current update.

Finding the best values for learning rate and momentum requires experimenting with different values and observing their effects on the network's performance metrics, such as accuracy or loss. Typically, a range of values is chosen for each parameter, and the network is trained with different combinations of learning rates and momentums.

Common values for learning rate include 0.1, 0.01, 0.001, and 0.0001, while momentum is often set between 0.9 and 0.99. However, these values may vary depending on the specific problem and architecture.

In general, a learning rate that is too high may cause the network to diverge, while a learning rate that is too low may result in slow convergence or getting stuck in a suboptimal solution. Momentum can help to mitigate these issues by smoothing the weight updates and avoiding oscillations.

To perform hyperparameter tuning for learning rate and momentum, techniques such as GridSearchCV or randomized search can be used to explore different combinations of values and find the best ones that maximize the network's performance.

### Training Optimization Algorithm
Hyperparameter tuning for training optimization algorithms involves selecting the most appropriate algorithm that will optimize the weights and biases of the neural network during training. It is essential to choose the best optimization algorithm that can quickly converge to the global minima and achieve the desired accuracy.

Some popular optimization algorithms include:

- Stochastic Gradient Descent (SGD)
- Adam
- Adagrad
- Adadelta
- RMSprop

In hyperparameter tuning, the different algorithms are tested, and their performance is compared to determine the most suitable algorithm for a specific neural network and dataset. Factors such as the dataset size, complexity, and training time can influence the choice of optimization algorithm.

# V. Conclusion

In conclusion, building an efficient neural network model requires tuning several hyperparameters to optimize its performance. In addition to data preparation, the number of neurons, and batch size and epoch, other parameters such as dropout regularization, neuron activation function, network weight initialization, learning rate and momentum, and training optimization algorithm are also essential to consider. A comprehensive search for these hyperparameters can be time-consuming and computationally expensive, but it is necessary to build a robust model. Therefore, it is recommended to utilize techniques such as GridSearchCV or Bayesian optimization to optimize the search process and utilize cloud services for faster computation.

---