# Hyperparameter Tuning for Neural Networks

## Learning Objectives
* <a href="#p1">Part 1</a>: Describe the major hyperparameters to tune
* <a href="#p2">Part 2</a>: Implement an experiment tracking framework
* <a href="#p3">Part 3</a>: Search the hyperparameter space using RandomSearch (Optional)

# 1. Hyperparameter Options (Learn)
<a id="p1"></a>

## Overview

Hyperparameter tuning is much more important with neural networks than it has been with any other models that we have considered up to this point. Other supervised learning models might have a couple of parameters, but neural networks can have dozens. These can substantially affect the accuracy of our models and although it can be a time consuming process is a necessary step when working with neural networks.
​
Hyperparameter tuning comes with a challenge. How can we compare models specified with different hyperparameters if our model's final error metric can vary somewhat erratically? How do we avoid just getting unlucky and selecting the wrong hyperparameter? This is a problem that to a certain degree we just have to live with as we test and test again. However, we can minimize it somewhat by pairing our experiments with Cross Validation to reduce the variance of our final accuracy values.

In [1]:
from keras.activations import relu, sigmoid, softmax, tanh, selu, elu
import tensorflow as tf
import numpy as np

### Load MNIST Dataset

In [2]:
from keras.datasets import mnist

(X_train, y_train), (X_test, y_test) = mnist.load_data()

###BEGIN SOLUTION
# rescale our pixel values between 0 and 1
max_pixel_value = 255
X_train = X_train / max_pixel_value
X_test = X_test / max_pixel_value

# flatten images into row vectors
X_train = X_train.reshape(60000, 784)
X_test = X_test.reshape(10000, 784)
###END SOLUTION

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz


### Normalizing Input Data

Recall from our lesson on Gradient Descent, that we need to normalize our input data so that the weights will be updated in equal proportions.

**Hint:** if your dataset's values range accross multiple orders of magnitude (i.e. $10^1,~~10^2,~~10^3,~~10^4$), then gradient descent will update the weights in grossly uneven proportions.  


![](https://quicktomaster.com/wp-content/uploads/2020/08/contour_plot.png)

##1.1 Hyperparameter Tuning Approaches:
For an excellent brief article introducing this subject, see [Comparison of Hyperparameter Tuning algorithms: Grid search, Random search, Bayesian optimization](https://medium.com/analytics-vidhya/comparison-of-hyperparameter-tuning-algorithms-grid-search-random-search-bayesian-optimization-5326aaef1bd1)

### 1.1.1 Babysitting AKA "Student Descent".

If you fiddled with any hyperparameters yesterday, this is basically what you did. This approach is 100% manual and is pretty common among researchers, where finding that one exact specification of hyperparameter values that jumps your model to a level of accuracy never seen before is the difference between publishing and not publishing a paper. Of course the professors don't do this themselves, that's grunt work. This is also known as the "fiddle with hyperparameters until you run out of time" method.

### 1.1.2 Grid Search

Grid Search is the Grad Student galaxy brain realization of: why don't I just specify all the experiments I want to run and let the computer try every possible combination of them while I go and grab lunch. This has a specific downside in that if I specify 5 hyperparameters with 5 options each then I've just created $5^5 = 3125$ combinations of hyperparameters to check -- which means that I have to train $3125$ different versions of my model. Then if I use $5\text{-fold}$ Cross Validation, I need five times as many runs, or $15,625$ runs. This is the brute-force method of hyperparameter tuning, but it can be very profitable if done wisely.

### 1.1.3 Random Search

Do Grid Search for a couple of hours and you'll say to yourself - "There's got to be a better way." <br>
Enter Random Search. For Random search you specify an interval of values to search for each hyperparameter and the search algorithm randomly samples hyperparameters from the specified intervals, and and returns you the best results.

The downside of Random search is that it won't find the absolute best hyperparameters, but it is much less costly to perform than Grid Search.

### 1.1.4 Bayesian Search

One thing that can make manual search methods like babysitting and gridsearch effective is that as the experimenter sees results they can then make updates to their future searches taking into account the previous results. If only we could hyperparameter tune our hyperparameter tuning! <br><br>
Well, we can if we use Bayesian Optimization. Tuning Neural Network hyperparameters is like an optimization problem within an optimization problem, and Bayesian Optimization is a search strategy that takes into account the results of past searches in order to improve future ones. <br><br>
Bayesian Optimization figures out the most promising regions of hyperparameter space to focus on, so it wastes less time searching through hyperparameter values that are unlikely to lead to improvement.<br><br>
Check out the library `keras-tuner` for easy implementations of Bayesian methods.

If the Bayesian hyperparameter search strategy piques your interest, here's a nice reference article: <br>
[A Conceptual Explanation of Bayesian Hyperparameter Optimization for Machine Learning](https://towardsdatascience.com/a-conceptual-explanation-of-bayesian-model-based-hyperparameter-optimization-for-machine-learning-b8172278050f)



## 1.2 What Hyperparameters are there to tune?

- learning rate
- batch_size
- number of  training epochs
- dropout regularization
- number of hidden layers
- number of neurons in each hidden layer
- optimization algorithms
- activation functions
- loss functions

There are more, but these are the most important.

### 1.2.1 Optimizer

Optimizers can also be considered as hyperparameters! <br>
There are a variety of [**optimizers**](https://keras.io/optimizers/) in Keras, and you may want to include several choices in your hyperparameter tuning process.

At some point, take some time to read this article [**An overview of gradient descent optimization algorithms**](https://ruder.io/optimizing-gradient-descent/) by Sebastian Ruder.

The **adam** optimizer usually gives great results, so it's your go-to optimizer. Different optimizers have different hyperparameters (such as learning rate, momentum, etc.) So, based on the optimizer you choose you might also have to tune the learning rate and momentum of those optimizers.

### 1.2.2 Learning Rate

The Learning Rate is a hyperparameter that is specific to your gradient-descent based optimizer selection. A learning rate that is too high will cause divergent behavior, but a Learning Rate that is too low will fail to converge; you're looking for the sweet spot. Start out tuning learning rates by orders of magnitude: $[0.001, 0.01, 0.1, 0.2, 0.3, 0.5]$ etc.

Once you have narrowed down your search, make the window even smaller and try again. If after running the above specification your model reports that $0.1$ is the Learning Rate, then you can bracket $0.1$ by a few other values on either side, say $[0.05, 0.08, 0.1, 0.12, 0.15]$ to try and narrow it down further. If on the other hand, you find that $0.001$ (or $0.5$) is the best Learning Rate, you definitely need to try more values to the left of $0.001$ (or to the right of $0.5$) in order to explore whether lower (or higher) learning rates could perform even better.

It can also be good to tune the number of epochs in combination with the learning rate since lower learning rates may need more epochs to converge to the minimum.

### 1.2.3 Momentum

**Momentum** is a variation of Stochastic Gradient Descent optimization. SGD is a common optimizer because it's what people understand and know, but it won't always get the best results. You can try adding the momentum option and tuning its hyperparameters to see if you can beat the performance from **Adam**. <br>
In SGD with Momentum, parameter updates are adaptive to prevent overshooting the loss function's minimum. Imagine a ball rolling down one side of a bowl, speeding up and rolling past the bottom and up the other side. Momentum stabilizes parameter updates by making them depend partly on the past. If our "marble" encounters a rapidly varying region of the loss function, momentum will prevent the parameter updates from changing rapidly in response and therefore overshooting the minimum.

### 1.2.4  Activation Functions

Typically you would choose the **ReLU** activation function for hidden layers. For output layers of binary and multi-class classification models, you would choose the **Sigmoid**, or **Softmax** activations respectively. <br>

Be aware that there are [other activation functions available in Keras](https://keras.io/api/layers/activations/) that can potentially improve your results!<br>   
For a brief introduction to some alternate activation functions, read [7 popular activation functions you should know in Deep Learning and how to use them with Keras and TensorFlow 2](https://towardsdatascience.com/7-popular-activation-functions-you-should-know-in-deep-learning-and-how-to-use-them-with-keras-and-27b4d838dfe6). Some of these activation functions (such as PReLU -- Parameteric Leaky ReLU) have learnable parameters.<br><br>
The choice of activation function is a hyperparameter whose exploration can potentially pay off in the form of better results.

### **Sigmoid**

![](https://i.stack.imgur.com/inMoa.png)

### **ReLU and variants**


![](https://miro.medium.com/max/2050/1*ypsvQH7kvtI2BhzR2eT_Sw.png)

### Can you code up the ReLU and leaky ReLU functions?

In [3]:
# ReLU: YOUR CODE HERE
def relu(x):
  if(x>=0):
    y = x
  else:
    y = 0
  return y

In [4]:
# Leaky ReLU: YOUR CODE HERE
def leaky_relu(x):
  if(x>=0):
    y = x
  else:
    y = 0.01*x
  return y

### **Softmax**
The main use of softmax activation in neural nets is to map scores to probabilities
<br><br><br><br>
![](https://miro.medium.com/max/1906/1*ReYpdIZ3ZSAPb2W8cJpkBg.jpeg)

The output layer is a vector of numbers ${z_i}$

In [5]:
# Suppose we have a set of K = 5 ourputs z_i
outputs = np.array([1.3, 5.1, 2.2, 0.7, 1.1])

Transform the outputs to ${\exp{z_i}}$

In [6]:
np.exp(outputs).round(3)

array([  3.669, 164.022,   9.025,   2.014,   3.004])

The ratio $\frac{{\exp{z_i}}}{\sum_{j=1}^{5}\exp{z_j}}$ is the probability of the $ith$ class

In [7]:
(np.exp(outputs) / np.exp(outputs).sum()).round(2)

array([0.02, 0.9 , 0.05, 0.01, 0.02])

In [8]:
# implement softmax
def softmax(scores):
    return (np.exp(scores) / np.exp(scores).sum())

softmax(np.array([1.3, 5.1, 2.2, 0.7, 1.1])).round(2)

array([0.02, 0.9 , 0.05, 0.01, 0.02])

#### Softmax transforms a list of numbers into probabilites.
Softmax applies the standard exponential function to each element $z_{i}$ of the input vector $\mathbf {z}$ <br> and normalizes these values by dividing by the sum of the exponentials. <br><br>


Exponentiation maps the real line onto the positive half line.<br>
Normalization maps the positive half-line onto the unit interval $[0,1]$ <br>
Normalization also ensures that the sum of the components of the output vector ${\text{softmax} (\mathbf {z} )}$ is $1$.<br><br>

Because they satisfy these properties, transformed output value is a probability:<br>

$$ \text{Probability} = \frac{\text{part}}{\text{whole}}$$<br><br>
The $ith$ component of the Softmax mapping is the predicted probability for the $ith$ class.
${\displaystyle \text{Softmax} (\mathbf {z} )_{i}={\frac {e^{z_{i}}}{\sum _{j=1}^{K}e^{z_{j}}}}{\text{ for }}i=1,\dotsc ,K{\text{ and }}\mathbf {z} =(z_{1},\dotsc ,z_{K})\in \mathbb {R} ^{K}}$


**Take Away:** Softmax is a multi-dimensional generalization of the Sigmoid. In a binary classification problem, the Sigmoid calculates the probability for a class. The probability can be converted to a class prediction by application of a threshold value, or by choosing the class with the highest probability, as we have seen previously. <br>

The Softmax calculates a probability for each of a set of classes. The predicted class is the class with the highest softmax probability.

### 1.2.5 Cross Entropy Loss functions

You also have the freedom to select various types of loss functions. [**Keras has a library of loss functions that you can select from**](https://keras.io/api/losses/probabilistic_losses/#categorical_crossentropy-function). <br><br>
A common class of loss functions to use in mult-class classification tasks are the [Cross Entropy Loss functions](https://ztlevi.gitbook.io/ml-101/loss/cross_entropy_loss). <br>
They come in several different formulations. Let's take a look at the basic idea that underpins them all.

Cross-Entropy loss for a data point is defined as the *negative log of the probability that is predicted for the correct class*. <br><br>
The overall Cross-Entropy loss for a batch (or mini-batch) of data points is the sum of the Cross-Entropy losses of the individual data points in the batch (or mini-batch).<br><br>
For a binary classification problem, the probability of the positive class is predicted using a Sigmoid activation function, while in a mult-class classification problem, class probabilities are predicted using a Softmax activation function.

Cross-Entropy Loss is a measure of how close the predicted targets values are to the true target values. Its minimum possible value is zero, indicating that the predictions agree perfectly with the true targets.<br><br>

In the usual case of Multi-Class classification, the labels are One-Hot encoded.

If the labels are One-Hot encoded, the target vector has only one nonzero element -- the one that corresponds to the correct class.<br><br>
#### Example: <br>
For a 5-class classification problem, an example that is in class 1 (0-based indexing) has the following One-Hot encoded target vector:

$y = [0, 1, 0, 0, 0]$ <br><br>

#### Example:  <br>
Suppose we have a $K$-class classification problem, and one of the data points is in the $jth$ class. <br>

The cross-entropy loss (CE) for this data point is defined as:<br>

$ \text{CE}~=~-\sum_{i=0}^{K-1}y_i\log{p_i} = -\log{p_j}$ <br><br>

In the summation, only the term corresponding to the $jth$ (correct) class survives, because $y_j=1$, and all the other $y_i$ are zeros.<br><br>

Notice that the closer $p_j$ gets to $1$, the better our model has done in classifying this example and the closer the cross-entropy loss gets to zero.<br><br>

Binary Cross-Entropy Loss (BCE) is a special case of Cross-Entropy Loss for the case of $K=2$ (i.e. binary classification): <br>

$BCE = -(y \in {\text{class} 0})\log p_0 - (y \in {\text{class} 1})\log p_1$<br><br>

Note that an example must be in one of the two classes, so<br><br>
$p_0 = 1-p_1$, and <br>


$BCE =  { \begin{cases}-\log p_0 = -\log (1-p_1)& {\text{if }}\ y \in {\text{class0} } \\  - \log p_1& {\text{if }}\ y \in {\text{class} 1} \\ \end{cases}}$<br><br>

#### Computing the cross-entropy loss function with Keras

In [9]:
# Example: 5 class problem [red, green, blue, orange, yellow]
# these are one-hot encoded labels
y_true = [[0, 1, 0 , 0, 0], # one-hot encoded label (this example is greeen)
          [1, 0, 0, 0, 0]] # one-hot encoded label (this example is red)

# these are probabilities calculated by Softmax
y_pred = [[0., 1., 0., 0., 0.],
          [0.9, 0.02 , 0.05, 0.01, 0.02]]

# if labels are one-hot encoded, you use categorical_crossentropy
# if the labels are encoded as single digits, use sparse_categorical_crossentropy
#   recall that is how we have been doing things
loss = tf.keras.losses.categorical_crossentropy(y_true, y_pred)
CE_loss = loss.numpy().round(3)
CE_loss

array([0.   , 0.105], dtype=float32)

In [10]:
y_true = [[0, 1, 0 , 0, 0]]

# input into softmax
z = tf.constant([[1.3, 5.1, 2.2, 0.7, 1.1]])

# same output as in the image above
y = softmax(z)
y_pred = y.round(decimals=2)
y_pred

array([[0.02, 0.9 , 0.05, 0.01, 0.02]], dtype=float32)

In [11]:
# pass one-hot encoded labels (y_true) and the softmax probabilities (y_pred) int CE
loss = tf.keras.losses.categorical_crossentropy(y_true, y_pred)
loss.numpy().round(3)

array([0.105], dtype=float32)

In [12]:
p1 = 0.9
-1 * np.log(p1).round(3)

0.105

### 1.2.6 Network Weight Initialization

![](https://i.pinimg.com/originals/89/f9/bd/89f9bddacf547661dfc209d4b31c2c12.png)

**Recall** from our **Gradient Descent** lecture, how we initalize our model weights can determine the difference between Gradient Descent converging towards a local minimum or a global minimum!

[Keras has documentation on intializer options](https://keras.io/api/layers/initializers/)

`init_mode = ['uniform', 'lecun_uniform', 'normal', 'zero', 'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform']`

If you wish to dive into a deep analysis of varying weight initializers and their affect on model performance read this article [Hyper-parameters in Action! Part II — Weight Initializers](https://towardsdatascience.com/hyper-parameters-in-action-part-ii-weight-initializers-35aee1a28404) by Daniel Godoy/

**Take Away:** Include a few different weight initalizers in your gridsearch.

### 1.2.7 Regularization
In general, Regularization methods make your model more robust against overfitting.

There are various types of regularization that you can gridsearch for your model as well. <br>
This includes **Dropout** and **Weight Constraints** (you might know this second one as **L2 regularization**). <br>

We will do a deep dive into both **Dropout** and **Weight Constraints** in the next Module (4), so we'll hold off discussing them until then.

### 1.2.8 Number of Hidden Layer Neurons

Remember that when we only had a single perceptron our model was only able to fit to linearly separable data, but by adding multiple layers and nodes we can build a network that is a powerhouse of fitting nonlinearity in data. The larger the network and the more nodes generally the stronger the network's capacity to fit nonlinear patterns in data. The more nodes and layers the longer it will take to train a network, and higher the probability of overfitting. The larger your network gets the more you'll need dropout regularization or other regularization techniques to keep it in check.

Typically, depth (more layers) is more important than width (more nodes) for neural networks. This is part of why Deep Learning is so highly touted. Certain deep learning architectures have truly been huge breakthroughs for certain machine learning tasks.

-----
## 1.3 Brute Force Hyperparameter Gridsearch with sklearn's `GridSearchCV` (Learn)

In [13]:
import numpy as np
import pandas as pd

from sklearn.model_selection import GridSearchCV
from keras.models import Sequential
from keras.layers import Dense

from scikeras.wrappers import KerasClassifier

ModuleNotFoundError: No module named 'scikeras'

In [None]:
# Function to create model, required for KerasClassifier
def create_model(units=32, optimizer='adam', activation='sigmoid'):
    """"
    Returns a compiled keras model

    Parameters
    ----------
    units: int
        number of neruons/nodes/units to use in each hidden layer

    Returns
    -------
    model: keras object
    """

    model = Sequential()
    model.add(Dense(units, input_dim=784, activation=activation))
    model.add(Dense(10, activation='softmax'))

    model.compile(loss='sparse_categorical_crossentropy',
                  optimizer=optimizer,
                  metrics=['accuracy'])

    return model

In [None]:
# Create a Scikit-Learn wrapper around our Keras Model
# so that they play nicely together
# KerasClassifier needs a model creation function that returns a compiled model
model = KerasClassifier(build_fn=create_model)

Models wrapped with `KerasClassifier` have different properties and methods,<br>
but some are the same. For example `model.summary()` throws an error:

In [None]:
model

In [None]:
dir(model)

### 1.3.1 Create a dictionary with a grid of hyperparameter values to be searched

In [None]:
# define the search grid of hyparameters
# note "units" is the number of neurons
param_grid = {'batch_size': [32],
              'epochs': [5],
              'units':[64, 128, 512],
              'optimizer': ['adam'],
              'activation': ['sigmoid', 'relu']}

### 1.3.2 Use `GridSearchCV` to perform the hyperparameter grid search
as we have done previously in Sprint 1<br>
Takes ~$2$ or $3$ min on Colab with GPU

In [None]:
%%time
# Create Grid Search
model = KerasClassifier(build_fn=create_model)

grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-2, verbose=2, cv=3)
grid_result = grid.fit(X_train, y_train)

# Report Results
print(f"Best: {grid_result.best_score_} using {grid_result.best_params_}")
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print(f"Means: {mean}, Stdev: {stdev} with: {param}")

## Challenge
You will be expected to tune several hyperparameters in today's module project.

# 2. Experiment Tracking Frameworks (Learn)
<a id="p2"></a>

## Overview

You will notice that managing the results of all the experiments you are running becomes challenging. Which set of parameters did the best? Which code did I use? Are my results today different than my results yesterday? Although we work with Jupyter Notebooks, this format is not well suited to logging experimental results. <br><br>
Enter experiment tracking frameworks like [Comet.ml](https://comet.ml) and [Weights and Biases](https://wandb.ai/), and [TensorBoard's Hyperparameter Dashboard](https://www.tensorflow.org/tensorboard/hyperparameter_tuning_with_hparams)!

Those tools will help you track your experiments, store the results, and the code associated with those experiments. Experimental results can also be readily visualized to see changes in performance across any metric you care about. Data is sent to the tool as each epoch is completed, so you can also see if your model is converging in real time. Let's check out TensorBoard today.

## 2.1 TensorBoard Hyperparameter Dashboard(Follow Along)
To understand the code in this section, <br>
read [Hyperparameter Tuning with the HParams Dashboard](https://www.tensorflow.org/tensorboard/hyperparameter_tuning_with_hparams)

In [None]:
%load_ext tensorboard

In [None]:
import tensorflow as tf
from tensorboard.plugins.hparams import api as hp

import os
import datetime

In [None]:
# we are mainly using hp to initialize the ranges of possible hyper-parameter values
# use dir(hp) to check its methods, classes and attributes
#      you could also type `hp.` in a code cell and hover the cursor over the period
dir(hp)

### 2.1.1 Create Experiment Configuration
We are going to experiment with:
* Number of units (neurons) in the first dense layer
* Learning Rates
* Optimizers

In [None]:
# let's use Hparams Dashboard
HP_NUM_UNITS = hp.HParam('num_units', hp.Discrete([16,32]))
HP_LEARNING_RATE = hp.HParam('learning_rate', hp.RealInterval(0.001,.01))
HP_OPTIMIZER = hp.HParam('optimizer', hp.Discrete(['adam', 'sgd']))

In [None]:
METRIC_ACCURACY = 'accuracy'
GRID_SEARCH_RESULTS_DIR = 'logs/hparam_tuning'

# creating a dir to save/log our gridsearch results for use with tensorboard
# this is a "context manager" in python
with tf.summary.create_file_writer(GRID_SEARCH_RESULTS_DIR).as_default():

    hp.hparams_config(
        # store h-params and their values  in a list
        hparams=[HP_NUM_UNITS, HP_LEARNING_RATE, HP_OPTIMIZER],

        # store metrics to score the model
        metrics=[hp.Metric(METRIC_ACCURACY, display_name='Accuracy')]
  )

### 2.1.2 Adapt the model to accept hyperparameter values from a dictionary `hparams`
This function
- builds a model with a set of hyperparameter values specified in the `hparams` dictionary
- fits the model
- evaluates the fitted model on a test set

In [None]:
def train_test_model(hparams):
  # Hyperparameters
  # Sequential() Model
  # hparams is a standard python dictionary of hyperparameters with keys and values

    model = tf.keras.Sequential([

    # 1st layer in model
    tf.keras.layers.Dense(hparams[HP_NUM_UNITS],
                          activation='relu'),
    # output layer
    tf.keras.layers.Dense(10,
                          activation='softmax')
    ])

    # get optimizer from param dict
    opt_name = hparams[HP_OPTIMIZER]

    # get learning_rate for optimizer
    lr = hparams[HP_LEARNING_RATE]

    # There is a better way to perform the actions that are being done in this block of code
    # see https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/deserialize
    if opt_name == 'adam':
        # import Adam opt object and set learning rate
        opt = tf.keras.optimizers.Adam(learning_rate=lr)
        # import sgd opt object and set leanring rate
    elif opt_name == 'sgd':
        opt = tf.keras.optimizers.SGD(learning_rate=lr)
    else:
        raise ValueError(f"unexpected optimizer name: {opt_name}")

    model.compile(
          optimizer=opt,
          loss='sparse_categorical_crossentropy',
          metrics=['accuracy']
    )

    model.fit(X_train, y_train, epochs=2)

    _, accuracy = model.evaluate(X_test, y_test)

    return accuracy

### 2.1.3 For each run, log an `hparams` summary with the hyperparameter values and final accuracy.

In [None]:
def run(run_dir, hparams):

    with tf.summary.create_file_writer(run_dir).as_default():
        # record the values used in this trial
        hp.hparams(hparams)

        # call train_test_model to build, train, and score model on parameter values
        accuracy = train_test_model(hparams)

        # store trained accuracy to file
        tf.summary.scalar(METRIC_ACCURACY, accuracy, step=1)

This is the main method, so start reading code from here

In [None]:
%%time

session_num = 0
for num_units in HP_NUM_UNITS.domain.values:
    for learning_rate in (HP_LEARNING_RATE.domain.min_value, HP_LEARNING_RATE.domain.max_value):
        for optimizer in HP_OPTIMIZER.domain.values:

            # as we loop through all the hyper-param values
            #     store each unique combination in the dictionary hparams
            hparams = {
               HP_NUM_UNITS: num_units,
               HP_LEARNING_RATE: learning_rate,
               HP_OPTIMIZER: optimizer
            }

            run_name = f"run-{session_num}"
            print(f"--- Starting trial: {run_name}")
            print({h.name: hparams[h] for h in hparams})

            # execute the run function, which runs the training of the models
            run('logs/hparam_tuning/' + run_name, hparams)
            session_num += 1

### 2.1.4 Visualize the Results

In [None]:
# run tensorboard in the Colab notebook
%tensorboard --logdir=logs/hparam_tuning/ --host localhost --port 8088

### 2.1.5 Your Turn

Pick a few hyparameters that we *have not* tuned. Using the above code as a template, try changing a few parameters you're interested in.

In [None]:
##YOUR CODE HERE

## Challenge

In today's module assignment, you will be expected to use TensorFlow's HParams API along with TensorBoard to implement and visualize results of hyperparameter tuning scenarios for neural network models

# 3. Hyperparameter Tuning with RandomSearchCV (Learn)

## Overview

`GridSearchCV` can take a long time to systematically explore a hyper-parameter search space. You'll want to adopt more sophiscated strategy such as a Random Search.

Let's see how to do this with with [`keras-tuner`](https://keras.io/keras_tuner/).

## 3.1 Install `keras-tuner`

In [None]:
!pip install keras-tuner

## 3.2 Perform Random Search with `keras-tuner` (Follow Along)

### 3.2.1 Set up the hyperparameter search space

In [None]:
from tensorflow import keras
from keras import layers
from keras_tuner import RandomSearch

"""
This model Tunes:
- Number of Neurons in the Hidden Layer
- Learning Rate in Adam

"""

def build_model(hp):

    model = keras.Sequential()
    model.add(layers.Dense(units=hp.Int('units',
                                        min_value=32,
                                        max_value=512,
                                        step=32),
                           activation='relu'))
    model.add(layers.Dense(10, activation='softmax'))
    model.compile(
      optimizer=keras.optimizers.Adam(
          hp.Choice('learning_rate',
                    values=[1e-1, 1e-2, 1e-3])),
      loss='sparse_categorical_crossentropy',
      metrics=['accuracy']
    )

    return model

### 3.2.2 Set up the `RandomSearch()` tuner
to do $5$ trials, each with $3$ sets of hyperparameters randomly chosen from our grid.

In [None]:
tuner = RandomSearch(
    build_model,
    objective='val_accuracy',
    max_trials=5,
    executions_per_trial=3,
    directory='./keras-tuner-trial',
    project_name='mnist')

#### Check that the search space corresponds with what you ordered

In [None]:
tuner.search_space_summary()

### 3.2.3 Run the Random Search
use $5$ epochs for each run<br>
This takes ~$9$ min on Colab with GPU

In [None]:
%%time
tuner.search(X_train, y_train,
             epochs=5,
             validation_data=(X_test, y_test))

### 3.2.4 Report the search results

In [None]:
tuner.results_summary()

## Challenge

In your module project today, you will apply `RandomSearch` and `BayesianSearch` using `keras-tuner`!

# Review
* <a href="#p1">Part 1</a>: Describe the major hyperparameters to tune
    - Activation Functions
    - Optimizer
    - Number of Layers
    - Number of Neurons
    - Batch Size
    - Dropout Regulaization
    - Learning Rate
    - Number of Epochs
    - and many more
* <a href="#p2">Part 2</a>: Implement an experiment tracking framework
    - By Hand: GridSearchCV
    - TensorBoard with Hparams Dashboard
    - Stretch topic: other experiment tracking frameworks
    > [Weights & Biases](https://wandb.ai/site)<br>
    > [Comet.ml](https://www.comet.ml/site/)<br>
    > [neptune.ai](https://neptune.ai/)<br>

* <a href="#p3">Part 3</a>: Search the hyperparameter space using Keras-Tuner
    - Random Search
    - Bayesian Search
    - Stretch Topic: [Advanced Hyperparameter Optimization Techniques](https://neptune.ai/blog/hyperband-and-bohb-understanding-state-of-the-art-hyperparameter-optimization-algorithms)
    > Hyperband<br>
    > BOHB (Bayesian Optimization + Hyperband)<br>

# Sources

## Additional Reading
- [How to Grid Search Hyperparameters for Deep Learning Models in Python With Keras](https://machinelearningmastery.com/grid-search-hyperparameters-deep-learning-models-python-keras/)
- [Practical Guide to Hyperparameters Optimization for Deep Learning Models](https://blog.floydhub.com/guide-to-hyperparameters-search-for-deep-learning-models/)
- [ML Experiment Tracking: What It Is, Why It Matters, and How to Implement It](https://neptune.ai/blog/ml-experiment-tracking)
- [Dropout Regularization in Deep Learning Models With Keras](https://machinelearningmastery.com/dropout-regularization-deep-learning-models-keras/)
- [A Gentle Introduction to Weight Constraints in Deep Learning](https://machinelearningmastery.com/introduction-to-weight-constraints-to-reduce-generalization-error-in-deep-learning/)
- [How to Configure the Number of Layers and Nodes in a Neural Network](https://machinelearningmastery.com/how-to-configure-the-number-of-layers-and-nodes-in-a-neural-network/)
