<a href="https://colab.research.google.com/github/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_08_4_bayesian_hyperparameter_opt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Architecting Network: Hyperparameters

The number of layers, neuron counts per layer, layer types, and activation functions are all choices we must make to optimize your neural network. Some of the categories of hyperparameters for you to choose from coming from the following list:

* Number of Hidden Layers and Neuron Counts
* Activation Functions
* Advanced Activation Functions
* Regularization: L1, L2, Dropout
* Batch Normalization
* Training Parameters

## Number of Hidden Layers and Neuron Counts

* **Activation** - You can also add activation functions as layers.  Using the activation layer is the same as specifying the activation function as part of a Dense (or other) layer type.
* **ActivityRegularization** Used to add L1/L2 regularization outside of a layer. You can specify L1 and L2 as part of a Dense (or other) layer type.
* **Dense** - The original neural network layer type. In this layer type, every neuron connects to the next layer. The input vector is one-dimensional, and placing specific inputs next does not affect each other. 
* **Dropout** - Dropout consists of randomly setting a fraction rate of input units to 0 at each update during training time, which helps prevent overfitting. Dropout only occurs during training.
* **Flatten** - Flattens the input to 1D and does not affect the batch size.
* **Input** - A Keras tensor is a tensor object from the underlying back end (Theano, TensorFlow, or CNTK), which we augment with specific attributes to build a Keras by knowing the inputs and outputs of the model.
* **Lambda** - Wraps arbitrary expression as a Layer object.
* **Masking** - Masks a sequence using a mask value to skip timesteps.
* **Permute** - Permutes the input dimensions according to a given pattern. Useful for tasks such as connecting RNNs and convolutional networks.
* **RepeatVector** - Repeats the input n times.
* **Reshape** - Similar to Numpy reshapes.
* **SpatialDropout1D** - This version performs the same function as Dropout; however, it drops entire 1D feature maps instead of individual elements. 
* **SpatialDropout2D** - This version performs the same function as Dropout; however, it drops entire 2D feature maps instead of individual elements
* **SpatialDropout3D** - This version performs the same function as Dropout; however, it drops entire 3D feature maps instead of individual elements. 

There is always trial and error for choosing a good number of neurons and hidden layers. Generally, the number of neurons on each layer will be larger closer to the hidden layer and smaller towards the output layer. This configuration gives the neural network a somewhat triangular or trapezoid appearance.

## Activation Functions

Activation functions are a choice that you must make for each layer. Generally, you can follow this guideline:
* Hidden Layers - RELU
* Output Layer - Softmax for classification, linear for regression.

Some of the common activation functions in Keras are listed here:

* **softmax** - Used for multi-class classification.  Ensures all output neurons behave as probabilities and sum to 1.0.
* **elu** - Exponential linear unit.  Exponential Linear Unit or its widely known name ELU is a function that tends to converge cost to zero faster and produce more accurate results. Can produce negative outputs.
* **selu** - Scaled Exponential Linear Unit (SELU), essentially **elu** multiplied by a scaling constant.
* **softplus** - Softplus activation function. $log(exp(x) + 1)$  [Introduced](https://papers.nips.cc/paper/1920-incorporating-second-order-functional-knowledge-for-better-option-pricing.pdf) in 2001.
* **softsign** Softsign activation function. $x / (abs(x) + 1)$ Similar to tanh, but not widely used.
* **relu** - Very popular neural network activation function.  Used for hidden layers, cannot output negative values. No trainable parameters.
* **tanh** Classic neural network activation function, though often replaced by relu family on modern networks.
* **sigmoid** - Classic neural network activation.  Often used on output layer of a binary classifier.
* **hard_sigmoid** - Less computationally expensive variant of sigmoid.
* **exponential** - Exponential (base e) activation function.
* **linear** - Pass-through activation function. Usually used on the output layer of a regression neural network.

For more information about Keras activation functions refer to the following:

* [Keras Activation Functions](https://keras.io/activations/)
* [Activation Function Cheat Sheets](https://ml-cheatsheet.readthedocs.io/en/latest/activation_functions.html)


### Advanced Activation Functions

Hyperparameters are not changed when the neural network trains. You, the network designer, must define the hyperparameters. The neural network learns regular parameters during neural network training. Neural network weights are the most common type of regular parameter. The "[advanced activation functions](https://keras.io/layers/advanced-activations/)," as Keras call them, also contain parameters that the network will learn during training. These activation functions may give you better performance than RELU.

* **LeakyReLU** - Leaky version of a Rectified Linear Unit. It allows a small gradient when the unit is not active, controlled by alpha hyperparameter.
* **PReLU** - Parametric Rectified Linear Unit, learns the alpha hyperparameter. 

## Regularization: L1, L2, Dropout


* [Keras Regularization](https://keras.io/regularizers/)
* [Keras Dropout](https://keras.io/layers/core/)

## Batch Normalization

* [Keras Batch Normalization](https://keras.io/layers/normalization/)

Normalize the activations of the previous layer at each batch, i.e. applies a transformation that maintains the mean activation close to 0 and the activation standard deviation close to 1. Can allow learning rate to be larger.


## Training Parameters

* [Keras Optimizers](https://keras.io/optimizers/)

* **Batch Size** - Usually small, such as 32 or so.
* **Learning Rate**  - Usually small, 1e-3 or so.


In [1]:
# Startup Google CoLab
try:
    %tensorflow_version 2.x
    COLAB = True
    print("Note: using Google CoLab")
except:
    print("Note: not using Google CoLab")
    COLAB = False

# Nicely formatted time string
def hms_string(sec_elapsed):
    h = int(sec_elapsed / (60 * 60))
    m = int((sec_elapsed % (60 * 60)) / 60)
    s = sec_elapsed % 60
    return "{}:{:>02}:{:>05.2f}".format(h, m, s)

Note: using Google CoLab


# Bayesian Hyperparameter Optimization for Keras

Bayesian Hyperparameter Optimization is a method of finding hyperparameters more efficiently than a grid search. Because each candidate set of hyperparameters requires a retraining of the neural network, it is best to keep the number of candidate sets to a minimum. Bayesian Hyperparameter Optimization achieves this by training a model to predict good candidate sets of hyperparameters. [[Cite:snoek2012practical]](https://arxiv.org/pdf/1206.2944.pdf)

* [bayesian-optimization](https://github.com/fmfn/BayesianOptimization)
* [hyperopt](https://github.com/hyperopt/hyperopt)
* [spearmint](https://github.com/JasperSnoek/spearmint)

In [2]:
# Ignore useless W0819 warnings generated by TensorFlow 2.0.  
# Hopefully can remove this ignore in the future.
# See https://github.com/tensorflow/tensorflow/issues/31308
import logging, os
logging.disable(logging.WARNING)
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"

import pandas as pd
from scipy.stats import zscore

# Read the data set
df = pd.read_csv(
    "https://data.heatonresearch.com/data/t81-558/jh-simple-dataset.csv",
    na_values=['NA','?'])

# Generate dummies for job
df = pd.concat([df,pd.get_dummies(df['job'],prefix="job")],axis=1)
df.drop('job', axis=1, inplace=True)

# Generate dummies for area
df = pd.concat([df,pd.get_dummies(df['area'],prefix="area")],axis=1)
df.drop('area', axis=1, inplace=True)

# Missing values for income
med = df['income'].median()
df['income'] = df['income'].fillna(med)

# Standardize ranges
df['income'] = zscore(df['income'])
df['aspect'] = zscore(df['aspect'])
df['save_rate'] = zscore(df['save_rate'])
df['age'] = zscore(df['age'])
df['subscriptions'] = zscore(df['subscriptions'])

# Convert to numpy - Classification
x_columns = df.columns.drop('product').drop('id')
x = df[x_columns].values
dummies = pd.get_dummies(df['product']) # Classification
products = dummies.columns
y = dummies.values

Now that we've preprocessed the data, we can begin the hyperparameter optimization.  We start by creating a function that generates the model based on just three parameters.  Bayesian optimization works on a vector of numbers, not on a problematic notion like how many layers and neurons are on each layer.  To represent this complex neuron structure as a vector, we use several numbers to describe this structure.   

* **dropout** - The dropout percent for each layer.
* **neuronPct** - What percent of our fixed 5,000 maximum number of neurons do we wish to use?  This parameter specifies the total count of neurons in the entire network.
* **neuronShrink** - Neural networks usually start with more neurons on the first hidden layer and then decrease this count for additional layers.  This percent specifies how much to shrink subsequent layers based on the previous layer.  We stop adding more layers once we run out of neurons (the count specified by neuronPct).

These three numbers define the structure of the neural network.  The commends in the below code show exactly how the program constructs the network.

In [3]:
import pandas as pd
import os
import numpy as np
import time
import tensorflow.keras.initializers
import statistics
import tensorflow.keras
from sklearn import metrics
from sklearn.model_selection import StratifiedKFold
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout, InputLayer
from tensorflow.keras import regularizers
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import ShuffleSplit
from tensorflow.keras.layers import LeakyReLU,PReLU
from tensorflow.keras.optimizers import Adam

def generate_model(dropout, neuronPct, neuronShrink):
    # We start with some percent of 5000 starting neurons on 
    # the first hidden layer.
    neuronCount = int(neuronPct * 5000)
    
    # Construct neural network
    model = Sequential()

    # So long as there would have been at least 25 neurons and 
    # fewer than 10
    # layers, create a new layer.
    layer = 0
    while neuronCount>25 and layer<10:
        # The first (0th) layer needs an input input_dim(neuronCount)
        if layer==0:
            model.add(Dense(neuronCount, 
                input_dim=x.shape[1], 
                activation=PReLU()))
        else:
            model.add(Dense(neuronCount, activation=PReLU())) 
        layer += 1

        # Add dropout after each hidden layer
        model.add(Dropout(dropout))

        # Shrink neuron count for each layer
        neuronCount = neuronCount * neuronShrink

    model.add(Dense(y.shape[1],activation='softmax')) # Output
    return model

We can test this code to see how it creates a neural network based on three such parameters.

In [4]:
# Generate a model and see what the resulting structure looks like.
model = generate_model(dropout=0.2, neuronPct=0.1, neuronShrink=0.25)
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 500)               24500     
                                                                 
 dropout (Dropout)           (None, 500)               0         
                                                                 
 dense_1 (Dense)             (None, 125)               62750     
                                                                 
 dropout_1 (Dropout)         (None, 125)               0         
                                                                 
 dense_2 (Dense)             (None, 31)                3937      
                                                                 
 dropout_2 (Dropout)         (None, 31)                0         
                                                                 
 dense_3 (Dense)             (None, 7)                 2

We will now create a function to evaluate the neural network using three such parameters.  We use bootstrapping because one training run might have "bad luck" with the assigned random weights.  We use this function to train and then evaluate the neural network.  

In [5]:
SPLITS = 2
EPOCHS = 500
PATIENCE = 10

def evaluate_network(dropout,learning_rate,neuronPct,neuronShrink):
    # Bootstrap

    # for Classification
    boot = StratifiedShuffleSplit(n_splits=SPLITS, test_size=0.1)
    # for Regression
    # boot = ShuffleSplit(n_splits=SPLITS, test_size=0.1)

    # Track progress
    mean_benchmark = []
    epochs_needed = []
    num = 0
    
    # Loop through samples
    for train, test in boot.split(x,df['product']):
        start_time = time.time()
        num+=1

        # Split train and test
        x_train = x[train]
        y_train = y[train]
        x_test = x[test]
        y_test = y[test]

        model = generate_model(dropout, neuronPct, neuronShrink)
        model.compile(loss='categorical_crossentropy', 
                      optimizer=Adam(learning_rate=learning_rate))
        monitor = EarlyStopping(monitor='val_loss', min_delta=1e-3, 
        patience=PATIENCE, verbose=0, mode='auto', 
                                restore_best_weights=True)

        # Train on the bootstrap sample
        model.fit(x_train,y_train,validation_data=(x_test,y_test),
                  callbacks=[monitor],verbose=0,epochs=EPOCHS)
        epochs = monitor.stopped_epoch
        epochs_needed.append(epochs)

        # Predict on the out of boot (validation)
        pred = model.predict(x_test)

        # Measure this bootstrap's log loss
        y_compare = np.argmax(y_test,axis=1) # For log loss calculation
        score = metrics.log_loss(y_compare, pred)
        mean_benchmark.append(score)
        m1 = statistics.mean(mean_benchmark)
        m2 = statistics.mean(epochs_needed)
        mdev = statistics.pstdev(mean_benchmark)

        # Record this iteration
        time_took = time.time() - start_time
        
    tensorflow.keras.backend.clear_session()
    return (-m1)




In [6]:
print(evaluate_network(
    dropout=0.2,
    learning_rate=1e-3,
    neuronPct=0.2,
    neuronShrink=0.2))

-0.7455484813312068


First, we must install the Bayesian optimization package if we are in Colab.

In [7]:
# HIDE OUTPUT
!pip install bayesian-optimization

Collecting bayesian-optimization
  Downloading bayesian-optimization-1.2.0.tar.gz (14 kB)
Building wheels for collected packages: bayesian-optimization
  Building wheel for bayesian-optimization (setup.py) ... [?25l[?25hdone
  Created wheel for bayesian-optimization: filename=bayesian_optimization-1.2.0-py3-none-any.whl size=11685 sha256=2312074622448467db0cdf8476cd55dcfa1e47b285165ad78199a0bbabae81bb
  Stored in directory: /root/.cache/pip/wheels/fd/9b/71/f127d694e02eb40bcf18c7ae9613b88a6be4470f57a8528c5b
Successfully built bayesian-optimization
Installing collected packages: bayesian-optimization
Successfully installed bayesian-optimization-1.2.0


We will now automate this process. We define the bounds for each of these four hyperparameters and begin the Bayesian optimization. Once the program finishes, the best combination of hyperparameters found is displayed. The **optimize** function accepts two parameters that will significantly impact how long the process takes to complete: 

* **n_iter** - How many steps of Bayesian optimization that you want to perform. The more steps, the more likely you will find a reasonable maximum.
* **init_points**: How many steps of random exploration that you want to perform. Random exploration can help by diversifying the exploration space.

In [8]:
from bayes_opt import BayesianOptimization
import time

# Supress NaN warnings
import warnings
warnings.filterwarnings("ignore",category =RuntimeWarning)

# Bounded region of parameter space
pbounds = {'dropout': (0.0, 0.499),
           'learning_rate': (0.0, 0.1),
           'neuronPct': (0.01, 1),
           'neuronShrink': (0.01, 1)
          }

optimizer = BayesianOptimization(
    f=evaluate_network,
    pbounds=pbounds,
    verbose=2,  # verbose = 1 prints only when a maximum 
    # is observed, verbose = 0 is silent
    random_state=1,
)

start_time = time.time()
optimizer.maximize(init_points=10, n_iter=20,)
time_took = time.time() - start_time

print(f"Total runtime: {hms_string(time_took)}")
print(optimizer.max)

|   iter    |  target   |  dropout  | learni... | neuronPct | neuron... |
-------------------------------------------------------------------------
| [0m 1       [0m | [0m-0.7891  [0m | [0m 0.2081  [0m | [0m 0.07203 [0m | [0m 0.01011 [0m | [0m 0.3093  [0m |
| [95m 2       [0m | [95m-0.7768  [0m | [95m 0.07323 [0m | [95m 0.009234[0m | [95m 0.1944  [0m | [95m 0.3521  [0m |
| [0m 3       [0m | [0m-21.59   [0m | [0m 0.198   [0m | [0m 0.05388 [0m | [0m 0.425   [0m | [0m 0.6884  [0m |
| [95m 4       [0m | [95m-0.7414  [0m | [95m 0.102   [0m | [95m 0.08781 [0m | [95m 0.03711 [0m | [95m 0.6738  [0m |
| [0m 5       [0m | [0m-0.9557  [0m | [0m 0.2082  [0m | [0m 0.05587 [0m | [0m 0.149   [0m | [0m 0.2061  [0m |
| [0m 6       [0m | [0m-19.86   [0m | [0m 0.3996  [0m | [0m 0.09683 [0m | [0m 0.3203  [0m | [0m 0.6954  [0m |
| [0m 7       [0m | [0m-7.02    [0m | [0m 0.4373  [0m | [0m 0.08946 [0m | [0m 0.09419 [0m | [0m 0

As you can see, the algorithm performed 30 total iterations. This total iteration count includes ten random and 20 optimization iterations. 