# Regularisation, Parameter Initialisation, Batchnorm, Optimisers

Create and compare different models (as described below).

Inspect the results by using tensorboard.


In [53]:
import tensorflow as tf
import datetime
import matplotlib.pyplot as plt
import os

In [32]:
mnist = tf.keras.datasets.mnist
(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

In [34]:
x_train = x_train.reshape(-1, 28**2)
x_test = x_test.reshape(-1, 28**2)

### Parameters


In [3]:
layersizes = [50,50,50,10]
batchsize = 32 
epochs = 20
learning_rate = 0.1

tensorboard_folder = "tb_logs_keras"
outdir = os.path.join(os.getcwd(), tensorboard_folder)

### Baseline Model

* No regularisation
* No Batch Norm
* Default parameter initialisation of Keras: What is the default?
   - glorot_uniform
   - src:https://keras.io/api/layers/core_layers/dense/
* Sigmoid activation (last layer always softmax)
* SGD with given batchsize and learning rate, no accelerators (no momentum nor RMS prop).

Now, create the baseline model. 

Possibly, add convenient naming to the layers so that you can more easily read the outputs in tensorboard. 

In [63]:
def baseline_model(layersizes, activation):
    """
    Provides an MLP model (using Sequential) with given layersizes. The last layer is a softmax layer.
    As activation function use sigmoid.
        
    Arguments:
    layersizes -- list of integers with the number of hidden units per layer. The last element is for MNIST 10.
    activation -- string specifying the activation function for the hidden layers to be used.
    
    """
    ### START YOUR CODE HERE ###


    inputs = tf.keras.Input(shape=(28**2,), name="input_layer")
    x = inputs
    
    hidden_layer_nr = 1
    for size in layersizes[:-1]:
        x = tf.keras.layers.Dense(units=size, activation=activation, name=f"hidden_layer_{hidden_layer_nr}")(x)
        hidden_layer_nr += 1
    
    x = tf.keras.layers.Dense(units=layersizes[-1], activation="softmax", name="final_layer")(x)
    
    ### STOP YOUR CODE HERE ###

    return tf.keras.models.Model(inputs=inputs, outputs=x)

#### Run model

Use cross entropy as loss function.

In [74]:
run_name = "baseline"
rundir = os.path.join(outdir, run_name)

tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=rundir, histogram_freq=1, profile_batch=0)
# start tensorboard on command line with tensorboard -logs <path to outdir> 


### START YOUR CODE HERE ###


model = baseline_model(layersizes, "sigmoid")
model.summary()
model.compile(optimizer=tf.keras.optimizers.SGD(lr=learning_rate),
              loss="sparse_categorical_crossentropy",
              metrics=["accuracy"])

history = model.fit(x_train, y_train,
                    batch_size=batchsize, epochs=epochs,
                    validation_split=0.2,
                    callbacks=[tensorboard_callback])


### STOP YOUR CODE HERE ###

Model: "functional_46"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_layer (InputLayer)     [(None, 784)]             0         
_________________________________________________________________
hidden_layer_1 (Dense)       (None, 50)                39250     
_________________________________________________________________
hidden_layer_2 (Dense)       (None, 50)                2550      
_________________________________________________________________
hidden_layer_3 (Dense)       (None, 50)                2550      
_________________________________________________________________
final_layer (Dense)          (None, 10)                510       
Total params: 44,860
Trainable params: 44,860
Non-trainable params: 0
_________________________________________________________________
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoc

<img src="acc1.png" style="float:left">
<img src="loss1.png">

We observe that the validation loss of the network is converging and the model already achieves an validation accuracy of  96.45%. With more epochs the model could gain another percent of accuracy.

### Parameter Initialisation

* No regularisation
* No Batch Norm
* __Parameter Initialisation: Compare GlorotNormal, Random Normal (mean 0, stdev 1), Zero, HeNormal__
* __Sigmoid Activation (last layer always softmax): Compare Sigmoid, ReLu__
* SGD with given batchsize and learning rate, no accelerators (no momentum nor RMS prop).

Hence, for each of the 4 initializers train and test a model sigmoid and relu.

In [69]:
def model_param_init(layersizes, initializer, activation):
    """
    Provides an MLP model (using Sequential) with given layersizes. The last layer is a softmax layer.
    As activation function use sigmoid.
        
    Arguments:
    layersizes -- list of integers with the number of hidden units per layer. The last element is for MNIST 10.
    initializer -- weight initializer
    activation -- string specifying the activation function to be used.
    
    """
    ### START YOUR CODE HERE ###

    inputs = tf.keras.Input(shape=(28**2,), name="input_layer")
    x = inputs
    
    hidden_layer_nr = 1
    for size in layersizes[:-1]:
        x = tf.keras.layers.Dense(units=size, activation=activation,
                                  name=f"hidden_layer_{hidden_layer_nr}",
                                  kernel_initializer=initializer,
                                  bias_initializer=initializer)(x)
        hidden_layer_nr += 1
    
    x = tf.keras.layers.Dense(units=layersizes[-1], activation="softmax", name="final_layer")(x)
    
    ### STOP YOUR CODE HERE ###

    return tf.keras.models.Model(inputs=inputs, outputs=x)

#### Run model

Run with the different settings.
Don't forget to configure the proper tensorboard callback.

In [75]:
for act in ["sigmoid", "relu"]:
    print(40*"-")
    print(f"Activation {act}")
    print(40*"-")
    for init in ["glorot_normal", "random_normal", "zero", "he_normal"]:
        run_name = f"act-{act}-init-{init}"
        rundir = os.path.join(outdir, run_name)

        tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=rundir, histogram_freq=1, profile_batch=0)

        model = model_param_init(layersizes, init, act)
        model.compile(optimizer=tf.keras.optimizers.SGD(lr=learning_rate),
                      loss="sparse_categorical_crossentropy",
                      metrics=["accuracy"])

        print(f"working on initializer {init}")
        history = model.fsit(x_train, y_train,
                            batch_size=batchsize, epochs=epochs,
                            validation_split=0.2,
                            callbacks=[tensorboard_callback], verbose=False)
        
        print(f"latest val_loss: {history.history['val_loss'][-1]}")
        print(f"latest val_accuracy: {history.history['val_accuracy'][-1]}")
        print(40*"-")
        

----------------------------------------
Activation sigmoid
----------------------------------------
working on initializer glorot_normal
latest val_loss: 0.11276379227638245
latest val_accuracy: 0.9644166827201843
----------------------------------------
working on initializer random_normal
latest val_loss: 0.15437713265419006
latest val_accuracy: 0.9574999809265137
----------------------------------------
working on initializer zero
latest val_loss: 1.4241622686386108
latest val_accuracy: 0.45133334398269653
----------------------------------------
working on initializer he_normal
latest val_loss: 0.10860568284988403
latest val_accuracy: 0.968916654586792
----------------------------------------
----------------------------------------
Activation relu
----------------------------------------
working on initializer glorot_normal
latest val_loss: 0.15023940801620483
latest val_accuracy: 0.9699166417121887
----------------------------------------
working on initializer random_normal
lat

##  Sigmoid
<img src="acc-sigmoid.png" style="float:left">
<img src="loss-sigmoid.png">
<img src="legend-sigmoid.png">

We observe that for `glorot_normal`, `he_normal` and `random_normal` the model converges relatively fast. The later shows the slowest convergence speed, which could be due to the fact that the sigmoid activation function has two saturating regions and the `random_normal` initialization does not account for the number of inputs. Hence it is more probable that the logit value can get very large (or very small) and result in a vanishing gradient.

The slowest convergence speed is observed with the `zero` weight initializer. From the chain rule we know that the gradient of the activation of the previous layer are calculated as 

$$
\frac{\partial L}{\partial \mathbf{A}[l-1]}=\frac{\partial L}{\partial \mathbf{Z}^{[l]}} \cdot \frac{\partial \mathbf{Z}^{[l]}}{\partial \mathbf{A}^{[l-1]}}=\left(\mathbf{W}^{[1]}\right)^{T} \cdot \frac{\partial L}{\partial \mathbf{Z}^{[l]}}
$$

and will be backpropagated through multiplication. Hence the gradient will vanish already at the last layer in the first training iteration and will only be able to be propagated back properly, once the weights in the last layer get pushed away from zero. This obviously takes some time. This "pushing away from 0" can also be seen as the network trying to predict equal probability for all classes, as the extracted features at the last hidden layer are completely random ($\sigma(0)=0.5$).

## Relu
<img src="acc-relu.png" style="float:left">
<img src="loss-relu.png">
<img src="legend-relu.png">

We observe that for `glorot_normal`, `he_normal` and `random_normal` the model converges relatively fast. The later does not differ from the former two as it was observed with the sigmoid activation function. This could be due to the fact that the relu activation function has only a single saturating region.

The `zero` weight initializer does not work at all with the relu activation function. This is due to the fact that $\text{relu}(0)=0$. Hence 

$$
\frac{\partial L}{\partial \mathbf{W}^{[l]}}=\frac{\partial L}{\partial \mathbf{Z}[l]} \cdot \frac{\partial \mathbf{Z}^{[l]}}{\partial \mathbf{W}^{[l]}}=\frac{\partial L}{\partial \mathbf{Z}^{[l]}}\left(\mathbf{A}^{[l-1]}\right)^{T}
$$

will always be zero and so will the weights. The only thing that the model can learn is to predict equal probability for each class. Thus the maximal accuracy that is can reach is ~10%.

<img src="acc-relu-zero.png" style="float:left">
<img src="loss-relu-zero.png">
<img src="legend-relu-zero.png">


In order to see more differences between `glorot_normal`, `he_normal` we could try to create a network with very "thin" layers  (small number of neurons), such that the constant factor of the two techniques would gain more weight as it is the only difference between the two.

$$
\begin{aligned}
\sigma_{\text{Glorot}}=&\sqrt{\frac{2}{n_{\text {inputs }}+n_{\text {outputs }}}} \\
\sigma_{\text{He}}=\sqrt{2} &\sqrt{\frac{2}{n_{\text {inpurs }}+n_{\text {outputs }}}}
\end{aligned}
$$

The expectation would be that `glorot_normal` works better for sigmoid and `he_normal` better for relu activations.

### Batch Normalisation

* No regularisation
* __Batch Norm__: with / without 
* __Parameter Initialisation: Random Normal (0,1), GlorotNormal__
* __Activation: Compare Sigmoid, ReLu__
* SGD with given batchsize and learning rate, no accelerators (no momentum nor RMS prop).

Run with/without batchnorm in combination with sigmoid or relu (with GlorotNormal).<br>
Run with/without batchnorm in combination with GlorotNormal or RandomNormal (with sigmoid).<br>
Hence run 8 different models.

In [78]:
def model_batchnorm(layersizes, initializer, activation):
    """
    Provides an MLP model (using Sequential) with given layersizes. The last layer is a softmax layer.
    As activation function use sigmoid.
        
    Arguments:
    layersizes -- list of integers with the number of hidden units per layer. The last element is for MNIST 10.
    initializer -- weight initializer
    activation -- string specifying the activation function to be used.
    """
    ### START YOUR CODE HERE ###

    inputs = tf.keras.Input(shape=(28**2,), name="input_layer")
    x = inputs
    
    hidden_layer_nr = 1
    for size in layersizes[:-1]:
        x = tf.keras.layers.Dense(units=size, activation=None,
                                  name=f"hidden_layer_{hidden_layer_nr}",
                                  kernel_initializer=initializer,
                                  bias_initializer=initializer)(x)
        x = tf.keras.layers.BatchNormalization()(x)
        x = tf.keras.layers.Activation(activation)(x)
        hidden_layer_nr += 1
    
    x = tf.keras.layers.Dense(units=layersizes[-1], activation="softmax", name="final_layer")(x)  
    
    
    ### STOP YOUR CODE HERE ###

    return tf.keras.models.Model(inputs=inputs, outputs=x)

#### Run model

Run the different variants.

In [81]:
### START YOUR CODE HERE ###
import itertools

for act in ["sigmoid", "relu"]:
    print(40*"-")
    print(f"Activation {act}")
    print(40*"-")
    for batch_norm, init in itertools.product([True, False], ["glorot_normal", "random_normal"]):
        run_name = f"batchnorm-{batch_norm}-act-{act}-init-{init}"
        rundir = os.path.join(outdir, run_name)

        tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=rundir, histogram_freq=1, profile_batch=0)

        if batch_norm:
            model = model_batchnorm(layersizes, init, act)
        else:
            model = model_param_init(layersizes, init, act)
        
        model.compile(optimizer=tf.keras.optimizers.SGD(lr=learning_rate),
                      loss="sparse_categorical_crossentropy",
                      metrics=["accuracy"])

        print(f"working on initializer {init} with batch_norm {batch_norm}")
        history = model.fit(x_train, y_train,
                            batch_size=batchsize, epochs=epochs,
                            validation_split=0.2,
                            callbacks=[tensorboard_callback], verbose=False)
        
        print(f"latest val_loss: {history.history['val_loss'][-1]}")
        print(f"latest val_accuracy: {history.history['val_accuracy'][-1]}")
        print(40*"-")

### STOP YOUR CODE HERE ###

----------------------------------------
Activation sigmoid
----------------------------------------
working on initializer glorot_normal with batch_norm True
latest val_loss: 0.09597315639257431
latest val_accuracy: 0.9724166393280029
----------------------------------------
working on initializer random_normal with batch_norm True
latest val_loss: 0.08772739768028259
latest val_accuracy: 0.9749166369438171
----------------------------------------
working on initializer glorot_normal with batch_norm False
latest val_loss: 0.11894385516643524
latest val_accuracy: 0.965666651725769
----------------------------------------
working on initializer random_normal with batch_norm False
latest val_loss: 0.1706104576587677
latest val_accuracy: 0.953083336353302
----------------------------------------
----------------------------------------
Activation relu
----------------------------------------
working on initializer glorot_normal with batch_norm True
latest val_loss: 0.08182129263877869
lat

## Sigmoid

<img src="acc-sigmoid-batchnorm.png" style="float:left">
<img src="loss-sigmoid-batchnorm.png">
<img src="legend-sigmoid-batchnorm.png">


We observe for both initializers that the model using batchnorm converges *much* faster and also to a solution with a higher accuracy.

## Relu

<img src="acc-relu-batchnorm.png" style="float:left">
<img src="loss-relu-batchnorm.png">
<img src="legend-relu-batchnorm.png">

We again observe for both initializers that the model using batchnorm converges *a little* faster and also to a solution with a higher accuracy. But in general when using the relu activation function the convergence is again much faster in contrast to the sigmoid activation function.

## Weights
Thanks to the tensorboard log, we can analyze the distribution of the kernel weights and compare them for different models
<img src="batchnorm-activation-hists.png">

The upper row contains the histogram of the kernel of the second hidden layer for different models **not** using batchnorm and the lower row contains the histrograms of the same layer of the same models but with batchnorm.

We can see that using batch norm results in much more consistent weights distribution over the different models (notice x-axis ranges). Another interesting observation is that when using a `standard_normal` initialization, the initial standard deviation of the weights is much smaller compared to the `glorot_normal` initialization.

We can also analyse a plot where the time is on the x-axis and the weights distribution on the y-axis.

<img src="batchnorm-activation-timedist.png">

Here we observe the weights distribution of the kernel of the third hidden layer without (left) and with (right) batchnorm. The main differences are the scale and rate of change (especially in the beginning) of the distribution. Again it is nicely visible that the batchnorm layers help to make the weights more consistent and thus enable a smoother /stable learning process.

In order to make the effects of batch norm more clear we could make the model deeper and observe the weights histograms in the deep layers. There the effect should be better visible.

### Optimizers

* No regularisation
* No BatchNorm 
* Parameter Initialisation: GlorotNormal
* Activation: ReLu
* Optimizers: Compare 
    * SGD with given batchsize and learning rate, no accelerators (no momentum nor RMS prop)
    * RmsProp
    * Momentum

Create an according model and train it with the different optimizers.

In [85]:
### START YOUR CODE HERE ###
import itertools

learning_rate = 0.001

for opt in ["SGD", "RMS prop", "Momentum"]:
    print(40*"-")
    print(f"Optimizer {opt}")
    print(40*"-")
    run_name = f"optimizer-{opt}"
    rundir = os.path.join(outdir, run_name)

    tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=rundir, histogram_freq=1, profile_batch=0)

    model = model_param_init(layersizes, "glorot_normal", "relu")

    if opt == "SGD":
        optimizer = tf.keras.optimizers.SGD(lr=learning_rate)
    elif opt == "RMS prop":
        optimizer = tf.keras.optimizers.RMSprop(lr=learning_rate)
    elif opt == "Momentum":
        optimizer = tf.keras.optimizers.SGD(lr=learning_rate, momentum=0.9, nesterov=True)

    model.compile(optimizer=optimizer,
                  loss="sparse_categorical_crossentropy",
                  metrics=["accuracy"])

    print(f"working on optimizer {opt}")
    history = model.fit(x_train, y_train,
                        batch_size=batchsize, epochs=epochs,
                        validation_split=0.2,
                        callbacks=[tensorboard_callback], verbose=False)

    print(f"latest val_loss: {history.history['val_loss'][-1]}")
    print(f"latest val_accuracy: {history.history['val_accuracy'][-1]}")
    print(40*"-")

### STOP YOUR CODE HERE ###

----------------------------------------
Optimizer SGD
----------------------------------------
working on optimizer SGD
latest val_loss: 0.2698284387588501
latest val_accuracy: 0.9228333234786987
----------------------------------------
----------------------------------------
Optimizer RMS prop
----------------------------------------
working on optimizer RMS prop
latest val_loss: 0.24551866948604584
latest val_accuracy: 0.9683333039283752
----------------------------------------
----------------------------------------
Optimizer Momentum
----------------------------------------
working on optimizer Momentum
latest val_loss: 0.11718112230300903
latest val_accuracy: 0.968416690826416
----------------------------------------


## Results

<img src="acc-opt.png" style="float:left">
<img src="loss-opt.png">
<img src="legend-opt.png">

`RMS prop` shows the fastes convergence and `SGD` with momentum the most stable and best end result. For those two to convergce in the first place, a much smaller global learning rate was required. The reason that `RMS prop` overfitts more could be due to the fact that it can get stuck faster in a local minima where as `SGD` with momentum can escape thanks to the momentum term. In theory the `Adam` optimizer should solve this discrepancy.

From my point of view the differences of the optimizers is already nicely shown in this example. But for the differences to become more extreme, we could increase the number of trainable parameters of the model (increase neurons in layers and / or increase network depth).