# Deep Learning Fundamentals 7 - Into the World of Tensorflow/Keras 1

In the seventh notebook, I will talk about how to use the techniques we covered in the previous three notebooks by using Tensorflow and Keras. Let's get started.

In [1]:
import sys
import sklearn
import tensorflow as tf
from tensorflow import keras
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt

# Exploring Initializers and Activation Functions 

Previously, we implemented initializers and activation functions from scratch. Now we will use them with Keras for practice, let's get started with loading our beloved dataset, Fashion MNIST.

In [2]:
(X_train, y_train_full), (X_test, y_test) = keras.datasets.fashion_mnist.load_data()
X_train_full = X_train / 255.0
X_test = X_test / 255.0
X_valid, X_train = X_train_full[:5000], X_train_full[5000:]
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]

How many initializers are implemented in Keras ?

In [3]:
[initializers for initializers in dir(keras.initializers) if not initializers.startswith("_")]

['Constant',
 'GlorotNormal',
 'GlorotUniform',
 'HeNormal',
 'HeUniform',
 'Identity',
 'Initializer',
 'LecunNormal',
 'LecunUniform',
 'Ones',
 'Orthogonal',
 'RandomNormal',
 'RandomUniform',
 'TruncatedNormal',
 'VarianceScaling',
 'Zeros',
 'constant',
 'deserialize',
 'get',
 'glorot_normal',
 'glorot_uniform',
 'he_normal',
 'he_uniform',
 'identity',
 'lecun_normal',
 'lecun_uniform',
 'ones',
 'orthogonal',
 'random_normal',
 'random_uniform',
 'serialize',
 'truncated_normal',
 'variance_scaling',
 'zeros']

By default, Keras uses Glorot initialization with a uniform distribution. Let's train a network usign ReLU and default initialization then also talk about some methods that we previously used but didn't talk about much.

In [4]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dense(300),
    keras.layers.ReLU(),
    keras.layers.Dense(100),
    keras.layers.ReLU(),
    keras.layers.Dense(10, activation="softmax")
])

model.compile(loss="sparse_categorical_crossentropy",
              optimizer=keras.optimizers.SGD(lr=1e-2),
              metrics=["accuracy"])

history = model.fit(X_train, y_train, epochs=10,
                    validation_data=(X_valid, y_valid))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In the third notebook, I used Keras for training some basic networks but didn't talk much about layers, and main methods that we use such as `compile()` and `fit()`  I think now we covered more enough building blocks of deep learning to talk about these things. 

In the code above, we first defined the layers. This neural network is composed of dense layers (or a fully connected network). These layers are appropriate for 2-rank tensors of shape (samples, features). On the other hand, for tensors for different ranks we need to use different models, for instance, we can use LSTM layers, RNN layers, or 1D-CNN layers for sequence data which is stored in 3-rank tensors of shape (samples, timesteps, features). Moreover, Image data is stored in 4-rank tensors and generally processed by 2D convolution layers (Conv2D). Secondly, we use the `compile()` method and pass the optimizer, loss, and metric. I passed two of them as strings and one of them as an object (that's because I wanted to tune `learning_rate` argument), however, all of them got converted to  Python objects in the end. For instance, `loss="sparse_categorical_crossentropy"` becomes `loss=tf.keras.losses.SparseCategoricalCrossentropy` once we run the code. Lastly, I call `fit()` method which implements the training loop itself.


Additionally, there are two important methods that we frequently use 
1. `evaluate()`: This method is used when we want to compute the loss and metrics after training. If the model doesn't have a metric then only the loss gets returned.
2. `predict()`: This method is used for inference. The data we pass will be itareted over the data and in return we will get a Numpy array of predictions.

Now let's go on talking about initializations and see if He initialization will improve accuracy.

In [5]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dense(300, kernel_initializer="HeUniform"),
    keras.layers.ReLU(),
    keras.layers.Dense(100, kernel_initializer="HeUniform"),
    keras.layers.ReLU(),
    keras.layers.Dense(10, activation="softmax")
])

model.compile(loss="sparse_categorical_crossentropy",
              optimizer=keras.optimizers.SGD(lr=1e-2),
              metrics=["accuracy"])

history = model.fit(X_train, y_train, epochs=10,
                    validation_data=(X_valid, y_valid))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Looks like both initilization techniques work almost equally well in this case. How would the performance change if I used another activation function ? Let's use LeakyRelu with HeUniform initialization.

In [6]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dense(300, kernel_initializer="HeUniform"),
    keras.layers.LeakyReLU(),
    keras.layers.Dense(100, kernel_initializer="HeUniform"),
    keras.layers.LeakyReLU(),
    keras.layers.Dense(10, activation="softmax")
])

model.compile(loss="sparse_categorical_crossentropy",
              optimizer=keras.optimizers.SGD(lr=1e-2),
              metrics=["accuracy"])

history = model.fit(X_train, y_train, epochs=10,
                    validation_data=(X_valid, y_valid))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


What about using PReLU ?

In [7]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dense(300, kernel_initializer="HeUniform"),
    keras.layers.PReLU(),
    keras.layers.Dense(100, kernel_initializer="HeUniform"),
    keras.layers.PReLU(),
    keras.layers.Dense(10, activation="softmax")
])

model.compile(loss="sparse_categorical_crossentropy",
              optimizer=keras.optimizers.SGD(lr=1e-2),
              metrics=["accuracy"])

history = model.fit(X_train, y_train, epochs=10,
                    validation_data=(X_valid, y_valid))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Let's also use HeNormal initialization.

In [8]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dense(300, kernel_initializer="HeNormal"),
    keras.layers.PReLU(),
    keras.layers.Dense(100, kernel_initializer="HeNormal"),
    keras.layers.PReLU(),
    keras.layers.Dense(10, activation="softmax")
])

model.compile(loss="sparse_categorical_crossentropy",
              optimizer=keras.optimizers.SGD(lr=1e-2),
              metrics=["accuracy"])

history = model.fit(X_train, y_train, epochs=10,
                    validation_data=(X_valid, y_valid))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Looks like PReLU and HeNormal provided the best result so far. 


We can also tune our initializer by changing the scaling factor. We previously talked about that He initialization uses $fan_{in}$ we can actually change this and use $fan_{avg}$ as well by using `keras.initializers.VarianceScaling()`

In [9]:
init = keras.initializers.VarianceScaling(scale=2., mode='fan_avg', distribution='normal')


model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dense(300, kernel_initializer=init),
    keras.layers.PReLU(),
    keras.layers.Dense(100, kernel_initializer=init),
    keras.layers.PReLU(),
    keras.layers.Dense(10, activation="softmax")
])

model.compile(loss="sparse_categorical_crossentropy",
              optimizer=keras.optimizers.SGD(lr=1e-2),
              metrics=["accuracy"])

history = model.fit(X_train, y_train, epochs=10,
                    validation_data=(X_valid, y_valid))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In all the examples above, we had almost similar results and didn't get any vanishing/exploiding gradient problem. However, we can see a different result when we are using a very dense neural network. Let's see how it happens.

## Let's Vanish/Explode these Gradients

Here, I will do something very basic. I will train two neural networks. In the first one I will add lots of layers by using a for loop. In the second one, I will train almost the same model with the exception that I will use a different activation function. You will see that changing just the activation function may improve our model extraordinarily in some cases. 

In [10]:
model = keras.models.Sequential()
model.add(keras.layers.Flatten(input_shape=[28, 28]))
model.add(keras.layers.Dense(300, kernel_initializer='lecun_normal',activation='relu'))
for layer in range(50):
    model.add(keras.layers.Dense(100, kernel_initializer='lecun_normal',activation='relu')) # adding 50 hidden layers.
model.add(keras.layers.Dense(10, activation="softmax"))

model.compile(loss="sparse_categorical_crossentropy",
              optimizer=keras.optimizers.SGD(lr=1e-2),
              metrics=["accuracy"])

history = model.fit(X_train, y_train, epochs=10,
                    validation_data=(X_valid, y_valid))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Previously we talked about a paper in which it is proposed that using SELU activation function and LeCun initialization will self-normalize our neural network (each layer will have the same mean and variance during training) and this will solve vanishing/exploding gradients problem. Let's use these proposed model and see if it improves our neural metwork.

In [11]:
model = keras.models.Sequential()
model.add(keras.layers.Flatten(input_shape=[28, 28]))
model.add(keras.layers.Dense(300, kernel_initializer='lecun_normal',activation='selu'))
for layer in range(50):
    model.add(keras.layers.Dense(100, kernel_initializer='lecun_normal',activation='selu')) # adding 50 hidden layers.
model.add(keras.layers.Dense(10, activation="softmax"))

model.compile(loss="sparse_categorical_crossentropy",
              optimizer=keras.optimizers.SGD(lr=1e-2),
              metrics=["accuracy"])

history = model.fit(X_train, y_train, epochs=10,
                    validation_data=(X_valid, y_valid))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


**It worked !!** In the first model, we suffered from vanishing/exploiding gradient problem, however, this wasn't the case in the second example. But there are still some point that we need to be careful about which are stated in [Géron, A. (2019)](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/).
* The self-normalizing property of the SELU activation function is easily broken: you cannot use ℓ<sub>1</sub> or ℓ<sub>2</sub> regularization, regular dropout, max-norm, skip connections or other non-sequential topologies (so recurrent neural networks won't self-normalize). However, in practice it works quite well with sequential CNNs. If you break self-normalization, SELU will not necessarily outperform other activation functions.

However, if we don't use any technique that break the self-normalizing property. The function preserves this property for even very big neural networks. Moreover, by default, the hyperparameters of SELU  are tuned in a way that the mean output of each neuron remains close to 0, and the standard deviation remains close to 1.

The below suggestion is also taken from [the book](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/):


**Which activation function should you use for the hidden layers of your deep neural networks?** 
* Although your mileage will vary, in general SELU > ELU > leaky ReLU (and its variants) > ReLU > tanh > logistic. If the network’s architecture prevents it from self-normalizing, then ELU may perform better than SELU (since SELU is not smooth at z = 0). If you care a lot about runtime latency, then you may prefer leaky ReLU. If you don’t want to tweak yet another hyperparameter, you may use the default α values used by Keras (e.g., 0.3 for leaky ReLU). If you have spare time and computing power, you can use cross-validation to evaluate other activation functions, such as RReLU if your network is overfitting or PReLU if you have a huge training set. That said, because ReLU is the most used activation function (by far), many libraries and hardware accelerators provide ReLU-specific optimizations; therefore, if speed is your priority, ReLU might still be the best choice - [Géron, A. (2019)](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/).

# Batch Normalization

We previously talked about that if we add BN as the very first layer of our neural network we generally don't need to standardize our training set. Moreover, sometimes applying BN before the activation function works better (try both). Also layer before a BN layer doesn't need to have bias terms, therefore, it would be appropriate to use `use_bias=False` to avoid wasting parameters (This is actually because that in the mean subtraction step of batch normalization, the bias term will be canceled out since we are adding it to all neurons).

In [12]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, kernel_initializer="HeUniform"),
    keras.layers.PReLU(),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, kernel_initializer="HeUniform"),
    keras.layers.PReLU(),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(10, activation="softmax")
])

model.compile(loss="sparse_categorical_crossentropy",
              optimizer=keras.optimizers.SGD(lr=1e-2),
              metrics=["accuracy"])

history = model.fit(X_train, y_train, epochs=10,
                    validation_data=(X_valid, y_valid),batch_size=64)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Let's see the model summary.

In [13]:
model.summary()

Model: "sequential_8"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_8 (Flatten)          (None, 784)               0         
_________________________________________________________________
batch_normalization (BatchNo (None, 784)               3136      
_________________________________________________________________
dense_122 (Dense)            (None, 300)               235500    
_________________________________________________________________
p_re_lu_6 (PReLU)            (None, 300)               300       
_________________________________________________________________
batch_normalization_1 (Batch (None, 300)               1200      
_________________________________________________________________
dense_123 (Dense)            (None, 100)               30100     
_________________________________________________________________
p_re_lu_7 (PReLU)            (None, 100)              

Let's also look at the parameters BN layer added to our model.

In [14]:
[(var.name, var.trainable) for var in model.layers[4].variables]

[('batch_normalization_1/gamma:0', True),
 ('batch_normalization_1/beta:0', True),
 ('batch_normalization_1/moving_mean:0', False),
 ('batch_normalization_1/moving_variance:0', False)]

Moving mean and moving variance are non-trainable.

Moreover, the BatchNormalization class has some hyperparameters we can tune such as the momentum (the hyperparameter that is used to update the exponential moving averages given a new value v (i.e., a new vector of input means or standard deviations computed over the current batch). A good momentum should be generally close to 1. The other hyperparameter is an axis which determines the axis that should be normalized and it is -1 by default which corresponds that the last axis will be normalized.


The below paragraph is taken from [the book](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/)
* When the input batch is 2D (i.e., the batch shape is [batch size, features]), this means that each input feature will be normalized based on the mean and standard deviation computed across all the instances in the batch. For example, the first BN layer in the previous code example will independently normalize (and rescale and shift) each of the 784 input features. If we move the first BN layer before the Flatten layer, then the input batches will be 3D, with shape [batch size, height, width]; therefore, the BN layer will compute 28 means and 28 standard deviations (1 per column of pixels, computed across all instances in the batch and across all rows in the column), and it will normalize all pixels in a given column using the same mean and standard deviation. There will also be just 28 scale parameters and 28 shift parameters. If instead you still want to treat each of the 784 pixels independently, then you should set axis=[1, 2]. Notice that the BN layer does not perform the same computation during training and after training: it uses batch statistics during training and the “final” statistics after training (i.e., the final values of the moving averages).

A good article for PReLU [link](https://github.com/christianversloot/machine-learning-articles/blob/main/how-to-use-prelu-with-keras.md)

Important paper : [Fixup Initialization: Residual Learning Without Normalization](https://arxiv.org/abs/1901.09321)

# Optimization

Let's use the model that gave us the best scores with different optimization functions to see whether we can improve it further.

In [15]:
def Optimizer_tryout(selected):
    optimizer = selected
    model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, kernel_initializer="HeUniform"),
    keras.layers.PReLU(),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, kernel_initializer="HeUniform"),
    keras.layers.PReLU(),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(10, activation="softmax")
])

    model.compile(loss="sparse_categorical_crossentropy",
              optimizer=optimizer,
              metrics=["accuracy"])

    history = model.fit(X_train, y_train, epochs=10,
                      validation_data=(X_valid, y_valid))

## Momentum optimization

In [16]:
Optimizer_tryout(keras.optimizers.SGD(lr=0.001, momentum=0.9))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## Nesterov Accelerated Gradient

In [17]:
Optimizer_tryout(keras.optimizers.SGD(lr=0.001, momentum=0.9, nesterov=True))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## AdaGrad

In [18]:
Optimizer_tryout(keras.optimizers.Adagrad(lr=0.001))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## RMSProp

In [19]:
Optimizer_tryout(keras.optimizers.RMSprop(lr=0.001, rho=0.9))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## Adam Optimization

In [20]:
Optimizer_tryout(keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


# Adam Types

1. Adamax: This version of Adam is proposed in the same paper that Adam is proposed. The Adam optimization algorithm scales down the parameter updates by $L2$ norm of the time-decayed gradient, on the other hand, Adamax uses $L_{max}$ norm and scales the parameter updates by the max of the time-decayed gradients. In theory this adjusment makes Adamax more stable but it is up to the dataset in practice.


2. Nadam: Nadam algorithm uses Nesterov trick with Adam optimization which makes it converge slightly faster than Adam. In the [paper](https://cs229.stanford.edu/proj2015/054_report.pdf) that algorithm is proposed, it is also reported that this algorithm generally outperform Adam but at the same time sometimes outperformed by RMSProp.

## Adamax Optimization

In [21]:
Optimizer_tryout(keras.optimizers.Adamax(lr=0.001, beta_1=0.9, beta_2=0.999))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## Nadam Optimization

In [22]:
Optimizer_tryout(keras.optimizers.Nadam(lr=0.001, beta_1=0.9, beta_2=0.999))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Also see these versions of Adam
1. [Adagrad](https://keras.io/api/optimizers/adagrad/)
2. [Adadelta](https://keras.io/api/optimizers/adadelta/)

**Notes from [the book](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/):**

* Adaptive optimization methods (including RMSProp, Adam, and Nadam optimization) are often great, converging fast to a good solution. However, a 2017 [paper](https://arxiv.org/abs/1705.08292) by Ashia C. Wilson et al. showed that they can lead to solutions that generalize poorly on some datasets. So when you are disappointed by your model’s performance, try using plain Nesterov Accelerated Gradient instead: your dataset may just be allergic to adaptive gradients. Also check out the latest research, because it’s moving fast.

# Regularization

Let's add L2-norm to our neural network and see how it affects the performance.

In [23]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, use_bias=False,kernel_regularizer='l2'),
    keras.layers.BatchNormalization(),
    keras.layers.Activation("relu"),
    keras.layers.Dense(100, use_bias=False,kernel_regularizer='l2'),
    keras.layers.BatchNormalization(),
    keras.layers.Activation("relu"),
    keras.layers.Dense(10, activation="softmax",kernel_regularizer='l2')
])

model.compile(loss="sparse_categorical_crossentropy",
              optimizer="nadam",
              metrics=["accuracy"])

history = model.fit(X_train, y_train, epochs=10,
                    validation_data=(X_valid, y_valid))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Now I will use L1-norm but I will pass the argument as Python Object because I want to tune the penality.

In [24]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, activation="elu",
                       kernel_initializer="he_normal",
                       kernel_regularizer=keras.regularizers.l1(0.01)),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, activation="elu",
                       kernel_initializer="he_normal",
                       kernel_regularizer=keras.regularizers.l1(0.01)),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(10, activation="softmax",
                       kernel_regularizer=keras.regularizers.l1(0.01))
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam", metrics=["accuracy"])
history = model.fit(X_train, y_train, epochs=10,
                    validation_data=(X_valid, y_valid))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Instead of defining the activation function, initializer and regularizer in each layer we can use `functools.partial()` function to create a wrapper and call it with arguments we defined. I will create the wrapper with elastic net term.

In [25]:
from functools import partial

RegularizedDense = partial(keras.layers.Dense,
                           activation="selu",
                           kernel_initializer="he_normal",
                           kernel_regularizer=keras.regularizers.l1_l2(l1=0.01,l2=0.01))

model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.BatchNormalization(),
    RegularizedDense(300),
    keras.layers.BatchNormalization(),
    RegularizedDense(100),
    keras.layers.BatchNormalization(),
    RegularizedDense(10, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam", metrics=["accuracy"])
history = model.fit(X_train, y_train, epochs=10,
                    validation_data=(X_valid, y_valid))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## Dropout

Now let's implement Dropout regularization.

In [26]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dropout(rate=0.25),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"),
    keras.layers.Dropout(rate=0.25),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"),
    keras.layers.Dropout(rate=0.25),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(10, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam", metrics=["accuracy"])
history = model.fit(X_train, y_train, epochs=10,
                    validation_data=(X_valid, y_valid))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


I used BatchNorm after Dropout which is because using it before may cause information leakage.

**Notes from [the book](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/):**

* Since dropout is only active during training, comparing the training loss and the validation loss can be misleading. In particular, a model may be overfitting the training set and yet have similar training and validation losses. So make sure to evaluate the training loss without dropout (e.g., after training).

## Alpha Dropout

One problem of regular dropout is that it breaks self-normalization feature of SELU + Lecun_normalization combination. To cope with that, we can use alpha dropout which preserves the mean and standard deviation of its inputs.

In [27]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.AlphaDropout(rate=0.2),
    keras.layers.Dense(300, activation="selu", kernel_initializer="lecun_normal"),
    keras.layers.AlphaDropout(rate=0.2),
    keras.layers.Dense(100, activation="selu", kernel_initializer="lecun_normal"),
    keras.layers.AlphaDropout(rate=0.2),
    keras.layers.Dense(10, activation="softmax")
])
optimizer = keras.optimizers.SGD(lr=0.01, momentum=0.9, nesterov=True)
model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])
history = model.fit(X_train, y_train, epochs=10,
                    validation_data=(X_valid, y_valid))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## Monte Carlo Dropout - A free lunch ?

An interesting technique called Monte Carlo Dropout published in a [paper](https://arxiv.org/abs/1506.02142) by Yarin Gal and Zoubin Ghahramani states that there is a good connection between dropout networks and Deep Gaussian Process (An approximate bayesian inference method). When we use dropout in each iteration, we get a slightly different neural network architecture. While doing predictions for test data we can use these different architectures and get the average prediction results which will cause our model to predict slightly better. You can read more about this algorithm in this [article](https://towardsdatascience.com/monte-carlo-dropout-7fd52f8b6571). The algorithm resembles a voting classifier.

Let's make 100 predictions over the test set (we need to set `training=True` to ensure that the Dropout layer is active) and then we stack the predictions. Thanks to dropout predictions are made by different arhitectures, in other words, we have different predictions. When we average over the first dimension (axis=0) we will get a array of shape that we would get with a single prediction. In the end, averaging over multiple predictions with dropout on gives us a Monte Carlo estimate that is generally more reliable than the result of a single prediction with dropout off.

In [28]:
y_probas = np.stack([model(X_test, training=True)
                     for sample in range(100)])
y_proba = y_probas.mean(axis=0)
y_std = y_probas.std(axis=0)

Let's look at the prediction for the first instance in the test data.

In [29]:
np.round(model.predict(X_test[:1]), 2)

array([[0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.01, 0.  , 0.99]],
      dtype=float32)

The model says that that instance belong to class 9 with %97 probability. Let's see how this probability will change when we use Monte Carlo Dropout.

In [None]:
np.round(y_probas[:, :1], 2)

In the end, the model still says that this instance belong to the class 9. However, it is more unsure about the class of instance. Let's average over the first dimension to get the classification result for the first instance.

In [31]:
np.round(y_proba[:1], 2)

array([[0.  , 0.  , 0.  , 0.  , 0.  , 0.3 , 0.  , 0.2 , 0.04, 0.45]],
      dtype=float32)

We can also have a look at the standard deviation of the probability estimation.

In [32]:
y_std = y_probas.std(axis=0)
np.round(y_std[:1], 2)

array([[0.02, 0.01, 0.  , 0.02, 0.  , 0.21, 0.02, 0.14, 0.08, 0.25]],
      dtype=float32)

Looks like there is a big variance in the probability estimation. It is not that important for this model but it may be very important especially while trying to build a model for medical prediction. Let's also see whether we improved the accuracy.

In [33]:
y_pred = np.argmax(y_proba, axis=1)

In [34]:
accuracy = np.sum(y_pred == y_test) / len(y_test)
accuracy

0.7792

Looks like we improved the accuracy.

*  The number of Monte Carlo samples you use (100 in this example) is a hyperparameter you can tweak. The higher it is, the more accurate the predictions and their uncertainty estimates will be. However, if you double it, inference time will also be doubled. Moreover, above a certain number of samples, you will notice little improvement. So your job is to find the right trade-off between latency and accuracy, depending on your application - [Géron, A. (2019)](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/)

It is not a good idea to force training mode if we have special layers like BatchNormalization layers. In cases like that we can use Subclass API and override the call() method to directly force its training argument. We can also do the same thing for Alpha Dropout and get a MCAlpha Dropout version.

In [35]:
class MCDropout(keras.layers.Dropout):
    def call(self, inputs):
        return super().call(inputs, training=True)

class MCAlphaDropout(keras.layers.AlphaDropout):
    def call(self, inputs):
        return super().call(inputs, training=True)

Let's use MCAlphaDropout

In [36]:
mc_model = keras.models.Sequential([
    MCAlphaDropout(layer.rate) if isinstance(layer, keras.layers.AlphaDropout) else layer
    for layer in model.layers
])

In [37]:
mc_model.summary()

Model: "sequential_21"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_20 (Flatten)         (None, 784)               0         
_________________________________________________________________
mc_alpha_dropout (MCAlphaDro (None, 784)               0         
_________________________________________________________________
dense_158 (Dense)            (None, 300)               235500    
_________________________________________________________________
mc_alpha_dropout_1 (MCAlphaD (None, 300)               0         
_________________________________________________________________
dense_159 (Dense)            (None, 100)               30100     
_________________________________________________________________
mc_alpha_dropout_2 (MCAlphaD (None, 100)               0         
_________________________________________________________________
dense_160 (Dense)            (None, 10)              

In [38]:
optimizer = keras.optimizers.SGD(lr=0.01, momentum=0.9, nesterov=True)
mc_model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])

In [39]:
mc_model.set_weights(model.get_weights())

In [40]:
np.round(np.mean([mc_model.predict(X_test[:1]) for sample in range(100)], axis=0), 3)

array([[0.002, 0.   , 0.001, 0.001, 0.002, 0.339, 0.003, 0.207, 0.025,
        0.42 ]], dtype=float32)

In [41]:
y_probas = np.stack([mc_model(X_test) for sample in range(100)])
y_proba = y_probas.mean(axis=0)

In [42]:
y_pred = np.argmax(y_proba, axis=1)

In [43]:
accuracy = np.sum(y_pred == y_test) / len(y_test)
accuracy

0.7803

Another way to calculate accuracy

In [44]:
y_pred = np.argmax(np.round(np.mean([mc_model.predict(X_test) for sample in range(100)], axis=0), 3), axis=1)

In [45]:
accuracy = np.sum(y_pred == y_test) / len(y_test)
accuracy

0.7806

This MCDropout class will work with all Keras APIs, including the Sequential API. If you only care about the
Functional API or the Subclassing API, you do not have to create an MCDropout class; you can create a regular
Dropout layer and call it with `training=True`.

## Max Norm Regularization

Max-Norm Regularization or Max-Norm Constraint is a popular regularization technique in which we constrain the weights of connections for each neuron. It uses L2 term even though it does not actually add the term into the overall loss function. Instead, after each training step the weight vector is forced to have L2 norm if it is less than or equal to the hyperparameter **r**. If this condition is not satisfied, the weight vector is replaced with by the unit vector that is scaled by r. 

You can find more about Max-Norm Regularization in the following articles.

[This one is about Max-Norm Regularization](https://machinelearningjourney.com/index.php/2021/01/15/max-norm-regularization/)  

[This one contains information about all the regularization methods](https://cs231n.github.io/neural-networks-2/#reg)

In [46]:
layer = keras.layers.Dense(100, activation="selu", kernel_initializer="lecun_normal",
                           kernel_constraint=keras.constraints.max_norm(1.))

In [47]:
MaxNormDense = partial(keras.layers.Dense,
                       activation="selu", kernel_initializer="lecun_normal",
                       kernel_constraint=keras.constraints.max_norm(1.))

model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    MaxNormDense(300),
    MaxNormDense(100),
    keras.layers.Dense(10, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam", metrics=["accuracy"])
history = model.fit(X_train, y_train, epochs=20,
                    validation_data=(X_valid, y_valid))

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In each iteration the fit() method will call the object returned by max_norm() which will scale weights in return. In addition to rescaling the weights we can define different constraints, for instance, we can constrain bias terms by setting the bias_constrait argument. 

* A Dense layer usually has weights of shape [number of inputs, number of neurons], so using axis=0 means that the max-norm constraint will apply independently to each neuron’s weight vector. If you want to use max-norm with convolutional layers, make sure to set the max_norm() constraint’s axis argument appropriately (usually axis=[0, 1, 2]). - [Géron, A. (2019)](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/)

**Important documentations:**

https://www.tensorflow.org/api_docs/python/tf/keras/losses


https://www.tensorflow.org/api_docs/python/tf/keras/optimizers


https://www.tensorflow.org/api_docs/python/tf/keras/metrics

