# Training Deep Neural Networks

In [1]:
import tensorflow as tf
from tensorflow import keras
import numpy as np

2023-12-31 20:26:55.300213: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-12-31 20:26:55.448395: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-12-31 20:26:55.449266: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## The Vanishing/Exploding Gradient Problems

### Galrot and He Initialization

By default, Keras uses Glorot initialization with a uniform distribution.
When creating a layer, we can change this to He initialization by setting
`kernel_initializer="he_uniform"` or
`kernel_initializer="he_normal"` like this:

In [4]:
keras.layers.Dense(10, activation="relu", kernel_initializer="he_normal")

<keras.src.layers.core.dense.Dense at 0x7f64524bb090>

If you want He initialization with a uniform distribution but based on
$fan_{avg}$ rather than $fan_{in}$, you can use the `VarianceScaling` initializer like
this:

In [5]:
he_avg_init = keras.initializers.VarianceScaling(scale=2., mode='fan_avg', distribution='uniform')
keras.layers.Dense(10, activation="sigmoid", kernel_initializer=he_avg_init)

<keras.src.layers.core.dense.Dense at 0x7f63dc1c6c50>

### Nonsaturating Activation Functions

To use the leaky ReLU activation function, create a `LeakyReLU` layer and
add it to our model just after the layer we want to apply it to:

In [6]:
model = keras.models.Sequential([
    keras.layers.Dense(10, kernel_initializer="he_normal"),
    keras.layers.LeakyReLU(alpha=0.2),
])

For SELU activation, `set activation="selu"` and
`kernel_initializer="lecun_normal"` when creating a layer:

In [7]:
layer = keras.layers.Dense(10, activation="selu", kernel_initializer="lecun_normal")

### Batch Normalization

In [8]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(10, activation="softmax")
])

In [9]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 flatten (Flatten)           (None, 784)               0         
                                                                 
 batch_normalization (Batch  (None, 784)               3136      
 Normalization)                                                  
                                                                 
 dense_4 (Dense)             (None, 300)               235500    
                                                                 
 batch_normalization_1 (Bat  (None, 300)               1200      
 chNormalization)                                                
                                                                 
 dense_5 (Dense)             (None, 100)               30100     
                                                                 
 batch_normalization_2 (Bat  (None, 100)              

As we can see, each BN layer adds four parameters per input: $\gamma$, $\beta$, $\mu$, and
$\sigma$ (for example, the first BN layer adds 3,136 parameters, which is 4 × 784). 
The last two parameters, $\mu$ and $\sigma$, are the moving averages; they are
not affected by backpropagation, so Keras calls them “non-trainable” 
(if we count the total number of BN parameters, 3,136 + 1,200 + 400, and
divide by 2, you get 2,368, which is the total number of non-trainable
parameters in this model).

**NOTE:**

$\mu$ and $\sigma$ are estimated during training, based on the training data, so arguably they
are trainable. In Keras, “non-trainable” really means “untouched by backpropagation.”

Let’s look at the parameters of the first BN layer. Two are trainable (by
backpropagation), and two are not:

In [10]:
[(var.name, var.trainable) for var in model.layers[1].variables]

[('batch_normalization/gamma:0', True),
 ('batch_normalization/beta:0', True),
 ('batch_normalization/moving_mean:0', False),
 ('batch_normalization/moving_variance:0', False)]

The authors of the BN paper argued in favor of adding the BN layers
before the activation functions, rather than after (as we just did). But this highly depends on the task. To add the BN layers before the activation functions, you must
remove the activation function from the hidden layers and add them as
separate layers after the BN layers. Moreover, since a Batch
Normalization layer includes one offset parameter per input, you can
remove the bias term from the previous layer (just pass `use_bias=False`
when creating it):

In [11]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, kernel_initializer="he_normal",use_bias=False),
    keras.layers.BatchNormalization(),
    keras.layers.Activation("elu"),
    keras.layers.Dense(100, kernel_initializer="he_normal",use_bias=False),
    keras.layers.BatchNormalization(),
    keras.layers.Activation("elu"),
    keras.layers.Dense(10, activation="softmax")
])

When the input batch is 2D (i.e., the batch shape is `[batch size, features])`, this means that each input feature will be normalized based on the mean and standard deviation computed across all
the instances in the batch.

For example, the first BN layer in the previous code example will independently normalize (and rescale and shift) each of the 784 input features. If we move the first BN layer before the `Flatten` layer, then the input batches will be 3D, with shape `[batch size, height, width]`; therefore BN layer will compute 28 means and 28 standard deviations (1 per column of pixels, computed across all instances in the batch and across all rows in the column), and it will normalize all pixels in a given column using the same mean and standard deviation. There will also be just 28 scale parameters and 29 shift parameters. If instead we still want to treat each of the 784 pixels independently, then we should set `axis=[1,2]`

### Gradient Clipping

In Keras, implementing Gradient Clipping is just a matter of setting the
`clipvalue` or `clipnorm argument` when creating an optimizer, like this:

In [13]:
optimizer = keras.optimizers.SGD(clipvalue=1.0)
model.compile(loss="mse", optimizer=optimizer)

The optimizier will clip every component of the gradient vector to a value between -1.0 and 1.0. This means that all the partial derivatives of the loss (wrt to each and every trainable parameter) will be clipped to -1.0 and 1.0. This threshold is a hyperparameter we can tune.

Note that, it may change the orientation of the gradient vector. For instance, if the original gradient vector is [0.9, 100.0], it points mostly in the direction of 2nd axis; but once u clip it by value, we get [0.9, 1.0], which points roughly in the diagonal between the two axes. In practice, this approach works well. 

If we want to ensure that Gradient Clipping does not change the direction of gradient vector, we should clip by norm by setting `clipnorm` instead of `clipvalue`. This will clip the whole gradient if its $l_2$ norm is greater than the threshold we picked. For example, if we set `clipnorm=1.0`, theb vector [0.9, 100.0] will be clipped to [0.00899964, 0.9999595], preserving its oreintation but almost eliminating its first component. 

## Reusing Pretrained Layers

### Transfer Learning with Keras

Suppose the Fashin MNIST dataset only contained eight classes -  for example, all the classes except for sandal and shirt. Someone built and trained a Keras model on that set and got reasonably good performance (>90% accuracy). Let's call this model A. Now we want to tackle a different task: we have images of sandals and shirts, and we want to train a binary classifier (positive = shirt, negative = sandal). Our dataset is quite small, we only have 200 labelled images. When we train a new model for this task (let's call it model B) with the same architecture as model A, it performs reasonably well (97.2 % accuracy). But since it's a much easier task (there are just two classes), we were hoping for more. While drinking our morning coffee, we realized that our task is quite similar to task A, so perhaps transfer learning can help. Let's find out.

But first of all we will train model A, so that we can reuse it.

##### Training Model A

Let's split the Fashion MNIST dataset in two:

- `X_train_A`: all the images of all items except for sandalas and shirts (class 5 and 6)
- `X_train_B`: a smaller training set of just 200 images of sandalas and shirts

The validation and test set are also split this way, but without restricting the number of images. 

In [41]:
def split_dataset(X, y):
    y_5_or_6 = (y == 5) | (y == 6) # sandals or shirts
    y_A = y[~y_5_or_6]
    y_A[y_A > 6] -= 2 # class indices 7, 8, 9 should be moved to 5, 6, 7
    y_B = (y[y_5_or_6] == 6).astype(np.float32) #binary classification task: is it a shirt (class 6)?
    return ((X[~y_5_or_6], y_A),
            (X[y_5_or_6], y_B))

In [42]:
(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.fashion_mnist.load_data()
X_train_full = X_train_full / 255.0
X_test = X_test / 255.0
X_valid, X_train = X_train_full[:5000], X_train_full[5000:]
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]

In [43]:
(X_train_A, y_train_A), (X_train_B, y_train_B) = split_dataset(X_train, y_train)
(X_valid_A, y_valid_A), (X_valid_B, y_valid_B) = split_dataset(X_valid, y_valid)
(X_test_A, y_test_A), (X_test_B, y_test_B) = split_dataset(X_test, y_test)
X_train_B = X_train_B[:200]
y_train_B = y_train_B[:200]

In [7]:
X_train_A.shape

(43986, 28, 28)

In [8]:
X_train_B.shape

(200, 28, 28)

In [9]:
y_train_A[:30]

array([4, 0, 5, 7, 7, 7, 4, 4, 3, 4, 0, 1, 6, 3, 4, 3, 2, 6, 5, 3, 4, 5,
       1, 3, 4, 2, 0, 6, 7, 1], dtype=uint8)

In [10]:
y_train_B[:30]

array([1., 1., 0., 0., 0., 0., 1., 1., 1., 0., 0., 1., 1., 0., 0., 0., 0.,
       0., 0., 1., 1., 0., 0., 1., 1., 0., 1., 1., 1., 1.], dtype=float32)

In [11]:
tf.random.set_seed(42)
np.random.seed(42)

In [12]:
keras.backend.clear_session()

In [13]:
model_A = keras.models.Sequential()
model_A.add(keras.layers.Flatten(input_shape=[28, 28]))
for n_hidden in (300, 100, 50, 50, 50):
    model_A.add(keras.layers.Dense(n_hidden, activation="selu"))
model_A.add(keras.layers.Dense(8, activation="softmax"))

In [14]:
model_A.compile(loss="sparse_categorical_crossentropy",
                optimizer=keras.optimizers.SGD(learning_rate=1e-3),
                metrics=["accuracy"])

In [15]:
history = model_A.fit(X_train_A, y_train_A, epochs=20, validation_data=(X_valid_A, y_valid_A))

Epoch 1/20


2023-12-26 21:18:06.510259: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 137940096 exceeds 10% of free system memory.


Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [16]:
model_A.save("my_model_A.h5")

  saving_api.save_model(


Now we will train model_B (binary classifier) from and then we will train using transfer learning and then we will check whether transfer learning provides any benefits:

##### Training Model B from scratch

In [17]:
model_B = keras.models.Sequential()
model_B.add(keras.layers.Flatten(input_shape=[28, 28]))
for n_hidden in (300, 100, 50, 50, 50):
    model_B.add(keras.layers.Dense(n_hidden, activation="selu"))
model_B.add(keras.layers.Dense(1, activation="sigmoid"))

In [18]:
model_B.compile(loss="binary_crossentropy",
               optimizer=keras.optimizers.SGD(learning_rate=1e-3),
               metrics=["accuracy"])

In [19]:
history = model_B.fit(X_train_B, y_train_B, epochs=20, 
                     validation_data=(X_valid_B, y_valid_B))

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [20]:
model_B.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 flatten_1 (Flatten)         (None, 784)               0         
                                                                 
 dense_6 (Dense)             (None, 300)               235500    
                                                                 
 dense_7 (Dense)             (None, 100)               30100     
                                                                 
 dense_8 (Dense)             (None, 50)                5050      
                                                                 
 dense_9 (Dense)             (None, 50)                2550      
                                                                 
 dense_10 (Dense)            (None, 50)                2550      
                                                                 
 dense_11 (Dense)            (None, 1)                

Now let's use the transfer learning. 

First we need to load model A and create a new model based on that model's layers. Let's reuse all the layers except for the output layer:

In [21]:
model_A = keras.models.load_model("my_model_A.h5")

In [22]:
model_A.layers

[<keras.src.layers.reshaping.flatten.Flatten at 0x7f53701f3dd0>,
 <keras.src.layers.core.dense.Dense at 0x7f53706541d0>,
 <keras.src.layers.core.dense.Dense at 0x7f53d6335f50>,
 <keras.src.layers.core.dense.Dense at 0x7f53701dd610>,
 <keras.src.layers.core.dense.Dense at 0x7f5350711910>,
 <keras.src.layers.core.dense.Dense at 0x7f53701db590>,
 <keras.src.layers.core.dense.Dense at 0x7f5350708690>]

In [23]:
model_B_on_A = keras.models.Sequential(model_A.layers[:-1])
model_B_on_A.add(keras.layers.Dense(1, activation="sigmoid"))

Note that `model_A` and `model_B_on_A` now share some layers. When we train `model_B_on_A`, it will also affect `model_A`. If we want to avoid that, we need to clone the `model_A` before we reuse its layers. To do this, we clone model A's architecture with `clone.model()`, then copy its weights (since `clone_model()` does not clone the weights):

In [24]:
model_A_clone = keras.models.clone_model(model_A)
model_A_clone.set_weights(model_A.get_weights())

In [25]:
model_B_on_A = keras.models.Sequential(model_A_clone.layers[:-1])
model_B_on_A.add(keras.layers.Dense(1, activation="sigmoid"))

Now we could train `model_B_on_A` for task B, but since the new output layer was initialized randomly it will make large errors (at least during the first few epochs), so there will be large error gradients that may wreck the reused weights. To avoid this, one approach is to freeze the reused layers during the first few epochs, giving the new layer some time to learn reasonable weights. 

To do this, set every layer's `trainable` attribute to `False` and compile the model:

In [26]:
for layer in model_B_on_A.layers[:-1]:
    layer.trainable = False

In [27]:
model_B_on_A.compile(loss="binary_crossentropy", 
                     optimizer=keras.optimizers.SGD(learning_rate=1e-3), 
                     metrics=["accuracy"])

**NOTE:**

We must always compile our model after we freeze or unfreeze layers.

Now we can train the model for a few epochs, then unfreeze the reused layers (which requires compiling the model again) and continue training to fine-tune the reused layers for task B. After unfreezing the reused layers, it is usually good idea to reduce the learning rate, once again to avoid damaging the reused weights.

In [28]:
history = model_B_on_A.fit(X_train_B, y_train_B, epochs=4, validation_data=(X_valid_B, y_valid_B))

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


In [29]:
for layer in model_B_on_A.layers[:-1]:
    layer.trainable = True

In [30]:
optimizer = keras.optimizers.SGD(learning_rate = 1e-3)
model_B_on_A.compile(loss="binary_crossentropy", optimizer=optimizer, metrics=["accuracy"])

In [31]:
history = model_B_on_A.fit(X_train_B, y_train_B, epochs=16, validation_data=(X_valid_B, y_valid_B))

Epoch 1/16
Epoch 2/16
Epoch 3/16
Epoch 4/16
Epoch 5/16
Epoch 6/16
Epoch 7/16
Epoch 8/16
Epoch 9/16
Epoch 10/16
Epoch 11/16
Epoch 12/16
Epoch 13/16
Epoch 14/16
Epoch 15/16
Epoch 16/16


In [32]:
model_B_on_A.evaluate(X_test_B, y_test_B)



[0.0902191549539566, 0.987500011920929]

In [33]:
model_B.evaluate(X_test_B, y_test_B)



[0.09650509804487228, 0.9815000295639038]

Well the model's test accuracy has improved a little bit from transfer learning. 

It turns out that transfer learning does not work very well with small dense networks, preasumably because small networks learn few patterns, and dense networks learn very specific patterns, which are unlikely to be useful in other tasks. Transfer learning works best with deep CNN, which tend to learn feature detectors that are much more general (especially in the lower layers).

### Unsupervised Pretraining

### Pretraining on Auxiliary Task

## Faster Optimizers

### Momentum Optimization

Implementing momentum optimization in Keras is a no-brainer: just use
the SGD optimizer and set its `momentum` hyperparameter, then lie back and
profit!

In [2]:
optimizer = keras.optimizers.SGD(learning_rate=0.001, momentum=0.9)

### Nesterov Accelerated Gradient (NAG)

NAG is generally faster than regular momentum optimization. To use it simply set `nesterov=True` when creating SGD optimizer:

In [3]:
optimizer = keras.optimizers.SGD(learning_rate=0.001, momentum=0.9, nesterov=True)

### AdaGrad

In [4]:
optimizer = keras.optimizers.Adagrad(learning_rate=0.001)

### RMSProp

In [6]:
optimizer = keras.optimizers.RMSprop(learning_rate=0.001, rho=0.9)

### Adam and Nadam Optimization

The momentum decay hyperparameter $\beta_1$ is typically initialized to 0.9,
while the scaling decay hyperparameter $\beta_2$ is often initialized to 0.999. As
earlier, the smoothing term $\epsilon$ is usually initialized to a tiny number such as
$10^{-7}$.

These are the default values for the `Adam` class (to be precise, `epsilon` defaults to `None`, which tells Keras to use `keras.backend.epsilon()`, which defaults to $10^{-7}$; we can change it
using `keras.backend.set_epsilon())`.

In [7]:
optimizer = keras.optimizers.Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999)

Since Adam is an adaptive learning rate algorithm (like AdaGrad and RMSProp), it requires less tuning of the learning rate hyperparameter $\eta$.
We can often use the default value $\eta$ = 0.001, making Adam even easier to use than Gradient Descent.

#### Adamax and Nadam

In [8]:
optimizer = keras.optimizers.Adamax(learning_rate=0.001, beta_1=0.9, beta_2=0.999)

In [9]:
optimizer = keras.optimizers.Nadam(learning_rate=0.001, beta_1=0.9, beta_2=0.999)

### Learning Rate Scheduling

#### Power Scheduling

Implementing power scheduling in Keras is the easiest option: just set the
`decay` hyperparameter when creating an optimizer:

In [13]:
optimizer = keras.optimizers.legacy.SGD(learning_rate=0.01, decay=1e-4)

The decay is the inverse of `s` (the number of steps it takes to divide the learning rate by one or more unit), and Keras assumes that `c` is equal to 1.

#### Exponential Scheduling

Exponential scheduling and piecewise scheduling are quite simple too.
We first need to define a function that takes the current epoch and returns
the learning rate. For example, let’s implement exponential scheduling:

In [14]:
def exponential_decay_fn(epoch):
    return 0.01*0.1**(epoch/20)

If we don't want to hardcode $\eta_0$ and s, we can create a function that returns a configured function:

In [16]:
def exponential_decay(lr0, s):
    def exponential_decay_fn(epoch):
        return lr0 * 0.1 ** (epoch / s)
    return exponential_decay_fn

In [None]:
exponential_decay_fn = exponential_decay(lr0=0.01, s=20)

Next, create a `LearningScheduler` callback, giving it a schedule function, and pass this call back to fit method:

In [18]:
lr_scheduler = keras.callbacks.LearningRateScheduler(exponential_decay_fn)

The `LearningRateScheduler` will update the optimizer's `learning_rate` attribute at the beginning of each epoch. Updating the learning rate once per epoch is usually enough, but if we want to update more often, for example at every step, we can always write our own callback. Updating the learning rate at every step makes sense if there are many steps per epoch. Let's try it out

In [20]:
K = keras.backend

class ExponentialDecay(keras.callbacks.Callback):
    def __init__(self, s=40000):
        super().__init__()
        self.s = s
    
    def on_batch_begin(self, batch, logs=None):
        #Note: the `batch` argument resets at each epoch
        lr = K.get_value(self.model.optimizer.learning_rate)
        K.set_value(self.model.optimizer.learning_rate, lr*0.1**(1/(self.s)))
        
    def on_epoch_end(self, epoch, logs=None):
        logs = logs or {}
        logs["lr"] = K.get_value(self.model.optimizer.learning_rate)

In [21]:
exp_decay = ExponentialDecay()

Alternatively, we can use the `keras.optimizers.schedules` approach.

The schedule function can optionally take the current learning rate as a second argument. For example, the following schedule function multiplies the previous learning rate by $0.1^{1/20}$, which results in the same exponential decay ( except the decay now starts at the beginning of epoch 0 instead of 1):

In [22]:
def exponential_decay_fn(epoch, lr):
    return lr * 0.1 ** (1 / 20)

When we save the model, the optimizer and its learning rate get saved along with it. This means that with this new schedule function, we could just load a trained model and continue training where it left off, no problem. 

Things are not so simple if our schedule function uses the `epoch` argument, however: the epoch does not get saved, and it gets reset to 0 everytime we call the `fit()` method. If we were to continue training a model where it left off, this could be lead to a very large learning rate, which would likely damage our model's weights. One solution is to manually set the `fit()` method's argument `intial_epoch` argument so the `epoch` starts at right value.

#### Piecewise constant Scheduling

For piecewise constant scheduling, we can use a schedule function like the following one, then create a `LearningRateScheduler` callback with this function and pass it to `fit()` method, just like we did for exponential scheduling.

In [23]:
def piecewise_constant_fn(epoch):
    if epoch < 5:
        return 0.01
    elif epoch < 15:
        return 0.005
    else:
        return 0.001

More general Piecewise Constant Scheduling function:

In [32]:
def piecewise_constant(boundaries, values):
    boundaries = np.array([0] + boundaries) # it is just adding 0 at the beginning
    # print(boundaries)
    values = np.array(values) # just converting the values (learning rate values) to numpy array
    # print(values)
    def piecewise_constant_fn(epoch):
        return values[np.argmax(boundaries > epoch) - 1] # this will check whether boundary is greather than epoch and it find the index of that and we subtract 1 because array are 0 based and then we get the corresponding learning rate based on the index
    return piecewise_constant_fn

In [33]:
piecewise_constant_fn = piecewise_constant([5,15], [0.01, 0.005, 0.001])

#### Performace Scheduling

For performace scheduling, use the `ReduceLROnPlateau` callback. For example, if we pass the following callback to `fit()` method, it will multiply the learning rate by 0.5 whenever the best validation loss does not improve for five consecutive epochs:

In [34]:
lr_scheduler = keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=5)

Lastly, tf.keras offers an alternative way to implement learning rate scheduling: define the learning rate using one of the schedules available in `keras.optimizers.schedules`, then pass this learning rate to any optimizer. This approach updates the learning rate at each step rather than at each epoch.

For example, here's how to implement the same exponential schedule as the `exponential_decay_fn()` function we defined earlier:

In [35]:
s = 40000
learning_rate = keras.optimizers.schedules.ExponentialDecay(0.01, s, 0.1)
optimizer = keras.optimizers.SGD(learning_rate)

This is nice and simple, plus when we save the model, the learning rate and its schedule (including its state) get saved as well. This approach, however, is not part of the Keras API; it is specific to tf.keras

#### 1Cycle Scheduling

As for 1cycle approach, the implementation poses no particular difficulty: just create a custom callback that modifies the learning rate at each iteration (we can update the optimizer's learning rate by changing `self.model.optimizer.lr`)

In [37]:
K = keras.backend

class OneCycleScheduler(keras.callbacks.Callback):
    def __init__(self, iterations, max_rate, start_rate=None, 
                last_iterations=None, last_rate=None):
        self.iterations = iterations
        self.max_rate = max_rate
        self.start_rate = start_rate or max_rate / 10
        self.last_iterations = last_iterations or iterations // 10 + 1
        self.half_iteration = (iterations - self.last_iterations) // 2
        self.last_rate = last_rate or self.start_rate / 1000
        self.iteration = 0
    
    # this function starting with "_" means it is for internal use inside the class. It is not public method and it is not recommended to access publicly.
    def _interpolate(self, iter1, iter2, rate1, rate2):
        return ((rate2 -  rate1) * (self.iteration - iter1) / (iter2 - iter1) + rate1)
    
    def on_batch_begin(self, batch, logs):
        if self.iteration < self.half_iteration:
            rate = self._interpolate(0, self.half_iteration, self.start_rate, self.max_rate)
        elif self.iteration < 2 * self.half_iteration:
            rate = self._interpolate(self.half_iteration, 2 * self.half_iteration, self.max_rate,
                                     self.start_rate)
        else:
            rate = self._interpolate(2 * self.half_iteration, self.iterations, self.start_rate,
                                    self.last_rate)
        self.iteration += 1
        K.set_value(self.model.optimizer.learning_rate, rate)

## Avoiding Overfitting Through Regularization

### $l_1$ and $l_2$ regularization

Here's how to apply $l_2$ regularization to a Keras layer's connection weights, using a regularization factor of 0.01:

In [38]:
layer = keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal",
                          kernel_regularizer=keras.regularizers.l2(0.01))

The `l2()` function returns a regularizer that will be called at each step during training to compute the regularization loss. This is then added to final loss. 

Similarly, we can use `keras.regularizers.l1()` for $l_1$ regularization and `keras.regularizers.l1_l2()` if we want both $l_1$ and $l_2$ regularization, specifying factors for both.

Since we typically want to apply the same regularizer to all layers in network, as well as using same activation function and the same initialization strategy in all hidden layers, we may find ourself repeating same arguments. To avoid this, we can try refactoring our code to use loops. Another option is to use Python's `functools.partial()` function, which lets us create a thin wrapper for any callable, with some default argument values:

In [39]:
from functools import partial

RegularizedDense = partial(keras.layers.Dense, activation="elu", kernel_initializer="he_normal",
                          kernel_regularizer=keras.regularizers.l2(0.01))

model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28,28]),
    RegularizedDense(300),
    RegularizedDense(100),
    RegularizedDense(10, activation="softmax", kernel_initializer="glorot_uniform")
])

### Dropout

To implement dropout using Keras, we can use the `keras.layers.Dropout` layer. During training it randomly drops some inputs (setting them to 0) and divides the remaining inputs by the keep probability. After training, it does nothing at all; it just passes the inputs to the next layer. The following code applies the dropout regularization before every `Dense` layer, using a dropout rate of 0.2:

In [44]:
(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.fashion_mnist.load_data()
X_train_full = X_train_full / 255.0
X_test = X_test / 255.0
X_valid, X_train = X_train_full[:5000], X_train_full[5000:]
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]

pixel_means = X_train.mean(axis=0, keepdims=True)
pixel_stds = X_train.std(axis=0, keepdims=True)
X_train_scaled = (X_train - pixel_means) / pixel_stds
X_valid_scaled = (X_valid - pixel_means) / pixel_stds
X_test_scaled = (X_test - pixel_means) / pixel_stds

In [45]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28,28]),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(10, activation="softmax")
])

model.compile(loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
n_epochs=2
h = model.fit(X_train_scaled, y_train, epochs=n_epochs, 
             validation_data=(X_valid_scaled, y_valid))

Epoch 1/2


2024-01-01 14:36:45.516693: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 172480000 exceeds 10% of free system memory.


Epoch 2/2


**Warning:**

Since dropout is only active during training, comparing the training loss and the validation loss can be misleading. In particular, a model may be overfitting the training set and yet have similar training and validation losses. So make sure to evaluate the training loss without dropout (e.g: after training)

If we observe that model is overfitting, we can increase the dropout rate and vice-versa. It can also help to increase the dropout rate for large layers, and reduce it for small ones.

Dropout tends to significantly slow down the convergence, but it usually results in a much better model when tuned properly. So, it is generally well worth the extra effort and time. 

**TIP:**

If we want to regularize a self-normalizing network based on the SELU activation function, we should use *alpha dropout*: this is a variant of dropout that preserves the mean and the standard deviation of its inputs (it was introduced in the same paper as SELU, a regular dropout would break self-normalization).

### Monte Carlo (MC) Dropout

Look at the following code. It is full implementation of `MC Dropout`, boosting the dropout model we trained earlier without retraining it:

In [46]:
y_probas = np.stack([model(X_test_scaled, training=True) for sample in range(100)]) # we are making predictions
y_proba = y_probas.mean(axis=0)

2024-01-01 14:39:13.034987: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 62720000 exceeds 10% of free system memory.
2024-01-01 14:39:13.062752: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 31360000 exceeds 10% of free system memory.
2024-01-01 14:39:13.073865: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 31360000 exceeds 10% of free system memory.
2024-01-01 14:39:13.080418: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 31360000 exceeds 10% of free system memory.


We just make 100 predictions over the test set, setting `training=True` to ensure that the `Dropout` layer is active, and stack the predictions. Since dropout is active, all the predictions will be different. Recall that `predict()` returns a matrix with one row per instance and one column per class. Because there are 10,000 instances in the test set and 10 classes, this is a matrix of shape [10000, 10]. We stack 100 such matrices, so `y_probas` is an array of shape [100, 10000, 10]. Once we average ove the first dimension (axis = 0), we get an array of shape [10000, 10], like we would get with single prediction. That's all!

Averaging over multiple predictions with dropout on gives us a Monte Carlo estimate that is generally more reliable than the result of a single prediction with dropout off.

For example, let's look at the model's prediction for the first instance in the test set, with dropout off:

In [48]:
np.round(model.predict(X_test_scaled[:1]), 2)



array([[0.  , 0.  , 0.  , 0.  , 0.  , 0.02, 0.  , 0.09, 0.  , 0.89]],
      dtype=float32)

The model seems certain that this image belongs to class 9 (ankle boot). Should we trust it? Is there really a little room for doubt? Let's compare it with the predictions made when dropout is activated:

In [49]:
np.round(y_probas[:, :1], 2)

array([[[0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.15, 0.  , 0.84]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.03, 0.  , 0.18, 0.  , 0.79]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.2 , 0.  , 0.02, 0.  , 0.77]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.02, 0.  , 0.97]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.11, 0.  , 0.06, 0.  , 0.83]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.07, 0.  , 0.93]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.15, 0.  , 0.01, 0.  , 0.84]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.07, 0.  , 0.93]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.99]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.01, 0.  , 0.05, 0.  , 0.95]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.7 , 0.  , 0.04, 0.  , 0.26]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.04, 0.  , 0.04, 0.  , 0.92]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.1 , 0.  , 0.47, 0.  , 0.43]],

       [[0.  , 0.  , 0.  , 0.  , 0.  , 0.04, 0.  , 0.32, 0.  , 0

This tells a different story: apprantly when we activate dropout, the model is not sure anymore. It still seems to prefer class 9, but sometimes it hesitates with class 7 (sneaker), which makes sense given they are all footwear. 

Once we average over the first dimension, we get the following MC Dropout predictions:

In [50]:
np.round(y_proba[:1], 2)

array([[0.  , 0.  , 0.  , 0.  , 0.  , 0.06, 0.  , 0.17, 0.  , 0.77]],
      dtype=float32)

The model still thinks this image belongs to class 9, but only with a 77% confidence, which seems much more reasonable than 89%. Plus it's useful to know exactly which other classes it thinks are likely. And we can also look at the standard deviation of the probability estimates:

In [51]:
y_std = y_probas.std(axis=0)
np.round(y_std[:1],2)

array([[0.  , 0.  , 0.  , 0.  , 0.  , 0.09, 0.  , 0.14, 0.  , 0.17]],
      dtype=float32)

Apparantly, there's a quite a lot of variance in the probability estimates: if we were builiding a risk-sensitive system (e.g: a medical or financial system), we should probably treat such an uncertain prediction with extreme caution.

In [53]:
y_pred = np.argmax(y_proba, axis=1)
accuracy = np.sum(y_pred == y_test) / len(y_test)
accuracy

0.8589

**NOTE:**

The number of Monte Carlo samples we use (100 in above code) is a hyperparameter we can tweak. The higher it is, the more accurate the predictions and their uncertainity estimates will be. However, if we double it, inference time will be also doubled. Moreover, above certain number of samples, we will notice little improvement. So our job is to find the right trade-off between latency and accuracy, depending on our application. 

If our model contains other layers that behave in a special way during training (such as `BatchNormalization` layers), then we should not force training model like we just did. Instead, we should replace the `Dropout` layers with the following `MCDropout` class (This `MCDropout` class works well with all Keras API, including the Sequential API). If we only care about the Functional API or subclass API, we do not have to create an `MCDropout` class; we can create a regular Dropout layer and call it with `training = False`):

In [54]:
class MCDropout(keras.layers.Dropout):
    def call(self, inputs):
        return super().call(inputs, training=True)

Here, we just subclass the `Dropout` layer and override the `call()` method to force its `training` argument to `True`. 

Since MC Dropout is just regular dropout during training, it also acts like regularizer.

### Max-Norm regularization

To implement it in Keras, set the `kernel_constraint` argument of each hidden layer to a `max_norm()` constraint with the appropriate max value, like this:

In [55]:
keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal", 
                  kernel_constraint=keras.constraints.max_norm(1.))

<keras.src.layers.core.dense.Dense at 0x7f611c0ca710>

After each training iteration, the model's `fit()` method will call the object returned by `max_norm()`, passing it layer's weights and getting rescaled weights in return, which then replace the layer's weights. We can also contrain the bias terms by setting the `bias_constraint` argument. 

The `max_norm()` function has an `axis` argument that defaults to 0. A `Dense` layer usually has weights of shape [*number of inputs*, *number of neurons*], so using `axis=0` means that the max-norm constraint will apply independently to each neuron's weight vector. 

Below are few configurations that will work generally most of the time.

![image.png](attachment:a3114163-3f4c-413d-81b0-010c1d399eca.png)

![image.png](attachment:841ceb29-5232-497a-901f-bd9a9b0fd7a2.png)