# ADVANCED TOPICS

* Build a training loop with a custom training step
  * Useful to build custom models!
* Explore DL tricks and how to implement them with Keras
  * Useful to improve performances

## CUSTOM TRAINING STEP

In [None]:
from tensorflow import keras as K
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import numpy as np
import tensorflow as tf

### LET'S BUILD OUR USUAL DATASET

In [None]:
N_CLASSES = 3
N_PATTERNS_PER_CLASS = 5000
BATCH_SIZE = 64

N_PATTERNS = N_CLASSES * N_PATTERNS_PER_CLASS
X, y = make_classification(n_samples=N_PATTERNS, n_classes=N_CLASSES, n_informative=5)
test_size = int(0.25 * y.shape[0])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, shuffle=True, stratify=y)

train_dataset = tf.data.Dataset.from_tensor_slices((X_train, y_train)).shuffle(buffer_size=1024).batch(BATCH_SIZE)
test_dataset = tf.data.Dataset.from_tensor_slices((X_test, y_test)).batch(BATCH_SIZE)


### LET'S BUILD OUR USUAL MODEL

In [None]:
inputs = K.Input(shape=(20,)) # input layer
h = K.layers.Dense(units=64, activation='relu')(inputs) # hidden layer
outputs = K.layers.Dense(units=N_CLASSES, activation='softmax')(h) # output layer
model = K.Model(inputs=inputs, outputs=outputs)
model.summary()

### KERAS TRAINING

In [None]:
model.compile(loss='sparse_categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])
print("Training")
model.fit(train_dataset, epochs=10) # batch size is already specified by the dataset
print("Evaluating")
metrics = model.evaluate(test_dataset)

Up to now all good. Let's dig deeper!

### IMPLEMENT CUSTOM TRAINING METHOD

In order to do that, we have to build a custom model!  
We also have to specify which metrics we want to monitor.

The `train_step` method trains your model on a single minibatch of data. Do not override `fit`, since you will need to manage more complex Keras functionalities related to epochs.

In [None]:
# this has to be a loss, otherwise gradients won't be tracked
loss_metric = K.losses.SparseCategoricalCrossentropy()

acc_metric = K.metrics.Accuracy(name="acc")
mean_loss = K.metrics.Mean(name="loss") 

In [None]:
class MyModel(K.Model):
  """
  Model overriding `train_step` method
  """

  def train_step(self, data):
    """
    :param data: depending on what you pass to `fit` method this can be
      yielded by a tensorflow dataset or sampled from (X, y) tuple
    """

    # loop over dataset
    # this is 1 iteration
    x, y = data
    with tf.GradientTape() as tape:
      # this calls the forward pass
      y_pred = self(x, training=True) 
      loss_value = loss_metric(y, y_pred)
      
    # # backward pass (backpropagation)
    gradients = tape.gradient(loss_value, self.trainable_variables) 

    # gradient step update
    self.optimizer.apply_gradients(zip(gradients, self.trainable_variables))

    # get classification 
    winner = tf.argmax(y_pred, axis=-1)
    acc_metric.update_state(y, winner)
    mean_loss.update_state(loss_value)    

    return {"loss": mean_loss.result(), "acc": acc_metric.result()}


  def test_step(self, data):
    """
    Custom evaluation step
    """
    x, y = data
    y_pred = self(x, training=False)
    loss_value = loss_metric(y, y_pred)
    mean_loss.update_state(loss_value)
    winner = tf.argmax(y_pred, axis=-1)
    acc_metric.update_state(y, winner)
    print("Inner print: ", x.shape)
    return {"loss": mean_loss.result(), "acc": acc_metric.result()}

  @property
  def metrics(self):
    """
    Add metrics here. In this way, metrics are automatically
    reset at the end of each epoch.
    """
    return [mean_loss, acc_metric]

In [None]:
inputs = K.Input(shape=(20,)) # input layer
h = K.layers.Dense(units=32, activation='relu')(inputs) # hidden layer
outputs = K.layers.Dense(units=N_CLASSES, activation='softmax')(h) # output layer
model = MyModel(inputs=inputs, outputs=outputs)
model.summary()

In [None]:
model.compile(optimizer="adam") # metrics are already specified!

In [None]:
print("Training")
model.fit(train_dataset, epochs=10)
print("Evaluating")
metrics = model.evaluate(test_dataset)

**Exercise**: Implement an entire training loop (i.e. multiple epochs) outside the model.  
Use any model you want and monitor any metrics you like.

## DL TRICKS

### L2/L1 REGULARIZATION

This is useful to reduce the weights values. Large values are usually associated to overfitting. By adding a penalization term to the loss function it is possible to induce weights to have small values.  
Depending on the penalization type you can have L1 (absolute value) or L2 (square norm) penalization.  
Same process applies to output values.

In [None]:
l2_reg = K.regularizers.l2(l2=1e-2)
l1_reg = K.regularizers.l1(l1=1e-2)

In [None]:
layer = K.layers.Dense(32, kernel_regularizer=l2_reg, activity_regularizer=l1_reg)
out = layer(tf.random.normal(shape=(10,7)))
layer.losses

You can easily create your custom regularizer (it is just a function/class)

### EARLY STOPPING

In [None]:
callback = K.callbacks.EarlyStopping(monitor='acc', patience=3, 
                                     restore_best_weights=True)

Simply use this as a callback during `fit`

### DROPOUT

Switch off random units during training.  
Use all units during evaluation.  
This makes the network resilient to different topologies and more robust against overfitting.

Place it after activation (but it is not a must).

In [None]:
model = K.Sequential()
# use tanh to clearly separate 0 produced by relu from the ones
# produced by the activation function (relu may cause confusion)
model.add(K.layers.Input(shape=(2,)))
model.add(K.layers.Dense(20, activation="tanh"))
model.add(K.layers.Dropout(rate=0.5))
input_t = tf.constant([1,2,3,4,5,6,7,8,9,10], shape=(5,2), dtype=tf.float32)
print(input_t)
print()
out = model(input_t, training=True)
print(out)
print()
out2 = model(input_t, training=False)
print(out2)

# change the rate [0, 1) and see what happens

### BATCH NORMALIZATION

Keep adaptive mean and std parameters for each layer. Normalize the output of each layer with mean 0 and std 1. These are **learned** parameters!  
During training: use the current mean std.  
During inference: use the average mean and std.
Improves stability of predictions and generalization.
Place it between layer and activation.

Many hypothesis on why BN improves performance. None has been really proofed.

In [None]:
model = K.Sequential()
# use tanh to clearly separate 0 produced by relu from the ones
# produced by the activation function (relu may cause confusion)
model.add(K.layers.Input(shape=(2,)))
model.add(K.layers.Dense(20))
model.add(K.layers.BatchNormalization())
model.add(K.layers.ReLU())
input_t = tf.constant([1,2,3,4,5,6,7,8,9,10], shape=(5,2), dtype=tf.float32)
out = model(input_t, training=True)
print(out)
out2 = model(input_t, training=False)
print()
print(out2)

**Exercise**: try to empirically validate the effect of these DL tricks on a real dataset (use Keras dataset to get started).