Here's some commonly faced problems when building deep neural nets:
- Vanishing and Exploding gradients
- Might not have enough training data for such a large network, or it might be too costly to label
- Training may be extremely slow
- A model with millions of parameters would severely risk overfitting the training set, especially if there are not enough training instances or if they're too noisy.

**The Vanishing/Exploding Gradients Problem:**
- Unfortunately gradients often get smaller and smaller as the algorithm progresses down to the lower layers (chain rule - multiplying multiple parameters - vanishing gradients) - as a result the gradient descent update leaves the lower layers' connection weights virtually unchanged, and training never converges to a good solution - this is called the vanishing gradients problem. Reverse can also happen - gradients exploding.
- More generally, deep neural networks suffer from unstable gradients; different layers may learn at widely different speeds
- Few suspects were found for this problem earlier on - the combination of the sigmoid (logistic) activation function and the weight initialization technique that was most popular at the tume (i.e. a normal distribution with a mean of 0 and a standard deviation of 1) - with this activation function and this initialization scheme the variance of the outputs of each layer is much greater than the variance of the of the inputs. 
  - Glorot and He Initialization:
    - Glorot and Bengio propose a way to significantly alleviate the unstable gradients problem. They point out that we need the signal to flow properly in both directions: the forward direction when making predictions and the reverse direction when backpropagating gradients. We don't want the signal to die out, nor do we want it to explode and saturate. For the signal to flow properly,  the authors argue we need the variance of the outputs of each layer to be equal to the variance of its inputs and we need the gradients to have equal variance before and after flowing through a layer in the reverse direction. Its actually not possible to guarantee both unless the layer has an equal number of inputs and outputs but Glorot and Bengio proposed a good compromise that has proven to work well in practice: the connection weights of each layer must be initialized randomly where fanavg = (fanin + fanout)/2. 
    - Lesson here is that based on these guys' research there's different initialization strategies preferred for various activation functions.
      - We have Glorot initialization preferred for tanh, sigmoid, and softmax
      - He initialization preferred for ReLU, Leaky ReLU, ELU, GELU, Swish, Mish
      - LeCun initialization - SELU
    - By default keras uses Glorot initialization with a uniform distribution 

In [1]:
import tensorflow as tf

In [2]:
dense = tf.keras.layers.Dense(50, activation="relu", kernel_initializer="he_normal")

Alternatively - you can obtain any of the initializations l

**Better Activation Functions:**
- One of the insights in the 2010 paper was the problems with unstable gradients were in part due to a poor choice of activation function.
- Turns out other activation functions perform much better than sigmoid activations in deep neural networks - in particular the ReLU activation function - mostly because it does not saturate for positive values, and also because it is very fast to compute
- Unfortuanately, ReLU activation functions are not perfect - they suffer from a problem known as dying ReLUs: during training, some neurons effectively die meaning they stop outputting anything other than 0. A neuron dies when its weights get tweaked in such a way that the input of the ReLU function (i.e. the weighted sum of the neuron's inputs plus its bias term) is negative for all instances in the training set. When this happens it just keeps outputting zeros - and gradient descent - does not affect it anymore because the gradient of the relu finction is zero when its input is negative. 
- To solve solve this problem - you may want to use a variant of the ReLU function - such as leaky ReLU. 

**Leaky ReLU**
- The leaky relu activation function - is defined as leak
- Setting alpha to 0.2, a huge leak seemed to result in better performance than alpha = 0.01 (small leak). 
- There's also the parametric leaky relu where alpha is authorised to be learned during training: instead of being a hyperparameter, it becomes a parameter that can be modified by backpropagation like any other parameter. Paraametric  Leaky ReLU was reported to strongly outperform ReLU on large image datasets, but on smaller datasets it runs the risk of overfitting on the training set.
- Keras includes the classes LeakyReLU and PReLU in the tf.keras.layers package. Just like for other relu variants - use the he initialization with these

In [3]:
leaky_relu = tf.keras.layers.LeakyReLU(alpha=0.2) # defaults to alpha=0.3
dense = tf.keras.layers.Dense(50, activation=leaky_relu, kernel_initializer="he_normal")



In [4]:
leaky_relu

<LeakyReLU name=leaky_re_lu, built=True>

if you prefer, you can also use leaky relu as a separate layer in your model; it makes no difference for training and predictions:

In [None]:
model = tf.keras.Sequential([
    ..... # more layers
    tf.keras.layers.Dense(50, kernel_initializer="he_normal"), # no activation
    tf.keras.layers.LeakyReLU(alpha=0.2), #activation as a separate layer
    ....... # more layers
])

For PReLU, replace LeakyReLU with PReLU. currently, there's no official implementation of RReLU in keras - but you can fairly easily implement your own. 
- ReLU, leaky ReLU, and PReLU all suffer from the fact that they are not smooth functions: their derivatives abruptly change at (z=0)

**ELU and SELU**
- New activation function => called exponential linear unit (ELU), outperforming all ReLU variants 

**Batch Normalization:**
- learns the optimal scaling and mean parameter to standardize your data
- during training, BN standardizes its inputs, then it rescales and offsets them. 
- In the initial research => 
  - discovered that batch normalization considerably improved all the deep neural networks they experimented with, leading to a huge improvement in the imagenet classification task. Batch norm solves the vanishing gradients problem, can even use saturating activation functions, networks much less sensitive to weights initialization.
  - Batch norm also acts like a regularizer - reducing the need for other regularization techniques.
  - Batch norm however adds some complexity to the model - moreover there is a 

In [6]:
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=[28,28]),
    tf.keras.layers.Flatten(),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(300, activation="relu", kernel_initializer="he_normal"),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(100, activation="relu", kernel_initializer="he_normal"),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(10, activation="softmax")
])

batch norm in the above is used even early on at the input stage to standardize the data - eliminating the need for a normalization/standardscaler

In [7]:
model.summary()

Lets look at the parameters of the first BN layer. 2 are trainable by backprop and 2 are not:

In [8]:
[(var.name, var.trainable) for var in model.layers[1].variables]

[('gamma', True),
 ('beta', True),
 ('moving_mean', False),
 ('moving_variance', False)]

Can experiment with adding batch norm before and after the activation fxn to see where it performs best for a given dataset:

**Reusing Pretrained Layers:**
- Transfer Learning
- Speeds up training considerably but also requires significantly less training data
- 

**Transfer learning in keras:**

In [None]:
# first you need to load the original model and create a new model based on its layers
# here say, you decide to reuse all the layers except for the output layer:

model_A = tf.keras.load_model("my_model_A")

model_B_on_A = tf.keras.Sequential(model_A.layers[:-1])

model_B_on_A.add(tf.keras.layers.Dense(1, activation="sigmoid"))

Note - model_A and model_B_on_A - now share some layers. When you train model_B_on_A, it will also affect model_A. To avoid that - clone model_A before you reuse its layers. clone the model then copy its weights:

model_A_clone = tf.keras.models.clone_model(model_A)
model_A_clone.set_weights(model_A.get_weights())

In [None]:
for layer in model_B_on_A.layers[:-1]:
    layer.trainable = False

Now, you can train the model for a few epochs - then unfreeze the reused layers (which requires compiling again), and continue training to fine-tune the resued layers for task B. after unfreezing - good practive to reduce the learning rate.

Transfer learning usually only works with really deep neural nets.
If there isn't a suitable model you can use for transfer learning - another effective technique is
- Use Autoencoders or GANS - reuse the lower layers - add output layers for your task and finetune the model.
- Could also try pretraining on an auxiliary task for which its easy to obtain labels and then finetune on your final task

**Learning Rate Scheduling:**
- based off experiments - preferred lr scheduling algorithms are power scheduling, exponential scheduling and 1cycle scheduling:

In [9]:
# implementing power scheduling in keras is the easiest option-just set the decay hyperparam when creating an optimizer

optimizer = tf.keras.optimizers.SGD(learning_rate=0.01, decay=1.0e-4)



Avoiding Overfitting through regularization:
- already implemented the best regularization technique - early stopping => Moreover, even though batch normalization was designed to solve the unstable gradients problems, it also acts like a good regularizer. L1 and L2 regularization, dropout, and max-norm regularization:

**L1 and L2 Regularization:**
- L2 Regularization to constrain a neural net's connection weights, and or l1 regularization if you want a sparse model (with many weights equal to 0). Here's how to apply l2 regularization to a keras layer's connection weights, using a regularization factor of 0.01:

In [10]:
#
layer = tf.keras.layers.Dense(100, activation="relu", kernel_initializer="he_normal", kernel_regularizer=tf.keras.regularizers.l2(0.01))

if you want l1 regularization - tf.keras.regularizers.l1() - and if you want both => tf.keras.regularizers.l1_l2()

l2 regularization is fine when using sgd, momentum optimization, and nesterov momentum optimization, but not with adam and its variants. if you want to use adam with weight decay, then do not use l2 regularization - use AdamW instead.

**Dropout:**
- Dropout's one of the most popular regularization techniques for deep neural networks. 
- Fairly simple algorithm - at every training step - every neuron including the input neurons but excluding the output neurons has a probability p of being temporarily dropped out - meaning it will be entirely ignored during training but may be active during the next. Hyperparameter p is called the dropout rate and is typocally set between 10 and 50% closer to 20-30% in recurrent neural nets and closer to 40-50% in conv nets.
- Since dropout is only active during training, comparing the training loss and validation loss can be misleading. In particular, a model may be overfitting the training set and yet have similar training and validation losses. So, make sure to eval the training loss after w/o dropout - i.e. after training.

In [12]:
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=[28,28]),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dropout(rate=0.2),
    tf.keras.layers.Dense(100, activation="relu", kernel_initializer="he_normal"),
    tf.keras.layers.Dropout(rate=0.2),
    tf.keras.layers.Dense(100, activation="relu", kernel_initializer="he_normal"),
    tf.keras.layers.Dropout(rate=0.2),
    tf.keras.layers.Dense(10, activation="softmax")
])

**Monte Carlo (MC) Dropout:**
- 