# Unstable Gradient Problem

we start this season with the probelm of gradients and how to prevent them 

* Xavier / Glorot method :

here we have 2 conditions to satisfy :

1) for each layer, the variance of the inpout and output should remain the same

2) the gradient, before and after a layer should maintain its variance

These 2 cannot happen togheter (which would cause fan_in and fan_out to be equal)

so we have to initialize the weights by :

A) Natural distribution, with a certain varience and mean of 0

B) Uniform distribution, between r and -r (r = rad(3 * (varience ** 2)))

and if fan_in = fan_out, we would have the LeCun initialization

Keras has the first method as the default, but:

In [1]:
import keras


keras.layers.Dense(10, activation="relu", kernel_initializer="he_normal")





<keras.src.layers.core.dense.Dense at 0x232566f4210>

for furthur personalization :

In [2]:
he_avg_init = keras.initializers.VarianceScaling(scale=2, mode="fan_avg", distribution="uniform")

keras.layers.Dense(10, activation="sigmoid", kernel_initializer=he_avg_init)

<keras.src.layers.core.dense.Dense at 0x232564a9f50>

another place we can look for the cause of this problem would be the activation function

because of the fact that mother nature has used the sigmoid activation function, we believed that it was best

but as the ReLU activation function will not be saturated (by definition mind you), it did better than sigmoid in most cases

# Non-Saturated Activation Function

as a sneak peak :

SELU > ELU > PRelU > ReLU > tanh > logestic

lets implement a simple leaky ReLU

In [3]:
model = keras.models.Sequential(
    # some layers
    keras.layers.Dense(10, kernel_initializer="he_normal"), 
    keras.layers.LeakyReLU(alpha=0.2)
)

In [4]:
layer = keras.layers.Dense(10, activation="selu", kernel_initializer="lecun_normal")

Keep in mind, using the lecun initializer is neccesary for the SELU model 

# Batch Normalization in Neural Networks

Batch Normalization (BN) is a technique used in neural networks to normalize the inputs of each layer. It operates on a mini-batch of data during training and normalizes the activations to have zero mean and unit variance. This is done by subtracting the batch mean and dividing by the batch standard deviation.

## Steps of Batch Normalization

1. **Normalization:** It standardizes the inputs of a layer, ensuring that they have roughly zero mean and unit variance. This helps in overcoming the internal covariate shift, making training more stable.

2. **Scaling and Shifting:** After normalization, the values are scaled and shifted using learnable parameters (gamma and beta). This allows the model to adapt and learn the optimal scale and shift for each feature.

## Preventing the Unstable Gradient Problem

Batch Normalization helps prevent the unstable gradient problem through:

1. **Mitigating Internal Covariate Shift:** By normalizing the inputs, BN reduces the internal covariate shift, which is the change in the distribution of network activations due to parameter updates during training. This helps in stabilizing the training process.

2. **Stabilizing Gradients:** BN reduces the dependency of the gradient on the scale of the parameters. During backpropagation, gradients are less likely to vanish or explode as the inputs are within a certain range (near zero mean and unit variance). This mitigates the unstable gradient problem and allows for more effective learning.

3. **Enabling Higher Learning Rates:** BN often allows for the use of higher learning rates during training. With normalized inputs, the optimization process is less sensitive to the choice of learning rate, leading to faster convergence.

In summary, Batch Normalization is effective in preventing the unstable gradient problem by normalizing inputs, reducing internal covariate shift, and stabilizing the gradients during backpropagation. This, in turn, facilitates more stable and efficient training of neural networks.


Enough talk, lets code :

In [5]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]), 
    keras.layers.BatchNormalization(), 
    keras.layers.Dense(300, activation="relu", kernel_initializer="he_normal"), 
    keras.layers.BatchNormalization(), 
    keras.layers.Dense(100, activation="relu", kernel_initializer="he_normal"), 
    keras.layers.BatchNormalization(), 
    keras.layers.Dense(10, activation="softmax")
])

Keep in mind we have to inmplement the funnel method with the neural numbers

Also note that we have Batch Normalization, in each and every layer but the output 

In [6]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 flatten (Flatten)           (None, 784)               0         
                                                                 
 batch_normalization (Batch  (None, 784)               3136      
 Normalization)                                                  
                                                                 
 dense_4 (Dense)             (None, 300)               235500    
                                                                 
 batch_normalization_1 (Bat  (None, 300)               1200      
 chNormalization)                                                
                                                                 
 dense_5 (Dense)             (None, 100)               30100     
                                                                 
 batch_normalization_2 (Bat  (None, 100)               4

in a network this thin, we dont expect much improvment though

Non-trainable params are the mean and varience of the batch we have at that instance which are not trainable BY BACKPROPAGATION, which makes them non-trainable

In [7]:
[(var.name, var.trainable) for var in model.layers[1].variables]

[('batch_normalization/gamma:0', True),
 ('batch_normalization/beta:0', True),
 ('batch_normalization/moving_mean:0', False),
 ('batch_normalization/moving_variance:0', False)]

wether we decide to use the batch normalization layer before or after the actual layer, is completly based on the problem itself and is determined based on trial and error 

and note that if we were to use it before, unlike what we did before, i would be :

we have to seperate the activation, note that the input layer did not have any activation so we did not have to change that part in any way :

the placement of the BN layer is with respect to an activation layer (wether it be in the by itself, or in a dense layer)

In [8]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]), 
    keras.layers.BatchNormalization(), 
    keras.layers.Dense(300, kernel_initializer="he_normal"), 
    keras.layers.BatchNormalization(), 
    keras.layers.Activation("elu"), 
    keras.layers.Dense(100, kernel_initializer="he_normal"), 
    keras.layers.BatchNormalization(), 
    keras.layers.Activation("elu"), 
    keras.layers.Dense(10, activation="softmax")
])

the hyper parameters of the BN layer we can alter are not much and is best to leave them as are

worse case, we alter the momentum hyperparameter, which updates the moving average 

the bigger the data and smaller each batch, this parameter should be intensly closer to 1

# Gradient Clipping in Neural Networks

Gradient clipping is a technique used to address the exploding gradient problem during the training of neural networks. This problem occurs when gradients become extremely large, leading to unstable training and potential divergence. Gradient clipping helps control the magnitude of gradients to prevent this issue.

## Gradient Clipping Process

1. **Compute Gradients:** During backpropagation, gradients are calculated for each parameter in the neural network.

2. **Calculate Gradient Norm:** Calculate the Euclidean norm (L2 norm) of the entire gradient vector. This norm represents the overall magnitude of the gradients.

3. **Clip Gradients:** If the calculated norm exceeds a predefined threshold (clip_value), then scale down the entire gradient vector to ensure its norm is within an acceptable range.

4. **Update Parameters:** Finally, use the clipped gradients to update the model parameters.

## Preventing Exploding Gradients

Gradient clipping helps prevent the exploding gradient problem, especially in recurrent neural networks (RNNs) and deep networks. By controlling the magnitude of gradients, it stabilizes the training process and allows the model to learn more effectively.


In [9]:
optimizer = keras.optimizers.SGD(clipvalue=1.0)

In [10]:
model.compile(loss="mse", optimizer=optimizer)

what this does, is to cut off the gradient vector between -1 and 1

but this could be like 0.9 and 1, which would result in the vector to be more in direction of the second axis (which could be problematic)

the method clipnorm though, uses the concept of second norm (hing) which is symmetric

# Transfer Learning in Neural Networks

Transfer learning is a powerful technique in neural network training that leverages pre-trained models to improve the performance of a model on a new, related task. Instead of training a neural network from scratch, transfer learning involves using a pre-trained model's knowledge and adapting it to a different but related problem.

## Steps of Transfer Learning

1. **Select a Pre-trained Model:** Choose a pre-trained model that has been trained on a large dataset for a similar task. Popular models include VGG, ResNet, Inception, and BERT, depending on the type of task (image classification, object detection, natural language processing, etc.).

2. **Remove Last Layers (Optional):** Depending on the similarity of the new task, you might need to remove the last layers of the pre-trained model. For example, in image classification, you may remove the output layer and add a new one with the appropriate number of classes for your task.

3. **Freeze Pre-trained Layers (Optional):** Optionally, freeze the weights of the pre-trained layers to prevent them from being updated during the initial training on the new task. This is useful when the lower layers capture generic features that are likely to be beneficial for the new task.

4. **Add New Layers:** Add new layers or modify the existing ones to adapt the pre-trained model to the specifics of your task. These new layers are typically randomly initialized and then trained on the new dataset.

5. **Training on the New Task:** Train the modified model on your target dataset. Since the pre-trained layers already contain valuable features, training is often faster, and the model can achieve good performance with less data.

## Benefits of Transfer Learning

- **Faster Training:** Utilizing pre-trained weights speeds up the training process, especially when dealing with large and complex models.

- **Improved Generalization:** Transfer learning allows models to generalize well to new tasks, even with limited data for the specific task.

- **Effective Feature Extraction:** Pre-trained models serve as effective feature extractors, capturing useful hierarchical features that can be adapted to different tasks.

when using transfer learning, we should check for the input shapes to be the same, as this could distrub the training procces

also, the higher we go in the original network, the more making changes could be beneficial

with the changing of the output layer being almost mandatory

we start by freezing the layers of the base network, and then starting by the first one or two at the top

__while potentialy lowering the learning rate__

we start defrosting the layers, training our model

if this did not work lower the nunber of layers and repeat this proccess

Consider the following problem


with have a model for the mnist fashion problem (with 8 classes though)
and what we want is a binary classification for the two remaining classes

In [11]:
# the initial model for the mnist fashion problem :

from tensorflow import keras


model = keras.models.Sequential()
model.add(keras.layers.Flatten(input_shape=[28, 28]))
model.add(keras.layers.Dense(300, activation="relu"))
model.add(keras.layers.Dense(100, activation="relu"))
model.add(keras.layers.Dense(10, activation="softmax"))

In [12]:
model_A = model

In [13]:
model_B_on_A = keras.models.Sequential(model_A.layers[:-1])
model_B_on_A.add(keras.layers.Dense(1, activation="sigmoid"))

keep in mind that from here on the changes made to model_B_on_A are going to affect model_A


to prevent this use clone_model

In [14]:
model_A_clone = keras.models.clone_model(model_A)
model_A_clone.set_weights(model_A.get_weights())

lets start freezing and training the rest of the model

keep in mind we should compile the model again each time we freez / unfreez

In [15]:
for layer in model_B_on_A.layers[:-1]:
    layer.trainable = False


model_B_on_A.compile(
    loss="binary_crossentropy", 
    optimizer="sgd", 
    metrics=["accuracy"]
)




Let's train the rest of the model now :

In [17]:
X_train_B = y_train_B = X_val_B = y_val_B = []

history = model_B_on_A.fit(
    X_train_B, 
    y_train_B, 
    epochs=4, 
    validation_data=(X_val_B, y_val_B)
)

for layer in model_B_on_A.layers[:-1]:
    layer.trainable = True

In [18]:
optimizer = keras.optimizers.SGD(learning_rate=1e-4)
model_B_on_A.compile(
    loss="binary_crossentropy", 
    optimizer="sgd", 
    metrics=["accuracy"]
)

history = model_B_on_A.fit(
    X_train_B, 
    y_train_B, 
    epochs=16, 
    validation_data=(X_val_B, y_val_B)
)

This works well but keep in mind :

this method does not work well dense small networks

# Unsupervised Pretraining in Neural Networks

Unsupervised pretraining is a technique in neural network training where a model is pretrained on a task that doesn't require labeled data. This pretrained model can then be fine-tuned on a specific task that does have labeled data, providing a good initialization for the network's weights.

## Steps of Unsupervised Pretraining

1. **Select an Unsupervised Task:** Choose an unsupervised task that doesn't require labeled data. Examples include autoencoders, denoising autoencoders, or generative models like variational autoencoders (VAEs) or Generative Adversarial Networks (GANs).

2. **Pretrain the Model:** Train the neural network on the selected unsupervised task using a large dataset. The goal is for the model to learn useful representations or features from the data without relying on labeled information.

3. **Save Pretrained Model Weights:** After unsupervised pretraining, save the weights of the pretrained model. These weights will serve as the starting point for the subsequent fine-tuning on a supervised task.

4. **Fine-Tune on a Supervised Task:** Initialize a new neural network with the pretrained weights and fine-tune it on a task that requires labeled data. This can include tasks like image classification, object detection, or sentiment analysis.

5. **Training on the Supervised Task:** Train the fine-tuned model on the labeled dataset. The pretrained features help the model converge faster and often lead to better performance compared to training from scratch.

## Benefits of Unsupervised Pretraining

- **Feature Learning:** Unsupervised pretraining allows the model to learn meaningful features or representations from the data, even when labeled information is not available.

- **Improved Generalization:** The learned features can be transferable to various downstream tasks, enhancing the model's ability to generalize.

- **Addressing Data Scarcity:** Unsupervised pretraining is particularly useful when labeled data is limited or expensive to obtain.

# Momentum Optimizer in Neural Networks

The Momentum optimizer is an extension of the standard Gradient Descent (GD) optimization algorithm that helps accelerate training and navigate through areas with noisy or sparse gradients.

## Key Concepts

### Standard Gradient Descent (GD)

In standard GD, the model parameters are updated in the opposite direction of the gradient of the loss with respect to those parameters. The update rule for each parameter (θ) at each iteration is given by:

\[ \theta_{t+1} = \theta_t - \alpha \nabla L(\theta_t) \]

where:
- \( \alpha \) is the learning rate.
- \( \nabla L(\theta_t) \) is the gradient of the loss function with respect to the parameters.

### Momentum Optimizer

The Momentum optimizer introduces the concept of a "momentum" term to the parameter updates. The update rule for each parameter (θ) at each iteration is given by:

\[ v_{t+1} = \beta v_t + (1 - \beta) \nabla L(\theta_t) \]
\[ \theta_{t+1} = \theta_t - \alpha v_{t+1} \]

where:
- \( \alpha \) is the learning rate.
- \( \nabla L(\theta_t) \) is the gradient of the loss function with respect to the parameters.
- \( v_t \) is the momentum term at time step \( t \).
- \( \beta \) is the momentum coefficient, typically close to 1 (e.g., 0.9).

## Differences Between Momentum and GD

1. **Acceleration:** Momentum helps accelerate training, especially in the presence of oscillations or noisy gradients. The momentum term allows the optimizer to accumulate velocity in directions with consistent gradients, enabling faster convergence.

2. **Inertia:** The momentum term introduces an "inertia" effect, allowing the optimizer to continue moving in the previous direction, even if the gradient changes direction or magnitude. This helps the optimizer navigate through flat regions or saddle points more efficiently.

3. **Damping Effect:** The momentum term acts as a damping factor for oscillations. It reduces the impact of oscillations or high-frequency noise in the gradient, leading to smoother and more stable updates.

The beta hyperparameter, is a moving average of gradients

What determines a step :

1) GD : how far from the answer we are

2) MO : how flat the plane is 

the implementation is easy :

MO tends to act like damped oscillation having fluctuation around the answer, and it is much faster

In [19]:
optimizer = keras.optimizers.SGD(learning_rate=0.001, momentum=0.9)

# Learning Rate Schedules in Neural Networks

## Learning Rate Schedule Overview

In the context of neural network training, a learning rate schedule refers to the dynamic adjustment of the learning rate during training. The learning rate is a hyperparameter that determines the size of the steps taken during optimization. Different schedules can impact the convergence speed, stability, and generalization of the model.

## 1. Power Scheduling

Power scheduling adjusts the learning rate based on a power of the iteration number. The learning rate at each iteration (\(t\)) is calculated as:

\[ \alpha_t = \alpha_0 \cdot \frac{1}{(1 + t \cdot k)^p} \]

This schedule is effective for gradually reducing the learning rate, allowing the model to converge more slowly over time.

## 2. Exponential Scheduling

Exponential scheduling reduces the learning rate exponentially over iterations. The learning rate at each iteration (\(t\)) is calculated as:

\[ \alpha_t = \alpha_0 \cdot e^{-kt} \]

This schedule rapidly reduces the learning rate, promoting faster convergence initially but may become very small in later iterations.

## 3. Fixed Step Scheduling

Fixed step scheduling keeps the learning rate constant throughout training. The learning rate at each iteration (\(t\)) is constant:

\[ \alpha_t = \alpha_0 \]

While simple, this schedule may not be optimal for all scenarios and often requires careful tuning of the initial learning rate.

## 4. Performance Scheduling

Performance scheduling adjusts the learning rate based on the model's performance. If the validation error stops improving, the learning rate is reduced. The learning rate at each iteration (\(t\)) is calculated as:

\[ \alpha_t = \alpha_0 \cdot \text{factor}^{\text{epoch\_no\_improvement}} \]

This schedule adapts the learning rate based on the model's performance on the validation set.

## 5. One-Cycle Scheduling

One-Cycle scheduling involves a cyclical learning rate, where the learning rate starts low, gradually increases to a maximum, and then decreases again. This is done within a single cycle of training. The learning rate at each iteration (\(t\)) is calculated using a piecewise linear or cosine annealing function.

This schedule is designed to achieve both fast convergence and fine-tuning by exploring a broad range of learning rates.



## now lets look at some implementations :

Power Scheduling :

In [21]:
optimizer = keras.optimizers.SGD(learning_rate=0.01, decay=1e-4)

Exponential Scheduling :

In [22]:
def exponential_decay_fn(epoch):
    return 0.01 * 0.1 ** (epoch / 20)

# or more percisly : 

def exponential_decay(lr0, s):
    def exponential_decay_fn(epoch):
        return 0.01 * 0.1 ** (epoch / 20)
    return exponential_decay_fn

In [23]:
lr_scheduler = keras.callbacks.LearningRateScheduler(exponential_decay_fn)

In [25]:
X_train_scaled = y_train = []

history = model.fit(
    X_train_scaled, 
    y_train, 
    # [...]
    callbacks=[lr_scheduler]
)

Or quit simply :

In [26]:
def exponential_decay_fn(epoch, lr):
    return lr * 0.1 ** (epoch / 20)

When saving a model, the optimizer and the learning rate are also saved with it

As a result we can continue an unfinished training with this different optimization's 

this is not the case when we have the parameter "epoch", as it is not saved

so dont interupt the training proccess

Fixed Step Scheduling :

In [27]:
def piecwise_constant_fn(epoch):
    if epoch < 5:
        return 0.01
    elif epoch < 15:
        return 0.005
    else:
        return 0.001

In [28]:
lr_scheduler = keras.callbacks.LearningRateScheduler(piecwise_constant_fn)

Performance Scheduling :

In [29]:
lr_scheduler = keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=5)

and so on....

# Over fitting prevention

## L1 Normalization (Lasso Regularization)

L1 normalization, also known as Lasso regularization, is a technique used to regularize neural network models. It adds a penalty term to the loss function proportional to the absolute values of the model weights.


## L2 Normalization (Ridge Regularization)

L2 normalization, also known as Ridge regularization, is another technique used to regularize neural network models. It adds a penalty term to the loss function proportional to the squared values of the model weights.

In [32]:
layer = keras.layers.Dense(
    100, 
    activation='elu', 
    kernel_initializer="he_normal", 
    kernel_regularizer=keras.regularizers.l2(0.01)
)

The l2 regularizer basicly makes the weights more sparse, punishing when a weight is to large or small

We can also can use a combination of both with l1_l2()

and as the basic features of each layer are simply repeated we can :

In [34]:
from functools import partial


RegularizedDense = partial(
    keras.layers.Dense,
    activation="elu", 
    kernel_initializer="he_normal", 
    kernel_regularizer=keras.regularizers.l2(0.01)
)

# Dropout Regularization

Dropout is a regularization technique used in neural networks to prevent overfitting. It involves randomly setting a fraction of input units to zero during training, which helps prevent complex co-adaptations on training data.

## How Dropout Works

At each training step, Dropout randomly "drops out" (sets to zero) a fraction of the input units, chosen at random. This helps to prevent overfitting by ensuring that no single neuron becomes too specialized, and the network becomes more robust.

## Mathematical Formulation

The mathematical formulation of Dropout involves applying a binary mask to the output of the layer during training. The mask is generated independently for each input unit at each update of the training phase. The output \(y\) is given by:

\[ y = \frac{x \cdot \text{mask}}{1 - \text{dropout\_rate}} \]

where:
- \(x\) is the input to the layer.
- \(\text{mask}\) is the binary mask.
- \(\text{dropout\_rate}\) is the fraction of units to drop.

During testing or inference, no units are dropped, but the output is scaled by a factor of \(1 - \text{dropout\_rate}\) to maintain the expected output magnitude.

This method has the advantages of almost always making the model better

and by definition it cannot make it worse 

In [36]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]), 
    keras.layers.Dropout(rate=0.2), 
    keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"), 
    keras.layers.Dropout(rate=0.2), 
    keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"), 
    keras.layers.Dropout(rate=0.2), 
    keras.layers.Dense(10, activation="softmax")
])

If the model is overfitting, increase the rate, and wise-versa

# Monte Carlo Dropout

Monte Carlo Dropout is an extension of the Dropout regularization technique that involves making multiple predictions with dropout-enabled models at test time. It provides a form of uncertainty estimation and can be useful for tasks where understanding model uncertainty is crucial.

## How Monte Carlo Dropout Works

In traditional Dropout, during training, a fraction of input units is randomly set to zero. In Monte Carlo Dropout, this dropout behavior is retained during testing or inference. Instead of making a single deterministic prediction, the model is sampled multiple times with dropout applied, and predictions are averaged.

## Mathematical Formulation

Mathematically, let \(f(x)\) represent the prediction of the model for input \(x\). In Monte Carlo Dropout, the prediction is obtained by averaging over multiple dropout samples:

\[ \text{Prediction} = \frac{1}{N} \sum_{i=1}^{N} f(x) \]

where:
- \(N\) is the number of dropout samples.

This process provides an estimate of the model's uncertainty, and the variance in predictions reflects the uncertainty in the model's predictions.

In [38]:
import numpy as np

X_test_scaled = []

y_probas = np.stack([model(X_test_scaled, training=True)
                     for sample in range(100)])
y_proba = y_probas.mean(axis=0)

simple as that 

monte carlo does not alway give us better results 

but better ones 

# Max-Norm Regularization

Max-Norm regularization is a technique used to prevent the weights in a neural network from becoming too large during training. This regularization method imposes a constraint on the maximum L2 norm of the weight vectors in a layer.

## How Max-Norm Regularization Works

1. **Weight Normalization:**
   - For each neuron in a layer, the weights are normalized. The L2 norm (Euclidean norm) of the weight vector is calculated.

2. **Applying Constraint:**
   - The L2 norm is then compared to a predefined threshold, denoted as \(c\) (the max-norm constraint).
   - If the L2 norm exceeds \(c\), the weight vector is scaled down to ensure that the constraint is satisfied:

     \[ \|W_i\|_2 \leq c \]

   where:
   - \(\|W_i\|_2\) is the L2 norm of the weight vector for neuron \(i\).
   - \(c\) is the max-norm constraint.

## Why Max-Norm Regularization?

### 1. Improved Generalization:
   - Large weights in a neural network can lead to overfitting, where the model becomes too specialized to the training data and performs poorly on new, unseen data.
   - Max-Norm regularization prevents the weights from growing excessively, promoting better generalization to new data.

### 2. Stability During Training:
   - Prevents exploding gradients: Large weights can cause the gradients during backpropagation to become very large, leading to instability in training. Max-Norm regularization helps mitigate this issue.


lets see a breif implementation

In [39]:
keras.layers.Dense(
    100, 
    activation="elu", 
    kernel_initializer="he_normal", 
    kernel_constraint=keras.constraints.max_norm(1.0)
)

<keras.src.layers.core.dense.Dense at 0x2325b4bbcd0>