# The Xavier initialization

also known as Glorot initialization (named after Xavier Glorot who proposed it), is a weight initialization technique used to help alleviate the problem of vanishing and exploding gradients in deep neural networks with sigmoid and tanh activation functions. It's particularly effective for layers in a deep network where each neuron's output variance is not too dissimilar to its input variance across layers, helping to keep the gradient magnitudes reasonable throughout the depth of the network.

## Principle of Xavier Initialization

The main idea behind Xavier initialization is to keep the scale of the gradients roughly the same in all layers. During the training of a deep network, if the weights are too small, the signal shrinks as it passes through each layer until it's too tiny to be useful. If the weights are too large, the signal grows until it becomes too massive and results in numerical instability.

Xavier initialization specifically addresses these issues by considering the number of input and output neurons associated with a specific layer and initializing the weights to maintain a variance that allows for an appropriate flow of gradients.

Xavier initialization sets a layer's weights **ùëä** randomly drawn from a distribution with zero mean and a specific variance

## When to Use Xavier Initialization

* **Activation Functions:** It is generally recommended for networks using the tanh or sigmoid activation functions because these activations can exacerbate the vanishing gradients problem due to their mathematical properties.
* **Not Ideal for ReLU:** For networks using ReLU activations, **He** initialization (a similar approach that considers the rectifier's characteristics) is generally preferred. Xavier can lead to weights that are too small for ReLU neurons, potentially resulting in dead neurons during training.

In [None]:
import tensorflow as tf

# For a dense layer with sigmoid activation
model = tf.keras.Sequential([
    tf.keras.layers.Dense(256, activation='sigmoid',
                          kernel_initializer=tf.keras.initializers.GlorotUniform(), # Xavier Uniform
                          input_shape=(input_dim,))
])


# Momentum in SGD

SGD with momentum considers the past gradients to determine the direction of the new update. Essentially, it adds a fraction of the update vector from the previous step to the current step's update, creating a smoother and more stable convergence to the minimum of the loss function.

Momentum is designed to accelerate convergence by combining:

A fraction of the previous update (scaled by the momentum parameter).
The current gradient update.
The formula for velocity is:

$v_t = momentum.v_t-\eta.g_t$

Where:
* $v_t$: Velocity at time $t$
* $momentum$: The momentum parameter (typically $0 ‚â§ momentum < 1$)
* $\eta$: Learning Rate
* $g_t$: Gradient at time $t$


## Benefits of Using Momentum
* **Faster Convergence:** Momentum can lead to faster convergence by accelerating gradient descent in the right direction, thus reducing the oscillations.
* **Smoothing Effect:** It helps to smooth out the steps of SGD, which is beneficial when dealing with noisy data or gradients.
* **Escape from Plateaus:** In scenarios where the algorithm encounters flat areas (plateaus) or local minima, momentum can help to escape and continue learning.

### When momentum is not recommended?

* Highly noisy gradients.
* Extremely flat loss landscapes.
* Dynamically changing objectives.
* Small datasets.
* Poorly tuned learning rates.
* Resource-constrained environments.
* Shallow or simple models.


In [None]:
import tensorflow as tf

# Define an SGD optimizer with momentum
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9)

# Configure the optimizer in a Keras model
model.compile(optimizer=optimizer, loss='mean_squared_error')


# ADAM Optimizer

The ADAM optimizer (short for "Adaptive Moment Estimation") is a widely used method in training neural networks, particularly effective due to its handling of learning rates for individual parameters and its efficient use of computational resources. ADAM combines ideas from two other extensions of stochastic gradient descent: Adaptive Gradient Algorithm (AdaGrad) and Root Mean Square Propagation (RMSProp), providing the benefits of both methods in handling sparse gradients on noisy problems.

## How ADAM Works
ADAM maintains two separate estimates for each parameter:

* **First Moment (the mean)** - Essentially an exponentially decaying average of past gradients.
* **Second Moment (the uncentered variance)** - An exponentially decaying average of past squared gradients.

## Benefits of Using ADAM
* **Adaptive Learning Rate:** Individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients provide robustness to initial learning rate choices.
* **Efficiency:** Computationally efficient with little memory requirements relative to the capability it offers.
* **Well-suited for Problems:** Performs well on problems with large datasets or many parameters or when the objective function is very noisy.

* Bias Correction: Includes bias corrections to the first and second moments, which help the moments to converge more rapidly at the beginning of training.

In [None]:
import tensorflow as tf

# Create a model
model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu', input_shape=(100,)),
    tf.keras.layers.Dense(1)
])

# Compile the model with ADAM optimizer
model.compile(optimizer='adam', loss='mean_squared_error')

# Train the model
model.fit(x_train, y_train, epochs=10)
