1. **Vanishing Gradients:**
    - **Definition:** In some cases, when a neural network is deep, the gradients (indicators used to adjust the model's parameters during training) can become extremely small as they are propagated backward through the network.
    - **Key Aspect:** This can lead to the early layers of the network learning very slowly or not learning at all, as the updates to the parameters become too tiny to make a significant impact.

In [4]:
import numpy as np

# Define a simple neural network with small weights
def vanishing_gradients_example():
    np.random.seed(42)
    input_data = np.random.rand(1, 3)  # Input data with three features
    weights = np.random.rand(3, 3) * 0.1  # Small random weights

    for _ in range(10):
        # Forward pass
        hidden_layer = np.dot(input_data, weights)
        output = np.dot(hidden_layer, weights)

        # Backward pass (Gradient computation)
        output_gradient = np.random.rand(1, 3)  # Gradient of the output layer (random for illustration)
        hidden_layer_gradient = np.dot(output_gradient, weights.T)  # Gradient at the hidden layer

        # Update weights (Gradient Descent)
        learning_rate = 0.1
        weights -= learning_rate * hidden_layer.T.dot(output_gradient)
    print("Updated weights after training (vanishing gradients):")
    print(weights)

vanishing_gradients_example()


Updated weights after training (vanishing gradients):
[[ 0.03920641 -0.00065686  0.00055813]
 [-0.02343697  0.06295061  0.03688675]
 [ 0.01939752 -0.03821543  0.05544233]]


2. **Exploding Gradients:**
    - **Definition:** Conversely, exploding gradients occur when the gradients become exceptionally large during backpropagation.
    - **Key Aspect:** This can lead to unstable training, where the model's parameters are adjusted by a large extent, causing the model to diverge or fail to converge to a good solution.

In [5]:
# Define a simple neural network with large weights
def exploding_gradients_example():
    np.random.seed(42)
    input_data = np.random.rand(1, 3)  # Input data with three features
    weights = np.random.rand(3, 3) * 10  # Large random weights

    for _ in range(10):
        # Forward pass
        hidden_layer = np.dot(input_data, weights)
        output = np.dot(hidden_layer, weights)

        # Backward pass (Gradient computation)
        output_gradient = np.random.rand(1, 3)  # Gradient of the output layer (random for illustration)
        hidden_layer_gradient = np.dot(output_gradient, weights.T)  # Gradient at the hidden layer

        # Update weights (Gradient Descent)
        learning_rate = 0.1
        weights -= learning_rate * hidden_layer.T.dot(output_gradient)

    print("Updated weights after training (exploding gradients):")
    print(weights)

exploding_gradients_example()


Updated weights after training (exploding gradients):
[[ 3.92064111 -0.06568602  0.0558127 ]
 [-2.34369726  6.29506135  3.68867495]
 [ 1.9397522  -3.82154291  5.5442327 ]]


3. **Consequences:**
    - **Vanishing Gradients Consequence:** Layers that are too deep might not effectively learn meaningful representations, impacting the overall performance of the model.
    - **Exploding Gradients Consequence:** Unstable training can hinder convergence, making it difficult for the model to learn useful patterns from the data.

4. **Mitigation Strategies:**
    - **Weight Initialization:** Careful initialization of the weights can help alleviate both vanishing and exploding gradients.
    - **Activation Functions:** Choosing appropriate activation functions, such as ReLU, can mitigate vanishing gradient problems.
    - **Batch Normalization:** Normalizing intermediate layer outputs helps stabilize training by reducing internal covariate shift.
    - **Gradient Clipping:** Setting a threshold for the gradient values can prevent them from becoming too large during training.

## Xavier Initialization (Glorot Initialization):

It aims to keep the variance of activations and gradients roughly the same across layers.
It's well-suited for tanh and sigmoid activation functions.
The weights are randomly drawn from a distribution with a standard deviation based on the number of input and output units in a layer.

In [6]:
# Define a simple neural network with large weights
def Xavier_init(input, output):
    gain = np.sqrt(2.0 / (input + output))
    return np.random.uniform(-gain, gain, size=(input, output))

def Xavier_init_example():
    np.random.seed(42)
    input_data = np.random.rand(1, 3)  # Input data with three features
    weights = Xavier_init(3, 3) # Xavier

    for _ in range(10):
        # Forward pass
        hidden_layer = np.dot(input_data, weights) # 1x3 * 3x3 = 1x3
        output = np.dot(hidden_layer, weights) # 1x3 * 3x3 = 1x3

        # Backward pass (Gradient computation)
        output_gradient = np.random.rand(1, 3)  # Gradient of the output layer (random for illustration) 
        hidden_layer_gradient = np.dot(output_gradient, weights.T)  # Gradient at the hidden layer

        # Update weights (Gradient Descent)
        learning_rate = 0.1
        weights -= learning_rate * hidden_layer.T.dot(output_gradient)

    print("Updated weights after training (Xavier initialization):")
    print(weights)

Xavier_init_example()


Updated weights after training (Xavier initialization):
[[ 0.24504301 -0.29374122 -0.2895481 ]
 [-0.43516865  0.48203275  0.17840939]
 [ 0.06683412 -0.69038429  0.40046285]]


## He Initialization:

It's a variant of Xavier initialization designed for ReLU activation functions.
It uses a different scaling factor to account for the "dying ReLU" problem, where neurons get stuck at zero activation.

<b>The "Dying ReLU" Problem:</b>

Vanilla ReLU activations have a binary nature: they output 0 for negative inputs and the original value for positive inputs. This can lead to the "dying ReLU" problem, where some neurons get stuck in a permanently inactive state due to receiving mostly negative inputs.

In [51]:
# Define a simple neural network with large weights
def he_init(input, output):
    gain = np.sqrt(2.0 / input )
    return np.random.randn(input, output) * gain

def he_init_example():
    np.random.seed(42)
    input_data = np.random.rand(1, 3)  # Input data with three features
    weights = he_init(3, 3) # He initialization

    for _ in range(10):
        # Forward pass
        hidden_layer = np.dot(input_data, weights) # 1x3 * 3x3 = 1x3
        output = np.dot(hidden_layer, weights) # 1x3 * 3x3 = 1x3

        # Backward pass (Gradient computation)
        output_gradient = np.random.rand(1, 3)  # Gradient of the output layer (random for illustration) 
        hidden_layer_gradient = np.dot(output_gradient, weights.T)  # Gradient at the hidden layer

        # Update weights (Gradient Descent)
        learning_rate = 0.1
        weights -= learning_rate * hidden_layer.T.dot(output_gradient)

    print("Updated weights after training (He initialization):")
    print(weights)

he_init_example()


Updated weights after training (He initialization):
[[-1.11077427  0.06486551  0.00779027]
 [ 1.0969696  -0.23179696 -0.15457116]
 [ 0.22889471 -0.14776024 -1.42075096]]
