# **Q5:** Vanishing Gradient Issue

Run the provided network code and observe the gradient norms for each layer.

Describe the pattern you see in the gradient values and identify where the vanishing gradient problem is evident.

-  Briefly explain why sigmoid activation functions can lead to vanishing gradients, particularly in deep networks.

-  Modify the network to use ReLU or Leaky ReLU activations instead of sigmoid. What changes do you see in the gradient norms across layers?

-  Experiment with different weight initialization strategies, such as Xavier or He initialization.


In [1]:
import torch
import torch.nn as nn
from collections import OrderedDict
import plotly.graph_objects as go

### Gradient Norms Calculation

The following code snippet calculates the gradient norms for the parameters (weights) of a neural network model after backpropagation.

It helps us in understanding the behavior of the learning process, especially in diagnosing problems such as vanishing gradients.

#### Gradient Norm

The gradient norm is the L2 norm (Euclidean norm) of the gradient vector for a parameter in the neural network. Gradient norm is defined as:

$
\|\frac{\partial L}{\partial \mathbf{W}^{(l)}}\|_2 = \sqrt{\sum_{i=1}^{m} \sum_{j=1}^{n} \left(\frac{\partial L}{\partial w^{[l]}_{i,j}}\right)^2}
$

The gradient norm $\left\|\frac{\partial L}{\partial \mathbf{W}^{(l)}}\right\|_2$ represents the magnitude of the gradient matrix of the loss function $L$ with respect to the weights $\mathbf{W}^{(l)}$.

Lower gradient norm means smaller gradient values and smaller weight updates. Gradient norm would be zero if gradients with respect to all weight vectors are zero. Therefore, we can use gradient norms to track vanishing gradient issue.

In [2]:
# Number of Hidden Layers
L = 5
# Number of perceptrons per each hidden layer (assume equal number of perceptrons in each hidden layer for simplicity)
k = 10
# Number of input features
m = 10
# Number of output features
z = 1
# Number of data samples
n = 100



In [3]:
# Random input and target
# n data samples, each with m features and z target values
X = torch.randn(n, m)
Y = torch.randn(n, z)


In [4]:
# Define the network with Sigmoid activations
layer_dict = OrderedDict()
for i in range(L):
    # first hidden layer
    if i == 0:
        layer_dict[f'lin{i+1}'] = nn.Linear(m, k)
    # remaining hidden layers
    elif i < (L-1):
        layer_dict[f'lin{i+1}'] = nn.Linear(k, k)
    else:
        # output layer
        layer_dict[f'lin{i+1}'] = nn.Linear(k, 1)  # Single perceptron in output layer
    layer_dict[f'act{i+1}'] = nn.Sigmoid()

# Create the deep neural network model
deep_neural_net = nn.Sequential(layer_dict)




- `model.named_parameters()` is a PyTorch method that returns an iterator over the model's parameters, along with their names, allowing inspection or modification of these parameters.
   - `name` is the name of the parameter (like 'weight', 'bias') and `param` is the parameter tensor itself.


- `if param.grad is not None:` checks whether gradients have been computed for the parameter.
   - During backpropagation, gradients may not be computed for some parameters if they are not involved in the computation for the current batch, or if they have been detached from the graph.

- `param.grad.norm().item()` computes the norm of the gradient tensor associated with the parameter.
   - `param.grad.norm()` calculates the L2 norm (also known as Euclidean norm) of the gradient tensor, which is the square root of the sum of the squares of its elements.
   - `.item()` extracts the scalar value from the tensor, making it suitable for appending to a Python list.

In [5]:
# Function to calculate gradient norms
def get_gradient_norms(model, X, Y):
    model.zero_grad()
    Y_hat = model(X)
    loss_func = nn.MSELoss()   #L2 Loss (MSE)
    loss = loss_func(Y_hat, Y)
    loss.backward()
    gradient_norms = []
    for name, param in model.named_parameters():
        if param.grad is not None:
            gradient_norms.append(param.grad.norm().item())
    return gradient_norms

In [6]:
# Calculate gradient norms for the given neural network at each layer
gradient_norms_sigmoid = get_gradient_norms(deep_neural_net, X, Y)

# Output the gradient norms
print(gradient_norms_sigmoid)

# Plotting the gradient norms for the vanilla network
fig = go.Figure()

fig.add_trace(go.Scatter(
    x=list(range(1, len(gradient_norms_sigmoid) + 1)),
    y=gradient_norms_sigmoid,
    mode='lines+markers',
    name='Vanilla Sigmoid Network',
    marker=dict(size=8),
    line=dict(width=2)
))

fig.update_layout(
    title='Gradient Norms Across Layers for Vanilla Sigmoid Network',
    xaxis_title='Layer Index',
    yaxis_title='Gradient Norm',
    template='plotly_dark',
    hovermode='closest'
)

fig.show()

[1.8628388716024347e-05, 3.066662611672655e-05, 0.0005138158448971808, 0.00033092324156314135, 0.003808162175118923, 0.002379715209826827, 0.04350128769874573, 0.02624613605439663, 0.3425194323062897, 0.20319567620754242]


In [7]:
# Define the network with other activations
layer_dict_other = OrderedDict()
for i in range(L):
    # first hidden layer
    if i == 0:
        layer_dict_other[f'lin{i+1}'] = nn.Linear(m, k)
    # remaining hidden layers
    elif i < (L-1):
        layer_dict_other[f'lin{i+1}'] = nn.Linear(k, k)
    else:
        # output layer
        layer_dict_other[f'lin{i+1}'] = nn.Linear(k, 1)  # Single perceptron in output layer
    layer_dict_other[f'act{i+1}'] = nn.LeakyReLU()

# Create the deep neural network model
deep_neural_net_other = nn.Sequential(layer_dict_other)

In [8]:
# Calculate gradient norms for the given neural network at each layer
gradient_norms_other = get_gradient_norms(deep_neural_net_other, X, Y)

# Output the gradient norms
print(gradient_norms_other)

# Plotting the gradient norms for the vanilla network
fig = go.Figure()

fig.add_trace(go.Scatter(
    x=list(range(1, len(gradient_norms_other) + 1)),
    y=gradient_norms_other,
    mode='lines+markers',
    marker=dict(size=8),
    line=dict(width=2)
))

fig.update_layout(
    title='Gradient Norms Across Layers for Other Activations',
    xaxis_title='Layer Index',
    yaxis_title='Gradient Norm',
    template='plotly_dark',
    hovermode='closest'
)

fig.show()

[0.012357436120510101, 0.007089552469551563, 0.013694400899112225, 0.010775835253298283, 0.029977064579725266, 0.05505665764212608, 0.06813295185565948, 0.16980555653572083, 0.1350313276052475, 0.4393619894981384]


# Weight Initialization Strategies in Neural Networks

## Xavier/Glorot Initialization

For a layer with $n_{in}$ inputs and $n_{out}$ outputs:
$$W \sim \mathcal{N}\left(0, \sqrt{\frac{2}{n_{in} + n_{out}}}\right)$$

- Aims to maintain variance of activations and gradients across layers
- Assumes linear activation functions or tanh/sigmoid
- Prevents vanishing/exploding signals in both forward and backward passes

```python
# PyTorch Xavier/Glorot Initialization
nn.init.xavier_normal_(layer.weight, gain=1.0)

# Manual Implementation
std = np.sqrt(2.0 / (n_in + n_out))
weights = np.random.normal(0, std, size=(n_out, n_in))
```

## He Initialization

For a layer with $n_{in}$ inputs:
$$W \sim \mathcal{N}\left(0, \sqrt{\frac{2}{n_{in}}}\right)$$

- Designed specifically for ReLU activation functions
- Accounts for ReLU dropping negative values
- Maintains variance for positive activations
- Suitable for deep architectures with many layers


```python
# PyTorch He Initialization
nn.init.kaiming_normal_(layer.weight, mode='fan_in', nonlinearity='relu')
```

## Comparison

### Xavier/Glorot
- Best for linear, tanh, sigmoid activations
- Considers both input and output dimensions
- Variance of weights: $Var(W) = \frac{2}{n_{in} + n_{out}}$

### He
- Optimized for ReLU and variants
- Considers only input dimension
- Variance of weights: $Var(W) = \frac{2}{n_{in}}$

## Mathematical Intuition

### Forward Pass Variance
For a layer with input $x$ and weights $W$:

1. **Xavier/Glorot**:
   $$Var(Wx) \approx Var(x)$$

2. **He**:
   $$Var(ReLU(Wx)) \approx Var(x)$$

### Backward Pass Variance
For gradient $\delta$ flowing backward:

1. **Xavier/Glorot**:
   $$Var(W^T\delta) \approx Var(\delta)$$

2. **He**:
   $$Var(W^T(ReLU'(Wx)\delta)) \approx Var(\delta)$$

In [9]:
# Create the deep neural network model
deep_neural_net_sigmoid_kaimeng = nn.Sequential(layer_dict)

# Apply He initialization using PyTorch's inbuilt function
for name, module in deep_neural_net.named_modules():
    if isinstance(module, nn.Linear):
        # He initialization for weights
        nn.init.kaiming_normal_(module.weight, mode='fan_in', nonlinearity='sigmoid')
        # Zero initialization for biases
        nn.init.zeros_(module.bias)

# Calculate gradient norms for the given neural network at each layer
gradient_norms_sigmoid_Kaimeng = get_gradient_norms(deep_neural_net_sigmoid_kaimeng, X, Y)

# Output the gradient norms
print(gradient_norms_sigmoid_Kaimeng)

# Plotting the gradient norms for the vanilla network
fig = go.Figure()

fig.add_trace(go.Scatter(
    x=list(range(1, len(gradient_norms_sigmoid_Kaimeng) + 1)),
    y=gradient_norms_sigmoid_Kaimeng,
    mode='lines+markers',
    name='He Initialization',
    marker=dict(size=8),
    line=dict(width=2)
))

fig.update_layout(
    title='Gradient Norms Across Layers with He initialization',
    xaxis_title='Layer Index',
    yaxis_title='Gradient Norm',
    template='plotly_dark',
    hovermode='closest'
)





[0.0005468870513141155, 0.0010482256766408682, 0.007102056406438351, 0.004478410352021456, 0.024837154895067215, 0.014628857374191284, 0.08468004316091537, 0.056439392268657684, 0.3818344175815582, 0.24892641603946686]


In [10]:
# Alternative: Xavier/Glorot initialization (more suitable for Sigmoid)
deep_neural_net_xavier = nn.Sequential(layer_dict_other)
for name, module in deep_neural_net_xavier.named_modules():
    if isinstance(module, nn.Linear):
        # Xavier initialization for weights
        nn.init.xavier_normal_(module.weight, gain=1.0)
        # Zero initialization for biases
        nn.init.zeros_(module.bias)

# Calculate gradient norms for the given neural network at each layer
gradient_norms_xavier = get_gradient_norms(deep_neural_net_xavier, X, Y)

# Output the gradient norms
print(gradient_norms_xavier)

# Plotting the gradient norms for the vanilla network
fig = go.Figure()

fig.add_trace(go.Scatter(
    x=list(range(1, len(gradient_norms_xavier) + 1)),
    y=gradient_norms_xavier,
    mode='lines+markers',
    marker=dict(size=8),
    line=dict(width=2)
))

fig.update_layout(
    title='Gradient Norms Across Layers with Xavier Initialization',
    xaxis_title='Layer Index',
    yaxis_title='Gradient Norm',
    template='plotly_dark',
    hovermode='closest'
)


[0.01369679719209671, 0.004151427187025547, 0.020848531275987625, 0.008705598302185535, 0.022342799231410027, 0.015539039857685566, 0.039693936705589294, 0.08935532718896866, 0.04022403433918953, 0.1322801560163498]
