# **Q5:** Vanishing Gradient Issue

Run the provided network code and observe the gradient norms for each layer. 

Describe the pattern you see in the gradient values and identify where the vanishing gradient problem is evident.

-  Briefly explain why sigmoid activation functions can lead to vanishing gradients, particularly in deep networks.

-  Modify the network to use ReLU or Leaky ReLU activations instead of sigmoid. What changes do you see in the gradient norms across layers?

-  Experiment with different weight initialization strategies, such as Xavier or He initialization. 


In [35]:
import torch
import torch.nn as nn
from collections import OrderedDict
import plotly.graph_objects as go

### Gradient Norms Calculation

The following code snippet calculates the gradient norms for the parameters (weights) of a neural network model after backpropagation.

It helps us in understanding the behavior of the learning process, especially in diagnosing problems such as vanishing gradients.

#### Gradient Norm

The gradient norm is the L2 norm (Euclidean norm) of the gradient vector for a parameter in the neural network. Gradient norm is defined as:

$
\|\frac{\partial L}{\partial \mathbf{W}^{(l)}}\|_2 = \sqrt{\sum_{i=1}^{m} \sum_{j=1}^{n} \left(\frac{\partial L}{\partial w^{[l]}_{i,j}}\right)^2}
$

The gradient norm $\left\|\frac{\partial L}{\partial \mathbf{W}^{(l)}}\right\|_2$ represents the magnitude of the gradient matrix of the loss function $L$ with respect to the weights $\mathbf{W}^{(l)}$.

Lower gradient norm means smaller gradient values and smaller weight updates. Gradient norm would be zero if gradients with respect to all weight vectors are zero. Therefore, we can use gradient norms to track vanishing gradient issue.

In [36]:
# Number of Hidden Layers
L = 50
# Number of perceptrons per each hidden layer (assume equal number of perceptrons in each hidden layer for simplicity)
k = 1
# Number of input features
m = 10
# Number of output features
z = 1
# Number of data samples
n = 100



In [37]:
# Random input and target
# n data samples, each with m features and z target values
X = torch.randn(n, m)  
Y = torch.randn(n, z)   


In [38]:
# Define the network with Sigmoid activations 
layer_dict = OrderedDict()
for i in range(L):
    # first hidden layer
    if i == 0:
        layer_dict[f'lin{i+1}'] = nn.Linear(m, k)
    # remaining hidden layers
    elif i < (L-1):
        layer_dict[f'lin{i+1}'] = nn.Linear(k, k)
    else:
        # output layer
        layer_dict[f'lin{i+1}'] = nn.Linear(k, 1)  # Single perceptron in output layer
    layer_dict[f'act{i+1}'] = nn.Sigmoid()

# Create the deep neural network model
deep_neural_net = nn.Sequential(layer_dict)




- `model.named_parameters()` is a PyTorch method that returns an iterator over the model's parameters, along with their names, allowing inspection or modification of these parameters.
   - `name` is the name of the parameter (like 'weight', 'bias') and `param` is the parameter tensor itself.


- `if param.grad is not None:` checks whether gradients have been computed for the parameter.
   - During backpropagation, gradients may not be computed for some parameters if they are not involved in the computation for the current batch, or if they have been detached from the graph.

- `param.grad.norm().item()` computes the norm of the gradient tensor associated with the parameter.
   - `param.grad.norm()` calculates the L2 norm (also known as Euclidean norm) of the gradient tensor, which is the square root of the sum of the squares of its elements.
   - `.item()` extracts the scalar value from the tensor, making it suitable for appending to a Python list.

In [39]:
# Function to calculate gradient norms
def get_gradient_norms(model, X, Y):
    model.zero_grad()
    Y_hat = model(X)
    loss_func = nn.MSELoss()   #L2 Loss (MSE)
    loss = loss_func(Y_hat, Y)  
    loss.backward()
    gradient_norms = []
    for name, param in model.named_parameters():
        if param.grad is not None:
            gradient_norms.append(param.grad.norm().item())
    return gradient_norms

In [None]:
# Calculate gradient norms for the given neural network at each layer
gradient_norms_sigmoid = get_gradient_norms(deep_neural_net, X, Y)

# Output the gradient norms
print(gradient_norms_sigmoid)

# Plotting the gradient norms for the vanilla network
fig = go.Figure()

fig.add_trace(go.Scatter(
    x=list(range(1, len(gradient_norms_sigmoid) + 1)),
    y=gradient_norms_sigmoid,
    mode='lines+markers',
    name='Vanilla Sigmoid Network',
    marker=dict(size=8),
    line=dict(width=2)
))

fig.update_layout(
    title='Gradient Norms Across Layers for Vanilla Sigmoid Network',
    xaxis_title='Layer Index',
    yaxis_title='Gradient Norm',
    template='plotly_dark',
    hovermode='closest'
)

fig.show()