# Gradients and Initialization
## Backpropagation
- Derivatives of a loss tell us how the loss changes when we make a small change to the parameters
- Backpropagation algorithm
  - Compute partial derivative with respect to each parameter
  - Consists of
    - Forward pass: Compute and store series of intermediate values and the network output
    - Backward pass: Derivatives of each parameter, starting at the end of the network
  - Deep neural network $f[x_i,\phi]$, $K$ hidden layers with ReLU and individual loss term $l_{i} = l[f[x_{i},\phi],y_{i}]$
  - Goal: Compute partial derivatives $\frac{\partial l_{i}}{\partial \Omega_{k}}$ and $\frac{\partial l_{i}}{\partial \beta_{k}}$ with respects to biases $\beta_{k}$ and weights $\Omega_{k}$
  - Forward pass:
    - For each $k$, compute <br>  $ f_{0} = \beta_{0} + \Omega_{0}x_{i}$ <br> $h_{k} = a[f_{k-1}], k \in \{1,2,3...,K\} $ <br> $f_{k} = \beta_{k} + \Omega_{k}h_{k}, k \in \{1,2,3...,K\} $
  - Backward pass:
    - Start with the derivative $\frac{\partial l_i}{\partial f_K}$ of the loss $l_{i}$ with respect to network output $f_{K}$ and work backthrough the network <br>
    $$\frac{\partial l_i}{\partial \beta_k} = \frac{\partial l_i}{\partial f_k}, k \in \{K,K-1,...,1\}$$
    $$\frac{\partial l_i}{\partial \Omega_k} = \frac{\partial l_i}{\partial f_k}h_k^T, k \in \{K,K-1,...,1\}$$
    $$\frac{\partial l_i}{\partial f_{k-1}} = \mathbb{I}[f_{k-1} > 0] \odot \left(\Omega_{k}^T \frac{\partial l_i}{\partial f_k}\right), k \in \{K,K-1,...,1\}$$
  - Calculate those derivate for every training example in the batch and sum them together to retrieve the gradient

## Parameter initialization
- If we initialize parameters from a $\mathbb{N}(0,\sigma^2)$
  - If variance is too small, the magnitudes of the pre-activations will become smaller and smaller. Can lead to **vanishing gradient**
  - If the variance is too large, the magnitudes of the pre-activation will become larger and larger. Can lead to **exploding gradient**
- He initialization  $$\sigma_{\Omega}^2 = \frac{2}{D_h}$$
- If the weight matrix is not squared, then we can use the mean $(D_h + D_h')/2$ as a proxy, which yields $$\sigma_{\Omega}^2 = \frac{4}{D_h + D_h'}$$

## Pytorch code

In [2]:
import torch, torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader
from torch.optim.lr_scheduler import StepLR
# define input size, hidden layer size, output size
D_i, D_k, D_o = 10, 40, 5
# create model with two hidden layers
model = nn.Sequential(
nn.Linear(D_i, D_k),
nn.ReLU(),
nn.Linear(D_k, D_k),
nn.ReLU(),
nn.Linear(D_k, D_o))
# He initialization of weights
def weights_init(layer_in):
    if isinstance(layer_in, nn.Linear):
        nn.init.kaiming_uniform_(layer_in.weight)
        layer_in.bias.data.fill_(0.0)
model.apply(weights_init)
# choose least squares loss function
criterion = nn.MSELoss()
# construct SGD optimizer and initialize learning rate and momentum
optimizer = torch.optim.SGD(model.parameters(), lr = 0.1, momentum=0.9)
# object that decreases learning rate by half every 10 epochs
scheduler = StepLR(optimizer, step_size=10, gamma=0.5)
# create 100 random data points and store in data loader class
x = torch.randn(100, D_i)
y = torch.randn(100, D_o)
data_loader = DataLoader(TensorDataset(x,y), batch_size=10, shuffle=True)
# loop over the dataset 100 times
for epoch in range(100):
    epoch_loss = 0.0
    # loop over batches
    for i, data in enumerate(data_loader):
    # retrieve inputs and labels for this batch
        x_batch, y_batch = data
        # zero the parameter gradients
        optimizer.zero_grad()
        # forward pass
        pred = model(x_batch)
        loss = criterion(pred, y_batch)
        # backward pass
        loss.backward()
        # SGD update
        optimizer.step()
    # update statistics
        epoch_loss += loss.item()
    # print error
    print(f'Epoch {epoch:5d}, loss {epoch_loss:.3f}')
    # tell scheduler to consider updating learning rate
    scheduler.step()

Epoch     0, loss 20.080
Epoch     1, loss 9.577
Epoch     2, loss 9.208
Epoch     3, loss 8.933
Epoch     4, loss 8.582
Epoch     5, loss 8.016
Epoch     6, loss 7.561
Epoch     7, loss 7.298
Epoch     8, loss 7.198
Epoch     9, loss 6.812
Epoch    10, loss 6.666
Epoch    11, loss 6.227
Epoch    12, loss 5.782
Epoch    13, loss 5.658
Epoch    14, loss 5.520
Epoch    15, loss 5.296
Epoch    16, loss 5.148
Epoch    17, loss 4.929
Epoch    18, loss 4.813
Epoch    19, loss 4.626
Epoch    20, loss 4.521
Epoch    21, loss 4.416
Epoch    22, loss 4.354
Epoch    23, loss 4.317
Epoch    24, loss 4.293
Epoch    25, loss 4.261
Epoch    26, loss 4.207
Epoch    27, loss 4.194
Epoch    28, loss 4.137
Epoch    29, loss 4.106
Epoch    30, loss 4.048
Epoch    31, loss 4.021
Epoch    32, loss 4.005
Epoch    33, loss 3.966
Epoch    34, loss 3.940
Epoch    35, loss 3.931
Epoch    36, loss 3.911
Epoch    37, loss 3.904
Epoch    38, loss 3.866
Epoch    39, loss 3.860
Epoch    40, loss 3.838
Epoch    41, lo