# Goal
Understand how Autograd enables us to train neural networks

# Imports

In [None]:
import torch as th
from torchviz import make_dot
import matplotlib.pyplot as plt

### Autograd

<b> In a nutshell: </b>
* feature of PyTorch that allows computation of partial derivatives (gradients)
* central to backpropagation <br>

<b> Training a NN: </b>
* View a NN as a collection of nested functions <br>
<b> Forward pass </b>
* NN makes its best guess about the outputs
* Data flows through each of these functions to compute the output <br>
<b> Backward pass </b>
* NN adjust its parameters" in the direction of minimizing the loss
* A gradient = vector (direction, magnitude) that points in the direction of steepest increase in the loss function
* Collects gradients starting from the end <br>
![gradient_descent.png](../Presentations/assets_autograd/gradient_descent.png)

Q: How PyTorch adjust the weights?

In [None]:
# General setup
from torchvision.models import resnet18, ResNet18_Weights

BATCH_SIZE, CHANNELS, IMG_HEIGHT, IMG_WIDTH, NUM_CLASSES = 32, 3, 64, 64, 1000
data = th.rand(BATCH_SIZE, CHANNELS, IMG_HEIGHT, IMG_WIDTH)
targ = th.rand(BATCH_SIZE, NUM_CLASSES)
weights = ResNet18_Weights.IMAGENET1K_V1

model = resnet18(weights=weights)
optim = th.optim.SGD(model.parameters(), lr=1e-2)
device = th.device('cuda' if th.cuda.is_available() else 'cpu')
objective = th.nn.CrossEntropyLoss()

model.to(device)
data = data.to(device)
targ = targ.to(device)

pred = model(data)  # forward pass
loss = objective(pred, targ)  # evaluate guess
initial_weight = model.fc.weight[0][0].item()
loss.backward()   # backward pass
grad_value = model.fc.weight.grad[0][0].item()
optim.step()  # adjust weights
updated_weight = model.fc.weight[0][0].item()
print(initial_weight, grad_value, updated_weight, initial_weight - 1e-2*grad_value)

### Computation graphs
![computation_graph.png](../Presentations/assets_autograd/computation_graph.png)

* During forward propagation the (conceptual) graph is augmented ==> the real graph is created
* Each node has a context (saves inputs and outputs of the function)
* Specifically the graph (DAG) is constructed from torch.autograd.Function(s) and tensors(with requires_grad=True, also called leaves)

In [None]:
x1 = th.tensor([1.0], requires_grad=True)  # leaf node because they are not the result of any computation
x2 = th.tensor([2.0], requires_grad=True)
a = x1 * x2
y1 = th.log(a)
y2 = th.sin(x2)
w = y1 * y2
z = w

print(x1.grad_fn) # can't backpropagate beyond leaf nodes
print(z.grad_fn)
print(isinstance(z.grad_fn, th.autograd.graph.Node))
print(dir(z.grad_fn))
print(y1, y2, z.grad_fn._saved_other, z.grad_fn._saved_self)

make_dot(z, params={"x1":x1, "x2":x2})


### Differentiation
![tracking.png](../Presentations/assets_autograd/tracking.png)

Q: How Autograd collects gradients?

Autograd traces computation dynamically at runtime e.g., model implies ifs, fors (length not known). <br>

* Every computed tensor tracks a history of its inputs and the function used to create it
* Every such function has a built-in implementation to compute its own derivatives (Add --> AddBackward)
* Setting requires_grad = True on a tensor -> Track every computation that follows BUT in the output tensors i.e. <b>requires_grad=True is contagious</b>; y is a computation on x so y will track its computation history
* SinBackward tells us that during backprop step weâ€™ll need to compute the sin derivative of the output during fw step wrt the inputs of this function
* Gradients of a tensor are accumulated for every bw pass --> optimizer.zero_grad()
* Calling backward will destroy the graph => calling a second time will result in error

In [None]:
x = th.tensor([1., 2.])
print(x)
x = th.tensor([1., 2.], requires_grad=True)
print(x)
y = th.sin(x)
print(y)

Each grad_fn stored in within the tensors allows us to track the computation all the way back to its inputs using <b>next_functions</b> property

In [None]:
x = th.tensor([1., 2.])
print(x)
x = th.tensor([1., 2.], requires_grad=True)
y = x**2 + 1
z = 3*y
print(z.grad_fn)
print(z.grad_fn.next_functions)
print(z.grad_fn.next_functions[0][0].next_functions)

Q: How efficient is Autograd?

When training a neural network we need to compute gradients of a loss function w.r.t. every weight and bias.
If we were to expand the expression $\nabla_WL$ using the chain rule it will result in many partial derivatives over every weight, activation function and other mathematical transformation. Each such partial derivative is the sum of the products of the local gradients of every possible path through the computation graph that leads to the variable whose gradient we are trying to measure => number of partial derivatives will tend to go exponentially with the depth of the network.

![paths.png](../Presentations/assets_autograd/paths.png)

For efficient computation of the gradients, Autograd combines two powerful concepts: 
1) Jacobian-vector products to propagate gradients  
2) Each tensor tracks history of its inputs and the function used to create it (each such function has a built-in implementation for its own derivatives).  

Once we call <b>.backward()</b> on L, we get $dL/dL=1$ => invoke its backward function (compute the gradient of the output of the Function object w.r.t to the inputs of the Function object i.e. $dL/dy$). This computed gradient is multiplied by the accumulated gradient (stored in the <b>.grad</b> of the current node which is $acc=dL/dL=1$) and then sent to the input node. What we have done was to apply chain rule on the path L-> y. 

![jaccobian.png](../Presentations/assets_autograd/jaccobian.png) <br>

Note: By default only on leaf nodes we can access the gradient value => call <b>.retain_grad()</b> just after declaring it

In [None]:
W = th.tensor([[2., 4.], [1., 0.], [2., 0.5]], requires_grad=True)
b = th.tensor([1.0, 1.0, 1.0], requires_grad=True)

x = th.randn(2)

f = lambda W, b: x @ W.t() + b

y = f(W, b)  # function f
y.retain_grad()

y_t = th.tensor([0.3, 0.4, 0.3])

l = lambda y: ((y-y_t)**2).mean() # l=mse

e = l(y) 
e.retain_grad()

e.backward()

acc_grad = e.grad

print(y.grad.equal(th.autograd.functional.jacobian(l, y) * acc_grad))

acc_grad = th.autograd.functional.jacobian(l, y) * acc_grad

W_j, b_j = th.autograd.functional.jacobian(f, (W, b))
print(W.grad.equal(W_j.permute(0, 2, 1) @ acc_grad))
print(b.grad.equal(b_j.t() @ acc_grad))

# or
acc_grad = e.grad
print(acc_grad)
grad_L_y = th.autograd.functional.jacobian(l, y)
acc_grad = grad_L_y * acc_grad
print(acc_grad)
grad_y_w, grad_y_b = th.autograd.functional.jacobian(f, (W, b))
acc_grad = grad_y_w.permute(0, 2, 1) @ acc_grad
print(acc_grad.equal(W.grad))

### Custom Autograd Functions 
Fine grained control over gradients and computations <br>
$ y = 3x^2 + 2x + 1$ <br>
$ dy \over dx \\= 6x+2 $

In [None]:
class QuadraticFunction(th.autograd.Function):
    """
    We can implement our own custom autograd Functions by subclassing
    torch.autograd.Function and implementing the forward and backward passes
    which operate on Tensors.
    """

    @staticmethod
    def forward(ctx, input):
        """
        In the forward pass we receive a Tensor containing the input and return
        a Tensor containing the output. ctx is a context object that can be used
        to stash information for backward computation. You can cache arbitrary
        objects for use in the backward pass using the ctx.save_for_backward method.
        """
        ctx.save_for_backward(input)
        return 3*input**2 + 2*input + 1

    @staticmethod
    def backward(ctx, grad_output):
        """
        In the backward pass we receive a Tensor containing the gradient of the loss
        with respect to the output, and we need to compute the gradient of the loss
        with respect to the input.
        """
        input, = ctx.saved_tensors
        return grad_output * 6*input + 2

![relu.png](../Presentations/assets_autograd/relu.png)

In [None]:
class CustomReLUFunction(th.autograd.Function):
    @staticmethod
    def forward(ctx, input):
        # compute the result of forward pass and save input for backward pass
        ctx.save_for_backward(input)
        return input.clamp(min=0)

    @staticmethod
    def backward(ctx, grad_output):
        # grad_output = gradient of the loss (accumulated gradient) w.r.t. output
        # we need to compute the gradient of the loss w.r.t. input
        input, = ctx.saved_tensors
        grad_input = grad_output.clone()
        grad_input[input < 0] = 0
        grad_input[input >= 0] = 1
        return grad_input

### Turning on/off
Multiple options:
* Without copy: $$ x.requires\_grad=True$$
* With copy:  $$ x.detach() $$
* Turning it off(on) temporarely using context manager: $$ with\ torch.no\_grad() \ \\or\ with\ torch.enable\_grad() $$
* For a group of operations using decorator: $$ @torch.no\_grad() \ \\ or\ @torch.enable\_grad() $$

Use case: Freezing some layers: <br>
![freeze.png](../Presentations/assets_autograd/freeze.png)


Without copy

In [None]:
w1 = th.ones(10, requires_grad=True)
w2 = th.ones(10, requires_grad=True)

y = w1*2 + w2*3

# set this before !!
w1.requires_grad = False
w2.requires_grad = False

print(y)

Temporarely

In [None]:
w1 = th.ones(10, requires_grad=True)
w2 = th.ones(10, requires_grad=True)

with th.no_grad():
    y2 = w1*2 + w2*3
    print(y2)

y3 = w1*2 + w2*3
print(y3)

Decorator

In [None]:
w1 = th.ones(10, requires_grad=True)
w2 = th.ones(10, requires_grad=True)
             
@th.no_grad
def linear_func(w1, w2):
    return w1*2 + w2*3
    
y4 = linear_func(w1, w2)
print(y4)

With copy <br>
Use case: perform intermediate computations without affecting the gradients

In [None]:
w1 = th.ones(10, requires_grad=True)
w2 = th.ones(10, requires_grad=True)

y5 = w1.detach()*2 + w2*3
make_dot(y5)

### Gradient reset

Reset gradients as they get accumulated:

In [None]:
import torchvision
import torchvision.transforms as transforms
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])


trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                        download=True, transform=transform)
trainloader = th.utils.data.DataLoader(trainset, batch_size=32,
                                          shuffle=True, num_workers=2)

# make it smaller
class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = nn.Conv2d(3, 3, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.fc = nn.Linear(588, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv(x)))
        x = th.flatten(x, 1) 
        x = self.fc(x)
        return x

In [None]:
def train(zero_gradients=True):
    net = Net()
    net.to('cuda')
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)
    
    loss_history = []
    for epoch in range(5):  
        
        for i, data in enumerate(trainloader):
            inputs, labels = data
            inputs = inputs.to('cuda')
            labels = labels.to('cuda')

            if zero_gradients:
                optimizer.zero_grad()

            # forward + backward + optimize
            outputs = net(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            loss_history.append(loss.item())
    return loss_history

In [None]:
zg = train(zero_gradients=True)
nzg = train(zero_gradients=False)

plt.plot(zg)
plt.plot(nzg)
plt.legend(['Zero gradients', 'No zero gradients'])

### Profiler

Track CPU+GPU times for each function (executed in C++) and its corresponding derivative function

In [None]:
device = th.device('cpu')
run_on_gpu = False
if th.cuda.is_available():
    device = th.device('cuda')
    run_on_gpu = True
    
B, Cin, Ns = 32, 16, 100    
x = th.randn(B, Cin, Ns)
conv1d = th.nn.Conv1d(16, 32, kernel_size=3)

with th.autograd.profiler.profile(use_cuda=run_on_gpu) as prf:
    for _ in range(1000):
        y = conv1d(x)
        
print(prf.key_averages().table(sort_by='self_cpu_time_total'))

Resources:
* https://towardsdatascience.com/getting-started-with-pytorch-part-1-understanding-how-automatic-differentiation-works-5008282073ec
* https://pytorch.org/tutorials/beginner/introyt/autogradyt_tutorial.html#what-do-we-need-autograd-for
* https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html
* https://pytorch.org/docs/stable/notes/extending.html
* https://pytorch.org/blog/computational-graphs-constructed-in-pytorch/