# A Gentle Introduction to `torch.autograd`

https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html

In [3]:
import torch
from torchvision.models import resnet18, ResNet18_Weights

In [18]:
model = resnet18(weights=ResNet18_Weights.DEFAULT)
# random data tensor to represent a single image with 3 channels
data = torch.rand(1, 3, 64, 64)
# its corresponding label initialized to some random value
labels = torch.rand(1, 1000)

labels.is_leaf

True

Next, we run the input data through the model through each of its layers to make a prediction. This is the **forward pass**.

In [19]:
prediction = model(data) # forward pass

print(type(prediction))
print(prediction.dtype)
print(prediction.size())
print('---')
print(labels.size())
print(prediction.is_leaf)

<class 'torch.Tensor'>
torch.float32
torch.Size([1, 1000])
---
torch.Size([1, 1000])
False


The next step is to backpropagate this error through the network.

In [20]:
loss = (prediction - labels).sum()
print(loss)
print(loss.is_leaf)
loss.backward() # backward pass

tensor(-500.3402, grad_fn=<SumBackward0>)
False


Next, we load an optimizer, in this case SGD with a learning rate of 0.01 and *momentum* of 0.9. We register all the parameters of the model in the optimizer.

In [21]:
optim = torch.optim.SGD(model.parameters(), lr=1e-2, momentum=0.9)

Finally, we call `.step()` to initiate gradient descent. The optimizer adjusts each parameter by its gradient stored in .grad.

In [22]:
optim.step() #gradient descent

## Exclusion from the DAG

In a NN, parameters that don’t compute gradients are usually called **frozen parameters**. It is useful to “freeze” part of your model if you know in advance that you won’t need the gradients of those parameters (this offers some performance benefits by reducing autograd computations).

In finetuning, we freeze most of the model and typically only modify the classifier layers to make predictions on new labels. Let’s walk through a small example to demonstrate this. As before, we load a pretrained resnet18 model, and freeze all the parameters.

In [23]:
from torch import nn, optim

model = resnet18(weights=ResNet18_Weights.DEFAULT)

# Freeze all the parameters in the network
for param in model.parameters():
    param.requires_grad = False

Let’s say we want to finetune the model on a new dataset with 10 labels. In resnet, the classifier is the last linear layer `model.fc`. We can simply replace it with a new linear layer (unfrozen by default) that acts as our classifier.

In [24]:
model.fc = nn.Linear(512, 10)

Now all parameters in the model, except the parameters of model.fc, are frozen. The only parameters that compute gradients are the weights and bias of model.fc.

In [25]:
# Optimize only the classifier
optimizer = optim.SGD(model.parameters(), lr=1e-2, momentum=0.9)

Notice although we register all the parameters in the optimizer, the only parameters that are computing gradients (and hence updated in gradient descent) are the weights and bias of the classifier.

The same exclusionary functionality is available as a context manager in `torch.no_grad()`