<a href="https://colab.research.google.com/github/Nishijujuba/python-cookbook-2023-3rd/blob/master/_downloads/8eed7e178f8fa30798f280ea82ff468b/autograd_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# For tips on running notebooks in Google Colab, see
# https://docs.pytorch.org/tutorials/beginner/colab
%matplotlib inline

A Gentle Introduction to `torch.autograd`
=========================================

`torch.autograd` is PyTorch's automatic differentiation engine that
powers neural network training. In this section, you will get a
conceptual understanding of how autograd helps a neural network train.

Background
----------

Neural networks (NNs) are a collection of nested functions that are
executed on some input data. These functions are defined by *parameters*
(consisting of weights and biases), which in PyTorch are stored in
tensors.

Training a NN happens in two steps:

**Forward Propagation**: In forward prop, the NN makes its best guess
about the correct output. It runs the input data through each of its
functions to make this guess.

**Backward Propagation**: In backprop, the NN adjusts its parameters
proportionate to the error in its guess. It does this by traversing
backwards from the output, collecting the derivatives of the error with
respect to the parameters of the functions (*gradients*), and optimizing
the parameters using gradient descent. For a more detailed walkthrough
of backprop, check out this [video from
3Blue1Brown](https://www.youtube.com/watch?v=tIeHLnjs5U8).

Usage in PyTorch
----------------

Let\'s take a look at a single training step. For this example, we load
a pretrained resnet18 model from `torchvision`. We create a random data
tensor to represent a single image with 3 channels, and height & width
of 64, and its corresponding `label` initialized to some random values.
Label in pretrained models has shape (1,1000).

<div style="background-color: #54c7ec; color: #fff; font-weight: 700; padding-left: 10px; padding-top: 5px; padding-bottom: 5px"><strong>NOTE:</strong></div>

<div style="background-color: #f3f4f7; padding-left: 10px; padding-top: 10px; padding-bottom: 10px; padding-right: 10px">

<p>This tutorial works only on the CPU and will not work on GPU devices (even if tensors are moved to CUDA).</p>

</div>



In [2]:
import torch
from torchvision.models import resnet18, ResNet18_Weights
model = resnet18(weights=ResNet18_Weights.DEFAULT)
data = torch.rand(1, 3, 64, 64)
labels = torch.rand(1, 1000)

Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth" to /root/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth


100%|██████████| 44.7M/44.7M [00:00<00:00, 125MB/s]


In [4]:
print(model)



ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
  

In [5]:
print(data)


tensor([[[[0.6064, 0.7566, 0.7164,  ..., 0.2291, 0.2338, 0.9033],
          [0.3737, 0.9530, 0.4079,  ..., 0.2948, 0.0152, 0.8009],
          [0.5868, 0.9241, 0.6795,  ..., 0.3566, 0.8000, 0.6773],
          ...,
          [0.6289, 0.9196, 0.9750,  ..., 0.3969, 0.3064, 0.9416],
          [0.7567, 0.6302, 0.6577,  ..., 0.7207, 0.3188, 0.6220],
          [0.3605, 0.0696, 0.9460,  ..., 0.9647, 0.8787, 0.7182]],

         [[0.7879, 0.5128, 0.6273,  ..., 0.0077, 0.4611, 0.8532],
          [0.2390, 0.1883, 0.9053,  ..., 0.9471, 0.2342, 0.1629],
          [0.1641, 0.3597, 0.6280,  ..., 0.7551, 0.2479, 0.7240],
          ...,
          [0.1258, 0.9277, 0.6533,  ..., 0.7723, 0.2672, 0.8815],
          [0.3650, 0.7117, 0.4573,  ..., 0.1747, 0.3890, 0.7111],
          [0.9609, 0.0476, 0.1487,  ..., 0.3277, 0.1957, 0.9579]],

         [[0.1383, 0.2249, 0.9418,  ..., 0.4493, 0.3834, 0.2009],
          [0.2985, 0.3353, 0.1312,  ..., 0.9082, 0.1380, 0.6555],
          [0.7994, 0.8242, 0.2147,  ..., 0

In [6]:
print(labels)

tensor([[5.4359e-01, 5.2782e-01, 4.3972e-01, 3.7217e-01, 9.5360e-01, 7.3617e-01,
         5.9428e-01, 1.8654e-01, 3.3698e-01, 6.9070e-01, 5.5069e-01, 6.0084e-01,
         7.7408e-01, 7.9019e-01, 4.8164e-01, 4.3611e-01, 6.7691e-01, 1.3822e-01,
         5.3003e-02, 3.8807e-01, 5.4047e-02, 6.2502e-01, 6.1881e-01, 1.1707e-01,
         4.5454e-01, 8.3824e-01, 4.4360e-01, 2.1732e-03, 7.9705e-01, 7.2369e-01,
         4.5937e-01, 5.8701e-01, 2.6559e-01, 1.3350e-02, 8.7201e-01, 2.0968e-01,
         8.6884e-01, 8.1964e-01, 4.1572e-01, 5.9808e-01, 2.2318e-01, 3.3777e-01,
         7.1876e-01, 7.1240e-01, 1.2244e-01, 1.1079e-01, 4.3274e-01, 8.1832e-01,
         3.7218e-01, 5.1647e-01, 4.9659e-01, 3.9252e-01, 9.2870e-01, 1.3825e-01,
         6.5420e-01, 9.5362e-01, 3.1945e-01, 4.1297e-01, 6.9258e-01, 9.5552e-02,
         8.7889e-01, 4.9736e-01, 8.5048e-01, 5.5419e-01, 5.9285e-01, 5.4527e-01,
         4.0390e-01, 9.7109e-01, 5.7211e-01, 7.1151e-02, 5.6616e-01, 7.5649e-01,
         1.2708e-01, 8.5312e

Next, we run the input data through the model through each of its layers
to make a prediction. This is the **forward pass**.


In [7]:
prediction = model(data) # forward pass

In [8]:
print(prediction)

tensor([[-7.1721e-01, -8.0317e-01, -4.6254e-01, -1.3953e+00, -6.5891e-01,
         -3.8245e-02, -3.0617e-01,  2.9008e-01,  1.9753e-01, -7.7796e-01,
         -9.2397e-01, -8.4715e-01, -1.1115e-01, -5.9057e-01, -1.1899e+00,
         -3.2618e-01, -6.3384e-01, -1.2378e-01, -4.2011e-01, -1.0706e-01,
         -1.4644e+00, -5.8291e-01, -1.2522e+00,  1.4706e-01, -6.2728e-01,
         -1.3277e+00, -9.5223e-01, -1.1591e+00, -1.0244e+00, -3.7779e-01,
         -1.1520e+00, -8.0115e-01, -6.4117e-01, -6.1042e-01, -3.7982e-01,
         -3.6364e-01,  5.6635e-01, -8.1175e-01, -4.4226e-01,  2.2278e-01,
         -4.6674e-01, -6.3391e-01, -1.0514e+00, -2.3385e-01, -4.1905e-01,
         -7.0135e-01, -6.8939e-01, -2.8507e-01, -1.2978e+00, -1.2454e+00,
         -6.3175e-01,  5.8784e-01, -2.6600e-01, -6.7488e-01, -2.3692e-01,
         -9.4356e-01, -4.2503e-01, -1.5013e+00, -9.5428e-01, -2.2992e-01,
          6.6360e-01,  1.0867e-01, -2.8330e-01,  2.7000e-02, -5.8279e-01,
         -5.0296e-01, -4.3025e-01, -4.

In [11]:
print(prediction.dtype)

torch.float32


In [12]:
print(prediction.shape)

torch.Size([1, 1000])


We use the model\'s prediction and the corresponding label to calculate
the error (`loss`). The next step is to backpropagate this error through
the network. Backward propagation is kicked off when we call
`.backward()` on the error tensor. Autograd then calculates and stores
the gradients for each model parameter in the parameter\'s `.grad`
attribute.


In [9]:
loss = (prediction - labels).sum()
loss.backward() # backward pass

In [10]:
print(loss)

tensor(-485.6462, grad_fn=<SumBackward0>)


In [13]:
print(loss.dtype)
print(loss.shape)

torch.float32
torch.Size([])


Next, we load an optimizer, in this case SGD with a learning rate of
0.01 and
[momentum](https://medium.com/data-science/stochastic-gradient-descent-with-momentum-a84097641a5d)
of 0.9. We register all the parameters of the model in the optimizer.


In [14]:
optim = torch.optim.SGD(model.parameters(), lr=1e-2, momentum=0.9)

In [15]:
print(optim)

SGD (
Parameter Group 0
    dampening: 0
    differentiable: False
    foreach: None
    fused: None
    lr: 0.01
    maximize: False
    momentum: 0.9
    nesterov: False
    weight_decay: 0
)


Finally, we call `.step()` to initiate gradient descent. The optimizer
adjusts each parameter by its gradient stored in `.grad`.


In [16]:
optim.step() #gradient descent

At this point, you have everything you need to train your neural
network. The below sections detail the workings of autograd - feel free
to skip them.


------------------------------------------------------------------------


Differentiation in Autograd
===========================

Let\'s take a look at how `autograd` collects gradients. We create two
tensors `a` and `b` with `requires_grad=True`. This signals to
`autograd` that every operation on them should be tracked.


In [17]:
import torch

a = torch.tensor([2., 3.], requires_grad=True)
b = torch.tensor([6., 4.], requires_grad=True)

In [18]:
print(a)

tensor([2., 3.], requires_grad=True)


In [19]:
print(b)

tensor([6., 4.], requires_grad=True)


We create another tensor `Q` from `a` and `b`.

$$Q = 3a^3 - b^2$$


In [22]:
Q = 3*a**3 - b**2

In [21]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [23]:
Q

tensor([-12.,  65.], grad_fn=<SubBackward0>)

Let\'s assume `a` and `b` to be parameters of an NN, and `Q` to be the
error. In NN training, we want gradients of the error w.r.t. parameters,
i.e.

$$\frac{\partial Q}{\partial a} = 9a^2$$

$$\frac{\partial Q}{\partial b} = -2b$$

When we call `.backward()` on `Q`, autograd calculates these gradients
and stores them in the respective tensors\' `.grad` attribute.

We need to explicitly pass a `gradient` argument in `Q.backward()`
because it is a vector. `gradient` is a tensor of the same shape as `Q`,
and it represents the gradient of Q w.r.t. itself, i.e.

$$\frac{dQ}{dQ} = 1$$

Equivalently, we can also aggregate Q into a scalar and call backward
implicitly, like `Q.sum().backward()`.


In [24]:
external_grad = torch.tensor([1., 1.])
Q.backward(gradient=external_grad)

Gradients are now deposited in `a.grad` and `b.grad`


In [25]:
# check if collected gradients are correct
print(9*a**2 == a.grad)
print(-2*b == b.grad)

tensor([True, True])
tensor([True, True])


In [26]:
print(a.grad)

tensor([36., 81.])


In [27]:

import torch
from torchvision.models import resnet18, ResNet18_Weights


def _shape(x):
    """把 forward 输出转成易读的 shape 字符串（兼容 Tensor / tuple / list / dict / None）"""
    if torch.is_tensor(x):
        return tuple(x.shape)
    if isinstance(x, (list, tuple)):
        # 只展示第一个元素的 shape，避免输出太长
        if len(x) == 0:
            return "[]"
        return f"{type(x).__name__}[0]:" + str(_shape(x[0]))
    if isinstance(x, dict):
        if len(x) == 0:
            return "{}"
        k = next(iter(x.keys()))
        return f"dict['{k}']:" + str(_shape(x[k]))
    return str(type(x).__name__)


def _describe_module(m: torch.nn.Module) -> str:
    """把常见层的关键超参数提取出来，做成一行摘要"""
    import torch.nn as nn

    if isinstance(m, nn.Conv2d):
        return (
            f"Conv2d({m.in_channels}→{m.out_channels}, "
            f"k={tuple(m.kernel_size)}, s={tuple(m.stride)}, p={tuple(m.padding)}, "
            f"bias={m.bias is not None})"
        )
    if isinstance(m, nn.BatchNorm2d):
        return (
            f"BatchNorm2d({m.num_features}, eps={m.eps}, momentum={m.momentum}, "
            f"affine={m.affine}, track_running_stats={m.track_running_stats})"
        )
    if isinstance(m, nn.ReLU):
        return f"ReLU(inplace={m.inplace})"
    if isinstance(m, nn.MaxPool2d):
        return (
            f"MaxPool2d(k={m.kernel_size}, s={m.stride}, p={m.padding}, "
            f"dilation={m.dilation}, ceil_mode={m.ceil_mode})"
        )
    if isinstance(m, nn.AdaptiveAvgPool2d):
        return f"AdaptiveAvgPool2d(output_size={m.output_size})"
    if isinstance(m, nn.Linear):
        return f"Linear({m.in_features}→{m.out_features}, bias={m.bias is not None})"
    if isinstance(m, nn.Sequential):
        return f"Sequential(n={len(m)})"

    # ResNet18 的残差块类型（torchvision.models.resnet.BasicBlock）
    if m.__class__.__name__ == "BasicBlock":
        # BasicBlock 常用字段：stride / downsample（有些版本还有 dilation 等）
        stride = getattr(m, "stride", None)
        downsample = getattr(m, "downsample", None)
        has_downsample = downsample is not None
        return f"BasicBlock(stride={stride}, downsample={has_downsample})"

    return m.__class__.__name__


def _annotation(name: str) -> str:
    """给常见层名加中文注释（你也可以按需扩充）"""
    ann = {
        "conv1": "Stem：首层 7×7 卷积（降采样、扩通道）",
        "bn1": "Stem：BatchNorm（稳定训练/推理）",
        "relu": "Stem：ReLU 非线性",
        "maxpool": "Stem：最大池化（再次降采样）",
        "layer1": "Stage 1：残差块堆叠（不降采样，通道 64）",
        "layer2": "Stage 2：残差块堆叠（首块降采样，通道 128）",
        "layer3": "Stage 3：残差块堆叠（首块降采样，通道 256）",
        "layer4": "Stage 4：残差块堆叠（首块降采样，通道 512）",
        "avgpool": "Head：自适应全局平均池化到 1×1",
        "fc": "Head：全连接分类层（ImageNet 默认 1000 类）",
        "downsample": "残差分支：用 1×1 conv/BN 对齐形状以便相加",
        "conv2": "残差主分支：第二个 3×3 卷积",
        "bn2": "残差主分支：第二个 BatchNorm",
    }
    return ann.get(name, "")


def register_shape_hooks(model: torch.nn.Module):
    """
    给关键模块注册 forward hook，用于记录每层输出 shape：
    - 记录：stem(各层) / layer1-4(整体) / 每个 BasicBlock / head
    """
    shapes = {}

    def _hook(name):
        def fn(_m, _inp, out):
            shapes[name] = _shape(out)
        return fn

    # 你关心的“结构层级节点”
    target_names = [
        "conv1", "bn1", "relu", "maxpool",
        "layer1", "layer2", "layer3", "layer4",
        "avgpool", "fc",
        # 每个 stage 的两个 block（ResNet-18 固定 2 个）
        "layer1.0", "layer1.1",
        "layer2.0", "layer2.1",
        "layer3.0", "layer3.1",
        "layer4.0", "layer4.1",
    ]

    # 对存在的子模块挂 hook
    for n in target_names:
        m = model.get_submodule(n)
        m.register_forward_hook(_hook(n))

    return shapes


def print_tree(model: torch.nn.Module, shapes: dict, input_shape):
    """
    结构化打印：
    - 顶层信息
    - Stem
    - 4 个 stage（含 block 级别信息）
    - Head
    """
    print("=" * 88)
    print("Model:", model.__class__.__name__, "(torchvision ResNet-18)")
    print("Input:", input_shape, "  # 例如 (1, 3, 64, 64)")
    print("=" * 88)

    def line(path, module, indent=0):
        name = path.split(".")[-1]
        desc = _describe_module(module)
        ann = _annotation(name)
        shp = shapes.get(path, None)
        shp_s = f" | out={shp}" if shp is not None else ""
        ann_s = f"  # {ann}" if ann else ""
        print("  " * indent + f"- {path}: {desc}{shp_s}{ann_s}")

    # --- Stem ---
    print("\n[Stem]  # 特征提取入口（快速降采样 + 提升通道）")
    for n in ["conv1", "bn1", "relu", "maxpool"]:
        line(n, model.get_submodule(n), indent=0)

    # --- Stages ---
    for stage in ["layer1", "layer2", "layer3", "layer4"]:
        print(f"\n[{stage}]  # { _annotation(stage) }")
        line(stage, model.get_submodule(stage), indent=0)

        # 每个 stage 的 block（ResNet-18：2 个）
        for i in [0, 1]:
            bname = f"{stage}.{i}"
            block = model.get_submodule(bname)
            line(bname, block, indent=1)

            # block 内部结构（固定字段：conv1/bn1/relu/conv2/bn2 + 可选 downsample）
            for sub in ["conv1", "bn1", "relu", "conv2", "bn2"]:
                if hasattr(block, sub):
                    line(f"{bname}.{sub}", getattr(block, sub), indent=2)
            if getattr(block, "downsample", None) is not None:
                # downsample 是 Sequential，里面一般是 1×1 conv + BN
                ds = block.downsample
                line(f"{bname}.downsample", ds, indent=2)
                for j, child in enumerate(ds):
                    line(f"{bname}.downsample.{j}", child, indent=3)

    # --- Head ---
    print("\n[Head]  # 池化 + 分类器")
    for n in ["avgpool", "fc"]:
        line(n, model.get_submodule(n), indent=0)

    print("\n备注：out=... 是根据一次前向传播自动记录的输出 shape（和输入大小有关）。")
    print("=" * 88)


def main():
    # 1) 构建模型（ImageNet 预训练权重）
    model = resnet18(weights=ResNet18_Weights.DEFAULT)

    # 2) 构造你的输入（示例：1×3×64×64）
    data = torch.rand(1, 3, 64, 64)

    # 3) 注册 hook 以便自动记录每层输出 shape
    shapes = register_shape_hooks(model)

    # 4) 做一次前向传播（eval + no_grad：避免 BN 统计更新 & 不建梯度图）
    model.eval()
    with torch.no_grad():
        _ = model(data)

    # 5) 结构化打印（带中文注释 + 输出 shape）
    print_tree(model, shapes, input_shape=tuple(data.shape))


if __name__ == "__main__":
    main()


Model: ResNet (torchvision ResNet-18)
Input: (1, 3, 64, 64)   # 例如 (1, 3, 64, 64)

[Stem]  # 特征提取入口（快速降采样 + 提升通道）
- conv1: Conv2d(3→64, k=(7, 7), s=(2, 2), p=(3, 3), bias=False) | out=(1, 64, 32, 32)  # Stem：首层 7×7 卷积（降采样、扩通道）
- bn1: BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) | out=(1, 64, 32, 32)  # Stem：BatchNorm（稳定训练/推理）
- relu: ReLU(inplace=True) | out=(1, 64, 32, 32)  # Stem：ReLU 非线性
- maxpool: MaxPool2d(k=3, s=2, p=1, dilation=1, ceil_mode=False) | out=(1, 64, 16, 16)  # Stem：最大池化（再次降采样）

[layer1]  # Stage 1：残差块堆叠（不降采样，通道 64）
- layer1: Sequential(n=2) | out=(1, 64, 16, 16)  # Stage 1：残差块堆叠（不降采样，通道 64）
  - layer1.0: BasicBlock(stride=1, downsample=False) | out=(1, 64, 16, 16)
    - layer1.0.conv1: Conv2d(64→64, k=(3, 3), s=(1, 1), p=(1, 1), bias=False)  # Stem：首层 7×7 卷积（降采样、扩通道）
    - layer1.0.bn1: BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)  # Stem：BatchNorm（稳定训练/推理）
    - layer1.0.relu: ReLU(inplace=True)  

Optional Reading - Vector Calculus using `autograd`
===================================================

Mathematically, if you have a vector valued function
$\vec{y}=f(\vec{x})$, then the gradient of $\vec{y}$ with respect to
$\vec{x}$ is a Jacobian matrix $J$:

$$\begin{aligned}
J
=
 \left(\begin{array}{cc}
 \frac{\partial \bf{y}}{\partial x_{1}} &
 ... &
 \frac{\partial \bf{y}}{\partial x_{n}}
 \end{array}\right)
=
\left(\begin{array}{ccc}
 \frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{1}}{\partial x_{n}}\\
 \vdots & \ddots & \vdots\\
 \frac{\partial y_{m}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}}
 \end{array}\right)
\end{aligned}$$

Generally speaking, `torch.autograd` is an engine for computing
vector-Jacobian product. That is, given any vector $\vec{v}$, compute
the product $J^{T}\cdot \vec{v}$

If $\vec{v}$ happens to be the gradient of a scalar function
$l=g\left(\vec{y}\right)$:

$$\vec{v}
 =
 \left(\begin{array}{ccc}\frac{\partial l}{\partial y_{1}} & \cdots & \frac{\partial l}{\partial y_{m}}\end{array}\right)^{T}$$

then by the chain rule, the vector-Jacobian product would be the
gradient of $l$ with respect to $\vec{x}$:

$$\begin{aligned}
J^{T}\cdot \vec{v} = \left(\begin{array}{ccc}
 \frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{1}}\\
 \vdots & \ddots & \vdots\\
 \frac{\partial y_{1}}{\partial x_{n}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}}
 \end{array}\right)\left(\begin{array}{c}
 \frac{\partial l}{\partial y_{1}}\\
 \vdots\\
 \frac{\partial l}{\partial y_{m}}
 \end{array}\right) = \left(\begin{array}{c}
 \frac{\partial l}{\partial x_{1}}\\
 \vdots\\
 \frac{\partial l}{\partial x_{n}}
 \end{array}\right)
\end{aligned}$$

This characteristic of vector-Jacobian product is what we use in the
above example; `external_grad` represents $\vec{v}$.


Computational Graph
===================

Conceptually, autograd keeps a record of data (tensors) & all executed
operations (along with the resulting new tensors) in a directed acyclic
graph (DAG) consisting of
[Function](https://pytorch.org/docs/stable/autograd.html#torch.autograd.Function)
objects. In this DAG, leaves are the input tensors, roots are the output
tensors. By tracing this graph from roots to leaves, you can
automatically compute the gradients using the chain rule.

In a forward pass, autograd does two things simultaneously:

-   run the requested operation to compute a resulting tensor, and
-   maintain the operation's *gradient function* in the DAG.

The backward pass kicks off when `.backward()` is called on the DAG
root. `autograd` then:

-   computes the gradients from each `.grad_fn`,
-   accumulates them in the respective tensor's `.grad` attribute, and
-   using the chain rule, propagates all the way to the leaf tensors.

Below is a visual representation of the DAG in our example. In the
graph, the arrows are in the direction of the forward pass. The nodes
represent the backward functions of each operation in the forward pass.
The leaf nodes in blue represent our leaf tensors `a` and `b`.

![](https://pytorch.org/tutorials/_static/img/dag_autograd.png)

<div style="background-color: #54c7ec; color: #fff; font-weight: 700; padding-left: 10px; padding-top: 5px; padding-bottom: 5px"><strong>NOTE:</strong></div>

<div style="background-color: #f3f4f7; padding-left: 10px; padding-top: 10px; padding-bottom: 10px; padding-right: 10px">

<p>An important thing to note is that the graph is recreated from scratch; after each<code>.backward()</code> call, autograd starts populating a new graph. This isexactly what allows you to use control flow statements in your model;you can change the shape, size and operations at every iteration ifneeded.</p>

</div>

Exclusion from the DAG
----------------------

`torch.autograd` tracks operations on all tensors which have their
`requires_grad` flag set to `True`. For tensors that don't require
gradients, setting this attribute to `False` excludes it from the
gradient computation DAG.

The output tensor of an operation will require gradients even if only a
single input tensor has `requires_grad=True`.


In [28]:
x = torch.rand(5, 5)
y = torch.rand(5, 5)
z = torch.rand((5, 5), requires_grad=True)

a = x + y
print(f"Does `a` require gradients?: {a.requires_grad}")
b = x + z
print(f"Does `b` require gradients?: {b.requires_grad}")

Does `a` require gradients?: False
Does `b` require gradients?: True


In a NN, parameters that don\'t compute gradients are usually called
**frozen parameters**. It is useful to \"freeze\" part of your model if
you know in advance that you won\'t need the gradients of those
parameters (this offers some performance benefits by reducing autograd
computations).

In finetuning, we freeze most of the model and typically only modify the
classifier layers to make predictions on new labels. Let\'s walk through
a small example to demonstrate this. As before, we load a pretrained
resnet18 model, and freeze all the parameters.


In [29]:
from torch import nn, optim

model = resnet18(weights=ResNet18_Weights.DEFAULT)

# Freeze all the parameters in the network
for param in model.parameters():
    param.requires_grad = False

Let\'s say we want to finetune the model on a new dataset with 10
labels. In resnet, the classifier is the last linear layer `model.fc`.
We can simply replace it with a new linear layer (unfrozen by default)
that acts as our classifier.


In [30]:
model.fc = nn.Linear(512, 10)

Now all parameters in the model, except the parameters of `model.fc`,
are frozen. The only parameters that compute gradients are the weights
and bias of `model.fc`.


In [31]:
# Optimize only the classifier
optimizer = optim.SGD(model.parameters(), lr=1e-2, momentum=0.9)

Notice although we register all the parameters in the optimizer, the
only parameters that are computing gradients (and hence updated in
gradient descent) are the weights and bias of the classifier.

The same exclusionary functionality is available as a context manager in
[torch.no\_grad()](https://pytorch.org/docs/stable/generated/torch.no_grad.html)


------------------------------------------------------------------------


Further readings:
=================

-   [In-place operations & Multithreaded
    Autograd](https://pytorch.org/docs/stable/notes/autograd.html)
-   [Example implementation of reverse-mode
    autodiff](https://colab.research.google.com/drive/1VpeE6UvEPRz9HmsHh1KS0XxXjYu533EC)
-   [Video: PyTorch Autograd Explained - In-depth
    Tutorial](https://www.youtube.com/watch?v=MswxJw-8PvE)
