<a href="https://colab.research.google.com/github/Nishijujuba/python-cookbook-2023-3rd/blob/master/_downloads/c029676472d90691aa145c6fb97a61c3/neural_networks_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# For tips on running notebooks in Google Colab, see
# https://docs.pytorch.org/tutorials/beginner/colab
%matplotlib inline

Neural Networks
===============

Neural networks can be constructed using the `torch.nn` package.

Now that you had a glimpse of `autograd`, `nn` depends on `autograd` to
define models and differentiate them. An `nn.Module` contains layers,
and a method `forward(input)` that returns the `output`.

For example, look at this network that classifies digit images:

![convnet](https://pytorch.org/tutorials/_static/img/mnist.png)

It is a simple feed-forward network. It takes the input, feeds it
through several layers one after the other, and then finally gives the
output.

A typical training procedure for a neural network is as follows:

-   Define the neural network that has some learnable parameters (or
    weights)
-   Iterate over a dataset of inputs
-   Process input through the network
-   Compute the loss (how far is the output from being correct)
-   Propagate gradients back into the network's parameters
-   Update the weights of the network, typically using a simple update
    rule: `weight = weight - learning_rate * gradient`

Define the network
------------------

Let's define this network:


In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F


class Net(nn.Module):#定义一个网络类，继承nn.Module，这样Pytorch才可以：注册参数、支持相关方法.to(device)、 .state_dict()、.parameters()等

    def __init__(self):
        super(Net, self).__init__()
        # 调用父类初始化：让 nn.Module 做好内部注册/管理的准备

        # 1 input image channel, 6 output channels, 5x5 square convolution
        # kernel
        self.conv1 = nn.Conv2d(1, 6, 5)
        # 卷积层1：输入通道=1（灰度图），输出通道=6，卷积核大小=5
        self.conv2 = nn.Conv2d(6, 16, 5)
        # 卷积层2：输入通道=6，输出通道=16，卷积核=5

        # an affine operation: y = Wx + b
        self.fc1 = nn.Linear(16 * 5 * 5, 120)  # 5*5 from image dimension
        # 全连接层1：输入特征=16*5*5=400，输出=120，这个 5*5 来自前面卷积+池化后的空间尺寸推导（见 forward）
        self.fc2 = nn.Linear(120, 84)
        # 全连接层2：120 -> 84
        self.fc3 = nn.Linear(84, 10)
        # 输出层：84 -> 10（10 类数字 0~9）

    def forward(self, input):
        # 定义前向传播：描述“从 input 到 output 怎么算”


        # Convolution layer C1: 1 input image channel, 6 output channels,
        # 5x5 square convolution, it uses RELU activation function, and
        # outputs a Tensor with size (N, 6, 28, 28), where N is the size of the batch
        c1 = F.relu(self.conv1(input))
        # self.conv1: (N,1,32,32) -> (N,6,28,28)
        # 解释：32 - 5 + 1 = 28（padding=0, stride=1）
        # F.relu：逐元素 ReLU，不改变形状

        # Subsampling layer S2: 2x2 grid, purely functional,
        # this layer does not have any parameter, and outputs a (N, 6, 14, 14) Tensor
        s2 = F.max_pool2d(c1, (2, 2))
        # 2x2 最大池化： (N,6,28,28) -> (N,6,14,14)
        # 解释：默认 stride=kernel_size=2，空间尺寸减半
        # 池化层本身没有可学习参数

        # Convolution layer C3: 6 input channels, 16 output channels,
        # 5x5 square convolution, it uses RELU activation function, and
        # outputs a (N, 16, 10, 10) Tensor
        c3 = F.relu(self.conv2(s2))
        # self.conv2: (N,6,14,14) -> (N,16,10,10)
        # 解释：14 - 5 + 1 = 10
        # ReLU 不改形状


        # Subsampling layer S4: 2x2 grid, purely functional,
        # this layer does not have any parameter, and outputs a (N, 16, 5, 5) Tensor
        s4 = F.max_pool2d(c3, 2)
        # 池化核=2： (N,16,10,10) -> (N,16,5,5)


        # Flatten operation: purely functional, outputs a (N, 400) Tensor
        s4 = torch.flatten(s4, 1)
        # 展平：从第 1 维开始展平（保留 batch 维 N）
        # (N,16,5,5) -> (N, 16*5*5) = (N,400)


        # Fully connected layer F5: (N, 400) Tensor input,
        # and outputs a (N, 120) Tensor, it uses RELU activation function
        f5 = F.relu(self.fc1(s4))
        # fc1: (N,400) -> (N,120)，再 ReLU

        # Fully connected layer F6: (N, 120) Tensor input,
        # and outputs a (N, 84) Tensor, it uses RELU activation function
        f6 = F.relu(self.fc2(f5))
        # fc2: (N,120) -> (N,84)，再 ReLU

        # Fully connected layer OUTPUT: (N, 84) Tensor input, and
        # outputs a (N, 10) Tensor
        output = self.fc3(f6)
        # fc3: (N,84) -> (N,10)
        # 这里通常称为 logits（未做 softmax）

        return output


net = Net()
print(net)

Net(
  (conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
  (conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
  (fc1): Linear(in_features=400, out_features=120, bias=True)
  (fc2): Linear(in_features=120, out_features=84, bias=True)
  (fc3): Linear(in_features=84, out_features=10, bias=True)
)


conv1 的配置（kernel 5x5，步幅 1）。

Linear(in_features=400, out_features=120, bias=True)
fc1 输入 400 输出 120，带 bias。


You just have to define the `forward` function, and the `backward`
function (where gradients are computed) is automatically defined for you
using `autograd`. You can use any of the Tensor operations in the
`forward` function.

The learnable parameters of a model are returned by `net.parameters()`


In [3]:
params = list(net.parameters())
# net.parameters() 返回一个迭代器：依次给出所有可学习参数（weight、bias）
# list(...) 只是为了能 len()、索引查看
print(len(params))
# 打印参数张量的数量（此例是 10：每个 Conv/Linear 通常都有 weight + bias）
#5 个层（conv1、conv2、fc1、fc2、fc3）×（weight+bias）= 10 个参数张量。

print(params[0].size())  # conv1's .weight
#conv1.weight：6 个输出通道、1 个输入通道、5x5 卷积核。
# params[0] 对应 conv1.weight，形状是 [out_channels, in_channels, kH, kW]


10
torch.Size([6, 1, 5, 5])


In [4]:
print(params)

[Parameter containing:
tensor([[[[ 1.8127e-01, -1.3200e-01,  1.9884e-01, -7.7197e-02,  4.1498e-02],
          [-5.2222e-02, -1.2963e-01, -1.1134e-01, -1.0487e-01, -1.0360e-01],
          [-1.7946e-01, -1.8686e-01, -1.9981e-01, -1.2209e-01,  1.4376e-01],
          [-1.0527e-01, -1.5708e-01,  3.4261e-02, -3.5365e-02, -1.5402e-01],
          [-9.5600e-02,  1.1518e-01,  3.4566e-02, -4.9534e-02,  6.7677e-02]]],


        [[[ 3.9205e-02,  1.2131e-01, -1.7216e-01, -1.3520e-01, -4.9865e-02],
          [ 8.7491e-02, -1.3624e-01,  1.3771e-01, -1.3387e-01,  1.6757e-01],
          [ 1.7481e-01, -1.3455e-01,  1.4780e-01,  1.1507e-01, -7.1423e-02],
          [-5.3798e-02, -1.5582e-01,  9.0688e-02, -9.6122e-02,  1.0218e-01],
          [-1.6819e-01, -1.6567e-01, -1.3055e-01, -1.7472e-01,  1.7996e-01]]],


        [[[ 1.7901e-01,  2.1007e-02,  1.6716e-01, -4.2629e-02,  1.3756e-01],
          [-4.0931e-02, -7.3204e-02, -1.7830e-01,  9.7799e-02, -1.4004e-01],
          [ 9.2205e-02, -8.2522e-02, -1.8889e

Let\'s try a random 32x32 input. Note: expected input size of this net
(LeNet) is 32x32. To use this net on the MNIST dataset, please resize
the images from the dataset to 32x32.


In [5]:
input = torch.randn(1, 1, 32, 32)
# 随机生成一个 batch：N=1，1 通道，32x32
# randn 是标准正态分布

out = net(input)
# 前向计算：得到 (1,10) 的输出：10 类得分

print(out)
# 打印输出张量；grad_fn=<AddmmBackward0> 说明它在计算图里，可用于反传：AddmmBackward0：最后的线性层本质是 add + matrix-multiply 的组合，其反传节点叫这个名字

tensor([[ 0.1554, -0.0728, -0.0223, -0.0758, -0.0987, -0.0030, -0.0877, -0.0198,
         -0.0635, -0.1065]], grad_fn=<AddmmBackward0>)


Zero the gradient buffers of all parameters and backprops with random
gradients:


In [6]:
net.zero_grad()
# 把所有参数的 .grad 清空/置零
# 重要：PyTorch 默认“梯度累积”，不清零会叠加
out.backward(torch.randn(1, 10))
# 从 out 这个非标量张量做 backward 时，必须提供同形状的“上游梯度”
# 这里用随机 (1,10) 作为 dL/dout 来演示反传能跑通
# 反传后：各参数的 .grad 会被填充

<div style="background-color: #54c7ec; color: #fff; font-weight: 700; padding-left: 10px; padding-top: 5px; padding-bottom: 5px"><strong>NOTE:</strong></div>

<div style="background-color: #f3f4f7; padding-left: 10px; padding-top: 10px; padding-bottom: 10px; padding-right: 10px">

<p><code>torch.nn</code> only supports mini-batches. The entire <code>torch.nn</code>package only supports inputs that are a mini-batch of samples, and nota single sample.For example, <code>nn.Conv2d</code> will take in a 4D Tensor of<code>nSamples x nChannels x Height x Width</code>.If you have a single sample, just use <code>input.unsqueeze(0)</code> to adda fake batch dimension.</p>

</div>

Before proceeding further, let\'s recap all the classes you've seen so
far.

**Recap:**

:   -   `torch.Tensor` - A *multi-dimensional array* with support for
        autograd operations like `backward()`. Also *holds the gradient*
        w.r.t. the tensor.
    -   `nn.Module` - Neural network module. *Convenient way of
        encapsulating parameters*, with helpers for moving them to GPU,
        exporting, loading, etc.
    -   `nn.Parameter` - A kind of Tensor, that is *automatically
        registered as a parameter when assigned as an attribute to a*
        `Module`.
    -   `autograd.Function` - Implements *forward and backward
        definitions of an autograd operation*. Every `Tensor` operation
        creates at least a single `Function` node that connects to
        functions that created a `Tensor` and *encodes its history*.

**At this point, we covered:**

:   -   Defining a neural network
    -   Processing inputs and calling backward

**Still Left:**

:   -   Computing the loss
    -   Updating the weights of the network

Loss Function
=============

A loss function takes the (output, target) pair of inputs, and computes
a value that estimates how far away the output is from the target.

There are several different [loss
functions](https://pytorch.org/docs/nn.html#loss-functions) under the nn
package . A simple loss is: `nn.MSELoss` which computes the mean-squared
error between the output and the target.

For example:


In [7]:
output = net(input)
# 再跑一次前向，得到 (1,10)
target = torch.randn(10)  # a dummy target, for example
# 随机生成一个“假目标”，形状 (10,)
target = target.view(1, -1)  # make it the same shape as output
# reshape 成 (1,10)，以便和 output 对齐
# -1 表示自动推断该维度大小

criterion = nn.MSELoss()
# 创建均方误差损失：mean((output - target)^2)

loss = criterion(output, target)
# 计算损失：得到一个标量 Tensor

print(loss)

tensor(3.2424, grad_fn=<MseLossBackward0>)


Now, if you follow `loss` in the backward direction, using its
`.grad_fn` attribute, you will see a graph of computations that looks
like this:

``` {.sh}
input -> conv2d -> relu -> maxpool2d -> conv2d -> relu -> maxpool2d
      -> flatten -> linear -> relu -> linear -> relu -> linear
      -> MSELoss
      -> loss
```

So, when we call `loss.backward()`, the whole graph is differentiated
w.r.t. the neural net parameters, and all Tensors in the graph that have
`requires_grad=True` will have their `.grad` Tensor accumulated with the
gradient.

For illustration, let us follow a few steps backward:


In [8]:
print(loss.grad_fn)  # MSELoss loss 的反向函数节点：MseLossBackward0
print(loss.grad_fn.next_functions[0][0])  # Linear  next_functions 指向它的输入来自哪些节点，会看到与最后的线性层（AddmmBackward0）相关的节点
print(loss.grad_fn.next_functions[0][0].next_functions[0][0])  # ReLU 继续往回追，会看到更前面的节点 AccumulateGrad

<MseLossBackward0 object at 0x79421cc155d0>
<AddmmBackward0 object at 0x79421eb269e0>
<AccumulateGrad object at 0x79421eb269e0>


Backprop
========

To backpropagate the error all we have to do is to `loss.backward()`.
You need to clear the existing gradients though, else gradients will be
accumulated to existing gradients.

Now we shall call `loss.backward()`, and have a look at conv1\'s bias
gradients before and after the backward.


In [9]:
net.zero_grad()     # zeroes the gradient buffers of all parameters 再次清零梯度，避免和之前的累积混在一起

print('conv1.bias.grad before backward')
print(net.conv1.bias.grad)#反传前 conv1.bias.grad 通常是 None（还没计算/填充）

loss.backward()#对 loss 做反传：把 d(loss)/d(param) 计算出来并累积到各 param.grad

print('conv1.bias.grad after backward')
print(net.conv1.bias.grad)#现在 conv1.bias.grad 会变成一个长度为 6 的向量（对应 6 个输出通道的 bias 梯度）

conv1.bias.grad before backward
None
conv1.bias.grad after backward
tensor([-0.0015,  0.0056, -0.0485,  0.0238, -0.0206,  0.0230])


Now, we have seen how to use loss functions.

**Read Later:**

> The neural network package contains various modules and loss functions
> that form the building blocks of deep neural networks. A full list
> with documentation is [here](https://pytorch.org/docs/nn).

**The only thing left to learn is:**

> -   Updating the weights of the network

Update the weights
==================

The simplest update rule used in practice is the Stochastic Gradient
Descent (SGD):

``` {.python}
weight = weight - learning_rate * gradient
```

We can implement this using simple Python code:

``` {.python}
learning_rate = 0.01
for f in net.parameters():
    f.data.sub_(f.grad.data * learning_rate)
```

However, as you use neural networks, you want to use various different
update rules such as SGD, Nesterov-SGD, Adam, RMSProp, etc. To enable
this, we built a small package: `torch.optim` that implements all these
methods. Using it is very simple:

``` {.python}
import torch.optim as optim

# create your optimizer
optimizer = optim.SGD(net.parameters(), lr=0.01)

# in your training loop:
optimizer.zero_grad()   # zero the gradient buffers
output = net(input)
loss = criterion(output, target)
loss.backward()
optimizer.step()    # Does the update
```


<div style="background-color: #54c7ec; color: #fff; font-weight: 700; padding-left: 10px; padding-top: 5px; padding-bottom: 5px"><strong>NOTE:</strong></div>

<div style="background-color: #f3f4f7; padding-left: 10px; padding-top: 10px; padding-bottom: 10px; padding-right: 10px">

<p>Observe how gradient buffers had to be manually set to zero using<code>optimizer.zero_grad()</code>. This is because gradients are accumulatedas explained in the <a href="">Backprop</a> section.</p>

</div>

