# Neural Networks
Neural networks can be constructed using the *torch.nn* package.

Now that you had a glimpse of *autograd, nn* depends on *autograd* to define models and differentiate them. And *nn.Module* contains layers, and a method *forward(inout)* that returns the *output*

For example, look at this network that classifies digit images:
![convert](https://pytorch.org/tutorials/_images/mnist.png)
convert 卷积网

It is a simple feed-forward network 前馈网络. It takes the input, feeds it through several layers one after the other, and then finally gives the output

A typical training procedure for a neural network is as follows:

1. Define the neural network that has some learnable parameters
(or weights)
2. Iterate over a dataset of inputs
3. Process input through the network
4. Compute the loss (how far is the output from the being correct)
5. Propagate 传播 gradients back into the network's parameters
6. Update the weights of the network, typically using a simple update rule: *weight = weight - learning_rate x gradient*

## Define the network

Let's define this network:

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F


class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        # 1 input image channel, 6 output channels
        # 5x5 square convolution kernel
        self.conv1 = nn.Conv2d(1, 6, 5)
        self.conv2 = nn.Conv2d(6, 16, 5)
        # an affine operation: y = Wx + b
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        # 5*5 from image dimension
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        # Max pooling over a (2, 2) window
        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
        # If the size is a square, you can specify with a single number
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        x = torch.flatten(x, 1)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

net = Net()
print(net)

Net(
  (conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
  (conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
  (fc1): Linear(in_features=400, out_features=120, bias=True)
  (fc2): Linear(in_features=120, out_features=84, bias=True)
  (fc3): Linear(in_features=84, out_features=10, bias=True)
)


You just have to define the `forward` function, and the `backward` function (where gradients are computed) is automatically defined for you using `autograd`. You can use any of the Tensor operations in the `forward` function.

The learnable parameters of a model are returned by `net.parameters()`

In [2]:
params = list(net.parameters())
print(len(params))
print(params[0].size())


10
torch.Size([6, 1, 5, 5])


Let's try a random 32x32 input. Note: expected input size of this net (LeNet) is 32x32. To use this net on the MNIST dataset, please resize the images from the dataset to 32x32

In [3]:
input = torch.randn(1,1,32,32)
out = net(input)
print(out)

tensor([[ 0.0464, -0.1199, -0.1110, -0.0548,  0.0170,  0.1179, -0.0509,  0.0688,
         -0.0746, -0.0187]], grad_fn=<AddmmBackward>)


  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)


Zero the gradient buffers of all parameters and backprops with
random gradients:

In [4]:
net.zero_grad()
out.backward(torch.randn(1,10))

> * Note

`torch.nn` only supports mini-batches. The entire `torch.nn` package only supports inputs that are a mini-batch of samples, and not a single sample.torch.nn仅支持小批量。 整个torch.nn包仅支持作为微型样本而不是单个样本的输入。

For example, `nn.Conv2d` will take in a 4D Tensor of *nSamples x nChannels x Height x Width*.
例如，nn.Conv2d将采用nSamples x nChannels x Height x Width的 4D 张量。

If you have a single sample, just use `input.unsqueeze(0)` to add a fake batch dimension.如果您只有一个样本，只需使用input.unsqueeze(0)添加一个假批量尺寸。

Before proceeding further, let's recap all the classes you've seen so far.

### Recap:

1. ``torch.Tensor`` - A multi-dimensional array with support for autograd operations like ``backward()``. Also *hold the gradient* w.r.t. tensor torch.Tensor-一个多维数组，支持诸如backward()的自动微分操作。 同样，保持相对于张量的梯度。
2. ``nn.Module`` - Neural network module. *Convenient way of encapsulating parameters,* with helpers for moving them to GPU, exporting, loading, etc nn.Module-神经网络模块。 封装参数的便捷方法，并带有将其移动到 GPU，导出，加载等的帮助器。
3. ``nn.Parameter`` - A kind of Tensor, that *is automatically registered as a parameter when assigned as an attribute to* a `Module`.一种张量，即将其分配为`Module`的属性时，自动注册为参数。
4. `autograd.Function` -Implements *forward and backward definitions of an autograd operation*. Every `Tensor` operation creates at least a single `Function` node that connects to functions that created a `Tensor` and *encodes its history*. 实现自动微分操作的正向和反向定义。 每个`Tensor`操作都会创建至少一个`Function`节点，该节点连接到创建`Tensor`的函数，并且编码其历史记录。

### At this point, we covered:
>* Defining a neural network
>* Processing inputs and calling backward
### Still Left:
>* Computing the loss
>* Updating the weights of the network

## Loss Function

A loss function takes the (output, target) pair of inputs, and computes a value that estimates how far away the output is from the target.

There are several different **loss functions** under the `nn` package. A simple loss is: `nn.MSELoss` which computes the mean-squared error (均方误差) between the input and target.

***
For example:

In [5]:
output = net(input)
target = torch.randn(10) # a dummy target, for example
target = target.view(1, -1) # make it the same shape as output
criterion = nn.MSELoss()

loss = criterion(output, target)
print(loss)

tensor(0.5511, grad_fn=<MseLossBackward>)


Now, if you follow `loss` in the backward direction, using its `.grad_fn` attribute, you will see a graph of computations that looks like this:

现在，如果使用.grad_fn属性向后跟随loss，您将看到一个计算图，如下所示：
***
input:

      -> conv2d -> relu -> maxpool2d -> conv2d -> relu -> maxpool2d

      -> flatten -> linear -> relu -> linear -> relu -> linear
      
      -> MSELoss
      
      -> loss
So, when we call `loss.backward()`, the whole graph is differentiated (微分) w.r.t. the neural net parameters, and all Tensors in the graph that have `requires_grad=True` will have their `.grad` Tensor accumulated (累积) with the gradient.

因此，当我们调用`loss.backward()`时，整个图将被微分。 损失，并且图中具有`requires_grad=True`的所有张量将随梯度累积其`.grad`张量。
***
For illustration, let us follow a few steps backward:

为了说明，让我们向后走几步：

In [6]:
print(loss.grad_fn) # MSELoss
print(loss.grad_fn.next_functions[0][0]) # Linear
print(loss.grad_fn.next_functions[0][0].next_functions[0][0]) # ReLu

<MseLossBackward object at 0x0000017C8FE67B80>
<AddmmBackward object at 0x0000017C8FE67C40>
<AccumulateGrad object at 0x0000017C8FE67B80>


## Backprop 反向传播
To backpropagate the error all we have to do is to `loss.backward()`. You need to clear the existing gradients though, else gradients will be accumulated to existing gradients.

Now we shall call `loss.backward()`, and have a look at conv1's bias gradients before and after the backward.

In [7]:
net.zero_grad()
# zeroes the gradient buffers of all parameters
print('conv1.bias.grad before backward')
print(net.conv1.bias.grad)

loss.backward()

print('conv1.bias.grad before backward')
print(net.conv1.bias.grad)

conv1.bias.grad before backward
tensor([0., 0., 0., 0., 0., 0.])
conv1.bias.grad before backward
tensor([ 0.0019,  0.0025,  0.0088,  0.0016, -0.0122,  0.0074])


Now, we have seen how to use loss functions.

### Read later:
The neural network package contains various modules and loss functions that form the building blocks of deep neural networks. A full list with documentation is [here](https://pytorch.org/docs/stable/nn.html).

### The only thing left to learn is:
>* Updating the weights of network

## Update the weights
The simplest update rule used in practice is the Stochastic Gradient Descent (SGD):

    `weight = weight - learning_rate * gradient`

We can implement this using simple Python code:

In [8]:
learning_rate = 0.01
for f in net.parameters():
    f.data.sub_(f.grad.data * learning_rate)

However, as you use neural networks, you want to use various different update rules such as SGD, Nesterov-SGD, Adam, RMSProp, etc. To enable this, we built a small package: `torch.optim` that implements all these methods. Using it is very simple:

In [9]:
import torch.optim as optim

# creat you optimizer
optimizer = optim.SGD(net.parameters(), lr = 0.01)

# in your training loop:
optimizer.zero_grad()  # zero the gradient buffers
output = net(input)
loss = criterion(output, target)
loss.backward()
optimizer.step() # Does the update

>* Note

>Observe how gradient buffers had to be manually set to zero using `optimizer.zero_grad()`. This is because gradients are accumulated as explained in the **Backprop** section.