# 第一课


## 什么是PyTorch?

model = Architecture + Parameters
PyTorch是一个基于Python的科学计算库，它有以下特点:

- 类似于NumPy，但是它可以使用GPU
- 可以用它定义深度学习模型，可以灵活地进行深度学习模型的训练和使用

## Tensors

Tensor类似与NumPy的ndarray，唯一的区别是Tensor可以在GPU上加速运算。

# 如何成为PyTorch大神？

- 学好深度学习的基础知识
- 学习PyTorch官方tutorial
- 学习GitHub以及各种博客上的教程(别人创建好的list)
- 阅读documentation，使用论坛https://discuss.pytorch.org/
- 跑通以及学习开源PyTorch项目
- 阅读深度学习模型paper，学习别人的模型实现
- 通过阅读paper，自己实现模型
- 自己创造模型(也可以写paper)

In [1]:
import torch

构造一个未初始化的5x3矩阵:

In [2]:
x = torch.empty(5,3)
x

tensor([[9.1837e-39, 4.6837e-39, 9.2755e-39],
        [1.0837e-38, 8.4490e-39, 1.1112e-38],
        [1.0194e-38, 9.0919e-39, 8.4490e-39],
        [9.6429e-39, 8.4490e-39, 9.6429e-39],
        [9.2755e-39, 1.0286e-38, 9.0919e-39]])

构建一个随机初始化的矩阵:

In [3]:
x = torch.rand(5,3)
x

tensor([[0.2449, 0.8412, 0.6487],
        [0.1779, 0.5634, 0.4302],
        [0.5804, 0.2735, 0.9928],
        [0.8712, 0.2792, 0.2858],
        [0.8749, 0.4141, 0.8015]])

构建一个全部为0，类型为long的矩阵:

In [4]:
x = torch.zeros(5,3,dtype=torch.long)
x

tensor([[0, 0, 0],
        [0, 0, 0],
        [0, 0, 0],
        [0, 0, 0],
        [0, 0, 0]])

In [5]:
x = torch.zeros(5,3).long()
x.dtype

torch.int64

从数据直接直接构建tensor:

In [6]:
x = torch.tensor([5.5,3])
x

tensor([5.5000, 3.0000])

也可以从一个已有的tensor构建一个tensor。这些方法会重用原来tensor的特征，例如，数据类型，除非提供新的数据。

In [7]:
x = x.new_ones(5,3, dtype=torch.double)
x

tensor([[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]], dtype=torch.float64)

In [8]:
x = torch.randn_like(x, dtype=torch.float)
x

tensor([[ 7.5098e-01, -2.1833e+00,  1.7353e+00],
        [-1.9543e+00,  1.3971e-03,  9.1297e-02],
        [ 2.4668e+00,  1.3432e+00,  1.4745e-01],
        [-1.1110e+00, -8.8386e-01,  1.5700e+00],
        [ 5.7871e-01,  6.8120e-01, -1.5284e-01]])

得到tensor的形状:

In [9]:
x.shape

torch.Size([5, 3])

<div class="alert alert-info"><h4>注意</h4><p>``torch.Size`` 返回的是一个tuple</p></div>

Operations


有很多种tensor运算。我们先介绍加法运算。



In [10]:
y = torch.rand(5,3)
y

tensor([[0.2833, 0.6793, 0.4296],
        [0.6732, 0.1925, 0.0823],
        [0.1578, 0.9667, 0.8093],
        [0.4643, 0.9432, 0.9148],
        [0.6818, 0.3293, 0.4672]])

In [11]:
x + y

tensor([[ 1.0343, -1.5040,  2.1649],
        [-1.2811,  0.1939,  0.1736],
        [ 2.6247,  2.3099,  0.9568],
        [-0.6467,  0.0593,  2.4848],
        [ 1.2605,  1.0105,  0.3143]])

另一种着加法的写法


In [12]:
torch.add(x, y)

tensor([[ 1.0343, -1.5040,  2.1649],
        [-1.2811,  0.1939,  0.1736],
        [ 2.6247,  2.3099,  0.9568],
        [-0.6467,  0.0593,  2.4848],
        [ 1.2605,  1.0105,  0.3143]])

加法：把输出作为一个变量

In [13]:
result = torch.empty(5,3)
torch.add(x, y, out=result)
# result = x + y
result

tensor([[ 1.0343, -1.5040,  2.1649],
        [-1.2811,  0.1939,  0.1736],
        [ 2.6247,  2.3099,  0.9568],
        [-0.6467,  0.0593,  2.4848],
        [ 1.2605,  1.0105,  0.3143]])

in-place加法

In [14]:
y.add_(x)
y

tensor([[ 1.0343, -1.5040,  2.1649],
        [-1.2811,  0.1939,  0.1736],
        [ 2.6247,  2.3099,  0.9568],
        [-0.6467,  0.0593,  2.4848],
        [ 1.2605,  1.0105,  0.3143]])

<div class="alert alert-info"><h4>注意</h4><p>任何in-place的运算都会以``_``结尾。
    举例来说：``x.copy_(y)``, ``x.t_()``, 会改变 ``x``。</p></div>

各种类似NumPy的indexing都可以在PyTorch tensor上面使用。


In [15]:
x[1:, 1:]

tensor([[ 1.3971e-03,  9.1297e-02],
        [ 1.3432e+00,  1.4745e-01],
        [-8.8386e-01,  1.5700e+00],
        [ 6.8120e-01, -1.5284e-01]])

Resizing: 如果你希望resize/reshape一个tensor，可以使用``torch.view``：

In [16]:
x = torch.randn(4,4)
y = x.view(16)
z = x.view(-1,8)
z

tensor([[-2.3771,  0.1132, -0.3402, -1.3023,  0.8325,  0.1971, -0.6270,  0.5392],
        [-1.9797, -1.3136, -1.2431,  1.5669,  0.7865,  0.9015, -0.6884, -1.0703]])

如果你有一个只有一个元素的tensor，使用``.item()``方法可以把里面的value变成Python数值。

In [17]:
x = torch.randn(1)
x

tensor([2.6816])

In [18]:
x.item()

2.6816351413726807

In [19]:
z.transpose(1,0)

tensor([[-2.3771, -1.9797],
        [ 0.1132, -1.3136],
        [-0.3402, -1.2431],
        [-1.3023,  1.5669],
        [ 0.8325,  0.7865],
        [ 0.1971,  0.9015],
        [-0.6270, -0.6884],
        [ 0.5392, -1.0703]])

**更多阅读**


  各种Tensor operations, 包括transposing, indexing, slicing,
  mathematical operations, linear algebra, random numbers在
  `<https://pytorch.org/docs/torch>`.

Numpy和Tensor之间的转化
------------

在Torch Tensor和NumPy array之间相互转化非常容易。

Torch Tensor和NumPy array会共享内存，所以改变其中一项也会改变另一项。

把Torch Tensor转变成NumPy Array


In [20]:
a = torch.ones(5)
a

tensor([1., 1., 1., 1., 1.])

In [21]:
b = a.numpy()
b

array([1., 1., 1., 1., 1.], dtype=float32)

改变numpy array里面的值。

In [22]:
b[1] = 2
b

array([1., 2., 1., 1., 1.], dtype=float32)

In [23]:
a

tensor([1., 2., 1., 1., 1.])

把NumPy ndarray转成Torch Tensor

In [24]:
import numpy as np

In [25]:
a = np.ones(5)
b = torch.from_numpy(a)
np.add(a, 1, out=a)
print(a)

[2. 2. 2. 2. 2.]


In [26]:
b

tensor([2., 2., 2., 2., 2.], dtype=torch.float64)

所有CPU上的Tensor都支持转成numpy或者从numpy转成Tensor。

CUDA Tensors
------------

使用``.to``方法，Tensor可以被移动到别的device上。



In [27]:
if torch.cuda.is_available():
    device = torch.device("cuda")
    y = torch.ones_like(x, device=device)
    x = x.to(device)
    z = x + y
    print(z)
    print(z.to("cpu", torch.double))
    

tensor([3.6816], device='cuda:0')
tensor([3.6816], dtype=torch.float64)


In [28]:
y.to("cpu").data.numpy()
y.cpu().data.numpy()

array([1.], dtype=float32)

In [29]:
model = model.cuda()


NameError: name 'model' is not defined


热身: 用numpy实现两层神经网络
--------------

一个全连接ReLU神经网络，一个隐藏层，没有bias。用来从x预测y，使用L2 Loss。
- $h = W_1X$
- $a = max(0, h)$
- $y_{hat} = W_2a$

这一实现完全使用numpy来计算前向神经网络，loss，和反向传播。
- forward pass
- loss
- backward pass

numpy ndarray是一个普通的n维array。它不知道任何关于深度学习或者梯度(gradient)的知识，也不知道计算图(computation graph)，只是一种用来计算数学运算的数据结构。



In [None]:
N, D_in, H, D_out = 64, 1000, 100, 10

# 随机创建一些训练数据
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

learning_rate = 1e-6
for it in range(500):
    # Forward pass
    h = x.dot(w1) # N * H
    h_relu = np.maximum(h, 0) # N * H
    y_pred = h_relu.dot(w2) # N * D_out
    
    # compute loss
    loss = np.square(y_pred - y).sum()
    print(it, loss)
    
    # Backward pass
    # compute the gradient
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h<0] = 0
    grad_w1 = x.T.dot(grad_h)
    
    # update weights of w1 and w2
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2


PyTorch: Tensors
----------------

这次我们使用PyTorch tensors来创建前向神经网络，计算损失，以及反向传播。

一个PyTorch Tensor很像一个numpy的ndarray。但是它和numpy ndarray最大的区别是，PyTorch Tensor可以在CPU或者GPU上运算。如果想要在GPU上运算，就需要把Tensor换成cuda类型。


In [None]:
N, D_in, H, D_out = 64, 1000, 100, 10

# 随机创建一些训练数据
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

w1 = torch.randn(D_in, H)
w2 = torch.randn(H, D_out)

learning_rate = 1e-6
for it in range(500):
    # Forward pass
    h = x.mm(w1) # N * H
    h_relu = h.clamp(min=0) # N * H
    y_pred = h_relu.mm(w2) # N * D_out
    
    # compute loss
    loss = (y_pred - y).pow(2).sum().item()
    print(it, loss)
    
    # Backward pass
    # compute the gradient
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.t().mm(grad_y_pred)
    grad_h_relu = grad_y_pred.mm(w2.t())
    grad_h = grad_h_relu.clone()
    grad_h[h<0] = 0
    grad_w1 = x.t().mm(grad_h)
    
    # update weights of w1 and w2
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

简单的autograd

In [None]:
x = torch.tensor(1., requires_grad=True)
w = torch.tensor(2., requires_grad=True)
b = torch.tensor(3., requires_grad=True)

y = w*x + b # y = 2*1+3

y.backward()

# dy / dw = x
print(w.grad)
print(x.grad)
print(b.grad)



PyTorch: Tensor和autograd
-------------------------------

PyTorch的一个重要功能就是autograd，也就是说只要定义了forward pass(前向神经网络)，计算了loss之后，PyTorch可以自动求导计算模型所有参数的梯度。

一个PyTorch的Tensor表示计算图中的一个节点。如果``x``是一个Tensor并且``x.requires_grad=True``那么``x.grad``是另一个储存着``x``当前梯度(相对于一个scalar，常常是loss)的向量。


In [None]:
N, D_in, H, D_out = 64, 1000, 100, 10

# 随机创建一些训练数据
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

w1 = torch.randn(D_in, H, requires_grad=True)
w2 = torch.randn(H, D_out, requires_grad=True)

learning_rate = 1e-6
for it in range(500):
    # Forward pass
    y_pred = x.mm(w1).clamp(min=0).mm(w2)
    
    # compute loss
    loss = (y_pred - y).pow(2).sum() # computation graph
    print(it, loss.item())
    
    # Backward pass
    loss.backward()
    
    # update weights of w1 and w2
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad
        w1.grad.zero_()
        w2.grad.zero_()


PyTorch: nn
-----------


这次我们使用PyTorch中nn这个库来构建网络。
用PyTorch autograd来构建计算图和计算gradients，
然后PyTorch会帮我们自动计算gradient。




In [30]:
import torch.nn as nn

N, D_in, H, D_out = 64, 1000, 100, 10

# 随机创建一些训练数据
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H, bias=False), # w_1 * x + b_1
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out, bias=False),
)

torch.nn.init.normal_(model[0].weight)
torch.nn.init.normal_(model[2].weight)

# model = model.cuda()

loss_fn = nn.MSELoss(reduction='sum')

learning_rate = 1e-6
for it in range(500):
    # Forward pass
    y_pred = model(x) # model.forward() 
    
    # compute loss
    loss = loss_fn(y_pred, y) # computation graph
    print(it, loss.item())
    
    # Backward pass
    loss.backward()
    
    # update weights of w1 and w2
    with torch.no_grad():
        for param in model.parameters(): # param (tensor, grad)
            param -= learning_rate * param.grad
            
    model.zero_grad()

0 30796758.0
1 31658336.0
2 35340732.0
3 35842768.0
4 29652120.0
5 19048664.0
6 9998813.0
7 4842135.0
8 2503577.75
9 1502023.125
10 1045942.25
11 805809.4375
12 656833.125
13 551465.1875
14 470619.78125
15 405676.625
16 352232.21875
17 307525.8125
18 269770.96875
19 237652.015625
20 210176.09375
21 186538.578125
22 166096.75
23 148363.03125
24 132908.125
25 119409.46875
26 107558.53125
27 97133.3671875
28 87936.59375
29 79787.796875
30 72550.4375
31 66099.953125
32 60338.90234375
33 55180.42578125
34 50549.44140625
35 46382.1953125
36 42627.51171875
37 39237.1875
38 36168.2890625
39 33386.96875
40 30861.333984375
41 28562.126953125
42 26468.365234375
43 24556.771484375
44 22809.75
45 21210.68359375
46 19745.548828125
47 18402.154296875
48 17167.94921875
49 16032.509765625
50 14986.251953125
51 14020.9482421875
52 13129.1328125
53 12303.9296875
54 11539.8408203125
55 10831.37890625
56 10174.25390625
57 9564.1943359375
58 8996.5107421875
59 8468.296875
60 7976.27783203125
61 7517.3325195

In [31]:
model[0].weight

Parameter containing:
tensor([[-2.4369, -1.2259,  1.3928,  ..., -0.6311,  1.0741, -0.0186],
        [-0.4728, -0.8807, -0.3484,  ...,  0.4950, -0.4096,  0.7196],
        [-0.0301, -1.0491, -1.3055,  ...,  0.3130,  0.5399,  0.6213],
        ...,
        [-0.6895,  0.0346, -0.2390,  ..., -0.6945,  0.6638, -1.1315],
        [ 0.8071,  1.0557,  0.1674,  ..., -0.9313,  0.0808, -1.0955],
        [ 0.5790, -0.4493,  0.8469,  ...,  0.2945, -1.5833, -0.7447]],
       requires_grad=True)


PyTorch: optim
--------------

这一次我们不再手动更新模型的weights,而是使用optim这个包来帮助我们更新参数。
optim这个package提供了各种不同的模型优化方法，包括SGD+momentum, RMSProp, Adam等等。


In [32]:
import torch.nn as nn

N, D_in, H, D_out = 64, 1000, 100, 10

# 随机创建一些训练数据
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H, bias=False), # w_1 * x + b_1
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out, bias=False),
)

torch.nn.init.normal_(model[0].weight)
torch.nn.init.normal_(model[2].weight)

# model = model.cuda()

loss_fn = nn.MSELoss(reduction='sum')
# learning_rate = 1e-4
# optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

learning_rate = 1e-6
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

for it in range(500):
    # Forward pass
    y_pred = model(x) # model.forward() 
    
    # compute loss
    loss = loss_fn(y_pred, y) # computation graph
    print(it, loss.item())

    optimizer.zero_grad()
    # Backward pass
    loss.backward()
    
    # update model parameters
    optimizer.step()


0 31203018.0
1 24459852.0
2 19565672.0
3 14788742.0
4 10435055.0
5 6971237.5
6 4576112.0
7 3042284.0
8 2102442.25
9 1523112.0
10 1156355.25
11 913097.125
12 743221.625
13 618317.125
14 522565.875
15 446830.75
16 385507.25
17 335017.375
18 292761.3125
19 257028.1875
20 226631.328125
21 200571.5
22 178084.078125
23 158592.921875
24 141631.09375
25 126793.453125
26 113775.9140625
27 102356.125
28 92280.3203125
29 83363.0390625
30 75442.578125
31 68389.53125
32 62099.8203125
33 56478.3203125
34 51445.4921875
35 46925.22265625
36 42857.80078125
37 39196.75
38 35891.359375
39 32906.9296875
40 30201.1953125
41 27745.16015625
42 25513.787109375
43 23481.83203125
44 21630.525390625
45 19940.89453125
46 18402.841796875
47 16997.359375
48 15710.19921875
49 14530.1640625
50 13448.8984375
51 12456.611328125
52 11545.50390625
53 10707.75
54 9936.71484375
55 9226.4912109375
56 8572.30078125
57 7968.474609375
58 7410.970703125
59 6895.8779296875
60 6420.083984375
61 5980.220703125
62 5573.0625
63 5195

399 0.0008407257264479995
400 0.000815213134046644
401 0.0007915767491795123
402 0.0007707122713327408
403 0.0007489375420846045
404 0.000727669452317059
405 0.0007082887459546328
406 0.0006876115803606808
407 0.0006683834944851696
408 0.0006496653077192605
409 0.0006329094176180661
410 0.0006157293100841343
411 0.0005990762729197741
412 0.0005814794567413628
413 0.000567686278373003
414 0.0005518735270015895
415 0.0005379949579946697
416 0.0005239408928900957
417 0.0005109445773996413
418 0.0004979264922440052
419 0.00048516166862100363
420 0.00047292609815485775
421 0.00046157435281202197
422 0.00044996861834079027
423 0.0004384240019135177
424 0.00042779595241881907
425 0.00041692424565553665
426 0.00040680382517166436
427 0.0003973639686591923
428 0.00038905185647308826
429 0.0003787998575717211
430 0.0003707826544996351
431 0.00036169192753732204
432 0.000353094597812742
433 0.00034433742985129356
434 0.0003374666557647288
435 0.00032953821937553585
436 0.00032214936800301075
437 


PyTorch: 自定义 nn Modules
--------------------------

我们可以定义一个模型，这个模型继承自nn.Module类。如果需要定义一个比Sequential模型更加复杂的模型，就需要定义nn.Module模型。



In [34]:
import torch.nn as nn

N, D_in, H, D_out = 64, 1000, 100, 10

# 随机创建一些训练数据
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

class TwoLayerNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        super(TwoLayerNet, self).__init__()
        # define the model architecture
        # 线性层中的D_in, H, D_out都为参数W的维度，初始化中不涉及输入X和Y
        self.linear1 = torch.nn.Linear(D_in, H, bias=False)
        self.linear2 = torch.nn.Linear(H, D_out, bias=False)
    
    def forward(self, x):
        y_pred = self.linear2(self.linear1(x).clamp(min=0))
        return y_pred

model = TwoLayerNet(D_in, H, D_out)
loss_fn = nn.MSELoss(reduction='sum')
learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

'''
训练时，三句话的顺序为
1. optimizer.zero_grad()
2. loss.backward()
3. optimizer.step()
'''
for it in range(500):
    # Forward pass
    y_pred = model(x) # model.forward() 
    
    # compute loss
    loss = loss_fn(y_pred, y) # computation graph
    print(it, loss.item())

    optimizer.zero_grad()
    # Backward pass
    loss.backward()
    
    # update model parameters
    optimizer.step()


0 627.3970336914062
1 611.1285400390625
2 595.31689453125
3 579.910400390625
4 564.9271850585938
5 550.3590087890625
6 536.1961669921875
7 522.4404296875
8 509.03594970703125
9 495.9280700683594
10 483.1991271972656
11 470.7916259765625
12 458.69512939453125
13 446.97943115234375
14 435.7530822753906
15 424.87615966796875
16 414.29534912109375
17 403.9889221191406
18 393.9405822753906
19 384.1108093261719
20 374.5871887207031
21 365.31689453125
22 356.3175964355469
23 347.5909118652344
24 339.0564880371094
25 330.7277526855469
26 322.60174560546875
27 314.66668701171875
28 306.9889831542969
29 299.5657043457031
30 292.3930969238281
31 285.4082336425781
32 278.5841064453125
33 271.9277648925781
34 265.4345703125
35 259.0935363769531
36 252.884033203125
37 246.8143310546875
38 240.864013671875
39 235.03262329101562
40 229.34420776367188
41 223.77505493164062
42 218.32476806640625
43 212.99063110351562
44 207.78125
45 202.6981201171875
46 197.71563720703125
47 192.83370971679688
48 188.05

375 1.0628429663483985e-05
376 1.003461511572823e-05
377 9.473998034081887e-06
378 8.944818546297029e-06
379 8.445858838967979e-06
380 7.97367738414323e-06
381 7.5275370363669936e-06
382 7.106205430318369e-06
383 6.70767440169584e-06
384 6.332363682304276e-06
385 5.976620741421357e-06
386 5.641676125378581e-06
387 5.324543508322677e-06
388 5.025209702580469e-06
389 4.7420826376765035e-06
390 4.474781690078089e-06
391 4.222859388391953e-06
392 3.984295290138107e-06
393 3.7593920296785655e-06
394 3.5475932236295193e-06
395 3.3458875350333983e-06
396 3.155816784783383e-06
397 2.9769507818855345e-06
398 2.808133103826549e-06
399 2.6484437967155827e-06
400 2.497099103493383e-06
401 2.3547511318611214e-06
402 2.220245960415923e-06
403 2.093348030030029e-06
404 1.9735789464903064e-06
405 1.8605572904561996e-06
406 1.7538062593303039e-06
407 1.6528582591490704e-06
408 1.557692485221196e-06
409 1.4678712432214525e-06
410 1.3830973557560355e-06
411 1.3032359902354074e-06
412 1.2278317171876552e-

## 小结

- 定义模型时，一般分为：初始化和前向传播。深度学习的框架大致为：定义输入输出数据、定义模型、定义损失函数和优化方法、训练模型、调参。
- 模型性能不好时，可以调节学习率、更换优化方法、初始化参数
- Adam的lr一般为1e-3——1e-4，SGD的lr一般为1e-6——1e-7
- 定义模型时，初始化中不涉及输入X和Y，均为参数的维度
- 训练时，三句话的顺序为
    1. optimizer.zero_grad()
    2. loss.backward()
    3. optimizer.step()