# 第一课：双层神经网络

讲师
- 褚则伟 
- [homepage](http://people.cs.uchicago.edu/~zeweichu/)
- [email](zeweichu@gmail.com)

目录
- 什么是PyTorch?
- Tensor基础
- 使用numpy搭建双层神经网络
- 使用PyTorch搭建双层神经网络


## 1、什么是PyTorch?

PyTorch是一个基于Python的科学计算库，它有以下特点:

- 类似于NumPy，但是它可以使用GPU
- 可以用它定义深度学习模型，可以灵活地进行深度学习模型的训练和使用

## 2、Tensor基础

Tensor类似与NumPy的ndarray，唯一的区别是Tensor可以在GPU上加速运算。

In [2]:
from __future__ import print_function                    # 为了兼容python2
import torch                                             # 需要提前安装torch：https://pytorch.org/get-started/locally/

### 1）Tensor 创建

构造一个未初始化的5x3矩阵:

In [3]:
x = torch.empty(5, 3)                                    # 这里是随机的，相当于占位符，不是“空的”
print(x)

tensor([[8.4490e-39, 1.0194e-38, 9.0919e-39],
        [8.4490e-39, 1.0745e-38, 1.0102e-38],
        [9.6429e-39, 8.9082e-39, 9.6429e-39],
        [1.0102e-38, 1.0194e-38, 1.0561e-38],
        [1.0469e-38, 9.2756e-39, 4.2246e-39]])


构建一个随机初始化的矩阵:

In [3]:
x = torch.rand(5, 3)                                     # 矩阵中的元素满足[0,1)的均匀分布
print(x)

tensor([[0.4821, 0.3854, 0.8517],
        [0.7962, 0.0632, 0.5409],
        [0.8891, 0.6112, 0.7829],
        [0.0715, 0.8069, 0.2608],
        [0.3292, 0.0119, 0.2759]])


构建一个全部为0，类型为long的矩阵:

In [4]:
x = torch.zeros(5, 3, dtype=torch.long)                  # 也可以 x = torch.zeros(2,3,3, dtype=torch.long)
print(x)

tensor([[0, 0, 0],
        [0, 0, 0],
        [0, 0, 0],
        [0, 0, 0],
        [0, 0, 0]])


从数据直接直接构建tensor:

In [5]:
x = torch.tensor([5.5, 3])
print(x)

tensor([5.5000, 3.0000])


也可以从一个已有的tensor构建一个tensor。这些方法会重用原来tensor的特征，例如，数据类型，除非提供新的数据。

In [6]:
x = x.new_ones(5, 3, dtype=torch.double)      # new_* methods take in sizes,there is no '.ones()' in pytorch
print(x)

x = torch.randn_like(x, dtype=torch.float)    # override dtype!
print(x)                                      # result has the same size
                                              # dtype is needed，'randn' but not 'rand'

tensor([[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]], dtype=torch.float64)
tensor([[ 1.4793, -2.4772,  0.9738],
        [ 2.0328,  1.3981,  1.7509],
        [-0.7931, -0.0291, -0.6803],
        [-1.2944, -0.7352, -0.9346],
        [ 0.5917, -0.5149, -1.8149]])


打印tensor的形状:

In [7]:
print(x.size())                              # ``torch.Size`` 返回的是一个tuple

torch.Size([5, 3])


### 2）Tensor 操作


有很多种tensor运算。我们先介绍加法运算。

In [6]:
x = x.new_ones(5, 3, dtype=torch.double) 
y = torch.rand(5, 3)
print(x + y)

tensor([[1.0281, 1.8720, 1.9095],
        [1.3242, 1.6702, 1.5501],
        [1.5275, 1.8434, 1.7097],
        [1.8846, 1.1496, 1.8425],
        [1.3632, 1.7262, 1.0422]], dtype=torch.float64)


另一种着加法的写法


In [9]:
print(torch.add(x, y))

tensor([[ 1.7113, -1.5490,  1.4009],
        [ 2.4590,  1.6504,  2.6889],
        [-0.3609,  0.4950, -0.3357],
        [-0.5029, -0.3086, -0.1498],
        [ 1.2850, -0.3189, -0.8868]])


加法：把输出作为一个变量

In [10]:
result = torch.empty(5, 3)
torch.add(x, y, out=result)
print(result)

tensor([[ 1.7113, -1.5490,  1.4009],
        [ 2.4590,  1.6504,  2.6889],
        [-0.3609,  0.4950, -0.3357],
        [-0.5029, -0.3086, -0.1498],
        [ 1.2850, -0.3189, -0.8868]])


in-place加法

In [11]:
y.add_(x)                              # 任何in-place的运算都会以``_``结尾。举例来说：``x.copy_(y)``, ``x.t_()``, 会改变 ``x``。
print(y)

tensor([[ 1.7113, -1.5490,  1.4009],
        [ 2.4590,  1.6504,  2.6889],
        [-0.3609,  0.4950, -0.3357],
        [-0.5029, -0.3086, -0.1498],
        [ 1.2850, -0.3189, -0.8868]])


各种类似NumPy的indexing都可以在PyTorch tensor上面使用。

In [12]:
print(x[:, 1])

tensor([-2.4772,  1.3981, -0.0291, -0.7352, -0.5149])


Resizing: 如果你希望resize/reshape一个tensor，可以使用``torch.view``：

In [13]:
x = torch.randn(4, 4)
y = x.view(16)
z = x.view(-1, 8)                                  # the size -1 is inferred from other dimensions
print(x.size(), y.size(), z.size())

torch.Size([4, 4]) torch.Size([16]) torch.Size([2, 8])


如果你有一个只有一个元素的tensor，使用``.item()``方法可以把里面的value变成Python数值。

In [14]:
x = torch.randn(1)
print(x)
print(x.item())

tensor([0.4726])
0.4726296067237854


**更多阅读**


  各种Tensor operations, 包括transposing, indexing, slicing,
  mathematical operations, linear algebra, random numbers在
  `<https://pytorch.org/docs/torch>`.

### 3）Numpy和Tensor之间的转化


在Torch Tensor和NumPy array之间相互转化非常容易。

Torch Tensor和NumPy array会共享内存，所以改变其中一项也会改变另一项。

把Torch Tensor转变成NumPy Array


In [7]:
a = torch.ones(5)
print(a)

tensor([1., 1., 1., 1., 1.])


In [16]:
b = a.numpy()
print(b)

[1. 1. 1. 1. 1.]


改变numpy array里面的值。

In [17]:
a.add_(1)
print(a)
print(b)                                                    # 注意这里

tensor([2., 2., 2., 2., 2.])
[2. 2. 2. 2. 2.]


把NumPy ndarray转成Torch Tensor

In [18]:
import numpy as np
a = np.ones(5)                                              # 注意pytorch没有ones操作
b = torch.from_numpy(a)
np.add(a, 1, out=a)
print(a)
print(b)

[2. 2. 2. 2. 2.]
tensor([2., 2., 2., 2., 2.], dtype=torch.float64)


所有CPU上的Tensor都支持转成numpy或者从numpy转成Tensor。

### 4）CUDA Tensors

使用``.to``方法，Tensor可以被移动到别的device上。

In [19]:
# let us run this cell only if CUDA is available
# We will use ``torch.device`` objects to move tensors in and out of GPU
if torch.cuda.is_available():
    device = torch.device("cuda")          # a CUDA device object
    y = torch.ones_like(x, device=device)  # directly create a tensor on GPU
    x = x.to(device)                       # or just use strings ``.to("cuda")``
    z = x + y
    print(z)
    print(z.to("cpu", torch.double))       # ``.to`` can also change dtype together!

## 3. 热身: 用numpy实现两层神经网络


一个全连接ReLU神经网络，一个隐藏层，没有bias。用来从x预测y，使用L2 Loss。
- $h=W_1*x+b$
- $h\_ReLU = max\{0,h\}$
- $\hat{y} = W_2*h\_ReLU$

这一实现完全使用numpy来计算前向神经网络，loss，和反向传播。

numpy ndarray是一个普通的n维array。它不知道任何关于深度学习或者梯度(gradient)的知识，也不知道计算图(computation graph)，只是一种用来计算数学运算的数据结构。



In [13]:
import numpy as np

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = np.random.randn(N, D_in)                                   # 64 * 1000
y = np.random.randn(N, D_out)                                  # 64 * 10

# Randomly initialize weights                   
w1 = np.random.randn(D_in, H)                                  # 1000 * 100
w2 = np.random.randn(H, D_out)                                 # 100 * 10

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    h = x.dot(w1)                                              # 64 * 100
    h_relu = np.maximum(h, 0)                                  # 64 * 100
    y_pred = h_relu.dot(w2)                                    # 64 * 10

    # Compute and print loss(sum but not mean to simplify the calculation)
    loss = np.square(y_pred - y).sum()
    print(t, loss)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    # 参考：https://www.cnblogs.com/pinard/p/10750718.html
    loss_grad_y_pred = 2.0 * (y_pred - y)                                             
    loss_grad_w2 = h_relu.T.dot(loss_grad_y_pred)               
    loss_grad_h_relu = loss_grad_y_pred.dot(w2.T)
    loss_grad_h = loss_grad_h_relu.copy()
    loss_grad_h[h < 0] = 0
    loss_grad_w1 = x.T.dot(loss_grad_h)

    # Update weights
    w1 -= learning_rate * loss_grad_w1
    w2 -= learning_rate * loss_grad_w2

## 4. 使用PyTorch实现前向神经网络

### 1）PyTorch tensors

这次我们使用PyTorch tensors来创建前向神经网络，计算损失，以及反向传播。

一个PyTorch Tensor很像一个numpy的ndarray。但是它和numpy ndarray最大的区别是，PyTorch Tensor可以在CPU或者GPU上运算。如果想要在GPU上运算，就需要把Tensor换成cuda类型。


In [21]:
import torch

dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Randomly initialize weights
w1 = torch.randn(D_in, H, device=device, dtype=dtype)
w2 = torch.randn(H, D_out, device=device, dtype=dtype)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    h = x.mm(w1)
    h_relu = h.clamp(min=0)
    y_pred = h_relu.mm(w2)

    # Compute and print loss
    loss = (y_pred - y).pow(2).sum().item()
    # print(t, loss)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.t().mm(grad_y_pred)
    grad_h_relu = grad_y_pred.mm(w2.t())
    grad_h = grad_h_relu.clone()
    grad_h[h < 0] = 0
    grad_w1 = x.t().mm(grad_h)

    # Update weights using gradient descent
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

0 31704728.0
1 25331164.0
2 22378086.0
3 19262238.0
4 15348289.0
5 11017595.0
6 7356282.0
7 4705923.5
8 3027346.5
9 2012536.375
10 1409662.25
11 1041771.75
12 807321.0625
13 649262.0
14 536533.1875
15 451980.875
16 385983.53125
17 332925.53125
18 289368.1875
19 253030.78125
20 222354.703125
21 196214.3125
22 173766.515625
23 154378.140625
24 137539.375
25 122867.1015625
26 110037.3515625
27 98769.4921875
28 88842.109375
29 80063.15625
30 72279.015625
31 65361.66796875
32 59195.42578125
33 53687.4453125
34 48757.57421875
35 44338.4453125
36 40370.34765625
37 36803.1484375
38 33587.4453125
39 30684.1640625
40 28059.435546875
41 25683.255859375
42 23528.814453125
43 21570.8515625
44 19792.4296875
45 18175.244140625
46 16704.6640625
47 15364.2578125
48 14141.7509765625
49 13026.609375
50 12007.3115234375
51 11075.3896484375
52 10221.8857421875
53 9439.876953125
54 8722.13671875
55 8063.46826171875
56 7458.20703125
57 6901.8876953125
58 6390.34375
59 5919.4794921875
60 5485.79345703125
61 5

375 0.0002844816190190613
376 0.00027625024085864425
377 0.0002687727683223784
378 0.0002608516369946301
379 0.00025311342324130237
380 0.0002469048195052892
381 0.00024049097555689514
382 0.0002342124644201249
383 0.00022811403323430568
384 0.00022231723414734006
385 0.0002166029589716345
386 0.00021077181736472994
387 0.00020510501053649932
388 0.00020020001102238894
389 0.0001948442222783342
390 0.00018990584067068994
391 0.00018529882072471082
392 0.00018070911755785346
393 0.00017650797963142395
394 0.00017214834224432707
395 0.0001683011942077428
396 0.00016451899136882275
397 0.00016050187696237117
398 0.00015686434926465154
399 0.00015321985119953752
400 0.0001501761726103723
401 0.00014639270375482738
402 0.00014274154091253877
403 0.0001396275474689901
404 0.0001364489580737427
405 0.00013346801279112697
406 0.00013024920190218836
407 0.00012755846546497196
408 0.00012532222899608314
409 0.0001224723382620141
410 0.00011974618973908946
411 0.00011740042100427672
412 0.0001144


### 2) PyTorch: Tensor和autograd

PyTorch的一个重要功能就是autograd，也就是说只要定义了forward pass(前向神经网络)，计算了loss之后，PyTorch可以自动求导计算模型所有参数的梯度。

一个PyTorch的Tensor表示计算图中的一个节点。如果``x``是一个Tensor并且``x.requires_grad=True``那么``x.grad``是另一个储存着``x``当前梯度(相对于一个scalar，常常是loss)的向量。


简单的autograd

In [22]:
# Create tensors.
x = torch.tensor(1., requires_grad=True)
w = torch.tensor(2., requires_grad=True)
b = torch.tensor(3., requires_grad=True)

# Build a computational graph.
y = w * x + b    # y = 2 * x + 3

# Compute gradients.
y.backward()

# Print out the gradients.
print(x.grad)    # x.grad = 2 
print(w.grad)    # w.grad = 1 
print(b.grad)    # b.grad = 1 

tensor(2.)
tensor(1.)
tensor(1.)


In [11]:
import torch

dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# N 是 batch size; D_in 是 input dimension;
# H 是 hidden dimension; D_out 是 output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# 创建随机的Tensor来保存输入和输出
# 设定requires_grad=False表示在反向传播的时候我们不需要计算gradient
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# 创建随机的Tensor和权重。
# 设置requires_grad=True表示我们希望反向传播的时候计算Tensor的gradient
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
    # 前向传播:通过Tensor预测y；这个和普通的神经网络的前向传播没有任何不同，
    # 但是我们不需要保存网络的中间运算结果，因为我们不需要手动计算反向传播。
    y_pred = x.mm(w1).clamp(min=0).mm(w2)

    # 通过前向传播计算loss
    # loss是一个形状为(1，)的Tensor
    # loss.item()可以给我们返回一个loss的scalar
    loss = (y_pred - y).pow(2).sum()
    # print(t, loss.item())

    # PyTorch给我们提供了autograd的方法做反向传播。如果一个Tensor的requires_grad=True，
    # backward会自动计算loss相对于每个Tensor的gradient。在backward之后，
    # w1.grad和w2.grad会包含两个loss相对于两个Tensor的gradient信息。
    loss.backward()

    # 我们可以手动做gradient descent(后面我们会介绍自动的方法)。
    # 
    # 用torch.no_grad()包含以下statements，因为w1和w2都是requires_grad=True，
    # 但是在更新weights之后（w1 -= learning_rate * w1.grad 和 w2 -= learning_rate * w2.grad 这里）我们并不需要再做autograd。
    # 
    # 另一种方法是在weight.data和weight.grad.data上做操作，这样就不会对grad产生影响。
    # tensor.data会我们一个tensor，这个tensor和原来的tensor指向相同的内存空间，
    # 但是不会记录计算图的历史。
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad

        # Manually zero the gradients after updating weights
        # 否则每次更新会自动叠加之前的grad
        w1.grad.zero_()
        w2.grad.zero_()

0 30642318.0
1 26655638.0
2 23372702.0
3 18951928.0
4 13784028.0
5 9170854.0
6 5822788.0
7 3723261.5
8 2486424.25
9 1766218.75
10 1331391.0
11 1053403.375
12 862462.375
13 722628.8125
14 614955.5
15 528794.6875
16 458177.625
17 399186.0625
18 349321.09375
19 306805.28125
20 270350.8125
21 238968.0625
22 211810.15625
23 188173.53125
24 167540.3125
25 149474.046875
26 133631.046875
27 119707.4921875
28 107422.0625
29 96554.7890625
30 86918.8359375
31 78362.078125
32 70763.1171875
33 63986.265625
34 57931.71484375
35 52509.37890625
36 47647.5078125
37 43286.98828125
38 39369.26953125
39 35842.5859375
40 32663.876953125
41 29794.2578125
42 27200.099609375
43 24853.697265625
44 22731.751953125
45 20807.77734375
46 19061.01171875
47 17473.966796875
48 16031.42578125
49 14717.21875
50 13519.9296875
51 12427.6630859375
52 11430.892578125
53 10521.0419921875
54 9690.3408203125
55 8930.34765625
56 8234.3701171875
57 7596.94775390625
58 7011.93896484375
59 6475.44970703125
60 5983.0576171875
61 5


### 3) PyTorch: nn


这次我们使用PyTorch中nn这个库来构建网络。
用PyTorch autograd来构建计算图和计算gradients，
然后PyTorch会帮我们自动计算gradient。




In [24]:
import torch

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Use the nn package to define our model as a sequence of layers. nn.Sequential
# is a Module which contains other Modules, and applies them in sequence to
# produce its output. Each Linear Module computes output from input using a
# linear function, and holds internal Tensors for its weight and bias.
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)

# The nn package also contains definitions of popular loss functions; in this
# case we will use Mean Squared Error (MSE) as our loss function.
loss_fn = torch.nn.MSELoss(reduction='sum')

learning_rate = 1e-4
for t in range(500):
    # Forward pass: compute predicted y by passing x to the model. Module objects
    # override the __call__ operator so you can call them like functions. When
    # doing so you pass a Tensor of input data to the Module and it produces
    # a Tensor of output data.
    y_pred = model(x)

    # Compute and print loss. We pass Tensors containing the predicted and true
    # values of y, and the loss function returns a Tensor containing the
    # loss.
    loss = loss_fn(y_pred, y)
    # print(t, loss.item())

    # Zero the gradients before running the backward pass.
    model.zero_grad()

    # Backward pass: compute gradient of the loss with respect to all the learnable
    # parameters of the model. Internally, the parameters of each Module are stored
    # in Tensors with requires_grad=True, so this call will compute gradients for
    # all learnable parameters in the model.
    loss.backward()

    # Update the weights using gradient descent. Each parameter is a Tensor, so
    # we can access its gradients like we did before.
    with torch.no_grad():
        for param in model.parameters():
            param -= learning_rate * param.grad

0 616.8349609375
1 570.4186401367188
2 530.6421508789062
3 495.7164001464844
4 464.91497802734375
5 437.1092834472656
6 411.87066650390625
7 388.5781555175781
8 367.105224609375
9 347.06768798828125
10 328.3486328125
11 310.6429748535156
12 294.08880615234375
13 278.54046630859375
14 263.8558044433594
15 249.83802795410156
16 236.52313232421875
17 223.8170166015625
18 211.7015380859375
19 200.13755798339844
20 189.1465301513672
21 178.6802520751953
22 168.74122619628906
23 159.31674194335938
24 150.35125732421875
25 141.79025268554688
26 133.63401794433594
27 125.89380645751953
28 118.53340148925781
29 111.54275512695312
30 104.91582489013672
31 98.65790557861328
32 92.7421646118164
33 87.18020629882812
34 81.94192504882812
35 77.01036834716797
36 72.3639144897461
37 67.99095916748047
38 63.88977813720703
39 60.036468505859375
40 56.426231384277344
41 53.05012512207031
42 49.88925552368164
43 46.92338943481445
44 44.14652633666992
45 41.54481887817383
46 39.10710144042969
47 36.8310813

364 0.00028583104722201824
365 0.000278460793197155
366 0.00027128090732730925
367 0.00026430474827066064
368 0.0002575131948105991
369 0.0002509095938876271
370 0.00024448230396956205
371 0.00023822381626814604
372 0.0002321432693861425
373 0.00022622810502070934
374 0.0002204657648690045
375 0.0002148632047465071
376 0.00020940121612511575
377 0.00020409838180057704
378 0.0001989272132050246
379 0.00019389843509998173
380 0.00018900231225416064
381 0.0001842392230173573
382 0.00017960301192943007
383 0.00017508988094050437
384 0.00017069902969524264
385 0.00016641429101582617
386 0.00016225305444095284
387 0.000158198265125975
388 0.00015424926823470742
389 0.000150404914165847
390 0.0001466635148972273
391 0.00014301779447123408
392 0.00013946628314442933
393 0.00013601550017483532
394 0.00013265143206808716
395 0.00012937198334839195
396 0.00012617645552381873
397 0.0001230676716659218
398 0.0001200390252051875
399 0.00011708753299899399
400 0.00011421682575019076
401 0.00011141804


### 4) PyTorch: optim


这一次我们不再手动更新模型的weights,而是使用optim这个包来帮助我们更新参数。
optim这个package提供了各种不同的模型优化方法，包括SGD+momentum, RMSProp, Adam等等。


In [12]:
import torch

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Use the nn package to define our model and loss function.
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)
loss_fn = torch.nn.MSELoss(reduction='sum')

# Use the optim package to define an Optimizer that will update the weights of
# the model for us. Here we will use Adam; the optim package contains many other
# optimization algoriths. The first argument to the Adam constructor tells the
# optimizer which Tensors it should update.
learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
for t in range(500):
    # Forward pass: compute predicted y by passing x to the model.
    y_pred = model(x)

    # Compute and print loss.
    loss = loss_fn(y_pred, y)
    # print(t, loss.item())

    # Before the backward pass, use the optimizer object to zero all of the
    # gradients for the variables it will update (which are the learnable
    # weights of the model). This is because by default, gradients are
    # accumulated in buffers( i.e, not overwritten) whenever .backward()
    # is called. Checkout docs of torch.autograd.backward for more details.
    optimizer.zero_grad()

    # Backward pass: compute gradient of the loss with respect to model parameters
    loss.backward()

    # Calling the step function on an Optimizer makes an update to its parameters
    optimizer.step()

0 595.8619384765625
1 579.458740234375
2 563.612060546875
3 548.3158569335938
4 533.6061401367188
5 519.296875
6 505.4200134277344
7 491.98504638671875
8 478.9765319824219
9 466.3656921386719
10 454.0737609863281
11 442.1894836425781
12 430.68804931640625
13 419.524169921875
14 408.7554626464844
15 398.3657531738281
16 388.31939697265625
17 378.4873352050781
18 368.9092102050781
19 359.59173583984375
20 350.52880859375
21 341.7240295410156
22 333.14678955078125
23 324.7843322753906
24 316.6291809082031
25 308.6772155761719
26 300.9322509765625
27 293.3556823730469
28 285.9695129394531
29 278.813232421875
30 271.8526611328125
31 265.1206970214844
32 258.5401916503906
33 252.11834716796875
34 245.84507751464844
35 239.69578552246094
36 233.69839477539062
37 227.81935119628906
38 222.08218383789062
39 216.4769744873047
40 211.0048370361328
41 205.65638732910156
42 200.43299865722656
43 195.33209228515625
44 190.35610961914062
45 185.51107788085938
46 180.78871154785156
47 176.170669555664


### 5) PyTorch: 自定义 nn Modules


我们可以定义一个模型，这个模型继承自nn.Module类。如果需要定义一个比Sequential模型更加复杂的模型，就需要定义nn.Module模型。



In [26]:
import torch


class TwoLayerNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        In the constructor we instantiate two nn.Linear modules and assign them as
        member variables.
        """
        super(TwoLayerNet, self).__init__()
        self.linear1 = torch.nn.Linear(D_in, H)
        self.linear2 = torch.nn.Linear(H, D_out)

    def forward(self, x):
        """
        In the forward function we accept a Tensor of input data and we must return
        a Tensor of output data. We can use Modules defined in the constructor as
        well as arbitrary operators on Tensors.
        """
        h_relu = self.linear1(x).clamp(min=0)
        y_pred = self.linear2(h_relu)
        return y_pred


# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Construct our model by instantiating the class defined above
model = TwoLayerNet(D_in, H, D_out)

# Construct our loss function and an Optimizer. The call to model.parameters()
# in the SGD constructor will contain the learnable parameters of the two
# nn.Linear modules which are members of the model.
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
for t in range(500):
    # Forward pass: Compute predicted y by passing x to the model
    y_pred = model(x)

    # Compute and print loss
    loss = criterion(y_pred, y)
    # print(t, loss.item())

    # Zero gradients, perform a backward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

0 656.6958618164062
1 608.1090087890625
2 566.172607421875
3 529.2335815429688
4 496.7382507324219
5 467.453125
6 440.5755310058594
7 416.12872314453125
8 393.6068420410156
9 372.708251953125
10 353.00006103515625
11 334.477783203125
12 316.97283935546875
13 300.36737060546875
14 284.6544189453125
15 269.65936279296875
16 255.33456420898438
17 241.66688537597656
18 228.60800170898438
19 216.09536743164062
20 204.13780212402344
21 192.75645446777344
22 181.89234924316406
23 171.58370971679688
24 161.7939453125
25 152.4780731201172
26 143.59371948242188
27 135.14727783203125
28 127.13992309570312
29 119.55585479736328
30 112.37797546386719
31 105.62073516845703
32 99.24383544921875
33 93.24134826660156
34 87.58341979980469
35 82.25212860107422
36 77.24210357666016
37 72.55087280273438
38 68.1427230834961
39 64.00277709960938
40 60.1308479309082
41 56.49887466430664
42 53.0952033996582
43 49.906524658203125
44 46.91959762573242
45 44.11970520019531
46 41.50297164916992
47 39.0628700256347

375 0.0006505012279376388
376 0.0006343786371871829
377 0.0006186614627949893
378 0.0006033276440575719
379 0.0005883832345716655
380 0.0005738206673413515
381 0.0005596213741227984
382 0.0005457888473756611
383 0.0005322962533682585
384 0.0005191444652155042
385 0.000506328884512186
386 0.0004938290221616626
387 0.00048163760220631957
388 0.0004697689728345722
389 0.0004582055553328246
390 0.00044691533548757434
391 0.00043590739369392395
392 0.0004251690406817943
393 0.0004147063591517508
394 0.00040450665983371437
395 0.0003945553908124566
396 0.0003848606429528445
397 0.00037539892946369946
398 0.0003661849768832326
399 0.00035720854066312313
400 0.000348439411027357
401 0.0003398970584385097
402 0.00033156739664264023
403 0.0003234421892557293
404 0.0003155224258080125
405 0.00030779733788222075
406 0.0003002593875862658
407 0.00029291390092112124
408 0.00028574312455020845
409 0.0002787590492516756
410 0.000271946337306872
411 0.00026530082686804235
412 0.00025882109184749424
413

# FizzBuzz

FizzBuzz是一个简单的小游戏。游戏规则如下：从1开始往上数数，当遇到3的倍数的时候，说fizz，当遇到5的倍数，说buzz，当遇到15的倍数，就说fizzbuzz，其他情况下则正常数数。

我们可以写一个简单的小程序来决定要返回正常数值还是fizz, buzz 或者 fizzbuzz。

In [32]:
# One-hot encode the desired outputs: [number, "fizz", "buzz", "fizzbuzz"]
def fizz_buzz_encode(i):
    if   i % 15 == 0: return 3
    elif i % 5  == 0: return 2
    elif i % 3  == 0: return 1
    else:             return 0
    
def fizz_buzz_decode(i, prediction):
    return [str(i), "fizz", "buzz", "fizzbuzz"][prediction]

print(fizz_buzz_decode(1, fizz_buzz_encode(1)))
print(fizz_buzz_decode(2, fizz_buzz_encode(2)))
print(fizz_buzz_decode(5, fizz_buzz_encode(5)))
print(fizz_buzz_decode(12, fizz_buzz_encode(12)))
print(fizz_buzz_decode(15, fizz_buzz_encode(15)))

1
2
buzz
fizz
fizzbuzz


我们首先定义模型的输入与输出(训练数据)

In [33]:
import numpy as np
import torch

NUM_DIGITS = 10

# Represent each input by an array of its binary digits.
def binary_encode(i, num_digits):
    return np.array([i >> d & 1 for d in range(num_digits)])   # 2 变成 0100000000，顺序有反，但是无所谓

trX = torch.Tensor([binary_encode(i, NUM_DIGITS) for i in range(101, 2 ** NUM_DIGITS)])   # 10位二进制
trY = torch.LongTensor([fizz_buzz_encode(i) for i in range(101, 2 ** NUM_DIGITS)])        # 0、1、2或3，注意这里是LongTensor

In [34]:
print (trY.shape)

torch.Size([923])


然后我们用PyTorch定义模型

In [35]:
# Define the model
NUM_HIDDEN = 100
model = torch.nn.Sequential(
    torch.nn.Linear(NUM_DIGITS, NUM_HIDDEN),
    torch.nn.ReLU(),
    torch.nn.Linear(NUM_HIDDEN, 4)
)

- 为了让我们的模型学会FizzBuzz这个游戏，我们需要定义一个损失函数，和一个优化算法。
- 这个优化算法会不断优化（降低）损失函数，使得模型的在该任务上取得尽可能低的损失值。
- 损失值低往往表示我们的模型表现好，损失值高表示我们的模型表现差。
- 由于FizzBuzz游戏本质上是一个分类问题，我们选用Cross Entropy Loss函数。
- 优化函数我们选用Stochastic Gradient Descent。

In [36]:
loss_fn = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr = 0.05)

以下是模型的训练代码

In [37]:
# Start training it
BATCH_SIZE = 128
for epoch in range(10000):
    for start in range(0, len(trX), BATCH_SIZE):
        end = start + BATCH_SIZE  
        batchX = trX[start:end]                           # 竟然没有越界
        batchY = trY[start:end]

        y_pred = model(batchX)
        loss = loss_fn(y_pred, batchY)                    # 注意这两个的维度：https://pytorch.org/docs/stable/nn.html?highlight=torch%20nn%20crossentropyloss#torch.nn.CrossEntropyLoss

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    # Find loss on training data
    loss = loss_fn(model(trX), trY).item()
    if (epoch%1000==0):
        print('Epoch:', epoch, 'Loss:', loss)

Epoch: 0 Loss: 1.217276930809021
Epoch: 1000 Loss: 0.5164196491241455
Epoch: 2000 Loss: 0.12073612958192825
Epoch: 3000 Loss: 0.05368366837501526
Epoch: 4000 Loss: 0.03037494607269764
Epoch: 5000 Loss: 0.01970599591732025
Epoch: 6000 Loss: 0.014038625173270702
Epoch: 7000 Loss: 0.010669323615729809
Epoch: 8000 Loss: 0.008475510403513908
Epoch: 9000 Loss: 0.0069842347875237465


最后我们用训练好的模型尝试在1到100这些数字上玩FizzBuzz游戏

In [38]:
# Output now
testX = torch.Tensor([binary_encode(i, NUM_DIGITS) for i in range(1, 101)])
with torch.no_grad():
    testY = model(testX)
predictions = zip(range(1, 101), list(testY.max(1)[1].data.tolist()))

print([fizz_buzz_decode(i, x) for (i, x) in predictions])

['1', '2', 'fizz', '4', 'buzz', 'fizz', '7', '8', 'fizz', 'buzz', '11', 'fizz', '13', '14', 'fizzbuzz', '16', '17', 'fizz', '19', 'buzz', 'fizz', '22', 'fizz', 'fizz', 'buzz', '26', 'fizz', '28', '29', 'fizzbuzz', '31', '32', 'fizz', '34', 'buzz', 'fizz', '37', '38', 'fizz', 'buzz', '41', 'fizz', '43', '44', 'fizzbuzz', '46', '47', 'fizz', '49', 'buzz', 'fizz', '52', '53', 'fizz', 'buzz', '56', 'fizz', '58', '59', 'fizzbuzz', '61', '62', 'fizz', '64', 'buzz', 'fizz', '67', 'buzz', 'fizz', 'buzz', '71', 'fizz', '73', '74', 'fizzbuzz', '76', '77', 'fizz', '79', 'buzz', 'fizz', '82', '83', 'fizz', 'buzz', '86', 'fizz', '88', '89', 'fizzbuzz', '91', '92', 'fizz', '94', 'buzz', 'buzz', '97', '98', 'fizz', 'buzz']


In [41]:
testY.max(1)                 # 注意这里有两部分

torch.return_types.max(
values=tensor([ 6.3046,  9.3019, 10.6227,  5.2429,  7.0908,  3.6957,  7.6848,  6.7376,
         8.8358,  7.0454,  8.9054,  6.5258, 11.1373,  8.1076,  6.8863,  8.8270,
         3.7558, 10.3897,  4.4458,  7.7029,  8.3061,  8.3061,  3.9069,  8.4617,
         5.7278,  8.7762, 10.1094, 12.1703,  7.7449,  5.7961,  6.3213,  5.9292,
         5.7895,  5.8037,  4.9718,  6.1873,  8.1670,  5.9121,  5.9668,  6.4944,
         6.8159,  6.3422,  9.4932,  6.0991,  5.6619,  6.2464,  4.7406,  9.2211,
         6.8329,  6.7497,  5.2758,  8.7305,  5.2762,  4.1747,  7.3745,  8.2549,
         9.0619,  8.5187,  8.2312,  5.6800,  8.1132,  8.1878,  7.8953,  6.5253,
        10.1245,  4.0647,  6.1315,  8.3083,  8.7441,  8.8598,  7.2043,  7.6296,
        10.1160,  5.1331,  4.9104,  7.5242,  7.2350,  8.3422,  6.0486, 10.9966,
         4.0352,  8.2559,  5.0356,  7.3801, 10.1361,  8.6810,  6.2584, 11.9410,
         9.7580,  6.9164,  5.4353,  7.1868,  8.0638,  6.0965,  2.9596,  2.9086,
        1

In [42]:
print(np.sum(testY.max(1)[1].numpy() == np.array([fizz_buzz_encode(i) for i in range(1,101)])))
testY.max(1)[1].numpy() == np.array([fizz_buzz_encode(i) for i in range(1,101)])

97


array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True, False,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True, False,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True, False,  True,  True,  True,
        True])

[参考资料 reference](https://pytorch.org/tutorials/beginner/pytorch_with_examples.html)