# Learning Pytorch With Examples

这份tutorial里面介绍了PyTorch中的基本概念，通过self-contained的例子。  
里面的核心是，PyTorch提供了两个features：    
1.一个n维的Tensor，跟numpy相同，但是可以在GPUs上运行。  
2.自动偏微分用来建立和训练神经网络。 

我们将会使用一个全连接的ReLU神经网络，这个神经网络会有单层的隐藏层，通过最小化预测output和真实ouput间的Euclidean距离，使用梯度下降来拟合随机数据。

### Tensor
### warm-up Numpy

在介绍PyTorch之前，我们首先通过numpy来实现神经网络。  
Numpy提供了一个n维数组对象，以及很多的functions来操作这些对象。Numpy是一个用于科学计算的普通框架，其中不包含任何计算图、深度学习和梯度下降。然而，我们可以很容易的用numpy来拟合随机数据中的两层神经网络，通过人为地使用numpy中的操作来实现神经网络中forward和backward的传导。


In [1]:
# -*- coding: utf-8 -*-
import numpy as np

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    h = x.dot(w1)
    h_relu = np.maximum(h, 0)
    y_pred = h_relu.dot(w2)

    # Compute and print loss
    loss = np.square(y_pred - y).sum()
    print(t, loss)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 0
    grad_w1 = x.T.dot(grad_h)

    # Update weights
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2


0 26103590.064815007
1 23805273.064857863
2 26517348.65556434
3 30887937.04698555
4 32984005.869947895
5 29270028.258647274
6 20642149.12989001
7 11743797.427902006
8 5961543.015416406
9 3050507.243194424
10 1751967.546260971
11 1168103.0738413478
12 880534.3068067222
13 716554.0658200237
14 607530.1942493559
15 526274.7633681342
16 461391.4232553778
17 407535.013027256
18 361886.5653740581
19 322707.6775668518
20 288824.1733967946
21 259326.32431755072
22 233523.5503189772
23 210896.7822708857
24 190954.83593800035
25 173295.1519419413
26 157611.4589557179
27 143643.22587169393
28 131170.9246054735
29 120004.69315965727
30 109981.65415377416
31 100963.14500437843
32 92840.66693073738
33 85502.83187092925
34 78859.37269018675
35 72835.77237246887
36 67368.51478877381
37 62391.14551244213
38 57854.19039917317
39 53708.27493851545
40 49916.57115363956
41 46445.70428442887
42 43260.18987011299
43 40335.84842312551
44 37649.880808069036
45 35176.17400418781
46 32896.535240222016
47 30791.5

379 0.3694502256396369
380 0.3593188038532833
381 0.34946951626711875
382 0.33989368355495736
383 0.3305850648239625
384 0.3215349616897558
385 0.3127363406679033
386 0.30418280146290455
387 0.2958657421046711
388 0.28778427654711136
389 0.27992283625499464
390 0.27227833201782664
391 0.2648464408137769
392 0.25761993156333307
393 0.25059349136091724
394 0.24376287055898227
395 0.23711969255571666
396 0.23066024107796648
397 0.22437960395598378
398 0.21827197328939368
399 0.21233360427180106
400 0.20655881779313107
401 0.2009437041935925
402 0.19548612568320894
403 0.19017592526447935
404 0.18501204657115408
405 0.17999021725107178
406 0.17510651425858864
407 0.17035770042667553
408 0.1657388200992652
409 0.16124700065975722
410 0.15687896640609078
411 0.15263032664653958
412 0.14849883690664822
413 0.1444803336017757
414 0.1405721545074632
415 0.13677294273474214
416 0.13307620613174576
417 0.12948074658425504
418 0.12598350979502085
419 0.12258179247096797
420 0.1192734571038664
421 

### Pytorch：Tensor

Numpy是一个很好的框架，但是它不能高效利用GPUs来加速科学计算。对于现代深度神经网络来说，GPUs经常可以提供50x以上的加速度，numpy对于深度神经网络来说是不够的。在这里我们引入最基本的PyTorch概念：Tensor，PyTorch Tensor概念上与numpy array类似，一个Tensor就是一个n维数组，PyTorch提供很多的函数来操作这些Tensor。除此以外，Tensors不仅可以跟计算图、梯度下降同步，还可以作为科学计算非常有效的工具。

跟numpy不一样的是，PyTorch Tensors可以使用GPUs来加速数值计算，在GPUs上运行PyTorch Tensor，你只需要简单地把它转换成新的数据类型。

以下是我们如何使用两层神经网络来拟合随机数据，如同以上的numpy小例子，我们需要人为地使用forward and backward来训练网络。

In [2]:
# -*- coding: utf-8 -*-

import torch


dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Randomly initialize weights
w1 = torch.randn(D_in, H, device=device, dtype=dtype)
w2 = torch.randn(H, D_out, device=device, dtype=dtype)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    h = x.mm(w1)
    h_relu = h.clamp(min=0)
    y_pred = h_relu.mm(w2)

    # Compute and print loss
    loss = (y_pred - y).pow(2).sum().item()
    if t % 100 == 99:
        print(t, loss)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.t().mm(grad_y_pred)
    grad_h_relu = grad_y_pred.mm(w2.t())
    grad_h = grad_h_relu.clone()
    grad_h[h < 0] = 0
    grad_w1 = x.t().mm(grad_h)

    # Update weights using gradient descent
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

99 734.4877319335938
199 5.479307651519775
299 0.06713082641363144
399 0.001289833802729845
499 0.0001208614485221915


### Autograd  
### PyTorch: Tensors and autograd  

在上述的例子里，我们在神经网络中使用了forward and backward，使用backward在两层小神经网络中不是什么大问题，但是在大型复杂网络中就会变得很棘手。

幸运地，我们使用自动偏微分在神经网络中自动计算backward，PyTorch中autograd库自动完成上述操作。在使用autograd时，神经网络的forward会被定义成计算图，图中的结点即是Tensors，图中的边使input Tensors生成output Tensors，通过计算图的Backpropagating让你可以轻易地计算出gradients。

听上去很复杂，实际上很简单。每一个节点代表计算图中的Tensor，如果x是一个Tensor，然后x.requires_grad=True，那么x.grad是另一个Tensor，是x经过gradient之后调节到的另外一些scalar的值。

这里我们使用PyTorch Tensors和autograd来实现两层的神经网络，那么我们不再需要手动地使用backward。


In [3]:
# -*- coding: utf-8 -*-
import torch

dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold input and outputs.
# Setting requires_grad=False indicates that we do not need to compute gradients
# with respect to these Tensors during the backward pass.
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Create random Tensors for weights.
# Setting requires_grad=True indicates that we want to compute gradients with
# respect to these Tensors during the backward pass.
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y using operations on Tensors; these
    # are exactly the same operations we used to compute the forward pass using
    # Tensors, but we do not need to keep references to intermediate values since
    # we are not implementing the backward pass by hand.
    y_pred = x.mm(w1).clamp(min=0).mm(w2)

    # Compute and print loss using operations on Tensors.
    # Now loss is a Tensor of shape (1,)
    # loss.item() gets the scalar value held in the loss.
    loss = (y_pred - y).pow(2).sum()
    if t % 100 == 99:
        print(t, loss.item())

    # Use autograd to compute the backward pass. This call will compute the
    # gradient of loss with respect to all Tensors with requires_grad=True.
    # After this call w1.grad and w2.grad will be Tensors holding the gradient
    # of the loss with respect to w1 and w2 respectively.
    loss.backward()

    # Manually update weights using gradient descent. Wrap in torch.no_grad()
    # because weights have requires_grad=True, but we don't need to track this
    # in autograd.
    # An alternative way is to operate on weight.data and weight.grad.data.
    # Recall that tensor.data gives a tensor that shares the storage with
    # tensor, but doesn't track history.
    # You can also use torch.optim.SGD to achieve this.
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad

        # Manually zero the gradients after updating weights
        w1.grad.zero_()
        w2.grad.zero_()

99 522.6221313476562
199 2.693244457244873
299 0.02032787725329399
399 0.00037950018304400146
499 5.442695692181587e-05


### PyTorch: Defining new autograd functions

所有原始的autograd操作实际上是两个在Tensor上的函数。forward函数使得input Tensors生成output Tensors，backward 函数接收转换成scaler值的output Tensors的梯度下降，然后生成转换成相同scaler值的input Tensors的梯度下降。

在PyTorch中我们可以很容易的定义我们自己的autograd，通过定义torch.autograd.Function的子类，以及使用forward和backward函数。

我们可以使用新的autograd操作，通过建立一个instance，然后像使用函数一样，让input data通过Tensors。

在以下的例子中，我们定义了定制的autograd函数来实现ReLU nonlinearity，然后在两层神经网络上使用它。

In [4]:
# -*- coding: utf-8 -*-
import torch


class MyReLU(torch.autograd.Function):
    """
    We can implement our own custom autograd Functions by subclassing
    torch.autograd.Function and implementing the forward and backward passes
    which operate on Tensors.
    """

    @staticmethod
    def forward(ctx, input):
        """
        In the forward pass we receive a Tensor containing the input and return
        a Tensor containing the output. ctx is a context object that can be used
        to stash information for backward computation. You can cache arbitrary
        objects for use in the backward pass using the ctx.save_for_backward method.
        """
        ctx.save_for_backward(input)
        return input.clamp(min=0)

    @staticmethod
    def backward(ctx, grad_output):
        """
        In the backward pass we receive a Tensor containing the gradient of the loss
        with respect to the output, and we need to compute the gradient of the loss
        with respect to the input.
        """
        input, = ctx.saved_tensors
        grad_input = grad_output.clone()
        grad_input[input < 0] = 0
        return grad_input


dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold input and outputs.
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Create random Tensors for weights.
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
    # To apply our Function, we use Function.apply method. We alias this as 'relu'.
    relu = MyReLU.apply

    # Forward pass: compute predicted y using operations; we compute
    # ReLU using our custom autograd operation.
    y_pred = relu(x.mm(w1)).mm(w2)

    # Compute and print loss
    loss = (y_pred - y).pow(2).sum()
    if t % 100 == 99:
        print(t, loss.item())

    # Use autograd to compute the backward pass.
    loss.backward()

    # Update weights using gradient descent
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad

        # Manually zero the gradients after updating weights
        w1.grad.zero_()
        w2.grad.zero_()

99 325.7989807128906
199 0.6053511500358582
299 0.0021314057521522045
399 8.822407107800245e-05
499 2.3329428586293943e-05


### TensorFlow: Static Graphs

PyTorch的autograd看上去有点像TensorFlow：两个框架中我们都定义了计算图，使用了自动偏微分去计算梯度下降。最大的不同是TensorFlow的计算图是静态的而PyTorch的计算图是动态的。

在TensorFlow中，计算图是一次定义，多次使用，将不同的input数据输入图中。在PyTorch中，每一次forward操作都定义一个新的计算图。

静态计算图是很好的，因为你可以在图之前进行优化，比如说一个框架可能决定加入某些图操作使得它更有效运行，或者想出一些静态图分发在不同GPUs或者机器上的策略。如果你重新使用相同的静态图，那么这个可能高昂的前期优化会在后期不断使用中得到摊销。

静态图和动态图一个不同的方面是控制流。


One aspect where static and dynamic graphs differ is control flow. For some models we may wish to perform different computation for each data point; for example a recurrent network might be unrolled for different numbers of time steps for each data point; this unrolling can be implemented as a loop. With a static graph the loop construct needs to be a part of the graph; for this reason TensorFlow provides operators such as tf.scan for embedding loops into the graph. With dynamic graphs the situation is simpler: since we build graphs on-the-fly for each example, we can use normal imperative flow control to perform computation that differs for each input.

跟PyTorch autograd的例子对比，下面我们使用TensorFlow来拟合一个简单的两层网络：

In [6]:
# -*- coding: utf-8 -*-
import tensorflow as tf
import numpy as np

# First we set up the computational graph:

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create placeholders for the input and target data; these will be filled
# with real data when we execute the graph.
x = tf.placeholder(tf.float32, shape=(None, D_in))
y = tf.placeholder(tf.float32, shape=(None, D_out))

# Create Variables for the weights and initialize them with random data.
# A TensorFlow Variable persists its value across executions of the graph.
w1 = tf.Variable(tf.random_normal((D_in, H)))
w2 = tf.Variable(tf.random_normal((H, D_out)))

# Forward pass: Compute the predicted y using operations on TensorFlow Tensors.
# Note that this code does not actually perform any numeric operations; it
# merely sets up the computational graph that we will later execute.
h = tf.matmul(x, w1)
h_relu = tf.maximum(h, tf.zeros(1))
y_pred = tf.matmul(h_relu, w2)

# Compute loss using operations on TensorFlow Tensors
loss = tf.reduce_sum((y - y_pred) ** 2.0)

# Compute gradient of the loss with respect to w1 and w2.
grad_w1, grad_w2 = tf.gradients(loss, [w1, w2])

# Update the weights using gradient descent. To actually update the weights
# we need to evaluate new_w1 and new_w2 when executing the graph. Note that
# in TensorFlow the the act of updating the value of the weights is part of
# the computational graph; in PyTorch this happens outside the computational
# graph.
learning_rate = 1e-6
new_w1 = w1.assign(w1 - learning_rate * grad_w1)
new_w2 = w2.assign(w2 - learning_rate * grad_w2)

# Now we have built our computational graph, so we enter a TensorFlow session to
# actually execute the graph.
with tf.Session() as sess:
    # Run the graph once to initialize the Variables w1 and w2.
    sess.run(tf.global_variables_initializer())

    # Create numpy arrays holding the actual data for the inputs x and targets
    # y
    x_value = np.random.randn(N, D_in)
    y_value = np.random.randn(N, D_out)
    for t in range(500):
        # Execute the graph many times. Each time it executes we want to bind
        # x_value to x and y_value to y, specified with the feed_dict argument.
        # Each time we execute the graph we want to compute the values for loss,
        # new_w1, and new_w2; the values of these Tensors are returned as numpy
        # arrays.
        loss_value, _, _ = sess.run([loss, new_w1, new_w2],
                                    feed_dict={x: x_value, y: y_value})
        if t % 100 == 99:
            print(t, loss_value)

AttributeError: module 'tensorflow' has no attribute 'placeholder'

###  nn module
###  PyTorch: nn

计算图和autograd对于处理复杂运算和自动偏微分来说是很强大的工具，然而对于大型神经网络来说，原始autograd是有点太低级了。

当搭建一个神经网络，我们总是想到如何安排计算进神经层，有一些具有可学习的参数，在之后会不停地优化。

在TensorFlow里面，库比如说像Keras, TensorFlow-Slim, and TFLearn提供对于建造神经网络的原始计算图的高级别抽象。

在PyTorch中，nn库提供同样的作用。nn库定义了一个模块，与神经网络层相当。一个模块接收input Tensors然后计算出output Tensors，同时维持一种中间态，让Tensors中包含可学习的参数。nn库同时定义了一组在训练神经网络中有用的损失函数。

以下例子中我们使用nn库来实现两层神经网络：

In [7]:
# -*- coding: utf-8 -*-
import torch

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Use the nn package to define our model as a sequence of layers. nn.Sequential
# is a Module which contains other Modules, and applies them in sequence to
# produce its output. Each Linear Module computes output from input using a
# linear function, and holds internal Tensors for its weight and bias.
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)

# The nn package also contains definitions of popular loss functions; in this
# case we will use Mean Squared Error (MSE) as our loss function.
loss_fn = torch.nn.MSELoss(reduction='sum')

learning_rate = 1e-4
for t in range(500):
    # Forward pass: compute predicted y by passing x to the model. Module objects
    # override the __call__ operator so you can call them like functions. When
    # doing so you pass a Tensor of input data to the Module and it produces
    # a Tensor of output data.
    y_pred = model(x)

    # Compute and print loss. We pass Tensors containing the predicted and true
    # values of y, and the loss function returns a Tensor containing the
    # loss.
    loss = loss_fn(y_pred, y)
    if t % 100 == 99:
        print(t, loss.item())

    # Zero the gradients before running the backward pass.
    model.zero_grad()

    # Backward pass: compute gradient of the loss with respect to all the learnable
    # parameters of the model. Internally, the parameters of each Module are stored
    # in Tensors with requires_grad=True, so this call will compute gradients for
    # all learnable parameters in the model.
    loss.backward()

    # Update the weights using gradient descent. Each parameter is a Tensor, so
    # we can access its gradients like we did before.
    with torch.no_grad():
        for param in model.parameters():
            param -= learning_rate * param.grad

99 3.708242893218994
199 0.0746285617351532
299 0.003060894785448909
399 0.00018285629630554467
499 1.3238582141639199e-05


### PyTorch: optim

到此为止，我们已经更行了模型的权重，通过手动地加上带有可学习参数(带着torch.no_grad()或者.data来避免记录autograd的情况)的Tensors。这对于简单的优化算法如随机梯度下降来说不是个大问题，但是实际上我们经常在训练神经网络时使用更复杂的优化器，如AdaGrad, RMSProp, Adam等等。

在PyTorch的优化库里，抽象了优化算法的思想，还听过了优化算法的普通实现。

在这个例子中，我们会使用nn库来定义我们的模型，然后使用optim库里的Adam算法来优化我们的模型：

In [8]:
# -*- coding: utf-8 -*-
import torch

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Use the nn package to define our model and loss function.
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)
loss_fn = torch.nn.MSELoss(reduction='sum')

# Use the optim package to define an Optimizer that will update the weights of
# the model for us. Here we will use Adam; the optim package contains many other
# optimization algoriths. The first argument to the Adam constructor tells the
# optimizer which Tensors it should update.
learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
for t in range(500):
    # Forward pass: compute predicted y by passing x to the model.
    y_pred = model(x)

    # Compute and print loss.
    loss = loss_fn(y_pred, y)
    if t % 100 == 99:
        print(t, loss.item())

    # Before the backward pass, use the optimizer object to zero all of the
    # gradients for the variables it will update (which are the learnable
    # weights of the model). This is because by default, gradients are
    # accumulated in buffers( i.e, not overwritten) whenever .backward()
    # is called. Checkout docs of torch.autograd.backward for more details.
    optimizer.zero_grad()

    # Backward pass: compute gradient of the loss with respect to model
    # parameters
    loss.backward()

    # Calling the step function on an Optimizer makes an update to its
    # parameters
    optimizer.step()

99 54.60757064819336
199 0.8584431409835815
299 0.009332788176834583
399 9.142134513240308e-05
499 3.3641069308032456e-07


### PyTorch: Custom nn Modules

有时候你想指定比现在模块中的sequence更复杂的模型；这样的话，你可以定义你自己的模块，通过nn.Module子类来实现，以及利用其他模块或者其他在Tensors上的autograd定义一个可以接收input Tensors生成output Tensors的forward。

下面的例子为两层神经网络定制的子类模块。


In [9]:
# -*- coding: utf-8 -*-
import torch


class TwoLayerNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        In the constructor we instantiate two nn.Linear modules and assign them as
        member variables.
        """
        super(TwoLayerNet, self).__init__()
        self.linear1 = torch.nn.Linear(D_in, H)
        self.linear2 = torch.nn.Linear(H, D_out)

    def forward(self, x):
        """
        In the forward function we accept a Tensor of input data and we must return
        a Tensor of output data. We can use Modules defined in the constructor as
        well as arbitrary operators on Tensors.
        """
        h_relu = self.linear1(x).clamp(min=0)
        y_pred = self.linear2(h_relu)
        return y_pred


# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Construct our model by instantiating the class defined above
model = TwoLayerNet(D_in, H, D_out)

# Construct our loss function and an Optimizer. The call to model.parameters()
# in the SGD constructor will contain the learnable parameters of the two
# nn.Linear modules which are members of the model.
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
for t in range(500):
    # Forward pass: Compute predicted y by passing x to the model
    y_pred = model(x)

    # Compute and print loss
    loss = criterion(y_pred, y)
    if t % 100 == 99:
        print(t, loss.item())

    # Zero gradients, perform a backward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

99 2.4334070682525635
199 0.05058251693844795
299 0.00185418373439461
399 8.191089727915823e-05
499 4.00275939682615e-06


### PyTorch: Control Flow + Weight Sharing

作为动态图和权重分享的一个例子，我们实现了一个很陌生的模型，一个全连接的ReLU神经网络，在forward时选择1到4随机数，然后使用很多隐藏层，循环利用同一组权重很多次来计算最里面的隐藏层。

我们可以使用Python控制流来实现这个循环，然后我们可以通过在定义forward时重复利用同一个模型实现层之间的权重分享。

我们可以简单的实现这个模型当成是模块的子类：

In [10]:
# -*- coding: utf-8 -*-
import random
import torch


class DynamicNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        In the constructor we construct three nn.Linear instances that we will use
        in the forward pass.
        """
        super(DynamicNet, self).__init__()
        #super调用父类中的函数
        self.input_linear = torch.nn.Linear(D_in, H)
        self.middle_linear = torch.nn.Linear(H, H)
        self.output_linear = torch.nn.Linear(H, D_out)

    def forward(self, x):
        """
        For the forward pass of the model, we randomly choose either 0, 1, 2, or 3
        and reuse the middle_linear Module that many times to compute hidden layer
        representations.

        Since each forward pass builds a dynamic computation graph, we can use normal
        Python control-flow operators like loops or conditional statements when
        defining the forward pass of the model.

        Here we also see that it is perfectly safe to reuse the same Module many
        times when defining a computational graph. This is a big improvement from Lua
        Torch, where each Module could be used only once.
        """
        h_relu = self.input_linear(x).clamp(min=0)
        for _ in range(random.randint(0, 3)):
            h_relu = self.middle_linear(h_relu).clamp(min=0)
        y_pred = self.output_linear(h_relu)
        return y_pred


# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Construct our model by instantiating the class defined above
model = DynamicNet(D_in, H, D_out)

# Construct our loss function and an Optimizer. Training this strange model with
# vanilla stochastic gradient descent is tough, so we use momentum
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4, momentum=0.9)
for t in range(500):
    # Forward pass: Compute predicted y by passing x to the model
    y_pred = model(x)

    # Compute and print loss
    loss = criterion(y_pred, y)
    if t % 100 == 99:
        print(t, loss.item())

    # Zero gradients, perform a backward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

99 12.992911338806152
199 0.784458339214325
299 1.714671015739441
399 0.3597424626350403
499 0.3711857795715332
