# PyTorch - Example Learning

**At its core, PyTorch provides two main features:**

- An n-dimensional Tensor, similar to numpy but can run on GPUs
- Automatic differentiation for building and training neural networks

##### Implementation Of Neural Network using Numpy

`Numpy` provides an n-dimensional array object, and many functions for manipulating these arrays. Numpy is a generic framework for scientific computing; it does not know anything about computation graphs, or deep learning, or gradients. However we can easily use numpy to fit a two-layer network to random data by manually implementing the forward and backward passes through the network using numpy operations:

In [1]:
# -*- coding: utf-8 -*-
import numpy as np

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

learning_rate = 1e-6
for t in range(400):
    # Forward pass: compute predicted y
    h = x.dot(w1)
    h_relu = np.maximum(h, 0)
    y_pred = h_relu.dot(w2)

    # Compute and print loss
    loss = np.square(y_pred - y).sum()
    print(t, loss)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 0
    grad_w1 = x.T.dot(grad_h)

    # Update weights
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

0 32874972.990253024
1 29934432.17806741
2 34440356.50382781
3 39906516.17699916
4 39442399.39236803
5 28919613.088503607
6 15710318.38694118
7 6914055.842159819
8 3152076.4110127212
9 1744844.7646032874
10 1191654.4807730294
11 922740.5578422977
12 757644.0277801667
13 638306.8213872982
14 544762.9488608971
15 468612.2119580572
16 405495.29098721454
17 352594.7132332511
18 307939.60510171193
19 269999.8182352121
20 237605.30127352514
21 209794.08024500107
22 185803.22181706264
23 165067.9142122714
24 147075.11462385923
25 131377.71942033648
26 117647.2431445288
27 105588.9241878282
28 94968.16549862134
29 85586.70180556727
30 77272.83950344141
31 69885.84786622631
32 63316.28737723447
33 57452.465112359096
34 52209.36813944816
35 47510.75584111583
36 43295.79504924922
37 39504.43228258435
38 36089.73210405444
39 33009.327597794116
40 30224.812480467408
41 27706.336002596734
42 25423.394998028794
43 23350.45046443734
44 21465.65079193267
45 19749.111787181795
46 18185.88906357286
47 16

378 0.0005415326284864458
379 0.0005188237395168117
380 0.000497074203490697
381 0.0004762653123912234
382 0.0004563155037644362
383 0.00043720757114518367
384 0.0004189039556465639
385 0.00040137311832945437
386 0.0003845826645665519
387 0.0003685004125747976
388 0.0003530959222285999
389 0.000338338923443009
390 0.00032420280856772124
391 0.0003106612688148385
392 0.00029770361940939267
393 0.00028527910143857417
394 0.00027337605209758566
395 0.0002619730462733623
396 0.0002510477808559969
397 0.00024058163539086136
398 0.00023055430325106196
399 0.00022094955107892044


##### Implementation Of Neural Network using Pytorch Tensor

Numpy is a great framework, but it cannot utilize GPUs to accelerate its numerical computations. For modern deep neural networks, GPUs often provide speedups of 50x or greater, so unfortunately numpy won’t be enough for modern deep learning.

Here we introduce the most fundamental PyTorch concept: the Tensor. A PyTorch Tensor is conceptually identical to a numpy array: a Tensor is an n-dimensional array, and PyTorch provides many functions for operating on these Tensors. Like numpy arrays, PyTorch Tensors do not know anything about deep learning or computational graphs or gradients; they are a generic tool for scientific computing.

However unlike numpy, PyTorch Tensors can utilize GPUs to accelerate their numeric computations. To run a PyTorch Tensor on GPU, you simply need to cast it to a new datatype.

Here we use PyTorch Tensors to fit a two-layer network to random data. Like the numpy example above we need to manually implement the forward and backward passes through the network:

In [2]:
# -*- coding: utf-8 -*-

import torch


dtype = torch.float
device = torch.device("cpu")
# dtype = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Randomly initialize weights
w1 = torch.randn(D_in, H, device=device, dtype=dtype)
w2 = torch.randn(H, D_out, device=device, dtype=dtype)

learning_rate = 1e-6
for t in range(400):
    # Forward pass: compute predicted y
    h = x.mm(w1)
    h_relu = h.clamp(min=0)
    y_pred = h_relu.mm(w2)

    # Compute and print loss
    loss = (y_pred - y).pow(2).sum().item()
    print(t, loss)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.t().mm(grad_y_pred)
    grad_h_relu = grad_y_pred.mm(w2.t())
    grad_h = grad_h_relu.clone()
    grad_h[h < 0] = 0
    grad_w1 = x.t().mm(grad_h)

    # Update weights using gradient descent
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

0 31996902.0
1 28662340.0
2 28561978.0
3 27235454.0
4 22849896.0
5 16232502.0
6 10060740.0
7 5778260.0
8 3337789.75
9 2056903.5
10 1389842.875
11 1022670.0625
12 802043.3125
13 656111.0
14 551150.5625
15 470531.15625
16 406076.6875
17 353120.15625
18 308818.71875
19 271375.46875
20 239401.28125
21 211914.578125
22 188207.796875
23 167621.859375
24 149689.203125
25 134009.15625
26 120249.84375
27 108142.6796875
28 97452.7578125
29 87998.3828125
30 79613.6640625
31 72159.921875
32 65509.41796875
33 59556.6875
34 54234.0
35 49462.28125
36 45180.4453125
37 41324.16015625
38 37846.40234375
39 34703.06640625
40 31860.484375
41 29285.71875
42 26950.76171875
43 24829.58203125
44 22899.248046875
45 21142.0234375
46 19540.572265625
47 18077.376953125
48 16739.169921875
49 15513.3828125
50 14389.2138671875
51 13357.7470703125
52 12409.9287109375
53 11538.7158203125
54 10737.0576171875
55 9998.0400390625
56 9316.794921875
57 8687.724609375
58 8106.67041015625
59 7569.591796875
60 7072.52490234375


378 0.006924356333911419
379 0.006689439993351698
380 0.006452304311096668
381 0.00622687628492713
382 0.006016206461936235
383 0.005808341316878796
384 0.005613172892481089
385 0.0054227616637945175
386 0.00523465545848012
387 0.005057323724031448
388 0.00488363578915596
389 0.004715883173048496
390 0.004558338783681393
391 0.004407003056257963
392 0.004259021952748299
393 0.0041180942207574844
394 0.00398328946903348
395 0.0038526516873389482
396 0.0037203796673566103
397 0.003600731259211898
398 0.0034776998218148947
399 0.003366249380633235


##### PyTorch: Tensors and autograd

A fully-connected ReLU network with one hidden layer and no biases, trained to predict y from x by minimizing squared Euclidean distance.

This implementation computes the forward pass using operations on PyTorch Tensors, and uses PyTorch autograd to compute gradients.

A PyTorch Tensor represents a node in a computational graph. If x is a Tensor that has x.requires_grad=True then x.grad is another Tensor holding the gradient of x with respect to some scalar value.

In [3]:
import torch

dtype = torch.float
device = torch.device("cpu")
# dtype = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold input and outputs.
# Setting requires_grad=False indicates that we do not need to compute gradients
# with respect to these Tensors during the backward pass.
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Create random Tensors for weights.
# Setting requires_grad=True indicates that we want to compute gradients with
# respect to these Tensors during the backward pass.
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for t in range(400):
    # Forward pass: compute predicted y using operations on Tensors; these
    # are exactly the same operations we used to compute the forward pass using
    # Tensors, but we do not need to keep references to intermediate values since
    # we are not implementing the backward pass by hand.
    y_pred = x.mm(w1).clamp(min=0).mm(w2)

    # Compute and print loss using operations on Tensors.
    # Now loss is a Tensor of shape (1,)
    # loss.item() gets the a scalar value held in the loss.
    loss = (y_pred - y).pow(2).sum()
    print(t, loss.item())

    # Use autograd to compute the backward pass. This call will compute the
    # gradient of loss with respect to all Tensors with requires_grad=True.
    # After this call w1.grad and w2.grad will be Tensors holding the gradient
    # of the loss with respect to w1 and w2 respectively.
    loss.backward()

    # Manually update weights using gradient descent. Wrap in torch.no_grad()
    # because weights have requires_grad=True, but we don't need to track this
    # in autograd.
    # An alternative way is to operate on weight.data and weight.grad.data.
    # Recall that tensor.data gives a tensor that shares the storage with
    # tensor, but doesn't track history.
    # You can also use torch.optim.SGD to achieve this.
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad

        # Manually zero the gradients after updating weights
        w1.grad.zero_()
        w2.grad.zero_()

0 27880666.0
1 22674056.0
2 20353266.0
3 18265012.0
4 15405713.0
5 11993109.0
6 8647373.0
7 5918367.0
8 3953217.5
9 2654134.25
10 1827326.0
11 1306730.0
12 973480.125
13 753978.0625
14 603627.9375
15 496203.90625
16 416174.78125
17 354510.0
18 305483.65625
19 265620.9375
20 232552.96875
21 204729.859375
22 181049.859375
23 160739.046875
24 143193.28125
25 127957.0078125
26 114684.2890625
27 103045.5078125
28 92800.8046875
29 83763.1171875
30 75764.5
31 68657.90625
32 62329.23046875
33 56683.18359375
34 51632.54296875
35 47102.75390625
36 43033.01953125
37 39369.63671875
38 36066.1875
39 33084.40625
40 30386.751953125
41 27942.357421875
42 25723.0859375
43 23706.740234375
44 21872.541015625
45 20200.828125
46 18675.2890625
47 17281.123046875
48 16005.7646484375
49 14837.474609375
50 13766.49609375
51 12782.3681640625
52 11878.2158203125
53 11046.7919921875
54 10281.96875
55 9577.0068359375
56 8926.72265625
57 8326.3154296875
58 7771.57568359375
59 7259.27099609375
60 6785.34326171875
61

##### PyTorch: Defining new autograd functions

A fully-connected ReLU network with one hidden layer and no biases, trained to predict y from x by minimizing squared Euclidean distance.

This implementation computes the forward pass using operations on PyTorch Variables, and uses PyTorch autograd to compute gradients.

In this implementation we implement our own custom autograd function to perform the ReLU function.

In [4]:
import torch


class MyReLU(torch.autograd.Function):
    """
    We can implement our own custom autograd Functions by subclassing
    torch.autograd.Function and implementing the forward and backward passes
    which operate on Tensors.
    """

    @staticmethod
    def forward(ctx, input):
        """
        In the forward pass we receive a Tensor containing the input and return
        a Tensor containing the output. ctx is a context object that can be used
        to stash information for backward computation. You can cache arbitrary
        objects for use in the backward pass using the ctx.save_for_backward method.
        """
        ctx.save_for_backward(input)
        return input.clamp(min=0)

    @staticmethod
    def backward(ctx, grad_output):
        """
        In the backward pass we receive a Tensor containing the gradient of the loss
        with respect to the output, and we need to compute the gradient of the loss
        with respect to the input.
        """
        input, = ctx.saved_tensors
        grad_input = grad_output.clone()
        grad_input[input < 0] = 0
        return grad_input


dtype = torch.float
device = torch.device("cpu")
# dtype = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold input and outputs.
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Create random Tensors for weights.
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
    # To apply our Function, we use Function.apply method. We alias this as 'relu'.
    relu = MyReLU.apply

    # Forward pass: compute predicted y using operations; we compute
    # ReLU using our custom autograd operation.
    y_pred = relu(x.mm(w1)).mm(w2)

    # Compute and print loss
    loss = (y_pred - y).pow(2).sum()
    print(t, loss.item())

    # Use autograd to compute the backward pass.
    loss.backward()

    # Update weights using gradient descent
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad

        # Manually zero the gradients after updating weights
        w1.grad.zero_()
        w2.grad.zero_()

0 32943748.0
1 31069718.0
2 35646584.0
3 40289768.0
4 37624872.0
5 26625168.0
6 13991263.0
7 6233779.0
8 2844255.75
9 1556081.25
10 1036383.4375
11 784857.6875
12 633895.5625
13 527715.125
14 446157.1875
15 380730.03125
16 327194.96875
17 282781.28125
18 245560.765625
19 214158.140625
20 187486.515625
21 164749.859375
22 145230.046875
23 128415.46875
24 113863.71875
25 101232.828125
26 90226.1171875
27 80607.1171875
28 72163.6953125
29 64734.96484375
30 58176.59375
31 52370.31640625
32 47218.39453125
33 42634.75390625
34 38551.375
35 34909.0234375
36 31649.009765625
37 28730.390625
38 26112.443359375
39 23757.646484375
40 21636.552734375
41 19723.3671875
42 17996.15234375
43 16434.41015625
44 15020.701171875
45 13739.5986328125
46 12577.4619140625
47 11521.7490234375
48 10562.7275390625
49 9692.0693359375
50 8901.599609375
51 8180.76220703125
52 7523.2763671875
53 6922.5927734375
54 6373.60546875
55 5871.50439453125
56 5411.85888671875
57 4990.732421875
58 4604.54248046875
59 4250.5493

442 6.172330176923424e-05
443 6.0682796174660325e-05
444 5.96632671658881e-05
445 5.860774763277732e-05
446 5.7910463510779664e-05
447 5.6837263400666416e-05
448 5.614368637907319e-05
449 5.532559225684963e-05
450 5.447082003229298e-05
451 5.378431160352193e-05
452 5.320545096765272e-05
453 5.223469634074718e-05
454 5.136742038303055e-05
455 5.0756672862917185e-05
456 4.9971607950283214e-05
457 4.9284801207249984e-05
458 4.85940690850839e-05
459 4.786878707818687e-05
460 4.714932947535999e-05
461 4.647169407689944e-05
462 4.5815126213710755e-05
463 4.535708649200387e-05
464 4.4620730477618054e-05
465 4.40411786257755e-05
466 4.3578140321187675e-05
467 4.280352732166648e-05
468 4.227770841680467e-05
469 4.1574134229449555e-05
470 4.1057948692468926e-05
471 4.0545430238125846e-05
472 4.021195854875259e-05
473 3.965546056861058e-05
474 3.9125581679400057e-05
475 3.858896889141761e-05
476 3.801032653427683e-05
477 3.751556869246997e-05
478 3.715454658959061e-05
479 3.666347765829414e-05
48

# nn Module
##### nn

Computational graphs and autograd are a very powerful paradigm for defining complex operators and automatically taking derivatives; however for large neural networks raw autograd can be a bit too low-level.

When building neural networks we frequently think of arranging the computation into layers, some of which have learnable parameters which will be optimized during learning.

In TensorFlow, packages like `Keras`, `TensorFlow-Slim`, and `TFLearn` provide higher-level abstractions over raw computational graphs that are useful for building neural networks.

In `PyTorch`, the `nn package` serves this same purpose. The nn package defines a set of Modules, which are roughly equivalent to neural network layers. A Module receives input Tensors and computes output Tensors, but may also hold internal state such as Tensors containing learnable parameters. The nn package also defines a set of useful loss functions that are commonly used when training neural networks.

In this example we use the nn package to implement our two-layer network:

In [5]:
# -*- coding: utf-8 -*-
import torch

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Use the nn package to define our model as a sequence of layers. nn.Sequential
# is a Module which contains other Modules, and applies them in sequence to
# produce its output. Each Linear Module computes output from input using a
# linear function, and holds internal Tensors for its weight and bias.
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)

# The nn package also contains definitions of popular loss functions; in this
# case we will use Mean Squared Error (MSE) as our loss function.
loss_fn = torch.nn.MSELoss(size_average=False)

learning_rate = 1e-4
for t in range(400):
    # Forward pass: compute predicted y by passing x to the model. Module objects
    # override the __call__ operator so you can call them like functions. When
    # doing so you pass a Tensor of input data to the Module and it produces
    # a Tensor of output data.
    y_pred = model(x)

    # Compute and print loss. We pass Tensors containing the predicted and true
    # values of y, and the loss function returns a Tensor containing the
    # loss.
    loss = loss_fn(y_pred, y)
    print(t, loss.item())

    # Zero the gradients before running the backward pass.
    model.zero_grad()

    # Backward pass: compute gradient of the loss with respect to all the learnable
    # parameters of the model. Internally, the parameters of each Module are stored
    # in Tensors with requires_grad=True, so this call will compute gradients for
    # all learnable parameters in the model.
    loss.backward()

    # Update the weights using gradient descent. Each parameter is a Tensor, so
    # we can access and gradients like we did before.
    with torch.no_grad():
        for param in model.parameters():
            param -= learning_rate * param.grad

0 637.2669677734375
1 590.0436401367188
2 548.80908203125
3 512.7425537109375
4 481.0378112792969
5 453.1461181640625
6 427.9104309082031
7 404.9233703613281
8 383.7756652832031
9 364.1868591308594
10 345.8730773925781
11 328.53778076171875
12 312.1357116699219
13 296.65521240234375
14 281.9488220214844
15 267.90557861328125
16 254.52845764160156
17 241.6984405517578
18 229.458740234375
19 217.8424835205078
20 206.71865844726562
21 196.05313110351562
22 185.82418823242188
23 176.02914428710938
24 166.65594482421875
25 157.70953369140625
26 149.16635131835938
27 141.0107879638672
28 133.2330780029297
29 125.83676147460938
30 118.80291748046875
31 112.10528564453125
32 105.71319580078125
33 99.65190887451172
34 93.898193359375
35 88.4475326538086
36 83.26880645751953
37 78.36988830566406
38 73.7337875366211
39 69.35096740722656
40 65.20748901367188
41 61.29609298706055
42 57.618892669677734
43 54.154850006103516
44 50.89242935180664
45 47.8221549987793
46 44.93335723876953
47 42.22098159

##### Optim - Optimizer

Up to this point we have updated the weights of our models by manually mutating the Tensors holding learnable parameters (with torch.no_grad() or .data to avoid tracking history in autograd). This is not a huge burden for simple optimization algorithms like stochastic gradient descent, but in practice we often train neural networks using more sophisticated optimizers like AdaGrad, RMSProp, Adam, etc.

The optim package in PyTorch abstracts the idea of an optimization algorithm and provides implementations of commonly used optimization algorithms.

In this example we will use the nn package to define our model as before, but we will optimize the model using the Adam algorithm provided by the optim package:

In [6]:
# -*- coding: utf-8 -*-
import torch

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Use the nn package to define our model and loss function.
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)
loss_fn = torch.nn.MSELoss(size_average=False)

# Use the optim package to define an Optimizer that will update the weights of
# the model for us. Here we will use Adam; the optim package contains many other
# optimization algoriths. The first argument to the Adam constructor tells the
# optimizer which Tensors it should update.
learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
for t in range(400):
    # Forward pass: compute predicted y by passing x to the model.
    y_pred = model(x)

    # Compute and print loss.
    loss = loss_fn(y_pred, y)
    print(t, loss.item())

    # Before the backward pass, use the optimizer object to zero all of the
    # gradients for the variables it will update (which are the learnable
    # weights of the model). This is because by default, gradients are
    # accumulated in buffers( i.e, not overwritten) whenever .backward()
    # is called. 
    optimizer.zero_grad()

    # Backward pass: compute gradient of the loss with respect to model
    # parameters
    loss.backward()

    # Calling the step function on an Optimizer makes an update to its
    # parameters
    optimizer.step()

0 733.1707153320312
1 715.3285522460938
2 698.0132446289062
3 681.1547241210938
4 664.71240234375
5 648.7755126953125
6 633.2326049804688
7 618.1083374023438
8 603.3302612304688
9 588.949951171875
10 575.0482788085938
11 561.5953979492188
12 548.5570068359375
13 535.9805297851562
14 523.7952880859375
15 511.9844970703125
16 500.5764465332031
17 489.4682922363281
18 478.68792724609375
19 468.1868591308594
20 457.9261169433594
21 447.9590759277344
22 438.2227478027344
23 428.7125549316406
24 419.42291259765625
25 410.33392333984375
26 401.4577331542969
27 392.7876892089844
28 384.33587646484375
29 376.1002502441406
30 368.0096435546875
31 360.0599365234375
32 352.3090515136719
33 344.7269287109375
34 337.312255859375
35 330.0553894042969
36 322.91796875
37 315.933837890625
38 309.1038818359375
39 302.385986328125
40 295.83074951171875
41 289.4211730957031
42 283.12957763671875
43 276.95843505859375
44 270.90631103515625
45 264.9493103027344
46 259.11175537109375
47 253.41818237304688
48 

###### PyTorch: Control Flow + Weight Sharing

To showcase the power of PyTorch dynamic graphs, we will implement a very strange model: a fully-connected ReLU network that on each forward pass randomly chooses a number between 1 and 4 and has that many hidden layers, reusing the same weights multiple times to compute the innermost hidden layers.

In [7]:
import random
import torch


class DynamicNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        In the constructor we construct three nn.Linear instances that we will use
        in the forward pass.
        """
        super(DynamicNet, self).__init__()
        self.input_linear = torch.nn.Linear(D_in, H)
        self.middle_linear = torch.nn.Linear(H, H)
        self.output_linear = torch.nn.Linear(H, D_out)

    def forward(self, x):
        """
        For the forward pass of the model, we randomly choose either 0, 1, 2, or 3
        and reuse the middle_linear Module that many times to compute hidden layer
        representations.

        Since each forward pass builds a dynamic computation graph, we can use normal
        Python control-flow operators like loops or conditional statements when
        defining the forward pass of the model.

        Here we also see that it is perfectly safe to reuse the same Module many
        times when defining a computational graph. This is a big improvement from Lua
        Torch, where each Module could be used only once.
        """
        h_relu = self.input_linear(x).clamp(min=0)
        for _ in range(random.randint(0, 3)):
            h_relu = self.middle_linear(h_relu).clamp(min=0)
        y_pred = self.output_linear(h_relu)
        return y_pred


# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Construct our model by instantiating the class defined above
model = DynamicNet(D_in, H, D_out)

# Construct our loss function and an Optimizer. Training this strange model with
# vanilla stochastic gradient descent is tough, so we use momentum
criterion = torch.nn.MSELoss(size_average=False)
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4, momentum=0.9)
for t in range(400):
    # Forward pass: Compute predicted y by passing x to the model
    y_pred = model(x)

    # Compute and print loss
    loss = criterion(y_pred, y)
    print(t, loss.item())

    # Zero gradients, perform a backward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

0 685.2882080078125
1 658.4806518554688
2 649.9451293945312
3 660.2328491210938
4 621.1951293945312
5 509.055419921875
6 656.7465209960938
7 423.87274169921875
8 573.077392578125
9 558.9674072265625
10 538.13427734375
11 653.3952026367188
12 643.6406860351562
13 637.207275390625
14 627.306640625
15 216.6813507080078
16 193.6690216064453
17 638.5858154296875
18 134.88211059570312
19 357.65399169921875
20 621.0591430664062
21 531.1492309570312
22 79.67281341552734
23 269.6471252441406
24 243.99330139160156
25 540.9002685546875
26 81.2044448852539
27 174.1448211669922
28 451.51904296875
29 144.88453674316406
30 371.6123962402344
31 326.8035888671875
32 109.8161849975586
33 102.23863220214844
34 143.2442626953125
35 245.27720642089844
36 190.06394958496094
37 165.6575927734375
38 205.73199462890625
39 81.78438568115234
40 384.2493896484375
41 76.95626068115234
42 441.6557312011719
43 171.9198760986328
44 106.73949432373047
45 90.2186050415039
46 105.39947509765625
47 349.04229736328125
48 