Pytorch 小试牛刀
----------------

PyTorch的两个核心特征：
- n维张量，类似numpy
- 搭建和训练网络时的自动微分/求导机制

使用一个全链接的ReLU网络作为例子。
该网络有一个单一的隐藏曾，使用梯度下降训练。

In [2]:
import numpy as np

# N是批量大小，D_in 输入维度，H表示隐藏层维度，D_out是输出维度
N, D_in, H, D_out = 64, 1000, 100, 10

# 创建随机输入和输出数据
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

# 随机初始化权重
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

learning_rate = 1e-6

for t in range(500):
    # 前向传递，计算预测y
    h = x.dot(w1)
    h_relu = np.maximum(h, 0)
    y_pred = h_relu.dot(w2)
    
    # 计算和打印损失loss
    loss = np.square(y_pred - y).sum()
    print(t, loss)
    
    # 反向传播，计算w1和w2对loss的梯度
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 0
    grad_w1 = x.T.dot(grad_h)
    
    # 更新权重
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2
    

0 36660962.93668987
1 36337290.19777673
2 41071589.3892885
3 42474866.5373313
4 35048570.22446701
5 21434442.982513946
6 10335502.061418
7 4629405.5940360725
8 2345438.8948948495
9 1451552.7169117318
10 1053544.079012527
11 834550.7492171037
12 689514.656706353
13 581734.2142949791
14 496596.8025013825
15 427381.4414060715
16 370174.3194873447
17 322354.95278048597
18 282081.63119673834
19 247903.1929527521
20 218807.62001077036
21 193902.14434060344
22 172402.59311982236
23 153771.07336156347
24 137580.70035977318
25 123429.44576114768
26 111038.74025310524
27 100149.45515010285
28 90539.96765640104
29 82029.86729477596
30 74462.04706692579
31 67718.42208271759
32 61706.19678844789
33 56324.07042949852
34 51493.0991221282
35 47144.78604792847
36 43225.84441740978
37 39684.91579813081
38 36487.58992455915
39 33589.90409291239
40 30957.99959275828
41 28564.15806805177
42 26382.076323826135
43 24392.097254203356
44 22571.84344948931
45 20906.263132064516
46 19384.144974941057
47 17993.88

PyTorch：张量
-----------------

Numpy是一个很好的框架，但是不用利用GPU来加速其数值计算

Tensor： 相当于numpy的array。 本质上是一个n维数组，并具有很多可以操作自身的函数。
要在GPU上运行Tensor，可以使用device参数。

这里也是使用tensor来在随机数据集上训练一个两层的网络，手动实现前向和后向传播

In [4]:
import torch

dtype = torch.float
device = torch.device("cuda:0")
# device = torch.device("cpu")

# 参数设计
N, D_in, H, D_out = 64, 1000, 100, 10

# 随机创建输入输出
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# 随机初始化权重
w1 = torch.randn(D_in, H, device=device, dtype=dtype)
w2 = torch.randn(H, D_out, device=device, dtype=dtype)

learning_rate = 1e-6

for t in range(500):
    # 前向
    h = x.mm(w1)
    h_relu = h.clamp(min=0)
    y_pred = h_relu.mm(w2)
    
    # 计算和打印损失
    loss = (y_pred - y).pow(2).sum().item()
    print(t, loss)
    
    # backprop计算w1和w2相对于损失的梯度
    grad_y_pred = 2.0 *(y_pred - y)
    grad_w2 = h_relu.t().mm(grad_y_pred)
    grad_h_relu = grad_y_pred.mm(w2.t())
    grad_h = grad_h_relu.clone()
    grad_h[h<0] = 0
    grad_w1 = x.t().mm(grad_h)
    
    # 使用梯度下降更新权重
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

0 27458792.0
1 20580132.0
2 16598710.0
3 13378539.0
4 10414564.0
5 7755834.0
6 5585409.5
7 3954524.25
8 2802383.75
9 2016023.875
10 1486802.25
11 1128786.125
12 882182.4375
13 708210.25
14 581488.6875
15 486262.09375
16 412680.90625
17 354297.6875
18 306833.625
19 267632.15625
20 234795.671875
21 206957.546875
22 183179.8125
23 162761.5
24 145129.5625
25 129804.6875
26 116425.4296875
27 104689.140625
28 94359.640625
29 85243.9375
30 77183.4140625
31 70023.2265625
32 63649.109375
33 57961.88671875
34 52873.1171875
35 48307.33984375
36 44208.2890625
37 40515.83203125
38 37182.83203125
39 34168.3203125
40 31436.837890625
41 28956.59765625
42 26701.59765625
43 24648.81640625
44 22777.23046875
45 21068.52734375
46 19507.87890625
47 18079.80859375
48 16770.43359375
49 15568.482421875
50 14464.2158203125
51 13448.0390625
52 12512.650390625
53 11650.2255859375
54 10853.828125
55 10118.3857421875
56 9439.31640625
57 8812.4453125
58 8231.779296875
59 7693.5693359375
60 7194.4921875
61 6730.94482

自动求导
------------------

#### 3.1 张量和自动求导

简单的网络求导尚且可以，但是网络足够复杂之后，再手动进行编码就会很麻烦。
PyTorch中的autograd包提供自动求导，也就是自动计算网络的后向传播。
使用autograd的时候，网络前向传播将定义一个计算图；

图的节点就是tensor，边就是函数。
这些函数是输出的tensor到输入的tensor的映射。

其实很简单，如果想要计算某个tensor的梯度的时候，只需要在建立这个tensor的时候，加上：
requires_grad = True

那么这个tensor上的任何PyTorch的操作都会构造一个计算图，允许我们进行反向传播。
如果这个tensro x的requires_grad = True
那么反向传播后的x.grad就会是另一个张量，是x关于某个标量值的梯度。

In [1]:
import torch

dtype = torch.float
device = torch.device("cuda:0")

N, D_in, H, D_out = 64, 1000, 100, 10

x = torch.randn(N, D_in, device=device,dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6

for t in range(500):
    y_pred = x.mm(w1).clamp(min=0).mm(w2)
    
    loss = (y_pred - y).pow(2).sum()
    # loss.item()表示获得这个张量对应的python数值
    print(t, loss.item())
    
    # 使用autograd计算反向传播
    # 这次调用之后，w1.grad和w2.grad分别是loss对应w1和w2的梯度张量
    loss.backward()
    
    # 在这里，只想对w2和w1的值进行原地改变；不想为其更新阶段构建计算图
    # 使用torch.no_grad上下文管理器防止PyTorch为其更新构建计算图
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad
        
        # 反向传播之后，手动清零梯度
        w1.grad.zero_()
        w2.grad.zero_()

0 32239596.0
1 31852400.0
2 42016280.0
3 56670312.0
4 60943816.0
5 43483232.0
6 18576254.0
7 5878778.0
8 2272979.5
9 1362837.75
10 1039465.5
11 856212.625
12 722320.625
13 615592.4375
14 528261.625
15 456111.9375
16 395833.0
17 345054.15625
18 302048.21875
19 265403.125
20 234015.859375
21 207038.71875
22 183750.390625
23 163558.703125
24 145982.5625
25 130629.21875
26 117163.53125
27 105338.84375
28 94911.484375
29 85686.9609375
30 77502.75
31 70220.828125
32 63730.953125
33 57934.859375
34 52745.7109375
35 48087.3125
36 43902.40234375
37 40134.94140625
38 36738.8203125
39 33673.96875
40 30900.232421875
41 28388.67578125
42 26110.94140625
43 24040.375
44 22155.7578125
45 20438.666015625
46 18871.484375
47 17440.15234375
48 16131.23046875
49 14933.0498046875
50 13834.796875
51 12827.4921875
52 11902.5712890625
53 11052.736328125
54 10270.609375
55 9550.30078125
56 8886.7978515625
57 8274.69921875
58 7709.52880859375
59 7187.4697265625
60 6704.81396484375
61 6258.193359375
62 5844.48486

380 0.0014773942530155182
381 0.0014265887439250946
382 0.001378708751872182
383 0.0013332550879567862
384 0.0012899485882371664
385 0.0012492737732827663
386 0.0012070141965523362
387 0.0011691692052409053
388 0.0011314433068037033
389 0.0010962334927171469
390 0.001061523798853159
391 0.0010286474134773016
392 0.0009965031640604138
393 0.0009661632357165217
394 0.0009362330893054605
395 0.0009079946321435273
396 0.0008803055388852954
397 0.0008545242017135024
398 0.0008291141130030155
399 0.000803518109023571
400 0.000780366943217814
401 0.0007585117709822953
402 0.0007359273731708527
403 0.0007141496753320098
404 0.0006947307265363634
405 0.0006749884341843426
406 0.0006559916655533016
407 0.0006376258097589016
408 0.0006202346412464976
409 0.0006027322961017489
410 0.0005865985294803977
411 0.0005703392089344561
412 0.000556049810256809
413 0.0005409027216956019
414 0.0005266102380119264
415 0.0005132405203767121
416 0.0005002834368497133
417 0.00048737312317825854
418 0.0004741282

#### 3.2 PyTorch: 定义新的自动求导函数

通过定义 torch.autograd.Function 的子类实现forward和backward函数，来定义自己的自动求导运算。
之后我们就可以使用这个新的自动梯度运算符了。
然后，构造一个实例，像是调用函数一样，传入包含输入数据的tensor调用它。

In [4]:
import torch

class MyReLU(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x):
        """
        在正向传播中
        我们接受一个上下文对象和一个包含输入的张量；
        返回一个包含输出的张量
        使用上下文对象来缓存对象，方便在反向传播中使用
        """
        
        ctx.save_for_backward(x)
        return x.clamp(min=0)
    
    @staticmethod
    def backward(ctx, grad_output):
        """
        在反向传播中，我们接受一个上下文对象和一个张量
        其包含了相对于正向传播过程中阐述的输出的损失的梯度
        可以从上上下文对象中检测缓存的数据、
        并且计算并返回与正向传播的输入相关的损失的梯度
        """
        
        x, = ctx.saved_tensors
        grad_x = grad_output.clone()
        grad_x[x < 0] = 0
        return grad_x
    
device = torch.device("cuda:0")

N, D_in, H, D_out = 64, 1000, 100, 10

x = torch.randn(N, D_in, device=device)
y = torch.randn(N, D_out, device=device)
w1 = torch.randn(D_in, H, device=device, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, requires_grad=True)
learning_rate = 1e-6

for t in range(500):
    y_pred = MyReLU.apply(x.mm(w1)).mm(w2)
    
    loss = (y_pred - y).pow(2).sum()
    print(t, loss.item())
    
    loss.backward()
    
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad
        
        w1.grad.zero_()
        w2.grad.zero_()

0 28356400.0
1 22186024.0
2 18713228.0
3 15664120.0
4 12470141.0
5 9326672.0
6 6631358.0
7 4587540.0
8 3165522.75
9 2224003.0
10 1610071.5
11 1207665.875
12 937629.5625
13 750356.5
14 615496.25
15 514778.625
16 437179.125
17 375612.125
18 325688.375
19 284421.8125
20 249859.40625
21 220606.78125
22 195623.8125
23 174043.671875
24 155346.984375
25 139068.5625
26 124841.625
27 112353.59375
28 101345.484375
29 91612.09375
30 82986.71875
31 75312.984375
32 68468.9609375
33 62351.421875
34 56870.265625
35 51944.375
36 47507.70703125
37 43503.296875
38 39882.83984375
39 36604.1875
40 33630.5078125
41 30930.138671875
42 28473.435546875
43 26236.259765625
44 24196.9609375
45 22334.435546875
46 20631.076171875
47 19071.79296875
48 17643.37890625
49 16333.203125
50 15130.69140625
51 14025.337890625
52 13008.4375
53 12071.998046875
54 11209.091796875
55 10413.3779296875
56 9679.4365234375
57 9002.208984375
58 8376.412109375
59 7797.96484375
60 7262.580078125
61 6766.87060546875
62 6307.5546875
63

393 0.00048416160279884934
394 0.0004704036400653422
395 0.0004582086985465139
396 0.00044554518535733223
397 0.0004333991673775017
398 0.00042127500637434423
399 0.00041048729326575994
400 0.00039956156979314983
401 0.0003887217026203871
402 0.0003775054356083274
403 0.000368162349332124
404 0.0003583707148209214
405 0.0003484755288809538
406 0.00033971265656873584
407 0.000331562157953158
408 0.00032255155383609235
409 0.0003148511459585279
410 0.0003069220983888954
411 0.00029890163568779826
412 0.0002919799298979342
413 0.0002842520480044186
414 0.0002771755389403552
415 0.00027098285499960184
416 0.0002646437205839902
417 0.0002580988220870495
418 0.0002518746769055724
419 0.0002464808931108564
420 0.00024135290004778653
421 0.00023582097492180765
422 0.00022974380408413708
423 0.0002242656919406727
424 0.00021958642173558474
425 0.00021453580120578408
426 0.0002094596711685881
427 0.0002052272902801633
428 0.0002003836998483166
429 0.00019617857469711453
430 0.0001921741495607420

#### 3.3 静态图

PyTorch自动求导看起来很像TensorFlow，但是最大的不同在于TensorFlow的计算图是静态的，PT是动态的。

TF中，计算图定义一次，然后重复执行这个相同的图，可以提供不同的输入数据。
PT中，每一个前向通道定义一个新的计算图。

静态图的优点：可以预先对图进行优化。

静态图和动态图的一个区别是：控制流。

对于一个模型，我们希望对每个数据点执行不同的计算：
比如一个递归神经网络可能对于每个数据点执行不同的时间步数，这个展开可以作为一个循环实现。
对于一个静态图，循环结构要作为图的一部分
因此，TF提供了运算符来把循环嵌入到图中

对于动态图而言，情况更加简单：
既然我们为每个例子即时创建图，我们就可以使用普通的命令控制流来为每个输入执行不同的计算。

4.nn模块
-----------------

#### 4.1 PyTorch: nn

计算图和autograd是十分强大的工具。
但是对于大规模网络，autograd太过于底层。
在构建网络时，我们经常考虑将计算安排成层，其中一些具有可学习的参数，将在学习过程中进行优化。

包nn完成这样的功能。
nn包中定义一组大致等价于层的模块。
一个模块接受输入的tensor，计算输出的tensor。
而且还保存了一些内部状态，比如tensor的参数等等。
包中也定义了彝族损失函数，用来训练网络

In [5]:
import torch

N, D_in, H, D_out = 64, 1000, 100, 10

x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

model = torch.nn.Sequential(
        torch.nn.Linear(D_in, H),
        torch.nn.ReLU(),
        torch.nn.Linear(H, D_out),
)

loss_fn = torch.nn.MSELoss(reduction='sum')
learning_rate = 1e-4

for t in range(500):
    y_pred = model(x)
    loss = loss_fn(y_pred, y)
    print(t, loss.item())
    
    # 反向传播之前清零梯度
    model.zero_grad()
    loss.backward()
    
    with torch.no_grad():
        for param in model.parameters():
            param -= learning_rate * param.grad

0 676.18359375
1 627.087890625
2 584.7290649414062
3 547.7319946289062
4 514.9356689453125
5 485.19622802734375
6 458.273193359375
7 433.77984619140625
8 411.1972961425781
9 390.1839294433594
10 370.5324401855469
11 352.1104431152344
12 334.7972106933594
13 318.439697265625
14 302.828369140625
15 287.93524169921875
16 273.6541748046875
17 259.9878234863281
18 246.9178466796875
19 234.46185302734375
20 222.5476837158203
21 211.11441040039062
22 200.16835021972656
23 189.72044372558594
24 179.74453735351562
25 170.21951293945312
26 161.10195922851562
27 152.4192352294922
28 144.15829467773438
29 136.2587890625
30 128.7401580810547
31 121.5772705078125
32 114.75633239746094
33 108.27776336669922
34 102.14570617675781
35 96.31876373291016
36 90.80496215820312
37 85.58736419677734
38 80.6474609375
39 75.97702026367188
40 71.56298828125
41 67.39440155029297
42 63.45671844482422
43 59.74479293823242
44 56.2451286315918
45 52.94639587402344
46 49.845638275146484
47 46.92377853393555
48 44.1748

#### 4.2 PyTorch: optim

实践中，我们常常使用AdaGrad, RMSProp, Adam等复杂的优化器来训练神经网路

In [6]:
import torch

N, D_in, H, D_out = 64, 1000, 100, 10

x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

model = torch.nn.Sequential(
        torch.nn.Linear(D_in, H),
        torch.nn.ReLU(),
        torch.nn.Linear(H, D_out),
)

loss_fn = torch.nn.MSELoss(reduction='sum')
learning_rate = 1e-4

# Adam构造参数的第一个参数告诉优化器应该更新哪些张量
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

for t in range(500):
    y_pred = model(x)
    loss = loss_fn(y_pred, y)
    print(t, loss.item())
    
    # 反向传播之前，使用optimizer将它要更新的所有梯度张量清零
    optimizer.zero_grad()
    # 反向传播：根据模型的参数计算loss的梯度
    loss.backward()
    # 调用optimizer的step函数使其所有的参数更新
    optimizer.step()

0 662.3717651367188
1 645.34375
2 628.7959594726562
3 612.72900390625
4 597.0604248046875
5 581.8961791992188
6 567.2136840820312
7 552.9628295898438
8 539.181396484375
9 525.7044067382812
10 512.5979614257812
11 499.86102294921875
12 487.56085205078125
13 475.66265869140625
14 464.05731201171875
15 452.7283935546875
16 441.70599365234375
17 431.0838928222656
18 420.8306579589844
19 410.82000732421875
20 401.05035400390625
21 391.5291442871094
22 382.25970458984375
23 373.2821960449219
24 364.4888000488281
25 355.9678649902344
26 347.6789245605469
27 339.56048583984375
28 331.6104431152344
29 323.8968811035156
30 316.4245300292969
31 309.128173828125
32 301.9924011230469
33 295.0163269042969
34 288.1916198730469
35 281.50152587890625
36 274.9500732421875
37 268.5572509765625
38 262.3010559082031
39 256.1800537109375
40 250.17568969726562
41 244.287353515625
42 238.54063415527344
43 232.92599487304688
44 227.41212463378906
45 222.00656127929688
46 216.705810546875
47 211.49002075195312


396 2.1611631382256746e-05
397 2.043540371232666e-05
398 1.932487793965265e-05
399 1.827476626203861e-05
400 1.727845119603444e-05
401 1.63386848726077e-05
402 1.5448666090378538e-05
403 1.4606800505134743e-05
404 1.3812375073030125e-05
405 1.3060429409961216e-05
406 1.234857518284116e-05
407 1.167534264823189e-05
408 1.1038831871701404e-05
409 1.0435934200359043e-05
410 9.868552297120914e-06
411 9.329440217697993e-06
412 8.819830327411182e-06
413 8.338020052178763e-06
414 7.88244778959779e-06
415 7.451875262631802e-06
416 7.044156973279314e-06
417 6.658640813839156e-06
418 6.295349976426223e-06
419 5.950830200163182e-06
420 5.625941412290558e-06
421 5.3169296734267846e-06
422 5.0256230679224245e-06
423 4.750128482555738e-06
424 4.4903522393724415e-06
425 4.243087005306734e-06
426 4.011025339423213e-06
427 3.7909451293671736e-06
428 3.582455292416853e-06
429 3.38582731274073e-06
430 3.199896582373185e-06
431 3.023695171577856e-06
432 2.8576396289281547e-06
433 2.700558070500847e-06
434

4.3 自定义nn模块
-----------------------

有时候需要指定比现有模块序列更复杂的模型
对于这些情况，就可以通过继承nn.Module并定义forward函数

这个forward函数可以使用其他模块或者其他的自动求导运算来接受输入tensor，产生输出tensor

In [8]:
import torch
import torch.nn as nn

class TwoLayerNet(nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        构造函数中，示例化两个nn.Linear模块，并将其作为成员变量
        """
        super(TwoLayerNet, self).__init__()
        self.linear1 = nn.Linear(D_in, H)
        self.linear2 = nn.Linear(H, D_out)
        
    def forward(self, x):
        """
        在前向传播的函数中
        接受一个输入的张量，也必须返回一个输出张量；
        """
        h_relu = self.linear1(x).clamp(min=0)
        y_pred = self.linear2(h_relu)
        return y_pred
    
N, D_in, H, D_out = 64, 1000, 100, 10

x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

model = TwoLayerNet(D_in, H, D_out)

loss_fn = nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)

for t in range(500):
    y_pred = model(x)
    loss = loss_fn(y_pred, y)
    print(t, loss.item())
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

0 694.797607421875
1 644.2505493164062
2 600.235107421875
3 561.2379760742188
4 526.4671630859375
5 494.9014892578125
6 465.9754943847656
7 439.43115234375
8 414.98876953125
9 392.41326904296875
10 371.3191223144531
11 351.3940734863281
12 332.65728759765625
13 314.92108154296875
14 298.073974609375
15 281.99676513671875
16 266.6170654296875
17 251.95858764648438
18 237.9494171142578
19 224.58616638183594
20 211.77003479003906
21 199.54458618164062
22 187.88589477539062
23 176.80535888671875
24 166.30972290039062
25 156.3731231689453
26 146.95114135742188
27 138.03846740722656
28 129.61170959472656
29 121.64468383789062
30 114.107666015625
31 107.01520538330078
32 100.33747863769531
33 94.06478881835938
34 88.18358612060547
35 82.67179107666016
36 77.51144409179688
37 72.67861938476562
38 68.15833282470703
39 63.93291091918945
40 59.987239837646484
41 56.30097198486328
42 52.86064910888672
43 49.64619827270508
44 46.63488006591797
45 43.82167434692383
46 41.193206787109375
47 38.737987

492 2.0622749161702814e-06
493 2.0000682070531184e-06
494 1.939944468176691e-06
495 1.881766365841031e-06
496 1.8250043467560317e-06
497 1.7704204537949408e-06
498 1.7170680166600505e-06
499 1.6652604699629592e-06


4.4 PyTorch：控制流和权重共享
--------------------------

作为动态图和权重共享的一个例子，我们实现一个奇怪的模型：
一个全连接的ReLU网络。

在每一次前向传播时，它的隐藏层的层数为随机1-4之间的数字，这样可以多次重用相同的权重来计算。

因为这个模型可以使用普通的python控制流来实现循环，并且我们可以通过在定义转发时多次重用同一个模块来实现最内层的权重共享。

In [4]:
import random
import torch
import torch.nn as nn

class DynamicNet(nn.Module):
    """构造三个nn.Linear实例，在前向传播的时候使用"""
    def __init__(self, D_in, H, D_out):
        super(DynamicNet, self).__init__()
        self.input_linear = nn.Linear(D_in, H)
        self.middle_linear = nn.Linear(H, H)
        self.output_linear = nn.Linear(H, D_out)
        
    def forward(self, x):
        h_relu = self.input_linear(x).clamp(min=0)
        for _ in range(random.randint(0, 3)):
            h_relu = self.middle_linear(h_relu).clamp(min=0)
        y_pred = self.output_linear(h_relu)
        return y_pred
    
N, D_in, H, D_out = 64, 1000, 100, 10

x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

model = DynamicNet(D_in, H, D_out)

criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4, momentum=0.9)

for t in range(500):
    y_pred = model(x)
    loss = criterion(y_pred, y)
    print(t, loss.item())
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

0 604.9212646484375
1 613.4895629882812
2 611.4428100585938
3 625.357421875
4 579.3432006835938
5 573.7955322265625
6 606.141357421875
7 556.5703125
8 602.6561279296875
9 600.7974853515625
10 335.53509521484375
11 597.2626953125
12 282.4223937988281
13 510.00213623046875
14 497.99737548828125
15 200.90550231933594
16 172.0865478515625
17 586.4696044921875
18 562.5220336914062
19 554.0592041015625
20 541.3773193359375
21 86.55435943603516
22 556.837646484375
23 76.38275146484375
24 335.790771484375
25 64.3868637084961
26 301.82403564453125
27 278.8071594238281
28 416.09771728515625
29 221.38380432128906
30 75.32075500488281
31 439.89306640625
32 76.66362762451172
33 397.833251953125
34 373.7976379394531
35 346.15484619140625
36 241.85116577148438
37 66.17269897460938
38 118.0133285522461
39 263.0484313964844
40 53.342506408691406
41 46.539371490478516
42 330.2564697265625
43 38.680171966552734
44 532.4647216796875
45 29.19445037841797
46 56.37847137451172
47 159.4569854736328
48 259.227

430 0.13315171003341675
431 0.15454748272895813
432 0.10607624053955078
433 0.09740836173295975
434 0.14093731343746185
435 1.1533571481704712
436 1.0623832941055298
437 0.9217357039451599
438 0.38777339458465576
439 0.2751402258872986
440 0.32615458965301514
441 0.10087944567203522
442 0.09777894616127014
443 0.3326129615306854
444 0.6364453434944153
445 0.5256108045578003
446 0.28159299492836
447 0.5395261645317078
448 0.5346019864082336
449 0.4686483144760132
450 0.4056470990180969
451 0.3084808588027954
452 0.3682557940483093
453 0.33072614669799805
454 0.7263197302818298
455 0.11598410457372665
456 0.2926886975765228
457 0.30939722061157227
458 0.24447859823703766
459 0.08216099441051483
460 0.30216899514198303
461 0.07814404368400574
462 0.0734647661447525
463 0.26630425453186035
464 0.3054557144641876
465 0.1317940056324005
466 0.12328866869211197
467 0.14474931359291077
468 0.8717991709709167
469 0.8446596264839172
470 0.7673892974853516
471 0.659972071647644
472 0.298113763332