### 作業目標: 使用Pytorch進行微分與倒傳遞
這份作業我們會實作微分與倒傳遞以及使用Pytorch的Autograd。

### 使用Pytorch實作微分與倒傳遞

這裡我們很簡單的實作兩層的神經網路進行回歸問題，其中loss function為L2 loss

$$
L2\_loss = (y_{pred}-y)^2
$$

兩層經網路如下所示
$$
y_{pred} = ReLU(XW_1)W_2
$$

In [1]:
import torch
device = torch.device('cpu')

In [23]:
# N: batch size
# D_in: input dimension
# H: hidden dimension
# D_out: output dimension
N, D_in, H, D_out = 64, 1000, 100, 10

# 隨機生成x, y
x = torch.randn((N, D_in)).to(device)
y = torch.randn((N, D_out)).to(device)

# 初始化weight W1, W2
W1 = torch.randn((D_in, H)).to(device)
W2 = torch.randn((H, D_out)).to(device)

# 設置learning rate
learning_rate = 1e-6

# 訓練500個epoch
for t in range(500):
  # 向前傳遞: 計算y_pred
  ###<your code>###
  h = torch.matmul(x, W1)
  h_relu = torch.relu(h)
  y_pred = torch.matmul(h_relu, W2)

  # 計算loss
  loss = pow(y_pred - y, 2).sum()
  print(t, loss.item())

  # 倒傳遞: 計算W1與W2對loss的微分(梯度)
  y_pred_grad = 2.0 * (y_pred - y) # 2.0 float
  W2_gradient = h_relu.T.matmul(y_pred_grad) #.T -> transpose
  h_gradient = y_pred_grad.mm(W2.T) * (h > 0.) 
  W1_gradient = x.T.mm(h_gradient)

  """
  answer key: 
  grad_y_pred = 2.0 * (y_pred - y)
  grad_w2 = h_relu.t().mm(grad_y_pred)
  grad_h_relu = grad_y_pred.mm(w2.t())
  grad_h = grad_h_relu.clone()
  grad_h[h < 0] = 0
  grad_w1 = x.t().mm(grad_h)
  """


  # 參數更新
  ###<your code>###
  W1.data -= learning_rate * W1_gradient
  W2.data -= learning_rate * W2_gradient

0 35273792.0
1 38135880.0
2 49189880.0
3 57262536.0
4 49265544.0
5 26915596.0
6 10102228.0
7 3615298.75
8 1807695.125
9 1238120.5
10 975353.625
11 805649.625
12 677719.5
13 575838.5625
14 492890.0
15 424526.625
16 367693.21875
17 320136.71875
18 280066.8125
19 246064.765625
20 217072.375
21 192292.109375
22 170945.53125
23 152485.6875
24 136459.359375
25 122482.6640625
26 110242.6875
27 99476.9609375
28 90018.96875
29 81649.75
30 74218.875
31 67602.421875
32 61693.9765625
33 56403.0703125
34 51657.2109375
35 47389.4453125
36 43543.19140625
37 40073.7734375
38 36935.0625
39 34086.921875
40 31498.357421875
41 29144.169921875
42 26996.53125
43 25035.09765625
44 23240.673828125
45 21596.890625
46 20088.4765625
47 18703.318359375
48 17429.30078125
49 16255.5234375
50 15172.357421875
51 14172.4755859375
52 13248.189453125
53 12392.4384765625
54 11600.13671875
55 10864.9541015625
56 10182.4140625
57 9548.1240234375
58 8958.4462890625
59 8409.76171875
60 7898.96728515625
61 7422.5166015625
62 

### 使用Pytorch的Autograd

In [None]:
import torch
device = torch.device('cpu')

In [22]:
# N: batch size
# D_in: input dimension
# H: hidden dimension
# D_out: output dimension
N, D_in, H, D_out = 64, 1000, 100, 10

# 隨機生成x, y
x = torch.randn((N, D_in)).to(device)
y = torch.randn((N, D_out)).to(device)

# 初始化weight W1, W2
W1 = torch.randn((D_in, H), requires_grad=True).to(device)
W2 = torch.randn((H, D_out), requires_grad=True).to(device)

# 設置learning rate
learning_rate = 1e-6

# 訓練500個epoch
for t in range(500):
  # 向前傳遞: 計算y_pred
  h = torch.matmul(x, W1)
  h_relu = torch.relu(h)
  y_pred = torch.matmul(h_relu, W2)
  
  # 計算loss
  ###<your code>###
  loss = torch.square(y_pred - y).sum()
  print(t, loss.item())

  # 倒傳遞: 計算W1與W2對loss的微分(梯度)
  loss.backward()

  # 參數更新: 這裡再更新參數時，我們不希望更新參數的計算也被紀錄微分相關的資訊，因此使用torch.no_grad()
  with torch.no_grad():
    # 更新參數W1 W2
    W1.data -= learning_rate * W1.grad
    W2.data -= learning_rate * W2.grad

    # 將紀錄的gradient清空(因為已經更新參數)
    W1.grad.zero_()
    W2.grad.zero_()

0 27591426.0
1 24856636.0
2 25583904.0
3 26096426.0
4 24027528.0
5 18839464.0
6 12601480.0
7 7455209.5
8 4207797.0
9 2424422.25
10 1504133.875
11 1023126.0625
12 757310.875
13 596656.4375
14 490023.3125
15 412825.4375
16 353373.5
17 305648.9375
18 266388.9375
19 233442.515625
20 205469.1875
21 181534.53125
22 160947.625
23 143109.65625
24 127603.609375
25 114084.8671875
26 102242.9453125
27 91832.53125
28 82659.9453125
29 74541.40625
30 67345.0859375
31 60943.65234375
32 55241.4453125
33 50148.63671875
34 45594.16015625
35 41509.60546875
36 37839.89453125
37 34537.45703125
38 31559.880859375
39 28875.1640625
40 26447.876953125
41 24249.666015625
42 22258.61328125
43 20450.087890625
44 18805.078125
45 17307.234375
46 15942.4140625
47 14697.548828125
48 13560.8916015625
49 12522.2900390625
50 11571.642578125
51 10700.431640625
52 9901.5498046875
53 9168.044921875
54 8494.6103515625
55 7875.337890625
56 7305.5185546875
57 6780.60693359375
58 6297.18798828125
59 5852.24462890625
60 5441.41