实现两层全连接神经网络
--------------

一个全连接ReLU神经网络，一个隐藏层，没有bias。用来从x预测y，使用L2 Loss。
- ##  $h = W_1X$
- ## $h_{relu} = max(0, h)$
- ## $y_{pred} = W_2 h_{relu}$

### 方案一：

## 用numpy实现两层神经网络

这一实现完全使用numpy来计算前向神经网络，loss，和反向传播。
- forward pass
- loss
- backward pass

numpy ndarray是一个普通的n维array。它不知道任何关于深度学习或者梯度(gradient)的知识，也不知道计算图(computation graph)，只是一种用来计算数学运算的数据结构。


In [2]:
import numpy as np

N, D_in, H, D_out = 64, 1000, 100, 10

# 随机创建一些训练数据
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

learning_rate = 1e-6
for it in range(501):
    # Forward pass
    h = x.dot(w1) # N * H
    h_relu = np.maximum(h, 0) # N * H
    y_pred = h_relu.dot(w2) # N * D_out
    
    # compute loss
    loss = np.square(y_pred - y).sum()
    
    if it % 50 == 0:
        print(it, loss)
    
    # Backward pass
    # compute the gradient
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h<0] = 0
    grad_w1 = x.T.dot(grad_h)
    
    # update weights of w1 and w2
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

0 33580346.856427416
50 16878.00562656999
100 1175.5761415406482
150 144.92942737452177
200 21.349479971363497
250 3.464356232454947
300 0.588885572480889
350 0.10300981682523466
400 0.018371116368468575
450 0.0033215922591358625
500 0.0006066239639763492


### 方案二：

## PyTorch: Tensors 实现两层神经网络

使用PyTorch tensors来创建前向神经网络，计算损失，以及反向传播。

一个PyTorch Tensor很像一个numpy的ndarray。但是它和numpy ndarray最大的区别是，PyTorch Tensor可以在CPU或者GPU上运算。如果想要在GPU上运算，就需要把Tensor换成cuda类型。

In [3]:
import torch

N, D_in, H, D_out = 64, 1000, 100, 10

# 随机创建一些训练数据
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

w1 = torch.randn(D_in, H)
w2 = torch.randn(H, D_out)

learning_rate = 1e-6
for it in range(501):
    # Forward pass
    h = x.mm(w1) # N * H
    h_relu = h.clamp(min=0) # N * H
    y_pred = h_relu.mm(w2) # N * D_out
    
    # compute loss
    loss = (y_pred - y).pow(2).sum().item()
    if it % 50 == 0:
        print(it, loss)
    
    # Backward pass
    # compute the gradient
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.t().mm(grad_y_pred)
    grad_h_relu = grad_y_pred.mm(w2.t())
    grad_h = grad_h_relu.clone()
    grad_h[h<0] = 0
    grad_w1 = x.t().mm(grad_h)
    
    # update weights of w1 and w2
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

0 26006564.0
50 11465.49609375
100 415.876220703125
150 26.682729721069336
200 2.1816067695617676
250 0.20476067066192627
300 0.021072817966341972
350 0.0025582911912351847
400 0.0004926429246552289
450 0.0001552953472128138
500 6.920313171576709e-05


### 方案三：

## PyTorch: Tensors 和 Autograd 实现两层神经网络


PyTorch的一个重要功能就是autograd，也就是说只要定义了forward pass(前向神经网络)，计算了loss之后，PyTorch可以自动求导计算模型所有参数的梯度。

一个PyTorch的Tensor表示计算图中的一个节点。如果``x``是一个Tensor并且``x.requires_grad=True``那么``x.grad``是另一个储存着``x``当前梯度(相对于一个scalar，常常是loss)的向量。

In [4]:
import torch
N, D_in, H, D_out = 64, 1000, 100, 10

# 随机创建一些训练数据
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

w1 = torch.randn(D_in, H, requires_grad=True)
w2 = torch.randn(H, D_out, requires_grad=True)

learning_rate = 1e-6
for it in range(501):
    # Forward pass
    y_pred = x.mm(w1).clamp(min=0).mm(w2)
    
    # compute loss
    loss = (y_pred - y).pow(2).sum() # computation graph
    if it % 50 == 0:
        print(it, loss.item())
    
    # Backward pass
    loss.backward()
    
    # update weights of w1 and w2
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad
        w1.grad.zero_()
        w2.grad.zero_()

0 32493268.0
50 13109.89453125
100 448.79290771484375
150 25.290950775146484
200 1.723570704460144
250 0.12972937524318695
300 0.010481716133654118
350 0.0011320256162434816
400 0.00023621311993338168
450 8.190183143597096e-05
500 3.955412466893904e-05


### 方案四：

## PyTorch: Tensors 和 optim 实现两层神经网络

In [17]:
import torch
N, D_in, H, D_out = 64, 1000, 100, 10

# 随机创建一些训练数据
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

w1 = torch.randn(D_in, H, requires_grad=True)
w2 = torch.randn(H, D_out, requires_grad=True)

learning_rate = 1e-6
optimizer = torch.optim.SGD([w1, w2], lr=learning_rate)

for it in range(501):
    # Forward pass
    y_pred = x.mm(w1).clamp(min=0).mm(w2)
    
    # compute loss
    loss = (y_pred - y).pow(2).sum() # computation graph
    if it % 50 == 0:
        print(it, loss.item())
    
    # Backward pass
    loss.backward()
    
    # update weights of w1 and w2
#     with torch.no_grad():
#         w1 -= learning_rate * w1.grad
#         w2 -= learning_rate * w2.grad
#         w1.grad.zero_()
#         w2.grad.zero_()
    optimizer.step()
    optimizer.zero_grad()

0 29260268.0
50 12504.0068359375
100 487.453125
150 31.144981384277344
200 2.379106283187866
250 0.2002507746219635
300 0.01808660849928856
350 0.0019787990022450686
400 0.000380392128136009
450 0.00012120555038563907
500 5.5032327509252355e-05


### 方案五：

## PyTorch: Tensors 和 nn.MSELoss 实现两层神经网络

In [29]:
import torch
N, D_in, H, D_out = 64, 1000, 100, 10

# 随机创建一些训练数据
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

w1 = torch.randn(D_in, H, requires_grad=True)
w2 = torch.randn(H, D_out, requires_grad=True)

learning_rate = 1e-6
optimizer = torch.optim.SGD([w1, w2], lr=learning_rate)
loss_fn = nn.MSELoss(reduction='sum')

for it in range(501):
    # Forward pass
    y_pred = x.mm(w1).clamp(min=0).mm(w2)
    
    # compute loss
    # loss = (y_pred - y).pow(2).sum() 
    loss = loss_fn(y_pred, y)
    if it % 50 == 0:
        print(it, loss.item())
    
    # Backward pass
    loss.backward()
    
    # update weights of w1 and w2
    optimizer.step()
    optimizer.zero_grad()

0 28800416.0
50 9263.63671875
100 274.2176208496094
150 13.594708442687988
200 0.8313031196594238
250 0.05708610638976097
300 0.004427996929734945
350 0.0005627150530926883
400 0.00014180631842464209
450 5.7230809034081176e-05
500 3.0386532671400346e-05


### 方案六：

## PyTorch: nn 实现两层神经网络

使用PyTorch中nn这个库来构建网络。
用PyTorch autograd来构建计算图和计算gradients，
然后PyTorch会帮我们自动计算gradient。


In [28]:
import torch.nn as nn

N, D_in, H, D_out = 64, 1000, 100, 10

# 随机创建一些训练数据
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H, bias=True), # w_1 * x + b_1
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out, bias=True),
)

torch.nn.init.normal_(model[0].weight)
torch.nn.init.normal_(model[2].weight)

# model = model.cuda()

loss_fn = nn.MSELoss(reduction='sum')

learning_rate = 1e-6

for it in range(501):
    # Forward pass
    y_pred = model(x) # model.forward() 
    
    # compute loss
    loss = loss_fn(y_pred, y) # computation graph
    
    if it % 50 == 0:
        print(it, loss.item())
    
    # Backward pass
    loss.backward()
    
    # update weights of w1 and w2
    with torch.no_grad():
        for param in model.parameters(): # param (tensor, grad)
            param -= learning_rate * param.grad
#             param.grad.zero_()
            
    model.zero_grad()

0 33754652.0
50 15310.271484375
100 593.506103515625
150 35.19988250732422
200 2.520732879638672
250 0.19948479533195496
300 0.016804296523332596
350 0.0017208210192620754
400 0.000318250065902248
450 0.0001016105234157294
500 4.570413148030639e-05


### 方案七：

## PyTorch: nn 和 Optim 实现两层神经网络

In [27]:
import torch.nn as nn

N, D_in, H, D_out = 64, 1000, 100, 10

# 随机创建一些训练数据
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H, bias=False), # w_1 * x + b_1
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out, bias=False),
)

torch.nn.init.normal_(model[0].weight)
torch.nn.init.normal_(model[2].weight)

# model = model.cuda()

loss_fn = nn.MSELoss(reduction='sum')
# learning_rate = 1e-4
# optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

learning_rate = 1e-6
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

for it in range(501):
    # Forward pass
    y_pred = model(x) # model.forward() 
    
    # compute loss
    loss = loss_fn(y_pred, y) # computation graph
    if it % 50 == 0:
        print(it, loss.item())

    # Backward pass
    loss.backward()
    
    # update model parameters
    optimizer.step()
    optimizer.zero_grad()


0 33509748.0
50 11850.4736328125
100 348.3865661621094
150 16.511716842651367
200 0.9526950716972351
250 0.061549607664346695
300 0.004501718562096357
350 0.000535203143954277
400 0.0001302134623983875
450 5.2256727940402925e-05
500 2.8227381335454993e-05


### 方案八：

## PyTorch:  自定义 nn Modules 实现两层神经网络 (显式参数)

可以定义一个模型，这个模型继承自nn.Module类。如果需要定义一个比Sequential模型更加复杂的模型，就需要定义nn.Module模型。

In [25]:
import torch.nn as nn

N, D_in, H, D_out = 64, 1000, 100, 10

# 随机创建一些训练数据
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

class TwoLayerNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        super(TwoLayerNet, self).__init__()
        # define the model architecture
        self.W1 = nn.Parameter(nn.init.xavier_normal_(torch.Tensor(D_in, H)))
        self.W2 = nn.Parameter(nn.init.xavier_normal_(torch.Tensor(H, D_out)))
    
    def forward(self, x):
        y_pred = x.mm(self.W1).clamp(min=0).mm(self.W2)
        return y_pred

model = TwoLayerNet(D_in, H, D_out)
# loss_fn = nn.MSELoss(reduction='sum')
loss_fn = nn.MSELoss()
learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

for it in range(500):
    # Forward pass
    y_pred = model(x) # model.forward() 
    
    # compute loss
    loss = loss_fn(y_pred, y) # computation graph
    if it % 50 == 0:
        print(it, loss.item())

    # Backward pass
    loss.backward()
    
    # update model parameters
    optimizer.step()
    
    optimizer.zero_grad()

0 2.4040756225585938
50 0.20220546424388885
100 0.016015464439988136
150 0.0016130581498146057
200 0.00014704203931614757
250 1.135985348810209e-05
300 7.175092946454242e-07
350 3.4551110417169184e-08
400 1.1245625541889126e-09
450 2.5372844450477494e-11


### 方案九：

## PyTorch: 自定义 nn Modules 实现两层神经网络 (隐式参数)

In [7]:
import torch.nn as nn

N, D_in, H, D_out = 64, 1000, 100, 10

# 随机创建一些训练数据
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

class TwoLayerNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        super(TwoLayerNet, self).__init__()
        # define the model architecture
        self.linear1 = torch.nn.Linear(D_in, H, bias=False)
        self.linear2 = torch.nn.Linear(H, D_out, bias=False)
    
    
    def forward(self, x):
        y_pred = self.linear2(self.linear1(x).clamp(min=0)) 
        return y_pred

model = TwoLayerNet(D_in, H, D_out)
loss_fn = nn.MSELoss(reduction='sum')
learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

for it in range(500):
    # Forward pass
    y_pred = model(x) # model.forward() 
    
    # compute loss
    loss = loss_fn(y_pred, y) # computation graph
    if it % 50 == 0:
        print(it, loss.item())

    # Backward pass
    loss.backward()
    
    # update model parameters
    optimizer.step()
    
    optimizer.zero_grad()

0 678.007568359375
50 208.61647033691406
100 53.54425811767578
150 9.193710327148438
200 1.215930700302124
250 0.1409573256969452
300 0.013429676182568073
350 0.0009565682266838849
400 4.8972426156979054e-05
450 1.740896436785988e-06
