# Tensors

## Warm-up: numpy

我们将使用一个三阶多项式拟合`y=sin(x)`的问题作为运行的例子。网络将有四个参数，并通过最小化网络输出和真实输出之间的**欧氏距离**，通过**梯度下降**训练来拟合随机数据。

In [1]:
# -*- coding: utf-8 -*-
import numpy as np
import math

# Create random input and output data
x = np.linspace(-math.pi, math.pi, 2000)
y = np.sin(x)

# Randomly initialize weights
a = np.random.randn()
b = np.random.randn()
c = np.random.randn()
d = np.random.randn()

learning_rate = 1e-6
for t in range(2000):
    # Forward pass: compute predicted y
    # y = a + b x + c x^2 + d x^3
    y_pred = a + b * x + c * x ** 2 + d * x ** 3
    
    # Compute and print loss
    loss = np.square(y_pred - y).sum()
    if t % 100 == 99:
        print(t, loss)
        
    # Backprop to compute gradients of a, b, c, d with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_a = grad_y_pred.sum()
    grad_b = (grad_y_pred * x).sum()
    grad_c = (grad_y_pred * x ** 2).sum()
    grad_d = (grad_y_pred * x ** 3).sum()
    
    # Update weights
    a -= learning_rate * grad_a
    b -= learning_rate * grad_b
    c -= learning_rate * grad_c
    d -= learning_rate * grad_d
    
print(f'Result: y = {a} + {b} x + {c} x^2 + {d} x^3')    

99 1578.912027805689
199 1057.588584615739
299 709.7802981630119
399 477.6102641123492
499 322.5438723964877
599 218.91363767876294
699 149.61497665149994
799 103.24410303834762
899 72.19418235180072
999 51.38843314241455
1099 37.43674239225784
1199 28.07398077541992
1299 21.785760382097592
1399 17.558965607681667
1499 14.715375407231367
1599 12.80064376831942
1699 11.510178810447417
1799 10.639628446170537
1899 10.051783858914014
1999 9.654443094126577
Result: y = 0.018890412844026594 + 0.8345999454509323 x + -0.0032589105534211278 x^2 + -0.09018103915838095 x^3


## PyTorch: Tensors

张量可以跟踪计算图和梯度，但它们也可以作为科学计算的通用工具。

PyTorch张量可以利用GPU来加速数值计算。要在GPU上运行PyTorch张量，只需指定正确的设备。

这里我们用PyTorch张量来拟合正弦函数的三阶多项式。像上面的numpy例子一样，我们需要手动实现网络的正向和反向传递:

In [3]:
import torch 
import math

dtype = torch.float
device = torch.device('cpu')
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# Create random input and output data
x = torch.linspace(-math.pi, math.pi, 2000, device=device, dtype=dtype)
y = torch.sin(x)

# Randomly initialize weights
a = torch.randn((),device=device,dtype=dtype)
b = torch.randn((),device=device,dtype=dtype)
c = torch.randn((),device=device,dtype=dtype)
d = torch.randn((),device=device,dtype=dtype)

learning_rate = 1e-6
for t in range(2000):
    # Forward pass: compute predicted y
    y_pred = a + b * x + c * x ** 2 + d * x ** 3
    
    # Compute and print loss
    loss = (y_pred - y).pow(2).sum().item()# pow(2) means square it
    if t % 100 == 99:
        print(t, loss)
     
    # Backprop to compute gradients of a, b, c, d with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_a = grad_y_pred.sum()
    grad_b = (grad_y_pred * x).sum()
    grad_c = (grad_y_pred * x ** 2).sum()
    grad_d = (grad_y_pred * x ** 3).sum()
    
    # Update weights using gradient descent
    a -= learning_rate * grad_a
    b -= learning_rate * grad_b
    c -= learning_rate * grad_c
    d -= learning_rate * grad_d
    
print(f'Result: y = {a.item()} + {b.item()} x + {c.item()} x^2 + {d.item()} x^3')  

99 217.10098266601562
199 156.15792846679688
299 113.04902648925781
399 82.55450439453125
499 60.982635498046875
599 45.72233200073242
699 34.92668914794922
799 27.289316177368164
899 21.88618278503418
999 18.063613891601562
1099 15.359193801879883
1199 13.445828437805176
1299 12.092103958129883
1399 11.134328842163086
1499 10.456680297851562
1599 9.977215766906738
1699 9.637975692749023
1799 9.39794635772705
1899 9.228111267089844
1999 9.10794448852539
Result: y = -0.018036028370261192 + 0.8573822975158691 x + 0.0031115144956856966 x^2 + -0.09342162311077118 x^3


# Autograd

## PyTorch: Tensors and autograd

如果`x`是一个具有`x.requires_grad=True`的张量，那么`x.grad`是另一个保持`x`对某个标量值梯度的张量。

这里我们使用PyTorch张量和autograd来实现我们用三阶多项式拟合正弦波的例子;现在我们不再需要手动实现向后通过网络:

In [4]:
import torch
import math

dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0")  # Uncomment this to run on GPU

# Create Tensors to hold input and outputs.
# By default, requires_grad=False, which indicates that we do not need to
# compute gradients with respect to these Tensors during the backward pass.
x = torch.linspace(-math.pi, math.pi, 2000, device=device, dtype=dtype)
y = torch.sin(x)

# Create random Tensors for weights. For a third order polynomial, we need
# 4 weights: y = a + b x + c x^2 + d x^3
# Setting requires_grad=True indicates that we want to compute gradients 
# with respect to these Tensors during the backward pass.
a = torch.randn((),device=device,dtype=dtype,requires_grad=True)
b = torch.randn((),device=device,dtype=dtype,requires_grad=True)
c = torch.randn((),device=device,dtype=dtype,requires_grad=True)
d = torch.randn((),device=device,dtype=dtype,requires_grad=True)

learning_rate = 1e-6
for t in range(2000):
    # Forward pass: compute predicted y using operations on Tensors.
    y_pred = a + b * x + c * x ** 2 + d * x ** 3

    # Compute and print loss using operations on Tensors.
    # Now loss is a Tensor of shape (1,)
    # loss.item() gets the scalar value held in the loss.
    loss = (y_pred - y).pow(2).sum()
    if t % 100 == 99:
        print(t, loss.item())
        
    # Use autograd to compute the backward pass. This call will compute the
    # gradient of loss with respect to all Tensors with requires_grad=True.
    # After this call a.grad, b.grad. c.grad and d.grad will be Tensors holding
    # the gradient of the loss with respect to a, b, c, d respectively.
    loss.backward()

    # Manually update weights using gradient descent. Wrap in torch.no_grad()
    # because weights have requires_grad=True, but we don't need to track this
    # in autograd. 
    with torch.no_grad():
        a -= learning_rate * a.grad
        b -= learning_rate * b.grad
        c -= learning_rate * c.grad
        d -= learning_rate * d.grad
        
        # Manually zero the gradients after updating weights
        a.grad = None
        b.grad = None
        c.grad = None
        d.grad = None       
        
print(f'Result: y = {a.item()} + {b.item()} x + {c.item()} x^2 + {d.item()} x^3')        

99 257.07525634765625
199 178.2841033935547
299 124.63977813720703
399 88.06747436523438
499 63.10560989379883
599 46.04880905151367
699 34.380149841308594
799 26.388404846191406
899 20.908693313598633
999 17.147083282470703
1099 14.561959266662598
1199 12.783341407775879
1299 11.5582914352417
1399 10.713586807250977
1499 10.13050651550293
1599 9.727605819702148
1699 9.448915481567383
1799 9.255941390991211
1899 9.122188568115234
1999 9.02939224243164
Result: y = 0.013413380831480026 + 0.8497552871704102 x + -0.00231402856297791 x^2 + -0.0923367515206337 x^3


## PyTorch: Defining new autograd functions

实际上，每一个本原的`autograd`算子实际上是两个作用于张量的函数。

`forward`函数从输入张量计算输出张量。`backward`函数接收输出张量相对于某个标量值的梯度，并计算输入张量相对于同一标量值的梯度。

可以通过定义一个`torch.autograd.Function`的子类来轻松定义我们自己的`autograd`操作符并实现`forward`和`backward`函数。

我们将模型定义为$y=a+bP_3(c+dx)$，其中$P_3(x)=\frac{1}{2}(5x^3−3x)$是三次Legendre多项式。我们编写自定义的`autograd`函数，用于向前和向后计算$P_3$，并使用它来实现我们的模型:

In [8]:
import torch
import math

class LegendrePolynomial3(torch.autograd.Function):
    """
    We can implement our own custom autograd Functions by subclassing
    torch.autograd.Function and implementing the forward and backward passes
    which operate on Tensors.
    """
    
    @staticmethod
    def forward(ctx, input):
        """
        In the forward pass we receive a Tensor containing the input and return a Tensor containing the output. 
        ctx is a context object that can be used to stash information for backward computation. 
        You can cache arbitrary objects for use in the backward pass using the ctx.save_for_backward method.
        """
        ctx.save_for_backward(input)
        return 0.5 * (5 * input ** 3 - 3 * input)
    
    @staticmethod
    def backward(ctx,grad_output):
        """
        In the backward pass we receive a Tensor containing the gradient of the loss with respect to the output, 
        and we need to compute the gradient of the loss with respect to the input.
        """
        input, = ctx.saved_tensors
        return grad_output * 1.5 * (5 * input ** 2 - 1)
    
dtype = torch.float
device = torch.device('cpu')
# device = torch.device("cuda:0")  # Uncomment this to run on GPU
# Create Tensors to hold input and outputs.
# By default, requires_grad=False, which indicates that we do not need to
# compute gradients with respect to these Tensors during the backward pass.
x = torch.linspace(-math.pi, math.pi, 2000, device=device, dtype=dtype)
y = torch.sin(x)

# Create random Tensors for weights. For this example, we need 4 weights: 
# y = a + b * P3(c + d * x), these weights need to be initialized not 
# too far from the correct result to ensure convergence.
# Setting requires_grad=True indicates that we want to compute gradients 
# with respect to these Tensors during the backward pass.
a = torch.full((), 0.0, device=device, dtype=dtype, requires_grad=True)
b = torch.full((), -1.0, device=device, dtype=dtype, requires_grad=True)
c = torch.full((), 0.0, device=device, dtype=dtype, requires_grad=True)
d = torch.full((), 0.3, device=device, dtype=dtype, requires_grad=True)

learning_rate = 5e-6
for t in range(2000):
    # To apply our Function, we use Function.apply method. We alias this as 'P3'.
    P3 = LegendrePolynomial3.apply
    
    # Forward pass: compute predicted y using operations; we compute
    # P3 using our custom autograd operation.
    y_pred = a + b * P3(c + d * x)
    
    # Compute and print loss
    loss = (y_pred - y).pow(2).sum()
    if t % 100 == 99:
        print(t, loss.item())
        
    # Use autograd to compute the backward pass.
    loss.backward()
    
    # Update weights using gradient descent
    with torch.no_grad():
        a -= learning_rate * a.grad
        b -= learning_rate * b.grad
        c -= learning_rate * c.grad
        d -= learning_rate * d.grad

        # Manually zero the gradients after updating weights
        a.grad = None
        b.grad = None
        c.grad = None
        d.grad = None

print(f'Result: y = {a.item()} + {b.item()} * P3({c.item()} + {d.item()} x)')

99 209.9583282470703
199 144.6602020263672
299 100.7025146484375
399 71.03520965576172
499 50.97850799560547
599 37.40315246582031
699 28.20688247680664
799 21.97319221496582
899 17.7457275390625
999 14.877889633178711
1099 12.93176555633545
1199 11.610918998718262
1299 10.71424674987793
1399 10.105476379394531
1499 9.69210433959961
1599 9.411375045776367
1699 9.220744132995605
1799 9.091285705566406
1899 9.003360748291016
1999 8.943641662597656
Result: y = 3.5881797533221516e-09 + -2.208526849746704 * P3(-1.6777875755380478e-09 + 0.2554861009120941 x)


# nn.Module

## PyTorch.nn

`nn`包定义了一组`Module`，这些模块大致相当于神经网络层。

`nn`包还定义了一组有用的损失函数，这些函数通常用于训练神经网络。

In [40]:
import torch
import math


# Create Tensors to hold input and outputs.
x = torch.linspace(-math.pi, math.pi, 2000)
y = torch.sin(x)

# For this example, the output y is a linear function of (x, x^2, x^3), so
# we can consider it as a linear layer neural network. Let's prepare the tensor (x, x^2, x^3).
p = torch.tensor([1, 2, 3])
xx = x.unsqueeze(-1).pow(p) # add a dimension to the last dimension

# In the above code, x.unsqueeze(-1) has shape (2000, 1), and p has shape
# (3,), for this case, broadcasting semantics will apply to obtain a tensor
# of shape (2000, 3) 

# Use the nn package to define our model as a sequence of layers. 
# nn.Sequential is a Module which contains other Modules, and applies them in sequence to produce its output. 
# The Linear Module computes output from input using a linear function, and holds internal Tensors for its weight and bias.
# The Flatten layer flatens the output of the linear layer to a 1D tensor, to match the shape of `y`
model = torch.nn.Sequential(
    torch.nn.Linear(3, 1),
    torch.nn.Flatten(0, 1)
)

# The nn package also contains definitions of popular loss functions; 
# in this case we will use Mean Squared Error (MSE) as our loss function.
loss_fn = torch.nn.MSELoss(reduction='sum')

learning_rate = 1e-6
for t in range(2000):
    # Forward pass: compute predicted y by passing x to the model. 
    # Module objects override the __call__ operator so you can call them like functions. 
    # When doing so you pass a Tensor of input data to the Module and it produces a Tensor of output data.
    y_pred = model(xx)
    
    # Compute and print loss. We pass Tensors containing the predicted and true
    # values of y, and the loss function returns a Tensor containing the
    # loss.
    loss = loss_fn(y_pred, y)
    if t % 100 == 99:
        print(t, loss.item())
        
    # Zero the gradients before running the backward pass.
    model.zero_grad()
    
    # Backward pass: compute gradient of the loss with respect to all the learnable
    # parameters of the model. Internally, the parameters of each Module are stored
    # in Tensors with requires_grad=True, so this call will compute gradients for
    # all learnable parameters in the model.
    loss.backward()
    
    # Update the weights using gradient descent. Each parameter is a Tensor, so
    # we can access its gradients like we did before.
    with torch.no_grad():
        for param in model.parameters():
            param -= learning_rate * param.grad
            
# You can access the first layer of `model` like accessing the first item of a list
linear_layer = model[0]

# For linear layer, its parameters are stored as `weight` and `bias`.
print(f'Result: y = {linear_layer.bias.item()} + {linear_layer.weight[:, 0].item()} x + {linear_layer.weight[:, 1].item()} x^2 + {linear_layer.weight[:, 2].item()} x^3')

99 357.5382385253906
199 250.67713928222656
299 176.72108459472656
399 125.4865493774414
499 89.95793151855469
599 65.29731750488281
699 48.164424896240234
799 36.250606536865234
899 27.958938598632812
999 22.183298110961914
1099 18.156980514526367
1199 15.347993850708008
1299 13.386817932128906
1399 12.016562461853027
1499 11.058523178100586
1599 10.388257026672363
1699 9.919024467468262
1799 9.590336799621582
1899 9.359963417053223
1999 9.198409080505371
Result: y = -0.019527047872543335 + 0.8505232334136963 x + 0.003368740202859044 x^2 + -0.0924459844827652 x^3


## PyTorch: optim

我们将使用`optim`包提供的`RMSprop`算法对模型进行优化:

In [41]:
import torch
import math


# Create Tensors to hold input and outputs.
x = torch.linspace(-math.pi, math.pi, 2000)
y = torch.sin(x)

# Prepare the input tensor (x, x^2, x^3).
p = torch.tensor([1, 2, 3])
xx = x.unsqueeze(-1).pow(p)

# Use the nn package to define our model and loss function.
model = torch.nn.Sequential(
    torch.nn.Linear(3, 1),
    torch.nn.Flatten(0, 1)
)
loss_fn = torch.nn.MSELoss(reduction='sum')

# Use the optim package to define an Optimizer that will update the weights of
# the model for us. Here we will use RMSprop; the optim package contains many other
# optimization algorithms. The first argument to the RMSprop constructor tells the
# optimizer which Tensors it should update.
learning_rate = 1e-3
optimizer = torch.optim.RMSprop(model.parameters(), lr=learning_rate)

for t in range(2000):
    # Forward pass: compute predicted y by passing x to the model.
    y_pred = model(xx)

    # Compute and print loss.
    loss = loss_fn(y_pred, y)
    if t % 100 == 99:
        print(t, loss.item())

    # Before the backward pass, use the optimizer object to zero all of the
    # gradients for the variables it will update (which are the learnable
    # weights of the model). This is because by default, gradients are
    # accumulated in buffers( i.e, not overwritten) whenever .backward()
    # is called. Checkout docs of torch.autograd.backward for more details.
    optimizer.zero_grad()

    # Backward pass: compute gradient of the loss with respect to model
    # parameters
    loss.backward()

    # Calling the step function on an Optimizer makes an update to its
    # parameters
    optimizer.step()


linear_layer = model[0]
print(f'Result: y = {linear_layer.bias.item()} + {linear_layer.weight[:, 0].item()} x + {linear_layer.weight[:, 1].item()} x^2 + {linear_layer.weight[:, 2].item()} x^3')

99 4114.9404296875
199 2218.18994140625
299 1289.7822265625
399 858.2783203125
499 650.33935546875
599 508.1154479980469
699 387.8284912109375
799 285.99310302734375
899 202.34718322753906
999 136.00958251953125
1099 85.62023162841797
1199 49.863826751708984
1299 27.090984344482422
1399 14.921346664428711
1499 10.114617347717285
1599 9.002344131469727
1699 8.901604652404785
1799 8.904823303222656
1899 8.907588005065918
1999 8.910552024841309
Result: y = 0.00026798946782946587 + 0.8572112917900085 x + 0.00026816074387170374 x^2 + -0.0928269550204277 x^3


## PyTorch: Custom nn.Modules

可以通过子类化`nn.Module`来定义自己的模块，并定义一个`forward`函数，它接收输入张量，并使用其他模块或张量上的其他autograd操作生成输出张量。

In [42]:
import torch
import math

class Polynomial3(torch.nn.Module):
    def __init__(self):
        """
        In the constructor we instantiate four parameters and assign them as member parameters.
        """
        super().__init__()
        self.a = torch.nn.Parameter(torch.randn(()))
        self.b = torch.nn.Parameter(torch.randn(()))
        self.c = torch.nn.Parameter(torch.randn(()))
        self.d = torch.nn.Parameter(torch.randn(()))
        
    def forward(self, x):
        """
        In the forward function we accept a Tensor of input data and we must return a Tensor of output data. 
        We can use Modules defined in the constructor as well as arbitrary operators on Tensors.
        """
        return self.a + self.b * x + self.c * x ** 2 + self.d * x ** 3
    
    def string(self):
        """
        Just like any class in Python, you can also define custom method on PyTorch modules
        """
        return f'y = {self.a.item()} + {self.b.item()} x + {self.c.item()} x^2 + {self.d.item()} x^3'
    
# Create Tensors to hold input and outputs.
x = torch.linspace(-math.pi, math.pi, 2000)
y = torch.sin(x)

# Construct our model by instantiating the class defined above
model = Polynomial3()

# Construct our loss function and an Optimizer. 
# The call to model.parameters() in the SGD constructor will contain 
# the learnable parameters of the nn.Linear module which is members of the model.
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-6)

for t in range(2000):
    # Forward pass: Compute predicted y by passing x to the model
    y_pred = model(x)
    
    # Compute and print loss
    loss = criterion(y_pred, y)
    
    if t % 100 == 99:
        print(t, loss.item())
        
    # Zero gradients, perform a backward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
print(f'Result: {model.string()}')   

99 4870.92333984375
199 3300.24609375
299 2239.31103515625
399 1521.9705810546875
499 1036.4495849609375
599 707.4884033203125
699 484.36468505859375
799 332.86248779296875
899 229.8781280517578
999 159.79600524902344
1099 112.05021667480469
1199 79.48486328125
1299 57.2479362487793
1399 42.046260833740234
1499 31.642108917236328
1599 24.513242721557617
1699 19.622968673706055
1799 16.264511108398438
1899 13.955439567565918
1999 12.36606216430664
Result: y = 0.05083204060792923 + 0.8224733471870422 x + -0.008769375272095203 x^2 + -0.08845613896846771 x^3


## PyTorch: Control Flow + Weight Sharing

作为动态图和权值共享的一个例子，我们实现了一个非常奇怪的模型:一个三至五阶多项式，在每次向前传递时在3到5之间选择一个随机数，并使用许多阶，多次重用相同的权值来计算四阶和五阶。

对于这个模型，我们可以使用普通的Python流控制来实现循环，我们可以通过在定义前向传递时多次重用相同的参数来实现权重共享。

In [43]:
# -*- coding: utf-8 -*-
import random
import torch
import math


class DynamicNet(torch.nn.Module):
    def __init__(self):
        """
        In the constructor we instantiate five parameters and assign them as members.
        """
        super().__init__()
        self.a = torch.nn.Parameter(torch.randn(()))
        self.b = torch.nn.Parameter(torch.randn(()))
        self.c = torch.nn.Parameter(torch.randn(()))
        self.d = torch.nn.Parameter(torch.randn(()))
        self.e = torch.nn.Parameter(torch.randn(()))

    def forward(self, x):
        """
        For the forward pass of the model, we randomly choose either 4, 5 and reuse the e parameter to compute the contribution of these orders.

        Since each forward pass builds a dynamic computation graph, we can use normal
        Python control-flow operators like loops or conditional statements when defining the forward pass of the model.

        Here we also see that it is perfectly safe to reuse the same parameter many
        times when defining a computational graph.
        """
        y = self.a + self.b * x + self.c * x ** 2 + self.d * x ** 3
        for exp in range(4, random.randint(4, 6)):
            y = y + self.e * x ** exp
        return y

    def string(self):
        """
        Just like any class in Python, you can also define custom method on PyTorch modules
        """
        return f'y = {self.a.item()} + {self.b.item()} x + {self.c.item()} x^2 + {self.d.item()} x^3 + {self.e.item()} x^4 ? + {self.e.item()} x^5 ?'


# Create Tensors to hold input and outputs.
x = torch.linspace(-math.pi, math.pi, 2000)
y = torch.sin(x)

# Construct our model by instantiating the class defined above
model = DynamicNet()

# Construct our loss function and an Optimizer. Training this strange model with
# vanilla stochastic gradient descent is tough, so we use momentum
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-8, momentum=0.9)
for t in range(30000):
    # Forward pass: Compute predicted y by passing x to the model
    y_pred = model(x)

    # Compute and print loss
    loss = criterion(y_pred, y)
    if t % 2000 == 1999:
        print(t, loss.item())

    # Zero gradients, perform a backward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

print(f'Result: {model.string()}')

1999 908.598388671875
3999 405.48858642578125
5999 176.23211669921875
7999 83.85343170166016
9999 45.16403579711914
11999 25.30931854248047
13999 18.274822235107422
15999 12.359508514404297
17999 10.427130699157715
19999 9.594767570495605
21999 9.221528053283691
23999 9.02722454071045
25999 8.944988250732422
27999 8.892292976379395
29999 8.866132736206055
Result: y = -0.0022345606703311205 + 0.8603494763374329 x + -0.0001398671738570556 x^2 + -0.09414059668779373 x^3 + 8.711427653906867e-05 x^4 ? + 8.711427653906867e-05 x^5 ?
