# Linear Regression Example - Concise Implementation

## 生成数据集 Generate fake data as training data (samples) 

In [1]:
import numpy as np
import torch
from torch.utils import data
from d2l import torch as d2l

In [2]:
# generate some fake data

def synthetic_data(w, b, num_examples):  #@save
    """生成y=Xw+b+噪声"""
    """Generate y=Xw+b and noises for it"""
    # X is a tensor a random numbers which are drawn from a normal distributions whose mean is 0 and standard deviation is 1. 
    # It has num_examples rows, len(w) columns.
    X = torch.normal(0, 1, (num_examples, len(w))) #正态分布 均值为0 方差为1 的随机数。 行数：num_examples。 列数: len(w)
    y = torch.matmul(X, w) + b # y = Xw + b
    # Add some noises for y. The shape of the noises is as same as y's. 
    y += torch.normal(0, 0.01, y.shape) # 加点噪音，噪音形状和y的形状相同 
    
    # return y as a column vector
    return X, y.reshape((-1, 1)) # y变成列向量返回


In [3]:
true_w = torch.tensor([2, -3.4])
true_b = 4.2
features, labels = d2l.synthetic_data(true_w, true_b, 1000)

## 读取数据集 Read dataset

我们可以[调用框架中现有的API来读取数据]。 我们将features和labels作为API的参数传递，并通过数据迭代器指定batch_size。 此外，布尔值is_train表示是否希望数据迭代器对象在每个迭代周期内打乱数据。

We can [use the existing APIs in the framework to read data]. We pass features and labels as parameters to the API and specify the batch_size through the data iterator. Moreover, the boolean value is_train indicates whether we want the data iterator object to shuffle the data in each iteration cycle.

In [4]:
def load_array(data_arrays, batch_size, is_train=True):  #@save
    """构造一个PyTorch数据迭代器"""
    dataset = data.TensorDataset(*data_arrays) # features和labels组成list作为data_arrays传给TensorDataset(), which把把它转成PyTorch的tensor
    return data.DataLoader(dataset, batch_size, shuffle=is_train) # DataLoader从dataset中每次以shuffle的方式（即随机的方式）取batch_size大小的数据

In [5]:
batch_size = 10
data_iter = load_array((features, labels), batch_size)

使用data_iter的方式与我们在scratch中使用data_iter函数的方式相同。为了验证是否正常工作，让我们读取并打印第一个小批量样本。 

与scratch不同，这里我们使用iter构造Python迭代器，并使用next从迭代器中获取第一项。

In [6]:
next(iter(data_iter))

[tensor([[-0.1407,  0.1276],
         [ 0.8594, -0.0988],
         [-1.3422,  0.0502],
         [ 2.6259,  1.1320],
         [ 1.8280, -0.2564],
         [-0.9355,  0.8142],
         [ 0.9108, -0.0376],
         [-0.9746,  0.3416],
         [ 0.2967,  1.0541],
         [ 0.0256,  0.5819]]),
 tensor([[ 3.4846],
         [ 6.2583],
         [ 1.3561],
         [ 5.6188],
         [ 8.7156],
         [-0.4448],
         [ 6.1490],
         [ 1.1163],
         [ 1.2060],
         [ 2.2765]])]

## Define the model 定义模型

当我们在scratch中实现线性回归时，
我们明确定义了模型参数变量，并编写了计算的代码，这样通过基本的线性代数运算得到输出。
但是，如果模型变得更加复杂，且当我们几乎每天都需要实现模型时，自然会想简化这个过程。


对于标准深度学习模型，我们可以[**使用框架的预定义好的层**]。这使我们只需关注使用哪些层来构造模型，而不必关注层的实现细节。

Linear Regression 可以被理解成只有一层的神经网络。

我们首先定义一个模型变量`net`，它是一个`Sequential`类的实例。
`Sequential`类将多个层串联在一起。**Sequential就是一个list of layers.**
当给定输入数据时，`Sequential`实例将数据传入到第一层，
然后将第一层的输出作为第二层的输入，以此类推。
在下面的例子中，我们的模型只包含一个层，因此实际上不需要`Sequential`。
但是由于以后几乎所有的模型都是多层的，在这里使用`Sequential`会让你熟悉“标准的流水线”。

回顾single_neuron中的单层网络架构，
这一单层被称为*全连接层*（fully-connected layer），
因为它的每一个输入都通过矩阵-向量乘法得到它的每个输出。


When we implemented linear regression from scratch, we explicitly defined the model parameter variables and wrote the code to compute the output using basic linear algebra operations. However, if the model becomes more complex, and we find ourselves needing to implement models almost daily, it's natural to want to simplify this process.

For standard deep learning models, we can [use the predefined layers from the framework]. This allows us to focus on which layers to use to construct the model, rather than the implementation details of the layers.

Linear Regression can be treated as a one-layer neural network.

We first define a model variable net, which is an instance of the Sequential class. The Sequential class chains multiple layers together. When given input data, a Sequential instance passes the data to the first layer, then takes the output of the first layer as the input to the second layer, and so on. In the example below, our model only contains one layer, so in reality, we don't need Sequential. However, since almost all models later on are multi-layered, using Sequential here will get you familiar with the 'standard pipeline'.

Recall the single-layer network architecture in single_neuron. This single layer is known as a fully-connected layer (fully-connected layer), because each of its inputs is connected to each of its outputs through matrix-vector multiplication.

在PyTorch中，全连接层在Linear类中定义。 值得注意的是，我们将两个参数传递到nn.Linear中。 第一个指定输入特征形状，即2，第二个指定输出特征形状，输出特征形状为单个标量，因此为1。

In PyTorch, the fully connected layer is defined in the Linear class. It's worth noting that we pass two parameters to nn.Linear. The first specifies the shape of the input features, which is 2, and the second specifies the shape of the output features. The output feature shape is a single scalar, therefore it is 1

In [7]:
# nn是神经网络的缩写
from torch import nn

net = nn.Sequential(nn.Linear(2, 1))

## Initializing model's parameters (**初始化模型参数**)

在使用`net`之前，我们需要初始化模型参数。
如在线性回归模型中的权重和偏置。
深度学习框架通常有预定义的方法来初始化参数。
在这里，我们指定每个权重参数应该从均值为0、标准差为0.01的正态分布中随机采样，
偏置参数将初始化为零。

正如我们在构造`nn.Linear`时指定输入和输出尺寸一样，
现在我们能直接访问参数以设定它们的初始值。
我们通过`net[0]`选择网络中的第一个图层，
然后使用`weight.data`和`bias.data`方法访问参数。
我们还可以使用替换方法`normal_`和`fill_`来重写参数值。

Before we use net, we need to initialize the model parameters, such as the weights and biases in the linear regression model. Deep learning frameworks typically have predefined methods for initializing parameters. Here, we specify that each weight parameter should be randomly sampled from a normal distribution with a mean of 0 and a standard deviation of 0.01, and the bias parameters will be initialized to zero.

Just as we specified input and output dimensions when constructing nn.Linear, now we can directly access the parameters to set their initial values. We select the first layer in the network with net[0], and then access the parameters using the weight.data and bias.data methods. We can also use the replacement methods normal_ and fill_ to overwrite parameter values.


In [8]:
net[0].weight.data.normal_(0, 0.01) # 这里等同于我们手动实现初始化模型参数w
net[0].bias.data.fill_(0) # 这里等同于我们手动实现初始化模型参数b

tensor([0.])

## Define the loss function 定义损失函数

[**计算均方误差使用的是`MSELoss`类，也称为平方$L_2$范数**]。
默认情况下，它返回所有样本损失的平均值。

[For computing the mean squared error, the MSELoss class is used, which is also known as the squared $L_2$ norm]. By default, it returns the average of the losses over all the samples.

In [9]:
loss = nn.MSELoss()

## Define the Optimization Algorithm 定义优化算法

小批量随机梯度下降算法是一种优化神经网络的标准工具，
PyTorch在`optim`模块中实现了该算法的许多变种。


当我们(**实例化一个`SGD`实例**)时，我们要指定优化的参数
（可通过`net.parameters()`从我们的模型中获得）以及优化算法所需的超参数字典。
小批量随机梯度下降只需要设置`lr`值，这里设置为0.03。

Mini-batch stochastic gradient descent is a standard tool for optimizing neural networks, and PyTorch implements many variants of this algorithm in the optim module. 

When we (instantiate an SGD instance), we need to specify the parameters to optimize (which can be obtained from our model via net.parameters()) and a dictionary of hyperparameters required by the optimization algorithm. Mini-batch stochastic gradient descent only needs the lr value to be set, which we set here to 0.03.


In [10]:
trainer = torch.optim.SGD(net.parameters(), lr=0.03)

## Training 训练

通过深度学习框架的高级API来实现我们的模型只需要相对较少的代码。
我们不必单独分配参数、不必定义我们的损失函数，也不必手动实现小批量随机梯度下降。
当我们需要更复杂的模型时，高级API的优势将大大增加。


当我们有了所有的基本组件，[**训练过程代码与我们从零开始实现时所做的非常相似**]。


回顾一下：在每个迭代周期里，我们将完整遍历一次数据集（`train_data`），

不停地从中获取一个小批量的输入和相应的标签。

对于每一个小批量，我们会进行以下步骤:

* 通过调用`net(X)`生成预测并计算损失`l`（前向传播）。
* 通过进行反向传播来计算梯度。
* 通过调用优化器来更新模型参数。

为了更好的衡量训练效果，我们计算每个迭代周期后的损失，并打印它来监控训练过程。


Implementing our model using the high-level APIs of a deep learning framework requires relatively less code. We don't need to allocate parameters individually, define our loss function, or manually implement mini-batch stochastic gradient descent. The advantages of high-level APIs become significantly greater when we need more complex models.

Once we have all the basic components, [the training process code is very similar to what we did when implementing from scratch].

To recap: in each iteration cycle, we will traverse the dataset (train_data) completely once, 
continuously obtaining a small batch of inputs and the corresponding labels. 

For each mini-batch, we will do the following steps:

* Generate predictions by calling net(X) and calculate the loss l (forward propagation).
* Compute gradients by performing backpropagation.
* Update the model parameters by calling the optimizer.


To better measure training effectiveness, we compute the loss after each iteration cycle and print it to monitor the training process.

In [11]:
num_epochs = 3
for epoch in range(num_epochs):
    for X, y in data_iter:
        l = loss(net(X) ,y) # net(X)自带模型参数所以我们不需要输入w b了 # 拿到预测值和真实的y做loss
        trainer.zero_grad() # 清理gradients
        l.backward() # Calculate the gradients using backward(); PyTorch automatically do sum for us.
        trainer.step() # step()做模型更新
    l = loss(net(features), labels) # 完成扫完一遍数据之后计算真实的loss；   features和labels的真实的损失
    print(f'epoch {epoch + 1}, loss {l:f}')

epoch 1, loss 0.000271
epoch 2, loss 0.000097
epoch 3, loss 0.000097


下面我们[**比较生成数据集的真实参数和通过有限数据训练获得的模型参数**]。
要访问参数，我们首先从`net`访问所需的层，然后读取该层的权重和偏置。
正如在从零开始实现中一样，我们估计得到的参数与生成数据的真实参数非常接近。


Next, we [compare the real parameters of the dataset we generated with the model parameters obtained through training with limited data]. 

To access the parameters, we first access the required layer from net, and then read the weight and bias of that layer. 

Just as in the implementation from scratch, the parameters we estimated are very close to the real parameters that were used to generate the data.

In [12]:
w = net[0].weight.data
print('w的估计误差 The estimated error of w：', true_w - w.reshape(true_w.shape))
b = net[0].bias.data
print('b的估计误差 The estimated error of b：', true_b - b)

w的估计误差 The estimated error of w： tensor([-0.0005,  0.0003])
b的估计误差 The estimated error of b： tensor([-4.8637e-05])
