# 5.5 PyTorch回归实战

回归问题是机器学习研究的主要任务之一，用于预测输入变量（即自变量）和输出变量（即因变量）之间的关系。下面将使用经典的Auto MPG（Mile Per Gallon）数据集，演示如何基于PyTorch构建深度网络模型来预测汽车的燃油效率。

## 5.5.1 数据处理

Auto MPG数据集共有398条记录，每条记录有9列数据，分别记录各种车的燃油效率、气缸数、排量、马力、重量、加速性能、车型年份、原产地共8个特征，以及汽车型号。

在确定了各列特征的含义后，我们可以使用Pandas导入数据集：

In [7]:
import pandas as pd
dataset_path = r'C:\Users\86188\Desktop\人工智能导论\jupyter notebook\第五章\auto-mpg.data'
column_names = ['MPG','Cylinders','Displacement','Horsepower','Weight',
                'Acceleration', 'Model Year', 'Origin']
raw_dataset = pd.read_csv(dataset_path, names=column_names, na_values = "?",comment='\t', sep=" ", skipinitialspace=True)
dataset = raw_dataset.copy()
dataset  # 展示数据集内容

Unnamed: 0,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model Year,Origin
0,18.0,8,307.0,130.0,3504.0,12.0,70,1
1,15.0,8,350.0,165.0,3693.0,11.5,70,1
2,18.0,8,318.0,150.0,3436.0,11.0,70,1
3,16.0,8,304.0,150.0,3433.0,12.0,70,1
4,17.0,8,302.0,140.0,3449.0,10.5,70,1
...,...,...,...,...,...,...,...,...
393,27.0,4,140.0,86.0,2790.0,15.6,82,1
394,44.0,4,97.0,52.0,2130.0,24.6,82,2
395,32.0,4,135.0,84.0,2295.0,11.6,82,1
396,28.0,4,120.0,79.0,2625.0,18.6,82,1


由于数据集中有些字段为空，为了便于处理，我们删除包含未知值的数据行：

In [8]:
dataset = dataset.dropna()
dataset  # 再次展示数据集内容

Unnamed: 0,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model Year,Origin
0,18.0,8,307.0,130.0,3504.0,12.0,70,1
1,15.0,8,350.0,165.0,3693.0,11.5,70,1
2,18.0,8,318.0,150.0,3436.0,11.0,70,1
3,16.0,8,304.0,150.0,3433.0,12.0,70,1
4,17.0,8,302.0,140.0,3449.0,10.5,70,1
...,...,...,...,...,...,...,...,...
393,27.0,4,140.0,86.0,2790.0,15.6,82,1
394,44.0,4,97.0,52.0,2130.0,24.6,82,2
395,32.0,4,135.0,84.0,2295.0,11.6,82,1
396,28.0,4,120.0,79.0,2625.0,18.6,82,1


对于“Origin”（原产地）特征，实际上是一个分类特征，取值为1到3，分别代表USA、Europe和Japan。该特征可以直接使用，也可以将其转换成one-hot编码，变为三个特征：

In [9]:
origin = dataset.pop('Origin')
dataset['USA'] = (origin == 1) * 1.0
dataset['Europe'] = (origin == 2) * 1.0
dataset['Japan'] = (origin == 3) * 1.0
dataset  # 再次展示数据集内容

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset['USA'] = (origin == 1) * 1.0
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset['Europe'] = (origin == 2) * 1.0
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset['Japan'] = (origin == 3) * 1.0


Unnamed: 0,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model Year,USA,Europe,Japan
0,18.0,8,307.0,130.0,3504.0,12.0,70,1.0,0.0,0.0
1,15.0,8,350.0,165.0,3693.0,11.5,70,1.0,0.0,0.0
2,18.0,8,318.0,150.0,3436.0,11.0,70,1.0,0.0,0.0
3,16.0,8,304.0,150.0,3433.0,12.0,70,1.0,0.0,0.0
4,17.0,8,302.0,140.0,3449.0,10.5,70,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...
393,27.0,4,140.0,86.0,2790.0,15.6,82,1.0,0.0,0.0
394,44.0,4,97.0,52.0,2130.0,24.6,82,0.0,1.0,0.0
395,32.0,4,135.0,84.0,2295.0,11.6,82,1.0,0.0,0.0
396,28.0,4,120.0,79.0,2625.0,18.6,82,1.0,0.0,0.0


此时，数据集中去掉了Origin特征，增加了USA、Europe和Japan三个特征。

为了便于最后评估模型，我们需要将数据集拆分为一个训练数据集和一个测试数据集：

In [10]:
train_dataset = dataset.sample(frac=0.8, random_state=0)
test_dataset = dataset.drop(train_dataset.index)

计算各特征的统计值：

In [11]:
train_stats = train_dataset.describe()
train_stats.pop("MPG")
train_stats = train_stats.transpose()

分别将训练集和测试集的训练特征和标签分离，标签是模型需要预测的值，即燃油效率：

In [12]:
train_labels = train_dataset.pop('MPG')
test_labels = test_dataset.pop('MPG')

为了能使模型更加快速地收敛，将各特征做归一化处理。尽管在没有做归一化的情况下模型也可能收敛，但可能会让训练变得更慢，并会造成最终的模型依赖特征选择的单位。

In [13]:
def norm(x):
    return (x - train_stats['mean']) / train_stats['std']
normed_train_data = norm(train_dataset)
normed_test_data = norm(test_dataset)

之后将会使用这个已经归一化的数据来训练模型。

## 5.5.2 构建模型

这里将构建一个简单的神经网络回归模型，其中包含两个全连接层，以及一个输出层。模型定义在一个 Model 类中，然后实例化出一个模型对象：

In [14]:
import torch
from torch import nn
class Model(nn.Module):
    def __init__(self, input_dim, middle_dim):
        super(Model, self).__init__()
        self.fc1 = nn.Linear(input_dim, middle_dim)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(middle_dim, middle_dim)
        self.out = nn.Linear(middle_dim, 1)
    
    def forward(self, x):
        return self.out(self.relu(self.fc2(self.relu(self.fc1(x))))).view(-1)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = Model(len(train_dataset.keys()), 64).to(device)

## 5.5.3 模型训练

下面将生成供训练和测试用的DataLoader，损失函数使用平方误差损失MSELoss，优化器使用SGD，一共进行20轮（epoch）的训练。每4轮训练后打印训练集上的误差损失值。

In [22]:
from  torch.utils.data  import TensorDataset, DataLoader
from torch import optim 
train_dataset = TensorDataset(torch.from_numpy(normed_train_data.values).float(),
                              torch.from_numpy(train_labels.values).float())
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True, num_workers=4)
test_dataset = TensorDataset(torch.from_numpy(normed_test_data.values).float(),
                             torch.from_numpy(test_labels.values).float())
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False, num_workers=4)

criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

epochs = 20
for epoch in range(epochs):
    for feats, labels in train_loader:
        optimizer.zero_grad()
        feats, labels = feats.to(device), labels.to(device)
        y = model(feats)
        loss = criterion(y, labels)
        loss.backward()
        optimizer.step()
    if (epoch + 1) % 4 == 0:
        print('Epoch {}/{}: loss {}'.format(epoch + 1, epochs, loss.item()))


Epoch 4/20: loss 4.9366607666015625
Epoch 8/20: loss 2.0439815521240234
Epoch 12/20: loss 5.02448034286499
Epoch 16/20: loss 3.717895746231079
Epoch 20/20: loss 3.914015769958496


## 5.5.4 模型测试

通过使用测试集来检验模型的泛化效果。由于在训练模型时没有使用测试集，这样就可以估计在实际使用模型时，它的预测性能大致如何。代码如下：

In [23]:
preds = torch.empty((0,)).to(device)
gts = torch.empty((0,)).to(device)
with torch.no_grad():
    for feats, labels in test_loader:
        feats, labels = feats.to(device), labels.to(device)
        y = model(feats)
        preds = torch.cat((preds, y), dim=0)
        gts = torch.cat((gts, labels), dim=0)
    err = criterion(preds, gts)
    print('Testing set MSE:  {:5.2f} MPG'.format(err.item()))

Testing set MSE:   5.38 MPG
