# 数据处理（缺失值处理 特征选择) 与K折交叉验证
以波士顿房价项目为例，演示常用的数据处理技术，重点是在数据集少的情况下使用K折交叉验证的实现

##  数据下载
我们可以在房价预测比赛的网页上了解比赛信息和参赛者成绩，也可以下载数据集并提交自己的预测结果。该比赛的网页地址是

https://www.kaggle.com/c/house-prices-advanced-regression-techniques 。

## 读取数据集
比赛数据分为训练数据集和测试数据集。两个数据集都包括每栋房子的特征，如街道类型、建造年份、房顶类型、地下室状况等特征值。这些特征值有连续的数字、离散的标签甚至是缺失值“na”。只有训练数据集包括了每栋房子的价格，也就是标签。我们可以访问比赛网页，点击图3.8中的“Data”标签，并下载这些数据集。


In [18]:
import tensorflow as tf
from tensorflow.keras import layers, models ,losses
from tensorflow import initializers as init
import numpy as np
import pandas as pd 
import plotly as py 
import plotly.graph_objects as go
from plotly.subplots import make_subplots

print('TensorFlow version:', tf.__version__)
print('Numpy version:', np.__version__)
print('Pandas version:', pd.__version__)
print('Plotly version:', py.__version__)

TensorFlow version: 2.2.0
Numpy version: 1.16.5
Pandas version: 1.0.4
Plotly version: 4.7.0


训练数据集包括1,460个样本、80个特征和1个标签

In [2]:
train_data = pd.read_csv('./Data/bostonhouse/train.csv')
test_data  = pd.read_csv('./Data/bostonhouse/test.csv')
print(train_data.head())
print('train_data shape:', train_data.shape)
print('test_data shape:', test_data.shape)

   Id  MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
0   1          60       RL         65.0     8450   Pave   NaN      Reg   
1   2          20       RL         80.0     9600   Pave   NaN      Reg   
2   3          60       RL         68.0    11250   Pave   NaN      IR1   
3   4          70       RL         60.0     9550   Pave   NaN      IR1   
4   5          60       RL         84.0    14260   Pave   NaN      IR1   

  LandContour Utilities  ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold  \
0         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      2   
1         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      5   
2         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      9   
3         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      2   
4         Lvl    AllPub  ...        0    NaN   NaN         NaN       0     12   

  YrSold  SaleType  SaleCondition  SalePrice  
0   2008        WD   

查看前4个样本的前4个特征、后2个特征和标签（SalePrice）：

In [3]:
train_data.iloc[0:4, [0,1,2,3,-3,-2,-1]]

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,WD,Normal,208500
1,2,20,RL,80.0,WD,Normal,181500
2,3,60,RL,68.0,WD,Normal,223500
3,4,70,RL,60.0,WD,Abnorml,140000


可以看到第一个特征是Id，它能帮助模型记住每个训练样本，但难以推广到测试样本，所以我们不使用它来训练。我们将所有的训练数据和测试数据的79个特征按样本连结。

In [5]:
all_features = pd.concat([train_data.iloc[:, 1:-1], test_data.iloc[:, 1:]])
all_features.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,2,2008,WD,Normal
1,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,0,,,,0,5,2007,WD,Normal
2,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,0,,,,0,9,2008,WD,Normal
3,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,0,,,,0,2,2006,WD,Abnorml
4,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,0,0,,,,0,12,2008,WD,Normal


## 数据预处理
我们对连续数值的特征做标准化（standardization）：设该特征在整个数据集上的均值为μμ，标准差为σσ。那么，我们可以将该特征的每个值先减去μμ再除以σσ得到标准化后的每个特征值。对于缺失的特征值，我们将其替换成该特征的均值。

In [8]:
numeric_features = all_features.dtypes[all_features.dtypes != 'object'].index 
all_features[numeric_features] = all_features[numeric_features].apply(lambda x: (x - x.mean()) / (x.std()))
# 标准化后，每个特征的均值变为0，所以可以直接用0来替换缺失值
all_features = all_features.fillna(0)
all_features.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,0.06732,RL,-0.202033,-0.217841,Pave,0,Reg,Lvl,AllPub,Inside,...,-0.285886,-0.063139,0,0,0,-0.089577,-1.551918,0.157619,WD,Normal
1,-0.873466,RL,0.501785,-0.072032,Pave,0,Reg,Lvl,AllPub,FR2,...,-0.285886,-0.063139,0,0,0,-0.089577,-0.446848,-0.602858,WD,Normal
2,0.06732,RL,-0.061269,0.137173,Pave,0,IR1,Lvl,AllPub,Inside,...,-0.285886,-0.063139,0,0,0,-0.089577,1.026577,0.157619,WD,Normal
3,0.302516,RL,-0.436639,-0.078371,Pave,0,IR1,Lvl,AllPub,Corner,...,-0.285886,-0.063139,0,0,0,-0.089577,-1.551918,-1.363335,WD,Abnorml
4,0.06732,RL,0.689469,0.518814,Pave,0,IR1,Lvl,AllPub,FR2,...,-0.285886,-0.063139,0,0,0,-0.089577,2.131647,0.157619,WD,Normal


接下来将离散数值转成指示特征。举个例子，假设特征MSZoning里面有两个不同的离散值RL和RM，那么这一步转换将去掉MSZoning特征，并新加两个特征MSZoning_RL和MSZoning_RM，其值为0或1。如果一个样本原来在MSZoning里的值为RL，那么有MSZoning_RL=1且MSZoning_RM=0。

In [10]:
# dummy_na=True将缺失值也当作合法的特征值并为其创建指示特征
all_features = pd.get_dummies(all_features, dummy_na=True)
all_features.shape # (2919, 354)

(2919, 354)

In [28]:
n_train = train_data.shape[0]
train_features = np.array(all_features[:n_train].values,dtype=np.float32)
test_features = np.array(all_features[n_train:].values,dtype=np.float32)
train_labels = np.array(train_data.SalePrice.values.reshape(-1, 1),dtype=np.float32)

## 建立模型

In [29]:
def get_net():
    net = models.Sequential()
    net.add(layers.Dense(1))
    return net

下面定义比赛用来评价模型的对数均方根误差。给定预测值$\hat y_1, \ldots, \hat y_n$和对应的真实标签$y_1,\ldots, y_n$，它的定义为

$$\sqrt{\frac{1}{n}\sum_{i=1}^n\left(\log(y_i)-\log(\hat y_i)\right)^2}.$$

对数均方根误差已经keras中已经集成了对数均方根误差，故直接调用。

log_rmse=tf.keras.losses.mean_squared_logarithmic_error

## K折交叉验证
下面定义K折交叉验证函数，它返回第i折交叉验证时所需要的训练和验证数据。

In [30]:
def get_k_fold_data(k, i, X, y):
    assert k > 1
    fold_size = X.shape[0] // k
    X_train, y_train = None, None
    for j in range(k):
        idx = slice(j * fold_size, (j + 1) * fold_size)
        X_part, y_part = X[idx, :], y[idx]
        if j == i:
            X_valid, y_valid = X_part, y_part
        elif X_train is None:
            X_train, y_train = X_part, y_part
        else:
            X_train = tf.concat([X_train, X_part], axis=0)
            y_train = tf.concat([y_train, y_part], axis=0)
    return X_train, y_train, X_valid, y_valid


在KK折交叉验证中我们训练KK次并返回训练和验证的平均误差。

In [31]:
def k_fold(k, X_train, y_train, num_epochs,
           learning_rate, weight_decay, batch_size):
    train_l_sum, valid_l_sum = 0, 0
    for i in range(k):
        # create model
        data = get_k_fold_data(k, i, X_train, y_train)
        net=get_net()
        # Compile model
        net.compile(loss=tf.keras.losses.mean_squared_logarithmic_error, optimizer=tf.keras.optimizers.Adam(learning_rate))
        # Fit the model
        history=net.fit(data[0], data[1],validation_data=(data[2], data[3]), epochs=num_epochs, batch_size=batch_size,validation_freq=1,verbose=0)
        loss = history.history['loss']
        val_loss = history.history['val_loss']
        print('fold %d, train rmse %f, valid rmse %f'
              % (i, loss[-1], val_loss[-1]))
    x = np.arange(num_epochs)    
    fig = make_subplots(rows = 1, cols= 2, subplot_titles=['Train loss', 'Valid value'])
    fig.add_trace(go.Scatter(x =x, y =loss, name = 'train', mode = 'lines', 
                             line=dict(width = 2)), row =1, col = 1)
    fig.add_trace(go.Scatter(x = x , y = val_loss, name= 'valid', mode = 'lines',
                             line=dict(width = 2)), row = 1, col =2)
    fig.update_xaxes(title_text = 'epochs', row = 1, col =1)
    fig.update_xaxes(title_text = 'epochs', row = 1, col =2)
    fig.update_yaxes(title_text = 'loss', row = 1, col =1)
    fig.update_yaxes(title_text = 'valid', row = 1, col =2)
    fig.update_layout(title = dict(text='K-Fold Cross Validation Results', font =dict(size = 20)))
    fig.show()
    
    
k, num_epochs, lr, weight_decay, batch_size = 5, 100, 5, 0, 64
k_fold(k, train_features, train_labels, num_epochs,lr, weight_decay, batch_size)


fold 0, train rmse 8.328447, valid rmse 8.515982
fold 1, train rmse 11.967943, valid rmse 11.908439
fold 2, train rmse 8.636151, valid rmse 8.698682
fold 3, train rmse 9.301886, valid rmse 9.248205
fold 4, train rmse 7.849432, valid rmse 7.740965


## 预测并保存结果成csv
下面定义预测函数。在预测之前，使用完整的训练数据集来重新训练模型，并将预测结果存成提交所需要的格式。

In [32]:
x_train=tf.convert_to_tensor(train_features,dtype=tf.float32)
y_train=tf.convert_to_tensor(train_labels,dtype=tf.float32)
x_test=tf.convert_to_tensor(test_features,dtype=tf.float32)
model=tf.keras.models.Sequential([
  tf.keras.layers.Dense(1)
])
adam=tf.keras.optimizers.Adam(0.5)
model.compile(optimizer=adam,
              loss=tf.keras.losses.mean_squared_logarithmic_error
              )
model.fit(x_train, y_train, epochs=200,batch_size=32,verbose=0)
preds=np.array(model.predict(x_test))
test_data['SalePrice'] = pd.Series(preds.reshape(1, -1)[0])
submission = pd.concat([test_data['Id'], test_data['SalePrice']], axis=1)
submission.to_csv('./Data/submission.csv', index=False)
