## 前言
Xgboost中内置了交叉验证，如果我们需要使用交叉验证的话，只需要直接调用即可。我们依旧采用部分之前的代码。你可以直接跳到交叉验证部分

## 前置代码
### 引用类库，添加需要的函数

In [1]:
import numpy as np
from sklearn.model_selection import train_test_split
import xgboost as xgb
import pandas as pd

In [2]:
def GetNewDataByPandas():
    wine = pd.read_csv("/home/fonttian/Data/UCI/wine/wine.csv")
    wine['alcohol**2'] = pow(wine["alcohol"], 2)
    wine['volatileAcidity*alcohol'] = wine["alcohol"] * wine['volatile acidity']
    y = np.array(wine.quality)
    X = np.array(wine.drop("quality", axis=1))
    # X = np.array(wine)

    columns = np.array(wine.columns)

    return X, y, columns

## 加载数据
### 读取数据并分割

因为我们这里使用的是交叉验证因此我们也就不再需要，将数据集分割为三份了，只需要分割出百分之十的数据用于预测就好。注意随机数的问题。

In [3]:
# Read wine quality data from file
X, y, wineNames = GetNewDataByPandas()
# X, y, wineNames = GetDataByPandas()
# split data to [0.8,0.2,01]
x_train, x_predict, y_train, y_predict = train_test_split(X, y, test_size=0.10, random_state=100)

### 展示数据

In [4]:
wineNames

array(['fixed acidity', 'volatile acidity', 'citric acid',
       'residual sugar', 'chlorides', 'free sulfur dioxide',
       'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol',
       'quality', 'alcohol**2', 'volatileAcidity*alcohol'], dtype=object)

In [5]:
print(len(x_train),len(y_train))
print(len(x_predict))

1439 1439
160


### 加载到DMatrix
 1. 其中missing将作为填充缺失数值的默认值，可不填
 2. 必要时也可以设置权重

In [7]:
dtrain = xgb.DMatrix(data=x_train,label=y_train,missing=-999.0)

## 设定参数
### Booster参数

eval_metric参数也可以在之后设置。

In [8]:
param = {'max_depth': 7, 'eta': 1, 'silent': 1, 'objective': 'reg:linear', 'seed':100}
param['nthread'] = 4
param['seed'] = 619
param['eval_metric'] = ['rmse']

## 利用交叉验证训练
### 训练模型

在之前的代码中，我将数据分割为 6:3:1，其分别为，训练数据，性能监视用数据，和最后的预测数据。这个比例只是为了示例用，并不具有代表性。

本处则主要介绍交叉验证方法。

In [10]:
num_round = 10
print('running cross validation')
# do cross validation, this will print result out as
# [iteration]  metric_name:mean_value+std_value
# std_value is standard deviation of the metric
rsult_table = xgb.cv(param, dtrain, num_round, nfold=5,
       callbacks=[xgb.callback.print_evaluation(show_stdv=True)])

print()
print(rsult_table)

running cross validation
[0]	train-rmse:0.685807+0.013638	test-rmse:0.731462+0.0235918
[1]	train-rmse:0.498956+0.0250755	test-rmse:0.711987+0.0161207
[2]	train-rmse:0.430987+0.0273151	test-rmse:0.717652+0.0249653
[3]	train-rmse:0.383161+0.0238077	test-rmse:0.720656+0.0246441
[4]	train-rmse:0.308969+0.0226345	test-rmse:0.724174+0.0263164
[5]	train-rmse:0.265569+0.0248556	test-rmse:0.7251+0.0281966
[6]	train-rmse:0.23472+0.0236373	test-rmse:0.730105+0.0281437
[7]	train-rmse:0.208329+0.0279188	test-rmse:0.737677+0.0317415
[8]	train-rmse:0.17889+0.028945	test-rmse:0.743078+0.0301012
[9]	train-rmse:0.15444+0.0194831	test-rmse:0.740954+0.0327966

   train-rmse-mean  train-rmse-std  test-rmse-mean  test-rmse-std
0         0.685807        0.013638        0.731462       0.023592
1         0.498956        0.025075        0.711987       0.016121
2         0.430987        0.027315        0.717652       0.024965
3         0.383161        0.023808        0.720656       0.024644
4         0.308969   

从结果来看，虽然训练误差比较小，但是测试误差比较大，并且测试数据集误差也变小再增大，这都是明显的过拟合现象。而事实上，本处数据集使用的是UCI的红酒质量数据集。该数据集确实比较容易过拟合。那么这时候我们就更加的需要早停了。

### 在交叉验证中使用早停

In [11]:
print('running cross validation, disable standard deviation display')
res = xgb.cv(param, dtrain, num_boost_round=10, nfold=5,
             callbacks=[xgb.callback.print_evaluation(show_stdv=False),
                        xgb.callback.early_stop(3)])
print(res)

running cross validation, disable standard deviation display
[0]	train-rmse:0.685807	test-rmse:0.731462
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 3 rounds.
[1]	train-rmse:0.498956	test-rmse:0.711987
[2]	train-rmse:0.430987	test-rmse:0.717652
[3]	train-rmse:0.383161	test-rmse:0.720656
[4]	train-rmse:0.308969	test-rmse:0.724174
Stopping. Best iteration:
[1]	train-rmse:0.498956+0.0250755	test-rmse:0.711987+0.0161207

   train-rmse-mean  train-rmse-std  test-rmse-mean  test-rmse-std
0         0.685807        0.013638        0.731462       0.023592
1         0.498956        0.025075        0.711987       0.016121


### 定义预处理函数

用于返回预处理的训练、测试数据和参数

我们可以用它来重新设置权重，等等。

例如，我们尝试设置scale_pos_weight，不过因为这里主要是为了展示用法，所以我们将权重全部设置为1

In [12]:
def fpreproc(dtrain, dtest, param):
    label = dtrain.get_label()
    ratio = float(np.sum(label == 0)) / np.sum(label == 1) # xgboost官方指引中的 例子
    ratio = 1 # 我的 例子
    param['scale_pos_weight'] = ratio
    return (dtrain, dtest, param)

### 在交叉验证中使用预处理函数

In [15]:
res = xgb.cv(param, dtrain, num_round, nfold=5, fpreproc=fpreproc)
print(res)

   train-rmse-mean  train-rmse-std  test-rmse-mean  test-rmse-std
0         0.685807        0.013638        0.731462       0.023592
1         0.498956        0.025075        0.711987       0.016121
2         0.430987        0.027315        0.717652       0.024965
3         0.383161        0.023808        0.720656       0.024644
4         0.308969        0.022635        0.724174       0.026316
5         0.265569        0.024856        0.725100       0.028197
6         0.234720        0.023637        0.730105       0.028144
7         0.208329        0.027919        0.737677       0.031741
8         0.178890        0.028945        0.743078       0.030101
9         0.154440        0.019483        0.740954       0.032797


  This is separate from the ipykernel package so we can avoid doing imports until
