## 前言
Xgboost中内置了交叉验证，如果我们需要使用交叉验证的话，只需要直接调用即可。我们依旧采用部分之前的代码。你可以直接跳到交叉验证部分

## 前置代码
### 引用类库，添加需要的函数

In [2]:
import numpy as np
from sklearn.model_selection import train_test_split
import xgboost as xgb
import pandas as pd

In [3]:
def GetNewDataByPandas():
    wine = pd.read_csv("/home/fonttian/Data/UCI/wine/wine.csv")
    wine['alcohol**2'] = pow(wine["alcohol"], 2)
    wine['volatileAcidity*alcohol'] = wine["alcohol"] * wine['volatile acidity']
    y = np.array(wine.quality)
    X = np.array(wine.drop("quality", axis=1))
    # X = np.array(wine)

    columns = np.array(wine.columns)

    return X, y, columns

## 加载数据
### 读取数据并分割

因为我们这里使用的是交叉验证因此我们也就不再需要，将数据集分割为三份了，只需要分割出百分之十的数据用于预测就好。注意随机数的问题。

In [4]:
# Read wine quality data from file
X, y, wineNames = GetNewDataByPandas()
# X, y, wineNames = GetDataByPandas()
# split data to [0.8,0.2,01]
x_train, x_predict, y_train, y_predict = train_test_split(X, y, test_size=0.10, random_state=100)

### 展示数据

In [5]:
wineNames

array(['fixed acidity', 'volatile acidity', 'citric acid',
       'residual sugar', 'chlorides', 'free sulfur dioxide',
       'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol',
       'quality', 'alcohol**2', 'volatileAcidity*alcohol'], dtype=object)

In [6]:
print(len(x_train),len(y_train))
print(len(x_predict))

1439 1439
160


### 加载到DMatrix
 1. 其中missing将作为填充缺失数值的默认值，可不填
 2. 必要时也可以设置权重

In [7]:
dtrain = xgb.DMatrix(data=x_train,label=y_train,missing=-999.0)

## 设定参数
### Booster参数

eval_metric参数也可以在之后设置。

In [8]:
param = {'max_depth': 7, 'eta': 1, 'silent': 1, 'objective': 'reg:linear', 'seed':100}
param['nthread'] = 4
param['seed'] = 619
param['eval_metric'] = ['rmse']

## 利用交叉验证训练
### 训练模型

在之前的代码中，我将数据分割为 6:3:1，其分别为，训练数据，性能监视用数据，和最后的预测数据。这个比例只是为了示例用，并不具有代表性。

本处则主要介绍交叉验证方法。

In [9]:
num_round = 10
print('running cross validation')
# do cross validation, this will print result out as
# [iteration]  metric_name:mean_value+std_value
# std_value is standard deviation of the metric
rsult_table = xgb.cv(param, dtrain, num_round, nfold=5,
       callbacks=[xgb.callback.print_evaluation(show_stdv=True)])

print()
print(rsult_table)

running cross validation
[0]	train-rmse:0.685807+0.013638	test-rmse:0.731462+0.0235918
[1]	train-rmse:0.498956+0.0250755	test-rmse:0.711987+0.0161207
[2]	train-rmse:0.430987+0.0273151	test-rmse:0.717652+0.0249653
[3]	train-rmse:0.383161+0.0238077	test-rmse:0.720656+0.0246441
[4]	train-rmse:0.308969+0.0226345	test-rmse:0.724174+0.0263164
[5]	train-rmse:0.265569+0.0248556	test-rmse:0.7251+0.0281966
[6]	train-rmse:0.23472+0.0236373	test-rmse:0.730105+0.0281437
[7]	train-rmse:0.208329+0.0279188	test-rmse:0.737677+0.0317415
[8]	train-rmse:0.17889+0.028945	test-rmse:0.743078+0.0301012
[9]	train-rmse:0.15444+0.0194831	test-rmse:0.740954+0.0327966

   train-rmse-mean  train-rmse-std  test-rmse-mean  test-rmse-std
0         0.685807        0.013638        0.731462       0.023592
1         0.498956        0.025075        0.711987       0.016121
2         0.430987        0.027315        0.717652       0.024965
3         0.383161        0.023808        0.720656       0.024644
4         0.308969   

从结果来看，虽然训练误差比较小，但是测试误差比较大，并且测试数据集误差也变小再增大，这都是明显的过拟合现象。而事实上，本处数据集使用的是UCI的红酒质量数据集。该数据集确实比较容易过拟合。那么这时候我们就更加的需要早停了。

### 在交叉验证中使用早停

In [10]:
print('running cross validation, disable standard deviation display')
res = xgb.cv(param, dtrain, num_boost_round=10, nfold=5,
             callbacks=[xgb.callback.print_evaluation(show_stdv=False),
                        xgb.callback.early_stop(3)])
print(res)

running cross validation, disable standard deviation display
[0]	train-rmse:0.685807	test-rmse:0.731462
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 3 rounds.
[1]	train-rmse:0.498956	test-rmse:0.711987
[2]	train-rmse:0.430987	test-rmse:0.717652
[3]	train-rmse:0.383161	test-rmse:0.720656
[4]	train-rmse:0.308969	test-rmse:0.724174
Stopping. Best iteration:
[1]	train-rmse:0.498956+0.0250755	test-rmse:0.711987+0.0161207

   train-rmse-mean  train-rmse-std  test-rmse-mean  test-rmse-std
0         0.685807        0.013638        0.731462       0.023592
1         0.498956        0.025075        0.711987       0.016121


In [11]:
print(res['test-rmse-mean'])

0    0.731462
1    0.711987
Name: test-rmse-mean, dtype: float64


### 定义预处理函数

用于返回预处理的训练、测试数据和参数

我们可以用它来重新设置权重，等等。

例如，我们尝试设置scale_pos_weight，不过因为这里主要是为了展示用法，所以我们将权重全部设置为1

In [12]:
def fpreproc(dtrain, dtest, param):
    label = dtrain.get_label()
    ratio = float(np.sum(label == 0)) / np.sum(label == 1) # xgboost官方指引中的 例子
    ratio = 1 # 我的 例子
    param['scale_pos_weight'] = ratio
    return (dtrain, dtest, param)

### 在交叉验证中使用预处理函数

In [13]:
res = xgb.cv(param, dtrain, num_round, nfold=5, fpreproc=fpreproc)
print(res)

  This is separate from the ipykernel package so we can avoid doing imports until


   train-rmse-mean  train-rmse-std  test-rmse-mean  test-rmse-std
0         0.685807        0.013638        0.731462       0.023592
1         0.498956        0.025075        0.711987       0.016121
2         0.430987        0.027315        0.717652       0.024965
3         0.383161        0.023808        0.720656       0.024644
4         0.308969        0.022635        0.724174       0.026316
5         0.265569        0.024856        0.725100       0.028197
6         0.234720        0.023637        0.730105       0.028144
7         0.208329        0.027919        0.737677       0.031741
8         0.178890        0.028945        0.743078       0.030101
9         0.154440        0.019483        0.740954       0.032797


## 交叉验证与Hyperopt结合

    xgboost进行交叉验证与Hyperopt结合有两种方案，第一种方案是使用本身自带的CV方法，但是这种方案的存在一个问题，就是CV中无法直接传递分开的参数，而只能传递唯一参数params，因此我们需要先生成一个model，然后通过`get_params()`来获取参数，这种方式代码会稍微多几行。不过相较于与sklearn结合的形式，计算时间上则有很大的提升。
    
    
    第二种方案：也可以直接使用`sklearn.model_selection`中的多种交叉验证方案，只将xgboost作为单个模型传入。但是个人并不建议这样做。首先是和sklearn的交互会一定程度上导致计算性能的下降，而且计算时间上差距可能会很大。所以个人建议只有在必要的时候，可以采用最传统的`train_test_split`方案时使用类似的代码结构，但是还是不适用和sklearn结合形式的交叉验证，具体怎么做会在其他例子中介绍。
    
    
### 使用CV方法进行交叉验证

    直接使用xgboost中的cv方法进行交叉验证应该是最好的方案。唯一的问题是如何传入唯一的`params`参数。不知道为什么在我翻译玩hyperopt中文文档的这近一年，都没有见到有人写这一类的。有可能是没想到如何解决吧这个问题吧。

In [14]:
param

{'max_depth': 7,
 'eta': 1,
 'silent': 1,
 'objective': 'reg:linear',
 'seed': 619,
 'nthread': 4,
 'eval_metric': ['rmse']}

In [15]:
import hyperopt

def hyperopt_objective(params):
    
    model = xgb.XGBRegressor(
        max_depth=int(params['max_depth'])+5,
        learning_rate=params['learning_rate'],
        silent=1,
        objective='reg:linear',
        eval_metric='rmse',
        seed=619,
        nthread=-1,
    )
     
    res = xgb.cv(model.get_params(), dtrain, num_boost_round=10, nfold=5,
             callbacks=[xgb.callback.print_evaluation(show_stdv=False),
                        xgb.callback.early_stop(3)])
    
    return np.min(res['test-rmse-mean']) # as hyperopt minimises

In [17]:
from numpy.random import RandomState

params_space = {
    'max_depth': hyperopt.hp.randint('max_depth', 6),
    'learning_rate': hyperopt.hp.uniform('learning_rate', 1e-3, 5e-1),
}

trials = hyperopt.Trials()

best = hyperopt.fmin(
    hyperopt_objective,
    space=params_space,
    algo=hyperopt.tpe.suggest,
    max_evals=50,
    trials=trials,
    rstate=RandomState(123)
)

print("\n展示hyperopt获取的最佳结果，但是要注意的是我们对hyperopt最初的取值范围做过一次转换")
print(best)

[0]	train-rmse:3.26275	test-rmse:3.2673
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 3 rounds.
[1]	train-rmse:2.08066	test-rmse:2.09467
[2]	train-rmse:1.3656	test-rmse:1.40461
[3]	train-rmse:0.931272	test-rmse:1.02085
[4]	train-rmse:0.668251	test-rmse:0.815193
[5]	train-rmse:0.503126	test-rmse:0.720206
[6]	train-rmse:0.398523	test-rmse:0.674504
[7]	train-rmse:0.32354	test-rmse:0.654644
[8]	train-rmse:0.272398	test-rmse:0.646695
[9]	train-rmse:0.242619	test-rmse:0.643557
[0]	train-rmse:3.11004	test-rmse:3.11516
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 3 rounds.
[1]	train-rmse:1.90079	test-rmse:1.91514
[2]	train-rmse:1.21239	test-rmse:1.24757
[3]	train-rmse:0.837937	test-rmse:0.899628
[4]	train-rmse:0.64615	test-rmse:0.739417
[5]	train-rmse:0.54979	test-rmse:0.671479
[6]	train-rmse:0.503193	test-rmse:0.644203
[7]

[6]	train-rmse:0.607008	test-rmse:0.733551
[7]	train-rmse:0.523004	test-rmse:0.677904
[8]	train-rmse:0.470362	test-rmse:0.64919
[9]	train-rmse:0.432426	test-rmse:0.635024
[0]	train-rmse:2.89835	test-rmse:2.90436
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 3 rounds.
[1]	train-rmse:1.66846	test-rmse:1.68351
[2]	train-rmse:1.03021	test-rmse:1.06949
[3]	train-rmse:0.716919	test-rmse:0.794803
[4]	train-rmse:0.578381	test-rmse:0.697065
[5]	train-rmse:0.515998	test-rmse:0.662358
[6]	train-rmse:0.480762	test-rmse:0.649818
[7]	train-rmse:0.461886	test-rmse:0.649972
[8]	train-rmse:0.444416	test-rmse:0.649307
[9]	train-rmse:0.431887	test-rmse:0.649179
[0]	train-rmse:4.74307	test-rmse:4.74366
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 3 rounds.
[1]	train-rmse:4.32937	test-rmse:4.33143
[2]	train-rmse:3.95325	test-rmse:3.9562

[1]	train-rmse:2.32766	test-rmse:2.33788
[2]	train-rmse:1.59577	test-rmse:1.62308
[3]	train-rmse:1.12384	test-rmse:1.18198
[4]	train-rmse:0.820035	test-rmse:0.929889
[5]	train-rmse:0.616898	test-rmse:0.786757
[6]	train-rmse:0.482986	test-rmse:0.706541
[7]	train-rmse:0.390492	test-rmse:0.666407
[8]	train-rmse:0.330194	test-rmse:0.648272
[9]	train-rmse:0.289325	test-rmse:0.639296
[0]	train-rmse:2.90654	test-rmse:2.91251
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 3 rounds.
[1]	train-rmse:1.67607	test-rmse:1.69114
[2]	train-rmse:1.02172	test-rmse:1.0791
[3]	train-rmse:0.668905	test-rmse:0.809289
[4]	train-rmse:0.477573	test-rmse:0.70237
[5]	train-rmse:0.372438	test-rmse:0.664591
[6]	train-rmse:0.307832	test-rmse:0.6521
[7]	train-rmse:0.257258	test-rmse:0.647787
[8]	train-rmse:0.220306	test-rmse:0.646447
[9]	train-rmse:0.19488	test-rmse:0.646466
[0]	train-rmse:3.85385	test-rmse:3.85654
Multiple eval met

[8]	train-rmse:4.90516	test-rmse:4.90558
[9]	train-rmse:4.8737	test-rmse:4.87418
[0]	train-rmse:3.55544	test-rmse:3.559
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 3 rounds.
[1]	train-rmse:2.45357	test-rmse:2.46259
[2]	train-rmse:1.71911	test-rmse:1.74023
[3]	train-rmse:1.23187	test-rmse:1.27387
[4]	train-rmse:0.910121	test-rmse:0.986465
[5]	train-rmse:0.692926	test-rmse:0.82202
[6]	train-rmse:0.550984	test-rmse:0.726877
[7]	train-rmse:0.454897	test-rmse:0.678742
[8]	train-rmse:0.386118	test-rmse:0.65006
[9]	train-rmse:0.339619	test-rmse:0.636732
[0]	train-rmse:4.62743	test-rmse:4.62825
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 3 rounds.
[1]	train-rmse:4.12173	test-rmse:4.12443
[2]	train-rmse:3.67366	test-rmse:3.67758
[3]	train-rmse:3.2766	test-rmse:3.28032
[4]	train-rmse:2.92494	test-rmse:2.93085
[5]	train-rms

### 使用sklearn进行交叉验证

这里要注意的是运算速度上也是有一定差距的。而且可能很大。

In [27]:
def XGBRegressor_CV(params):
    from sklearn.model_selection import cross_val_score
    
    model = xgb.XGBRegressor(
        max_depth=int(params['max_depth'])+5,
        learning_rate=params['learning_rate'],
        silent=1,
        objective='reg:linear',
        eval_metric='rmse',
        seed=619,
        nthread=-1,
        early_stopping_rounds=3
    )

#     x_train, x_predict, y_train, y_predict
    metric = cross_val_score(model, x_train, y_train, cv=5, scoring="neg_mean_squared_error")

    return min(-metric)

In [28]:
trials_2 = hyperopt.Trials()

best_2 = hyperopt.fmin(
    XGBRegressor_CV,
    space=params_space,
    algo=hyperopt.tpe.suggest,
    max_evals=50,
    trials=trials_2,
    rstate=RandomState(123)
)

print("\n展示hyperopt获取的最佳结果，但是要注意的是我们对hyperopt最初的取值范围做过一次转换")
print(best_2)


展示hyperopt获取的最佳结果，但是要注意的是我们对hyperopt最初的取值范围做过一次转换
{'learning_rate': 0.042355077979416025, 'max_depth': 5}
