TO DO：
* 图文并茂
* 文字通俗易懂
* 需要举例子解释 什么是样本分布变化？
* 将notebook上传到github
* 除了微信公众号，还需要在哪些渠道放文章：微博；Medium
* 可以投稿给 机器之心 这样的公众号吗？
* 图片要加简短注释
* 简介前面加一个图片，会好看一点
* 微博关注的果壳编辑好像分享过写作经验，看看

标题：还在用交叉验证？试试Kaggle大牛们经常用的方法 -- 对抗验证

这个时候，交叉验证（Cross Validation）通常会失效，所以构建一个可靠的验证集（Validation），就非常重要。

如果构造的验证集（Validation）不可靠，我们就无法准确地评估模型的效果，进而无法评估改进模型的方法是否有效。

首先，我们如何确定样本的分布确实发生了变化呢？

Kaggle上通常用到的一个方法就是：对抗验证（Adversarial Validation）。

# 简介

交叉验证（Cross Validation）是建模师常用的一种用来评估模型效果的方法。

在本篇文章中，我们将通过一个实例，了解到：当样本分布发生变化时，使用交叉验证，会存在问题。

我们将使用「对抗验证（Adversarial Validation）」这一方法，来辨别样本的分布是否发生了变化。

以及，如果样本分布发生了变化，除了交叉验证，我们有哪些更好的方法。

## 什么是训练集、验证集、和测试集？

在本文中训练集、验证集、和测试集的概念如下：

* 训练集：用于训练模型的样本。

* 验证集：用于评估「模型效果」，以及评估「提升模型效果的方法」的样本。提升模型效果的方法通常是：调整超参数、变量工程，变量筛选等。具体来说，如果某个变量提升了模型在验证集上的效果，那么这个变量将被纳入模型。建模过程中，凡是可以提升验证集效果的方法，都可以采纳。也正因为如此，模型“看见”过验证集，所以模型在验证集上的效果是存在偏差的。

* 测试集：用于评估模型效果的样本。模型没有“看见”过该样本。

有必要解释吗？

## 什么是交叉验证？

有必要解释吗？

## 为什么「样本分布变化」会影响模型？

在真实的业务场景中，我们经常会遇到「样本分布变化」的问题。

当出现这样的问题时，模型在训练样本上，表现还不错。但是应用了一段时间后，效果通常会大打折扣。

比如，在化妆品或者医美市场，男性的比例越来越多。基于过去的数据构建的模型，渐渐不适用于现在。

又比如，在信贷场景下，因为建模样本都是通过了信用审批的客户，所以「建模样本（Sample）」和「整体（Population）」的分布存在偏差。

![样本分布变化](./images/Change_in_Distribution.png)

（对上图的解释，应该修改一下格式）

如上图，当我们要做一个模型，来预测人们在超市的消费习惯。

如果我们的训练样本是18岁-25岁的年轻人，而测试样本是70岁以上的老人，那么我们的模型在测试样本上的表现，通常会大打折扣。

## 什么是对抗验证（Adversarial Validation）？

[对抗验证（Adversarial Validation）](http://fastml.com/adversarial-validation-part-one/)，并不是一种评估模型效果的方法，而是一种用来确认训练集（Train）和测试集（Test）的分布是否变化的方法。

它的本质是，构造一个分类模型，来预测样本是训练集或测试集的概率。

如果这个模型的效果不错，通常来说AUC在0.7以上，那么可以说明我们的训练集和测试集存在较大的差异。

![Adversarial_Validation](./images/Adversarial_Validation.png)

如上图，仍然以「预测人们在超市的消费习惯」为例。因为训练集是18岁-25岁的年轻人，测试集是70岁以上的老人，那么通过「年龄」，我们就能够区分出训练集和测试集。

（可以修改这个图片吗？我需要找到画这种图的工具）

# 实例

这里，我将用一个实例来论证怎么使用对抗验证，以及在分布发生变化时做交叉验证所存在的问题。

## 数据

![Microsoft Malware Prediction](./images/Microsoft_Malware_Competition.png)


这里用到的数据来自Kaggle上的[微软恶意软件比赛](https://www.kaggle.com/c/microsoft-malware-prediction/overview)比赛来说明。

这次比赛的目标是：预测电脑受到恶意软件攻击的概率。

因为这次比赛的 Train 和 Test 是根据时间划分的，所以Train 和 Test 的分布非常不同，很具有代表性。

如果需要完整的数据，可以从[Kaggle](https://www.kaggle.com/c/microsoft-malware-prediction/data)下载。

## Import

In [1]:
import pandas as pd
from tqdm import tqdm
import lightgbm as lgb
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import KFold

# Memory management
import gc
gc.enable()

# Plot
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('ggplot')

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

In [2]:
# dtypes = {
#     'MachineIdentifier':                                    'category',
#     'ProductName':                                          'category',
#     'EngineVersion':                                        'category',
#     'AppVersion':                                           'category',
#     'AvSigVersion':                                         'category',
#     'IsBeta':                                               'int8',
#     'RtpStateBitfield':                                     'float16',
#     'IsSxsPassiveMode':                                     'int8',
#     'DefaultBrowsersIdentifier':                            'float32',
#     'AVProductStatesIdentifier':                            'float32',
#     'AVProductsInstalled':                                  'float16',
#     'AVProductsEnabled':                                    'float16',
#     'HasTpm':                                               'int8',
#     'CountryIdentifier':                                    'int16',
#     'CityIdentifier':                                       'float32',
#     'OrganizationIdentifier':                               'float16',
#     'GeoNameIdentifier':                                    'float16',
#     'LocaleEnglishNameIdentifier':                          'int16',
#     'Platform':                                             'category',
#     'Processor':                                            'category',
#     'OsVer':                                                'category',
#     'OsBuild':                                              'int16',
#     'OsSuite':                                              'int16',
#     'OsPlatformSubRelease':                                 'category',
#     'OsBuildLab':                                           'category',
#     'SkuEdition':                                           'category',
#     'IsProtected':                                          'float16',
#     'AutoSampleOptIn':                                      'int8',
#     'PuaMode':                                              'category',
#     'SMode':                                                'float16',
#     'IeVerIdentifier':                                      'float16',
#     'SmartScreen':                                          'category',
#     'Firewall':                                             'float16',
#     'UacLuaenable':                                         'float64',  # was 'float32'
#     'Census_MDC2FormFactor':                                'category',
#     'Census_DeviceFamily':                                  'category',
#     'Census_OEMNameIdentifier':                             'float32',  # was 'float16'
#     'Census_OEMModelIdentifier':                            'float32',
#     'Census_ProcessorCoreCount':                            'float16',
#     'Census_ProcessorManufacturerIdentifier':               'float16',
#     'Census_ProcessorModelIdentifier':                      'float32',  # was 'float16'
#     'Census_ProcessorClass':                                'category',
#     'Census_PrimaryDiskTotalCapacity':                      'float64',  # was 'float32'
#     'Census_PrimaryDiskTypeName':                           'category',
#     'Census_SystemVolumeTotalCapacity':                     'float64',  # was 'float32'
#     'Census_HasOpticalDiskDrive':                           'int8',
#     'Census_TotalPhysicalRAM':                              'float32',
#     'Census_ChassisTypeName':                               'category',
#     'Census_InternalPrimaryDiagonalDisplaySizeInInches':    'float32',  # was 'float16'
#     'Census_InternalPrimaryDisplayResolutionHorizontal':    'float32',  # was 'float16'
#     'Census_InternalPrimaryDisplayResolutionVertical':      'float32',  # was 'float16'
#     'Census_PowerPlatformRoleName':                         'category',
#     'Census_InternalBatteryType':                           'category',
#     'Census_InternalBatteryNumberOfCharges':                'float64',  # was 'float32'
#     'Census_OSVersion':                                     'category',
#     'Census_OSArchitecture':                                'category',
#     'Census_OSBranch':                                      'category',
#     'Census_OSBuildNumber':                                 'int16',
#     'Census_OSBuildRevision':                               'int32',
#     'Census_OSEdition':                                     'category',
#     'Census_OSSkuName':                                     'category',
#     'Census_OSInstallTypeName':                             'category',
#     'Census_OSInstallLanguageIdentifier':                   'float16',
#     'Census_OSUILocaleIdentifier':                          'int16',
#     'Census_OSWUAutoUpdateOptionsName':                     'category',
#     'Census_IsPortableOperatingSystem':                     'int8',
#     'Census_GenuineStateName':                              'category',
#     'Census_ActivationChannel':                             'category',
#     'Census_IsFlightingInternal':                           'float16',
#     'Census_IsFlightsDisabled':                             'float16',
#     'Census_FlightRing':                                    'category',
#     'Census_ThresholdOptIn':                                'float16',
#     'Census_FirmwareManufacturerIdentifier':                'float16',
#     'Census_FirmwareVersionIdentifier':                     'float32',
#     'Census_IsSecureBootEnabled':                           'int8',
#     'Census_IsWIMBootEnabled':                              'float16',
#     'Census_IsVirtualDevice':                               'float16',
#     'Census_IsTouchEnabled':                                'int8',
#     'Census_IsPenCapable':                                  'int8',
#     'Census_IsAlwaysOnAlwaysConnectedCapable':              'float16',
#     'Wdft_IsGamer':                                         'float16',
#     'Wdft_RegionIdentifier':                                'float16',
#     'HasDetections':                                        'int8'
# }

# df_all = pd.read_csv('./input/train.csv.zip', dtype=dtypes) 
# df_all = df_all.sample(frac=0.02, random_state=123)
# df_all.to_csv('./input/train_sample.csv', index=False)

因为这次比赛的数据集有800万的样本，这里我仅随机抽取训练集中2%的样本（17万条），来说明问题。将不会用到任何测试集（700万）中的数据。

In [3]:
df_all = pd.read_csv('./input/train_sample.csv') 
df_all.head()

Unnamed: 0,MachineIdentifier,ProductName,EngineVersion,AppVersion,AvSigVersion,IsBeta,RtpStateBitfield,IsSxsPassiveMode,DefaultBrowsersIdentifier,AVProductStatesIdentifier,...,Census_FirmwareVersionIdentifier,Census_IsSecureBootEnabled,Census_IsWIMBootEnabled,Census_IsVirtualDevice,Census_IsTouchEnabled,Census_IsPenCapable,Census_IsAlwaysOnAlwaysConnectedCapable,Wdft_IsGamer,Wdft_RegionIdentifier,HasDetections
0,586d40804b950d0376575fdf10ee89ae,win8defender,1.1.15100.1,4.18.1806.18062,1.273.520.0,0,7.0,0,,53447.0,...,27767.0,1,,0.0,0,0,0.0,1.0,15.0,1.0
1,65fb3fae2d37f90e6b3174592f2490a8,win8defender,1.1.15200.1,4.18.1807.18075,1.275.453.0,0,7.0,0,,7945.0,...,14353.0,0,,0.0,0,0,0.0,0.0,10.0,0.0
2,c23aa37fb69e00afe2668ed150dee1ea,win8defender,1.1.15100.1,4.18.1807.18075,1.273.689.0,0,7.0,0,,53447.0,...,8941.0,1,,0.0,0,0,0.0,1.0,1.0,1.0
3,cba75d6c4d9b6533591e94b9cb8a5df5,win8defender,1.1.15200.1,4.12.16299.15,1.275.483.0,0,7.0,0,,68585.0,...,46589.0,1,,0.0,0,0,0.0,1.0,7.0,1.0
4,149746364c6b763662d03e1f263029fd,win8defender,1.1.15200.1,4.18.1807.18075,1.275.215.0,0,7.0,0,,53447.0,...,52530.0,0,,0.0,0,0,0.0,,,0.0


本次比赛的数据中提供了电脑的 Windows Defender（Windows系统自带的杀毒软件） 的版本号，所以我们可以通过该本版号发布的时间，粗略的推测采集该样本的时间。

这里AvSigVersionTimestamps就是各个版本对应的发布时间。

通过和该数据匹配，我们生成了一个新的字段 -- Date。

In [4]:
# IMPORT TIMESTAMP DICTIONARY
datedict = np.load('./input/AvSigVersionTimestamps.npy')
datedict = datedict[()]

df_all['Date'] = df_all['AvSigVersion'].map(datedict)

# MachineIdentifier是每台电脑的唯一识别号，对于模型的预测没有任何帮助，所以剔除。
df_all.drop(['MachineIdentifier'], axis=1, inplace=True) 

## 数据清理

### 去掉无意义变量

这里无意义变量的定义是：变量中某个值（可以是空值）的占比大于99%。

In [5]:
bad_cols = []
for col in df_all.columns:
    rate_train = df_all[col].value_counts(normalize=True, dropna=False).values[0]
    if rate_train > 0.99:
        bad_cols.append(col)

df_all = df_all.drop(bad_cols, axis=1)

print('Data Shape: ', df_all.shape)
print(bad_cols)

Data Shape:  (178430, 75)
['IsBeta', 'AutoSampleOptIn', 'PuaMode', 'UacLuaenable', 'Census_DeviceFamily', 'Census_ProcessorClass', 'Census_IsPortableOperatingSystem', 'Census_IsVirtualDevice']


### 定义数据类型

这里是通过EDA(Exploratory Data Analysis)的方式，人工判断的变量类型。

总共将变量分为
* 数值变量（true_numerical_columns）
* 一般的分类变量（categorical_columns）
* 类别非常多的分类变量（categorical_columns_high_car）

如果你对这次比赛的细节感兴趣，可以再深入研究为什么这样判断。这里就不详细阐述原因了。

In [6]:
true_numerical_columns = [
    'Census_PrimaryDiskTotalCapacity', 'Census_SystemVolumeTotalCapacity',
    'Census_TotalPhysicalRAM', 'Census_InternalBatteryNumberOfCharges'
]

categorical_columns_high_car = [
    'Census_FirmwareVersionIdentifier', 'Census_OEMModelIdentifier',
    'AVProductStatesIdentifier', 'Census_FirmwareManufacturerIdentifier',
    'Census_InternalPrimaryDiagonalDisplaySizeInInches',
    'Census_InternalPrimaryDisplayResolutionHorizontal',
    'Census_InternalPrimaryDisplayResolutionVertical',
    'Census_OEMNameIdentifier', 'Census_ProcessorModelIdentifier',
    'CityIdentifier', 'DefaultBrowsersIdentifier', 'OsBuildLab'
]

categorical_columns = [
    c for c in df_all.columns
    if c not in (['HasDetections', 'Date'] + true_numerical_columns +
                 categorical_columns_high_car)
]
print(categorical_columns)

['ProductName', 'EngineVersion', 'AppVersion', 'AvSigVersion', 'RtpStateBitfield', 'IsSxsPassiveMode', 'AVProductsInstalled', 'AVProductsEnabled', 'HasTpm', 'CountryIdentifier', 'OrganizationIdentifier', 'GeoNameIdentifier', 'LocaleEnglishNameIdentifier', 'Platform', 'Processor', 'OsVer', 'OsBuild', 'OsSuite', 'OsPlatformSubRelease', 'SkuEdition', 'IsProtected', 'SMode', 'IeVerIdentifier', 'SmartScreen', 'Firewall', 'Census_MDC2FormFactor', 'Census_ProcessorCoreCount', 'Census_ProcessorManufacturerIdentifier', 'Census_PrimaryDiskTypeName', 'Census_HasOpticalDiskDrive', 'Census_ChassisTypeName', 'Census_PowerPlatformRoleName', 'Census_InternalBatteryType', 'Census_OSVersion', 'Census_OSArchitecture', 'Census_OSBranch', 'Census_OSBuildNumber', 'Census_OSBuildRevision', 'Census_OSEdition', 'Census_OSSkuName', 'Census_OSInstallTypeName', 'Census_OSInstallLanguageIdentifier', 'Census_OSUILocaleIdentifier', 'Census_OSWUAutoUpdateOptionsName', 'Census_GenuineStateName', 'Census_ActivationChan

### 编码 -- Lebel Encoding

因为将使用的模型是[LightGBM](https://lightgbm.readthedocs.io/en/latest/)，所以我们需要对分类变量做编码。这里用的方法是label encoding。

（为什么将NAN设置为0，而不是最大值，会对模型的效果有什么样的影响呢？）

In [7]:
def factor_data(df, col):
    df_labeled, _ = df[col].factorize(sort=True)
    # MAKE SMALLEST LABEL 1, RESERVE 0
    df_labeled += 1
    # MAKE NAN LARGEST LABEL
    df_labeled = np.where(df_labeled==0, df_labeled.max()+1, df_labeled)
    df[col] = df_labeled

In [8]:
for col in tqdm(categorical_columns + categorical_columns_high_car):
    factor_data(df_all, col) 

100%|██████████████████████████████████████████████████████████████████████████████████| 69/69 [00:01<00:00, 38.24it/s]


## 拆分测试集

因为这里只用到了训练集2%的样本，而没有用到任何测试集的数据，所以我们需要从训练集中拆分出一个测试集。

因为训练集和测试集是根据时间划分的，所以我们从训练集拆分的测试集，同样也根据时间划分。

In [25]:
df_all = df_all.sort_values('Date')  
df_all.drop(['Date'], axis=1, inplace=True)

KeyError: 'Date'

In [29]:
df_test = df_all.iloc[int(0.8*len(df_all)):, ]
df_train = df_all.iloc[:int(0.8*len(df_all)), ]

## 对抗验证（Adversarial Validatiion）

步骤如下：

* 定义新的Y（因变量）：样本是train还是test。训练集中的样本统一标记为0，测试集则标记为1。
* 将 Train 和 Test 合成一个数据集
* 构造一个模型，拟合新定义的Y。
* 观察模型效果：如果模型的AUC超过0.7，说明了 Train 和 Test 的分布存在较大的差异。

In [30]:
# 定义新的Y
df_train['Is_Test'] = 0
df_test['Is_Test'] = 1

# 将 Train 和 Test 合成一个数据集。HasDetections是数据本来的Y，所以剔除。
df_adv = pd.concat([df_train, df_test])

adv_data = lgb.Dataset(
    data=df_adv.drop('Is_Test', axis=1), label=df_adv.loc[:, 'Is_Test'])

# 定义模型参数
params = {
    'boosting_type': 'gbdt',
    'colsample_bytree': 1,
    'learning_rate': 0.1,
    'max_depth': 5,
    'min_child_samples': 100,
    'min_child_weight': 1,
    'min_split_gain': 0.0,
    'num_leaves': 20,
    'objective': 'binary',
    'random_state': 50,
    'subsample': 1.0,
    'subsample_freq': 0,
    'metric': 'auc',
    'num_threads': 8
}

# 交叉验证
adv_cv_results = lgb.cv(
    params,
    adv_data,
    num_boost_round=10000,
    nfold=5,
    categorical_feature=categorical_columns,
    early_stopping_rounds=200,
    verbose_eval=True,
    seed=42)

print('交叉验证中最优的AUC为 {:.5f}，对应的标准差为{:.5f}.'.format(
    adv_cv_results['auc-mean'][-1], adv_cv_results['auc-stdv'][-1]))

print('模型最优的迭代次数为{}.'.format(len(adv_cv_results['auc-mean'])))

[1]	cv_agg's auc: 0.98165 + 0.000908944
[2]	cv_agg's auc: 0.992039 + 0.00125035
[3]	cv_agg's auc: 0.997865 + 0.000240807
[4]	cv_agg's auc: 0.99826 + 0.000358246
[5]	cv_agg's auc: 0.999219 + 0.000112415
[6]	cv_agg's auc: 0.999383 + 4.48e-05
[7]	cv_agg's auc: 0.999477 + 3.4368e-05
[8]	cv_agg's auc: 0.999485 + 5.05065e-05
[9]	cv_agg's auc: 0.999507 + 4.14044e-05
[10]	cv_agg's auc: 0.999573 + 5.05698e-05
[11]	cv_agg's auc: 0.999629 + 3.29159e-05
[12]	cv_agg's auc: 0.999668 + 3.21742e-05
[13]	cv_agg's auc: 0.999689 + 2.63383e-05
[14]	cv_agg's auc: 0.999719 + 1.95824e-05
[15]	cv_agg's auc: 0.999736 + 1.23727e-05
[16]	cv_agg's auc: 0.999749 + 1.97004e-05
[17]	cv_agg's auc: 0.999769 + 1.24006e-05
[18]	cv_agg's auc: 0.999782 + 6.12502e-06
[19]	cv_agg's auc: 0.999817 + 2.00364e-05
[20]	cv_agg's auc: 0.999843 + 1.72728e-05
[21]	cv_agg's auc: 0.999855 + 2.0813e-05
[22]	cv_agg's auc: 0.999865 + 2.36204e-05
[23]	cv_agg's auc: 0.99987 + 2.35125e-05
[24]	cv_agg's auc: 0.999868 + 1.75984e-05
[25]	cv_ag

[194]	cv_agg's auc: 0.999981 + 7.02834e-06
[195]	cv_agg's auc: 0.999981 + 6.87633e-06
[196]	cv_agg's auc: 0.999981 + 6.87354e-06
[197]	cv_agg's auc: 0.999982 + 6.75526e-06
[198]	cv_agg's auc: 0.999982 + 6.7333e-06
[199]	cv_agg's auc: 0.999981 + 7.08061e-06
[200]	cv_agg's auc: 0.999981 + 7.1276e-06
[201]	cv_agg's auc: 0.999981 + 6.98198e-06
[202]	cv_agg's auc: 0.999981 + 6.94695e-06
[203]	cv_agg's auc: 0.999981 + 7.22943e-06
[204]	cv_agg's auc: 0.99998 + 7.76649e-06
[205]	cv_agg's auc: 0.99998 + 7.69322e-06
[206]	cv_agg's auc: 0.99998 + 7.57607e-06
[207]	cv_agg's auc: 0.999981 + 7.53449e-06
[208]	cv_agg's auc: 0.999981 + 7.52494e-06
[209]	cv_agg's auc: 0.999981 + 7.43808e-06
[210]	cv_agg's auc: 0.999981 + 7.44301e-06
[211]	cv_agg's auc: 0.99998 + 7.38099e-06
[212]	cv_agg's auc: 0.99998 + 7.42687e-06
[213]	cv_agg's auc: 0.99998 + 7.96908e-06
[214]	cv_agg's auc: 0.99998 + 8.21543e-06
[215]	cv_agg's auc: 0.99998 + 8.11715e-06
[216]	cv_agg's auc: 0.99998 + 8.04836e-06
[217]	cv_agg's auc: 0.

通过对抗验证，我们发现模型的AUC达到了0.99。说明本次比赛的训练集和测试集的样本分布存在较大的差异。

In [31]:
params['n_estimators'] = len(adv_cv_results['auc-mean'])

model_adv = lgb.LGBMClassifier(**params)
model_adv.fit(df_adv.drop('Is_Test', axis=1), df_adv.loc[:, 'Is_Test'])

# AUC
preds_adv = model_adv.predict_proba(df_adv.drop('Is_Test', axis=1))[:, 1]
auc_test_cv = roc_auc_score(df_adv.loc[:, 'Is_Test'], preds_adv)
print('The baseline model scores {:.5f} ROC AUC on the oot set.'.format(
    auc_test_cv))

The baseline model scores 0.99820 ROC AUC on the oot set.


## 交叉验证（Cross Validation）

现在我们知道了训练集和测试集的分布存在很大的差异。那么接下来，我们采用交叉验证的方法，来评估模型的效果。

In [11]:
train_set = lgb.Dataset(
    df_train.drop('HasDetections', axis=1),
    label=df_train.loc[:, 'HasDetections'])

# Perform cross validation with early stopping
N_FOLDS = 5
cv_results = lgb.cv(
    params,
    train_set,
    num_boost_round=10000,
    nfold=N_FOLDS,
    categorical_feature=categorical_columns,
    early_stopping_rounds=200,
    verbose_eval=True,
    seed=42)

print('交叉验证中最优的AUC为 {:.5f}，对应的标准差为{:.5f}.'.format(
    cv_results['auc-mean'][-1], cv_results['auc-stdv'][-1]))

print('模型最优的迭代次数为{}.'.format(len(cv_results['auc-mean'])))

[1]	cv_agg's auc: 0.673633 + 0.00264353
[2]	cv_agg's auc: 0.678882 + 0.0044727
[3]	cv_agg's auc: 0.6815 + 0.00346469
[4]	cv_agg's auc: 0.682948 + 0.00386172
[5]	cv_agg's auc: 0.684683 + 0.00415183
[6]	cv_agg's auc: 0.686288 + 0.00342958
[7]	cv_agg's auc: 0.687548 + 0.00316655
[8]	cv_agg's auc: 0.688257 + 0.0034225
[9]	cv_agg's auc: 0.689065 + 0.00333237
[10]	cv_agg's auc: 0.690081 + 0.00332501
[11]	cv_agg's auc: 0.690898 + 0.00355812
[12]	cv_agg's auc: 0.691641 + 0.00343357
[13]	cv_agg's auc: 0.692108 + 0.00359949
[14]	cv_agg's auc: 0.692707 + 0.00384956
[15]	cv_agg's auc: 0.693462 + 0.00350975
[16]	cv_agg's auc: 0.693576 + 0.00338896
[17]	cv_agg's auc: 0.693801 + 0.00324963
[18]	cv_agg's auc: 0.694156 + 0.00326456
[19]	cv_agg's auc: 0.694611 + 0.00338037
[20]	cv_agg's auc: 0.694983 + 0.00315375
[21]	cv_agg's auc: 0.695084 + 0.00318677
[22]	cv_agg's auc: 0.695499 + 0.00336445
[23]	cv_agg's auc: 0.695954 + 0.00323137
[24]	cv_agg's auc: 0.696359 + 0.00325667
[25]	cv_agg's auc: 0.69658 + 

[199]	cv_agg's auc: 0.699885 + 0.00327966
[200]	cv_agg's auc: 0.699857 + 0.00331391
[201]	cv_agg's auc: 0.699831 + 0.00324778
[202]	cv_agg's auc: 0.699808 + 0.00326731
[203]	cv_agg's auc: 0.699804 + 0.00324742
[204]	cv_agg's auc: 0.6998 + 0.00319918
[205]	cv_agg's auc: 0.699855 + 0.00312026
[206]	cv_agg's auc: 0.699798 + 0.00315149
[207]	cv_agg's auc: 0.699728 + 0.00318658
[208]	cv_agg's auc: 0.699686 + 0.00320801
[209]	cv_agg's auc: 0.699676 + 0.0031589
[210]	cv_agg's auc: 0.699594 + 0.00315818
[211]	cv_agg's auc: 0.699614 + 0.00317993
[212]	cv_agg's auc: 0.699625 + 0.00321704
[213]	cv_agg's auc: 0.699587 + 0.00327078
[214]	cv_agg's auc: 0.699605 + 0.00327879
[215]	cv_agg's auc: 0.699609 + 0.00328098
[216]	cv_agg's auc: 0.699577 + 0.00326227
[217]	cv_agg's auc: 0.699575 + 0.00330303
[218]	cv_agg's auc: 0.699573 + 0.00333707
[219]	cv_agg's auc: 0.699552 + 0.00333701
[220]	cv_agg's auc: 0.699537 + 0.00333112
[221]	cv_agg's auc: 0.69952 + 0.00331969
[222]	cv_agg's auc: 0.699492 + 0.00331

使用交叉验证的方式来评估模型效果，模型的AUC为0.70125。

In [13]:
params['n_estimators'] = len(cv_results['auc-mean'])

model_cv = lgb.LGBMClassifier(**params)
model_cv.fit(df_train.drop('HasDetections', axis=1), df_train.loc[:, 'HasDetections'])

# AUC
preds_test_cv = model_cv.predict_proba(df_test.drop('HasDetections', axis=1))[:, 1]
auc_test_cv = roc_auc_score(df_test.loc[:, 'HasDetections'], preds_test_cv)
print('The baseline model scores {:.5f} ROC AUC on the oot set.'.format(
    auc_test))

The baseline model scores 0.66980 ROC AUC on the oot set.


模型在测试集上的AUC为0.66980，和模型交叉验证的AUC相差0.03。

我们来试一下其他方法，和CV对比看看。

## 在变量分布变化的情况下，除了交叉验证，还有哪些更优的方法？

这里我们将使用除了交叉验证以外的方法，来评估模型：

* 人工划分验证集
* XX
* 有权重的CV

### 人工划分验证集

根据对数据的了解，我们可以选择人工划分的方式。

因为我知道这个数据是根据时间划分的，所以我的验证集同样根据时间划分。

In [17]:
df_validation_man = df_train.iloc[int(0.8 * len(df_train)):, ]
df_train_man = df_train.iloc[:int(0.8 * len(df_train)), ]

In [18]:
dtrain_man = lgb.Dataset(
    data=df_train_man.drop('HasDetections', axis=1),
    label=df_train_man.loc[:, 'HasDetections'],
    free_raw_data=False,
    silent=True)

dvalid_man = lgb.Dataset(
    data=df_validation_man.drop('HasDetections', axis=1),
    label=df_validation_man.loc[:, 'HasDetections'],
    free_raw_data=False,
    silent=True)

In [20]:
params.pop('n_estimators', None)

clf = lgb.train(
    params=params,
    train_set=dtrain_man,
    num_boost_round=10000,
    valid_sets=[dtrain_man, dvalid_man],
    early_stopping_rounds=200,
    verbose_eval=True,
    categorical_feature=categorical_columns)

[1]	training's auc: 0.688663	valid_1's auc: 0.642369
Training until validation scores don't improve for 200 rounds.
[2]	training's auc: 0.694535	valid_1's auc: 0.649962
[3]	training's auc: 0.696651	valid_1's auc: 0.651748
[4]	training's auc: 0.698669	valid_1's auc: 0.653697
[5]	training's auc: 0.699434	valid_1's auc: 0.654084
[6]	training's auc: 0.701252	valid_1's auc: 0.654651
[7]	training's auc: 0.704317	valid_1's auc: 0.655619
[8]	training's auc: 0.706225	valid_1's auc: 0.65734
[9]	training's auc: 0.707367	valid_1's auc: 0.657409
[10]	training's auc: 0.708339	valid_1's auc: 0.658228
[11]	training's auc: 0.710424	valid_1's auc: 0.65975
[12]	training's auc: 0.711272	valid_1's auc: 0.659604
[13]	training's auc: 0.713813	valid_1's auc: 0.661753
[14]	training's auc: 0.714003	valid_1's auc: 0.66147
[15]	training's auc: 0.715192	valid_1's auc: 0.661742
[16]	training's auc: 0.7158	valid_1's auc: 0.661808
[17]	training's auc: 0.717848	valid_1's auc: 0.662201
[18]	training's auc: 0.719035	val

[151]	training's auc: 0.793186	valid_1's auc: 0.674549
[152]	training's auc: 0.793633	valid_1's auc: 0.674592
[153]	training's auc: 0.793932	valid_1's auc: 0.674592
[154]	training's auc: 0.794392	valid_1's auc: 0.674548
[155]	training's auc: 0.794677	valid_1's auc: 0.674623
[156]	training's auc: 0.794963	valid_1's auc: 0.674759
[157]	training's auc: 0.795398	valid_1's auc: 0.674788
[158]	training's auc: 0.795798	valid_1's auc: 0.674843
[159]	training's auc: 0.796327	valid_1's auc: 0.674707
[160]	training's auc: 0.796553	valid_1's auc: 0.674756
[161]	training's auc: 0.796911	valid_1's auc: 0.674663
[162]	training's auc: 0.797276	valid_1's auc: 0.674895
[163]	training's auc: 0.797627	valid_1's auc: 0.67494
[164]	training's auc: 0.797913	valid_1's auc: 0.675035
[165]	training's auc: 0.798092	valid_1's auc: 0.67496
[166]	training's auc: 0.798412	valid_1's auc: 0.674962
[167]	training's auc: 0.798743	valid_1's auc: 0.674933
[168]	training's auc: 0.798952	valid_1's auc: 0.674969
[169]	traini

[300]	training's auc: 0.836427	valid_1's auc: 0.676322
[301]	training's auc: 0.836627	valid_1's auc: 0.676282
[302]	training's auc: 0.836838	valid_1's auc: 0.676244
[303]	training's auc: 0.837102	valid_1's auc: 0.676212
[304]	training's auc: 0.837276	valid_1's auc: 0.676166
[305]	training's auc: 0.837458	valid_1's auc: 0.676173
[306]	training's auc: 0.837517	valid_1's auc: 0.676152
[307]	training's auc: 0.837965	valid_1's auc: 0.676155
[308]	training's auc: 0.838165	valid_1's auc: 0.67619
[309]	training's auc: 0.838496	valid_1's auc: 0.676137
[310]	training's auc: 0.838724	valid_1's auc: 0.67617
[311]	training's auc: 0.838999	valid_1's auc: 0.676216
[312]	training's auc: 0.83928	valid_1's auc: 0.676236
[313]	training's auc: 0.839509	valid_1's auc: 0.676162
[314]	training's auc: 0.839693	valid_1's auc: 0.676181
[315]	training's auc: 0.839837	valid_1's auc: 0.676239
[316]	training's auc: 0.840141	valid_1's auc: 0.676317
[317]	training's auc: 0.840477	valid_1's auc: 0.676399
[318]	trainin

[449]	training's auc: 0.866102	valid_1's auc: 0.675878
[450]	training's auc: 0.866258	valid_1's auc: 0.675907
[451]	training's auc: 0.866414	valid_1's auc: 0.675901
[452]	training's auc: 0.866592	valid_1's auc: 0.675812
[453]	training's auc: 0.866637	valid_1's auc: 0.675805
[454]	training's auc: 0.866966	valid_1's auc: 0.675809
[455]	training's auc: 0.867181	valid_1's auc: 0.675694
[456]	training's auc: 0.867438	valid_1's auc: 0.675687
[457]	training's auc: 0.867533	valid_1's auc: 0.67569
[458]	training's auc: 0.867633	valid_1's auc: 0.675662
[459]	training's auc: 0.867778	valid_1's auc: 0.675638
[460]	training's auc: 0.867888	valid_1's auc: 0.675644
[461]	training's auc: 0.868079	valid_1's auc: 0.675677
[462]	training's auc: 0.86834	valid_1's auc: 0.675634
[463]	training's auc: 0.868478	valid_1's auc: 0.675628
[464]	training's auc: 0.868591	valid_1's auc: 0.67561
[465]	training's auc: 0.868712	valid_1's auc: 0.675591
[466]	training's auc: 0.868824	valid_1's auc: 0.675585
[467]	trainin

In [23]:
params['n_estimators'] = clf.num_trees()

model_man = lgb.LGBMClassifier(**params)
model_man.fit(
    df_train_man.drop('HasDetections', axis=1),
    df_train_man.loc[:, 'HasDetections'])

# AUC
preds_test_man = model_man.predict_proba(df_test.drop('HasDetections', axis=1))[:, 1]
auc_test_man = roc_auc_score(df_test.loc[:, 'HasDetections'], preds_test_man)
print('The baseline model scores {:.5f} ROC AUC on the oot set.'.format(
    roc_auc_score(df_test.loc[:, 'HasDetections'], preds_test)))

The baseline model scores 0.67229 ROC AUC on the oot set.


In [24]:
0.676596 - 0.67229

0.004305999999999921

使用人工划分验证集的方式：AUC差值为0.004。

### 用概率高的作为验证集

In [35]:
df_train_weights = preds_adv[:len(df_train)]

### 有权重的CV 

In [None]:
df_train_weights = preds_adv[:len(df_train)]

# 结论

CV不能够准确评估模型在测试集上的效果，需要用其他的Validation的方式。下一篇文章，我们将使用更多的方法来解决样本分布变化的问题。

# 实验：categorical features with high cardinality

## 重新拆分测试集

In [None]:
df_test = df_all.iloc[int(0.8*len(df_all)):, ]
df_train = df_all.iloc[:int(0.8*len(df_all)), ]

In [None]:
categorical_columns = [
    c for c in df_all.columns
    if c not in (['HasDetections'] + true_numerical_columns)
]

print(categorical_columns)

## 交叉验证 对比组

In [None]:
params = {
    'boosting_type': 'gbdt',
    'colsample_bytree': 1,
    'learning_rate': 0.1,
    'max_depth': 5,
    'min_child_samples': 100,
    'min_child_weight': 1,
    'min_split_gain': 0.0,
    'num_leaves': 20,
    'objective': 'binary',
    'random_state': 50,
    'subsample': 1.0,
    'subsample_freq': 0,
    'metric': 'auc',
    'num_threads': 8
}

In [None]:
train_set = lgb.Dataset(
    df_train.drop('HasDetections', axis=1),
    label=df_train.loc[:, 'HasDetections'])

# Perform cross validation with early stopping
N_FOLDS = 5
cv_results = lgb.cv(
    params,
    train_set,
    num_boost_round=10000,
    nfold=N_FOLDS,
    categorical_feature=categorical_columns,
    early_stopping_rounds=200,
    verbose_eval=True,
    seed=42)

# Highest score
best = cv_results['auc-mean'][-1]

# Standard deviation of best score
best_std = cv_results['auc-stdv'][-1]

print(
    'The maximium ROC AUC in cross validation was {:.5f} with std of {:.5f}.'.
    format(best, best_std))

print('The ideal number of iterations was {}.'.format(
    len(cv_results['auc-mean'])))

In [None]:
params['n_estimators'] = len(cv_results['auc-mean'])

In [None]:
model = lgb.LGBMClassifier(**params)
model.fit(df_train.drop('HasDetections', axis=1), df_train.loc[:, 'HasDetections'])

# AUC
preds_test = model.predict_proba(df_test.drop('HasDetections', axis=1))[:, 1]
auc_test = roc_auc_score(df_test.loc[:, 'HasDetections'], preds_test)
print('The baseline model scores {:.5f} ROC AUC on the oot set.'.format(
    auc_test))

* 迭代次数减少：83到41
* Validation的AUC降低：0.70125到0.70060
* Test的AUC降低：0.66980到0.66051

## 人工划分验证集 对比组

In [None]:
df_validation = df_train.iloc[int(0.8*len(df_train)):, ]
df_train = df_train.iloc[:int(0.8*len(df_train)), ]

In [None]:
dtrain = lgb.Dataset(data=df_train.drop('HasDetections', axis=1), 
                             label=df_train.loc[:, 'HasDetections'], 
                             free_raw_data=False, silent=True)

dvalid = lgb.Dataset(data=df_validation.drop('HasDetections', axis=1), 
                     label=df_validation.loc[:, 'HasDetections'], 
                     free_raw_data=False, silent=True)

In [None]:
params.pop('n_estimators', None)

In [None]:
clf = lgb.train(
            params=params,
            train_set=dtrain,
            num_boost_round=10000,
            valid_sets=[dtrain, dvalid],
            early_stopping_rounds=200,
            verbose_eval=True,
            categorical_feature = categorical_columns
        )

In [None]:
params['n_estimators'] = 46

In [None]:
model = lgb.LGBMClassifier(**params)
model.fit(df_train.drop('HasDetections', axis=1), df_train.loc[:, 'HasDetections'])

# AUC
preds_test = model.predict_proba(df_test.drop('HasDetections', axis=1))[:, 1]
auc_test = roc_auc_score(df_test.loc[:, 'HasDetections'], preds_test)
print('The baseline model scores {:.5f} ROC AUC on the oot set.'.format(
    auc_test))

* 迭代次数减少：363到46
* Validation的AUC降低：0.676596到0.668609
* Test的AUC降低：0.67229到0.66146