# 文档说明

本文档重点补充了有关模型优化（即寻找最佳超参数）的一些内容。

对于模型融合而言，Datawhale已经给出较为详细的[notebook](https://tianchi.aliyun.com/notebook-ai/detail?spm=5176.12586969.1002.15.1cd8593aUpmrCG&postId=95535)，直接参考该文档即可。

模型融合是建立在模型调参的基础上，可以理解为多裁判评分机制（stacking要相对更复杂一些）。

其实，xgboost、lightgbm等集成模型本身也是模型融合的一种手段，相比于最终的融合而言，前期的特征工程以及模型调参更为重要。

# 模型调参及模型融合

在机器学习中，优化模型意味着为特定问题找到最佳的超参数集

## 超参数

模型超参数定义[model hyperparameters are in contrast to model parameters](https://machinelearningmastery.com/difference-between-a-parameter-and-a-hyperparameter/) :

* **模型超参数**被认为最好通过机器学习算法来进行设置，在训练之前由数据科学家调整。 例如，随机森林中的树木数量，或者K-Nearest Neighbors Regression中使用的邻居数量。

* **模型参数**是模型在训练期间学习的内容，例如线性回归中的权重。

我们作为数据科学家通过**选择超参数**来控制模型，这些选择会对模型的最终性能产生显着影响（尽管通常不会像获取更多数据或工程特征那样有效）。调整模型超参数 [Tuning the model hyperparameters](http://scikit-learn.org/stable/modules/grid_search.html) 可以控制模型中欠拟合与过拟合的平衡。 

* 我们可以尝试通过制作**更复杂的模型**来校正欠拟合，例如在随机森林中使用更多树或在深度神经网络中使用更多层。 当我们的模型没有足够的容量（自由度）来学习特征和目标之间的关系时，模型会发生**欠拟合并且具有高偏差**。 
* 我们可以通过**限制模型的复杂度和应用正则化**来尝试纠正过度拟合。 这可能意味着降低多项式回归的次数，或将衰退层添加到深度神经网络。 **过拟合的模型具有高方差**并且实际上记住了训练集。 

**欠拟合和过拟合导致模型在测试集上的泛化性能变差**。

选择超参数的问题在于，没有放之四海而皆准的超参数。 因此，对于每个新数据集，我们必须找到最佳设置。 这可能是一个耗时的过程，但幸运的是，在Scikit-Learn中执行此过程有多种选择。更好的是，新的libraries，如epistasis实验室的[TPOT](https://epistasislab.github.io/tpot/) 旨在为你自动完成此过程！ 目前，我们将坚持在Scikit-Learn中手动（有点）这样做，但请继续关注自动模型选择的文章！

__[常见设置超参数的做法有](https://www.jianshu.com/p/1b23afa34a47)__：

1. **猜测和检查**：根据经验或直觉，选择参数，一直迭代。
2. **网格搜索**：让计算机尝试在一定范围内均匀分布的一组值。
3. **随机搜索**：让计算机随机挑选一组值。
4. **贝叶斯优化**：使用贝叶斯优化超参数，会遇到贝叶斯优化算法本身就需要很多的参数的困难。
5. **在良好初始猜测的前提下进行局部优化**：这就是 MITIE 的方法，它使用 BOBYQA 算法，并有一个精心选择的起始点。由于 BOBYQA 只寻找最近的局部最优解，所以这个方法是否成功很大程度上取决于是否有一个好的起点。在 MITIE 的情下,我们知道一个好的起点，但这不是一个普遍的解决方案，因为通常你不会知道好的起点在哪里。从好的方面来说，这种方法非常适合寻找局部最优解。稍后我会再讨论这一点。
6. 最新提出的 **LIPO 的全局优化方法**。这个方法没有参数，而且经验证比随机搜索方法好。


## Import

In [1]:
# 导入需要用到的工具包
# 用于数据操作的pandas和numpy
import pandas as pd
import numpy as np

# 不显示关于在切片副本上设置值的警告
pd.options.mode.chained_assignment = None

# 一个dataframe 最多显示60例样本
pd.set_option('display.max_columns', 60)

# 可视化工具包
import matplotlib.pyplot as plt
%matplotlib inline

# 设置默认字体大小
plt.rcParams['font.size'] = 24

# 导入设置图片大小的工具
from IPython.core.pylabtools import figsize

# 导入用于可视化的Seaborn
import seaborn as sns
sns.set(font_scale = 2)

# 把数据分为训练集和测试集

from sklearn.model_selection import train_test_split

import math

# 导入xgb、lgb、catboost
import xgboost as xgb
import lightgbm as lgb
import catboost

## 参数搜索和评价的
from sklearn.model_selection import GridSearchCV, cross_val_score, StratifiedKFold, train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error


## 数据读取

In [2]:
## 通过Pandas对于数据进行读取 (pandas是一个很友好的数据读取函数库)
Train_data = pd.read_csv('used_car_train_20200313.csv', sep=' ')
TestA_data = pd.read_csv('used_car_testA_20200313.csv', sep=' ')

## 输出数据的大小信息
print('Train data shape:',Train_data.shape)
print('TestA data shape:',TestA_data.shape)

Train data shape: (150000, 31)
TestA data shape: (50000, 30)


凭借前期数据探索所做的工作，我们直接定义数据预处理函数

In [3]:
# 定义删除高共线性特征函数
def remove_collinear_features(x, threshold):
    '''
    Objective:
       删除数据帧中相关系数大于阈值的共线特征。 删除共线特征可以帮助模型泛化并提高模型的可解释性。

    Inputs: 
        阈值：删除任何相关性大于此值的特征

    Output: 
        仅包含非高共线特征的数据帧
    '''

    # 在数据副本上进行操作
    x = x.copy()

    # 仅对v_系列特征进行处理
    x_numeric = x[['v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7',
                    'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13', 'v_14']]


    # 计算相关性矩阵
    corr_matrix = x_numeric.corr()
    iters = range(len(corr_matrix.columns) - 1)
    drop_cols = []

    # 迭代相关性矩阵并比较相关性
    for i in iters:
        for j in range(i):
            item = corr_matrix.iloc[j:(j+1), (i+1):(i+2)]
            col = item.columns
            row = item.index
            val = abs(item.values)

            # 如果相关性超过阈值
            if val >= threshold:
                # 打印有相关性的特征和相关值
                # print(col.values[0], "|", row.values[0], "|", round(val[0][0], 3))
                drop_cols.append(col.values[0])

    # 删除每对相关列中的一个
    drops = set(drop_cols)
    x = x.drop(columns = drops)

    return x

In [4]:
def data_processing(df, threshold):
    '''
    Objective:
        对输入的DataFrame数据进行预处理，处理依据在前期EDA中获得。
        
    Inputs:
        df: 输入的原始DataFrame数据
        threshold: 用于删除数据中高共线性特征，当相关系数大于该阈值时，进行删除
    
    Outputs：
        输出经过预处理后的DataFrame数据
    '''
    Output = df.copy()
    
    # 类别型特征
    category_cols = ['brand', 'bodyType', 'fuelType', 'gearbox', 'notRepairedDamage']
    
    
    # 修正缺失值，有些缺失值为' - '，将其替换为np.nan
    Output.replace('-', np.nan, inplace=True)
    
    # 将"Not Available" 项替换为可以解释为浮点数的np.nan
    Output.replace({'Not Available': np.nan}, inplace=True)

    # 加入一列用于统计样本缺失值总数
    Output['miss_info'] = Output.isnull().sum(axis=1)

    # 转换某些列的数据类型
    for col in list(Output.columns):
        # 选择需要被数字化的列，通过if判断实现
        # 凡是包含下列红色字体的列，都是需要被转换的列
        if ('notRepairedDamage' in col):
            # 将数据类型转换为float
            Output[col] = Output[col].astype(float)
    
    # 添加used_time特征列
    Output['used_time'] = (pd.to_datetime(Output['creatDate'], format='%Y%m%d', errors='coerce') - 
                            pd.to_datetime(Output['regDate'], format='%Y%m%d', errors='coerce')).dt.days
    
    
    # 删除高线性特征
    Output = remove_collinear_features(Output, threshold=threshold)
    
    # 处理类别型特征
    # data_notRepairedDamage = pd.get_dummies(Output['notRepairedDamage'], prefix='notRepairedDamage', dummy_na=True)
    # data_fuelType = pd.get_dummies(Output['fuelType'], prefix='fuelType', dummy_na=True)
    # data_gearbox = pd.get_dummies(Output['gearbox'], prefix='gearbox', dummy_na=True)
    # data_bodyType = pd.get_dummies(Output['bodyType'], prefix='bodyType', dummy_na=True)
    
    # Output = pd.concat([Output, data_notRepairedDamage, data_fuelType, data_gearbox, data_bodyType], axis=1)
    

    
    # 处理类别型特征
    
    
    # 删除多余特征
    del Output["seller"]
    del Output["offerType"]
    del Output['name']
    del Output['model']
    del Output['regionCode']
    del Output['regDate']
    #del Output['brand']
    #del Output['bodyType']
    #del Output['fuelType']
    #del Output['gearbox']
    #del Output['notRepairedDamage']
    del Output['creatDate']
    
    return Output

In [5]:
Train_data = data_processing(Train_data, threshold=0.9)
TestA_data = data_processing(TestA_data, threshold=0.9)

## 输出数据的大小信息
print('Train data shape:',Train_data.shape)
print('TestA data shape:',TestA_data.shape)

Train data shape: (150000, 19)
TestA data shape: (50000, 18)


## 特征与标签构建

设计思路如下：
 * 将训练样本划分为两大类，一类是类别型特征，一类是仅包含数值型特征
 * 同时，对数值型特征样本的数据又划分包含`used_time`信息与不包含`used_time`信息
 * 使用catboost对类别型特征做回归分析；使用xgb和lgb对数值型特征做回归分析
 * 最后进行模型融合

In [6]:
# 定义一个构建数据集的函数，方便以后调用

def split_data(df, flag=0):
    '''
    Objective:
        将数据集按要求划分成四类：1. 所有类别型特征集（不论是否缺失`used_time`）；2. 包含`used_time`信息的数值型特征集
                                                            3. 不含`used_time`信息的数值型特征集；
    
    Input:
        待划分的DataFrame数据
        
    Output:
        四类特征数据集：O_1, O_2, O_3, O_4
    
    flag: 0表示划分训练集；1表示划分测试集，默认为0
    '''
#     category_features = ['SaleID', 'notRepairedDamage_0.0', 'notRepairedDamage_1.0', 'notRepairedDamage_nan',
#                         'fuelType_0.0', 'fuelType_1.0', 'fuelType_2.0', 'fuelType_3.0', 'fuelType_4.0', 'fuelType_5.0',
#                         'fuelType_6.0', 'fuelType_nan', 'gearbox_0.0', 'gearbox_1.0', 'gearbox_nan', 'bodyType_0.0',
#                         'bodyType_1.0', 'bodyType_2.0', 'bodyType_3.0', 'bodyType_4.0', 'bodyType_5.0', 'bodyType_6.0',
#                         'bodyType_7.0', 'bodyType_nan', 'price']
    
    category_features = ['SaleID', 'brand', 'bodyType', 'fuelType', 'gearbox', 'notRepairedDamage']
    
    value_features = ['SaleID', 'power', 'kilometer', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_11',
                     'v_12', 'v_14', 'used_time', 'miss_info']
    
    O_1 = df.copy()
    O_1.replace(np.nan, 'missing', inplace=True)
    O_1['brand'] = O_1['brand'].astype('str')
    O_1 = O_1.drop('used_time', axis=1)
    
    has_time = df.loc[df['used_time'].notna()]    
    O_2 = has_time[value_features]
    
    no_time = df.loc[df['used_time'].isna()]    
    O_3 = no_time[value_features]
    
    # 因为O_3明确缺失`used_time`，所以对应的`mis_info`应减1
    O_3['miss_info'].apply(lambda x: x -1 if x >0 else x)
    
    # 删除O_3中特征`used_time`列
    O_3.drop('used_time', axis=1, inplace=True)
    
    if flag == 0:
        O_1['price'] = df['price']
        O_2['price'] = has_time['price']
        O_3['price'] = no_time['price']
    
    return O_1, O_2, O_3

In [7]:
Train_data_1, Train_data_2, Train_data_3 = split_data(Train_data, flag=0)

Test_data_1, Test_data_2, Test_data_3 = split_data(TestA_data, flag=1)

In [8]:
Test_data_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 17 columns):
SaleID               50000 non-null int64
brand                50000 non-null object
bodyType             50000 non-null object
fuelType             50000 non-null object
gearbox              50000 non-null object
power                50000 non-null int64
kilometer            50000 non-null float64
notRepairedDamage    50000 non-null object
v_0                  50000 non-null float64
v_1                  50000 non-null float64
v_2                  50000 non-null float64
v_3                  50000 non-null float64
v_4                  50000 non-null float64
v_11                 50000 non-null float64
v_12                 50000 non-null float64
v_14                 50000 non-null float64
miss_info            50000 non-null int64
dtypes: float64(9), int64(3), object(5)
memory usage: 6.5+ MB


查看划分得到的数据集，确认划分无误。

In [9]:
# 查看第一部分，所有类别型特征集
Train_data_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 18 columns):
SaleID               150000 non-null int64
brand                150000 non-null object
bodyType             150000 non-null object
fuelType             150000 non-null object
gearbox              150000 non-null object
power                150000 non-null int64
kilometer            150000 non-null float64
notRepairedDamage    150000 non-null object
price                150000 non-null int64
v_0                  150000 non-null float64
v_1                  150000 non-null float64
v_2                  150000 non-null float64
v_3                  150000 non-null float64
v_4                  150000 non-null float64
v_11                 150000 non-null float64
v_12                 150000 non-null float64
v_14                 150000 non-null float64
miss_info            150000 non-null int64
dtypes: float64(9), int64(4), object(5)
memory usage: 20.6+ MB


In [10]:
# 查看第二部分，包含`used_time`的数值型特征集
Train_data_2.head()

Unnamed: 0,SaleID,power,kilometer,v_0,v_1,v_2,v_3,v_4,v_11,v_12,v_14,used_time,miss_info,price
0,0,60,12.5,43.357796,3.966344,0.050257,2.159744,1.143786,2.804097,-2.420821,0.914762,4385.0,0,1850
1,1,0,15.0,45.305273,5.236112,0.137925,1.380657,-1.422165,2.096338,-1.030483,0.245522,4757.0,1,3600
2,2,163,12.5,45.978359,4.823792,1.319524,-0.998467,-0.996911,1.803559,1.56533,-0.229963,4382.0,0,6222
3,3,193,15.0,45.687478,4.492574,-0.050616,0.8836,-2.228079,1.28594,-0.501868,-0.478699,7125.0,0,2400
4,4,68,5.0,44.383511,2.031433,0.572169,-1.571239,2.246088,0.910783,0.93111,1.923482,1531.0,0,5200


In [11]:
# 查看第三部分，不含`used_time`的数值型特征集
Train_data_3.head()

Unnamed: 0,SaleID,power,kilometer,v_0,v_1,v_2,v_3,v_4,v_11,v_12,v_14,miss_info,price
14,14,0,15.0,37.477726,4.439788,17.089898,2.969183,0.301389,18.192443,5.145351,0.645098,3,6900
20,20,54,15.0,41.906646,2.069275,-0.678881,1.825039,-0.039355,2.081788,-2.930227,0.826872,0,990
22,22,75,15.0,41.346476,-3.214313,-1.102793,3.62993,0.736763,0.346678,-3.941949,-0.157513,1,350
42,42,90,15.0,44.693776,2.017017,-1.479142,2.039647,-0.720858,0.107458,-2.330682,-0.022019,1,1600
51,51,109,15.0,41.996992,-3.092623,-2.780618,2.706164,0.980752,-1.360968,-4.125563,-0.208094,0,350


## 模型训练与评估

### 利用xgb进行五折交叉验证查看模型参数的效果

In [38]:
## xgb-Model
xgr = xgb.XGBRegressor(n_estimators=120, learning_rate=0.1, gamma=0, subsample=0.8,\
        colsample_bytree=0.9, max_depth=7) #,objective ='reg:squarederror'

scores_train = []
scores = []


X_data = Train_data_2.drop(['SaleID', 'price'], axis=1)
Y_data = Train_data_2['price']

## 5折交叉验证方式
sk=StratifiedKFold(n_splits=5,shuffle=True,random_state=0)
for train_ind,val_ind in sk.split(X_data,Y_data):
    
    train_x=X_data.iloc[train_ind].values
    train_y=Y_data.iloc[train_ind]
    val_x=X_data.iloc[val_ind].values
    val_y=Y_data.iloc[val_ind]
    
    xgr.fit(train_x,train_y)
    pred_train_xgb=xgr.predict(train_x)
    pred_xgb=xgr.predict(val_x)
    
    score_train = mean_absolute_error(train_y,pred_train_xgb)
    scores_train.append(score_train)
    score = mean_absolute_error(val_y,pred_xgb)
    scores.append(score)

print('Train mae:',np.mean(score_train))
print('Val mae',np.mean(scores))

  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \


Train mae: 582.517264794879
Val mae 674.6033826996667


In [32]:
Train_data

Unnamed: 0,SaleID,brand,bodyType,fuelType,gearbox,power,kilometer,notRepairedDamage,price,v_0,v_1,v_2,v_3,v_4,v_11,v_12,v_14,miss_info,used_time
0,0,6,1.0,0.0,0.0,60,12.5,0.0,1850,43.357796,3.966344,0.050257,2.159744,1.143786,2.804097,-2.420821,0.914762,0,4385.0
1,1,1,2.0,0.0,0.0,0,15.0,,3600,45.305273,5.236112,0.137925,1.380657,-1.422165,2.096338,-1.030483,0.245522,1,4757.0
2,2,15,1.0,0.0,0.0,163,12.5,0.0,6222,45.978359,4.823792,1.319524,-0.998467,-0.996911,1.803559,1.565330,-0.229963,0,4382.0
3,3,10,0.0,0.0,1.0,193,15.0,0.0,2400,45.687478,4.492574,-0.050616,0.883600,-2.228079,1.285940,-0.501868,-0.478699,0,7125.0
4,4,5,1.0,0.0,0.0,68,5.0,0.0,5200,44.383511,2.031433,0.572169,-1.571239,2.246088,0.910783,0.931110,1.923482,0,1531.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
149995,149995,10,4.0,0.0,1.0,163,15.0,0.0,5900,45.316543,-3.139095,-1.269707,-0.736609,-1.505820,-2.983973,0.589167,-0.302592,0,5772.0
149996,149996,11,0.0,0.0,0.0,125,10.0,0.0,9500,45.972058,-3.143764,-0.023523,-2.366699,0.698012,-2.774615,2.553994,-0.272160,0,2322.0
149997,149997,11,1.0,1.0,0.0,90,6.0,0.0,7500,44.733481,-3.105721,0.595454,-2.279091,1.423661,-1.630677,2.290197,0.414931,0,2003.0
149998,149998,10,3.0,1.0,0.0,156,15.0,0.0,4999,45.658634,-3.204785,-0.441680,-1.179812,0.620680,-2.633719,1.414937,-1.659014,0,3673.0


使用**网格搜索**和**交叉验证**进行超参数调整。（当然也可以尝试其他方法，如随机搜索）

* 我们定义一系列选项，然后评估我们指定的每一个组合。 
* 这与随机搜索形成对比，随机搜索随机选择要尝试的组合。

通常，当我们对最佳模型超参数的知识有限时，随机搜索会更好，**我们可以使用随机搜索缩小选项范围，然后使用更有限的选项范围进行网格搜索**。

**交叉验证**是用于评估超参数性能的方法：我们使用K-Fold交叉验证，而不是将训练设置拆分为单独的训练和验证集，以减少我们可以使用的训练数据量。

* 这意味着将训练数据划分为K个折叠，然后进行迭代过程，我们首先在K-1个折叠上进行训练，然后评估第K个折叠的性能。
   * 我们重复这个过程K次，所以最终我们将测试训练数据中的每个例子，关键是每次迭代我们都在测试我们之前没有训练过的数据。

* 在K-Fold交叉验证结束时，我们将每个K次迭代的平均误差作为最终性能度量，然后立即在所有训练数据上训练模型。

* 我们记录的性能用于比较超参数的不同组合。

A picture of k-fold cross validation using k = 10 is shown below:

![image.png](https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1586013602923&di=49deb291884b635a027d0d87bac948d7&imgtype=0&src=http%3A%2F%2F5b0988e595225.cdn.sohucs.com%2Fimages%2F20181128%2F63366292825f43f08663c2d9650db413.jpeg)

在这里，我们将使用**交叉验证实现网格搜索**，以选择梯度增强回归量的最佳超参数。我们首先定义一个网格，然后执行以下迭代过程：

* 从网格中随机抽样一组超参数，使用10折交叉验证评估超参数，
* 然后选择具有最佳性能的超参数。

当然，我们实际上并没有自己做这个迭代，我们让Scikit-Learn和`GridSearchCV`为我们完成这个过程！

我们选择了几个不同的超参数来调整梯度增强回归量。 这些都将以不同的方式影响模型，这些方法很难提前确定，找到特定问题的最佳组合的唯一方法是测试它们！ 要了解超参数，可以查看[Scikit-Learn](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html#sklearn.ensemble.GradientBoostingRegressor)文档

在下面的代码中，我们创建了随机搜索对象，可以传递以下参数：

* `estimator`: 模型
* `param_distributions`: 我们定义的参数的分布
* `cv` ：用于k-fold交叉验证的folds 数量
* `n_iter`: 不同的参数组合的数量
* `scoring`: 评估候选参数时使用的指标
* `n_jobs`: 核的数量（-1 时全部使用）
* `verbose`: 显示信息的数量
* `return_train_score`: 每一个cross-validation fold 返回的分数

网格搜索对象的训练方式与任何其他scikit-learn模型相同。训练之后，我们可以比较所有不同的超参数组合，找到效果最好的组合。

In [34]:
Train_data_cat = Train_data.copy()
Train_data_cat.replace(np.nan, 'missing', inplace=True)
X_data = Train_data_cat.drop(['SaleID', 'used_time', 'price'], axis=1)
Y_data = Train_data_cat['price']

categorical_features = np.where(X_data.dtypes == 'object')[0]
for i in categorical_features:
    X_data.iloc[:, i] = X_data.iloc[:, i].astype('str')

# 使用网格搜索进行超参数调整
estimator = catboost.CatBoostRegressor(learning_rate=0.1, n_estimators=350, l2_leaf_reg=0,subsample=0.8, loss_function='MAE')
param_grid = {
    'depth': [7, 10, 12],
}

kflod = StratifiedKFold(n_splits=10, shuffle = True, random_state=42)


#scoring指定损失函数类型，n_jobs指定全部cpu跑，cv指定交叉验证
grid_search = GridSearchCV(estimator, param_grid, scoring = 'neg_mean_absolute_error',n_jobs = -1,cv = kflod)


grid_result = grid_search.fit(X_data, Y_data, cat_features=categorical_features, use_best_model=True) #运行网格搜索
print("Best: %f using %s" % (grid_result.best_score_, grid_search.best_params_))

You should provide test set for use best model. use_best_model parameter has been switched to false value.


0:	learn: 4111.6455396	total: 375ms	remaining: 2m 10s
1:	learn: 3780.7131169	total: 759ms	remaining: 2m 12s
2:	learn: 3482.5188121	total: 1.11s	remaining: 2m 8s
3:	learn: 3226.3144854	total: 1.45s	remaining: 2m 5s
4:	learn: 3004.8143971	total: 1.81s	remaining: 2m 4s
5:	learn: 2797.5545776	total: 2.18s	remaining: 2m 4s
6:	learn: 2622.0192160	total: 2.56s	remaining: 2m 5s
7:	learn: 2440.4314222	total: 2.93s	remaining: 2m 5s
8:	learn: 2274.4433205	total: 3.3s	remaining: 2m 5s
9:	learn: 2125.7553769	total: 3.65s	remaining: 2m 4s
10:	learn: 1995.7267340	total: 4.03s	remaining: 2m 4s
11:	learn: 1872.7470466	total: 4.46s	remaining: 2m 5s
12:	learn: 1769.2055886	total: 4.84s	remaining: 2m 5s
13:	learn: 1671.3331445	total: 5.23s	remaining: 2m 5s
14:	learn: 1587.3665891	total: 5.6s	remaining: 2m 5s
15:	learn: 1516.5896987	total: 5.96s	remaining: 2m 4s
16:	learn: 1447.2003460	total: 6.34s	remaining: 2m 4s
17:	learn: 1402.1101829	total: 6.72s	remaining: 2m 3s
18:	learn: 1348.6912605	total: 7.1s	re

152:	learn: 624.5090577	total: 53.3s	remaining: 1m 8s
153:	learn: 623.2054201	total: 53.7s	remaining: 1m 8s
154:	learn: 622.6778645	total: 54s	remaining: 1m 7s
155:	learn: 621.4689064	total: 54.4s	remaining: 1m 7s
156:	learn: 618.8699851	total: 54.7s	remaining: 1m 7s
157:	learn: 618.3249644	total: 55s	remaining: 1m 6s
158:	learn: 617.6954808	total: 55.4s	remaining: 1m 6s
159:	learn: 615.7747370	total: 55.7s	remaining: 1m 6s
160:	learn: 615.5898361	total: 56.1s	remaining: 1m 5s
161:	learn: 614.6501594	total: 56.4s	remaining: 1m 5s
162:	learn: 613.4348473	total: 56.7s	remaining: 1m 5s
163:	learn: 612.8117260	total: 57.1s	remaining: 1m 4s
164:	learn: 612.0420597	total: 57.4s	remaining: 1m 4s
165:	learn: 611.3684547	total: 57.7s	remaining: 1m 3s
166:	learn: 609.6504787	total: 58.1s	remaining: 1m 3s
167:	learn: 609.2451488	total: 58.4s	remaining: 1m 3s
168:	learn: 608.8213297	total: 58.7s	remaining: 1m 2s
169:	learn: 607.5274321	total: 59s	remaining: 1m 2s
170:	learn: 605.9788560	total: 59.

303:	learn: 522.7219003	total: 1m 39s	remaining: 15.1s
304:	learn: 522.0546735	total: 1m 39s	remaining: 14.7s
305:	learn: 521.5411882	total: 1m 40s	remaining: 14.4s
306:	learn: 520.7998467	total: 1m 40s	remaining: 14.1s
307:	learn: 520.4805162	total: 1m 40s	remaining: 13.7s
308:	learn: 520.1494215	total: 1m 41s	remaining: 13.4s
309:	learn: 519.3698598	total: 1m 41s	remaining: 13.1s
310:	learn: 518.9550328	total: 1m 41s	remaining: 12.8s
311:	learn: 518.5725719	total: 1m 42s	remaining: 12.4s
312:	learn: 518.3877688	total: 1m 42s	remaining: 12.1s
313:	learn: 518.1636906	total: 1m 42s	remaining: 11.8s
314:	learn: 517.9855828	total: 1m 43s	remaining: 11.4s
315:	learn: 517.7857855	total: 1m 43s	remaining: 11.1s
316:	learn: 516.5934899	total: 1m 43s	remaining: 10.8s
317:	learn: 516.4359410	total: 1m 43s	remaining: 10.5s
318:	learn: 516.3484693	total: 1m 44s	remaining: 10.1s
319:	learn: 515.9331889	total: 1m 44s	remaining: 9.8s
320:	learn: 515.7256809	total: 1m 44s	remaining: 9.47s
321:	learn:

In [13]:
X_data = Train_data_2.drop(['SaleID', 'price'], axis=1)
Y_data = Train_data_2['price']

# 使用网格搜索进行超参数调整
estimator = xgb.XGBRegressor(learning_rate=0.2,gamma=0, subsample=0.8, colsample_bytree=0.9, max_depth=10)
param_grid = {
    'subsample': [0.8],
    'n_estimators': [300, 350]
}

kflod = StratifiedKFold(n_splits=10, shuffle = True, random_state=42)


#scoring指定损失函数类型，n_jobs指定全部cpu跑，cv指定交叉验证
grid_search = GridSearchCV(estimator, param_grid, scoring = 'neg_mean_absolute_error',n_jobs = -1,cv = kflod)


grid_result = grid_search.fit(X_data, Y_data) #运行网格搜索
print("Best: %f using %s" % (grid_result.best_score_, grid_search.best_params_))

  if getattr(data, 'base', None) is not None and \
  data.base is not None and isinstance(data, np.ndarray) \


Best: -611.329735 using {'n_estimators': 350, 'subsample': 0.8}


In [36]:
X_data = Train_data_3.drop(['SaleID', 'price'], axis=1)
Y_data = Train_data_3['price']

# 使用网格搜索进行超参数调整
estimator = xgb.XGBRegressor(learning_rate=0.1, gamma=0,subsample=0.8, colsample_bytree=0.9, max_depth=12)
param_grid = {
    'n_estimators': [400, 450],
}

kflod = StratifiedKFold(n_splits=10, shuffle = True, random_state=42)


#scoring指定损失函数类型，n_jobs指定全部cpu跑，cv指定交叉验证
grid_search = GridSearchCV(estimator, param_grid, scoring = 'neg_mean_absolute_error',n_jobs = -1,cv = kflod)


grid_result = grid_search.fit(X_data, Y_data) #运行网格搜索
print("Best: %f using %s" % (grid_result.best_score_, grid_search.best_params_))

  if getattr(data, 'base', None) is not None and \
  data.base is not None and isinstance(data, np.ndarray) \


Best: -328.781713 using {'max_depth': 12, 'n_estimators': 400}


In [44]:
X_data = Train_data_2.drop(['SaleID', 'price'], axis=1)
Y_data = Train_data_2['price']

# 使用网格搜索进行超参数调整
estimator = lgb.LGBMRegressor(learning_rate=0.1,gamma=0, subsample=0.8, colsample_bytree=0.9, max_depth=20)
param_grid = {
    'n_estimators': [450, 500, 550]
}

kflod = StratifiedKFold(n_splits=10, shuffle = True, random_state=42)


#scoring指定损失函数类型，n_jobs指定全部cpu跑，cv指定交叉验证
grid_search = GridSearchCV(estimator, param_grid, scoring = 'neg_mean_absolute_error',n_jobs = -1,cv = kflod)


grid_result = grid_search.fit(X_data, Y_data) #运行网格搜索
print("Best: %f using %s" % (grid_result.best_score_, grid_search.best_params_))



Best: -628.116327 using {'n_estimators': 550}


In [43]:
X_data = Train_data_3.drop(['SaleID', 'price'], axis=1)
Y_data = Train_data_3['price']

# 使用网格搜索进行超参数调整
estimator = lgb.LGBMRegressor(learning_rate=0.1,gamma=0, subsample=0.8, colsample_bytree=0.9)
param_grid = {
    'max_depth': [17, 20, 22],
    'n_estimators': [600, 650]
}

kflod = StratifiedKFold(n_splits=10, shuffle = True, random_state=42)


#scoring指定损失函数类型，n_jobs指定全部cpu跑，cv指定交叉验证
grid_search = GridSearchCV(estimator, param_grid, scoring = 'neg_mean_absolute_error',n_jobs = -1,cv = kflod)


grid_result = grid_search.fit(X_data, Y_data) #运行网格搜索
print("Best: %f using %s" % (grid_result.best_score_, grid_search.best_params_))



Best: -348.606713 using {'max_depth': 22, 'n_estimators': 650}


Scikit-learn使用负平均绝对误差进行评估，因为它希望度量最大化。因此，更好的分数将接近0。

我们可以将随机搜索的结果导入数据帧，并按性能对值进行排序。

In [None]:
# 获取所有cv结果并按测试性能排序
random_results = pd.DataFrame(grid_result.cv_results_).sort_values('mean_test_score', ascending = False)

random_results.head(10)

训练误差和测试误差之间始终存在差异（训练误差始终较低）但如果存在显着差异，我们希望通过**获取更多训练数据或通过超参数调整或正则化降低模型的复杂度**来尝试减少过拟合。对于gradient boosting regressor，一些选项包括:

* 减少树的数量
* 减少每棵树的最大深度
* 以及增加叶节点中的最小样本数。

对于任何想要进一步深入渐变增强回归量的人来说，[这是一篇很棒的文章](http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/)。目前，我们将使用具有最佳性能的模型，并接受它可能过拟合到训练集。


## 训练与预测

### 定义catboost、xgboost、lightgbm模型

In [12]:
def build_model_cat(x_train, y_train, cat_features):
    model_cat = catboost.CatBoostRegressor(learning_rate=0.1, n_estimators=350, 
                                       l2_leaf_reg=0,subsample=0.8, loss_function='MAE', depth=12)
    model_cat.fit(x_train, y_train, cat_features=cat_features, use_best_model=True)
    return model_cat


def build_model_xgb_1(x_train,y_train):
    xgb_1 = xgb.XGBRegressor(n_estimators=350,learning_rate=0.2,gamma=0, subsample=0.8, 
                             colsample_bytree=0.9, max_depth=10) #, objective ='reg:squarederror'
    xgb_1.fit(x_train, y_train)
    return xgb_1

def build_model_xgb_2(x_train,y_train):
    xgb_2 = xgb.XGBRegressor(n_estimators=400,learning_rate=0.1,gamma=0, subsample=0.8, 
                             colsample_bytree=0.9, max_depth=12) #, objective ='reg:squarederror'
    xgb_2.fit(x_train, y_train)
    return xgb_2

def build_model_lgb_1(x_train,y_train):
    lgb_1 = lgb.LGBMRegressor(n_estimators=550, learning_rate=0.1,gamma=0, subsample=0.8, 
                                  colsample_bytree=0.9, max_depth=20)
    
    lgb_1.fit(x_train, y_train)
    return lgb_1

def build_model_lgb_2(x_train,y_train):
    lgb_2 = lgb.LGBMRegressor(n_estimators=650, learning_rate=0.1,gamma=0, subsample=0.8, 
                                  colsample_bytree=0.9, max_depth=22)
    
    lgb_2.fit(x_train, y_train)
    return lgb_2

### 使用Catboost模型

In [27]:
# 使用catoost对数据1进行训练与预测

X_train = Train_data_1.drop(['SaleID', 'price'], axis=1)
Y_train = Train_data_1['price']

X_test = Test_data_1.drop('SaleID', axis=1)
res_cat = pd.DataFrame()
res_cat['SaleID'] = Test_data_1['SaleID']

categorical_features_train = np.where(X_train.dtypes == 'object')[0]
for i in categorical_features_train:
    X_train.iloc[:, i] = X_train.iloc[:, i].astype('str')

categorical_features_test = np.where(X_test.dtypes == 'object')[0]
for i in categorical_features_test:
    X_test.iloc[:, i] = X_test.iloc[:, i].astype('str')

test_pool = catboost.Pool(X_test, cat_features=categorical_features_test)

print('Train catboost ...')
model_cat = build_model_cat(X_train, Y_train, cat_features=categorical_features)
val_cat = model_cat.predict(X_train)
MAE_cat = mean_absolute_error(Y_train, val_cat)
print('MAE with catboost:' ,MAE_cat)

print('Predict catboost ...')
res_cat['price'] = model_cat.predict(test_pool)
print('Catboost Done!')



Train catboost ...


You should provide test set for use best model. use_best_model parameter has been switched to false value.


0:	learn: 4119.8529224	total: 328ms	remaining: 1m 54s
1:	learn: 3791.2374326	total: 698ms	remaining: 2m 1s
2:	learn: 3485.4772356	total: 1.07s	remaining: 2m 3s
3:	learn: 3256.2890088	total: 1.39s	remaining: 2m
4:	learn: 3033.8341876	total: 1.72s	remaining: 1m 58s
5:	learn: 2815.7639492	total: 2.07s	remaining: 1m 58s
6:	learn: 2634.4927667	total: 2.49s	remaining: 2m 1s
7:	learn: 2453.2242641	total: 2.85s	remaining: 2m 1s
8:	learn: 2290.5821979	total: 3.19s	remaining: 2m
9:	learn: 2151.2457233	total: 3.52s	remaining: 1m 59s
10:	learn: 2018.8429989	total: 3.85s	remaining: 1m 58s
11:	learn: 1904.3195954	total: 4.17s	remaining: 1m 57s
12:	learn: 1811.7749129	total: 4.51s	remaining: 1m 56s
13:	learn: 1712.0217440	total: 4.88s	remaining: 1m 57s
14:	learn: 1622.0318732	total: 5.24s	remaining: 1m 56s
15:	learn: 1543.1834913	total: 5.58s	remaining: 1m 56s
16:	learn: 1470.2752094	total: 5.9s	remaining: 1m 55s
17:	learn: 1405.2782224	total: 6.23s	remaining: 1m 54s
18:	learn: 1353.5854199	total: 6.

152:	learn: 615.1071788	total: 49.8s	remaining: 1m 4s
153:	learn: 613.8250738	total: 50.1s	remaining: 1m 3s
154:	learn: 613.1914992	total: 50.4s	remaining: 1m 3s
155:	learn: 612.4628604	total: 50.7s	remaining: 1m 3s
156:	learn: 611.9966574	total: 51.1s	remaining: 1m 2s
157:	learn: 611.1240102	total: 51.4s	remaining: 1m 2s
158:	learn: 610.5887166	total: 51.7s	remaining: 1m 2s
159:	learn: 609.4556022	total: 52s	remaining: 1m 1s
160:	learn: 609.1214519	total: 52.3s	remaining: 1m 1s
161:	learn: 608.6964295	total: 52.6s	remaining: 1m 1s
162:	learn: 607.5694727	total: 52.9s	remaining: 1m
163:	learn: 606.9213391	total: 53.2s	remaining: 1m
164:	learn: 605.0865483	total: 53.5s	remaining: 1m
165:	learn: 604.2144412	total: 53.9s	remaining: 59.7s
166:	learn: 602.9327284	total: 54.2s	remaining: 59.4s
167:	learn: 601.9890437	total: 54.5s	remaining: 59s
168:	learn: 600.5611664	total: 54.8s	remaining: 58.7s
169:	learn: 599.9799234	total: 55.1s	remaining: 58.3s
170:	learn: 599.3896583	total: 55.4s	rema

304:	learn: 518.8435408	total: 1m 41s	remaining: 15s
305:	learn: 518.3994631	total: 1m 42s	remaining: 14.7s
306:	learn: 517.9287617	total: 1m 42s	remaining: 14.4s
307:	learn: 517.2360840	total: 1m 42s	remaining: 14s
308:	learn: 516.4557144	total: 1m 43s	remaining: 13.7s
309:	learn: 515.8751211	total: 1m 43s	remaining: 13.4s
310:	learn: 514.9712316	total: 1m 44s	remaining: 13.1s
311:	learn: 514.2162500	total: 1m 44s	remaining: 12.7s
312:	learn: 513.9682570	total: 1m 44s	remaining: 12.4s
313:	learn: 513.2874266	total: 1m 45s	remaining: 12.1s
314:	learn: 512.7567601	total: 1m 45s	remaining: 11.7s
315:	learn: 511.9653275	total: 1m 46s	remaining: 11.4s
316:	learn: 511.7794711	total: 1m 46s	remaining: 11.1s
317:	learn: 511.6368310	total: 1m 47s	remaining: 10.8s
318:	learn: 511.1425590	total: 1m 47s	remaining: 10.5s
319:	learn: 510.7780096	total: 1m 48s	remaining: 10.1s
320:	learn: 510.6075675	total: 1m 48s	remaining: 9.8s
321:	learn: 510.1363654	total: 1m 48s	remaining: 9.47s
322:	learn: 509

### 使用xgboost模型

In [13]:
# 使用xgb对数据2进行训练与预测


X_train = Train_data_2.drop(['SaleID', 'price'], axis=1)
Y_train = Train_data_2['price']

X_test = Test_data_2.drop('SaleID', axis=1)
res_xgb_1 = pd.DataFrame()
res_xgb_1['SaleID'] = Test_data_2['SaleID']


print('Train xgboost ...')
model_xgb_1 = build_model_xgb_1(X_train, Y_train)
val_xgb_1 = model_xgb_1.predict(X_train)
MAE_xgb_1 = mean_absolute_error(Y_train, val_xgb_1)
print('MAE with xgboost:' ,MAE_xgb_1)

print('Predict xgboost ...')
res_xgb_1['price'] = model_xgb_1.predict(X_test)
print('Xgboost Done!')



Train xgboost ...


  if getattr(data, 'base', None) is not None and \
  data.base is not None and isinstance(data, np.ndarray) \


MAE with xgboost: 220.7082378428163
Predict xgboost ...
Xgboost Done!


In [14]:
# 使用xgb对数据3进行训练与预测

X_train = Train_data_3.drop(['SaleID', 'price'], axis=1)
Y_train = Train_data_3['price']

X_test = Test_data_3.drop('SaleID', axis=1)
res_xgb_2 = pd.DataFrame()
res_xgb_2['SaleID'] = Test_data_3['SaleID']


print('Train xgboost ...')
model_xgb_2 = build_model_xgb_2(X_train, Y_train)
val_xgb_2 = model_xgb_2.predict(X_train)
MAE_xgb_2 = mean_absolute_error(Y_train, val_xgb_2)
print('MAE with xgboost:' ,MAE_xgb_2)

print('Predict xgboost ...')
res_xgb_2['price'] = model_xgb_2.predict(X_test)
print('Xgboost Done!')

Train xgboost ...


  if getattr(data, 'base', None) is not None and \
  data.base is not None and isinstance(data, np.ndarray) \


MAE with xgboost: 12.582586533195219
Predict xgboost ...
Xgboost Done!


### 使用lightgbm模型

In [48]:
# 使用lgb对数据2进行训练与预测

X_train = Train_data_2.drop(['SaleID', 'price'], axis=1)
Y_train = Train_data_2['price']

X_test = Test_data_2.drop('SaleID', axis=1)
res_lgb_1 = pd.DataFrame()
res_lgb_1['SaleID'] = Test_data_2['SaleID']


print('Train lgb ...')
model_lgb_1 = build_model_lgb_1(X_train, Y_train)
val_lgb_1 = model_lgb_1.predict(X_train)
MAE_lgb_1 = mean_absolute_error(Y_train, val_lgb_1)
print('MAE with lgb:' ,MAE_lgb_1)

print('Predict lgb ...')
res_lgb_1['price'] = model_lgb_1.predict(X_test)
print('Lgb Done!')

Train lgb ...
MAE with lgb: 528.182895551685
Predict lgb ...
Lgb Done!


In [49]:
# 使用lgb对数据3进行训练与预测

X_train = Train_data_3.drop(['SaleID', 'price'], axis=1)
Y_train = Train_data_3['price']

X_test = Test_data_3.drop('SaleID', axis=1)
res_lgb_2 = pd.DataFrame()
res_lgb_2['SaleID'] = Test_data_3['SaleID']


print('Train lgb ...')
model_lgb_2 = build_model_lgb_2(X_train, Y_train)
val_lgb_2 = model_lgb_2.predict(X_train)
MAE_lgb_2 = mean_absolute_error(Y_train, val_lgb_2)
print('MAE with lgb:' ,MAE_lgb_2)

print('Predict lgb ...')
res_lgb_2['price'] = model_lgb_2.predict(X_test)
print('Lgb Done!')

Train lgb ...
MAE with lgb: 137.66646124469548
Predict lgb ...
Lgb Done!


### 合并结果

简单地进行加权融合，未使用stacking。

In [50]:
res_cat

Unnamed: 0,SaleID,price
0,150000,34401.554304
1,150001,313.837032
2,150002,6961.440489
3,150003,12732.420952
4,150004,598.709913
...,...,...
49995,199995,2963.090160
49996,199996,1194.045108
49997,199997,7796.021608
49998,199998,9057.747874


In [21]:
res_xgb = pd.concat([res_xgb_1, res_xgb_2], axis=0)
res_xgb.sort_values('SaleID', inplace=True)

In [22]:
res_xgb

Unnamed: 0,SaleID,price
0,150000,38791.167969
1,150001,290.022827
2,150002,6402.432617
3,150003,11093.931641
4,150004,646.546143
...,...,...
49995,199995,1890.720703
49996,199996,1531.341919
49997,199997,8753.550781
49998,199998,8672.522461


In [23]:
res_xgb['price'] = res_xgb['price'].apply(lambda x: 10 if x<10 else x)

In [25]:
res_xgb.to_csv('res_xgb.csv', index=False)

In [24]:
res_xgb['price'].describe()

count    50000.000000
mean      5926.376987
std       7399.875033
min         10.000000
25%       1365.219940
50%       3263.214111
75%       7693.387451
max      94094.398438
Name: price, dtype: float64

In [18]:
Train_data['price'].describe()

count    150000.000000
mean       5923.327333
std        7501.998477
min          11.000000
25%        1300.000000
50%        3250.000000
75%        7700.000000
max       99999.000000
Name: price, dtype: float64

In [59]:
res_lgb = pd.concat([res_lgb_1, res_lgb_2], axis=0)
res_lgb.sort_values('SaleID', inplace=True)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """


In [60]:
res_lgb

Unnamed: 0,SaleID,price,price_1,price_2
0,150000,38640.667360,11387.921718,
1,150001,422.646556,124.559595,
2,150002,7859.604011,2316.330470,
3,150003,11711.598416,3451.564765,
4,150004,612.125190,180.401484,
...,...,...,...,...
49995,199995,1975.274935,582.139954,
49996,199996,1499.053153,441.791023,
49997,199997,7937.583677,2339.312122,
49998,199998,8014.409505,,671.165659


In [82]:
res = pd.DataFrame()
res['SaleID'] = TestA_data['SaleID']
weight_cat = ((2/3) - MAE_cat/(MAE_cat + (MAE_lgb_1+MAE_lgb_2)/2 + (MAE_xgb_1+MAE_xgb_2)/2))
weight_xgb = ((2/3)- ((MAE_xgb_1+MAE_xgb_2)/2)/(MAE_cat + (MAE_lgb_1+MAE_lgb_2)/2 + (MAE_xgb_1+MAE_xgb_2)/2))
weight_lgb = ((2/3)- ((MAE_lgb_1+MAE_lgb_2)/2)/(MAE_cat + (MAE_lgb_1+MAE_lgb_2)/2 + (MAE_xgb_1+MAE_xgb_2)/2))

res['price'] = weight_cat * res_cat['price'] + weight_lgb * res_lgb['price'] + weight_xgb * res_xgb['price']


In [83]:
res

Unnamed: 0,SaleID,price
0,150000,38144.238975
1,150001,335.549350
2,150002,6943.217646
3,150003,11514.396295
4,150004,629.046462
...,...,...
49995,199995,2063.996634
49996,199996,1475.025688
49997,199997,8362.787903
49998,199998,8515.295491


In [84]:
res['price'].describe()

count    50000.000000
mean      5918.478760
std       7374.641035
min       -260.398814
25%       1374.223368
50%       3256.042068
75%       7691.492944
max      93072.824468
Name: price, dtype: float64

In [85]:
res['price'] = res['price'].apply(lambda x: 10 if x<10 else x)

In [86]:
res['price'].describe()

count    50000.000000
mean      5918.505537
std       7374.619304
min         10.000000
25%       1374.223368
50%       3256.042068
75%       7691.492944
max      93072.824468
Name: price, dtype: float64

In [87]:
res.to_csv('sub_10.csv',index=False)