## GBDT+LR模型
### 基本思想
* 利用GBDT自动进行特征筛选和组合，进而生成新的离散特征向量，再把该特征向量当做LR模型输入，预估CTR的模型结构。
    * 其中，需要注意的是，用GBDT构建特征工程，利用LR预估CTR这两步是独立训练的。
### GBDT
* 基本结构：GBDT的基本结构是决策树组成的树林，学习的方式是梯度提升。
    * 具体而言，GBDT作为集成模型，预测的方式是将所有子树的结果加起来。
    * GBDT通过逐一生成决策子树的方式生成整个树林，生成新子树的过程是利用样本标签值与当前树林预测值之间的残差，构建新的子树
    * 理论上，如果可以无限生成决策树，那么GBDT就可以无限逼近由所有训练集样本组成的目标拟合函数，从而达到减小预测误差的目的。
    * 决策树的深度决定了特征交叉的阶数。如果决策树的深度为4，则通过3次节点分裂，最终的叶节点实际上是进行三阶特征组合后的结果。
* 类型
    * 回归树
    * 分类树
    * 二叉树
    * 多叉树
* 变体
    * XGBoost（Extreme Gradient Boosting）
    * LightGBM（Light Gradient Boosting Machine）
    * CatBoost（Categorical Boosting）

### 1. 加载数据

In [1]:
import pandas as pd

# 读取训练和测试数据
df_train=pd.read_csv('/Users/linjiaxi/Desktop/RecommendationSystem/CTR_predictions_GBDT_LR/data/train.csv')
df_test=pd.read_csv('/Users/linjiaxi/Desktop/RecommendationSystem/CTR_predictions_GBDT_LR/data/test.csv')

df_train.drop(['Id'],axis=1,inplace=True)
df_test.drop(['Id'],axis=1,inplace=True)
df_test['Label']=-1
data=pd.concat([df_train,df_test])
data=data.fillna(-1)
data.to_csv('./data/data.csv')

### 2. 类别特征one-hot编码

In [2]:
category_feature=['C1','C2','C3','C4','C5','C6','C7','C8','C9','C10','C11','C12','C13','C14','C15','C16','C17','C18','C19','C20','C21','C22','C23','C24','C25','C26']

In [3]:
for col in category_feature:
    onehot_feats=pd.get_dummies(data[col],prefix=col)
    data.drop([col],axis=1,inplace=True)
    data=pd.concat([data,onehot_feats],axis=1)

In [27]:
data

Unnamed: 0,Label,I1,I2,I3,I4,I5,I6,I7,I8,I9,...,C26_fb7edec8,C26_fbe10aa8,C26_fcd456fa,C26_fcd5a3f4,C26_fd6ccd1e,C26_fdd86175,C26_fe7d4d4a,C26_ff2cdc2b,C26_ff86d5e0,C26_ffc123e9
0,1,1.0,0,1.0,-1.0,227.0,1.0,173.0,18.0,50.0,...,False,False,False,False,False,False,False,False,False,False
1,1,4.0,1,1.0,2.0,27.0,2.0,4.0,2.0,2.0,...,False,False,False,False,False,False,False,False,False,False
2,1,0.0,806,-1.0,-1.0,1752.0,142.0,2.0,0.0,50.0,...,False,False,False,False,False,False,False,False,False,False
3,0,2.0,-1,42.0,14.0,302.0,38.0,25.0,38.0,90.0,...,False,False,False,False,False,False,False,False,False,False
4,1,0.0,57,2.0,1.0,2891.0,2.0,35.0,1.0,137.0,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
395,-1,1.0,0,1.0,-1.0,149.0,5.0,1.0,0.0,0.0,...,False,False,False,False,False,False,False,False,False,False
396,-1,-1.0,-1,-1.0,-1.0,-1.0,-1.0,0.0,0.0,6.0,...,False,False,False,False,False,False,False,False,False,False
397,-1,0.0,300,4.0,-1.0,4622.0,25.0,20.0,6.0,55.0,...,False,False,False,False,False,False,False,False,False,False
398,-1,1.0,1,2.0,1.0,5.0,1.0,1.0,1.0,1.0,...,False,False,False,False,False,False,False,False,False,False


### 3. 划分GBDT的训练集与验证集

In [4]:
train=data[data['Label']!=1]
test=data[data['Label']==1]

In [7]:
target=train.pop('Label')
print(target)

3      0
5      0
7      0
9      0
10     0
      ..
395   -1
396   -1
397   -1
398   -1
399   -1
Name: Label, Length: 1665, dtype: int64


In [5]:
test.drop(['Label'],axis=1,inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test.drop(['Label'],axis=1,inplace=True)


In [8]:
from sklearn.model_selection import train_test_split

x_train, x_val, y_train, y_val = train_test_split(train, target, test_size = 0.2, random_state = 2020)

In [36]:
x_train.head()

Unnamed: 0,I1,I2,I3,I4,I5,I6,I7,I8,I9,I10,...,C26_fb7edec8,C26_fbe10aa8,C26_fcd456fa,C26_fcd5a3f4,C26_fd6ccd1e,C26_fdd86175,C26_fe7d4d4a,C26_ff2cdc2b,C26_ff86d5e0,C26_ffc123e9
311,2.0,48,14.0,4.0,131.0,7.0,16.0,7.0,8.0,1.0,...,False,False,False,False,False,False,False,False,False,False
159,2.0,24,20.0,10.0,1.0,2.0,6.0,16.0,213.0,1.0,...,False,False,False,False,False,False,False,False,False,False
326,3.0,13,2.0,2.0,43.0,2.0,9.0,10.0,39.0,1.0,...,False,False,False,False,False,False,False,False,False,False
1266,8.0,27,7.0,5.0,18.0,6.0,8.0,7.0,8.0,1.0,...,False,False,False,False,False,False,False,False,False,False
1476,1.0,8,15.0,14.0,108.0,20.0,8.0,4.0,25.0,1.0,...,False,False,False,False,False,False,False,False,False,False


In [44]:
y_train.head()

311     0
159    -1
326     0
1266    0
1476    0
Name: Label, dtype: int64

In [37]:
x_val.head()

Unnamed: 0,I1,I2,I3,I4,I5,I6,I7,I8,I9,I10,...,C26_fb7edec8,C26_fbe10aa8,C26_fcd456fa,C26_fcd5a3f4,C26_fd6ccd1e,C26_fdd86175,C26_fe7d4d4a,C26_ff2cdc2b,C26_ff86d5e0,C26_ffc123e9
961,-1.0,68,-1.0,-1.0,-1.0,-1.0,-1.0,0.0,-1.0,-1.0,...,False,False,False,False,False,False,False,False,False,False
949,1.0,0,109.0,4.0,10.0,0.0,32.0,26.0,562.0,1.0,...,False,False,False,False,False,False,False,False,False,False
1094,0.0,241,3.0,10.0,1504.0,192.0,41.0,47.0,1552.0,0.0,...,False,False,False,False,False,False,False,False,False,False
1111,6.0,1,4.0,21.0,707.0,32.0,6.0,32.0,34.0,2.0,...,False,False,False,False,False,False,False,False,False,False
942,6.0,0,3.0,39.0,72.0,48.0,6.0,1.0,48.0,1.0,...,False,False,False,False,False,False,False,False,False,False


In [45]:
y_val.head()

961     0
949     0
1094    0
1111    0
942     0
Name: Label, dtype: int64

### 4. 训练GBDT

In [32]:
import lightgbm as lgb

n_estimators = 32
num_leaves = 64
# 开始训练gbdt，使用100课树，每课树64个叶节点
model = lgb.LGBMRegressor(objective='binary',
                            subsample= 0.8,
                            min_child_weight= 0.2,
                            colsample_bytree= 0.7,
                            num_leaves=num_leaves,
                            learning_rate=0.05,
                            n_estimators=n_estimators,
                            random_state = 2020)

# 注意：新版本的gbdt需要把-1变成1，否则会把标签-1与识别为负样本！！！！

In [33]:
print(y_train.value_counts())


Label
 0    1006
-1     326
Name: count, dtype: int64


In [54]:
y_train = y_train.replace(-1, 1)


In [53]:
print(y_val.value_counts())


Label
 0    259
-1     74
Name: count, dtype: int64


In [55]:
y_val = y_val.replace(-1, 1)


In [56]:
model.fit(x_train, y_train,
            eval_set = [(x_train, y_train), (x_val, y_val)],
            eval_names = ['train', 'val'],
            eval_metric = 'binary_logloss')

[LightGBM] [Info] Number of positive: 326, number of negative: 1006
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001837 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1662
[LightGBM] [Info] Number of data points in the train set: 1332, number of used features: 162
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.244745 -> initscore=-1.126840
[LightGBM] [Info] Start training from score -1.126840


In [57]:
# 得到每一条训练数据落在了每棵树的哪个叶子结点上
# pred_leaf = True 表示返回每棵树的叶节点序号
gbdt_feats_train = model.predict(train, pred_leaf = True)
    

In [58]:
# 打印结果的 shape：
print(gbdt_feats_train.shape)
# gbdt_feats_train相当于将每一列（每一个样本）变成了一个标签
# 打印前5个数据：
print(gbdt_feats_train[:5])
# 打印train的shape
print(train.shape)
# 打印train的前面几行
print(train.head(5))
print(train.columns)

(1665, 32)
[[ 8 50  6 14 12 12 19 40 47 28 40 45  6 19 21 40 38 46 18 32 19 23 10 12
  30 43  9 19 10 27  7 18]
 [39 49 11 13 44 32 40 14 13 26 49 44 34 52 37 42 43 42 41 47 39 36 39 27
  30 12 37 43 28 17 44 13]
 [20 50 46 44 42 11  7 44 13 47 23 46  9 21 51 32 48 10 33 26 10 52  6  9
   9 11 51 50 40 12 47 42]
 [ 0 27 32  6 39 19 45 22 43 37 42 52 39 46 50 36 49 32 40 28 17 47 20 42
  32 30 37 44 14 14 27 41]
 [45 28 30 29 16 30 11 26 52 12 46 24 35 26 32 39 37 18 42 19 27 31 28 20
  17 15 31  9 30 45 17 37]]
(1665, 13104)
     I1  I2    I3    I4      I5     I6    I7    I8     I9  I10  ...  \
3   2.0  -1  42.0  14.0   302.0   38.0  25.0  38.0   90.0  1.0  ...   
5   0.0  67   3.0  12.0  1470.0   52.0  14.0   6.0   72.0  0.0  ...   
7   0.0   0   3.0   4.0  4520.0  158.0  28.0  23.0  639.0  0.0  ...   
9  -1.0  51   1.0   1.0  2278.0   -1.0   0.0  16.0   41.0 -1.0  ...   
10  1.0   1   4.0  17.0   108.0   22.0   1.0  24.0   22.0  1.0  ...   

    C26_fb7edec8  C26_fbe10aa8  C26_fcd456

In [59]:
# 获取测试集的叶子节点索引
gbdt_feats_test=model.predict(test,pred_leaf=True)

In [60]:
# 将 32 课树的叶节点序号构造成 DataFrame，方便后续进行 one-hot
# 树模型在训练时会将数据划分到不同的节点，每个叶子节点可以视为一个特征。
# 通过将叶节点的编号转化为 one-hot 编码，可以将这种划分信息作为新的特征加入到模型中
gbdt_feats_name = ['gbdt_leaf_' + str(i) for i in range(n_estimators)]
print(len(gbdt_feats_name))
print(gbdt_feats_name)
print(gbdt_feats_train.shape)

32
['gbdt_leaf_0', 'gbdt_leaf_1', 'gbdt_leaf_2', 'gbdt_leaf_3', 'gbdt_leaf_4', 'gbdt_leaf_5', 'gbdt_leaf_6', 'gbdt_leaf_7', 'gbdt_leaf_8', 'gbdt_leaf_9', 'gbdt_leaf_10', 'gbdt_leaf_11', 'gbdt_leaf_12', 'gbdt_leaf_13', 'gbdt_leaf_14', 'gbdt_leaf_15', 'gbdt_leaf_16', 'gbdt_leaf_17', 'gbdt_leaf_18', 'gbdt_leaf_19', 'gbdt_leaf_20', 'gbdt_leaf_21', 'gbdt_leaf_22', 'gbdt_leaf_23', 'gbdt_leaf_24', 'gbdt_leaf_25', 'gbdt_leaf_26', 'gbdt_leaf_27', 'gbdt_leaf_28', 'gbdt_leaf_29', 'gbdt_leaf_30', 'gbdt_leaf_31']
(1665, 32)


In [61]:
df_train_gbdt_feats = pd.DataFrame(gbdt_feats_train, columns = gbdt_feats_name) 
df_test_gbdt_feats = pd.DataFrame(gbdt_feats_test, columns = gbdt_feats_name)
train_len = df_train_gbdt_feats.shape[0]
data = pd.concat([df_train_gbdt_feats, df_test_gbdt_feats])

In [62]:
data.head()

Unnamed: 0,gbdt_leaf_0,gbdt_leaf_1,gbdt_leaf_2,gbdt_leaf_3,gbdt_leaf_4,gbdt_leaf_5,gbdt_leaf_6,gbdt_leaf_7,gbdt_leaf_8,gbdt_leaf_9,...,gbdt_leaf_22,gbdt_leaf_23,gbdt_leaf_24,gbdt_leaf_25,gbdt_leaf_26,gbdt_leaf_27,gbdt_leaf_28,gbdt_leaf_29,gbdt_leaf_30,gbdt_leaf_31
0,8,50,6,14,12,12,19,40,47,28,...,10,12,30,43,9,19,10,27,7,18
1,39,49,11,13,44,32,40,14,13,26,...,39,27,30,12,37,43,28,17,44,13
2,20,50,46,44,42,11,7,44,13,47,...,6,9,9,11,51,50,40,12,47,42
3,0,27,32,6,39,19,45,22,43,37,...,20,42,32,30,37,44,14,14,27,41
4,45,28,30,29,16,30,11,26,52,12,...,28,20,17,15,31,9,30,45,17,37


In [63]:
# 对每棵树的叶子结点序号进行one-hot
for col in gbdt_feats_name:
    onehot_feats=pd.get_dummies(data[col],prefix=col)
    data.drop([col],axis=1,inplace=True)
    data=pd.concat([data,onehot_feats],axis=1)

In [65]:
# 从data里分割出来train和test
train=data[:train_len]
test=data[train_len:]

### 5. 划分LR训练集、验证集

In [66]:
# 划分 LR 训练集、验证集
x_train, x_val, y_train, y_val = train_test_split(train, target, test_size = 0.3, random_state = 2018)
    

### 6. 训练LR

In [68]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss

lr=LogisticRegression()
lr.fit(x_train,y_train)
lr_logloss = log_loss(y_train, lr.predict_proba(x_train)[:, 1])
print('lr-logloss: ', lr_logloss)
val_logloss = log_loss(y_val, lr.predict_proba(x_val)[:, 1])
print('val-logloss: ', val_logloss)
# 对测试集预测
y_pred = lr.predict_proba(test)[:, 1]

lr-logloss:  0.05880971194512575
val-logloss:  0.4048955281051941
