# LightGBM模型训练
1. 选择验证集上AUC分数最高的参数做训练
2. 对训练集70%-100%位置的数据做训练，得到模型A
3. 对训练集50%-70%位置的数据做训练，得到模型B
4. 对训练集30%-50%位置的数据做训练，得到模型C
5. 对训练集10%-30%位置的数据做训练，得到模型D
6. 对模型ABCD做两种方式的融合：

   一：做平均，得到模型E
   
   二：加权系数分别是0.6：0.2：0.1：0.1， 得到模型F
   
7. 分别用以下模型对测试集做预测，结果提交Kaggle，得到public分数如下：

   模型A：0.70386 (第37名)
   
   模型E：0.70290 (第42名)
   
   模型F：0.70987 (第22名)
   
以上结果可以发现，训练集的后30%占比越高，AUC的分值越高；同时其他剩下的训练数据有一定的弥补作用。
所以最终选择模型F，和神经网络模型做融合。
   

## 1. 读取训练集和测试集

In [None]:
import sys
sys.path.append('/home/aistudio/package')

import gc
import math
import numpy as np
import pandas as pd
import lightgbm as lgbm
from sklearn.metrics import roc_auc_score

train_all = pd.read_csv('./music_test/train_nn.csv')
test_all = pd.read_csv('./music_test/test_nn.csv')
members = pd.read_csv('./music_test/members_nn.csv')
songs = pd.read_csv('./music_test/songs_nn.csv')

In [None]:
train_all.shape + test_all.shape + members.shape + songs.shape

(7377418, 48, 2556790, 48, 34403, 143, 419839, 102)

## 2. 取训练集中最后30%的数据，进行第一次模型训练

In [None]:
train = train_all[math.ceil(train_all.shape[0]*0.7):]
test = test_all[:]

train_y = train['target']
train.drop(['target'], inplace=True, axis=1)

test_y = test['id']
test.drop(['id'], inplace=True, axis=1)

train = train.merge(members, on='msno', how='left')
test = test.merge(members, on='msno', how='left')
train = train.merge(songs, on='song_id', how='left')
test = test.merge(songs, on='song_id', how='left')

# 新增几个时间特征
train['time_spent'] = train['timestamp'] - train['registration_init_time']
test['time_spent'] = test['timestamp'] - test['registration_init_time']

train['time_left'] = train['expiration_date'] - train['timestamp']
test['time_left'] = test['expiration_date'] - test['timestamp']

train['msno_upper_time'] = train['msno_timestamp_mean'] + train['msno_timestamp_std']
test['msno_upper_time'] = test['msno_timestamp_mean'] + test['msno_timestamp_std']

train['msno_lower_time'] = train['msno_timestamp_mean'] - train['msno_timestamp_std']
test['msno_lower_time'] = test['msno_timestamp_mean'] - test['msno_timestamp_std']

train['song_upper_time'] = train['song_timestamp_mean'] + train['song_timestamp_std']
test['song_upper_time'] = test['song_timestamp_mean'] + test['song_timestamp_std']

train['song_lower_time'] = train['song_timestamp_mean'] - train['song_timestamp_std']
test['song_lower_time'] = test['song_timestamp_mean'] - test['song_timestamp_std']

## 3. 取训练集中50%-70%位置的数据，进行第二次模型训练

In [None]:
train = train_all[math.ceil(train_all.shape[0]*0.5):math.ceil(train_all.shape[0]*0.7)]
test = test_all[:]

train_y = train['target']
train.drop(['target'], inplace=True, axis=1)

test_id = test['id']
test.drop(['id'], inplace=True, axis=1)

train = train.merge(members, on='msno', how='left')
test = test.merge(members, on='msno', how='left')
train = train.merge(songs, on='song_id', how='left')
test = test.merge(songs, on='song_id', how='left')

# 新增几个时间特征
train['time_spent'] = train['timestamp'] - train['registration_init_time']
test['time_spent'] = test['timestamp'] - test['registration_init_time']

train['time_left'] = train['expiration_date'] - train['timestamp']
test['time_left'] = test['expiration_date'] - test['timestamp']

train['msno_upper_time'] = train['msno_timestamp_mean'] + train['msno_timestamp_std']
test['msno_upper_time'] = test['msno_timestamp_mean'] + test['msno_timestamp_std']

train['msno_lower_time'] = train['msno_timestamp_mean'] - train['msno_timestamp_std']
test['msno_lower_time'] = test['msno_timestamp_mean'] - test['msno_timestamp_std']

train['song_upper_time'] = train['song_timestamp_mean'] + train['song_timestamp_std']
test['song_upper_time'] = test['song_timestamp_mean'] + test['song_timestamp_std']

train['song_lower_time'] = train['song_timestamp_mean'] - train['song_timestamp_std']
test['song_lower_time'] = test['song_timestamp_mean'] - test['song_timestamp_std']

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


In [None]:
train.shape + test.shape

(1475484, 296, 2556790, 296)

## 4. 取训练集中30%-50%位置的数据，进行第三次模型训练

In [None]:
train = train_all[math.ceil(train_all.shape[0]*0.3):math.ceil(train_all.shape[0]*0.5)]
test = test_all[:]

train_y = train['target']
train.drop(['target'], inplace=True, axis=1)

test_id = test['id']
test.drop(['id'], inplace=True, axis=1)

train = train.merge(members, on='msno', how='left')
test = test.merge(members, on='msno', how='left')
train = train.merge(songs, on='song_id', how='left')
test = test.merge(songs, on='song_id', how='left')

# 新增几个时间特征
train['time_spent'] = train['timestamp'] - train['registration_init_time']
test['time_spent'] = test['timestamp'] - test['registration_init_time']

train['time_left'] = train['expiration_date'] - train['timestamp']
test['time_left'] = test['expiration_date'] - test['timestamp']

train['msno_upper_time'] = train['msno_timestamp_mean'] + train['msno_timestamp_std']
test['msno_upper_time'] = test['msno_timestamp_mean'] + test['msno_timestamp_std']

train['msno_lower_time'] = train['msno_timestamp_mean'] - train['msno_timestamp_std']
test['msno_lower_time'] = test['msno_timestamp_mean'] - test['msno_timestamp_std']

train['song_upper_time'] = train['song_timestamp_mean'] + train['song_timestamp_std']
test['song_upper_time'] = test['song_timestamp_mean'] + test['song_timestamp_std']

train['song_lower_time'] = train['song_timestamp_mean'] - train['song_timestamp_std']
test['song_lower_time'] = test['song_timestamp_mean'] - test['song_timestamp_std']

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


## 5. 取训练集中10%-30%位置的数据，进行第四次模型训练

In [None]:
train = train_all[math.ceil(train_all.shape[0]*0.1):math.ceil(train_all.shape[0]*0.3)]
test = test_all[:]

train_y = train['target']
train.drop(['target'], inplace=True, axis=1)

test_id = test['id']
test.drop(['id'], inplace=True, axis=1)

train = train.merge(members, on='msno', how='left')
test = test.merge(members, on='msno', how='left')
train = train.merge(songs, on='song_id', how='left')
test = test.merge(songs, on='song_id', how='left')

# 新增几个时间特征
train['time_spent'] = train['timestamp'] - train['registration_init_time']
test['time_spent'] = test['timestamp'] - test['registration_init_time']

train['time_left'] = train['expiration_date'] - train['timestamp']
test['time_left'] = test['expiration_date'] - test['timestamp']

train['msno_upper_time'] = train['msno_timestamp_mean'] + train['msno_timestamp_std']
test['msno_upper_time'] = test['msno_timestamp_mean'] + test['msno_timestamp_std']

train['msno_lower_time'] = train['msno_timestamp_mean'] - train['msno_timestamp_std']
test['msno_lower_time'] = test['msno_timestamp_mean'] - test['msno_timestamp_std']

train['song_upper_time'] = train['song_timestamp_mean'] + train['song_timestamp_std']
test['song_upper_time'] = test['song_timestamp_mean'] + test['song_timestamp_std']

train['song_lower_time'] = train['song_timestamp_mean'] - train['song_timestamp_std']
test['song_lower_time'] = test['song_timestamp_mean'] - test['song_timestamp_std']

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


In [None]:
# ## feature selection
# feat_importance = pd.read_csv('./music_val/feat_importance_0316_01.csv')
# feature_name = feat_importance['name'].values
# feature_importance = feat_importance['importance'].values

# col_to_drop_by_importance = feature_name[feature_importance<85]
# train.drop(col_to_drop_by_importance, axis=1, inplace=True)
# test.drop(col_to_drop_by_importance, axis=1, inplace=True)

In [None]:
## model training
train_data = lgbm.Dataset(train, label=train_y, free_raw_data=True)

## 6. 利用lightGBM模型预测测试集，产生提交结果
验证集上AUC最高分数是0.76442

### 1）max_depth=10，num_leaves=220，min_data_in_leaf=500，l1=6， l2=2000，5000轮

In [None]:
params = {
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'metric': ['binary_logloss', 'auc'],  

    'learning_rate': 0.1,       
    'num_leaves': 220,
    'max_depth': 10,
    'min_data_in_leaf': 500,     
    'feature_fraction': 0.8,
    'bagging_fraction': 0.9,
    'bagging_freq': 1,  
    'lambda_l1': 6,
    'lambda_l2': 2000,
    'min_gain_to_split': 0,
    'min_sum_hessian_in_leaf': 0.1,

    'num_threads': 16,
    'verbose': 0,
    'is_training_metric': 'True'
}

print('Hyper-parameters:')
print(params)

MAX_ROUNDS = 5000
evals_result = {}

gbm = lgbm.train(params, train_data, num_boost_round=MAX_ROUNDS, evals_result=evals_result, valid_sets=train_data, verbose_eval=100)

# 保存特征重要性
feature_importance = pd.DataFrame({'name':gbm.feature_name(), 'importance':gbm.feature_importance()}).sort_values(by='importance', ascending=False)
feature_importance.to_csv('./music_submission/four/feat_importance_for_test_0318_04.csv', index=False)


Hyper-parameters:
{'boosting_type': 'gbdt', 'objective': 'binary', 'metric': ['binary_logloss', 'auc'], 'learning_rate': 0.1, 'num_leaves': 220, 'max_depth': 10, 'min_data_in_leaf': 500, 'feature_fraction': 0.8, 'bagging_fraction': 0.9, 'bagging_freq': 1, 'lambda_l1': 6, 'lambda_l2': 2000, 'min_gain_to_split': 0, 'min_sum_hessian_in_leaf': 0.1, 'num_threads': 16, 'verbose': 0, 'is_training_metric': 'True'}
[100]	training's binary_logloss: 0.489591	training's auc: 0.834941
[200]	training's binary_logloss: 0.46442	training's auc: 0.854836
[300]	training's binary_logloss: 0.447034	training's auc: 0.867681
[400]	training's binary_logloss: 0.434268	training's auc: 0.876593
[500]	training's binary_logloss: 0.42397	training's auc: 0.883552
[600]	training's binary_logloss: 0.414546	training's auc: 0.889725
[700]	training's binary_logloss: 0.406696	training's auc: 0.894732
[800]	training's binary_logloss: 0.399315	training's auc: 0.899355
[900]	training's binary_logloss: 0.39273	training's auc:

 ### 2. 模型的保存和加载

In [None]:
# 保存训练好的模型
import pickle as cPickle
cPickle.dump(gbm, open("./music_submission/third/Music_LightGBM_0318_03.pkl", 'wb'))

In [None]:
# 加载训练好的模型
import pickle as cPickle
gbm = cPickle.load(open("./music_submission/third/Music_LightGBM_0318_03.pkl", 'rb'))

### 3. 预测0-33%位置的测试集结果

In [None]:
# del train_all
# gc.collect()
# 根据预测结果生成结果集
val_auc = 0.971121
# val_auc = 0.9753
size_1 = math.ceil(test.shape[0] * 0.33)
test_1 = test[0:size_1]
test_id_1 = test_id[0:size_1]

test_pred_1 = gbm.predict(test_1)
test_sub_1 = pd.DataFrame({'id': test_id_1, 'target': test_pred_1})
test_sub_1.to_csv('./music_submission/third/submission_lgb_%.5f_3_01.csv.gz'%(val_auc), index=False, compression='gzip')


### 4. 预测33-66%位置的测试集结果

In [None]:
del test_1
del test_id_1
del test_pred_1
gc.collect()

size_2 = math.ceil(test.shape[0] * 0.66)
test_2 = test[size_1:size_2]
test_id_2 = test_id[size_1:size_2]

test_pred_2 = gbm.predict(test_2)
test_sub_2 = pd.DataFrame({'id': test_id_2, 'target': test_pred_2})
test_sub_2.to_csv('./music_submission/third/submission_lgb_%.5f_3_02.csv.gz'%(val_auc), index=False, compression='gzip')

### 5. 预测66%-100%位置的测试集结果

In [None]:
del test_2
del test_id_2
del test_pred_2
gc.collect()

test_3 = test[size_2:]
test_id_3 = test_id[size_2:]

test_pred_3 = gbm.predict(test_3)
test_sub_3 = pd.DataFrame({'id': test_id_3, 'target': test_pred_3})
test_sub_3.to_csv('./music_submission/third/submission_lgb_%.5f_3_03.csv.gz'%(val_auc), index=False, compression='gzip')

### 6. 综合三次预测结果，保存最后的文档

In [None]:
test_sub = pd.concat([test_sub_1, test_sub_2, test_sub_3], axis=0)
test_sub.to_csv('./music_submission/third/submission_lgb_%.5f_3_123.csv'%(val_auc), index=False)

### 7. 读取四个训练好的模型预测的结果

In [2]:
import sys
sys.path.append('/home/aistudio/package')
import gc
import math
import numpy as np
import pandas as pd

a1 = pd.read_csv('./music_submission/first/submission_lgb_0.95696_1231.csv')
a2 = pd.read_csv('./music_submission/second/submission_lgb_0.97015_2_123.csv')
a3 = pd.read_csv('./music_submission/third/submission_lgb_0.97112_3_123.csv')
a4 = pd.read_csv('./music_submission/four/submission_lgb_0.97530_4_123.csv')

### 8. 对测试集预测结果做加权

1. 只取后30%训练集训练模型，预测测试集提交Kaggle，public分数是0.70386
2. 分别依次取10%-30%，30%-50%，50%-70%，70%-100%位置的训练集，进行了4次模型训练，并分别用这4个模型预测测试集，
   对四个结果求平均，提交Kaggle，public分数0.70290
3. 对四个模型训练结果做加权：提交Kaggle，public分数是0.70987

   10%-30%：系数0.1
   
   30%-50%：系数0.1
   
   50%-70%：系数0.2
   
   70%-100%：系数0.6
   

In [3]:
a1['target'] = a1['target'] * 0.6 + a2['target'] * 0.2+ a3['target'] * 0.1 + a4['target'] * 0.1
a1.to_csv('./music_submission/submission_lgb_1234_weight_01.csv', index=False)