# LightGBM模型参数调优

1. 特征维数227时，使用lightGBM模型训练，验证集上AUC最高分数是0.75893。预测的测试集分数提交到Kaggle上public排名在第88名左右。
2. 增加了特征（296维度）验证集上最好的AUC是0.76644。最后用此组参数训练模型，预测测试集, Kaggle上public排名进步很多。

## 读取训练集和验证集

In [1]:
import sys
sys.path.append('/home/aistudio/package')

import gc
import math
import datetime
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.metrics import roc_auc_score
import lightgbm as lgbm

train_all = pd.read_csv('./music_val/train_val_nn.csv')
test_all = pd.read_csv('./music_val/test_val_nn.csv')
members = pd.read_csv('./music_val/members_val_nn.csv')
songs = pd.read_csv('./music_val/songs_val_nn.csv')

In [2]:
train_all.shape + test_all.shape + members.shape + songs.shape

(5901935, 48, 1475483, 48, 30755, 142, 359966, 102)

In [None]:
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_rows', None)

In [None]:
columns = ['msno_song_length_std', 'msno_artist_song_cnt_std','msno_artist_rec_cnt_std',\
        'msno_song_rec_cnt_std','msno_year_std']
for col in columns:
    members[col].fillna(np.nanmin(members[col]), inplace=True)

members.to_csv('./music_val/members_val_nn.csv', index=False)

## 1. 分别取50%训练集和验证集数据(各约300万数据)

In [3]:
train = train_all[math.ceil(train_all.shape[0]*0.5):]
test = test_all[0: math.ceil(train_all.shape[0] * 0.5)]

train_y = train['target']
train.drop(['target'], inplace=True, axis=1)

test_y = test['target']
test.drop(['target'], inplace=True, axis=1)

train = train.merge(members, on='msno', how='left')
test = test.merge(members, on='msno', how='left')
train = train.merge(songs, on='song_id', how='left')
test = test.merge(songs, on='song_id', how='left')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


In [4]:
# 用户会员已经用了多长时间
train['time_spent'] = train['timestamp'] - train['registration_init_time']
test['time_spent'] = test['timestamp'] - test['registration_init_time']

# 用户会员还剩多长时间 
train['time_left'] = train['expiration_date'] - train['timestamp']
test['time_left'] = test['expiration_date'] - test['timestamp']

# 用户收听歌曲所处的时间点均值左右一个标准差为大概率事件
train['msno_upper_time'] = train['msno_timestamp_mean'] + train['msno_timestamp_std']
test['msno_upper_time'] = test['msno_timestamp_mean'] + test['msno_timestamp_std']

train['msno_lower_time'] = train['msno_timestamp_mean'] - train['msno_timestamp_std']
test['msno_lower_time'] = test['msno_timestamp_mean'] - test['msno_timestamp_std']

# 歌曲被收听所处的时间点均值左右一个标准差为大概率事件
train['song_upper_time'] = train['song_timestamp_mean'] + train['song_timestamp_std']
test['song_upper_time'] = test['song_timestamp_mean'] + test['song_timestamp_std']

train['song_lower_time'] = train['song_timestamp_mean'] - train['song_timestamp_std']
test['song_lower_time'] = test['song_timestamp_mean'] - test['song_timestamp_std']

del members
del songs
gc.collect()
print('All data loaded.')

train_data = lgbm.Dataset(train, label=train_y, free_raw_data=True)
test_data = lgbm.Dataset(test, label=test_y, free_raw_data=True)

All data loaded.


In [5]:
train.shape + test.shape

(2950967, 295, 1475483, 295)

## 1. LightGBM


### 1）max_depth= 10，num_leaves=220,  lambda_l2: 500,  5000轮 （50%的数据量）

In [None]:
params = {
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'metric': ['binary_logloss', 'auc'],  

    'learning_rate': 0.1,       
    'num_leaves': 220,
    'max_depth': 10,
    'min_data_in_leaf': 100,     
    'feature_fraction': 0.8,
    'bagging_fraction': 0.9,
    'bagging_freq': 1,  
    'lambda_l1': 5,
    'lambda_l2': 500,
    'min_gain_to_split': 0,
    'min_sum_hessian_in_leaf': 0.1,

    'num_threads': 16,
    'verbose': 0,
    'is_training_metric': 'True'
}
print('Hyper-parameters:')
print(params)

MAX_ROUNDS = 5000
evals_result = {}

gbm = lgbm.train(params, train_data, num_boost_round=MAX_ROUNDS, valid_sets=[train_data, test_data], valid_names = ['train', 'valid'],
                 evals_result=evals_result, early_stopping_rounds=1000, verbose_eval=100)

bst_round = np.argmax(evals_result['valid']['auc'])
trn_auc = evals_result['train']['auc'][bst_round]
trn_loss = evals_result['train']['binary_logloss'][bst_round]
val_auc = evals_result['valid']['auc'][bst_round]
val_loss = evals_result['valid']['binary_logloss'][bst_round]

print('Best Round: %d'%bst_round)
print('Training loss: %.5f, Validation loss: %.5f'%(trn_loss, val_loss))
print('Training AUC : %.5f, Validation AUC : %.5f'%(trn_auc, val_auc))


Hyper-parameters:
{'boosting_type': 'gbdt', 'objective': 'binary', 'metric': ['binary_logloss', 'auc'], 'learning_rate': 0.1, 'num_leaves': 220, 'max_depth': 10, 'min_data_in_leaf': 100, 'feature_fraction': 0.8, 'bagging_fraction': 0.9, 'bagging_freq': 1, 'lambda_l1': 5, 'lambda_l2': 500, 'min_gain_to_split': 0, 'min_sum_hessian_in_leaf': 0.1, 'num_threads': 16, 'verbose': 0, 'is_training_metric': 'True'}
Training until validation scores don't improve for 1000 rounds
[100]	train's binary_logloss: 0.529797	train's auc: 0.807887	valid's binary_logloss: 0.56694	valid's auc: 0.75286
[200]	train's binary_logloss: 0.50722	train's auc: 0.828457	valid's binary_logloss: 0.561513	valid's auc: 0.75919
[300]	train's binary_logloss: 0.49235	train's auc: 0.841179	valid's binary_logloss: 0.559564	valid's auc: 0.761839
[400]	train's binary_logloss: 0.480129	train's auc: 0.851143	valid's binary_logloss: 0.558569	valid's auc: 0.763357
[500]	train's binary_logloss: 0.469881	train's auc: 0.859204	valid's 

本组参数下，验证集AUC最高是0.76575。但是从第1000轮开始验证集的AUC开始下降，而训练集的AUC还在上升。这种情况应该是过拟合了，需要重新调整参数。

### 2）增大正则项系数     'lambda_l2'=1000

In [None]:
params = {
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'metric': ['binary_logloss', 'auc'],  

    'learning_rate': 0.1,       
    'num_leaves': 220,
    'max_depth': 10,
    'min_data_in_leaf': 100,     
    'feature_fraction': 0.8,
    'bagging_fraction': 0.9,
    'bagging_freq': 1,  
    'lambda_l1': 6,
    'lambda_l2': 1000,
    'min_gain_to_split': 0,
    'min_sum_hessian_in_leaf': 0.1,

    'num_threads': 16,
    'verbose': 0,
    'is_training_metric': 'True'
}
print('Hyper-parameters:')
print(params)

MAX_ROUNDS = 5000
evals_result = {}

gbm = lgbm.train(params, train_data, num_boost_round=MAX_ROUNDS, valid_sets=[train_data, test_data], valid_names = ['train', 'valid'],
                 evals_result=evals_result, early_stopping_rounds=1000, verbose_eval=100)

bst_round = np.argmax(evals_result['valid']['auc'])
trn_auc = evals_result['train']['auc'][bst_round]
trn_loss = evals_result['train']['binary_logloss'][bst_round]
val_auc = evals_result['valid']['auc'][bst_round]
val_loss = evals_result['valid']['binary_logloss'][bst_round]

print('Best Round: %d'%bst_round)
print('Training loss: %.5f, Validation loss: %.5f'%(trn_loss, val_loss))
print('Training AUC : %.5f, Validation AUC : %.5f'%(trn_auc, val_auc))


Hyper-parameters:
{'boosting_type': 'gbdt', 'objective': 'binary', 'metric': ['binary_logloss', 'auc'], 'learning_rate': 0.1, 'num_leaves': 220, 'max_depth': 10, 'min_data_in_leaf': 100, 'feature_fraction': 0.8, 'bagging_fraction': 0.9, 'bagging_freq': 1, 'lambda_l1': 6, 'lambda_l2': 1000, 'min_gain_to_split': 0, 'min_sum_hessian_in_leaf': 0.1, 'num_threads': 16, 'verbose': 0, 'is_training_metric': 'True'}
Training until validation scores don't improve for 1000 rounds
[100]	train's binary_logloss: 0.535213	train's auc: 0.802409	valid's binary_logloss: 0.568209	valid's auc: 0.751323
[200]	train's binary_logloss: 0.513969	train's auc: 0.822157	valid's binary_logloss: 0.562619	valid's auc: 0.757863
[300]	train's binary_logloss: 0.499304	train's auc: 0.83511	valid's binary_logloss: 0.560137	valid's auc: 0.76088
[400]	train's binary_logloss: 0.488041	train's auc: 0.844527	valid's binary_logloss: 0.558925	valid's auc: 0.76257
[500]	train's binary_logloss: 0.478256	train's auc: 0.852441	valid

本组参数下，最好的验证集上AUC为0.76603，从过程看到第1500轮的时候验证集的AUC在下降，所以过拟合了，继续调整参数

### 6）增大正则项系数     'lambda_l2'=2000

In [None]:
params = {
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'metric': ['binary_logloss', 'auc'],  

    'learning_rate': 0.1,       
    'num_leaves': 220,
    'max_depth': 10,
    'min_data_in_leaf': 100,     
    'feature_fraction': 0.8,
    'bagging_fraction': 0.9,
    'bagging_freq': 1,  
    'lambda_l1': 6,
    'lambda_l2': 2000,
    'min_gain_to_split': 0,
    'min_sum_hessian_in_leaf': 0.1,

    'num_threads': 16,
    'verbose': 0,
    'is_training_metric': 'True'
}
print('Hyper-parameters:')
print(params)

MAX_ROUNDS = 5000
evals_result = {}

gbm = lgbm.train(params, train_data, num_boost_round=MAX_ROUNDS, valid_sets=[train_data, test_data], valid_names = ['train', 'valid'],
                 evals_result=evals_result, early_stopping_rounds=1000, verbose_eval=100)

bst_round = np.argmax(evals_result['valid']['auc'])
trn_auc = evals_result['train']['auc'][bst_round]
trn_loss = evals_result['train']['binary_logloss'][bst_round]
val_auc = evals_result['valid']['auc'][bst_round]
val_loss = evals_result['valid']['binary_logloss'][bst_round]

print('Best Round: %d'%bst_round)
print('Training loss: %.5f, Validation loss: %.5f'%(trn_loss, val_loss))
print('Training AUC : %.5f, Validation AUC : %.5f'%(trn_auc, val_auc))


Hyper-parameters:
{'boosting_type': 'gbdt', 'objective': 'binary', 'metric': ['binary_logloss', 'auc'], 'learning_rate': 0.1, 'num_leaves': 220, 'max_depth': 10, 'min_data_in_leaf': 100, 'feature_fraction': 0.8, 'bagging_fraction': 0.9, 'bagging_freq': 1, 'lambda_l1': 6, 'lambda_l2': 2000, 'min_gain_to_split': 0, 'min_sum_hessian_in_leaf': 0.1, 'num_threads': 16, 'verbose': 0, 'is_training_metric': 'True'}
Training until validation scores don't improve for 1000 rounds
[100]	train's binary_logloss: 0.540276	train's auc: 0.797417	valid's binary_logloss: 0.570143	valid's auc: 0.749337
[200]	train's binary_logloss: 0.520839	train's auc: 0.81565	valid's binary_logloss: 0.563911	valid's auc: 0.756469
[300]	train's binary_logloss: 0.507233	train's auc: 0.827944	valid's binary_logloss: 0.561039	valid's auc: 0.759849
[400]	train's binary_logloss: 0.497044	train's auc: 0.836734	valid's binary_logloss: 0.559362	valid's auc: 0.761883
[500]	train's binary_logloss: 0.488199	train's auc: 0.84411	vali

本组参数下，最好的验证集上AUC为0.76625，还是处于过拟合状态

### 4）'min_data_in_leaf': 500

In [None]:
params = {
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'metric': ['binary_logloss', 'auc'],  

    'learning_rate': 0.1,       
    'num_leaves': 220,
    'max_depth': 10,
    'min_data_in_leaf': 500,     
    'feature_fraction': 0.8,
    'bagging_fraction': 0.9,
    'bagging_freq': 1,  
    'lambda_l1': 6,
    'lambda_l2': 2000,
    'min_gain_to_split': 0,
    'min_sum_hessian_in_leaf': 0.1,

    'num_threads': 16,
    'verbose': 0,
    'is_training_metric': 'True'
}
print('Hyper-parameters:')
print(params)

MAX_ROUNDS = 5000
evals_result = {}

gbm = lgbm.train(params, train_data, num_boost_round=MAX_ROUNDS, valid_sets=[train_data, test_data], valid_names = ['train', 'valid'],
                 evals_result=evals_result, early_stopping_rounds=1000, verbose_eval=100)

bst_round = np.argmax(evals_result['valid']['auc'])
trn_auc = evals_result['train']['auc'][bst_round]
trn_loss = evals_result['train']['binary_logloss'][bst_round]
val_auc = evals_result['valid']['auc'][bst_round]
val_loss = evals_result['valid']['binary_logloss'][bst_round]

print('Best Round: %d'%bst_round)
print('Training loss: %.5f, Validation loss: %.5f'%(trn_loss, val_loss))
print('Training AUC : %.5f, Validation AUC : %.5f'%(trn_auc, val_auc))


Hyper-parameters:
{'boosting_type': 'gbdt', 'objective': 'binary', 'metric': ['binary_logloss', 'auc'], 'learning_rate': 0.1, 'num_leaves': 220, 'max_depth': 10, 'min_data_in_leaf': 500, 'feature_fraction': 0.8, 'bagging_fraction': 0.9, 'bagging_freq': 1, 'lambda_l1': 6, 'lambda_l2': 2000, 'min_gain_to_split': 0, 'min_sum_hessian_in_leaf': 0.1, 'num_threads': 16, 'verbose': 0, 'is_training_metric': 'True'}
Training until validation scores don't improve for 1000 rounds
[100]	train's binary_logloss: 0.53991	train's auc: 0.797843	valid's binary_logloss: 0.569653	valid's auc: 0.749973
[200]	train's binary_logloss: 0.520785	train's auc: 0.815733	valid's binary_logloss: 0.563846	valid's auc: 0.756434
[300]	train's binary_logloss: 0.507621	train's auc: 0.827626	valid's binary_logloss: 0.56109	valid's auc: 0.759722
[400]	train's binary_logloss: 0.496996	train's auc: 0.836853	valid's binary_logloss: 0.559517	valid's auc: 0.761693
[500]	train's binary_logloss: 0.488717	train's auc: 0.843755	vali

本组参数的结果显示，验证集上最好的AUC是0.76644。在第1500轮的时候训练集的AUC在上升但是验证集的AUC开始下降，还是有点过拟合了，需要再调整参数.

In [None]:
import datetime
feat_cnt = train.shape[1]
feature_importance = pd.DataFrame({'name':gbm.feature_name(), 'importance':gbm.feature_importance()}).sort_values(by='importance', ascending=False)
feature_importance.to_csv('./music_val/feat_importance_0316_01.csv', index=False)

res = '%s,%s,%d,%s,%.4f,%d,%d,%d,%.4f,%.4f,%d,%.4e,%.4e,%.4e,%.4e,%.4e,%d,%.5f,%.5f,%.5f,%.5f\n'%(datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S'), \
        'LightGBM_baseline_song_context_prob', feat_cnt, params['boosting_type'], params['learning_rate'], params['num_leaves'], params['max_depth'], \
        params['min_data_in_leaf'], params['feature_fraction'], params['bagging_fraction'], \
        params['bagging_freq'], params['lambda_l1'], params['lambda_l2'], params['min_gain_to_split'], \
        params['min_sum_hessian_in_leaf'], 0.0, bst_round+1, trn_loss, trn_auc, val_loss, val_auc)
f = open('./music_val/lgb_record_0316_01.csv', 'a')
f.write(res)
f.close()

### 5） 'min_data_in_leaf': 1000

In [None]:
params = {
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'metric': ['binary_logloss', 'auc'],  

    'learning_rate': 0.1,       
    'num_leaves': 220,
    'max_depth': 10,
    'min_data_in_leaf': 1000,     
    'feature_fraction': 0.8,
    'bagging_fraction': 0.9,
    'bagging_freq': 1,  
    'lambda_l1': 6,
    'lambda_l2': 2000,
    'min_gain_to_split': 0,
    'min_sum_hessian_in_leaf': 0.1,

    'num_threads': 16,
    'verbose': 0,
    'is_training_metric': 'True'
}
print('Hyper-parameters:')
print(params)

MAX_ROUNDS = 5000
evals_result = {}

gbm = lgbm.train(params, train_data, num_boost_round=MAX_ROUNDS, valid_sets=[train_data, test_data], valid_names = ['train', 'valid'],
                 evals_result=evals_result, early_stopping_rounds=1000, verbose_eval=100)

bst_round = np.argmax(evals_result['valid']['auc'])
trn_auc = evals_result['train']['auc'][bst_round]
trn_loss = evals_result['train']['binary_logloss'][bst_round]
val_auc = evals_result['valid']['auc'][bst_round]
val_loss = evals_result['valid']['binary_logloss'][bst_round]

print('Best Round: %d'%bst_round)
print('Training loss: %.5f, Validation loss: %.5f'%(trn_loss, val_loss))
print('Training AUC : %.5f, Validation AUC : %.5f'%(trn_auc, val_auc))


Hyper-parameters:
{'boosting_type': 'gbdt', 'objective': 'binary', 'metric': ['binary_logloss', 'auc'], 'learning_rate': 0.1, 'num_leaves': 220, 'max_depth': 10, 'min_data_in_leaf': 1000, 'feature_fraction': 0.8, 'bagging_fraction': 0.9, 'bagging_freq': 1, 'lambda_l1': 6, 'lambda_l2': 2000, 'min_gain_to_split': 0, 'min_sum_hessian_in_leaf': 0.1, 'num_threads': 16, 'verbose': 0, 'is_training_metric': 'True'}
Training until validation scores don't improve for 1000 rounds
[100]	train's binary_logloss: 0.54105	train's auc: 0.796604	valid's binary_logloss: 0.570169	valid's auc: 0.749288
[200]	train's binary_logloss: 0.522157	train's auc: 0.814367	valid's binary_logloss: 0.564154	valid's auc: 0.756073
[300]	train's binary_logloss: 0.509014	train's auc: 0.826338	valid's binary_logloss: 0.561565	valid's auc: 0.759224
[400]	train's binary_logloss: 0.498841	train's auc: 0.835191	valid's binary_logloss: 0.56004	valid's auc: 0.761141
[500]	train's binary_logloss: 0.490223	train's auc: 0.842442	val

本组参数的结果显示，AUC最好的是0.76573， 比min_data_in_leaf为500的时候分数降低了。选择上一组参数

### 6）     'num_leaves': 99,  'min_data_in_leaf': 1306,  'feature_fraction': 0.6866, 'bagging_fraction': 0.9054, 'lambda_l1': 6.37, 'lambda_l2': 65200

In [6]:
params = {
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'metric': ['binary_logloss', 'auc'],  

    'learning_rate': 0.1,       
    'num_leaves': 99,
    'max_depth': 10,
    'min_data_in_leaf': 1306,     
    'feature_fraction': 0.6866,
    'bagging_fraction': 0.9054,
    'bagging_freq': 1,  
    'lambda_l1': 6.37,
    'lambda_l2': 65200,
    'min_gain_to_split': 0,
    'min_sum_hessian_in_leaf': 0.1,

    'num_threads': 16,
    'verbose': 0,
    'is_training_metric': 'True'
}
print('Hyper-parameters:')
print(params)

MAX_ROUNDS = 8000
evals_result = {}

gbm = lgbm.train(params, train_data, num_boost_round=MAX_ROUNDS, valid_sets=[train_data, test_data], valid_names = ['train', 'valid'],
                 evals_result=evals_result, early_stopping_rounds=1000, verbose_eval=100)

bst_round = np.argmax(evals_result['valid']['auc'])
trn_auc = evals_result['train']['auc'][bst_round]
trn_loss = evals_result['train']['binary_logloss'][bst_round]
val_auc = evals_result['valid']['auc'][bst_round]
val_loss = evals_result['valid']['binary_logloss'][bst_round]

print('Best Round: %d'%bst_round)
print('Training loss: %.5f, Validation loss: %.5f'%(trn_loss, val_loss))
print('Training AUC : %.5f, Validation AUC : %.5f'%(trn_auc, val_auc))


Hyper-parameters:
{'boosting_type': 'gbdt', 'objective': 'binary', 'metric': ['binary_logloss', 'auc'], 'learning_rate': 0.1, 'num_leaves': 99, 'max_depth': 10, 'min_data_in_leaf': 1306, 'feature_fraction': 0.6866, 'bagging_fraction': 0.9054, 'bagging_freq': 1, 'lambda_l1': 6.37, 'lambda_l2': 65200, 'min_gain_to_split': 0, 'min_sum_hessian_in_leaf': 0.1, 'num_threads': 16, 'verbose': 0, 'is_training_metric': 'True'}
Training until validation scores don't improve for 1000 rounds
[100]	train's binary_logloss: 0.577551	train's auc: 0.764904	valid's binary_logloss: 0.59098	valid's auc: 0.732371
[200]	train's binary_logloss: 0.562095	train's auc: 0.777491	valid's binary_logloss: 0.580431	valid's auc: 0.741427
[300]	train's binary_logloss: 0.553707	train's auc: 0.784602	valid's binary_logloss: 0.575452	valid's auc: 0.745751
[400]	train's binary_logloss: 0.547639	train's auc: 0.790051	valid's binary_logloss: 0.572351	valid's auc: 0.748634
[500]	train's binary_logloss: 0.542836	train's auc: 0.

### 6）     'num_leaves': 120, 'max_depth': 10, 'min_data_in_leaf': 1000,  'feature_fraction': 0.8, 'bagging_fraction': 0.9, 'lambda_l1': 6, 'lambda_l2': 40000

In [7]:
params = {
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'metric': ['binary_logloss', 'auc'],  

    'learning_rate': 0.1,       
    'num_leaves': 120,
    'max_depth': 10,
    'min_data_in_leaf': 1000,     
    'feature_fraction': 0.8,
    'bagging_fraction': 0.9,
    'bagging_freq': 1,  
    'lambda_l1': 5,
    'lambda_l2': 50000,
    'min_gain_to_split': 0,
    'min_sum_hessian_in_leaf': 0.1,

    'num_threads': 16,
    'verbose': 0,
    'is_training_metric': 'True'
}
print('Hyper-parameters:')
print(params)

MAX_ROUNDS = 8000
evals_result = {}

gbm = lgbm.train(params, train_data, num_boost_round=MAX_ROUNDS, valid_sets=[train_data, test_data], valid_names = ['train', 'valid'],
                 evals_result=evals_result, early_stopping_rounds=1000, verbose_eval=100)

bst_round = np.argmax(evals_result['valid']['auc'])
trn_auc = evals_result['train']['auc'][bst_round]
trn_loss = evals_result['train']['binary_logloss'][bst_round]
val_auc = evals_result['valid']['auc'][bst_round]
val_loss = evals_result['valid']['binary_logloss'][bst_round]

print('Best Round: %d'%bst_round)
print('Training loss: %.5f, Validation loss: %.5f'%(trn_loss, val_loss))
print('Training AUC : %.5f, Validation AUC : %.5f'%(trn_auc, val_auc))


Hyper-parameters:
{'boosting_type': 'gbdt', 'objective': 'binary', 'metric': ['binary_logloss', 'auc'], 'learning_rate': 0.1, 'num_leaves': 120, 'max_depth': 10, 'min_data_in_leaf': 1000, 'feature_fraction': 0.8, 'bagging_fraction': 0.9, 'bagging_freq': 1, 'lambda_l1': 5, 'lambda_l2': 50000, 'min_gain_to_split': 0, 'min_sum_hessian_in_leaf': 0.1, 'num_threads': 16, 'verbose': 0, 'is_training_metric': 'True'}
Training until validation scores don't improve for 1000 rounds
[100]	train's binary_logloss: 0.573242	train's auc: 0.76826	valid's binary_logloss: 0.587723	valid's auc: 0.734886
[200]	train's binary_logloss: 0.557789	train's auc: 0.780997	valid's binary_logloss: 0.577804	valid's auc: 0.743344
[300]	train's binary_logloss: 0.549218	train's auc: 0.788471	valid's binary_logloss: 0.573079	valid's auc: 0.747668
[400]	train's binary_logloss: 0.542966	train's auc: 0.794256	valid's binary_logloss: 0.570172	valid's auc: 0.750482
[500]	train's binary_logloss: 0.537934	train's auc: 0.799005	v

### 从以上调参结果选择最优的一组超参数：验证集上分数是0.76644
 {'boosting_type': 'gbdt', 'objective': 'binary', 'metric': ['binary_logloss', 'auc'], 'learning_rate': 0.1, 'num_leaves': 220, 'max_depth': 10, 'min_data_in_leaf': 500, 'feature_fraction': 0.8, 'bagging_fraction': 0.9, 'bagging_freq': 1, 'lambda_l1': 6, 'lambda_l2': 2000, 'min_gain_to_split': 0, 'min_sum_hessian_in_leaf': 0.1, 'num_threads': 16, 'verbose': 0, 'is_training_metric': 'True'}
