## LightGBM

A fast, distributed, high performance gradient boosting (GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

一种基于decision tree算法的快速，分布式，高性能梯度boosting框架，可用于排序，分类以及其它一些机器学习任务。

GBDT (Gradient Boosting Decision Tree) 是机器学习中一个长盛不衰的模型，其主要思想是利用弱分类器（决策树）迭代训练以得到最优模型，该模型具有训练效果好、不易过拟合等优点。GBDT在工业界应用广泛，通常被用于点击率预测，搜索排序等任务。GBDT也是各种数据挖掘竞赛的致命武器，据统计Kaggle上的比赛有一半以上的冠军方案都是基于GBDT。

LightGBM具有以下优点：

更快的训练速度
更低的内存消耗
更好的准确率
分布式支持
可以快速处理海量数据

调参

1.使用num_leaves

因为LightGBM使用的是leaf-wise的算法，因此在调节树的复杂程度时，使用的是num_leaves而不是max_depth

大致换算关系：num_leaves = 2^(max_depth)

2.对于非平衡数据集：可以param['is_unbalance']='true’

3. Bagging参数：bagging_fraction+bagging_freq（必须同时设置）、feature_fraction

4. min_data_in_leaf、min_sum_hessian_in_leaf





# 官方文档

## https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters.md

## Core Parameters

- config, default="", type=string, alias=config_file
    - path of config file
    
- task, default=train, type=enum, options=train,prediction
    - train for training
    - prediction for prediction.
    
- application, default=regression, type=enum, options=regression,regression_l1,huber,fair,poisson,binary,lambdarank,multiclass, alias=objective,app
    - regression, regression application
        - regression_l2, L2 loss, alias=mean_squared_error,mse
        - regression_l1, L1 loss, alias=mean_absolute_error,mae
        - huber, Huber loss
        - fair, Fair loss
        - poisson, Poisson regression
    - binary, binary classification application
    - lambdarank, lambdarank application
    - multiclass, multi-class classification application, should set num_class as well
    
- boosting, default=gbdt, type=enum, options=gbdt,dart, alias=boost,boosting_type
    - gbdt, traditional Gradient Boosting Decision Tree
    - dart, Dropouts meet Multiple Additive Regression Trees
    - goss, Gradient-based One-Side Sampling
    
- data, default="", type=string, alias=train,train_data
    - training data, LightGBM will train from this data
    
- valid, default="", type=multi-string, alias=test,valid_data,test_data
    - validation/test data, LightGBM will output metrics for these data
    - support multi validation data, separate by ,
    
- num_iterations, default=100, type=int, alias=num_iteration,num_tree,num_trees,num_round,num_rounds
    - number of boosting iterations
    - note: num_tree here equal with num_iterations. For multi-class, it actually learns num_class * num_iterations trees.
    - note: For python/R package, cannot use this parameters to control number of iterations.
    
- learning_rate, default=0.1, type=double, alias=shrinkage_rate
    - shrinkage rate
    - in dart, it also affects normalization weights of dropped trees
    
- num_leaves, default=31, type=int, alias=num_leaf
    - number of leaves in one tree
    
- tree_learner, default=serial, type=enum, options=serial,feature,data
    - serial, single machine tree learner
    - feature, feature parallel tree learner
    - data, data parallel tree learner
    - Refer to Parallel Learning Guide to get more details.
    
- num_threads, default=OpenMP_default, type=int, alias=num_thread,nthread
    - Number of threads for LightGBM.
    - For the best speed, set this to the number of real CPU cores, not the number of threads (most CPU using hyper-threading to generate 2 threads per CPU core).
    - Do not set it too large if your dataset is small (do not use 64 threads for a dataset with 10,000 for instance).
    - Be aware a task manager or any similar CPU monitoring tool might report cores not being fully utilized. This is normal.
    - For parallel learning, should not use full CPU cores since this will cause poor performance for the network.

- device, default=cpu, options=cpu,gpu
    - Choose device for the tree learning, can use gpu to achieve the faster learning.
    - Note: 1. Recommend use the smaller max_bin(e.g 63) to get the better speed up. 2. For the faster speed, GPU use 32-bit float point to sum up by default, may affect the accuracy for some tasks. You can set gpu_use_dp=true to enable 64-bit float point, but it will slow down the training. 3. Refer to Installation Guide to build with GPU .

## 核心参数

- config, default="", type=string, alias=config_file
    - 配置文件的路径
    
- task, default=train, type=enum, options=train,prediction
    - train 用来做训练
    - prediction 用来做预测
    
- application, default=regression, type=enum, options=regression,regression_l1,huber,fair,poisson,binary,lambdarank,multiclass, alias=objective,app
    - regression 用于回归
        - regression_l2, L2损失函数, alias=mean_squared_error,mse均方误差
        - regression_l1, L1损失函数, alias=mean_absolute_error,mae均绝对值误差
        - huber, Huber损失函数
        - fair, Fair损失函数
        - poisson, Poisson回归
    - binary, 用于二项分类
    - lambdarank, 用于lambdarank
    - multiclass, 用于多类别分类，需设置num_class
    
- boosting, default=gbdt, type=enum, options=gbdt,dart, alias=boost,boosting_type
    - gbdt, traditional Gradient Boosting Decision Tree
    - dart, Dropouts meet Multiple Additive Regression Trees
    - goss, Gradient-based One-Side Sampling
    
- data, default="", type=string, alias=train,train_data
    - 训练集数据, 用于训练LightGBM
    
- valid, default="", type=multi-string, alias=test,valid_data,test_data
    - 验证集/测试集数据, LightGBM将基于该数据输出指标
    - 支持多验证集，用“,”分开
    
- num_iterations, default=100, type=int, alias=num_iteration,num_tree,num_trees,num_round,num_rounds
    - number of boosting iterations
    - note: num_tree here equal with num_iterations. For multi-class, it actually learns num_class * num_iterations trees.
    - note: For python/R package, cannot use this parameters to control number of iterations.
    
- learning_rate, default=0.1, type=double, alias=shrinkage_rate
    - 收缩率
    - 在dart中，它也影响所扔掉的树的正则化权重
    
- num_leaves, default=31, type=int, alias=num_leaf
    - 一棵树上的叶子数目
    
- tree_learner, default=serial, type=enum, options=serial,feature,data
    - serial, single machine tree learner
    - feature, feature parallel tree learner
    - data, data parallel tree learner
    - Refer to Parallel Learning Guide to get more details.
    
- num_threads, default=OpenMP_default, type=int, alias=num_thread,nthread
    - Number of threads for LightGBM.
    - For the best speed, set this to the number of real CPU cores, not the number of threads (most CPU using hyper-threading to generate 2 threads per CPU core).
    - Do not set it too large if your dataset is small (do not use 64 threads for a dataset with 10,000 for instance).
    - Be aware a task manager or any similar CPU monitoring tool might report cores not being fully utilized. This is normal.
    - For parallel learning, should not use full CPU cores since this will cause poor performance for the network.

- device, default=cpu, options=cpu,gpu
    - Choose device for the tree learning, can use gpu to achieve the faster learning.
    - Note: 1. Recommend use the smaller max_bin(e.g 63) to get the better speed up. 2. For the faster speed, GPU use 32-bit float point to sum up by default, may affect the accuracy for some tasks. You can set gpu_use_dp=true to enable 64-bit float point, but it will slow down the training. 3. Refer to Installation Guide to build with GPU .

## Learning control parameters

- max_depth, default=-1, type=int
    - Limit the max depth for tree model. This is used to deal with overfit when #data is small. Tree still grow by leaf-wise.
    - < 0 means no limit
    
- min_data_in_leaf, default=20, type=int, alias=min_data_per_leaf , min_data
    - Minimal number of data in one leaf. Can use this to deal with over-fit.
    
- min_sum_hessian_in_leaf, default=1e-3, type=double, alias=min_sum_hessian_per_leaf, min_sum_hessian, min_hessian
    - Minimal sum hessian in one leaf. Like min_data_in_leaf, can use this to deal with over-fit.

- feature_fraction, default=1.0, type=double, 0.0 < feature_fraction < 1.0, alias=sub_feature
    - LightGBM will random select part of features on each iteration if feature_fraction smaller than 1.0. For example, if set to 0.8, will select 80% features before training each tree.
    - Can use this to speed up training
    - Can use this to deal with over-fit

- feature_fraction_seed, default=2, type=int
    - Random seed for feature fraction.

- bagging_fraction, default=1.0, type=double, , 0.0 < bagging_fraction < 1.0, alias=sub_row
    - Like feature_fraction, but this will random select part of data
    - Can use this to speed up training
    - Can use this to deal with over-fit
    - Note: To enable bagging, should set bagging_freq to a non zero value as well

- bagging_freq, default=0, type=int
    - Frequency for bagging, 0 means disable bagging. k means will perform bagging at every k iteration.
    - Note: To enable bagging, should set bagging_fraction as well

- bagging_seed , default=3, type=int
    - Random seed for bagging.

- early_stopping_round , default=0, type=int, alias=early_stopping_rounds,early_stopping
    - Will stop training if one metric of one validation data doesn't improve in last early_stopping_round rounds.

- lambda_l1 , default=0, type=double
    - l1 regularization

- lambda_l2 , default=0, type=double
    - l2 regularization

- min_gain_to_split , default=0, type=double
    - The minimal gain to perform split

- drop_rate, default=0.1, type=double
    - only used in dart

- skip_drop, default=0.5, type=double
    - only used in dart, probability of skipping drop

- max_drop, default=50, type=int
    - only used in dart, max number of dropped trees on one iteration. <=0 means no limit.

- uniform_drop, default=false, type=bool
    - only used in dart, true if want to use uniform drop

- xgboost_dart_mode, default=false, type=bool
    - only used in dart, true if want to use xgboost dart mode

- drop_seed, default=4, type=int
    - only used in dart, used to random seed to choose dropping models.

- top_rate, default=0.2, type=double
    - only used in goss, the retain ratio of large gradient data

- other_rate, default=0.1, type=int
    - only used in goss, the retain ratio of small gradient data

## 学习控制参数

- max_depth, default=-1, type=int
    - Limit the max depth for tree model. This is used to deal with overfit when #data is small. Tree still grow by leaf-wise.
    - < 0 means no limit
    
- min_data_in_leaf, default=20, type=int, alias=min_data_per_leaf , min_data
    - Minimal number of data in one leaf. Can use this to deal with over-fit.
    
- min_sum_hessian_in_leaf, default=1e-3, type=double, alias=min_sum_hessian_per_leaf, min_sum_hessian, min_hessian
    - Minimal sum hessian in one leaf. Like min_data_in_leaf, can use this to deal with over-fit.

- feature_fraction, default=1.0, type=double, 0.0 < feature_fraction < 1.0, alias=sub_feature
    - 如果feature_fraction小于1.0，LightGBM将在每轮迭代中随机选择部分特征，如果设置为0.8，将在训练每棵树前选择80%的特征
    - 可以用这个参数加速训练
    - 可以用这个参数处理过拟合

- feature_fraction_seed, default=2, type=int
    - Random seed for feature fraction.

- bagging_fraction, default=1.0, type=double, , 0.0 < bagging_fraction < 1.0, alias=sub_row
    - 与feature_fraction类似, 但是这个参数将随机选择部分数据
    - 可以用这个参数加速训练
    - 可以用这个参数处理过拟合
    - 注意：要激活bagging,需设置bagging_freq为一个非零的值

- bagging_freq, default=0, type=int
    - Frequency for bagging, 0 means disable bagging. k means will perform bagging at every k iteration.
    - Note: To enable bagging, should set bagging_fraction as well

- bagging_seed , default=3, type=int
    - Random seed for bagging.

- early_stopping_round , default=0, type=int, alias=early_stopping_rounds,early_stopping
    - Will stop training if one metric of one validation data doesn't improve in last early_stopping_round rounds.

- lambda_l1 , default=0, type=double
    - l1 regularization

- lambda_l2 , default=0, type=double
    - l2 regularization

- min_gain_to_split , default=0, type=double
    - The minimal gain to perform split

- drop_rate, default=0.1, type=double
    - only used in dart

- skip_drop, default=0.5, type=double
    - only used in dart, probability of skipping drop

- max_drop, default=50, type=int
    - only used in dart, max number of dropped trees on one iteration. <=0 means no limit.

- uniform_drop, default=false, type=bool
    - only used in dart, true if want to use uniform drop

- xgboost_dart_mode, default=false, type=bool
    - only used in dart, true if want to use xgboost dart mode

- drop_seed, default=4, type=int
    - only used in dart, used to random seed to choose dropping models.

- top_rate, default=0.2, type=double
    - only used in goss, the retain ratio of large gradient data

- other_rate, default=0.1, type=int
    - only used in goss, the retain ratio of small gradient data

# Parameters Tuning

## Convert parameters from XGBoost

LightGBM uses leaf-wise tree growth algorithm. But other popular tools, e.g. XGBoost, use depth-wise tree growth. So LightGBM use num_leaves to control complexity of tree model, and other tools usually use max_depth. Following table is the correspond between leaves and depths. The relation is num_leaves = 2^(max_depth).

## For faster speed

- Use bagging by set bagging_fraction and bagging_freq
- Use feature sub-sampling by set feature_fraction
- Use small max_bin
- Use save_binary to speed up data loading in future learning
- Use parallel learning, refer to parallel learning guide.

## For better accuracy



- Use large max_bin (may be slower)
- Use small learning_rate with large num_iterations
- Use large num_leaves(may cause over-fitting)
- Use bigger training data
- Try dart

## Deal with over-fitting



- Use small max_bin
- Use small num_leaves
- Use min_data_in_leaf and min_sum_hessian_in_leaf
- Use bagging by set bagging_fraction and bagging_freq
- Use feature sub-sampling by set feature_fraction
- Use bigger training data
- Try lambda_l1, lambda_l2 and min_gain_to_split to regularization
- Try max_depth to avoid growing deep tree