# 용어 정리
### 1. basic


- `num_iterations`: `n_estimators`, `num_boost_round` 
- `learning_rate`: `eta`
- `nthread`: `n_jobs`
- `feature_fraction`: 0.8 feature fraction means LightGBM will select 80% of parameters randomly in each iteration for building trees.
- `device_type`: `cpu`, `gpu`
- `early_stopping_round`: Model will stop training if one metric of one validation data doesn’t improve in last early_stopping_round rounds. This will reduce excessive iterations.
- `max_cat_group`: When the number of category is large, finding the split point on it is easily over-fitting.


### 2. core


- `num_leaves`: control the complexity of the tree model. val should be less than (equal to) $2^{max_depth}$
- `min_data_in_leaf`: set it to a large value to avoid growing too deep a tree, but may cause under_fiiting. setting it to hundreds or thousands is enough
- `max_depth`


### 3. Faster speed


- Use bagging by setting `bagging_fraction` and `bagging_freq`
- Use feature sub-sampling by setting `feature_fraction`
- Use small `max_bin`
- Use save_binary to speed up data loading in future learning
- Use parallel learning


### 4. Better accuracy


- Use large `max_bin`
- Use samll `learning_rate` with large `num_iterations`
- Use large `num_leaves`(risk over-fitting)
- Try `dart`
- Use categorical feature directly


### Deal with over-ftting
- Use small `max_bin`
- Use small `num_leaves`
- Use `min_data_in_leaf`(large num) and `min_sum_hessian_in_leaf`
- Use bagging by setting `bagging_fraction` and `bagging_freq`
- Try `lambda_l1`, `lambda_l2` and `min_gain_to_split` to regularization

(https://medium.com/@pushkarmandot/https-medium-com-pushkarmandot-what-is-lightgbm-how-to-implement-it-how-to-fine-tune-the-parameters-60347819b7fc)

# code snippet

In [None]:
import pandas as pd
from sklearn.metrics import roc_auc_score
import lightgbm as lgb
import matplotlib.pyplot as plt

# sklearn tools for model training and assesment
from sklearn.model_selection import train_test_split
from sklearn.model_selection import PredefinedSplit
from sklearn.model_selection import GridSearchCV, ParameterGrid
from sklearn.metrics import (roc_curve, auc, accuracy_score)

df_train = pd.read_csv("binary.train", header=None, sep='\t')
df_test = pd.read_csv("binary.test", header=None, sep='\t')

y_train = df_train[0].values
y_test = df_test[0].values
X_train = df_train.drop(0, axis=1).values
X_test = df_test.drop(0, axis=1).values


# set Dataset
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)


# specify your configurations as a dict
params = {
    'task': 'train',
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'metric': {'binary_logloss', 'auc'},
    'metric_freq': 1,
    'is_training_metric': True,
    'max_bin': 255,
    'learning_rate': 0.1,
    'num_leaves': 63,
    'tree_learner': 'serial',
    'feature_fraction': 0.8,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'min_data_in_leaf': 50,
    'min_sum_hessian_in_leaf': 5,
    'is_enable_sparse': True,
    'use_two_round_loading': False,
    'is_save_binary_file': False,
    'output_model': 'LightGBM_model.txt',
    'num_machines': 1,
    'local_listen_port': 12400,
    'machine_list_file': 'mlist.txt',
    'verbose': 0,
    'subsample_for_bin': 200000,
    'min_child_samples': 20,
    'min_child_weight': 0.001,
    'min_split_gain': 0.0,
    'colsample_bytree': 1.0,
    'reg_alpha': 0.0,
    'reg_lambda': 0.0
}

# train
gbm = lgb.train(params,
                lgb_train,
                valid_sets=lgb_eval)

mdl = lgb.LGBMClassifier(
    task = params['task'],
    metric = params['metric'],
    metric_freq = params['metric_freq'],
    is_training_metric = params['is_training_metric'],
    max_bin = params['max_bin'],
    tree_learner = params['tree_learner'],
    feature_fraction = params['feature_fraction'],
    bagging_fraction = params['bagging_fraction'],
    bagging_freq = params['bagging_freq'],
    min_data_in_leaf = params['min_data_in_leaf'],
    min_sum_hessian_in_leaf = params['min_sum_hessian_in_leaf'],
    is_enable_sparse = params['is_enable_sparse'],
    use_two_round_loading = params['use_two_round_loading'],
    is_save_binary_file = params['is_save_binary_file'],
    n_jobs = -1
)

scoring = {'AUC': 'roc_auc'}

In [None]:
# grid_param
gridParams = {
    'learning_rate': [0.01, 0.05, 0.1, 0.5],
    'max_dpth': [3, 6]
    'num_leaves': [8, 32],
    'boosting_type' : ['gbdt', 'dart'],
}

# Create the grid
grid = GridSearchCV(mdl, gridParams, verbose=2, cv=5, scoring=scoring, n_jobs=-1, refit='AUC')
# Run the grid
grid.fit(X_train, y_train)

print('Best parameters found by grid search are:', grid.best_params_)
print('Best score found by grid search is:', grid.best_score_)

# LightGBM usages

In [None]:
dtrain = lightgbm.Dataset(X_train, label=y_train)
dtest = lightgbm.Dataset(X_test, label=y_test)

num_boost_round = 1000
learning_rate = 0.02

params = {'objective': 'multiclass',
          'boosting_type': 'gbdt',
          'max_depth': -1,
          'nthread': 4,
          'metric': 'multi_logloss',
          'num_class': 38,
          'learning_rate': learning_rate,
          }

lightgbm_model = lightgbm.train(params=params,
                                train_set=dtrain,
                                valid_sets=[dtrain, dtest],
                                num_boost_round=num_boost_round,
                                early_stopping_rounds=10)