# XG-Boosting and LightGBM

## XG-Boost

A decision-tree based ensemble Machine Learning Algorithm using a gradient boosting framework.

- Can be used for regression, classification, ranking and user-defined prediction problems
- Use both L1 and L2 regularization to prevent overfitting
- Can handle sparse data (missing values/data sparse caused by one-hot encoding)
- Parallel learning (faster computing time)

## XG-Boost application

### Read Data

In [5]:
import xgboost as xgb

#The data read into XG-Boost is a DMatrix, a CSV can be transformed into such object using the code

# dtrain = xgb.DMatrix('train.csv?format=csv&label_column=0')
# dtest = xgb.DMatrix('test.csv?format=csv&label_column=0')

dtrain = xgb.DMatrix('/Users/Melodie/Downloads/2021Spring/Study/DataWhale/April_Ensembled_Learning/Notes_Ensemble_Learning/Data/agaricus.txt.test') 
dtest = xgb.DMatrix('/Users/Melodie/Downloads/2021Spring/Study/DataWhale/April_Ensembled_Learning/Notes_Ensemble_Learning/Data/agaricus.txt.test') 


### Set Model Parameter

In [25]:
#Max_depth 
#Objective: specify the prediction problem you want to work on (others like reg:squarederror/count:poisson/multi:softmax)
#Eta: learning rate
#Eval_metric: evaluation metrics

param = {'max_depth':2, 'objective':'binary:logistic', 'eta': 0.1, 'eval_metric' :'logloss'} 
num_round = 2 

### Train the model

In [26]:
#Define evaluation standard
evallist = [(dtest, 'eval'), (dtrain, 'train')]
bst = xgb.train(param, dtrain, num_round,evallist)

[0]	eval-logloss:0.61417	train-logloss:0.61417
[1]	eval-logloss:0.54939	train-logloss:0.54939


### Prediction

In [29]:
preds = bst.predict(dtest)

## XG-Boost Parameters

### General Parameter

Booster: which weak learner to use to train (default gbtree)
nthread: Number of parallel threads used to run XGBoost (default max)
verbosity: Verbosity of printing messages

### Tree Booster

eta: learning rate

gamma: (min_split_loss)larger gamma, more conservative the algorithm will be (more uneasily to overfit)

max_depth: increase max_depth will lead to potential overfitting (default 6)

lambda: regularization



### Learning Task Parameters
objective: reg:squarederror/reg:logistic/binary:logistic/count:poisson/survival:cox

eval_metric:rmse/rmsle/mae/mphe/logloss/error; merror/mlogloss/auc/ndgc

seed: random state

## LightGBM

- Higher training efficiency
- Lower RAM
- Higher accuracy


## LightGBM applicaion

### Load library and read data

In [1]:
import lightgbm as lgb
from sklearn import metrics
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import pandas as pd
canceData=load_breast_cancer()
X=canceData.data
y=canceData.target
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=0,test_size=0.2)

### Transform the data for lgb model

In [2]:
lgb_train = lgb.Dataset(X_train, y_train, params={'verbose': -1}, free_raw_data=False)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train,params={'verbose': -1},free_raw_data=False)

### Select parameters

In [3]:
params = {
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': 'auc',
'nthread':4,
'learning_rate':0.1
}

### Cross-Validation Parameter Tuning

In [4]:
max_auc = float('0')
best_params = {}

#Accuracy
for num_leaves in range(5,100,5):
    for max_depth in range(3,8,1):
        params['num_leaves'] = num_leaves
        params['max_depth'] = max_depth
        
        cv_results = lgb.cv(
        params,
        lgb_train,
        seed=1,
        nfold=5,
        metrics=['auc'],
        early_stopping_rounds=10,
        verbose_eval=False
        )
        
        mean_auc = pd.Series(cv_results['auc-mean']).max()
        boost_rounds = pd.Series(cv_results['auc-mean']).idxmax()
        
        if mean_auc >= max_auc:
            max_auc = mean_auc
            best_params['num_leaves'] = num_leaves
            best_params['max_depth'] = max_depth
            
if 'num_leaves' and 'max_depth' in best_params.keys():
    params['num_leaves'] = best_params['num_leaves']
    params['max_depth'] = best_params['max_depth']
    
    
#Overfit
for max_bin in range(5,256,10):
    for min_data_in_leaf in range(1,102,10):
        params['max_bin'] = max_bin
        params['min_data_in_leaf'] = min_data_in_leaf
    
        cv_results = lgb.cv(
        params,
        lgb_train,
        seed=1,
        nfold=5,
        metrics=['auc'],
        early_stopping_rounds=10,
        verbose_eval=False
        )

        mean_auc = pd.Series(cv_results['auc-mean']).max()
        boost_rounds = pd.Series(cv_results['auc-mean']).idxmax()
        if mean_auc >= max_auc:
            max_auc = mean_auc
            best_params['max_bin']= max_bin
            best_params['min_data_in_leaf'] = min_data_in_leaf

if 'max_bin' and 'min_data_in_leaf' in best_params.keys():
    params['min_data_in_leaf'] = best_params['min_data_in_leaf']
    params['max_bin'] = best_params['max_bin']

[LightGBM] [Info] Number of positive: 232, number of negative: 132
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 4548
[LightGBM] [Info] Number of data points in the train set: 364, number of used features: 30
[LightGBM] [Info] Number of positive: 232, number of negative: 132
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 4548
[LightGBM] [Info] Number of data points in the train set: 364, number of used features: 30
[LightGBM] [Info] Number of positive: 232, number of negative: 132
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 4548
[LightGBM] [Info] Number of data points in the train set: 364, number of used features: 30
[LightGBM] [Info] Number of positive: 232, number of negative: 132
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 4548
[LightGBM] [Info] Number of data points in the train set: 364, number of used features: 30


References: 

XGBoost Algorithm: Long May She Reign!:
https://towardsdatascience.com/https-medium-com-vishalmorde-xgboost-algorithm-long-she-may-rein-edd9f99be63d?source=bookmarks---------0----------------------------

XGBoost Python Package:
https://xgboost.readthedocs.io/en/latest/python/python_intro.html