# XG-Boosting and LightGBM

## XG-Boost

A decision-tree based ensemble Machine Learning Algorithm using a gradient boosting framework.

- Can be used for regression, classification, ranking and user-defined prediction problems
- Use both L1 and L2 regularization to prevent overfitting
- Can handle sparse data (missing values/data sparse caused by one-hot encoding)
- Parallel learning (faster computing time)

## XG-Boost application

### Read Data

In [5]:
import xgboost as xgb

#The data read into XG-Boost is a DMatrix, a CSV can be transformed into such object using the code

# dtrain = xgb.DMatrix('train.csv?format=csv&label_column=0')
# dtest = xgb.DMatrix('test.csv?format=csv&label_column=0')

dtrain = xgb.DMatrix('/Users/Melodie/Downloads/2021Spring/Study/DataWhale/April_Ensembled_Learning/Notes_Ensemble_Learning/Data/agaricus.txt.test') 
dtest = xgb.DMatrix('/Users/Melodie/Downloads/2021Spring/Study/DataWhale/April_Ensembled_Learning/Notes_Ensemble_Learning/Data/agaricus.txt.test') 


### Set Model Parameter

In [25]:
#Max_depth 
#Objective: specify the prediction problem you want to work on (others like reg:squarederror/count:poisson/multi:softmax)
#Eta: learning rate
#Eval_metric: evaluation metrics

param = {'max_depth':2, 'objective':'binary:logistic', 'eta': 0.1, 'eval_metric' :'logloss'} 
num_round = 2 

### Train the model

In [26]:
#Define evaluation standard
evallist = [(dtest, 'eval'), (dtrain, 'train')]
bst = xgb.train(param, dtrain, num_round,evallist)

[0]	eval-logloss:0.61417	train-logloss:0.61417
[1]	eval-logloss:0.54939	train-logloss:0.54939


### Prediction

In [29]:
preds = bst.predict(dtest)

## XG-Boost Parameters

### General Parameter

Booster: which weak learner to use to train (default gbtree)
nthread: Number of parallel threads used to run XGBoost (default max)
verbosity: Verbosity of printing messages

### Tree Booster

eta: learning rate

gamma: (min_split_loss)larger gamma, more conservative the algorithm will be (more uneasily to overfit)

max_depth: increase max_depth will lead to potential overfitting (default 6)

lambda: regularization



### Learning Task Parameters
objective: reg:squarederror/reg:logistic/binary:logistic/count:poisson/survival:cox

eval_metric:rmse/rmsle/mae/mphe/logloss/error; merror/mlogloss/auc/ndgc

seed: random state

## LightGBM

- Higher training efficiency
- Lower RAM
- Higher accuracy


References: 

XGBoost Algorithm: Long May She Reign!:
https://towardsdatascience.com/https-medium-com-vishalmorde-xgboost-algorithm-long-she-may-rein-edd9f99be63d?source=bookmarks---------0----------------------------

XGBoost Python Package:
https://xgboost.readthedocs.io/en/latest/python/python_intro.html