### XGBoost

Just an advance implementation of GBM
https://xgboost.readthedocs.io/en/latest/tutorials/model.html

Objective Function: 

1. Mean Squared Error (Regression)
2. Logistic Loss (Classification)
3. Cross-entropy (Multi-class classification)

**XGBoost is basically boosted trees that incldues:**
https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/
1. Regularization
    - Standard GBM implementation has no regularization 
2. Takes in different loss functions and evaluation criteria
3. Parallel tree building
4. Handles missing data in-built
5. Tree Pruning     
    - A GBM would stop splitting a node when it encounters a negative loss in the split. Thus it is more of a **greedy algorithm.**
    - XGBoost on the other hand make **splits upto the max_depth** specified and then start **pruning** the tree backwards and remove splits beyond which there is no positive gain.
    - Another advantage is that sometimes a split of negative loss say -2 may be followed by a split of positive loss +10. GBM would stop as it encounters -2. But XGBoost will go deeper and it will see a combined effect of +8 of the split and keep both.
6. Built-in CV
    - XGBoost can run a CV at each iteration of boosting process, easy to get exact optimum number of boosting iterations in single run
    - GBM have to run gridsearch and only limited values can be tested at a time 
7. Continue on Existing Model
    - Both GBM and XGboost have this

#### XGBoost parameters

**1. General Parameters**


   - `booster`
       - Type of model to run at each iteration
       - _gbtree_: tree-based models
       - _gblinear_: linear models
       
       
   - `silent`
       - _0_: running messages will be printed
       - _1_: no running messages


   - `nthread`
       - Used for parallel processing
       - Default maximum
       - Enter number of cores
       

**2. Boosting Parameters**

- `eta`
    - Learning rate
    - Typical: 0.01~0.2
    
    
- `min_child_weight` _tree_
    - Control over-fitting, high values = prevent model from learning relations that are highly specific to particular sample selected for a tree
    - Similar to GBM's min num. of observations but this is min. sum of weights of observations
    
    
- `max_depth` _tree_
    - Control over-fitting,  higher depth will allow model to learn relations very specific to a particular sample.
    - Typical: 3-10
    - Tune with CV
    
    
- `max_leaf_nodes`
    - Similar to `max_depth`
    
    
- `gamma` _tree_
    - A node is split only when the resulting split gives a positive reduction in the loss function
    - Gamma specifies the minimum loss reduction required to make a split
    - Default = 0, higher values = more conservative 
    
    
- `max_delta_step`
    - Default = 0, might help in logistic regression with extremely imbalanced class
    - Max delta step we allow each tree's weight estimation to be
    - 0 = no constraint, higher value = update step more conservative
    
    
- `subsample` _tree_
    - Same as subsample in GBM
    - Lower values = more conservative, prevent overfitting. Too small = underfitting
    - Typical: 0.5-1
    
    
- `colsample_bytree` _tree_
    - Similar to max_features in GBM
    - Fraction of columns to be randomly sampled for each tree
    - Typical: 0.5-1


- `colsample_bylevel`
    - Subsample ratio of columns for each split, in each level
    - Not really as `colsample_bytree` is used
    

- `reg_alpha` _reg_
    - L1 regularization term on weights
    - Assign certain weights to 0
    - Used in case of very high dimensionality for algo to run faster
    
    
- `reg_lambda` _reg_
    - L2 regularizatin term on weights
    - Redistribute weights on weights/features
    - Explore for reducing overfitting


- `scale_pos_weight`
    - Value > 0 could help in faster convergence when used in case of high class imbalance

**3. Learning Parameters**

- `objective`
    - Default = 'reg:linear'
    - binary: logistic = returns predicted proba
    - multi: softmax = returns predicted class
    - multi: softprob = returns predicted proba of each point belonging to each class
    
    
- `eval_metric`
    - Default to objective
    - Regression: rmse, Classification: error
    - logloss, mae, merror(multiclass classification error rate), mlogloss(multiclass logloss), auc 
    
    
- `seed`
    - Fix it for generating reproducible results and parameters 

**xgb: direct xgboost library.**

In [None]:
#Choose all predictors except target & IDcols
predictors = [x for x in train.columns if x not in [target, IDcol]]

xgb1 = XGBClassifier(
 learning_rate =0.1,
 n_estimators=1000,
 max_depth=5,
 min_child_weight=1,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'binary:logistic',
 nthread=4,
 scale_pos_weight=1,
 seed=27)

modelfit(xgb1, train, predictors)

In [2]:
def modelfit(alg, dtrain, predictors,useTrainCV=True, cv_folds=5, early_stopping_rounds=50):
    
    if useTrainCV:
        xgb_param = alg.get_xgb_params()
        xgtrain = xgb.DMatrix(dtrain[predictors].values, label=dtrain[target].values)
        cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
            metrics='auc', early_stopping_rounds=early_stopping_rounds, show_progress=False)
        alg.set_params(n_estimators=cvresult.shape[0])
    
    #Fit the algorithm on the data
    alg.fit(dtrain[predictors], dtrain['Disbursed'],eval_metric='auc')
        
    #Predict training set:
    dtrain_predictions = alg.predict(dtrain[predictors])
    dtrain_predprob = alg.predict_proba(dtrain[predictors])[:,1]
        
    #Print model report:
    print("\nModel Report")
    print("Accuracy : %.4g" % metrics.accuracy_score(dtrain['Disbursed'].values, dtrain_predictions))
    print("AUC Score (Train): %f" % metrics.roc_auc_score(dtrain['Disbursed'], dtrain_predprob))
                    
    feat_imp = pd.Series(alg.booster().get_fscore()).sort_values(ascending=False)
    feat_imp.plot(kind='bar', title='Feature Importances')
    plt.ylabel('Feature Importance Score')

**XGBClassifier:  allows us to use sklearn’s Grid Search with parallel processing in the same way we did for GBM**

In [None]:
param_test1 = {
 'max_depth':range(3,10,2),
 'min_child_weight':range(1,6,2)
}

gsearch1 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=140, max_depth=5,
 min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27), 
 param_grid = param_test1, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

gsearch1.fit(train[predictors],train[target])

gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_

**General Approach for Parameter Tuning**

1. Relatively high learning rate of 0.05-0.3, determine optimum no. of trees for chosen learning rate
2. Tune tree-specific parameters 
3. Tune regularization parameters
4. Lower learning rate

**Rule of Thumb for initial**

Fix learning rate and number of estimators

**2. Tree-specific parameters (tuned first as they have highest impact)**

- `max_depth` = 5
    - Typical Values: 3-10
    
    
- `min_child_weight` = 1
    - Smaller value chosen for highly imbalanced class
    
    
- `gamma` = 0 
    - Typical Values = 0.1-0.2
    - To be tuned later
    
    
- `subsample` = 0.8
    - Typical Values: 0.5-0.9
    
    
- `colsample_bytree` = 0.8
    - Typical Values: 0.5-0.9
    
    
- `scale_pos_weight` = 1
    - Because of high class imbalance


**3. Regularization parameters**

Though many people don’t use this parameters much as gamma provides a substantial way of controlling complexity.



**- Difficult to get a very big leap in performance via parameter tuning**  
**- Could be obtained by feature engineering, ensemble models, stacking**

### LightGBM

Leaf-wise tree growth compared to level-wise tree growth in XGBoost

**Pros:**

1. **Faster training speed & efficiency**
    - Historgram-based algorithm: bucket continuous feature values into discrete bins which reduces training speed


2. **Lower memory usage**
    - Above method reduces memory usage


3. **Leaf-wise instead of XGBoost's Level-wise splitting**
    - Allows for more complex models that leads to higher accuracy
    - But could also contributes to overfitting which can be avoided by tuning `max_depth`


4. **Good for Large Datasets**


5. **Supports parallel learning**