# XGBoost in Python

**XGBoost (eXtreme Gradient Boosting)** is an advanced implementation of gradient boosting algorithm. Building a model using XGBoost is easy. But, improving the model using XGBoost is difficult. This algorithm uses multiple parameters. To improve the model, parameter tuning is must.

* The XGBoost Advantage
* Understanding XGBoost Parameters
* Tuning Parameters

## 1. The XGBoost Advantage

* Regularization:

Standard GBM implementation has no regularization like XGBoost, therefore it also helps to reduce overfitting.
In fact, XGBoost is also known as a ‘regularized boosting‘ technique.

* Parallel Processing:

XGBoost implements parallel processing and is blazingly faster as compared to GBM. Xgboost doesn't run multiple trees in parallel, rather it does the parallelization within a single tree to create branches independently. It also supports implementation on Hadoop.

* High Flexibility:

XGBoost allows users to define custom optimization objectives and evaluation criteria. This adds a whole new dimension to the model and there is no limit to what we can do.

* Handling Missing Values:

XGBoost has an in-built routine to handle missing values. The user is required to supply a different value than other observations and pass that as a parameter. XGBoost tries different things as it encounters a missing value on each node and learns which path to take for missing values in future.

* Tree Pruning:

A GBM would stop splitting a node when it encounters a negative loss in the split. Thus it is more of a greedy algorithm.
XGBoost on the other hand make splits upto the max_depth specified and then start pruning the tree backwards and remove splits beyond which there is no positive gain.

* Built-in Cross-Validation:

XGBoost allows user to run a cross-validation at each iteration of the boosting process and thus it is easy to get the exact optimum number of boosting iterations in a single run.

* DMatrices Data Structure:

DMatrix is a internal data structure that used by XGBoost which is optimized for both memory efficiency and training speed. You can construct DMatrix from numpy.arrays.

## 2. XGBoost Parameters

The overall parameters have been divided into 3 categories by XGBoost authors:

* General Parameters: Guide the overall functioning
* Booster Parameters: Guide the individual booster (tree/regression) at each step
* Learning Task Parameters: Guide the optimization performed

### 2.1. Gneral Parameters

These define the overall functionality of XGBoost.

1. booster [default=gbtree]

Select the type of model to run at each iteration: gbtree (tree-based models), gblinear (linear models)

2. silent [default=0]:

Silent mode is activated is set to 1, i.e. no running messages will be printed. It’s generally good to keep it 0 as the messages might help in understanding the model.

3. nthread [default to maximum number of threads available if not set]

This is used for parallel processing and number of cores in the system should be entered
If you wish to run on all cores, value should not be entered and algorithm will detect automatically

### 2.2. Booster Parameters

1. eta [default=0.3]

Analogous to learning rate in GBM taht makes the model more robust by shrinking the weights on each step (typical values: 0.01-0.2).

2. min_child_weight [default=1]

Defines the minimum sum of weights of all observations required in a child. This is similar to min_child_leaf in GBM but not exactly. This refers to min "sum of weights" of observations while GBM has min "number of observations" that used to control over-fitting, and higher values prevent a model from learning relations (under-fitting).

3. max_depth [default=6]

The maximum depth of a tree that deeper trees make model more complex relationships by adding more nodes. It used to control bias of model to allow model to learn relations very specific to a particular sample. (Typical values: 3-10)

4. gamma [default=0]

A node is split only when the resulting split gives a positive reduction in the loss function. Gamma specifies the minimum loss reduction required to make a split that makes the algorithm conservative. 

5. max_delta_step [default=0]

In maximum delta step we allow each tree’s weight estimation to be. If the value is set to 0, it means there is no constraint. If it is set to a positive value, it can help making the update step more conservative. Usually this parameter is not needed, but it might help in logistic regression when class is extremely imbalanced.

6. subsample [default=1]

Denotes the fraction of observations to be randomly samples for each tree. Lower values make the algorithm more conservative and prevents overfitting but too small values might lead to under-fitting. (Typical values: 0.5-1)

7. colsample_bytree [default=1]

Denotes the fraction of features to be randomly samples for each tree. (Typical values: 0.5-1)

8. lambda [default=1]

L2 regularization term on weights (Ridge regression) that used to handle the regularization part of XGBoost. It should be explored to reduce overfitting.

9. alpha [default=0]

L1 regularization term on weight (Lasso regression) that can be used in case of very high dimensionality so that the algorithm runs faster when implemented.

10. scale_pos_weight [default=1]

A value greater than 0 should be used in case of high class imbalance as it helps in faster convergence.

### 2.3. Learning Task Parameters

These parameters are used to define the optimization objective the metric to be calculated at each step.

1. objective [default=reg:linear]

This defines the loss function to be minimized. Mostly used values are:

* binary:logistic: logistic regression for binary classification, returns predicted probability (not class)
* multi:softmax: multiclass classification using the softmax objective, returns predicted class (not probabilities). you also need to set an additional num_class (number of classes) parameter defining the number of unique classes
* multi:softprob: same as softmax, but returns predicted probability of each data point belonging to each class.

2. eval_metric [default according to objective]

The metric to be used for validation data. The default values are rmse for regression and error for classification. Typical values are: rmse, mae, logloss, error, merror, mlogloss, auc


## 3. XGBoost’s CV

In order to tune the other hyperparameters, use the cv function from XGBoost that allows us to run cross-validation on our training dataset and returns a mean of metric score.

* params: dictionary of parameters.
* DMatrix: dtrain matrix.
* num_boost_round: number of boosting rounds.
* early_stopping_rounds: number of rounds without improvements after which we should stop
* seed: random seed.
* nfold: the number of folds to use for cross-validation
* metrics: the metrics to use to evaluate our model, here we use MAE.

## 4. Parameter Tuning

1. Choose a relatively high learning rate. Generally a learning rate of 0.1 works but somewhere between 0.05 to 0.3 should work for different problems.
2. Tune tree-specific parameters (max_depth, min_child_weight, gamma, subsample, colsample) for decided learning rate and number of trees.
3. Tune regularization parameters (lambda, alpha) for xgboost which can help reduce model complexity and enhance performance.
4. Lower the learning rate and decide the optimal parameters.

### 4.1. Native Python API

In [None]:
import xgboost as xgb

# construct DMatrix
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

In [None]:
# set params grid
# params priority by orders
gridsearch_params = [(max_depth, min_child_weight, gamma, subsample, colsample, Lambda, alpha)
                     for max_depth in range(9,12)
                     for min_child_weight in range(5,8)
                     for gamma in [i/10.0 for i in range(0,5)]
                     for subsample in [i/10. for i in range(7,11)]
                     for colsample in [i/10. for i in range(7,11)]
                     for Lambda in [0, 0.001, 0.005, 0.01, 0.05]
                     for alpha in [0, 0.001, 0.005, 0.01, 0.05]]

min_mae = float("Inf")
best_params = None

# We start by the largest values and go down to the smallest
for max_depth, min_child_weight, gamma, subsample, colsample, Lambda, alpha in reversed(gridsearch_params):

    # We update our parameters
    params['eta'] = 0.1
    params['max_depth'] = subsample
    params['min_child_weight'] = colsample
    params['gamma'] = colsample
    params['subsample'] = subsample
    params['colsample_bytree'] = colsample
    params['Lambda'] = subsample
    params['alpha'] = colsample

    # Run CV
    cv_results = xgb.cv(params, dtrain, num_boost_round=800, seed=42, nfold=5, metrics={'mae'}, early_stopping_rounds=10)
    
    # Update best score
    mean_mae = cv_results['test-mae-mean'].min()
    boost_rounds = cv_results['test-mae-mean'].argmin()
    print('MAE: {} for {} rounds'.format(mean_mae, boost_rounds))
    if mean_mae < min_mae:
        min_mae = mean_mae
        best_params = (max_depth, min_child_weight, gamma, subsample, colsample, Lambda, alpha)
        
print('Best params: {} with {} mae'.format(best_params, min_mae))

### 4.2. Scikit-Learn API

Step 1: Fix learning rate and number of estimators for tuning tree-based parameters

In [None]:
param_grid = {'n_estimators': range(100, 800, 50)}

model = XGBClassifier(learning_rate=0.1, max_depth=5, min_child_weight = 1, objective = 'binary:logistic')
gsearch = GridSearchCV(estimator = model, param_grid = param_grid, scoring = 'roc_auc', n_jobs =-1, cv = 5)
gsearch.fit(X_train, y_train)
gsearch.grid_scores_, gsearch.best_params_, gsearch.best_score_

Step 2: Set optimal n_estimators from above, and tune max_depth, min_child_weight, gamma, subsample, colsample

In [None]:
param_grid = {'max_depth':range(3,12,2),
              'min_child_weight':range(1,30,3),
              'gamma': [i/10.0 for i in range(0,5)],
              'subsample': [i/10. for i in range(7,11)],
              'colsample': [i/10. for i in range(7,11)]}

model = XGBClassifier(learning_rate=0.1, n_estimators = 200, objective='binary:logistic')
gsearch = GridSearchCV(estimator = model, param_grid = param_grid, scoring = 'roc_auc', n_jobs =-1, cv = 5)
gsearch.fit(X_train, y_train)
gsearch.grid_scores_, gsearch.best_params_, gsearch.best_score_

Step 3: Tuning Regularization Parameters

In [None]:
param_grid = {'reg_alpha':[1e-5, 1e-2, 0.1, 1, 100], 'reg_lambda':[1e-5, 1e-2, 0.1, 1, 100]}

model = XGBClassifier(learning_rate=0.1, n_estimators = 200, max_depth = 5, min_child_weight = 20, gamma = 2,
                      subsample = 0.8, colsample = 0.6, objective='binary:logistic')

gsearch = GridSearchCV(estimator = model, param_grid = param_grid, scoring = 'roc_auc', n_jobs =-1, cv = 5)
gsearch.fit(X_train, y_train)
gsearch.grid_scores_, gsearch.best_params_, gsearch.best_score_

Step 4: Tuning learning rate

In [None]:
param_grid = {'learning_rate': [0.01, 0.03, 0.05, 0.1, 0.2, 0.3]}

model = XGBClassifier(n_estimators = 200, max_depth = 5, min_child_weight = 20, gamma = 2,
                      subsample = 0.8, colsample = 0.6, reg_alpha = 10, reg_lambda = 5, objective='binary:logistic')

gsearch = GridSearchCV(estimator = model, param_grid = param_grid, scoring = 'roc_auc', n_jobs =-1, cv = 5)
gsearch.fit(X_train, y_train)
gsearch.grid_scores_, gsearch.best_params_, gsearch.best_score_

Step 5: Evaluate metrics in test data

In [None]:
model = XGBClassifier(learning_rate = 0.05, n_estimators = 200, max_depth = 5, min_child_weight = 20, gamma = 2,
                      subsample = 0.8, colsample = 0.6, reg_alpha = 10, reg_lambda = 5, objective='binary:logistic')

eval_set = [(X_train, y_train), (X_test, y_test)]
eval_metric = ["auc","error"]
model.fit(X_train, y_train, eval_metric=eval_metric, eval_set=eval_set, verbose=True)