### XGBoost (Extreme Gradient Boosting)
- uses regularization (add info to reduce variance and prevent overfitting)
-

In [None]:
import pandas as pd
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split, cross_val_score, RandomizedSearchCV, GridSearchCV, StratifiedKFold
from xgboost import XGBRegressor
from sklearn.metrics import accuracy_score

X, y = datasets.load_diabetes(return_X_y=True)

# Stratified fold includes the same percentage of target values in each fold
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=2)

 use the same folds to obtain new scores when fine-tuning hyperparameters.
 with GridSearchCV and RandomizedSearchCV
 GridsearchCV searches all possible combinations in a hyperparameter
 to find the best results. RandomizedSearchCV selects 10 random hyperparameters by default

In [None]:
def grid_search(params, random=False):
    # Initialize XGB Regressor with objective='reg:squarederror' (MSE)
    xgb = XGBRegressor(booster='gbtree', objective='reg:squarederror',
    random_state=2)
    if random:
        grid = RandomizedSearchCV(xgb, params, cv=kfold, n_iter=20, n_jobs=-1)
    else:
        grid = GridSearchCV(xgb, params, cv=kfold, n_jobs=-1)
    grid.fit()
    best_params = grid.best_params_
    print("Best params:", best_params)
    best_score = grid.best_score_
    print("Training score: {:.3f}".format(best_score))


Important XGBoost Hyperparamers:
* n_estimators: default 100. (1..inf). (number of trees in ensembled)
    - increasing may improve scores with large data
* learning_rate: default 0.3 (0..inf). Shrinks the tree weights in each round of boosting
    - Decreasing prevents overfitting
* max_depth: default 6 (0..inf). Depth of the tree
    - Decreasing prevents overfitting

 n_estimators provides the number of tress in the ensemble
 initialize a grid search of n_estimators with default of 100
 Then double the numbers of trees through 800:

In [1]:
grid_search(params={'n_estimators': [100, 200, 400, 800]})
# Since we have a small dataset, increasing n_estimators did not produce
# better results

NameError: name 'grid_search' is not defined

 learning_rate shrinks the weights of trees for each round of boosting
 by lowering the learning rate, more trees are required to produce better scores
 This prevents overfitting because the size of the weights carried forward
 is smaller

In [None]:
grid_search(params={'learning_rate':[0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5]})

 max_depth determines the length of the tree, equivalent to the number
 of rounds of splitting. Limiting max depth prevents overfitting because
 the individual trees can only grow as far as max_depth allows.

In [None]:
grid_search(params={'max_depth':[2, 3, 5, 6, 8]})

 gamma: (lagrange multiplier) provides a threhold that nodes must surpass
 before making further splits according to the loss function

In [None]:
grid_search(params={'gamma':[0, 0.1, 0.5, 1, 2, 5]})

 Minimum child weight refers to the minimum sum of weights required for a node
 to split into a child. reduces overfitting by increasing

In [None]:
grid_search(params={'min_child_weight':[1, 2, 3, 4, 5]})

### Applying Early Stopping
Early stopping limits the number of training rounds in iterative machine learning algorithms. it stops training when n consecutive rounds continue without producing gains.

eval_set and eval_metric may be used as parameters for .fit to generate test scores for each training round. eval metric provides the scoring method, 'rmse' for regression. eval_set provides the test to be evaluated, commonly X_test and y_test

In [None]:
# eval_set = [(X_test, y_test)]
# eval_metric = 'rmse'
# xgb.fit(X_train, y_train, eval_metric=eval_metric, eval_set=eval_set)