# Extreme Gradient Boosting with XGBoost
### Definition
* Boosting converts a collection of weak learners into a strong learner. 
    * **weak learner**: ML algorithm that is slightly better than chance (>50%)
* Boosting rounds: number of weak learners used by the meta-model. 

### How it works
1. Iteratively learning a set of weak models on subsets of the data
2. Weighing each weak prediction according to each weak learner's performance
3. Combine the weighted predictinons to obtain a single weighted prediction 
4. ... that is much better than the individual predictions themselves! 

### Advantages
* Speed and performance
* Core algorithm is parallelizable (good for big data)
* **Consistently outperforms single-algorithm methods**
* State-of-the-art performance in many ML tasks

### Cross-validation
* Is a robust method for estimating the performance of a model on unseen data
* Generates many non-overlapping train/test splits on training  data
* Reports the average test set performance across all data splits

### Common loss functions in XGBoost
* Loss function names in xgboost:
    *reg:linear - use for regression problems
    *reg:logistic - use for classification problem when you want just decision, not probability
    *binary:logistic - use when you want probability rather than just decision

### Base learners
* XGBoost involves creating a meta-model that is composed of many individual models that combine to give a final prediction
    * Individual models = base learners
    * Want base learners that when combined create final prediction that is non-linear
    * Each base learner should be good at distinguishing or predicting different parts of the dataset
    * Two kinds of base learners:
        * tree
        * linear


### When to use XGBoost
* You have a large number of training samples
    * Greater than 1000 training samples and less 100 features
    * The number of features < number of training samples
* You have a mixture of categorical and numeric features
    * Or just numeric features

### When to NOT use XGBoost
* All of these problems can be better tackled by Deep Learning
    * Image recognition
    * Computer vision
    * NLP
* Small datasets


### Regularization in XGBoost
* Regularization is a control on model complexity
* Want models that are both accurate and as simple as possible
* Regularization parameters in XGBoost: 
    * gamma - minimum loss reduction allowed for a split to occur
    * alpha - l1 regularization on leaf weights, larger values mean more regularization (the larger it is, the more leaf weights go to zero)
    * lambda - l2 regularization on leaf weights (smoother than l1)


### XGBoost's Hyperparameters
* learning rate: 
    - learning rate/eta
* gamma:
    - min loss reduction to create a new tree split 
* lambda: 
    - L2 reg on leaf weights
* alpha:
    - L1 reg on leaf weights
* max_depth: 
    - max depth per tree
* subsample: 
    - % samples used per tree
* colsample_bytree: 
    - % features used per tree (used also as a kind of regularization)


### Grid search and random search
* Grid search: 
    - Search exhaustively over a goven set of hyperparameters.
    - Number of models = number of distinct values per hyperparameter multiplied across each hyperparameter. 
    - Pick final model hyperparameter values that give best cross-validated evaluation metric value. 

* Random search: 
    - Set the number of iterations you would like for the random search to continue. 
    

### Pipelines in 


In [2]:
import xgboost as xgb
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

## Classification: Simple workflow

In [None]:
# Import xgboost
import xgboost as xgb

# Create arrays for the features and the target: X, y
X, y = churn_data.iloc[:,:-1], churn_data.iloc[:,-1]

# Create the training and test sets
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2, random_state=123)

# Instantiate the XGBClassifier: xg_cl
xg_cl = xgb.XGBClassifier(objective='binary:logistic', n_estimators=10, seed=123)

# Fit the classifier to the training set
xg_cl.fit(X_train, y_train)

# Predict the labels of the test set: preds
preds = xg_cl.predict(X_test)

# Compute the accuracy: accuracy
accuracy = float(np.sum(preds==y_test))/y_test.shape[0]
print("accuracy: %f" % (accuracy))

## Classification: Workflow with crossvalidation
- XGBoost gets its lauded performance and efficiency gains by utilizing its own optimized data structure for datasets called a DMatrix.
- In the previous exercise the input datasets were converted into DMatric data on the fly, but when CV you have to first explicitly convert your data into a DMatrix

In [None]:
# Create arrays for the features and the target: X, y
X, y = churn_data.iloc[:,:-1], churn_data.iloc[:,-1]

# Create the DMatrix from X and y: churn_dmatrix
churn_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary: params
params = {"objective":"reg:logistic", "max_depth":3}

# Perform cross-validation: cv_results
cv_results = xgb.cv(dtrain=churn_dmatrix, params=params, 
                  nfold=3, num_boost_round=5, 
                  metrics="error", # auc
                  as_pandas=True, seed=123)

# Print cv_results
print(cv_results)

# Print the accuracy
print(((1-cv_results["test-error-mean"]).iloc[-1]))

cv_results stores the training and test mean and standard deviation of the error per boosting round (tree built) as a DataFrame. From cv_results, the final round 'test-error-mean' is extracted and converted into an accuracy, where accuracy is 1-error

### Regression: Common workflow

In [None]:
# Create the training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

# Instantiate the XGBRegressor: xg_reg
xg_reg = xgb.XGBRegressor(seed=123, objective='reg:linear', n_estimators=10)

# Fit the regressor to the training set
xg_reg.fit(X_train, y_train)

# Predict the labels of the test set: preds
preds = xg_reg.predict(X_test)

# Compute the rmse: rmse
rmse = np.sqrt(mean_squared_error(preds, y_test))
print("RMSE: %f" % (rmse))

### L1 regularizaiton in XGBoost example

In [None]:
import xgboost as xgb
import pandas as pd

boston_data = pd.read_csv('boston_data.csv')
X, y = boston_data.iloc[:, :-1], boston_data.iloc[:, -1]

boston_dmatrix = xgb.DMatrix(data=X, label=y)
params = {"objective": "reg:linear", "max_depth": 4}

l1_params = [1, 10, 100]
rmses_l1 = []

for reg in l1_params:
    params['alpha'] = reg
    cv_results = xgb.cv(dtrain=boston_dmatrix, params=params, nfold=4, 
                        num_boost_round=10, metrics='rmse', as_pandas=True, 
                        seed=123)
    rmses_l1.append(cv_results['test-rmse-mean'].tail(1).values[0])



### Visualizing individual trees in the XGBoost model

In [None]:
# Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary: params
params = {"objective":"reg:linear", "max_depth":2}

# Train the model: xg_reg
xg_reg = xgb.train(params=params, dtrain=housing_dmatrix, num_boost_round=10)

# Plot the first tree
xgb.plot_tree(xg_reg, num_trees=0)
plt.show()

# Plot the fifth tree
xgb.plot_tree(xg_reg, num_trees=4)
plt.show()

# Plot the last tree sideways
xgb.plot_tree(xg_reg, num_trees=9, rankdir="LR")
plt.show()

### Feature importance in XGBoost

In [None]:
# Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary: params
params = {"objective": "reg:linear", "max_depth": 4}

# Train the model: xg_reg
xg_reg = xgb.train(params=params, dtrain=housing_dmatrix, num_boost_round=10)

# Plot the feature importances
xgb.plot_importance(xg_reg)
plt.show()

### Tune model example

In [None]:
import pandas as pd
import xgboost as xgb
import numpy as np

housing_data = pd.read_csv('blah.csv')
X, y = housing_data.iloc[:, :-1], housing_data.iloc[:, -1]
housing_dmatrix = xgb.DMatrix(data=X, label=y)

tuned_params = {
    "objective": "reg:linear", 
    "colsample_bytree": 0.3,
    "learning_rate": 0.1, 
    "max_depth": 5
}

tuned_cv_results_rmse = xgb.cv(dtrain=housing_dmatrix, params=tuned_params, nfold=4, num_boost_round=200, metrics='rmse',
                               as_pandas=True, seed=123)

print(f"Tuned rmse {tuned_cv_results_rmse['test-rmse-mean'].tail(1)}")

### Automate boosting round selection: 
Instead of cherry picking the best number of boosting rounds, automatically stop when the metric does not improve significantly.

In [None]:
# Create your housing DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary for each tree: params
params = {"objective":"reg:linear", "max_depth":4}

# Perform cross-validation with early stopping: cv_results
cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=3, metrics='rmse', seed=123, as_pandas=True, early_stopping_rounds=10, num_boost_round=50)

# Print cv_results
print(cv_results)

### Grid Search example

In [None]:
import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn.model_selection import GridSearchCV

housing_data = pd.read_csv("data.csv")
X, y = housing_data.iloc[:, :-1], housing_data.iloc[:, -1]
housing_dmatrix = xgb.DMatrix(data=X, label=y)

gbm_param_grid = {
    "learning_rate": [0.01, 0.1, 0.5, 0.9],
    "n_estimators": [200],
    "subsample": [0.3, 0.5, 0.9]
}

gbm = xgb.XGBRegressor()
grid_mse = GridSearchCV(estimator=gbm, param_grid=gbm_param_grid, 
                        scoring='neg_mean_squared_error', cv=4, verbose=1)
grid_mse.fit(X, y)

print("Best parameters found: ", grid_mse.best_params_)
print("Lower RMSE found: ", np.sqrt(np.abs(grid_mse.best_score_)))

### Random search example

In [None]:
import pandas as pd
import xgboost as xgb
import numpy as np
from sklearn.model_selection import RandomizedSearchCV

housing_data = pd.read_csv('data.csv')

X, y = housing_data.iloc[:, :-1], housing_data.iloc[:, -1]
housing_dmatrix = xgb.DMatrix(data=X, label=y)

gbm_param_grid = {
    "learning_rate": np.arange(0.05, 1.05, 0.05),
    "n_estimators": [200],
    "subsample": np.arange(0.05, 1.05, .05)
}

gbm = xgb.XGBRegressor()
randomized_mse = RandomizedSearchCV(estimator=gbm, param_distributions=gbm_param_grid,
                                    n_iter=25, scoring='neg_mean_squared_error',
                                    cv=4, verbose=1)
randomized_mse.fit(X, y)

print("Best parameters found: ", randomized_mse.best_params_)
print("Lower RMSE found: ", np.sqrt(np.abs(randomized_mse.best_score_)))

### Pipeline

In [None]:
import pandas as pd
import numpy as np

from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

names = ['crime', 'zone', 'industry', 'charles', 'no', 'rooms', 'age', 'distance', 
         'radial', 'tax', 'pupil', 'aam', 'lower', 'med_price']

data = pd.read_csv('boston_housing.csv', names=names)

X, y = data.iloc[:, :-1], data.iloc[:, -1]

rf_pipeline = Pipeline[
    ('st_scaler', StandardScaler()),
    ('rf_model', RandomForestRegressor())]

scores = cross_val_score(rf_pipeline, X, y, 
                         scoring='neg_mean_squared_error', cv=10)


final_avg_rmse = np.mean(np.sqrt(np.abs(scores)))

print("Final RMSE: ", final_avg_rmse)