# Extreme Gradient Boosting with XGBoost
### Definition
* Boosting converts a collection of weak learners into a strong learner. 
    * **weak learner**: ML algorithm that is slightly better than chance (>50%)

### How it works
1. Iteratively learning a set of weak models on subsets of the data
2. Weighing each weak prediction according to each weak learner's performance
3. Combine the weighted predictinons to obtain a single weighted prediction 
4. ... that is much better than the individual predictions themselves! 

### Advantages
* Speed and performance
* Core algorithm is parallelizable (good for big data)
* **Consistently outperforms single-algorithm methods**
* State-of-the-art performance in many ML tasks

### Cross-validation
* Is a robust method for estimating the performance of a model on unseen data
* Generates many non-overlapping train/test splits on training  data
* Reports the average test set performance across all data splits

### Common loss functions in XGBoost
* Loss function names in xgboost:
    *reg:linear - use for regression problems
    *reg:logistic - use for classification problem when you want just decision, not probability
    *binary:logistic - use when you want probability rather than just decision

### Base learners
* XGBoost involves creating a meta-model that is composed of many individual models that combine to give a final prediction
    * Individual models = base learners
    * Want base learners that when combined create final prediction that is non-linear
    * Each base learner should be good at distinguishing or predicting different parts of the dataset
    * Two kinds of base learners:
        * tree
        * linear


### When to use XGBoost
* You have a large number of training samples
    * Greater than 1000 training samples and less 100 features
    * The number of features < number of training samples
* You have a mixture of categorical and numeric features
    * Or just numeric features

### When to NOT use XGBoost
* All of these problems can be better tackled by Deep Learning
    * Image recognition
    * Computer vision
    * NLP
* Small datasets




In [2]:
import xgboost as xgb
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

## Classification: Simple workflow

In [None]:
# Import xgboost
import xgboost as xgb

# Create arrays for the features and the target: X, y
X, y = churn_data.iloc[:,:-1], churn_data.iloc[:,-1]

# Create the training and test sets
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2, random_state=123)

# Instantiate the XGBClassifier: xg_cl
xg_cl = xgb.XGBClassifier(objective='binary:logistic', n_estimators=10, seed=123)

# Fit the classifier to the training set
xg_cl.fit(X_train, y_train)

# Predict the labels of the test set: preds
preds = xg_cl.predict(X_test)

# Compute the accuracy: accuracy
accuracy = float(np.sum(preds==y_test))/y_test.shape[0]
print("accuracy: %f" % (accuracy))

## Classification: Workflow with crossvalidation
- XGBoost gets its lauded performance and efficiency gains by utilizing its own optimized data structure for datasets called a DMatrix.
- In the previous exercise the input datasets were converted into DMatric data on the fly, but when CV you have to first explicitly convert your data into a DMatrix

In [None]:
# Create arrays for the features and the target: X, y
X, y = churn_data.iloc[:,:-1], churn_data.iloc[:,-1]

# Create the DMatrix from X and y: churn_dmatrix
churn_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary: params
params = {"objective":"reg:logistic", "max_depth":3}

# Perform cross-validation: cv_results
cv_results = xgb.cv(dtrain=churn_dmatrix, params=params, 
                  nfold=3, num_boost_round=5, 
                  metrics="error", # auc
                  as_pandas=True, seed=123)

# Print cv_results
print(cv_results)

# Print the accuracy
print(((1-cv_results["test-error-mean"]).iloc[-1]))

cv_results stores the training and test mean and standard deviation of the error per boosting round (tree built) as a DataFrame. From cv_results, the final round 'test-error-mean' is extracted and converted into an accuracy, where accuracy is 1-error

### Regression: Common workflow

In [None]:
# Create the training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

# Instantiate the XGBRegressor: xg_reg
xg_reg = xgb.XGBRegressor(seed=123, objective='reg:linear', n_estimators=10)

# Fit the regressor to the training set
xg_reg.fit(X_train, y_train)

# Predict the labels of the test set: preds
preds = xg_reg.predict(X_test)

# Compute the rmse: rmse
rmse = np.sqrt(mean_squared_error(preds, y_test))
print("RMSE: %f" % (rmse))