# Gradient Boosting: XGBoost

This is a sample tutorial on how to use the XGBoost library.

XGBoost can be installed as:

```bash
sudo pip install xgboost
```
XGBoost can be upgraded as:
```bash
sudo pip install --upgrade xgboost
```

The code in this notebook is based on the following book:

### XGBoost With Python by Jason Brownlee

https://machinelearningmastery.com/xgboost-with-python/

In [2]:
import pandas as pd
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

ImportError: No module named 'xgboost'

We are going to user the Pima Indians dataset (from Lecture 6):

https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes

https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data

This dataset is comprised of 8 input variables that describe medical details of patients and one output variable to indicate whether the patient will have an onset of diabetes within 5 years.

In [1]:
filename = "../datasets/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
# df stands for "Data Frame"
df = pd.read_csv(filename, names=names)

# Split the data into X and Y
array = df.values
X = array[:,0:8]
Y = array[:,8]

NameError: name 'pd' is not defined

Split the data into a training and test set:

In [None]:
seed = 7
test_size = 0.33
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)

Train the XGBoost Model:

In [None]:
# fit model to training data
model = XGBClassifier()
model.fit(X_train, y_train)

We can see the parameters used in a trained model by printing the model, for example:

In [None]:
print(model)

We can make predictions with the model:

In [None]:
# make predictions for test data
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]

In [None]:
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

In [None]:
print(y_pred)

In [None]:
print(predictions)

## Data Preparation for Gradient Boosting

Internally, XGBoost models represent all problems as a regression predictive modeling problem
that only takes numerical values as input. If your data is in a dierent form, it must be prepared
into the expected format.

We are going to use the IRIS dataset: http://archive.ics.uci.edu/ml/datasets/Iris
This dataset has the class as a categorical feature. Since XGBoost expects numeric features, the dataset cannot be used as-is, its class must be converted into numbers.

In [None]:
filename = "../datasets/iris.data"
names = ['sepal length in cm', 'sepal width in cm', 'petal length in cm', 'petal width in cm', 'class']
# df stands for "Data Frame"
df = pd.read_csv(filename, header=None, names=names)

# Split the data into X and Y
array = df.values
X = array[:,0:4]
Y = array[:,4]

# from sklearn import datasets
# iris = datasets.load_iris()
# X = iris.data[:, :2]  # we only take the first two features.
# Y = iris.target

In [None]:
print(Y)

Load the encoder:

In [None]:
from sklearn.preprocessing import LabelEncoder
# encode string class values as integers
label_encoder = LabelEncoder()
label_encoder = label_encoder.fit(Y)
label_encoded_y = label_encoder.transform(Y)

In [None]:
print(label_encoded_y)

Train the XGBoost Model:

In [None]:
seed = 7
test_size = 0.33
X_train, X_test, y_train, y_test = train_test_split(X, label_encoded_y,
test_size=test_size, random_state=seed)
# fit model no training data
model = XGBClassifier()
model.fit(X_train, y_train)
print(model)

Make predictions and print the accuracy:

In [None]:
# make predictions for test data
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

### Notice
Notice how the XGBoost model is congured to automatically model the multiclass classi-
cation problem using the multi:softprob objective, a variation on the softmax loss function
to model class probabilities. This suggests that internally, that the output class is converted
into a one hot type encoding automatically.

In [None]:
print(model)

## Plot a Single XGBoost Decision Tree

In [None]:
from xgboost import XGBClassifier
from xgboost import plot_tree
from matplotlib import pyplot as plt
%matplotlib notebook

In [None]:
filename = "../datasets/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
# df stands for "Data Frame"
df = pd.read_csv(filename, names=names)

# Split the data into X and Y
array = df.values
X = array[:,0:8]
Y = array[:,8]
# fit model no training data
model = XGBClassifier()
model.fit(X, Y)
# plot single tree
plot_tree(model)
plt.show()

Plot the $4^{th}$ tree:

In [None]:
plot_tree(model, num_trees=3, rankdir='LR')
plt.show()

### Feature importance

In [None]:
print(model.feature_importances_)

In [None]:
# plot
from xgboost import plot_importance
plot_importance(model)
plt.show()

### Notice

It is interesting to check the skit-learn class `SelectFromModel`.

In [None]:
from sklearn.feature_selection import SelectFromModel
from numpy import sort

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=7)
# fit model on all training data
model = XGBClassifier()
model.fit(X_train, y_train)
# Fit model using each importance as a threshold
thresholds = sort(model.feature_importances_)
for thresh in thresholds:
  # select features using threshold
  selection = SelectFromModel(model, threshold=thresh, prefit=True)
  select_X_train = selection.transform(X_train)
  # train model
  selection_model = XGBClassifier()
  selection_model.fit(select_X_train, y_train)
  # eval model
  select_X_test = selection.transform(X_test)
  y_pred = selection_model.predict(select_X_test)
  predictions = [round(value) for value in y_pred]
  accuracy = accuracy_score(y_test, predictions)
  print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1],accuracy*100.0))

## Monitoring Training Performance

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=7)
eval_set = [(X_test, y_test)]
model.fit(X_train, y_train, eval_metric="error", eval_set=eval_set, verbose=True)

XGBoost supports a suite of evaluation metrics not limited to:
 * rmse for root mean squared error.
 * mae for mean absolute error.
 * logloss for binary logarithmic loss and mlogloss for multiclass log loss (cross entropy).
 * error for classification error.
 * auc for area under ROC curve.

### Learning curves

In [None]:
results = model.evals_result()
print(results)

In [None]:
model = XGBClassifier()
eval_set = [(X_train, y_train), (X_test, y_test)]
model.fit(X_train, y_train, eval_metric=["error", "logloss"], eval_set=eval_set, verbose=False)
# make predictions for test data
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
# retrieve performance metrics
results = model.evals_result()
epochs = len(results['validation_0']['error'])
x_axis = range(0, epochs)
# plot log loss
fig, ax = plt.subplots()
ax.plot(x_axis, results['validation_0']['logloss'], label='Train')
ax.plot(x_axis, results['validation_1']['logloss'], label='Test')
ax.legend()
plt.ylabel('Log Loss')
plt.title('XGBoost Log Loss')
plt.show()
# plot classification error
fig, ax = plt.subplots()
ax.plot(x_axis, results['validation_0']['error'], label='Train')
ax.plot(x_axis, results['validation_1']['error'], label='Test')
ax.legend()
plt.ylabel('Classification Error')
plt.title('XGBoost Classification Error')
plt.show()

### Early stopping

In [None]:
model = XGBClassifier()
eval_set = [(X_train, y_train), (X_test, y_test)]
model.fit(X_train, y_train, eval_metric=["logloss"], eval_set=eval_set, verbose=True, early_stopping_rounds=10)
# make predictions for test data
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
# retrieve performance metrics
results = model.evals_result()
epochs = len(results['validation_0']['logloss'])
x_axis = range(0, epochs)
# plot log loss
fig, ax = plt.subplots()
ax.plot(x_axis, results['validation_0']['logloss'], label='Train')
ax.plot(x_axis, results['validation_1']['logloss'], label='Test')
ax.legend()
plt.ylabel('Log Loss')
plt.title('XGBoost Log Loss')
plt.show()