Random Forest consists of a large number of decision trees and works as an ensemble technique.

Random Forests works on the principle of bagging, wherein each decision tree is build on a sample of the training data set with replacement. The results from these multiple decision trees are then aggregated to come up with the final forecast. For the purposes of classification, the mode of all the predictions is used, and for regression the mean of all predicitions is deemed as the final output.

* Random Forests works well with both categorical and numerical data. No scaling or transformation of variables is usually necessary.
* Random Forests implicitly perform feature selection and generate uncorrelated decision trees. It does this by choosing a random set of features to build each decision tree. This also makes it a great model when you have to work with a high number of features in the data.
* Random Forests are not influenced by outliers to a fair degree. It does this by binning the variables.
* Random Forests can handle linear and non-linear relationships well.
* Random Forests generally provide a high accuracy and balance the bias-variance trade off well. Since the model's principle is to average the results across the multiple decision trees it builds, it averages the variance as well.

While Random Forests are great in a number of applications, there are certain places where they may not be an ideal choice. For instance:
* Random Forests aren't easily interpretable. Although they provide feature importance, they do not provide complete visibility into the coefficients like a linear regression does.
* Random Forests can be computationally intensive for large datasets.

XGBoost improves upon the capabilities that Random Forest has by making use of the gradient descent framework. It also has the ability to build trees in parallel and optimizes hardware as it does so. XGBoost has the in-build capability to penalize complex models by using regularization techniques. It also comes with in-built cross validation that can be used to determine the number of boosting iterations required in a run.

# XGBoost

While it might be more comfortable to use the Sklearn API at first, later, you’ll realize that the native API of XGBoost contains some excellent features that the former doesn’t support.

Diamonds dataset throughout the tutorial. It is built into the Seaborn library. It has a nice combination of numeric and categorical features and over 50k observations that we can comfortably showcase all the advantages of XGBoost.

In [4]:
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import xgboost as xgb

diamonds = sns.load_dataset("diamonds")
diamonds.shape
diamonds.head()
diamonds.describe()
diamonds.describe(exclude=np.number)

'posix'

The two most popular classification objectives are

* binary:logistic - binary classification (the target contains only two classes, i.e., cat or dog)
* multi:softprob - multi-class classification (more than two classes in the target, i.e., apple/orange/banana)

Performing binary and multi-class classification in XGBoost is almost identical, so we will go with the latter

We want to predict the cut quality of diamonds given their price and their physical measurements. 

In [None]:
from sklearn.preprocessing import OrdinalEncoder

X, y = diamonds.drop("cut", axis=1), diamonds[['cut']]

# Encode y to numeric
y_encoded = OrdinalEncoder().fit_transform(y)

# Extract text features
cats = X.select_dtypes(exclude=np.number).columns.tolist()

# Convert to pd.Categorical
for col in cats:
   X[col] = X[col].astype('category')

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, random_state=1, stratify=y_encoded)


# XGBoost DMatrix


In [None]:
# Create classification matrices
dtrain_clf = xgb.DMatrix(X_train, y_train, enable_categorical=True)
dtest_clf = xgb.DMatrix(X_test, y_test, enable_categorical=True)

# XGBoost Classification

During cross-validation, we are asking XGBoost to watch three classification metrics which report model performance from three different angles.

In [None]:
params = {"objective": "multi:softprob", "tree_method": "gpu_hist", "num_class": 5}
n = 1000

results = xgb.cv(
   params, dtrain_clf,
   num_boost_round=n,
   nfold=5,
   metrics=["mlogloss", "auc", "merror"],
)

In [None]:
results.keys()

# the best AUC score, we take the maximum of test-auc-mean column:
results['test-auc-mean'].max()

# Evaluation

During the boosting rounds, the model object has learned all the patterns of the training set it possibly can. Now, we must measure its performance by testing it on unseen data. That's where our `dtest_reg` DMatrix comes into play:

This step of the process is called model evaluation (or inference). Once you generate predictions with predict, you pass them inside mean_squared_error function of Sklearn to compare against y_test:

In [None]:
from sklearn.metrics import mean_squared_error

preds = model.predict(dtest_reg)

rmse = mean_squared_error(y_test, preds, squared=False)

print(f"RMSE of the base model: {rmse:.3f}")
# RMSE of the base model: 543.203