# MNIST handwritten digits classification with ensembles of decision trees 

In this notebook, we'll use two different ensembles of decision trees: [random forest](https://rapidsai.github.io/projects/cuml/en/0.11.0/api.html#cuml.ensemble.RandomForestClassifier) and [gradient boosted trees](https://xgboost.readthedocs.io/en/latest/) to classify MNIST digits using a GPU, the [Rapids](https://rapids.ai/) libraries (cudf, cuml) and [XGBoost](https://xgboost.readthedocs.io/en/latest/).

First, the needed imports. 

In [None]:
%matplotlib inline

from pml_utils import get_mnist, show_failures

import cudf
import numpy as np
import pandas as pd

from cuml.ensemble import RandomForestClassifier
import xgboost as xgb

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

Then we load the MNIST data. First time we need to download the data, which can take a while. The data is stored as Numpy arrays in host (CPU) memory.

We also convert `y_train` and `y_test` from strings to integers. 

In [None]:
X_train, y_train, X_test, y_test = get_mnist('MNIST')

y_train = y_train.astype(np.int32)
y_test = y_test.astype(np.int32)

print()
print('MNIST data loaded: train:',len(X_train),'test:',len(X_test))
print('X_train:', type(X_train), 'shape:', X_train.shape)
print('y_train:', type(y_train), 'shape:', y_train.shape)
print('X_test:', type(X_test), 'shape:', X_test.shape)
print('y_test:', type(y_test), 'shape:', y_test.shape)

## Random forest

Random forest is an ensemble (or a group; hence the name *forest*) of decision trees, obtained by introducing randomness into the tree generation. The prediction of the random forest is obtained by *averaging* the predictions of the individual trees.

### Data

Let's convert our training data to cuDF DataFrames in device (GPU) memory. 

We do not need to convert the test data as we will evaluate it using CPU. 

In [None]:
cu_X_train = cudf.DataFrame.from_pandas(pd.DataFrame(X_train.astype(np.float32)))
cu_y_train = cudf.Series(y_train.astype(np.int32))

print('cu_X_train:', type(cu_X_train), 'shape:', cu_X_train.shape)
print('cu_y_train:', type(cu_y_train), 'shape:', cu_y_train.shape)

### Learning

Random forest classifiers are quick to train, quite robust to hyperparameter values, and often work relatively well.

We limit the parameters `n_estimators` and `max_depth` to speed up inference since we will need to run it using a CPU.

In [None]:
%%time

n_estimators = 10
max_depth = 12
clf_rf = RandomForestClassifier(n_estimators=n_estimators,
                                max_depth=max_depth)
clf_rf.fit(cu_X_train, cu_y_train)

### Inference

Multi-class inference does not yet seem to work on GPUs, so we need to predict the classes for the test data using a CPU.

In [None]:
%%time

pred_rf = clf_rf.predict(X_test, predict_model='CPU')
print('Predicted', len(pred_rf), 'digits with accuracy:',
      accuracy_score(y_test, pred_rf))

#### Failure analysis

The random forest classifier worked quite well, so let's take a closer look.

Here are the first 10 test digits the random forest model classified to a wrong class:

In [None]:
show_failures(pred_rf, y_test, X_test)

We can use `show_failures()` to inspect failures in more detail. For example:

* show failures in which the true class was "5":

In [None]:
show_failures(pred_rf, y_test, X_test, trueclass=5)

* show failures in which the prediction was "0":

In [None]:
show_failures(pred_rf, y_test, X_test, predictedclass=0)

* show failures in which the true class was "0" and the prediction was "2":

In [None]:
show_failures(pred_rf, y_test, X_test, trueclass=0, predictedclass=2)

#### Confusion matrix, accuracy, precision, and recall

We can also compute the confusion matrix to see which digits get mixed the most, and look at classification accuracies separately for each class:

In [None]:
labels = range(10)
print('Confusion matrix (rows: true classes; columns: predicted classes):'); print()
cm=confusion_matrix(y_test, pred_rf, labels=labels)
print(cm); print()

print('Classification accuracy for each class:'); print()
for i,j in enumerate(cm.diagonal()/cm.sum(axis=1)): print("%d: %.4f" % (i,j))

Precision and recall for each class:

In [None]:
print(classification_report(y_test, pred_rf, labels=labels))

## Gradient boosted trees (XGBoost)

Gradient boosted trees (or extreme gradient boosted trees) is another way of constructing ensembles of decision trees, using the *boosting* framework.  Here we use the GPU accelerated [XGBoost](http://xgboost.readthedocs.io/en/latest/) library to train gradient boosted trees to classify MNIST digits. 

### Data

We begin by converting our training and test data to XGBoost's internal `DMatrix` data structures.

In [None]:
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

### Learning

XGBoost has been used to obtain record-breaking results on many machine learning competitions, but have quite a lot of hyperparameters that need to be carefully tuned to get the best performance.

For more information, see the documentation for [XGBoost Parameters](https://xgboost.readthedocs.io/en/latest/parameter.html).

In [None]:
# instantiate params
params = {}

# general params
general_params = {'verbosity': 2}
params.update(general_params)

# booster params
booster_params = {'tree_method': 'gpu_hist'}
params.update(booster_params)

# learning task params
learning_task_params = {'objective': 'multi:softmax', 'num_class': 10}
params.update(learning_task_params)

print(params)

We specify the number of boosted trees and are then ready to train our gradient boosted trees model.

In [None]:
%%time

num_round = 100
clf_xgb = xgb.train(params, dtrain, num_round)

### Inference

Inference is also run on the GPU, so it should be rather fast.

In [None]:
%%time

pred_xgb = clf_xgb.predict(dtest)
print('Predicted', len(pred_xgb), 'digits with accuracy:', accuracy_score(y_test, pred_xgb))

You can also use `show_failures()` to inspect the failures, and calculate the confusion matrix and other metrics as was done with the random forest above.

## Model tuning

Study the documentation of the different decision tree models used in this notebook ([cuml random forest](https://rapidsai.github.io/projects/cuml/en/0.11.0/api.html#cuml.ensemble.RandomForestClassifier) and [XGBoost gradient boosted trees](https://xgboost.readthedocs.io/en/latest/)), and experiment with different hyperparameter values.  

Report the highest classification accuracy you manage to obtain for each model type.  Also mark down the parameters you used, so others can try to reproduce your results. 