# mnist digit recognizer trained with xgboost-gpu

NOTE: need to use an environment with py-xgboost-gpu installed, rather than vanilla xgboost.

In [1]:
import os
import pandas as pd
from pathlib import Path
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [2]:
print(os.listdir('../Datasets/digit-recognizer'))

['train.csv', 'test.csv', 'sample_submission.csv']


In [3]:
out_path = Path('./digit-recognizer-output')
in_path = Path('../Datasets/digit-recognizer')

In [4]:
df = pd.read_csv(in_path/'train.csv')
df.head(n=2)

Unnamed: 0,label,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


The data consists of a 1st column named 'label' that contains an integer between 0 and 9. The remaining columns correspond to an "image" of 784 pixels (one column per pixel), which together comprise the greyscale brightness (value 0-255) of a 28x28 image.

In [5]:
Y = df['label']

In [6]:
X = df.drop('label', axis=1)

In [7]:
# split data into train and test sets
seed = 7
test_size = 0.33
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)

## Training using the GPU
Provided py-xgboost-gpu is installed, passing the gpu-hist value in for the tree_method parameter should ensure that the GPU is used for fitting the model.

In [19]:
params = { "n_estimators": 1000, 'tree_method': 'gpu_hist', 'predictor': 'gpu_predictor'}
#params = { "n_estimators": 100}

In [20]:
model = XGBClassifier(**params)

In [None]:
%%time
model.fit(X_train, y_train)

#### Timings:

CPU (i5-2500K): ~6mins  (100 estimators)

CPU (i7-6700K): ~4mins  (100 estimators)

GPU (GTX970)  : ~3mins  (1000 estimators)

GPU (RTX2080) : ~1min   (1000 estimators)

GPU (RTX2080) : ~7secs  (100 estimators) 

In [12]:
# make predictions for test data
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]

In [13]:
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

Accuracy: 97.29%


In [13]:
# output result predictions on test.csv to file
submission_df = pd.DataFrame({'ImageId': list(range(1,len(predictions)+1)), 'Label': predictions})
submission_df.to_csv(out_path/f'submission.csv', index=False)

## final notes:
According to one article I read, xgboost on the cpu delivers repeatable results, while on the GPU there is some variance in the output accuracy. I've run this twice and got the same result both times.
https://medium.com/data-design/xgboost-gpu-performance-on-low-end-gpu-vs-high-end-cpu-a7bc5fcd425b

The accuracy of 96.66% is lower than what I acheived using the fastai library and a resnet50 deep-learning network (99.257%), although i didn't do any parameter tuning, just used defaults.