# notMNIST letters classification with nearest neighbors 

In this notebook, we'll use [nearest-neighbor classifiers](https://docs.rapids.ai/api/cuml/stable/api.html#id18) to classify notMNIST letters using a GPU and the [RAPIDS](https://rapids.ai/) libraries (cudf, cuml).

**Note that a GPU is required with this notebook.**

This version of the notebook has been tested with RAPIDS version 0.15.

First, the needed imports. 

In [None]:
%matplotlib inline

from pml_utils import show_failures

import cudf
import numpy as np
import pandas as pd

import os
import urllib.request

from cuml.neighbors import KNeighborsClassifier
from cuml import __version__ as cuml_version

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn import __version__ as sklearn_version

import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

print('Using cudf version:', cudf.__version__)
print('Using cuml version:', cuml_version)
print('Using sklearn version:', sklearn_version)

Then we load the notMNIST data. First time we need to download the data, which can take a while. The data is stored as Numpy arrays in host (CPU) memory.

In [None]:
def load_not_mnist(directory, filename):
    filepath = os.path.join(directory, filename)
    if os.path.isfile(filepath):
        print('Not downloading, file already exists:', filepath)
    else:
        if not os.path.isdir(directory):
            os.mkdir(directory)
        url_base = 'https://a3s.fi/mldata/'
        url = url_base + filename
        print('Downloading {} to {}'.format(url, filepath))
        urllib.request.urlretrieve(url, filepath)
    return np.load(filepath)

In [None]:
# Load notMNIST
DATA_DIR = os.path.expanduser('~/data/notMNIST/')
if not os.path.exists(DATA_DIR):
    os.makedirs(DATA_DIR)
    
X_train = load_not_mnist(DATA_DIR, 'notMNIST_large_images.npy').reshape(-1, 28*28)
y_train = load_not_mnist(DATA_DIR, 'notMNIST_large_labels.npy')
X_test = load_not_mnist(DATA_DIR, 'notMNIST_small_images.npy').reshape(-1, 28*28)
y_test = load_not_mnist(DATA_DIR, 'notMNIST_small_labels.npy')

In [None]:
print()
print('notMNIST data loaded: train:',len(X_train),'test:',len(X_test))
print('X_train:', type(X_train), 'shape:', X_train.shape)
print('y_train:', type(y_train), 'shape:', y_train.shape, y_train.dtype)
print('X_test:', type(X_test), 'shape:', X_test.shape)
print('y_test:', type(y_test), 'shape:', y_test.shape)

The training data (`X_train`) is a matrix of size (529114, 784), i.e. it consists of 529114 letters expressed as 784 sized vectors (28x28 images flattened to 1D). `y_train` is a 529114-dimensional vector containing the correct class (i.e. one of: "A", "B", ..., "J") for each training letter.

Let's take a closer look. Here are the first 10 training letters:

In [None]:
pltsize=1
plt.figure(figsize=(10*pltsize, pltsize))

for i in range(10):
    plt.subplot(1,10,i+1)
    plt.axis('off')
    plt.imshow(X_train[i,:].reshape(28, 28), cmap="gray")
    plt.title('Class: '+y_train[i])

Let's then convert our training and test data to cuDF DataFrames in device (GPU) memory. We will also convert the classes in `y_train` to integers in 
$[0 \mathrel{{.}\,{.}} 9]$.

We will later evaluate the test data as Numpy arrays, so we do not need to convert `y_test`. 

In [None]:
cu_X_train = cudf.DataFrame.from_pandas(pd.DataFrame(X_train))
cu_y_train = cudf.DataFrame.from_pandas(pd.DataFrame(y_train.view(np.int32)-ord('A')))
cu_X_test  = cudf.DataFrame.from_pandas(pd.DataFrame(X_test))

print('cu_X_train:', type(cu_X_train), 'shape:', cu_X_train.shape)
print('cu_y_train:', type(cu_y_train), 'shape:', cu_y_train.shape)
print('cu_X_test:', type(cu_X_test), 'shape:', cu_X_test.shape)

## 1-NN classifier

### Initialization

Let's create first a 1-NN classifier.  Note that with nearest-neighbor classifiers there is no internal (parameterized) model and therefore no learning required.  Instead, calling the `fit()` function simply stores the samples of the training data in a suitable data structure.

In [None]:
%%time

n_neighbors = 1
cu_clf = KNeighborsClassifier(n_neighbors=n_neighbors)
cu_clf.fit(cu_X_train, cu_y_train)

### Inference

We can now classify the test samples with our classifier. With a GPU, classifying all samples should be rather fast. 

We also convert the predictions to a Numpy array in host memory with `values_host`. To match `y_test`, we also convert the predicted integer classes back to letters.

In [None]:
%%time

predictions = cu_clf.predict(cu_X_test).values_host.flatten()
predictions = [chr(x) for x in predictions+ord('A')]
predictions = np.array(predictions)

The accuracy of the classifier:

In [None]:
print('Predicted', len(predictions), 'letters with accuracy:', 
      accuracy_score(y_test, predictions))

#### Confusion matrix

We can compute the confusion matrix to see which digits get mixed the most:

In [None]:
labels=['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
print('Confusion matrix (rows: true classes; columns: predicted classes):'); print()
cm=confusion_matrix(y_test, predictions, labels=labels)
print(cm); print()

Plotted as an image:

In [None]:
plt.matshow(cm, cmap=plt.cm.gray)
plt.xticks(range(10), labels)
plt.yticks(range(10), labels)
plt.grid(None)
plt.show()

#### Accuracy, precision and recall

Classification accuracy for each class:

In [None]:
for i,j in enumerate(cm.diagonal()/cm.sum(axis=1)): print("%s: %.4f" % (labels[i],j))

Precision and recall for each class:

In [None]:
print(classification_report(y_test, predictions, labels=labels))

#### Failure analysis

We can also inspect the results in more detail. Let's use the `show_failures()` helper function (defined in `pml_utils.py`) to show the wrongly classified test digits.

The helper function is defined as:

```
show_failures(predictions, y_test, X_test, trueclass=None, predictedclass=None, maxtoshow=10)
```

where:
- `predictions` is a vector with the predicted classes for each test set image
- `y_test` the _correct_ classes for the test set images
- `X_test` the test set images
- `trueclass` can be set to show only images for a given correct (true) class
- `predictedclass` can be set to show only images which were predicted as a given class
- `maxtoshow` specifies how many items to show


In [None]:
show_failures(predictions, y_test, X_test)

We can use `show_failures()` to inspect failures in more detail. For example:

* show failures in which the true class was "F":

In [None]:
show_failures(predictions, y_test, X_test, trueclass='F')

* show failures in which the prediction was "A":

In [None]:
show_failures(predictions, y_test, X_test, predictedclass='A')

* show failures in which the true class was "A" and the prediction was "C":

In [None]:
show_failures(predictions, y_test, X_test, trueclass='A', predictedclass='C')

## Model tuning

Try to modify the nearest-neighbor classifier. Things to try include using more than one neighbor and adding weights to the neighbors (if supported).  See the documentation for cuml [KNeighborsClassifier](https://docs.rapids.ai/api/cuml/stable/api.html#id18).