# KNN Classifier Lab
This lab will introduce you to the K-Nearest Neighbor Classifier. Let's import our modules

In [1]:
%matplotlib inline
import cv2 as cv
import numpy as np 
from matplotlib import pyplot as plt
from skimage.feature import (
    local_binary_pattern as lbp,
    hog
)
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn import metrics
from torchvision.datasets import MNIST
import random
from pathlib import Path

## The MNIST Dataset


The dataset that we will use for this lab is the MNIST<sup><a href="https://en.wikipedia.org/wiki/MNIST_database">1</a></sup> dataset. This dataset is a handwritten digit dataset, with 10 classes for the 10 digits: 0,1,2..9.


We can load the dataset using the `torchvision` package. We will discuss this package in far more detail in later portions of this module:

In [2]:
dataset = MNIST(root=Path.home() / "data" / "MNIST", download=True, train=True)

The dataset is loaded as a set of image and label pairs. The images are loaded as pillow (`PIL`) images, but we can convert them to the `np.ndarray` we are used to by using the `np.asarray` function:

In [None]:
image, label = dataset[0]

img = np.asarray(image)

plt.imshow(img)

## Generating Features: Local Binary Patterns

We are now going to attempt to classify the MNIST dataset, but we first will need to extract _features_ from the imagery. To generate features, we will use [local binary patterns (LBP)](https://scikit-image.org/docs/0.24.x/auto_examples/features_detection/plot_local_binary_pattern.html).

The LBP feature descriptor functions by dividing an image into windows, comparing those windows to its neighbors, and assigning a binary 0 where the center pixel's value is greater than the neighbors, and 1 otherwise.

This binary image then undergoes histogram normalization, and the resulting windows are then concatenated to give a feature descriptor of the entire image. 

Let's generate the LBP output for our sample image using the `scikit-image` library:

In [None]:
radius = 2
n_points = 8 * radius
image_features = lbp(np.asarray(image), n_points, radius)

plt.imshow(image_features)

Now let's generate the LBP features for all the images in our training dataset. The classifier that we are going to use, the K-Nearest Neighbor (KNN) classifier, requires single dimensional features, so we will have to flatten the feature vector to 1-d:

In [5]:
# initialize two numpy tensors
features = np.empty((len(dataset), img.size))
labels = np.empty(len(dataset), dtype=int)
# for each image and label pair
for i, (pil_image, label) in enumerate(dataset):
    # convert from PIL image to tensor
    img_tensor = np.asarray(pil_image)
    # generate the features using same parameters as our sample above
    # and flatten the 2-d output to 1-d
    features[i] = lbp(img_tensor, n_points, radius).flatten()
    # save the label to our labels array
    labels[i] = label

## The KNN Classifier
Now that we have generated features using the local binary pattern, we need to introduce our classifier. The first classifier that we are going to introduce is the K-Nearest Neighbor (KNN) classifier. This classifier functions by, for each sample, looking the the $k$ closest samples in feature space, and assigning the class most common among those $k$ neighbors.

Let's consider a 2-class problem in $\mathbb{R}^2$:
$$
X = \{ [-1, 1], [-1, 2], [3,4], [2,4] \}
$$
$$
Y =\{ 0, 0, 1, 1 \}
$$

Let's view this problem on a graph:

In [None]:
x = np.array([[-1, 1], [-1, 2], [3, 4], [2, 4]])
y = np.array([0, 0, 1, 1])

plt.scatter(x[:, 0], x[:, 1], c=["r", "r", "b", "b"])

Let's say a new point (1, 3) is added. What class does it belong to?

In [None]:
plt.scatter(x[:, 0], x[:, 1], c=["r", "r", "b", "b"])
plt.scatter(1, 3, c="green")

The K-Nearest Neighbor classifier would solve this problem by finding the $k$ nearest points. For our toy problem, let's use $k$ of 3. So what are the 3 closest points?

In [None]:
pt = np.array([1, 3])
distances = [np.linalg.norm(pt - _x) for _x in x]
distances

So the three closest points are $[-1, 2], [3,4], [2,4]$. Now that we have found the three closest neighbors, we classify the point by assigning the class that **most** of its neighbors belong to, which in this case is class 1.

## Applying KNN to MNIST


With features now generated and our understanding of the KNN classifier, we can initialize and train our classifier. We are going to initialize our classifier with default parameters, which as described in the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html), means that we will use an $L_2$, or euclidean, distance metric and a $k$ of 5.

The name of the function to train the KNN classifier is the `fit` function, which takes in 2 parameters:
1. features - a NxM vector with N samples and M features
2. Labels - a 1-d vector of length N

In [None]:
# initialize a KNN classifier with default parameters
knn = KNN()
# train (fit) our classifier on our generated features and labels
knn.fit(features, labels)

Our classifier has now been trained. Let's examine how our classifier performed by checking its training accuracy. The training accuracy is the accuracy of the model on the same set of data upon which the model is trained.

However, our training dataset is 60,000 images, so let's choose a random subset of 5000 images to get our training accuracy:

In [10]:
# get 5000 random integers between 0 and 4999
sample_idx = random.sample(range(len(dataset)), 5000)
# get the features at those indices
sample_features = [features[i] for i in sample_idx]
# get the labels at those indicies
sample_labels = [labels[i] for i in sample_idx]

# use knn.predict to generate predicted class for our sample features
o = knn.predict(sample_features)

With our predicted and target labels now generated, we can calculate our accuracy using the `sklearn` library:

In [None]:
metrics.accuracy_score(sample_labels, o)

## Testing our Classifier

So our classifier has achieved excellent training accuracy. Let's load the test portion of the dataset so we can test our model on data is has not yet seen: 

In [12]:
test_dataset = MNIST(root=Path.home() / "data" / "MNIST", download=True, train=False)

We again need to generate the LBP features and labels from the test imagery, following the same set of steps as the train data:

In [13]:
test_features = np.empty((len(test_dataset), img.size))
test_labels = np.empty(len(test_dataset), dtype=int)
for i, (pil_image, label) in enumerate(test_dataset):
    test_features[i] = lbp(np.asarray(pil_image), n_points, radius).flatten()
    test_labels[i] = label

We now again use the `predict` function with our features vectors to generate predicted classes:

In [None]:
o = knn.predict(test_features)
metrics.accuracy_score(test_labels, o)

The confusion matrix shows the distribution of predictions between the predicted and true labels. Along the diagonal of the matrix is where the true and predicted labels match (TP), and off the diagonal indicates places where the true and predicted labels disagree (FP and FN). We can generate this matrix easily with sklearn:

In [None]:
confusion_matrix = metrics.confusion_matrix(test_labels, o)
confusion_matrix

## Using Other Feature Extractors
Thus far we have only used the LBP feature extractor to generate features for our KNN classifier. However, feature descriptors for KNN image classification can be generated by many methods beyond just the local binary pattern.

Let's now use the [Histogram of Oriented Gradients (HOG)](https://scikit-image.org/docs/dev/auto_examples/features_detection/plot_hog.html) to generate feature descriptors for our classifier:

In [16]:
ORIENTATIONS = 8
CELLS_PER_BLOCK = 3

hog_features_train = np.empty((len(dataset), ORIENTATIONS * CELLS_PER_BLOCK**2))

for i, (pil_image, label) in enumerate(dataset):
    hog_features_train[i] = hog(
        np.asarray(pil_image),
        orientations=ORIENTATIONS,
        cells_per_block=(CELLS_PER_BLOCK, CELLS_PER_BLOCK),
    )

In [17]:
hog_features_test = np.empty((len(test_dataset), ORIENTATIONS * CELLS_PER_BLOCK**2))

for i, (pil_image, label) in enumerate(test_dataset):
    hog_features_test[i] = hog(
        np.asarray(pil_image),
        orientations=ORIENTATIONS,
        cells_per_block=(CELLS_PER_BLOCK, CELLS_PER_BLOCK),
    )

Let's train a new KNN classifier:

In [None]:
knn = KNN()
knn.fit(hog_features_train, labels)

Let's use our trained model to predict on the test data:

In [19]:
predictions = knn.predict(hog_features_test)

Let's view the accuracy of our model on the test dataset:

In [None]:
metrics.accuracy_score(test_labels, predictions)

So our HOG based KNN classifier performed very well, but not quite as well as our LBP based classifier!

<hr> 

## <span style="background:yellow;">**Your Turn**</span>

Train a new KNN classifier with the number of neighbors set to 9, and print its test accuracy

In [21]:
# -------
# Your code here

# Save your Notebook, then `File > Close and Halt`