# Classification: K-Nearest Neighbour

Let's start simple by using KNN.

KNN is one of, if not the simplest, classification algorithms. Simply store the training data, then check for the most common classifications amongst $k$ many nearest neighbours.

References:
Comparing Nearest Neighbors with and without Neighborhood Components Analysis: https://scikit-learn.org/stable/auto_examples/neighbors/plot_nca_classification.html <br>
Nearest Neighbors Classification: https://scikit-learn.org/stable/auto_examples/neighbors/plot_classification.html <br>
Confusion matrix: https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html#sphx-glr-auto-examples-model-selection-plot-confusion-matrix-py

## Installation

In [None]:
%pip install numpy
%pip install matplotlib
%pip install sklearn

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import neighbors, datasets, metrics
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

## Classifying the Iris dataset

Once again, let's use the Iris dataset.

### Viewing the Iris dataset

Let's view the dataset one more time.

In [None]:
iris_x, iris_y = datasets.load_iris(return_X_y=True, as_frame=True)
iris = datasets.load_iris()

In [None]:
iris_x.head()

In [None]:
iris_y.head()

### Choosing our inputs

Once again, let's use the sepal length and width as our inputs, with a train/test.

In [None]:
# Filter out the petal width and length, then convert to a numpy array

# Split the data into training and testing sets


<details><summary>Click to cheat</summary>

```python
# Filter out the petal width and length, then convert to a numpy array
iris_sepal = iris_x.filter(items=['sepal length (cm)', 'sepal width (cm)']).to_numpy()
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    iris_sepal, iris_y.to_numpy(), train_size=0.7
)
```
</details>

### Creating our KNN classifier

In [None]:
# Choose your k value
# k = 

# Create the models with/without neighbourhood component analysis
models = [
    # Create a KNN classifier without NCA
    
    # Create a KNN classifier with NCA
    
]

# Train the models and store in a list


<details><summary>Click to cheat</summary>

```python
# Choose your k value
k = 3

# Create the models with/without neighbourhood component analysis
models = [
    # Create a KNN classifier without NCA
    neighbors.KNeighborsClassifier(k),
    # Create a KNN classifier with NCA
    Pipeline(
        [
            ("nca", neighbors.NeighborhoodComponentsAnalysis()),
            ("knn", neighbors.KNeighborsClassifier(k))
        ]
    )
]

# Train the models and store in a list
models2 = [model.fit(X_train, y_train) for model in models]
```
</details>

### Plotting the Regions

In [None]:
titles = (
    f"KNN with k={k}",
    f"KNN with NCA and k={k}"
)

# Set-up len(models)x1 grid for plotting.
fig, sub = plt.subplots(nrows=1, ncols=len(models2), figsize=(5 * len(models2), 5),
        constrained_layout=True)
h = 0.05    # step size for mesh grid

cmap_light = ListedColormap(["#FFAAAA", "#AAFFAA", "#AAAAFF"])
cmap_bold = ListedColormap(["#FF0000", "#00FF00", "#0000FF"])

x_min, x_max = iris_sepal[:, 0].min() - 0.3, iris_sepal[:, 0].max() + 0.3
y_min, y_max = iris_sepal[:, 1].min() - 0.3, iris_sepal[:, 1].max() + 0.3
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

for title, model, ax in zip(titles, models2, sub.flatten()):
    # Get the accuracy as a number from 0-1
    score = model.score(X_test, y_test)

    # Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, x_max]x[y_min, y_max].
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    ax.pcolormesh(xx, yy, Z, cmap=cmap_light, alpha=0.8)

    # Plot also the training and testing points
    ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cmap_bold, edgecolor="k", s=20)
    ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cmap_bold, edgecolor="k", s=40, marker='*')
    ax.set_xlim(xx.min(), xx.max())
    ax.set_ylim(yy.min(), yy.max())
    ax.title.set_text(f"{title}")
    ax.text(
        0.9, 0.1,
        f"{score:.2f}",
        size=15,
        ha="center",
        va="center",
        transform=ax.transAxes,
    )

fig.supylabel("sepal length (cm)")
fig.supxlabel("sepal width (cm)")
plt.show()

### Confusion matrix of our models

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay

iris = datasets.load_iris()

titles_options = [
    ("Confusion matrix, without normalization", None),
    ("Normalized confusion matrix", "true"),
]

# pick a model from our trained models
# model = models2[0]

for title, normalize in titles_options:
    disp = ConfusionMatrixDisplay.from_estimator(
        model,
        X_test,
        y_test,
        display_labels=iris.target_names,
        cmap=plt.cm.Blues,
        normalize=normalize,
    )
    disp.ax_.set_title(title)

plt.show()


## Digits dataset

Similarly, we can try KNN on the digits dataset.

### Loading the data

First things first, we need to load the data. Let's also view the first few samples while we're at it.

In [None]:
# Load the digits as a bunch object
# We do this to get the target names and images for plotting

# Also load the digits X as a pandas Dataframe and the y as a Series


# Plot the first few examples
_, axes = plt.subplots(nrows=1, ncols=4, figsize=(10, 3))
for ax, image, label in zip(axes, digits.images, digits.target):
    ax.set_axis_off()
    ax.imshow(image, cmap=plt.cm.gray_r, interpolation="nearest")
    ax.set_title(f"Digit {label}")

<details><summary>Click to cheat</summary>

```python
# Load the digits as a bunch object
# We do this to get the target names and images for plotting
digits = datasets.load_digits()
# Also load the digits X as a pandas Dataframe and the y as a Series
digits_X, digits_y = datasets.load_digits(return_X_y=True, as_frame=True)

# Plot the first few examples
_, axes = plt.subplots(nrows=1, ncols=4, figsize=(10, 3))
for ax, image, label in zip(axes, digits.images, digits.target):
    ax.set_axis_off()
    ax.imshow(image, cmap=plt.cm.gray_r, interpolation="nearest")
    ax.set_title(f"Digit {label}")
```
</details>

Now let's split our labelled data into training and testing sets with a 70/30 ratio.

<details><summary>Click to cheat</summary>

```python
X_train, X_test, y_train, y_test = train_test_split(
    digits_X.to_numpy(), digits_y.to_numpy(), train_size=0.7
)
```
</details>

### Create the model

In [None]:
# Create the untrained model
# Choose whatever k you want

# Train the model

# get the predictions


<details><summary>Click to cheat</summary>

```python
# Create the untrained model
# Choose whatever k you want
k = 5

model = neighbors.KNeighborsClassifier(k)

# Train the model
model.fit(X_train, y_train)

# get the predictions
y_pred = model.predict(X_test)
```
</details>

### Test the model

Let's see a few examples of our predictions.

In [None]:
_, axes = plt.subplots(nrows=1, ncols=4, figsize=(10, 3))
for ax, image, prediction in zip(axes, X_test, y_pred):
    ax.set_axis_off()
    image = image.reshape(8, 8)
    ax.imshow(image, cmap=plt.cm.gray_r, interpolation="nearest")
    ax.set_title(f"Prediction: {prediction}")

Let's also view our confusion matrix for good measure.

In [None]:
disp = metrics.ConfusionMatrixDisplay.from_predictions(y_test, y_pred)
disp.figure_.suptitle("Confusion Matrix")

plt.show()

Finally, we'll look at our measures of performance.

In [None]:
from sklearn.metrics import classification_report

target_names = [str(name) for name in digits.target_names]

print(classification_report(y_test, y_pred, target_names=target_names))