# Classifying Data in Python using the $k$-Nearest Neighbors (KNN) Algorithm
*Curtis Miller*

In this notebook I will demonstrate training and using **$k$-nearest neighbors (KNN)** algorithms with **sklearn**.

We will be using the iris dataset, which I load below.

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.metrics import classification_report

In [None]:
iris_obj = load_iris()

flower, species = iris_obj.data, iris_obj.target

In [None]:
flower_train, flower_test, species_train, species_test = train_test_split(flower, species, test_size = 0.1)
flower_train[:5]

In [None]:
species_train[:5]

## Creating a Classifier

The `KNeighborsClassifier` allows for fitting and predicting using the KNN algorithm. Recall that with KNN, training a model means saving the training data, and predicting is done by picking the most common algorithm the $k$ nearest neighbors of a point.

Besides choice of variables, there are two hyperparameters that need to be picked to use KNN: the number of neighbors $k$ used for prediction and the choice of metric for defining distance. Here I will use Euclidean distance, and I start by picking $k = 1$.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
import numpy as np

In [None]:
knn1 = KNeighborsClassifier(n_neighbors=1)    # Setting the k parameter
knn1.fit(flower_train, species_train)    # Fitting the model
knn1.predict(np.array([[7, 3, 5, 2]]))    # A test prediction

In [None]:
pred1 = knn1.predict(flower_train)
pred1

In [None]:
print(classification_report(species_train, pred1))

*Of course* the model does perfectly on the training data! (How can it not?)

## Choosing $k$

Let's perform cross-validation to see what $k$ seems to lead to the best predictive accuracy, along with getting a sense of what level of accuracy in prediction we can hope to see.

In [None]:
import pandas as pd
from pandas import DataFrame

In [None]:
k_candidate = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
res = dict()

for k in k_candidate:
    pred2 = KNeighborsClassifier(n_neighbors=k)
    res[k] = cross_validate(estimator=pred2,    # The predictor
                            X=flower_train,     # Features array
                            y=species_train,    # Target array
                            cv=10,              # Number of folds (but other meanings exist)
                            return_train_score=False,    # Don't return training scores
                            scoring='accuracy') # What scores to return (other meanings exist)

In [None]:
resdf = DataFrame({(i, j): res[i][j]
                             for i in res.keys()
                             for j in res[i].keys()}).T
resdf

In [None]:
resdf.loc[(slice(None), 'test_score'), :]

In [None]:
resdf.loc[(slice(None), 'test_score'), :].mean(axis=1)

It seems that the best accuracy is attained when $k = 8$. Let's see how our classifier does on the test set.

In [None]:
pred3 = KNeighborsClassifier(n_neighbors=8)
pred3.fit(flower_train, species_train)
species_test_predict = pred3.predict(flower_test)
print(classification_report(species_test, species_test_predict))

Our KNN classifier does well predicting the setosa species, and the worst behavior is for the virginica species.

Considering the graphic below, where species correctly predicted are shown in blue and those incorrectly predicted in red (with shape corresponding to species), we can see this result should be expected; setosa flowers are easily identified while versicolor and virginica would be more difficult to predict.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
marker_map = {0: 'o', 1: 's', 2: '^'}
var1, var2 = 0, 1    # Sepal length and sepal width variables
for length, width, species in zip(flower_train[:, var1], flower_train[:, var2], species_train[:]):
    plt.scatter(x=length, y=width, marker=marker_map[species], c="black")
# Plot correct prediction
correct = (species_test == species_test_predict)
for length, width, species in zip(flower_test[correct, var1], flower_test[correct, var2], species_test[correct]):
    plt.scatter(x=length, y=width, marker=marker_map[species], c="blue")
for length, width, species in zip(flower_test[np.logical_not(correct), var1],
                                  flower_test[np.logical_not(correct), var2],
                                  species_test[np.logical_not(correct)]):
    plt.scatter(x=length, y=width, marker=marker_map[species], c="red")
plt.xlabel(iris_obj.feature_names[var1])
plt.ylabel(iris_obj.feature_names[var2])
plt.show()

In [None]:
marker_map = {0: 'o', 1: 's', 2: '^'}
var1, var2 = 2, 3    # Petal length and petal width variables
for length, width, species in zip(flower_train[:, var1], flower_train[:, var2], species_train[:]):
    plt.scatter(x=length, y=width, marker=marker_map[species], c="black")
# Plot correct prediction
correct = (species_test == species_test_predict)
for length, width, species in zip(flower_test[correct, var1], flower_test[correct, var2], species_test[correct]):
    plt.scatter(x=length, y=width, marker=marker_map[species], c="blue")
for length, width, species in zip(flower_test[np.logical_not(correct), var1],
                                  flower_test[np.logical_not(correct), var2],
                                  species_test[np.logical_not(correct)]):
    plt.scatter(x=length, y=width, marker=marker_map[species], c="red")
plt.xlabel(iris_obj.feature_names[var1])
plt.ylabel(iris_obj.feature_names[var2])
plt.show()