## K-NN classifier

In [1]:
import numpy as np
from scipy.spatial import distance_matrix

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

We use the simple Iris dataset which is perfect for binary classification problems. For the sake of simplicity we will only use the first two feature columns. Scaling is not really required for K-NN since the relative differences will stay the dame.

In [2]:
iris = datasets.load_iris()

# we only take the first two features.
X = iris.data[:, :2]
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)


We have multiple ways of implementing searching for the KNNs. Here we go with the simple approach that does not require fitting and computes all distances on the flight and selects the K closest points. Alternatively, we could make use of a K-D tree in order to make the similarity search more efficient but this would require first building the tree before making predictions.

In order to select the predicted class we use majority voting over the retrieved KNNs.

In [3]:
class KNNClassifier:
    def __init__(self, k):
        self.k = k

    def predict(self, X_train, y_train, X_test):
        X_dist = distance_matrix(X_train, X_test).T
        
        idx = np.argpartition(X_dist, self.k, axis=1)    # k first elements will be the smallest
        y_nn = y_train[idx[:, :self.k]]    # labels of k nearest training samples
        bin_counts = np.apply_along_axis(np.bincount,
                                         axis=1,
                                         arr=y_nn,
                                         minlength=np.max(y_nn) + 1)
        y_pred = np.argmax(bin_counts, axis=1)    # majority voting (bincount works for nonnegative int values)

        return y_pred

In [4]:
knn_clf = KNNClassifier(k=5)
y_pred = knn_clf.predict(X_train, y_train, X_test)

We are ready to make predictions for the test data and compute the f1 score over the test set.

In [5]:
f1_score(y_test, y_pred, average="micro")

0.78