Importing required modules and setting random seed

In [14]:
from sklearn import datasets
from sklearn import model_selection as ms
from sklearn.metrics import f1_score
from sklearn.neighbors import KNeighborsClassifier

from knn.k_nearest_neighbors_classifier import KNearestNeighborsClassifier
from utils.scaler import StandardScaler

SEED = 42

Importing the data


In [21]:
X, y = datasets.load_iris(return_X_y=True)
X = X[:, :3]

X_train, X_test, y_train, y_test = ms.train_test_split(X, y, train_size=0.8, random_state=SEED)

Scaling the data

In [22]:
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Classifying the data with my implementation of kNN

In [28]:
knn = KNearestNeighborsClassifier(n_neighbors=5, metric='minkowski', p=4)

knn.fit(X_train_scaled, y_train)
pred = knn.predict(X_test_scaled)

f1 = f1_score(y_true=y_test, y_pred=pred, average='weighted').round(2)
print(f'F1 is equal to {f1}')

F1 is equal to 0.93


Trying out sklearn's implementation of kNN

In [29]:
knn_sklearn = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=4)
knn_sklearn.fit(X_train_scaled, y_train)
pred_sklearn = knn_sklearn.predict(X_test_scaled)

f1_sklearn = f1_score(y_true=y_test, y_pred=pred_sklearn, average='weighted').round(2)
print(f'F1 is equal to {f1_sklearn}')

F1 is equal to 0.93


The idea of kNN classifier is quite simple. Basic brute force implementation doesn't require any
training and works when the inference happens.
Let's describe it in a few steps:
1. Standardization of dataset. 
Scaled data should have zero mean and unit variance. (Write about importance of standardization)
2. For every example in test dataset: <br>
2.1 Calculate distance between example and training dataset (Write about different metrics)<br>
2.2 Sort the distances in increasing order <br>
2.3 Find the first <code>k</code> closest items (from training dataset) and remember their labels <br>
2.4 Use the labels to do the majority voting and find the most common class among the labels <br>
2.5 Return the majority class as a prediction for input example <br>

As we can see from the results, my brute force implementation of kNN classifier (with the same
parameters) shows the same performance on a small, simple dataset as implementation from sklearn: F1 score = 0.93. 

TODO: <br>
1. Investigate Nearest Neighbor Algorithms (ball_tree, kd_tree)
2. Investigate Nearest Neighbors Regression
3. Investigate effect of data standardization of the knn (possibly on k-means and linear models too)
4. Plot dependency between p for minkowski metric, n_neighbors and F1 score for iris data
5. Investigate ANN approaches (Annoy, N2, hnswlib, etc.)