# K Nearest Neighbor (KNN) Classifier

## Overview
It tries to solve both a classification and a regression problem at once
Given a dataset that looks like the following, where we have 2 classes (green and red). The goal is to find out what is the class of the test point in blue. To do so we look for the k nearest points (k is determined based on the problem), and based on the k nearest points we determine what the class or the value (in case of regression) of the blue point is.

<div style="text-align: center;">
  <img src="media/knn.png" width="300">
</div>

You could use 2 types of distances:
1. Euclidean distance: $d = \sqrt{(x1 - x2)^2 + (y1 - y2)^2}$
2. Manhattan distance: $d = |x1 - x2| + |y1 - y2|$

The choice of distance depends on the problem.


Time complexity: $O(N)$ since we have to calculate the distance to each datapoint to find the k nearest ones. This is super inefficient.
This time complexity could be reduced by the help of:
- KD Tree
- Ball Tree

## KD Tree & Ball Tree

The KD tree splits the feature space into regions that could be organized in a binary tree.
The KD tree algorithm goes as follows:
1. Split the space at the median of feature 1, this forms 2 (first 2 branches in the binary tree).
2. For each of the branches, split at the median of feature 2 (median of the points at that branch only)
3. Repeat until you've fully broken down the entire space. The tree can then be used to quickly search for regions and quickly retieve closest points.

The Ball tree follows a similar logic. But instead of splitting the space along axis aligned hyperplanes, it uses hyperspheres (balls) to split the space. It is also a binary tree of balls, where each ball contains sub-balls .. etc

KD trees are efficient in low dimensional spaces, but Ball Trees are more efficient in higher dimensions.


## Advantages
- Simple
- Good when the decision boundary is very irregular

## Disadvantages
- Non-generalizing machine learning method (i.e. it doesn't create a generalized model, instead it just remembers all the training data)

In [2]:
# Let's now create a dataset to experiment with
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(
    n_samples=1000,
    n_features=3,
    n_redundant=1,
    n_classes=2,
    random_state=999
)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [3]:
# Create the k nearset neighbor classifier
from sklearn.neighbors import KNeighborsClassifier

classifier = KNeighborsClassifier(n_neighbors=5, algorithm='auto')
classifier.fit(X_train, y_train)

In [4]:
# Create all predicitons for test set
y_pred = classifier.predict(X_test)

In [5]:
# Assess the accuracy of the model
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
print(confusion_matrix(y_pred, y_test))
print(accuracy_score(y_pred, y_test))
print(classification_report(y_pred, y_test))

[[158  20]
 [ 11 141]]
0.906060606060606
              precision    recall  f1-score   support

           0       0.93      0.89      0.91       178
           1       0.88      0.93      0.90       152

    accuracy                           0.91       330
   macro avg       0.91      0.91      0.91       330
weighted avg       0.91      0.91      0.91       330



Should now perform hyper parameter tuning of the k value and see what k value produces the best results

## Testing on the astar dataset
KNN is really not the right algorithm for this dataset, but it is worth while checking the performance on this dataset.

In [16]:
# Set the grid size here
n, m = 10, 10  # grid size for the problem
N = 100000      # Number of examples

# Probability of existence of obstacle
obstacle_probability = 0.2

from datasets.astar_dataset import make_astar_dataset
import time
# Create the data set
start = time.time()
X, y = make_astar_dataset(N, n, m, obstacle_probability)
print(f"Execution time: {time.time() - start:.4f} seconds")

Execution time: 18.9946 seconds


In [17]:
# Flatten the dataset
X_flat = X.reshape((X.shape[0], -1))


In [18]:
X_train, X_test, y_train, y_test = train_test_split(X_flat, y, test_size=0.33, random_state=42)
classifier = KNeighborsClassifier(n_neighbors=5, algorithm='auto')
classifier.fit(X_train, y_train)

In [19]:
# Create all predicitons for test set
y_pred = classifier.predict(X_test)

In [21]:
# Assess the accuracy of the model
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
print(accuracy_score(y_pred, y_test))

0.16796969696969696


In [24]:
import pandas as pd

# Assuming y is a pandas Series
y = pd.DataFrame(y)  # Convert y to a pandas Series if it's not already

y.describe()


Unnamed: 0,0,1,2,3,4
count,100000.0,100000.0,100000.0,100000.0,100000.0
mean,0.33432,0.14178,0.24046,0.2371,0.04634
std,0.471755,0.348826,0.427365,0.425306,0.210221
min,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0
75%,1.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0
