# k Nearest Neighbors

## Overview

The k-nearest neighbors algorithm is a simple and intuitive classification algorithm. It operates on the assumption that data points with similar features tend to belong to the same class. The algorithm classifies a new data point by finding the k nearest neighbors to that point in the feature space and assigning the class label that is most common among its neighbors.

Let's consider a simple example of classifying fruits based on their weight and color. We have a dataset of fruits, each labeled as either "apple" or "orange," along with their corresponding weight and color. To classify a new fruit, we calculate the distances between the new fruit and all fruits in the dataset. Suppose the new fruit is a small, red fruit. We choose a value for k, let's say k=3, and find the three nearest neighbors to the new fruit. If two of the nearest neighbors are apples and one is an orange, we classify the new fruit as an apple since apples are the majority among its nearest neighbors.

Here's a simple ASCII representation of the k-NN algorithm in action:

```mathematica
Training Dataset:
          Blue (B)        Red (R)
    ------------------------------
    1 |    B                R
    2 |       B          R
    3 |          B    R
    4 |    B        R
    5 |       B   R

New Data Point: (X)
    ------------------
    6 |      X

K = 3 (Nearest Neighbors)
```

In this example, we have a training dataset with blue and red data points. We want to classify a new data point (represented by "X") based on its k nearest neighbors. By measuring the distances from the new data point to all other data points, we can identify the k nearest neighbors and assign the class label based on the majority.

The space complexity of the k-NN algorithm is relatively low. During the training phase, the algorithm stores the training dataset, requiring space proportional to the number of data points and the number of features. During the prediction phase, the space complexity is minimal as it only needs to store the new data point and the k nearest neighbors.

The time complexity of the k-NN algorithm can be higher than other algorithms during the prediction phase, especially when dealing with large datasets. For each new data point, it needs to calculate the distances to all training data points, which takes O(n) time, where n is the number of data points. Additionally, identifying the k nearest neighbors requires sorting or searching, which can take O(n log n) or O(n) time, respectively, depending on the implementation.

Both the iterative and recursive versions of the k-NN algorithm have similar space and time complexities since the algorithm mainly relies on distance calculations and finding nearest neighbors, rather than iterative or recursive operations.

## Implementation

Two things:
- The implementation is sensitive to the format of your data.
- kNN can solve regression, as well, by averaging the values of its k nearest neighbors along the dimension to predict. Here, I've only implemented classification, beause laziness (and CPP matters more).

In [1]:
import csv
import math
import pathlib
from collections import Counter
from typing import Dict, List, Union, Tuple

In [23]:
class KNN:
    def __init__(self, k: int, data_path: pathlib.Path):
        self.k = int
        self.data_path = data_path
        self.data: List[Dict[str, Union[float, str]]] = self.__load_data()
        self.stripped: List[List[float]] = self.__strip_data()
        
    def __load_data(self) -> List[Dict[str, Union[str, float]]]:
        """Load and attach CSV data to KNN instance."""
        with open(self.data_path, mode="r") as data:
            reader = csv.DictReader(data)
            self.data = [row for row in reader]
        return self.data
    
    def __strip_data(self) -> List[List[float]]:
        """Remove `label` from each entry, and strip keys to create pure list representation."""
        self.stripped = []
        for point in self.data:
            temp = []
            for key, value in point.items():
                if key == "label":
                    continue
                temp.append(float(value))
            self.stripped.append(temp)
        return self.stripped
        
    def distance(self, point1: List[float], point2: List[float]) -> float:
        """Compute Euclidean distance from `point1` to `point2`."""
        return math.sqrt(sum([x ** 2 + y ** 2 for x, y in zip(point1, point2)]))
        
    def nearest(self) -> List[Dict[str, Union[str, float]]]:
        """..."""
        # Sort on `distance`
        sorted(self.data, key=lambda point: point["distance"])
        return self.data[:self.k]
        
        raise NotImplementedError
    
    def find_majority(self, nearest: List[Dict[str, Union[str, float]]]) -> Tuple[str, str]:
        """Identifies indices of nearest neighbors and returns majority label."""
        raise NotImplementedError
    
    def classify(self, point: List[float]):
        """..."""
        # Compute distance from `point` to everything in `self.data`
        distances = [distance(point, p2) for p2 in self.stripped]
        # Compute/return average neighbor value
        return self.find_majority(self.nearest())[0]

In [20]:
model = KNN(k=3, data_path="knn.csv")