# K-Nearest Neighbors (KNN) Algorithm

k-Nearest Neighbors (k-NN) is a supervised learning algorithm that can be used for classification and regression tasks. It is a lazy learning algorithm, as it does not build an explicit model during the training phase, but instead stores the entire training dataset and makes predictions based on a similarity measure.

## History

The k-NN algorithm dates back to the early 1950s, with the work of Fix and Hodges on pattern recognition. The algorithm became more popular in the 1960s and 1970s due to its simplicity and effectiveness in a variety of tasks.

## Mathematical Equations

k-NN does not involve a specific mathematical equation. The algorithm is based on a similarity measure, usually the Euclidean distance, which is calculated between data points:

Euclidean distance = sqrt(Σ(x_i - y_i)^2)

where x_i and y_i are the coordinates of the data points x and y, respectively.

## Learning Algorithm

The learning algorithm for k-NN consists of the following steps:

1. Determine the value of k (number of nearest neighbors) and the distance metric.
2. For each new data point:
    a. Calculate the distance between the new data point and all the training data points.
    b. Find the k training data points with the smallest distances.
    c. For classification, predict the class that has the majority vote among the k nearest neighbors.
       For regression, predict the average target value of the k nearest neighbors.

## Pros and Cons

**Pros:**
- Simple to understand and implement.
- Can adapt to the data as it does not make any assumptions about the underlying data distribution.
- Works well with small datasets.

**Cons:**
- Computationally expensive, especially for large datasets, as it needs to calculate the distance between the new data point and all the training data points.
- Sensitive to the choice of k and the distance metric.
- Does not work well with high-dimensional data (curse of dimensionality).
- Requires preprocessing (e.g., scaling) to ensure that all features have equal importance.

## Suitable Tasks and Datasets

k-NN can be applied to a variety of classification and regression tasks, including:

- Handwritten digit recognition
- Image classification
- Recommender systems
- Anomaly detection

It works well with small datasets and low-dimensional data. k-NN is not suitable for large datasets or high-dimensional data due to the computational cost and the curse of dimensionality.

## References

1. Fix, E., & Hodges Jr, J. L. (1951). Discriminatory analysis. Nonparametric discrimination: Consistency properties. Technical Report 4, USAF School of Aviation Medicine.
2. Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21-27.


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
from collections import Counter

# Function to calculate Euclidean distance between two points
def euclidean_distance(a, b):
    return np.sqrt(np.sum((a - b) ** 2))

# KNN class
class KNN:
    def __init__(self, k=3):
        self.k = k

    def fit(self, X, y):
        self.X_train = X
        self.y_train = y

    def predict(self, X):
        y_pred = [self._predict(x) for x in X]
        return np.array(y_pred)

    def _predict(self, x):
        distances = [euclidean_distance(x, x_train) for x_train in self.X_train]
        k_indices = np.argsort(distances)[:self.k]
        k_nearest_labels = [self.y_train[i] for i in k_indices]
        most_common = Counter(k_nearest_labels).most_common(1)
        return most_common[0][0]

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply the k-NN algorithm
knn = KNN(k=3)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
print("Accuracy:", accuracy)
print("Confusion Matrix:\n", conf_matrix)

# Visualize the results (using only the first two features)
X_train_2d, X_test_2d = X_train[:, :2], X_test[:, :2]
knn.fit(X_train_2d, y_train)
y_pred_2d = knn.predict(X_test_2d)

plt.scatter(X_test_2d[:, 0], X_test_2d[:, 1], c=y_pred_2d, cmap='viridis', marker='o', label='Predicted')
plt.scatter(X_train_2d[:, 0], X_train_2d[:, 1], c=y_train, cmap='viridis', marker='x', label='Training')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.title('k-NN Classification (Iris Dataset)')
plt.show()
