# K-Nearest Neighbors (KNN) Classification

## Theoretical Foundation

K-Nearest Neighbors is a non-parametric, instance-based learning algorithm used for classification and regression. Unlike parametric methods that learn a fixed set of parameters, KNN stores the entire training dataset and makes predictions based on the local neighborhood of query points.

### Distance Metrics

The fundamental operation in KNN is computing distances between points. For two points $\mathbf{x} = (x_1, x_2, \ldots, x_n)$ and $\mathbf{y} = (y_1, y_2, \ldots, y_n)$ in $\mathbb{R}^n$:

**Euclidean Distance (L2 norm):**
$$d_E(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2} = \|\mathbf{x} - \mathbf{y}\|_2$$

**Manhattan Distance (L1 norm):**
$$d_M(\mathbf{x}, \mathbf{y}) = \sum_{i=1}^{n} |x_i - y_i| = \|\mathbf{x} - \mathbf{y}\|_1$$

**Minkowski Distance (Lp norm):**
$$d_p(\mathbf{x}, \mathbf{y}) = \left(\sum_{i=1}^{n} |x_i - y_i|^p\right)^{1/p}$$

### Classification Algorithm

Given a training set $\mathcal{D} = \{(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \ldots, (\mathbf{x}_N, y_N)\}$ where $y_i \in \{1, 2, \ldots, C\}$ represents class labels, the KNN classifier predicts the class of a query point $\mathbf{x}_q$ as follows:

1. Compute distances $d(\mathbf{x}_q, \mathbf{x}_i)$ for all training points
2. Identify the $k$ nearest neighbors: $\mathcal{N}_k(\mathbf{x}_q)$
3. Assign class by majority vote:

$$\hat{y} = \arg\max_{c \in \{1,\ldots,C\}} \sum_{i \in \mathcal{N}_k(\mathbf{x}_q)} \mathbb{1}(y_i = c)$$

where $\mathbb{1}(\cdot)$ is the indicator function.

### Weighted KNN

To give closer neighbors more influence, we can weight votes by inverse distance:

$$\hat{y} = \arg\max_{c} \sum_{i \in \mathcal{N}_k(\mathbf{x}_q)} w_i \cdot \mathbb{1}(y_i = c)$$

where $w_i = \frac{1}{d(\mathbf{x}_q, \mathbf{x}_i) + \epsilon}$ and $\epsilon$ prevents division by zero.

### Computational Complexity

- **Training:** $O(1)$ - simply store the data
- **Prediction:** $O(Nd)$ for $N$ training samples and $d$ features (naive implementation)
- With KD-trees or Ball trees: $O(d \log N)$ average case

### Choosing K

The hyperparameter $k$ controls the bias-variance tradeoff:
- **Small $k$:** Low bias, high variance (sensitive to noise)
- **Large $k$:** High bias, low variance (over-smoothing)

Common practice: Use cross-validation to select optimal $k$, often choosing odd values to avoid ties in binary classification.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter

# Set random seed for reproducibility
np.random.seed(42)

## Implementation from Scratch

We implement the KNN classifier without relying on scikit-learn, demonstrating the core algorithm.

In [None]:
class KNearestNeighbors:
    """
    K-Nearest Neighbors classifier implemented from scratch.
    
    Parameters
    ----------
    k : int
        Number of neighbors to consider
    distance_metric : str
        'euclidean' or 'manhattan'
    weighted : bool
        Whether to use distance-weighted voting
    """
    
    def __init__(self, k=3, distance_metric='euclidean', weighted=False):
        self.k = k
        self.distance_metric = distance_metric
        self.weighted = weighted
        self.X_train = None
        self.y_train = None
    
    def fit(self, X, y):
        """Store training data."""
        self.X_train = np.array(X)
        self.y_train = np.array(y)
        return self
    
    def _compute_distance(self, x1, x2):
        """Compute distance between two points."""
        if self.distance_metric == 'euclidean':
            return np.sqrt(np.sum((x1 - x2) ** 2))
        elif self.distance_metric == 'manhattan':
            return np.sum(np.abs(x1 - x2))
        else:
            raise ValueError(f"Unknown metric: {self.distance_metric}")
    
    def _predict_single(self, x):
        """Predict class for a single sample."""
        # Compute distances to all training points
        distances = np.array([self._compute_distance(x, x_train) 
                              for x_train in self.X_train])
        
        # Get indices of k nearest neighbors
        k_indices = np.argsort(distances)[:self.k]
        k_labels = self.y_train[k_indices]
        k_distances = distances[k_indices]
        
        if self.weighted:
            # Weighted voting by inverse distance
            weights = 1.0 / (k_distances + 1e-10)
            class_weights = {}
            for label, weight in zip(k_labels, weights):
                class_weights[label] = class_weights.get(label, 0) + weight
            return max(class_weights, key=class_weights.get)
        else:
            # Majority voting
            most_common = Counter(k_labels).most_common(1)
            return most_common[0][0]
    
    def predict(self, X):
        """Predict classes for multiple samples."""
        X = np.array(X)
        return np.array([self._predict_single(x) for x in X])
    
    def score(self, X, y):
        """Compute accuracy score."""
        predictions = self.predict(X)
        return np.mean(predictions == y)

## Generate Synthetic Dataset

We create a 2D classification problem with three distinct clusters to visualize the decision boundaries.

In [None]:
def generate_multiclass_data(n_samples_per_class=100):
    """
    Generate synthetic 2D data with three classes.
    Each class is a Gaussian cluster.
    """
    # Class 0: Cluster centered at (0, 0)
    X0 = np.random.randn(n_samples_per_class, 2) * 0.8 + np.array([0, 0])
    y0 = np.zeros(n_samples_per_class, dtype=int)
    
    # Class 1: Cluster centered at (3, 3)
    X1 = np.random.randn(n_samples_per_class, 2) * 0.8 + np.array([3, 3])
    y1 = np.ones(n_samples_per_class, dtype=int)
    
    # Class 2: Cluster centered at (3, 0)
    X2 = np.random.randn(n_samples_per_class, 2) * 0.8 + np.array([3, 0])
    y2 = np.full(n_samples_per_class, 2, dtype=int)
    
    # Combine all data
    X = np.vstack([X0, X1, X2])
    y = np.hstack([y0, y1, y2])
    
    # Shuffle the data
    shuffle_idx = np.random.permutation(len(y))
    return X[shuffle_idx], y[shuffle_idx]

# Generate data
X, y = generate_multiclass_data(n_samples_per_class=100)

# Split into train and test sets (80-20 split)
n_train = int(0.8 * len(y))
X_train, X_test = X[:n_train], X[n_train:]
y_train, y_test = y[:n_train], y[n_train:]

print(f"Training samples: {len(y_train)}")
print(f"Test samples: {len(y_test)}")
print(f"Classes: {np.unique(y)}")

## Visualize Decision Boundaries

We create a mesh grid over the feature space and classify each point to visualize how KNN partitions the space.

In [None]:
def plot_decision_boundary(clf, X, y, title, ax):
    """
    Plot decision boundary for a 2D classifier.
    """
    # Create mesh grid
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
                         np.linspace(y_min, y_max, 200))
    
    # Predict on mesh grid
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    # Plot decision regions
    ax.contourf(xx, yy, Z, alpha=0.3, cmap='viridis')
    ax.contour(xx, yy, Z, colors='k', linewidths=0.5)
    
    # Plot training points
    scatter = ax.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', 
                         edgecolors='black', s=50)
    
    ax.set_xlabel('$x_1$', fontsize=12)
    ax.set_ylabel('$x_2$', fontsize=12)
    ax.set_title(title, fontsize=14)
    
    return scatter

In [None]:
# Create figure with multiple subplots
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
fig.suptitle('K-Nearest Neighbors: Effect of K on Decision Boundaries', 
             fontsize=16, fontweight='bold')

# Test different values of k
k_values = [1, 3, 5, 7, 15, 25]

for ax, k in zip(axes.flat, k_values):
    # Train KNN with this k
    knn = KNearestNeighbors(k=k, distance_metric='euclidean')
    knn.fit(X_train, y_train)
    
    # Compute accuracy
    train_acc = knn.score(X_train, y_train)
    test_acc = knn.score(X_test, y_test)
    
    # Plot decision boundary
    title = f'k = {k}\nTrain Acc: {train_acc:.2f}, Test Acc: {test_acc:.2f}'
    plot_decision_boundary(knn, X_train, y_train, title, ax)

plt.tight_layout()
plt.savefig('plot.png', dpi=150, bbox_inches='tight')
plt.show()

print("\nDecision boundary visualization saved to 'plot.png'")

## Cross-Validation for Optimal K

We perform k-fold cross-validation to find the optimal number of neighbors.

In [None]:
def k_fold_cross_validation(X, y, k_neighbors, n_folds=5):
    """
    Perform k-fold cross-validation for KNN.
    
    Returns mean and std of accuracy across folds.
    """
    n_samples = len(y)
    fold_size = n_samples // n_folds
    indices = np.arange(n_samples)
    np.random.shuffle(indices)
    
    accuracies = []
    
    for fold in range(n_folds):
        # Define validation set
        val_start = fold * fold_size
        val_end = (fold + 1) * fold_size if fold < n_folds - 1 else n_samples
        val_idx = indices[val_start:val_end]
        train_idx = np.concatenate([indices[:val_start], indices[val_end:]])
        
        # Split data
        X_tr, X_val = X[train_idx], X[val_idx]
        y_tr, y_val = y[train_idx], y[val_idx]
        
        # Train and evaluate
        knn = KNearestNeighbors(k=k_neighbors)
        knn.fit(X_tr, y_tr)
        acc = knn.score(X_val, y_val)
        accuracies.append(acc)
    
    return np.mean(accuracies), np.std(accuracies)

# Test range of k values
k_range = list(range(1, 31, 2))  # Odd values to avoid ties
cv_results = []

print("Cross-validation results:")
print("-" * 40)

for k in k_range:
    mean_acc, std_acc = k_fold_cross_validation(X_train, y_train, k, n_folds=5)
    cv_results.append((k, mean_acc, std_acc))
    if k <= 11 or k >= 25:
        print(f"k = {k:2d}: Accuracy = {mean_acc:.3f} Â± {std_acc:.3f}")

# Find optimal k
best_k = max(cv_results, key=lambda x: x[1])[0]
print(f"\nOptimal k = {best_k}")

In [None]:
# Plot cross-validation results
k_vals = [r[0] for r in cv_results]
means = [r[1] for r in cv_results]
stds = [r[2] for r in cv_results]

plt.figure(figsize=(10, 6))
plt.errorbar(k_vals, means, yerr=stds, marker='o', capsize=4, capthick=2)
plt.axvline(x=best_k, color='r', linestyle='--', label=f'Optimal k = {best_k}')
plt.xlabel('Number of Neighbors (k)', fontsize=12)
plt.ylabel('Cross-Validation Accuracy', fontsize=12)
plt.title('KNN: Cross-Validation Accuracy vs. K', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

## Comparison: Euclidean vs Manhattan Distance

We compare the two most common distance metrics to see how they affect classification.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Euclidean distance
knn_euclidean = KNearestNeighbors(k=5, distance_metric='euclidean')
knn_euclidean.fit(X_train, y_train)
acc_euclidean = knn_euclidean.score(X_test, y_test)
plot_decision_boundary(knn_euclidean, X_train, y_train, 
                       f'Euclidean Distance (L2)\nTest Accuracy: {acc_euclidean:.3f}', 
                       axes[0])

# Manhattan distance
knn_manhattan = KNearestNeighbors(k=5, distance_metric='manhattan')
knn_manhattan.fit(X_train, y_train)
acc_manhattan = knn_manhattan.score(X_test, y_test)
plot_decision_boundary(knn_manhattan, X_train, y_train, 
                       f'Manhattan Distance (L1)\nTest Accuracy: {acc_manhattan:.3f}', 
                       axes[1])

plt.tight_layout()
plt.show()

print(f"Euclidean Distance Test Accuracy: {acc_euclidean:.3f}")
print(f"Manhattan Distance Test Accuracy: {acc_manhattan:.3f}")

## Weighted vs Unweighted KNN

Compare standard majority voting with distance-weighted voting.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Unweighted KNN
knn_unweighted = KNearestNeighbors(k=7, weighted=False)
knn_unweighted.fit(X_train, y_train)
acc_unweighted = knn_unweighted.score(X_test, y_test)
plot_decision_boundary(knn_unweighted, X_train, y_train, 
                       f'Unweighted (Majority Vote)\nTest Accuracy: {acc_unweighted:.3f}', 
                       axes[0])

# Weighted KNN
knn_weighted = KNearestNeighbors(k=7, weighted=True)
knn_weighted.fit(X_train, y_train)
acc_weighted = knn_weighted.score(X_test, y_test)
plot_decision_boundary(knn_weighted, X_train, y_train, 
                       f'Weighted (Inverse Distance)\nTest Accuracy: {acc_weighted:.3f}', 
                       axes[1])

plt.tight_layout()
plt.show()

print(f"Unweighted KNN Test Accuracy: {acc_unweighted:.3f}")
print(f"Weighted KNN Test Accuracy: {acc_weighted:.3f}")

## Summary

### Key Takeaways

1. **K-Nearest Neighbors** is a simple yet powerful non-parametric classifier that makes no assumptions about the underlying data distribution.

2. **Choice of K**: Smaller values of $k$ lead to more complex decision boundaries (low bias, high variance), while larger values produce smoother boundaries (high bias, low variance). Cross-validation is essential for selecting the optimal $k$.

3. **Distance Metrics**: Euclidean distance works well for continuous features, while Manhattan distance may be preferable for high-dimensional data or when features have different scales.

4. **Weighted Voting**: Distance-weighted KNN can improve performance by giving closer neighbors more influence, especially useful when points near decision boundaries have varying distances to neighbors.

### Limitations

- **Computational Cost**: $O(Nd)$ prediction time for $N$ samples and $d$ features
- **Curse of Dimensionality**: Performance degrades in high-dimensional spaces
- **Feature Scaling**: Requires normalization when features have different scales
- **Memory**: Must store entire training set

### Best Practices

- Normalize/standardize features before applying KNN
- Use cross-validation to tune $k$
- Consider dimensionality reduction (PCA) for high-dimensional data
- Use efficient data structures (KD-trees, Ball trees) for large datasets