# K-Nearest Neighbors (KNN) - Complete Guide

## From Basics to Advanced with Visualizations

KNN is a **non-parametric**, **instance-based** learning algorithm used for classification and regression.

### What You'll Learn
1. KNN fundamentals and distance metrics
2. Choosing optimal K
3. Implementation from scratch
4. Scikit-learn implementation
5. Distance weighting
6. Dimensionality curse
7. Real-world applications

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification, load_iris, make_moons, make_circles
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from collections import Counter

plt.style.use('seaborn-v0_8-whitegrid')
np.random.seed(42)

## 1. How KNN Works

### Algorithm Steps:
1. Store all training data
2. For a new point:
   - Calculate distance to all training points
   - Find K nearest neighbors
   - **Classification**: Majority vote
   - **Regression**: Average of K neighbors

In [None]:
# Create simple dataset
np.random.seed(42)
X_train = np.array([[1, 2], [2, 3], [3, 1], [6, 5], [7, 7], [8, 6]])
y_train = np.array([0, 0, 0, 1, 1, 1])
X_new = np.array([[5, 4]])

# Visualize KNN concept
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

for idx, k in enumerate([1, 3, 5]):
    ax = axes[idx]
    
    # Plot training data
    scatter = ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, 
                        cmap='RdYlBu', s=200, edgecolors='black', linewidth=2)
    
    # Plot new point
    ax.scatter(X_new[:, 0], X_new[:, 1], c='green', s=300, 
              marker='*', edgecolors='black', linewidth=2, label='New Point')
    
    # Calculate distances
    distances = np.sqrt(np.sum((X_train - X_new)**2, axis=1))
    nearest_idx = np.argsort(distances)[:k]
    
    # Draw circles to k nearest neighbors
    for i in nearest_idx:
        ax.plot([X_new[0, 0], X_train[i, 0]], 
               [X_new[0, 1], X_train[i, 1]], 
               'g--', alpha=0.5, linewidth=2)
        circle = plt.Circle(X_train[i], 0.3, color='green', fill=False, linewidth=2)
        ax.add_patch(circle)
    
    # Prediction
    prediction = Counter(y_train[nearest_idx]).most_common(1)[0][0]
    ax.set_title(f'K = {k}\nPrediction: Class {prediction}', fontsize=14, fontweight='bold')
    ax.set_xlabel('Feature 1', fontsize=12)
    ax.set_ylabel('Feature 2', fontsize=12)
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 2. Distance Metrics

Common distance metrics:

1. **Euclidean**: $d(p,q) = \sqrt{\sum_{i=1}^n (p_i - q_i)^2}$
2. **Manhattan**: $d(p,q) = \sum_{i=1}^n |p_i - q_i|$
3. **Minkowski**: $d(p,q) = \left(\sum_{i=1}^n |p_i - q_i|^p\right)^{1/p}$
4. **Cosine**: $similarity = \frac{p \cdot q}{||p|| \cdot ||q||}$

In [None]:
# Visualize different distance metrics
point_a = np.array([2, 2])
point_b = np.array([5, 6])

fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Euclidean
euclidean = np.sqrt(np.sum((point_a - point_b)**2))
axes[0].plot([point_a[0], point_b[0]], [point_a[1], point_b[1]], 'r-', linewidth=3, label=f'Euclidean = {euclidean:.2f}')
axes[0].scatter(*point_a, s=200, c='blue', edgecolors='black', zorder=5)
axes[0].scatter(*point_b, s=200, c='red', edgecolors='black', zorder=5)
axes[0].set_title('Euclidean Distance\n(Straight line)', fontsize=14)
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Manhattan
manhattan = np.sum(np.abs(point_a - point_b))
axes[1].plot([point_a[0], point_b[0]], [point_a[1], point_a[1]], 'b-', linewidth=3)
axes[1].plot([point_b[0], point_b[0]], [point_a[1], point_b[1]], 'b-', linewidth=3, label=f'Manhattan = {manhattan:.2f}')
axes[1].scatter(*point_a, s=200, c='blue', edgecolors='black', zorder=5)
axes[1].scatter(*point_b, s=200, c='red', edgecolors='black', zorder=5)
axes[1].set_title('Manhattan Distance\n(Grid path)', fontsize=14)
axes[1].legend()
axes[1].grid(True, alpha=0.3)

# Minkowski (p=3)
minkowski = np.power(np.sum(np.abs(point_a - point_b)**3), 1/3)
axes[2].plot([point_a[0], point_b[0]], [point_a[1], point_b[1]], 'g-', linewidth=3, label=f'Minkowski (p=3) = {minkowski:.2f}')
axes[2].scatter(*point_a, s=200, c='blue', edgecolors='black', zorder=5)
axes[2].scatter(*point_b, s=200, c='red', edgecolors='black', zorder=5)
axes[2].set_title('Minkowski Distance (p=3)', fontsize=14)
axes[2].legend()
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 3. Implementation from Scratch

In [None]:
class KNNClassifierScratch:
    """K-Nearest Neighbors Classifier from scratch"""
    
    def __init__(self, k=3, distance_metric='euclidean'):
        self.k = k
        self.distance_metric = distance_metric
        
    def fit(self, X, y):
        """Store training data"""
        self.X_train = X
        self.y_train = y
        return self
    
    def _calculate_distance(self, x1, x2):
        """Calculate distance between two points"""
        if self.distance_metric == 'euclidean':
            return np.sqrt(np.sum((x1 - x2)**2, axis=1))
        elif self.distance_metric == 'manhattan':
            return np.sum(np.abs(x1 - x2), axis=1)
        elif self.distance_metric == 'minkowski':
            p = 3
            return np.power(np.sum(np.abs(x1 - x2)**p, axis=1), 1/p)
    
    def predict(self, X):
        """Predict class labels for samples in X"""
        predictions = [self._predict_single(x) for x in X]
        return np.array(predictions)
    
    def _predict_single(self, x):
        """Predict class label for a single sample"""
        # Calculate distances to all training samples
        distances = self._calculate_distance(self.X_train, x)
        
        # Get indices of k nearest neighbors
        k_indices = np.argsort(distances)[:self.k]
        
        # Get labels of k nearest neighbors
        k_nearest_labels = self.y_train[k_indices]
        
        # Return most common class label
        most_common = Counter(k_nearest_labels).most_common(1)
        return most_common[0][0]

# Test on Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train our KNN
knn_scratch = KNNClassifierScratch(k=5)
knn_scratch.fit(X_train_scaled, y_train)

# Predictions
y_pred = knn_scratch.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy (from scratch): {accuracy:.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

## 4. Choosing Optimal K

- **Small K**: More sensitive to noise (overfitting)
- **Large K**: Smoother boundaries (underfitting)
- **Rule of thumb**: K = âˆšn (where n = number of samples)
- **Best practice**: Use cross-validation

In [None]:
# Test different K values
k_values = range(1, 31)
train_scores = []
test_scores = []
cv_scores = []

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train_scaled, y_train)
    
    train_scores.append(knn.score(X_train_scaled, y_train))
    test_scores.append(knn.score(X_test_scaled, y_test))
    cv_scores.append(cross_val_score(knn, X_train_scaled, y_train, cv=5).mean())

# Find optimal K
optimal_k = k_values[np.argmax(cv_scores)]

# Plot
plt.figure(figsize=(12, 6))
plt.plot(k_values, train_scores, 'b-o', label='Training Accuracy', linewidth=2)
plt.plot(k_values, test_scores, 'r-s', label='Test Accuracy', linewidth=2)
plt.plot(k_values, cv_scores, 'g-^', label='CV Accuracy', linewidth=2)
plt.axvline(x=optimal_k, color='purple', linestyle='--', linewidth=2, label=f'Optimal K = {optimal_k}')
plt.xlabel('K (Number of Neighbors)', fontsize=12)
plt.ylabel('Accuracy', fontsize=12)
plt.title('Model Performance vs K Value', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print(f"Optimal K: {optimal_k}")
print(f"Best CV Accuracy: {max(cv_scores):.4f}")

## 5. Decision Boundaries Visualization

In [None]:
# Create non-linear dataset
X_moons, y_moons = make_moons(n_samples=200, noise=0.15, random_state=42)

X_train_m, X_test_m, y_train_m, y_test_m = train_test_split(
    X_moons, y_moons, test_size=0.3, random_state=42
)

# Train KNN with different K values
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
axes = axes.ravel()

k_values_viz = [1, 3, 5, 10, 20, 50]

for idx, k in enumerate(k_values_viz):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train_m, y_train_m)
    
    # Create decision boundary
    h = 0.02
    x_min, x_max = X_moons[:, 0].min() - 0.5, X_moons[:, 0].max() + 0.5
    y_min, y_max = X_moons[:, 1].min() - 0.5, X_moons[:, 1].max() + 0.5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    
    Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    # Plot
    axes[idx].contourf(xx, yy, Z, alpha=0.4, cmap='RdYlBu')
    axes[idx].scatter(X_train_m[:, 0], X_train_m[:, 1], c=y_train_m, 
                     cmap='RdYlBu', edgecolors='black', s=50)
    
    accuracy = knn.score(X_test_m, y_test_m)
    axes[idx].set_title(f'K = {k}\nAccuracy = {accuracy:.3f}', fontsize=12, fontweight='bold')
    axes[idx].set_xlabel('Feature 1')
    axes[idx].set_ylabel('Feature 2')

plt.tight_layout()
plt.show()

## 6. Distance Weighting

Give more weight to closer neighbors:

$$w_i = \frac{1}{d_i^2}$$

where $d_i$ is the distance to neighbor $i$.

In [None]:
# Compare uniform vs distance weighting
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

for idx, weights in enumerate(['uniform', 'distance']):
    knn = KNeighborsClassifier(n_neighbors=5, weights=weights)
    knn.fit(X_train_m, y_train_m)
    
    # Decision boundary
    h = 0.02
    x_min, x_max = X_moons[:, 0].min() - 0.5, X_moons[:, 0].max() + 0.5
    y_min, y_max = X_moons[:, 1].min() - 0.5, X_moons[:, 1].max() + 0.5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    
    Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    axes[idx].contourf(xx, yy, Z, alpha=0.4, cmap='RdYlBu')
    axes[idx].scatter(X_train_m[:, 0], X_train_m[:, 1], c=y_train_m, 
                     cmap='RdYlBu', edgecolors='black', s=50)
    
    accuracy = knn.score(X_test_m, y_test_m)
    axes[idx].set_title(f'Weights: {weights}\nAccuracy = {accuracy:.3f}', 
                       fontsize=14, fontweight='bold')
    axes[idx].set_xlabel('Feature 1', fontsize=12)
    axes[idx].set_ylabel('Feature 2', fontsize=12)

plt.tight_layout()
plt.show()

## 7. The Curse of Dimensionality

KNN struggles with high-dimensional data because:
- Distances become less meaningful
- All points become equidistant
- Computational cost increases

In [None]:
# Demonstrate curse of dimensionality
dimensions = [2, 5, 10, 20, 50, 100]
accuracies = []
training_times = []

for n_features in dimensions:
    # Generate dataset
    X_dim, y_dim = make_classification(n_samples=500, n_features=n_features, 
                                       n_informative=min(n_features, 10),
                                       n_redundant=0, random_state=42)
    
    X_train_d, X_test_d, y_train_d, y_test_d = train_test_split(
        X_dim, y_dim, test_size=0.3, random_state=42
    )
    
    scaler_d = StandardScaler()
    X_train_d_scaled = scaler_d.fit_transform(X_train_d)
    X_test_d_scaled = scaler_d.transform(X_test_d)
    
    knn = KNeighborsClassifier(n_neighbors=5)
    
    import time
    start = time.time()
    knn.fit(X_train_d_scaled, y_train_d)
    knn.predict(X_test_d_scaled)
    end = time.time()
    
    accuracies.append(knn.score(X_test_d_scaled, y_test_d))
    training_times.append(end - start)

# Plot results
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].plot(dimensions, accuracies, 'b-o', linewidth=2, markersize=8)
axes[0].set_xlabel('Number of Dimensions', fontsize=12)
axes[0].set_ylabel('Accuracy', fontsize=12)
axes[0].set_title('Accuracy vs Dimensionality', fontsize=14)
axes[0].grid(True, alpha=0.3)

axes[1].plot(dimensions, training_times, 'r-s', linewidth=2, markersize=8)
axes[1].set_xlabel('Number of Dimensions', fontsize=12)
axes[1].set_ylabel('Time (seconds)', fontsize=12)
axes[1].set_title('Computational Cost vs Dimensionality', fontsize=14)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 8. KNN for Regression

In [None]:
# Generate regression data
np.random.seed(42)
X_reg = np.sort(5 * np.random.rand(100, 1), axis=0)
y_reg = np.sin(X_reg).ravel() + np.random.randn(100) * 0.1

# Train KNN regressors with different K
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

X_test_reg = np.linspace(0, 5, 500)[:, np.newaxis]

for idx, k in enumerate([1, 5, 20]):
    knn_reg = KNeighborsRegressor(n_neighbors=k)
    knn_reg.fit(X_reg, y_reg)
    y_pred_reg = knn_reg.predict(X_test_reg)
    
    axes[idx].scatter(X_reg, y_reg, color='blue', s=50, label='Training data')
    axes[idx].plot(X_test_reg, y_pred_reg, 'r-', linewidth=2, label=f'KNN (K={k})')
    axes[idx].plot(X_test_reg, np.sin(X_test_reg), 'g--', linewidth=2, label='True function')
    axes[idx].set_xlabel('X', fontsize=12)
    axes[idx].set_ylabel('y', fontsize=12)
    axes[idx].set_title(f'KNN Regression (K={k})', fontsize=14)
    axes[idx].legend()
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 9. Hyperparameter Tuning with GridSearchCV

In [None]:
# Define parameter grid
param_grid = {
    'n_neighbors': [3, 5, 7, 9, 11, 13, 15],
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan', 'minkowski']
}

# Grid search
knn_grid = KNeighborsClassifier()
grid_search = GridSearchCV(knn_grid, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train_scaled, y_train)

print("Best Parameters:", grid_search.best_params_)
print(f"Best CV Score: {grid_search.best_score_:.4f}")
print(f"Test Score: {grid_search.score(X_test_scaled, y_test):.4f}")

# Visualize grid search results
results_df = pd.DataFrame(grid_search.cv_results_)
pivot_table = results_df.pivot_table(
    values='mean_test_score', 
    index='param_n_neighbors',
    columns='param_weights'
)

plt.figure(figsize=(10, 6))
sns.heatmap(pivot_table, annot=True, fmt='.3f', cmap='YlGnBu')
plt.title('Grid Search Results: Mean CV Accuracy', fontsize=14)
plt.xlabel('Weights', fontsize=12)
plt.ylabel('Number of Neighbors', fontsize=12)
plt.show()

## Summary

### Key Takeaways

1. **KNN is lazy learning**: No training phase, all computation at prediction time
2. **Distance matters**: Feature scaling is critical
3. **K selection**: Use cross-validation to find optimal K
4. **Distance weighting**: Can improve performance
5. **Curse of dimensionality**: Struggles with high-dimensional data

### Pros and Cons

**Pros:**
- Simple and intuitive
- No training required
- Non-parametric (no assumptions about data)
- Works for both classification and regression
- Naturally handles multi-class problems

**Cons:**
- Slow prediction time (O(n) for each prediction)
- Requires feature scaling
- Suffers from curse of dimensionality
- Memory intensive (stores all training data)
- Sensitive to irrelevant features

### When to Use KNN

**Use when:**
- Small to medium datasets
- Low-dimensional feature space
- Non-linear decision boundaries
- Need interpretable results

**Avoid when:**
- Large datasets (slow prediction)
- High-dimensional data
- Real-time predictions needed
- Many irrelevant features

### Practice Problems

1. Implement weighted KNN in the scratch version
2. Compare KNN with other algorithms on imbalanced datasets
3. Use KNN for anomaly detection
4. Implement KNN with custom distance metrics