# k-Nearest Neighbors (k-NN) Regression From Scratch

k-NN Regression predicts continuous values by averaging the target values of the k nearest neighbors.

## Key Concepts:
- **Instance-Based Learning**: No explicit training phase
- **Distance Metrics**: Euclidean, Manhattan, etc.
- **Averaging**: Prediction based on mean of k nearest neighbors
- **Weighted Averaging**: Optional distance-weighted predictions

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

## 1. Mathematical Foundation

### Distance Metrics:

**Euclidean Distance:**
$$d(x, y) = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2}$$

### Prediction Rule (Simple Average):
$$\hat{y} = \frac{1}{k} \sum_{i=1}^{k} y_i$$

### Prediction Rule (Weighted Average):
$$\hat{y} = \frac{\sum_{i=1}^{k} w_i y_i}{\sum_{i=1}^{k} w_i}$$

where $w_i = \frac{1}{d_i + \epsilon}$ (inverse distance weighting)

## 2. Implementation

In [None]:
class KNNRegressor:
    def __init__(self, k=3, distance_metric='euclidean', weights='uniform'):
        """
        Initialize k-NN Regressor
        
        Parameters:
        -----------
        k : int
            Number of neighbors to consider (default=3)
        distance_metric : str
            Distance metric: 'euclidean' or 'manhattan'
        weights : str
            Weight function: 'uniform' or 'distance'
        """
        self.k = k
        self.distance_metric = distance_metric
        self.weights = weights
        self.X_train = None
        self.y_train = None
    
    def fit(self, X, y):
        """
        'Train' the model (just store the training data)
        
        Parameters:
        -----------
        X : array-like, shape (n_samples, n_features)
        y : array-like, shape (n_samples,)
        """
        self.X_train = X
        self.y_train = y
        return self
    
    def _calculate_distance(self, x1, x2):
        """
        Calculate distance between two points
        """
        if self.distance_metric == 'euclidean':
            return np.sqrt(np.sum((x1 - x2) ** 2))
        elif self.distance_metric == 'manhattan':
            return np.sum(np.abs(x1 - x2))
        else:
            raise ValueError(f"Unknown distance metric: {self.distance_metric}")
    
    def _predict_single(self, x):
        """
        Predict value for a single sample
        """
        # Calculate distances to all training samples
        distances = np.array([self._calculate_distance(x, x_train) for x_train in self.X_train])
        
        # Get indices of k nearest neighbors
        k_indices = np.argsort(distances)[:self.k]
        
        # Get values of k nearest neighbors
        k_nearest_values = self.y_train[k_indices]
        
        # Calculate prediction based on weighting scheme
        if self.weights == 'uniform':
            # Simple average
            return np.mean(k_nearest_values)
        elif self.weights == 'distance':
            # Weighted average (inverse distance)
            k_distances = distances[k_indices]
            # Add small epsilon to avoid division by zero
            weights = 1 / (k_distances + 1e-10)
            return np.sum(weights * k_nearest_values) / np.sum(weights)
        else:
            raise ValueError(f"Unknown weights: {self.weights}")
    
    def predict(self, X):
        """
        Predict values for multiple samples
        
        Parameters:
        -----------
        X : array-like, shape (n_samples, n_features)
        
        Returns:
        --------
        predictions : array, shape (n_samples,)
        """
        return np.array([self._predict_single(x) for x in X])
    
    def score(self, X, y):
        """
        Calculate R² score
        
        Parameters:
        -----------
        X : array-like, shape (n_samples, n_features)
        y : array-like, shape (n_samples,)
        
        Returns:
        --------
        r2_score : float
        """
        y_pred = self.predict(X)
        ss_res = np.sum((y - y_pred) ** 2)
        ss_tot = np.sum((y - np.mean(y)) ** 2)
        return 1 - (ss_res / ss_tot)

## 3. Testing on Synthetic Data

In [None]:
# Generate synthetic data
np.random.seed(42)
X, y = make_regression(n_samples=200, n_features=1, noise=15, random_state=42)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Training samples: {X_train.shape[0]}")
print(f"Test samples: {X_test.shape[0]}")
print(f"Features: {X_train.shape[1]}")

In [None]:
# Train k-NN Regressor
knn = KNNRegressor(k=5)
knn.fit(X_train_scaled, y_train)

# Make predictions
y_pred_train = knn.predict(X_train_scaled)
y_pred_test = knn.predict(X_test_scaled)

# Calculate scores
train_score = knn.score(X_train_scaled, y_train)
test_score = knn.score(X_test_scaled, y_test)

print(f"\nk-NN Regressor (k={knn.k})")
print(f"Train R² Score: {train_score:.4f}")
print(f"Test R² Score: {test_score:.4f}")

## 4. Comparison with Scikit-learn

In [None]:
from sklearn.neighbors import KNeighborsRegressor

# Train sklearn k-NN
sklearn_knn = KNeighborsRegressor(n_neighbors=5)
sklearn_knn.fit(X_train_scaled, y_train)

# Compare scores
sklearn_train_score = sklearn_knn.score(X_train_scaled, y_train)
sklearn_test_score = sklearn_knn.score(X_test_scaled, y_test)

print("\nComparison:")
print(f"{'Method':<20} {'Train R²':<12} {'Test R²':<12}")
print("-" * 44)
print(f"{'Our k-NN':<20} {train_score:<12.4f} {test_score:<12.4f}")
print(f"{'Sklearn k-NN':<20} {sklearn_train_score:<12.4f} {sklearn_test_score:<12.4f}")

## 5. Effect of k (Number of Neighbors)

In [None]:
# Test different k values
k_values = range(1, 31)
train_scores = []
test_scores = []

for k in k_values:
    knn = KNNRegressor(k=k)
    knn.fit(X_train_scaled, y_train)
    
    train_scores.append(knn.score(X_train_scaled, y_train))
    test_scores.append(knn.score(X_test_scaled, y_test))

# Plot results
plt.figure(figsize=(10, 6))
plt.plot(k_values, train_scores, 'o-', label='Train R²', linewidth=2)
plt.plot(k_values, test_scores, 's-', label='Test R²', linewidth=2)
plt.xlabel('k (Number of Neighbors)', fontsize=12)
plt.ylabel('R² Score', fontsize=12)
plt.title('k-NN Regression: Effect of k on Performance', fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.show()

# Find best k
best_k = k_values[np.argmax(test_scores)]
best_score = max(test_scores)
print(f"\nBest k: {best_k}")
print(f"Best Test R²: {best_score:.4f}")

## 6. Uniform vs Distance-Weighted Predictions

In [None]:
# Compare uniform vs distance weighting
knn_uniform = KNNRegressor(k=5, weights='uniform')
knn_distance = KNNRegressor(k=5, weights='distance')

knn_uniform.fit(X_train_scaled, y_train)
knn_distance.fit(X_train_scaled, y_train)

uniform_score = knn_uniform.score(X_test_scaled, y_test)
distance_score = knn_distance.score(X_test_scaled, y_test)

print("\nWeighting Scheme Comparison:")
print(f"Uniform Weights:  {uniform_score:.4f}")
print(f"Distance Weights: {distance_score:.4f}")

## 7. Visualization of Predictions

In [None]:
# Train models with different k values
knn_k1 = KNNRegressor(k=1)
knn_k5 = KNNRegressor(k=5)
knn_k15 = KNNRegressor(k=15)

knn_k1.fit(X_train_scaled, y_train)
knn_k5.fit(X_train_scaled, y_train)
knn_k15.fit(X_train_scaled, y_train)

# Create smooth line for predictions
X_line = np.linspace(X_train_scaled.min(), X_train_scaled.max(), 300).reshape(-1, 1)
y_pred_k1 = knn_k1.predict(X_line)
y_pred_k5 = knn_k5.predict(X_line)
y_pred_k15 = knn_k15.predict(X_line)

# Plot
plt.figure(figsize=(15, 5))

for i, (k, y_pred, title) in enumerate([
    (1, y_pred_k1, 'k=1 (High Variance)'),
    (5, y_pred_k5, 'k=5 (Balanced)'),
    (15, y_pred_k15, 'k=15 (High Bias)')
]):
    plt.subplot(1, 3, i+1)
    plt.scatter(X_train_scaled, y_train, alpha=0.5, s=30, label='Training Data')
    plt.plot(X_line, y_pred, 'r-', linewidth=2, label=f'k-NN (k={k})')
    plt.xlabel('Feature', fontsize=11)
    plt.ylabel('Target', fontsize=11)
    plt.title(title, fontsize=12)
    plt.legend(fontsize=10)
    plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 8. Key Takeaways

### Advantages:
- ✅ Simple and intuitive
- ✅ No training phase (lazy learning)
- ✅ Naturally handles non-linear relationships
- ✅ No assumptions about data distribution
- ✅ Can use distance weighting for better predictions

### Disadvantages:
- ❌ Slow prediction (must compute all distances)
- ❌ Sensitive to feature scaling
- ❌ Curse of dimensionality
- ❌ Sensitive to outliers
- ❌ Requires choosing k hyperparameter
- ❌ Poor extrapolation beyond training data range

### When to Use:
- Small to medium-sized datasets
- Low-dimensional feature space
- Non-linear relationships
- Need interpretable baseline model
- Local patterns more important than global trends

### Bias-Variance Trade-off:
- **Small k**: Low bias, high variance (overfitting)
- **Large k**: High bias, low variance (underfitting)
- **Optimal k**: Balance between bias and variance