# K-Nearest Neighbors (KNN) - Scikit-Learn Implementation

Multi-class classification on the **Covertype (Forest Cover Type)** dataset.

**Dataset**: 581,012 samples, 54 features, 7 forest cover types  
**Task**: Predict forest cover type from cartographic variables  
**Key Concept**: KNN is a "lazy learner" - no training phase, expensive at prediction time

## What Makes KNN Different?
- **Non-parametric**: Doesn't learn weights/coefficients like Logistic Regression
- **Instance-based**: Stores entire training set, compares at prediction time
- **Distance-based**: Classification depends on "nearest" training examples
- **No training**: All computation happens during prediction (O(n) per sample)


In [None]:
# Standard libraries
import numpy as np
import sys

# Scikit-Learn KNN
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

# Add utils to path
sys.path.append('../..')
from utils.data_loader import load_processed_data
from utils.metrics import accuracy, macro_f1_score, confusion_matrix_multiclass
from utils.visualization import (
    plot_confusion_matrix_multiclass,
    plot_validation_curve,
    plot_per_class_f1
)
from utils.performance import track_performance

print("Imports complete!")

Imports complete!


In [None]:
# Load preprocessed data
"""
Data was preprocessed in data-preperation/preprocess_knn.py
    - 581,012 total samples (80/20 split)
    - StandardScaler applied (fit on train only)
    - All 4 frameworks load identical data for fair comparison
"""

X_train, X_test, y_train, y_test, metadata = load_processed_data('knn')

# Extract metadata for refernce
class_names = metadata['class_names']
n_classes = metadata['n_classes']
random_seed = metadata['random_seed']

print(f"Training set: {X_train.shape[0]:,} samples, {X_train.shape[1]} features")
print(f"Test set: {X_test.shape[0]:,} samples")
print(f"Classes ({n_classes}): {class_names}")
print(f"Random seed: {random_seed}")

Training set: 464,809 samples, 54 features
Test set: 116,203 samples
Classes (7): ['Spruce/Fir', 'Lodgepole Pine', 'Ponderosa Pine', 'Cottonwood/Willow', 'Aspen', 'Douglas-fir', 'Krummholz']
Random seed: 113


## K-Value Tuning

The most important hyperparameter in KNN is **K** (number of neighbors).

**Bias-Variance Tradeoff:**
- **K=1**: High variance (overfitting) - prediction based on single nearest neighbor
- **K=large**: High bias (underfitting) - prediction averaged over many neighbors

We'll test K = 1, 3, 5, 7, 9, 11, 13, 15 and plot the validation curve.

In [None]:
# K-Value Tunning
"""
Test different K values to find optimal number of neighbors
Using accuracy on test set for each K value
"""

k_values = [1, 3, 5, 7, 9 , 11, 13, 15]
train_scores = []       # Accuracy on training set (to check overfitting)
test_scores = []        # Accuracy on test set (what really matters)

print("K-Value Tuning Progress:")
print("-" * 50)

for k in k_values:
    # Create and "fit" model (KNN just stores data, no real training)
    model = KNeighborsClassifier(n_neighbors=k, algorithm='auto', n_jobs=-1)
    model.fit(X_train, y_train)

    # Evaluate on both sets
    train_pred = model.predict(X_train)
    test_pred = model.predict(X_test)

    train_acc = accuracy(y_train, train_pred)
    test_acc = accuracy(y_test, test_pred)

    train_scores.append(train_acc)
    test_scores.append(test_acc)

    print(f"K={k:2d} | Train Acc: {train_acc:.4f} | Test Acc: {test_acc:.4f}")

# Find best k
best_idx = np.argmax(test_scores)
best_k = k_values[best_idx]
print("-" * 40)
print(f"Best K: {best_k} (Test Accuracy: {test_scores[best_idx]:.4f})")


K-Value Tuning Progress:
--------------------------------------------------
K= 1 | Train Acc: 1.0000 | Test Acc: 0.9353
K= 3 | Train Acc: 0.9691 | Test Acc: 0.9331
K= 5 | Train Acc: 0.9560 | Test Acc: 0.9275
K= 7 | Train Acc: 0.9468 | Test Acc: 0.9228
K= 9 | Train Acc: 0.9394 | Test Acc: 0.9186
K=11 | Train Acc: 0.9330 | Test Acc: 0.9141
K=13 | Train Acc: 0.9278 | Test Acc: 0.9100
K=15 | Train Acc: 0.9230 | Test Acc: 0.9070
----------------------------------------
Best K: 1 *(Test Accu;racy: 0.9353)
