# K-Nearest Neighbors (KNN) - Complete Guide

## üìö Learning Objectives
- Understand KNN algorithm for classification and regression
- Learn how to choose optimal K value
- Implement proper feature scaling
- Handle distance metrics
- Evaluate model performance

## üéØ What is KNN?
K-Nearest Neighbors is a **non-parametric**, **lazy learning** algorithm that:
- Makes predictions based on K closest training examples
- Uses distance metrics (usually Euclidean)
- Works for both classification and regression

### Key Concepts:
1. **K**: Number of neighbors to consider
2. **Distance Metric**: How to measure similarity (Euclidean, Manhattan, etc.)
3. **Weights**: Uniform or distance-based
4. **Feature Scaling**: CRITICAL for KNN!

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.metrics import (
    accuracy_score, classification_report, confusion_matrix,
    mean_squared_error, r2_score, mean_absolute_error
)
from sklearn.datasets import load_iris, make_classification
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings('ignore')

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

## Part 1: KNN for Classification
### 1Ô∏è‚É£ Load and Explore Data

In [None]:
# Load Iris dataset
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.Series(iris.target, name='species')

print(f"Dataset shape: {X.shape}")
print(f"\nFeatures: {list(X.columns)}")
print(f"\nTarget classes: {iris.target_names}")
print(f"\nClass distribution:")
print(y.value_counts().sort_index())

# Display first few rows
df = pd.concat([X, y], axis=1)
df.head(10)

In [None]:
# Visualize feature distributions
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
axes = axes.ravel()

for idx, col in enumerate(X.columns):
    for species in range(3):
        axes[idx].hist(X[y == species][col], alpha=0.6, label=iris.target_names[species], bins=20)
    axes[idx].set_xlabel(col, fontsize=12)
    axes[idx].set_ylabel('Frequency')
    axes[idx].legend()
    axes[idx].grid(True, alpha=0.3)

plt.suptitle('Feature Distributions by Species', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

### 2Ô∏è‚É£ Train-Test Split and Scaling
**CRITICAL**: KNN is distance-based, so feature scaling is MANDATORY!

In [None]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")

# Demonstrate importance of scaling
print("\nüìä Feature Ranges BEFORE Scaling:")
print(X_train.describe().loc[['min', 'max']])

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("\nüìä Feature Ranges AFTER Scaling:")
print(pd.DataFrame(X_train_scaled, columns=X.columns).describe().loc[['min', 'max']])
print("\n‚úÖ All features now on similar scale!")

### 3Ô∏è‚É£ Finding Optimal K Value

In [None]:
# Test different K values
k_range = range(1, 31)
train_scores = []
test_scores = []

for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train_scaled, y_train)
    
    train_scores.append(knn.score(X_train_scaled, y_train))
    test_scores.append(knn.score(X_test_scaled, y_test))

# Plot results
plt.figure(figsize=(12, 6))
plt.plot(k_range, train_scores, 'bo-', label='Training Accuracy', linewidth=2)
plt.plot(k_range, test_scores, 'ro-', label='Test Accuracy', linewidth=2)
plt.xlabel('K (Number of Neighbors)', fontsize=12)
plt.ylabel('Accuracy', fontsize=12)
plt.title('KNN: Finding Optimal K Value', fontsize=14, fontweight='bold')
plt.legend(fontsize=12)
plt.grid(True, alpha=0.3)
plt.xticks(range(1, 31, 2))

# Highlight best K
best_k = k_range[np.argmax(test_scores)]
plt.axvline(x=best_k, color='g', linestyle='--', linewidth=2, label=f'Best K={best_k}')
plt.legend()
plt.show()

print(f"\nüèÜ Optimal K: {best_k}")
print(f"Test Accuracy at K={best_k}: {max(test_scores):.4f}")

### 4Ô∏è‚É£ Train Final Model with Optimal K

In [None]:
# Train with optimal K
knn_optimal = KNeighborsClassifier(n_neighbors=best_k)
knn_optimal.fit(X_train_scaled, y_train)

# Make predictions
y_pred = knn_optimal.predict(X_test_scaled)

# Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f"\nüìä Model Performance (K={best_k}):")
print(f"Accuracy: {accuracy:.4f}")
print(f"\nüìã Classification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

### 5Ô∏è‚É£ Confusion Matrix Visualization

In [None]:
# Create confusion matrix
cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=iris.target_names, 
            yticklabels=iris.target_names,
            cbar_kws={'label': 'Count'})
plt.xlabel('Predicted Label', fontsize=12)
plt.ylabel('True Label', fontsize=12)
plt.title(f'Confusion Matrix - KNN (K={best_k})', fontsize=14, fontweight='bold')
plt.show()

# Calculate per-class accuracy
print("\nüìä Per-Class Accuracy:")
for i, species in enumerate(iris.target_names):
    class_acc = cm[i, i] / cm[i, :].sum()
    print(f"{species}: {class_acc:.4f}")

### 6Ô∏è‚É£ Comparing Distance Metrics and Weights

In [None]:
# Test different configurations
configs = [
    {'metric': 'euclidean', 'weights': 'uniform'},
    {'metric': 'euclidean', 'weights': 'distance'},
    {'metric': 'manhattan', 'weights': 'uniform'},
    {'metric': 'manhattan', 'weights': 'distance'},
]

results = []

for config in configs:
    knn = KNeighborsClassifier(n_neighbors=best_k, **config)
    knn.fit(X_train_scaled, y_train)
    score = knn.score(X_test_scaled, y_test)
    results.append({
        'Metric': config['metric'],
        'Weights': config['weights'],
        'Accuracy': score
    })

results_df = pd.DataFrame(results)
print("\nüìä Comparison of Distance Metrics and Weights:")
print(results_df.to_string(index=False))

# Visualize
plt.figure(figsize=(10, 6))
x = np.arange(len(results_df))
plt.bar(x, results_df['Accuracy'], color=['skyblue', 'lightcoral', 'lightgreen', 'gold'], 
        edgecolor='black', linewidth=1.5)
plt.xticks(x, [f"{row['Metric']}\n{row['Weights']}" for _, row in results_df.iterrows()])
plt.ylabel('Accuracy', fontsize=12)
plt.title('KNN Performance: Distance Metrics & Weights Comparison', fontsize=14, fontweight='bold')
plt.ylim(0.9, 1.0)
plt.grid(True, alpha=0.3, axis='y')
plt.show()

## Part 2: KNN for Regression
### 7Ô∏è‚É£ KNN Regression Example

In [None]:
# Load housing data for regression
df_housing = pd.read_csv('../Linear Regression/data/dataset.csv')

# Select features for regression
features = ['longitude', 'latitude', 'housing_median_age', 'total_rooms', 
            'total_bedrooms', 'population', 'households', 'median_income']
target = 'median_house_value'

# Prepare data
X_reg = df_housing[features].dropna()
y_reg = df_housing.loc[X_reg.index, target]

# Sample for computational efficiency
sample_size = 5000
indices = np.random.choice(X_reg.index, sample_size, replace=False)
X_reg = X_reg.loc[indices]
y_reg = y_reg.loc[indices]

print(f"Regression dataset shape: {X_reg.shape}")
print(f"Target variable: {target}")

In [None]:
# Split and scale
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X_reg, y_reg, test_size=0.2, random_state=42
)

scaler_reg = StandardScaler()
X_train_reg_scaled = scaler_reg.fit_transform(X_train_reg)
X_test_reg_scaled = scaler_reg.transform(X_test_reg)

# Find optimal K for regression
k_range_reg = range(1, 21)
rmse_scores = []

for k in k_range_reg:
    knn_reg = KNeighborsRegressor(n_neighbors=k)
    knn_reg.fit(X_train_reg_scaled, y_train_reg)
    y_pred_reg = knn_reg.predict(X_test_reg_scaled)
    rmse = np.sqrt(mean_squared_error(y_test_reg, y_pred_reg))
    rmse_scores.append(rmse)

# Plot RMSE vs K
plt.figure(figsize=(12, 6))
plt.plot(k_range_reg, rmse_scores, 'bo-', linewidth=2, markersize=8)
plt.xlabel('K (Number of Neighbors)', fontsize=12)
plt.ylabel('RMSE ($)', fontsize=12)
plt.title('KNN Regression: Finding Optimal K', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)

best_k_reg = k_range_reg[np.argmin(rmse_scores)]
plt.axvline(x=best_k_reg, color='r', linestyle='--', linewidth=2, label=f'Best K={best_k_reg}')
plt.legend()
plt.show()

print(f"\nüèÜ Optimal K for Regression: {best_k_reg}")
print(f"Best RMSE: ${min(rmse_scores):,.2f}")

In [None]:
# Train final regression model
knn_reg_final = KNeighborsRegressor(n_neighbors=best_k_reg)
knn_reg_final.fit(X_train_reg_scaled, y_train_reg)

# Predictions
y_pred_reg_final = knn_reg_final.predict(X_test_reg_scaled)

# Metrics
rmse_final = np.sqrt(mean_squared_error(y_test_reg, y_pred_reg_final))
mae_final = mean_absolute_error(y_test_reg, y_pred_reg_final)
r2_final = r2_score(y_test_reg, y_pred_reg_final)

print(f"\nüìä KNN Regression Performance (K={best_k_reg}):")
print(f"RMSE: ${rmse_final:,.2f}")
print(f"MAE:  ${mae_final:,.2f}")
print(f"R¬≤:   {r2_final:.4f}")

# Visualize predictions
plt.figure(figsize=(12, 6))
plt.scatter(y_test_reg, y_pred_reg_final, alpha=0.5, edgecolors='k')
plt.plot([y_test_reg.min(), y_test_reg.max()], 
         [y_test_reg.min(), y_test_reg.max()], 
         'r--', lw=2, label='Perfect Prediction')
plt.xlabel('Actual House Value ($)', fontsize=12)
plt.ylabel('Predicted House Value ($)', fontsize=12)
plt.title(f'KNN Regression: Actual vs Predicted (K={best_k_reg})', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

### 8Ô∏è‚É£ Hyperparameter Tuning with GridSearchCV

In [None]:
# Define parameter grid
param_grid = {
    'n_neighbors': [3, 5, 7, 9, 11, 15],
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan']
}

# Create KNN classifier
knn_grid = KNeighborsClassifier()

# Grid search
grid_search = GridSearchCV(
    knn_grid, 
    param_grid, 
    cv=5, 
    scoring='accuracy',
    n_jobs=-1
)

print("üîç Performing Grid Search...")
grid_search.fit(X_train_scaled, y_train)

print(f"\nüèÜ Best Parameters: {grid_search.best_params_}")
print(f"Best Cross-Validation Score: {grid_search.best_score_:.4f}")
print(f"Test Set Score: {grid_search.score(X_test_scaled, y_test):.4f}")

# Show top 10 configurations
results_df = pd.DataFrame(grid_search.cv_results_)
top_results = results_df.nsmallest(10, 'rank_test_score')[[
    'param_n_neighbors', 'param_weights', 'param_metric', 'mean_test_score'
]]
print("\nüìä Top 10 Configurations:")
print(top_results.to_string(index=False))

## üìä Key Takeaways

### Algorithm Characteristics:
1. **Non-parametric**: No assumptions about data distribution
2. **Lazy learning**: No training phase, all computation at prediction time
3. **Instance-based**: Uses entire training set for predictions

### Critical Requirements:
‚úÖ **Feature Scaling**: MANDATORY for KNN
‚úÖ **Optimal K**: Use cross-validation to find best K
‚úÖ **Distance Metric**: Choose based on data characteristics
‚úÖ **Computational Cost**: Slow for large datasets

### Choosing K:
- **Small K** (1-3): More complex decision boundary, prone to overfitting
- **Large K**: Smoother decision boundary, may underfit
- **Rule of thumb**: K = ‚àön (where n = number of samples)
- **Best practice**: Use cross-validation

### Distance Metrics:
- **Euclidean**: Most common, works well for continuous features
- **Manhattan**: Better for high-dimensional data
- **Minkowski**: Generalization of both (p=1: Manhattan, p=2: Euclidean)

### Weights:
- **Uniform**: All neighbors contribute equally
- **Distance**: Closer neighbors have more influence

### Pros:
‚úÖ Simple to understand and implement
‚úÖ No training required
‚úÖ Works for both classification and regression
‚úÖ Naturally handles multi-class problems

### Cons:
‚ùå Slow prediction for large datasets
‚ùå Sensitive to irrelevant features
‚ùå Requires feature scaling
‚ùå Curse of dimensionality
‚ùå Memory intensive

### When to Use KNN:
- ‚úÖ Small to medium datasets
- ‚úÖ Low-dimensional data
- ‚úÖ Non-linear decision boundaries
- ‚ùå Large datasets (use approximate methods)
- ‚ùå High-dimensional data (use dimensionality reduction first)