# k-Nearest Neighbours for Credit Classification

This notebook demonstrates k-Nearest Neighbours (k-NN) classification for credit default prediction using the UCI German Credit dataset.

## Model Overview

**k-Nearest Neighbours (k-NN)** classifies instances based on the majority vote of their k closest neighbours in feature space.

### Pros
- Simple and intuitive algorithm
- No training phase (lazy learner)
- Naturally handles multi-class problems
- Non-parametric (makes no assumptions about data distribution)
- Can capture complex decision boundaries

### Cons
- Slow at inference time (must compute distances to all training points)
- Sensitive to irrelevant features and feature scaling
- High memory usage (stores all training data)
- Performance degrades in high dimensions (curse of dimensionality)
- Requires careful choice of k and distance metric

### When to Use
- When you need a simple baseline
- For small to medium datasets
- When interpretability at the instance level is important

## Setup

In [None]:
import sys
sys.path.insert(0, '../src')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from creditclass.preprocessing import prepare_data
from creditclass.training import get_model, train_model, save_model, tune_hyperparameters
from creditclass.evaluation import (
    evaluate_model,
    compute_shap_values,
    get_learning_curve_data,
)
from creditclass.plots import (
    set_plot_style,
    plot_confusion_matrix,
    plot_roc_curve,
    plot_precision_recall,
    plot_learning_curve,
    plot_calibration,
    plot_shap_summary,
)

set_plot_style()
RANDOM_STATE = 42

## Load Data

In [None]:
data = prepare_data(
    target_type='default',
    encoding_method='onehot',
    test_size=0.2,
    random_state=RANDOM_STATE,
    scale=True,  # k-NN requires scaling
)

X_train = data['X_train_scaled']
X_test = data['X_test_scaled']
y_train = data['y_train']
y_test = data['y_test']
feature_names = data['feature_names']

print(f"Training set: {X_train.shape[0]} samples, {X_train.shape[1]} features")
print(f"Test set: {X_test.shape[0]} samples")

## Training

In [None]:
model = get_model('knn')
model = train_model(model, X_train, y_train)

print("Model trained successfully!")
print(f"Number of neighbours (k): {model.n_neighbors}")
print(f"Weights: {model.weights}")
print(f"Distance metric: {model.metric}")

## Evaluation

In [None]:
metrics = evaluate_model(model, X_test, y_test)

print("Performance Metrics:")
print("-" * 30)
for name, value in metrics.items():
    if value is not None:
        print(f"{name.capitalize():12} {value:.4f}")

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

plot_confusion_matrix(
    model, X_test, y_test,
    class_names=['Good Credit', 'Bad Credit'],
    ax=axes[0],
    title='k-NN - Confusion Matrix'
)

plot_roc_curve(model, X_test, y_test, ax=axes[1], label='k-NN')

plt.tight_layout()
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(7, 6))
plot_precision_recall(model, X_test, y_test, ax=ax, label='k-NN')
plt.tight_layout()
plt.show()

## Effect of k

Let's see how the choice of k affects performance.

In [None]:
k_values = [1, 3, 5, 7, 9, 11, 15, 21]
k_results = []

for k in k_values:
    knn = get_model('knn', params={'n_neighbors': k})
    knn = train_model(knn, X_train, y_train)
    metrics = evaluate_model(knn, X_test, y_test)
    k_results.append({'k': k, **metrics})

k_df = pd.DataFrame(k_results)
print(k_df.to_string(index=False))

In [None]:
fig, ax = plt.subplots(figsize=(8, 6))

ax.plot(k_df['k'], k_df['accuracy'], 'o-', label='Accuracy')
ax.plot(k_df['k'], k_df['f1'], 's-', label='F1')
ax.plot(k_df['k'], k_df['auc'], '^-', label='AUC')

ax.set_xlabel('k (Number of Neighbours)')
ax.set_ylabel('Score')
ax.set_title('k-NN Performance vs. k')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Interpretability

k-NN provides instance-level interpretability through its neighbours. We can also use SHAP for global feature importance.

In [None]:
# SHAP values (using KernelExplainer)
print("Computing SHAP values (this may take a moment)...")
shap_data = compute_shap_values(model, X_test, feature_names=feature_names, max_samples=50)

fig, ax = plt.subplots(figsize=(10, 8))
plot_shap_summary(shap_data, plot_type='bar', max_display=15)
plt.title('k-NN - SHAP Feature Importance')
plt.tight_layout()
plt.show()

## Hyperparameter Tuning

In [None]:
tuning_results = tune_hyperparameters(
    'knn',
    X_train, y_train,
    method='grid',
    cv=5,
    scoring='f1'
)

print("Best Parameters:")
print(tuning_results['best_params'])
print(f"\nBest CV F1 Score: {tuning_results['best_score']:.4f}")

In [None]:
tuned_model = tuning_results['best_model']
tuned_metrics = evaluate_model(tuned_model, X_test, y_test)

print("\nTuned Model Performance:")
print("-" * 30)
for name, value in tuned_metrics.items():
    if value is not None:
        print(f"{name.capitalize():12} {value:.4f}")

## Learning Curve

In [None]:
lc_model = get_model('knn')
lc_data = get_learning_curve_data(lc_model, X_train, y_train, cv=5, scoring='f1')

fig, ax = plt.subplots(figsize=(8, 6))
plot_learning_curve(lc_data, ax=ax, title='k-NN - Learning Curve')
plt.tight_layout()
plt.show()

## Calibration

In [None]:
fig, ax = plt.subplots(figsize=(7, 6))
plot_calibration(model, X_test, y_test, ax=ax, label='k-NN')
plt.tight_layout()
plt.show()

## Save Model

In [None]:
save_path = save_model(model, 'knn')
print(f"Model saved to: {save_path}")

## Summary

### Key Takeaways

1. **Simplicity**: k-NN is one of the simplest ML algorithms to understand
2. **k Selection**: Choice of k significantly impacts bias-variance trade-off
3. **Scaling**: Feature scaling is crucial for distance-based algorithms
4. **Inference Cost**: Predictions require computing distances to all training points

### Recommendations

- Always scale features before using k-NN
- Use cross-validation to select optimal k
- Consider distance weighting for better performance
- For large datasets, consider approximate nearest neighbour methods