# Supervised Learning → KNN (Classification)

This notebook is part of the **ML-Methods** project.

As with the other classification notebooks,
the first sections focus on data preparation
and are intentionally repeated.

This ensures consistency across models
and allows fair comparison of results.

1. Project setup and common pipeline
2. Dataset loading
3. Train-test split
4. Feature scaling (why we do it)

----------------------------------

5. What is this model? (Intuition)
6. Model training
7. Model behavior and key parameters
8. Predictions
9. Model evaluation
10. When to use it and when not to
11. Model persistence
12. Mathematical formulation (deep dive)
13. Final summary – Code only

-----------------------------------------------------

## How this notebook should be read

This notebook is designed to be read **top to bottom**.

Before every code cell, you will find a short explanation describing:
- what we are about to do
- why this step is necessary
- how it fits into the overall process

The goal is not just to run the code,
but to understand what is happening at each step
and be able to adapt it to your own data.

-----------------------------------------------------

## What is KNN Classification?

KNN Classification is a **distance-based classification model**.

Instead of learning a decision boundary during training,
the model makes predictions by comparing data points directly.

-----------------------------------------------------

## Why we start with intuition

KNN is one of the simplest classification models conceptually.

All it does is:
- measure distances between data points
- find the K closest neighbors
- assign the most common class among them

If this idea is clear,
the rest of the notebook becomes easy to follow.

-----------------------------------------------------

## What you should expect from the results

With KNN Classification, you should expect:
- locally adaptive decision boundaries
- sensitivity to feature scaling
- good performance when similar samples exist nearby

However:
- performance can degrade on large datasets
- prediction time increases with dataset size


___________________________________

## 1. Project setup and common pipeline 

In this section we set up the common pipeline
used across classification models in this project.

KNN relies heavily on distances,
so feature scaling is especially important.


In [1]:
# Common imports used across all classification models

import numpy as np
import pandas as pd

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import (
    accuracy_score,
    confusion_matrix,
    classification_report,
    ConfusionMatrixDisplay
)

from pathlib import Path
import joblib
import matplotlib.pyplot as plt


# ____________________________________
## 2. Dataset loading

In this section we load the dataset
used for the KNN classification task.

We use the same dataset as in Logistic Regression
to allow a direct comparison between models.


In [2]:
# Load the dataset

data = load_breast_cancer(as_frame=True)

X = data.data
y = data.target


### Inputs and target

- `X` contains the input features
- `y` contains the target labels

This is a binary classification problem,
where the goal is to assign each sample
to one of two possible classes.


### Why using the same dataset matters

Using the same dataset across classification models
allows us to:
- compare performance fairly
- isolate model-specific behavior
- better understand trade-offs between models


# ____________________________________
## 3. Train-test split

In this section we split the dataset
into training and test sets.

This step allows us to evaluate
how well the KNN classifier generalizes
to unseen data.


In [3]:
# Split data into training and test sets

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42
)


### Why this step is important

KNN does not learn a model during training.
Instead, it relies directly on the training data
to make predictions.

For this reason:
- the training set defines the model’s knowledge
- the test set must remain completely unseen

This separation is essential
to obtain a fair evaluation.


### Note on split proportions

The choice of train-test split
depends on the dataset and problem.

Common splits include:
- 80 / 20
- 70 / 30
- 90 / 10

Here we use 80 / 20
as a reasonable default for demonstration purposes.


# ____________________________________
## 4. Feature scaling (why we do it)

In this section we apply feature scaling
to the input features.

For KNN Classification, feature scaling is **mandatory**.
Without scaling, the model will produce incorrect results.


### Why scaling is critical for KNN

KNN is a **distance-based model**.

This means that:
- predictions depend entirely on distances between samples
- features with larger numerical ranges dominate the distance
- unscaled data leads to biased neighbor selection


### Important rule: fit only on training data

As with all preprocessing steps:
- the scaler is fitted only on the training data
- the same scaler is applied to test data

This prevents data leakage
and ensures fair evaluation.


In [4]:
# Feature scaling (mandatory for KNN)

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


### What we have after this step

- scaled training data
- scaled test data
- a valid input space for distance computation

At this point, the data is ready
to be used by the KNN classifier.
