# Supervised Learning -> KNN Regression (K-Nearest Neighbors)

The first section (1 - 4) is identical across all regression models:
Keeping this part unchanged allows easy model comparison
and prevents pipeline-related mistakes.
Everything after that depends on the specific model.

1. Project setup and common pipeline
2. Dataset loading
3. Train-test split
4. Feature scaling (why we do it)
____
5. What is this model? (Intuition)
6. Model training
7. Model behavior and key hyperparameters
8. Predictions
9. Model evaluation
10. When to use it and when not to
11. Model persistence
12. Mathematical formulation (deep dive)
13. Final summary – Code only
-----------------------------------------------------

## How this notebook should be read

This notebook is designed to be read **top to bottom**.

Before every code cell, you will find a short explanation describing:
- what we are about to do
- why this step is necessary
- how it fits into the overall process

The goal is not just to run the code,
but to understand what is happening at each step
and be able to adapt it to your own data.

-----------------------------------------------------
## What is KNN Regression? 

KNN Regression is a very different model compared to Linear Regression.

Instead of learning a global formula or a line,
KNN Regression makes predictions by looking at the data directly.

The idea is simple:
to predict the value for a new data point,
the model searches for the **K closest data points**
in the training set.

Once these neighbors are found:
- their target values are collected
- the prediction is computed as the **average** of those values

There is no explicit training phase where parameters are learned.
The model simply stores the training data
and uses it at prediction time.

For this reason, KNN is often described as a
**lazy, instance-based model**.

-----------------------------------------------------
## Why we start with intuition

Starting with intuition is especially important for KNN Regression.

All the model does is:
- measure distances between data points
- decide which points are close
- aggregate their target values

If this idea is clear,
the rest of the notebook becomes easy to follow.

Every step in the code will reflect this logic:
distance → neighbors → average → prediction.

-----------------------------------------------------

## What you should expect from the results

Before using KNN Regression, it is important to set expectations.

With KNN Regression, you should expect:
- predictions that adapt locally to the data
- good performance when similar data points exist nearby
- sensitivity to noise and outliers

This model often performs well as a flexible alternative
when linear assumptions are too restrictive,
but it may struggle on large datasets or high-dimensional data.


## 1. Project setup and common pipeline

In this section we prepare everything that is shared across all regression models.

The goal is to:
- set up a clean and reproducible environment
- import all common dependencies
- define a standard pipeline that will not change when switching models

This part of the notebook is intentionally kept simple and consistent.
If something changes here, it should change in all models.


### Why having a common pipeline matters

Using the same pipeline for all models allows us to:
- compare models fairly
- avoid data leakage
- reduce implementation errors
- focus on understanding the model instead of debugging the setup

From this point on, every model will start from the exact same data preparation steps.


In [2]:
#   Used for numerical operations and data manipulation.
import numpy as np 
import pandas as pd

#  Provides a real-world regression dataset for demonstration purposes.
from sklearn.datasets import fetch_california_housing

#  Ensures a clean separation between training and test data.
from sklearn.model_selection import train_test_split

#  Applies feature scaling to ensure numerical stability and pipeline consistency.
from sklearn.preprocessing import StandardScaler

#  Used later to assess model performance.
from sklearn.metrics import mean_squared_error, r2_score , mean_absolute_error

# Used for stabil path.
from pathlib import Path

#  Used to save trained models and preprocessing objects.
import joblib

# ____________________________________

## 2. Dataset loading

In this section we load the dataset that will be used throughout the notebook.

The purpose of this step is to:
- obtain real data to work with
- clearly separate input features from the target variable
- create a structure that can be easily replaced with custom datasets

At this stage, no modeling is involved.
We are only defining what the model will learn from.


### About the dataset

We use the California Housing dataset, a classic regression dataset.

Each row represents a housing district in California.
Each column represents a numerical feature describing that district.
The target variable is a continuous value representing house prices.


In [3]:
data = fetch_california_housing(as_frame=True)

X = data.data
y = data.target

### What we obtained

- `X`  
  A table of input features used by the model to make predictions.

- `y`  
  The target variable we want the model to predict.

This separation is fundamental:
- the model learns patterns from `X`
- the model is evaluated by comparing predictions to `y`

Adapting this step to your own data:
`X` should contain only feature columns
`y` should contain the target variable

# ____________________________________

## 3. Train-test split

In this section we split the dataset into two separate parts:
- a training set
- a test set

This step is fundamental in machine learning and applies to almost every model.


### Why we split the data

The goal of a machine learning model is not to memorize data,
but to make good predictions on **new, unseen data**.

If we trained and evaluated the model on the same data:
- we would get overly optimistic results
- we would not know how well the model generalizes

By splitting the data:
- the training set is used to learn patterns
- the test set is used only for evaluation


In [4]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42
)

### What these parameters mean

- `test_size=0.2`  
  20% of the data is reserved for testing,  
  80% is used for training.

That split can be change based on the dataset.

### What we have after this step

- `X_train`, `y_train`  
  Data used to train the model.

- `X_test`, `y_test`  
  Data kept aside and used only for evaluation.

From this point on:
- the model must never see the test data during training
- all learning happens exclusively on the training set


# ____________________________________

## 4. Feature scaling (why we do it)

In this section we apply feature scaling to the input data.

Feature scaling means transforming the numerical features
so that they all follow a similar scale.


### Why feature scaling is important

In many datasets, features can have very different ranges.

For example:
- one feature may range between 0 and 1
- another may range between 0 and 100,000

Without scaling:
- features with larger values can dominate the learning process
- numerical computations may become unstable

### Does KNN require scaling?

Strictly speaking:
- KNN **can work without feature scaling**

However, we still apply scaling because:

1. Numerical stability 
2. Pipeline consistency  
   Many other models (Ridge, Lasso, SGD, SVM) **require** scaled features.
   Keeping scaling in the pipeline avoids changing code later
3. Better coefficient interpretation  




### Important rule: fit only on training data

The scaler must be:
- fitted on the training data
- applied to both training and test data

In [5]:
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

### What we have after this step

- `X_train_scaled`  
  Scaled features used to train the model.

- `X_test_scaled`  
  Scaled features

Now we are ready to introduce the model.


# ____________________________________

## 5. What is this model? (KNN Regression)

Before training the model, it is important to understand
what KNN Regression is trying to do conceptually.

KNN  Regression is a **distance-based model**.
Unlike Linear Regression, it does not learn a global formula
that describes the entire dataset.

### The core idea

KNN Regression makes predictions by comparing data points.

To predict the value for a new input:
- the model looks at the training data
- finds the K closest data points (neighbors)
- computes the prediction as the average of their target values

The idea is that:
points that are close to each other
are likely to have similar target values.


### Why distance matters

Distance is the key concept behind KNN.

The notion of "closeness" depends on:
- the feature values
- the distance metric
- proper feature scaling

If features are not scaled:
- distances become meaningless
- the model behaves incorrectly

This is why feature scaling is mandatory for KNN Regression.


### Key takeaway

KNN Regression does not try to explain the data with a formula.

Instead, it answers the question:
"How did similar data points behave in the past?"

# ____________________________________

## 6. Model training (KNN Regression)

In this section we train the KNN Regression model.

For KNN, training does not mean learning parameters
or fitting a mathematical function.

Instead, training simply consists of:
- storing the training data
- preparing it to be used for distance-based predictions

### What does "training" mean for KNN?

Unlike Linear Regression, KNN is a **lazy model**.

This means:
- no optimization is performed during training
- no coefficients are learned
- no global model is built

All the work happens at prediction time,
when distances between data points are computed.


### The role of K (number of neighbors)

The most important hyperparameter in KNN is **K**.

K represents:
- how many neighbors are considered when making a prediction

Choosing K involves a trade-off:
- small K → very local, sensitive to noise
- large K → smoother predictions, less flexible


In [6]:
from sklearn.neighbors import KNeighborsRegressor

# Initialize the KNN regressor
knn_model = KNeighborsRegressor(
    n_neighbors=5
)

# "Train" the model
knn_model.fit(X_train_scaled, y_train)


0,1,2
,"n_neighbors  n_neighbors: int, default=5 Number of neighbors to use by default for :meth:`kneighbors` queries.",5
,"weights  weights: {'uniform', 'distance'}, callable or None, default='uniform' Weight function used in prediction. Possible values: - 'uniform' : uniform weights. All points in each neighborhood  are weighted equally. - 'distance' : weight points by the inverse of their distance.  in this case, closer neighbors of a query point will have a  greater influence than neighbors which are further away. - [callable] : a user-defined function which accepts an  array of distances, and returns an array of the same shape  containing the weights. Uniform weights are used by default. See the following example for a demonstration of the impact of different weighting schemes on predictions: :ref:`sphx_glr_auto_examples_neighbors_plot_regression.py`.",'uniform'
,"algorithm  algorithm: {'auto', 'ball_tree', 'kd_tree', 'brute'}, default='auto' Algorithm used to compute the nearest neighbors: - 'ball_tree' will use :class:`BallTree` - 'kd_tree' will use :class:`KDTree` - 'brute' will use a brute-force search. - 'auto' will attempt to decide the most appropriate algorithm  based on the values passed to :meth:`fit` method. Note: fitting on sparse input will override the setting of this parameter, using brute force.",'auto'
,"leaf_size  leaf_size: int, default=30 Leaf size passed to BallTree or KDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.",30
,"p  p: float, default=2 Power parameter for the Minkowski metric. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.",2
,"metric  metric: str, DistanceMetric object or callable, default='minkowski' Metric to use for distance computation. Default is ""minkowski"", which results in the standard Euclidean distance when p = 2. See the documentation of `scipy.spatial.distance `_ and the metrics listed in :class:`~sklearn.metrics.pairwise.distance_metrics` for valid metric values. If metric is ""precomputed"", X is assumed to be a distance matrix and must be square during fit. X may be a :term:`sparse graph`, in which case only ""nonzero"" elements may be considered neighbors. If metric is a callable function, it takes two arrays representing 1D vectors as inputs and must return one value indicating the distance between those vectors. This works for Scipy's metrics, but is less efficient than passing the metric name as a string. If metric is a DistanceMetric object, it will be passed directly to the underlying computation routines.",'minkowski'
,"metric_params  metric_params: dict, default=None Additional keyword arguments for the metric function.",
,"n_jobs  n_jobs: int, default=None The number of parallel jobs to run for neighbors search. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details. Doesn't affect :meth:`fit` method.",


### What we have after this step

After this step:
- the model has stored the training data
- the value of K has been fixed
- the model is ready to make predictions

The model will compute distances only when predictions are requested.

### Important implication

Because KNN defers computation to prediction time:
- training is fast
- prediction can be slow on large datasets
- memory usage is higher than parametric models

This behavior will influence
when and where KNN should be used.


# ____________________________________
## 7. Model behavior and key hyperparameters

In this section we analyze how the KNN Regression model behaves
and which hyperparameters control its predictions.

Unlike Linear Regression, KNN does not produce coefficients.
Its behavior is entirely determined by a small set of choices.


### The most important hyperparameter: K

The value of **K** determines how many neighbors
are used to compute each prediction.

This choice has a strong impact on model behavior:

- Small K (e.g. K = 1 or 3):
  - predictions are very local
  - the model is sensitive to noise
  - high variance, low bias

- Large K (e.g. K = 20 or more):
  - predictions are smoother
  - the model is less sensitive to noise
  - higher bias, lower variance


### Distance metric

KNN relies on a distance metric to define what “close” means.

The choice of distance metric affects:
- neighbor selection
- prediction behavior
- model sensitivity to feature distributions


### Uniform vs distance-based weighting

KNN can weight neighbors in different ways:

- Uniform weighting:
  - all neighbors contribute equally
  - prediction is the simple average

- Distance-based weighting:
  - closer neighbors have more influence
  - farther neighbors contribute less

This choice can significantly affect predictions,
especially when neighbors are unevenly distributed.


# ____________________________________
## 8. Predictions (KNN Regression)

In this section we use the trained KNN model
to generate predictions on unseen data.

This step shows how KNN behaves in practice
when making predictions based on nearby data points.


### What does a prediction mean for KNN?

For KNN Regression, making a prediction means:
- taking a new input sample
- computing its distance to all training samples
- selecting the K closest neighbors
- averaging their target values

In [7]:
# Generate predictions on the test set

y_pred_knn = knn_model.predict(X_test_scaled)


### What we obtained

- `y_pred_knn` contains the predicted target values
- each prediction is influenced by nearby training samples
- predictions can vary significantly across the input space

This local behavior is a defining characteristic of KNN.


# ____________________________________
## 9. Model evaluation (KNN Regression)

In this section we evaluate the performance of the KNN Regression model
by comparing its predictions with the true target values.

Evaluation allows us to understand
how well the model generalizes to unseen data.


### Why evaluation is important for KNN

KNN can behave very differently depending on:
- the value of K
- the density of the data
- the presence of noise

Evaluation helps us determine
whether the chosen configuration is appropriate
for the given problem.


In [8]:
# Compute evaluation metrics for KNN Regression

mae = mean_absolute_error(y_test, y_pred_knn)
mse = mean_squared_error(y_test, y_pred_knn)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred_knn)

mae, mse, rmse, r2


(0.4461535271317829,
 0.4324216146043236,
 np.float64(0.6575877238850522),
 0.6700101862970989)

### Metrics interpretation

- **MAE (Mean Absolute Error)**  
  Average absolute prediction error.  
  Easy to interpret and less sensitive to outliers.

- **MSE (Mean Squared Error)**  
  Penalizes large errors more heavily.

- **RMSE (Root Mean Squared Error)**  
  Expresses the error in the same units as the target variable.

- **R² score**  
  Indicates how much variance in the target variable
  is explained by the model.


### Which metric should we focus on when comparing models?

When comparing regression models, there is no single metric
that is always the “best”.

However, in practice, **RMSE** is often the most informative metric
and the one most commonly used for model comparison.

Why RMSE?
- It is expressed in the same units as the target variable
- It penalizes large errors more than small ones
- It reflects how wrong predictions are in realistic scenarios

MAE is useful for understanding average error,
and R² is useful for understanding explained variance,
but RMSE usually provides the best overall picture
when comparing different models on the same dataset.

For this reason, RMSE is typically the primary metric
used to compare regression models in this project.


### Interpreting the RMSE value

In this case, the RMSE is approximately **0.66**.

This means that, on average,
the model’s predictions differ from the true values
by about **0.66 units of the target variable**.

Since the target in this dataset represents
house prices in hundreds of thousands of dollars,
this corresponds to an average error of roughly **$66,000**.

It is important to remember that:
- this is an average measure
- some predictions may be much more accurate
- others may have larger errors


# ____________________________________


## 10. When to use it and when not to (KNN Regression)

Knowing when to use KNN Regression
is essential to avoid misuse and misleading results.

KNN is a powerful model in the right context,
but it also has clear limitations.


### When KNN Regression is a good choice

KNN Regression works well when:

- The relationship between features and target is non-linear
- Similar data points tend to have similar target values
- The dataset is not extremely large
- Feature scaling is properly applied
- Local patterns are more important than global trends


# ____________________________________
## 11. Model persistence (KNN Regression)

In this section we save the trained KNN Regression model
and the preprocessing steps used during training.

Model persistence allows us to reuse the model
without repeating the entire training process.


### Why saving the model is important

Even though KNN does not learn parameters in the traditional sense,
saving the model is still essential.

The saved model contains:
- the training data
- the chosen hyperparameters (such as K)
- the configuration needed to reproduce predictions

This allows the model to be reused consistently.


### Important rule: save the scaler together with the model

KNN relies on distance calculations.

For this reason:
- new input data must be scaled in the same way
- using a different scaler would lead to incorrect distances

The model and the scaler must always be saved and loaded together.


In [10]:
from pathlib import Path
import joblib

# Define model directory
model_dir = Path("models/supervised_learning/regression/knn_regression")

# Create directory if it does not exist
model_dir.mkdir(parents=True, exist_ok=True)

# Save model and scaler
joblib.dump(knn_model, model_dir / "knn_regression_model.joblib")
joblib.dump(scaler, model_dir / "scaler.joblib")



['models\\supervised_learning\\regression\\knn_regression\\scaler.joblib']

### What we have now

- A trained KNN Regression model
- A fitted feature scaler
- Both saved and ready to be reused

At this point, the model can be:
- loaded in another notebook
- used in an application
- evaluated on new data


# ____________________________________

### Loading the model later (conceptual example)

To use the model in the future:
- load the scaler
- scale new input data
- load the KNN model
- generate predictions

This ensures full consistency with the original training pipeline.


# ____________________________________
## 12. Mathematical formulation (deep dive)

This section provides a deeper look at how KNN Regression works
from a mathematical and algorithmic perspective.



### Representation of the data

KNN Regression operates directly on the training data.

Each training sample can be represented as:
- a feature vector xᵢ
- a corresponding target value yᵢ

The full training set is stored and used during prediction.


### Distance computation

To make a prediction for a new input x,
the model computes the distance between x
and every training sample xᵢ.

By default, this distance is the Euclidean distance.

This step determines which samples are considered
the "nearest neighbors".


### Neighbor selection and prediction

Once distances are computed:
- the K closest samples are selected
- their target values are retrieved

In KNN Regression, the prediction is computed as:
- the average of the target values of the selected neighbors

Optionally:
- closer neighbors can be given more weight
- farther neighbors can contribute less


### Final takeaway

KNN Regression is conceptually simple but powerful.

It replaces a mathematical model
with direct comparison between data points.

Understanding this mechanism explains:
- why scaling is mandatory
- why K strongly affects performance
- why KNN behaves very differently from linear models


# ____________________________________
## Final summary – Code only

The following cell contains the complete pipeline
from data loading to model persistence.

No explanations are provided here on purpose.

This section is intended for:
- quick execution
- reference
- reuse in scripts or applications

If you want to understand what each step does and why,
read the notebook from top to bottom.


In [None]:
# ====================================
# Imports
# ====================================

import numpy as np
import pandas as pd

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

from pathlib import Path
import joblib


# ====================================
# Dataset loading
# ====================================

data = fetch_california_housing(as_frame=True)

X = data.data
y = data.target


# ====================================
# Train-test split
# ====================================

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42
)


# ====================================
# Feature scaling (mandatory for KNN)
# ====================================

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


# ====================================
# Model initialization
# ====================================

knn_model = KNeighborsRegressor(
    n_neighbors=5
)


# ====================================
# Model training
# ====================================

knn_model.fit(X_train_scaled, y_train)


# ====================================
# Predictions
# ====================================

y_pred_knn = knn_model.predict(X_test_scaled)


# ====================================
# Model evaluation
# ====================================

mae = mean_absolute_error(y_test, y_pred_knn)
mse = mean_squared_error(y_test, y_pred_knn)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred_knn)

mae, mse, rmse, r2


# ====================================
# Model persistence
# ====================================

model_dir = Path("models/supervised_learning/regression/knn_regression")
model_dir.mkdir(parents=True, exist_ok=True)

joblib.dump(knn_model, model_dir / "knn_regression_model.joblib")
joblib.dump(scaler, model_dir / "scaler.joblib")
