**Author:** Shahab Fatemi

**Email:** shahab.fatemi@umu.se   ;   shahab.fatemi@amitiscode.com

**Created:** 2025-06-05

**Last update:** 2025-09-24

**MIT License** — Shahab Fatemi (2025); For use in the *Machine Learning in Physics* course, Umeå University, Sweden; See the full license text in the parent folder.

<hr>

📢 <span style="color:red"><strong> Note for Students:</strong></span>

* Before working on the labs, review your lecture notes.

* Please read all sections, code blocks, and comments **carefully** to fully understand the material. Throughout the labs, my instructions are provided to you in written form, guiding you through the materials step-by-step.

* All concepts covered in this lab are part of the course and may be included in the final exam.

* I strongly encourage you to work in pairs and discuss your findings, observations, and reasoning with each other.

* If something is unclear, don't hesitate to ask.

* I have done my best to make the lab files as bug-free (and error-free) as possible, but remember: *there is no such thing as bug-free code.* If you observed any bugs, errors, typos, or other issues, I would greatly appreciate it if you report them to me by email. Verbal notifications are not work, as I will likely forget 🙂

* Your answers for the "⚡ Mandatory" sections of each lab <span style="color:red"><strong>must be submitted before the start of the next lab session</strong></span>.

ENJOY WORKING ON THIS LAB.
***

# 🛠️ Purpose and Learning Outcomes:

In this lab, you will learn about the K-Nearest Neighbors (KNN) algorithm for regression and classification tasks. You will understand how KNN works, how to implement it from scratch, and how to use it from scikit-learn, and how to evaluate its performance on a dataset.

***

In [None]:
import sys
import os
sys.path.append(os.path.abspath('../utils'))
from notebook_config import *

# k-Nearest Neighbors (k-NN) 

**Overview:** k-NN is a simple, non-parametric, and instance-based method used for classification and regression tasks. Instead of learning an explicit model, k-NN makes predictions by finding the k closest data points (neighbors) to a given query point, using a distance metric such as Euclidean distance. 

This lab consists of two parts: Regression and Classification.

## k-NN for Regression

You are familiar with the functions below from linear regression labs. The functions generate data for the motion of a projectile.

In [None]:
# Compute the true y values for the motion of a projectile
# y = -(1/2)*g*t^2 + v0*t + y0
#
def projectile_true(t, g=9.8, v0=25, y0=0):
    # The projectile equation is y = -(1/2)*g*t^2 + v0*t + y0
    return -0.5*g*t**2 + v0*t + y0 # true trajectory

def projectile_motion(g, v0, y0, time_range=(0, 5), num_points=100, noise_level=5, seed=42):
    np.random.seed(seed)  # For reproducability
    
    # time for the projectile motsion
    t = np.linspace(time_range[0], time_range[1], num_points)
 
    y_true  = projectile_true(t, g, v0, y0) # true trajectory
    y_noisy = y_true + np.random.normal(0, noise_level, num_points) # noisy data
    return t, y_true, y_noisy

Let's create some data and plot.

In [None]:
def plot_projectile(t, y_true, y_noisy):
    plt.figure()
    plt.scatter(t, y_noisy, s=50, color=colors[7], edgecolors='k', alpha=0.4, label="Noisy data")
    plt.plot(t, y_true, color=colors[0], linestyle="--", linewidth=2, label="Actual trajectory")

    # Customize plot appearance
    plt.xlabel("Time (s)", fontsize=14)
    plt.ylabel("Displacement (m)", fontsize=14)
    plt.title("Projectile motion", fontsize=16)
    plt.grid(True)
    plt.legend()
    plt.show()
    
# Generate data and split
# Parameters for the projectile motion simulation
v0 = 20.0  # initial velocity (m/s)
g  = 9.8   # gravity acceleration (m/s^2)
y0 = 0.0   # initial position/height/displacement (m)
n  = 200   # number of points (#)
time_range  = (0, 5)  # time range from 0 to 5 seconds
noise_level = 3  # standard deviation of the noise
t, y_true, y_noisy = projectile_motion(g=g, v0=v0, y0=y0, 
                                       time_range=time_range, 
                                       num_points=n, 
                                       noise_level=noise_level)

plot_projectile(t, y_true, y_noisy)  

Previously, and in linear regression labs, you have seen how to fit a linear model to data. However, not all relationships in data are linear. In this section, we will explore a general technique called k-NN regression that can be applied to both linear and non-linear data. When using k-NN regression, we do not assume any specific form for the relationship between input features and the target variable. Instead, we rely on the proximity of data points to make predictions.

The code section below implements a custom k-NN regression model to predict the displacement of a projectile over time, based on noisy simulated data. The `KNNRegressor` class contains methods to `fit` the model by storing training data, and to make `predictions` for new input points by calculating the distance between each test point and all training points, identifying the k closest neighbors, and averaging their target values to generate the prediction. 

The script splits the data into training and testing sets. It trains the k-NN regressor on the training data, predicts the displacement over a dense range of time points to produce a smooth curve, and finally visualizes the true motion, noisy training data, and the model's predictions. Here, I've not used the test data, because I want to show you the smooth curve of predictions over the entire time range. Therefore, I've generated a dense set of time points for prediction. The test data, however, will be used later to evaluate the model's performance.

I suppose you are familiar with the general concepts of OOP (Object-Oriented Programming) in Python. However, if you still do not feel confident, what you should do it to move to the first line after the class definition and start analyzing the code line by line from there and find the associated function in the class.

As stated in the lecture notes, k-NN is computationally intensive during the prediction phase. This you can already see by comparing the `fit` and `predict` methods. The `fit` method simply stores the training data, while the `predict` method involves calculating distances and sorting to find the nearest neighbors for each test point.

In [None]:
from sklearn.model_selection import train_test_split

# Own developed k-Nearest Neighbors Regressor
class KNNRegressor:
    def __init__(self, k):
        self.k = k  # number of nearest neighbors
        
    # Take a copy of X_train and y_train values.
    def fit(self, X_train, y_train):
        self.X_train = X_train.flatten() #  Convert a 2D array made by train_test_split (n_samples, 1) to a 1D array of shape (n_samples)
        self.y_train = y_train # y_train is already in 1D format; no need to be flattened.
    
    # generates predicted values for each input using the kNN method.
    def predict(self, X_test):
        predictions = []

        # Walk through all test samples. You see that this is a loop over all test samples, 
        # which we have replaced with dense X.
        for x in X_test.flatten():
            # Step 1/5: Compute the absolute distance (here, 1D Euclidean) between each 
            # training point and the test point x.
            distances = np.abs(self.X_train - x)  # 1D Euclidean distance. In higher dimensions, use np.linalg.norm(self.X_train - x, axis=1)
            
            # Step 2/5 and 3/5: Sorts the distances from smallest to largest, and 
            # return the indices of the k nearest neighbors
            k_indices = np.argsort(distances)[:self.k]
            
            # Step 4/5: Extract the target values (y_train) of the k nearest neighbors.
            target_values = self.y_train[k_indices]

            # Step 5/5: compute the majority voting. Here I've used the average (mean) of 
            # the target values (y_train) of the k nearest neighbors.
            y_pred = np.mean(target_values)

            predictions.append(y_pred)
        return np.array(predictions)

    # Plotting function
    def plot(self, x_true, y_true, x_train, y_train, x_pred, y_pred, k):
        plt.figure()
        plt.scatter(x_train, y_train, s=50, color=colors[7], edgecolors='k', alpha=0.4, label="Training noisy data")
        plt.plot(x_true, y_true, color=colors[0], linestyle="--", linewidth=2, label="Actual trajectory")
        plt.plot(x_pred, y_pred, color=colors[1], label=f"kNN Prediction (k={k})")

        # Customize plot appearance
        plt.xlabel("Time (s)", fontsize=14)
        plt.ylabel("Displacement (m)", fontsize=14)
        plt.title(f"kNN Prediction (k={k})", fontsize=16)
        plt.grid(True)
        plt.legend()
        plt.show()

# ========== MAIN ==========
# Reshape t for compatibility with sklearn's train_test_split
x_train, x_test, y_train, y_test = train_test_split(t.reshape(-1, 1), y_noisy, test_size=0.3, random_state=42)
print(x_train.shape, y_train.shape)

k = 5  # number of nearest neighbors

# Create the model
knn_model = KNNRegressor(k=k)  # only the __init__ method from the class is called here.

# Fit the model
knn_model.fit(x_train, y_train)

# Create dense X values to show smooth prediction curve
x_dense = np.linspace(time_range[0], time_range[1], n).reshape(-1, 1)

# Predict y values for the dense X values (not the test data)
y_dense_pred = knn_model.predict(x_dense)

# Plot the results
knn_model.plot(t, y_true, x_train, y_train, x_dense, y_dense_pred, k)  # data prediction and visualization


***
### ✅ Check your understanding

- Carefully read and analyze the `predict` method of the KNNRegressor class and compare the steps with what you have learned in the lecture notes (see KNN algorithm slide and mainly see the "Steps" section).

- Change the value of `k` to 1, 2, 5, 9, 25, and 50. Observe how the predictions change with different values of `k`. What do you notice about the model's behavior as `k` increases or decreases? Any signs of overfitting or underfitting? Explain your observations.

***

We want to evaluate the model's performance on the test data. For this, we will use different metrics. Below, I've developed another class that computes and visualizes these metrics. I've limited the metrics to three common ones: Mean Absolute Error (MAE), Mean Squared Error (MSE), and R-squared. You can, of course, add more metrics if you wish.

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Validtor for any KNN model. I've made it generic to be used for any KNN model
class KNNValidator:
    # x_train, y_train: training data
    # x_test,  y_test : test data
    # model_class: the KNN model class to be evaluated, e.g., KNNRegressor
    def __init__(self, x_train, y_train, x_test, y_test, model_class):
        # Take an instance copy (with reference) of the input parameters.
        self.x_train     = x_train
        self.y_train     = y_train
        self.x_test      = x_test
        self.y_test      = y_test
        self.model_class = model_class

    # For a range of k-values, perform model evaluation and
    # the required metrics, like MSE, MAE, amd R2
    def evaluate(self, k_values):
        self.results = []       # Combine all results
        
        # For different k values, predict values and measure model accuracy
        for k in k_values:
            # Instance of the model class (e.g., KNNRegressor)
            model = self.model_class(k=k)
            model.fit(self.x_train, self.y_train)
            
            # Predict valus
            y_pred = model.predict(self.x_test)
            
            mse = mean_squared_error (self.y_test , y_pred)
            mae = mean_absolute_error(self.y_test , y_pred)
            r2  = r2_score(self.y_test, y_pred)

            self.results.append({
                'k'     : k,
                'mse'   : mse,
                'mae'   : mae,
                'r2'    : r2,
                'y_pred': y_pred
            })
                
        return self.results

    # Plot all metrics
    def plot_metrics(self):
        ks   = [r['k']   for r in self.results]
        mses = [r['mse'] for r in self.results]
        maes = [r['mae'] for r in self.results]
        r2s  = [r['r2']  for r in self.results]

        # Plot all metrics
        plt.figure(figsize=(10, 3), dpi=200)

        plt.subplot(1, 3, 1)
        plt.plot(ks, mses, marker='o')
        plt.xlabel("k", fontsize=14)
        plt.ylabel("MSE", fontsize=14)
        plt.title(f"Mean Squared Error", fontsize=16)
        plt.grid(True)

        plt.subplot(1, 3, 2)
        plt.plot(ks, maes, marker='o', color='orange')
        plt.xlabel("k", fontsize=14)
        plt.ylabel("MAE", fontsize=14)
        plt.title("Mean Absolute Error", fontsize=16)
        plt.grid(True)

        plt.subplot(1, 3, 3)
        plt.plot(ks, r2s, marker='o', color='green')
        plt.xlabel("k", fontsize=14)
        plt.ylabel("R2 Score", fontsize=14)
        plt.title("R2 Score", fontsize=16)
        plt.grid(True)

        plt.show()

# ========== MAIN ==========
# Validate across multiple k values
validator = KNNValidator(x_train, y_train, x_test, y_test, KNNRegressor)
k_values = list(range(1, 31, 2))  # A range of k values to be evaluated; only odd k-values are used for safety! 
results = validator.evaluate(k_values)

validator.plot_metrics()

***
### ✅ Check your understanding
- Analyze the plots of the error metrics. What do they tell you about the model's performance across different `k` values? How do these metrics help in understanding the model's accuracy and reliability?

### ⚡ Mandatory submission
- Find the best `k` value based on the error metrics. How did you find it? Explain your reasoning.
***

In the code section below, I've used sklearn and developed another class that in practice it does exactly the same as the KNNRegressor class. The purpose is to show you how to use sklearn's KNeighborsRegressor class. You can compare the results of the two classes. They should be very similar, if not identical. The major differences include: 
- sklearn's class is highly optimized and should run faster than mine (you notice the difference for very large datasets), and
- in my implementation, I've not normalized the data, while in the sklearn's class, the data is normalized.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler

class SklearnKNNRegressor:
    def __init__(self, k=3):
        self.k = k
        
        # Create the pipeline
        self.model = Pipeline([
            ('scaler', StandardScaler()),
            ('knn', KNeighborsRegressor(n_neighbors=self.k))
        ])

    def fit(self, X_train, y_train):
        self.model.fit(X_train, y_train)

    def predict(self, X_test):
        return self.model.predict(X_test)

    # Plotting function
    def plot(self, x_true, y_true, x_train, y_train, x_pred, y_pred, k):
        plt.figure()
        plt.scatter(x_train, y_train, s=50, color=colors[7], edgecolors='k', alpha=0.4, label="Training noisy data")
        plt.plot(x_true, y_true, color=colors[0], linestyle="--", linewidth=2, label="Actual trajectory")
        plt.plot(x_pred, y_pred, color=colors[1], label=f"kNN Prediction (k={k})")

        # Customize plot appearance
        plt.xlabel("Time (s)", fontsize=14)
        plt.ylabel("Displacement (m)", fontsize=14)
        plt.title(f"kNN Prediction (k={k})", fontsize=16)
        plt.grid(True)
        plt.legend()
        plt.show()

# ========== MAIN ==========   
k = 5
# Instantiate and use the sklearn-based kNN regressor
sklearn_knn_model = SklearnKNNRegressor(k)
sklearn_knn_model.fit(x_train, y_train)

# Predict on dense time points
y_dense_pred_sklearn = sklearn_knn_model.predict(x_dense)

# Plot predictions
sklearn_knn_model.plot(t, y_true, x_train, y_train, x_dense, y_dense_pred_sklearn, k)

In [None]:
# Validate across multiple k values
sklearn_validator = KNNValidator(x_train, y_train, x_test, y_test, SklearnKNNRegressor)
k_values = list(range(1, 31, 2))
results = sklearn_validator.evaluate(k_values)

sklearn_validator.plot_metrics()

***
### ✅ Check your understanding
- Compare the results of the custom k-NN regressor (KNNRegressor) with those from sklearn's implementation (SklearnKNNRegressor). What similarities and differences do you observe in the predictions and performance metrics? Discuss any discrepancies and potential reasons.

- You can also try to modify the SklearnKNNRegressor class to exclude scaling the data. How does scaling affect the model's performance? Is it something that you expected? Explain your findings and discuss it with your neighbor classmaters.

***

## Cross-Validation Analysis

In the code section below, I've demonstrated how to use sklearn's `cross_val_score` and `cross_val_predict` functions to perform cross-validation on the k-NN regression model.

`cross_val_score` is used to evaluate a model's performance using cross-validation. It splits the dataset into multiple folds (k-folds which is a different k than the one used in k-NN), trains the model on some folds, and tests it on the remaining fold. This process is repeated so every fold gets used for testing once, and the function returns an array of scores (like accuracy or MSE) for each fold. By averaging these scores, you can estimate how well the model is expected to generalize to new/unseen data.

`cross_val_predict` is used to generate cross-validated predictions for every data point in the dataset. Instead of returning evaluation scores, it returns the predicted labels or values, where each prediction is made on a fold that did not include that data point in training. This is useful when you want to analyze predictions directly e.g., in making a confusion matrix, plotting predicted vs. actual values, or creating residual plots in regression. Read more about these functions in the [sklearn documentation](https://scikit-learn.org/stable/modules/cross_validation.html) and [cross_val_predict documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_predict.html).

In [None]:
from sklearn.model_selection import cross_val_score, cross_val_predict

# Cross Validation
cv_scores = cross_val_score(sklearn_knn_model.model, x_train, y_train, cv=5, scoring='neg_mean_squared_error')
cv_scores *= -1  # Convert negative MSE to positive MSE
print("Cross validation results:")
print(f"    MSE scores: {[f'{score:.2f}' for score in cv_scores]}")
print(f"    Average MSE: {np.mean(cv_scores):.2f}")

# Cross validated predictions
y_cv_pred = cross_val_predict(sklearn_knn_model.model, x_train, y_train, cv=5)
mse_cv = mean_squared_error(y_train, y_cv_pred)
r2_cv = r2_score(y_train, y_cv_pred)
print(f"Cross-validated predictions:")
print(f"    MSE: {mse_cv:.2f}")
print(f"    R2:  {r2_cv :.2f}")

***
### ✅ Check your understanding

- Analyze the cross-validation results. For which `k` value are the cross-validation metrics obtained?

### ⚡ Mandatory submission
- Write a code that performs cross-validation for different `k` values (e.g., 1, 3, 5, 7, 9) and computes the average MSE for each `k`. Plot the average MSE against `k` to visualize how the choice of `k` affects model performance. Identify the optimal `k` value based on this analysis. You should not submit your code to me, but you need to include your plot with the optimal `k` value in your submission. Justify your choice of the optimal `k` in a short paragraph.

***

## k-NN for Classification

Now we move to classification. We aim to apply k-NN to classify data points into different categories based on their features. 

Open your previous lab notebook file for Classification. I hope you remember what you've done there. If not, you can quickly browse through it again. 

In the code block below, I've simply copied the code from the classification lab that generates different datasets for classification tasks. In addition, I've also added a new function called `make_three_classes_modified` that generates three clusters of data. If you remember from that lab, we could not classify non-linearly separable data with a linear model, like Perceptron. Here, we will see how k-NN can handle such cases.

In [None]:
from sklearn.datasets import make_classification, make_circles, make_moons, make_blobs

# This function uses sklearn "make_classification" to generate a dataset with 
# two informative features, and one cluster per class.
# I've intentionally used a non-42 rand state here! Do not change it!
def make_regular_data(n_samples=500, rand_state=90):
    x, y = make_classification(n_samples=n_samples, n_features=2, n_redundant=0,
                               n_informative=2, n_clusters_per_class=1, random_state=rand_state)
    return x, y

# This function creates points uniformly distributed in a square and 
# labels them based on whether they lie above or below the line y = x, 
# producing a simple linear decision boundary.
def make_diagonal_data(n_samples=500, random_state=42):
    np.random.seed(random_state)

    # Generate uniform data in range [0, 5]
    x = np.random.uniform(0, 5, size=(n_samples, 2))

    # true labels based on y > x
    y = (x[:, 1] > x[:, 0]).astype(int)
    return x, y

# This function generates a dataset with a non-linear XOR pattern such that points are labeled 
# as class 1 if exactly one of their coordinates is greater than 2.5, and class 0 otherwise. 
# This creates a checkerboard-like separation.
def make_xor_data(n_samples=500, random_state=42):
    np.random.seed(random_state)

    # Generate uniform data in range [0, 5]
    x = np.random.uniform(0, 5, size=(n_samples, 2))

    # XOR condition: label is 1 if one of the coordinates is >2.5 and the other is <=2.5
    y = (((x[:, 0] >  2.5) & (x[:, 1] <= 2.5)) | 
         ((x[:, 0] <= 2.5) & (x[:, 1] >  2.5))).astype(int)
    return x, y

# This function produces a dataset consisting of two noisy concentric circles: 
# an inner and outer ring are labeled as different classes. 
# This is a classic example of a non-linearly separable dataset, useful for testing non-linear classifiers.
def make_concentric_circles(n_samples=500, factor=0.3, noise=0.1):
    x, y = make_circles(n_samples=n_samples, factor=factor, noise=noise, random_state=42)
    return x, y

# Similar to the circle function, this function makes two interleaving half circles.
def make_two_half_moons(n_samples=500, noise=0.2):
    x, y = make_moons(n_samples=n_samples, noise=noise, random_state=42)
    return x, y

# This function generates a dataset with three distinct clusters using Gaussian blobs.
def make_three_classes(n_samples=500, noise=0.7):
    x, y = make_blobs(n_samples=n_samples,
                    centers=[[0, 5], [2, 0], [5, 4]], #[[0, 5], [2, 0], [5, 4], [2, 6]],
                    cluster_std=noise,
                    random_state=42)
    return x, y


# Create 3 clusters of data with make_blobs
def make_three_classes_modified(samples=[50, 80, 80, 40]):
    centers = [[0, 5], [2, 0], [5, 4], [4, -1]]
    noise   = [2.0   , 2.0   , 2.0   , 1.0 ]
    x, y = make_blobs(n_samples=samples,
                    centers=centers, 
                    cluster_std=noise,
                    random_state=42)
    
    # Relabel the 4th blob (label == 3) to have the same label as the 1st blob (label == 0)
    y[y == 3] = 0
    return x, y


In the Classification lab, we did not use non-linearly separable data. 
In that lab, scroll down to "Multiclass Classifiers" section and in the first code-block in that section, instead of 
```python
x, y = make_three_classes( noise=1.0)
```
use

```python
x, y = make_xor_data()
```
and re-run that section and all code blocks below it. Do you remember what is being done? How did the linear model perform?

Now, in the code block below, we will use k-NN to classify the same data, but before that, we need to split the data into training and test sets, and visualize the training data.

In [None]:
from sklearn.model_selection import train_test_split

# ========== MAIN ==========
# Generate three classes
X, y = make_xor_data()

# split data to training and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Plot the training data
plt.figure( figsize=(5, 5), dpi=200)

# Different Marker for Scatter plot
markers = ['o', 's', '*', 'x', '^', 'v', '<', '>']  # Different markers for different classes

# Plot each class dynamically
for i, class_label in enumerate(np.unique(y_train)):
    plt.scatter(
        X_train[y_train == class_label][:, 0], 
        X_train[y_train == class_label][:, 1],
        color=colors[i % len(colors)],
        marker=markers[i % len(markers)],
        s=70, edgecolor='k', alpha=0.7,
        label=f'Class {class_label}'
    )

plt.xlabel('Feature x1')
plt.ylabel('Feature x2')
plt.title("Training Data")
plt.legend()
plt.show()

Now, we will use k-NN to classify the data. The code is simple, and since I've used sklearn, in principle is similar to the SklearnKNNRegressor class. The major difference is that here we are using `KNeighborsClassifier` instead of `KNeighborsRegressor`. Study the code first, and then run it.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

k = 1

# Train a KNN classifier
knn = Pipeline([
            ('scaler', StandardScaler()),
            ('knn', KNeighborsClassifier(n_neighbors=k))
        ])

knn.fit(X_train, y_train)

# Create meshgrid for decision boundary
h = 0.1  # step size in the mesh
x_min, x_max = X[:, 0].min()-0.5, X[:, 0].max()+0.5
y_min, y_max = X[:, 1].min()-0.5, X[:, 1].max()+0.5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

# Predict on meshgrid
y_pred = knn.predict(np.c_[xx.ravel(), yy.ravel()])
y_pred = y_pred.reshape(xx.shape)

# Plot decision boundary
plt.figure()

# Plot each class dynamically
for i, class_label in enumerate(np.unique(y_train)):
    plt.scatter(
        X_train[y_train == class_label][:, 0], 
        X_train[y_train == class_label][:, 1],
        color=colors[i % len(colors)],
        marker=markers[i % len(markers)],
        s=70, edgecolor='k', alpha=0.7,
        label=f'Class {class_label}'
    )

plt.contourf(xx, yy, y_pred, alpha=0.4, cmap='Set1')

plt.xlabel('Feature x1')
plt.ylabel('Feature x2')
plt.title(f"KNN Decision Boundary (k={k})")
plt.legend()
plt.show()

In [None]:
from sklearn.metrics import accuracy_score

# Predict
y_pred = knn.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Model Accuracy: {accuracy:.2f}')

***
### 💡 Reflect and Run
- Analyze the decision boundary plot. How does the k-NN classifier separate the different classes in the feature space? Discuss how the choice of `k` influences the shape and complexity of the decision boundary. You can try different `k` values to see how the decision boundary changes.

- Find the best `k` value for the k-NN classifier. How did you find it? Explain in a short paragraph.

- For the best `k` value, make the confusion matrix for the test data. How well does the model perform on the test data? Explain your observations.

- Instead of the xor dataset, you can choose something else (e.g., `make_two_half_moons` and `make_concentric_circles`) and run the k-NN classification code again, and see how well k-NN can classify such data. Remember those data generating functions have some parameters that you can change to make the classification task easier or harder 🤓

⚠️ Practical note: The default distance metric in sklearn's KNeighborsClassifier and KNeighborsRegressor is 'minkowski' distance. You can change it to other metrics, such as Manhattan, Euclidean, or others. For that, you need to set the `metric` parameter when initializing the k-NN model, e.g., `KNeighborsClassifier(n_neighbors=k, metric='euclidean')`.

### ⚡ Mandatory submission
- Instead of the xor function, use `make_three_classes_modified` function to generate data. Visualize the data and re-run the k-NN classification code. Find the best `k` value and make the confusion matrix for the test data. What is the best `k` value and how did you find it? There is no need to submit your code.

- What does the confusion matrix tell you about the model's performance? Calculate the accuracy and precision from the confusion matrix and interpret these metrics in the context of the classification task.

- For a classification task, explain what may happen if we choose an even value for `k` (e.g., 2, 4, 6, etc.) instead of an odd value. Discuss the potential implications on the model's predictions and performance. Does your explanation applicable to regression tasks as well? Write your answer in a short paragraph.

***

## Beauty of KDE
In the lecture notes, you have learned about Kernel Density Estimation (KDE) as a non-parametric way to estimate the probability density function of a random variable. During out previous lab session, I also emphasized on the importance of KDE in classification tasks. KDE can be used for classification tasks by estimating the class-conditional densities and applying Bayes' theorem to classify new data points.

Let's make a quick implementation of this idea. However, moving towards the beauties of KDE deserves a dedicated lab session, which is beyond the scale and scope of this course. For now, we only limit ourselves to a simple implementation of the idea.

For that, let's first generate some data using the `make_three_classes_modified` function.

Then, for each class, we will fit a KDE model to estimate the density of that class. Finally, we will classify new data points based on the estimated densities. The code is simple and self-explanatory. Study it and run it. If you do not understand it, first discuss it with your neighbor classmates, and if you still do not understand it, ask me.

In [None]:
from sklearn.neighbors import KernelDensity

# Generate three classes using the modified function
X, y = make_three_classes_modified()

# split data to training and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

k = 5 # number of nearest neighbors

# Train a KNN classifier with scaling
knn = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier(n_neighbors=k))
])

# Fit the model
knn.fit(X_train, y_train)

# Create meshgrid for decision boundary
h = 0.1  # step size in the mesh
x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

# Predict on meshgrid
y_pred = knn.predict(np.c_[xx.ravel(), yy.ravel()])
y_pred = y_pred.reshape(xx.shape)

# Plot decision boundary
plt.figure()
plt.contourf(xx, yy, y_pred, alpha=0.4, cmap='Set1')

# For each class, fit a KDE and overlay the density contours
for i, class_label in enumerate(np.unique(y_train)):
    X_class = X_train[y_train == class_label]

    # Plot class points
    plt.scatter(
        X_class[:, 0], X_class[:, 1],
        color=colors[i % len(colors)],
        marker=markers[i % len(markers)],
        s=70, edgecolor='k', alpha=0.5,
        label=f'Class {class_label}'
    )

    # Fit KDE for this class
    # Do you remember the bandwidth parameter from the lecture notes?
    kde = KernelDensity(bandwidth=0.3, kernel='gaussian')
    kde.fit(X_class)

    # Evaluate density on meshgrid
    # kde.score_samples returns the log of the probability density
    # then we exponentiate it to get the actual density values
    log_dens = kde.score_samples(np.c_[xx.ravel(), yy.ravel()])
    dens = np.exp(log_dens).reshape(xx.shape)

    # Overlay density contours
    plt.contour(xx, yy, dens, colors=[colors[i % len(colors)]], 
                linewidths=1.0, alpha=0.9)

plt.xlabel('Feature x1')
plt.ylabel('Feature x2')
plt.title(f"KNN Decision Boundary with KDE (k={k})")
plt.legend()
plt.show()

***
END
***