# K-Nearest Neighbors (KNN) from Scratch

## Overview 🔍
In this project, I’m implementing a custom **K-Nearest Neighbors (KNN)** algorithm from scratch using Python and NumPy. KNN is a simple yet powerful algorithm that can be used for both classification and regression. It makes predictions by looking at the most similar data points (or "neighbors") in the training set and assigning the majority class (for classification) or averaging values (for regression).

### Key Concepts:
- **K-Nearest Neighbors (KNN)**: A classification and regression method that uses the closest K data points to make a prediction.
  
- **Distance Calculation**: The algorithm utilizes both **Euclidean distance** and **Manhattan distance** to measure similarity between data points.
  
- **Hyperparameter \( k \)**: The number of neighbors to consider for making a prediction. Choosing \( k \) affects the model’s bias and variance.
  
- **Decision Boundaries**: The boundaries between different classes created based on the neighbors, providing a visual representation of how the model classifies each region.

---

## Objective 🎯
The goal of this project is to:
1. Implement a custom K-Nearest Neighbors class from scratch.
2. Allow the model to handle both classification and regression tasks.

---

## K-Nearest Neighbors Explanation 🧠

<img src="./figures/KNN.png" alt="KNN" width="600" hight= "200"/>


### How KNN Works
In the K-Nearest Neighbors algorithm, we predict the label of a new data point by finding the **K nearest points** in the training set. The steps to make a prediction are as follows:
1. **Calculate Distances**: Measure the distance from the new data point to all points in the training set using both Euclidean and Manhattan distance metrics.
   
2. **Identify Neighbors**: Select the K closest data points based on the distance.
   
3. **Make Prediction**: 
   - For classification, the predicted label is the **majority class** among the neighbors.
   - For regression, the prediction is the **average** of the neighbor values.

### Distance Calculation
To determine the closeness of neighbors, we use the following distance formulas:

- **Euclidean Distance**:
  $$
  \text{distance} = \sqrt{\sum_{i=1}^{n} (x_i - x_{i,\text{neighbor}})^2}
  $$

- **Manhattan Distance**:
  $$
  \text{distance} = \sum_{i=1}^{n} |x_i - x_{i,\text{neighbor}}|
  $$

Where:
- \( x_i \): The feature of the data point we want to predict.
- \( x_i,neighbor \): The corresponding feature of each neighbor.
- \( x \): Number of features.

### Choosing K (Hyperparameter Tuning)
The value of **K** affects the model's bias and variance:
- **Small K** (e.g., K=1) tends to have low bias and high variance, meaning it can capture more details but might overfit.
- **Larger K** values increase bias but reduce variance, leading to a smoother decision boundary and potentially better generalization.

### Majority Voting
For classification, KNN uses **majority voting** to assign the class label. This means the class with the most occurrences among the K nearest neighbors is chosen as the prediction.

---

## Implementation 🛠️

The `KNN` class includes methods to:
1. **Fit the Model**: Store the training data for distance-based calculations.
   
2. **Predict**: Find the nearest neighbors for a given input and assign a label based on majority voting (for classification) or average value (for regression).
   
3. **Evaluate**: Assess the model's performance using accuracy (for classification) or Mean Squared Error (for regression).



In [15]:
import numpy as np

# Define the K-Nearest Neighbors (KNN) class
class KNN:
    def __init__(self, k=3, distance_metric='euclidean'):
        """
        Initialize the KNN algorithm with the specified number of neighbors (k) 
        and the distance metric (either 'euclidean' or 'manhattan').

        Parameters:
        k (int): The number of nearest neighbors to consider for classification or regression.
        distance_metric (str): The type of distance metric to use ('euclidean' or 'manhattan').
        """
        self.k = k  # Set the number of neighbors to use in the algorithm
        self.distance_metric = distance_metric  # Set the distance metric ('euclidean' or 'manhattan')

    def fit(self, X, y):
        """
        Store the training data and labels for later use in predictions.

        Parameters:
        X (array-like): The feature matrix (training data).
        y (array-like): The target labels (training labels).
        """
        self.X_train = X  # Store the training features
        self.y_train = y  # Store the training labels

    def predict(self, X):
        """
        Predict the labels for a set of test samples.

        Parameters:
        X (array-like): The feature matrix of test samples to classify.

        Returns:
        np.array: The predicted labels for each test sample.
        """
        predictions = [self._predict(x) for x in X]  # For each test sample, predict its label
        return np.array(predictions)  # Return the list of predictions as a numpy array

    def _predict(self, x):
        """
        Predict the label for a single test sample 'x'.

        Parameters:
        x (array-like): A single test sample whose label needs to be predicted.

        Returns:
        int: The predicted label for the test sample.
        """
        distances = self._compute_distances(x)  # Compute the distances from 'x' to all training samples
        k_indices = np.argsort(distances)[:self.k]  # Get indices of the k closest training samples
        k_nearest_labels = [self.y_train[i] for i in k_indices]  # Get the labels of the k nearest neighbors
        return np.bincount(k_nearest_labels).argmax()  # Return the most frequent label among the k neighbors

    def _compute_distances(self, x):
        """
        Compute the distance between the test sample 'x' and all training samples 
        using the selected distance metric ('euclidean' or 'manhattan').

        Parameters:
        x (array-like): A test sample for which the distance to each training sample needs to be computed.

        Returns:
        np.array: An array of distances from 'x' to all training samples.
        """
        if self.distance_metric == 'euclidean':
            # Calculate Euclidean distance: sqrt(sum((x_i - x)^2))
            distances = np.sqrt(np.sum((self.X_train - x) ** 2, axis=1))
        elif self.distance_metric == 'manhattan':
            # Calculate Manhattan distance: sum(|x_i - x|)
            distances = np.sum(np.abs(self.X_train - x), axis=1)
        else:
            # Raise an error if an unsupported distance metric is specified
            raise ValueError("Unsupported distance metric")
        return distances  # Return the calculated distances

    def predict_regression(self, X):
        """
        Predict continuous values (regression) for a set of test samples.

        Parameters:
        X (array-like): The feature matrix of test samples to predict.

        Returns:
        np.array: The predicted continuous values for each test sample.
        """
        predictions = [self._predict_regression(x) for x in X]  # Use the helper method for each test sample
        return np.array(predictions)  # Return the list of predictions as a numpy array

    def _predict_regression(self, x):
        """
        Predict the continuous value for a single test sample 'x' in a regression task.

        Parameters:
        x (array-like): A single test sample whose value needs to be predicted.

        Returns:
        float: The predicted continuous value for the test sample.
        """
        distances = self._compute_distances(x)  # Compute distances to all training samples
        k_indices = np.argsort(distances)[:self.k]  # Get indices of the k closest training samples
        k_nearest_values = [self.y_train[i] for i in k_indices]  # Get the continuous values of the k nearest neighbors
        return np.mean(k_nearest_values)  # Return the average of the k nearest values

    

In [10]:
# Sample training data (2 features) and labels for classification
X_train = np.array([[1, 2], [2, 3], [3, 4], [6, 7], [7, 8], [8, 9]])
y_train_classification = np.array([0, 0, 0, 1, 1, 1])  # Binary classification labels
y_train_regression = np.array([1.5, 2.5, 3.5, 6.5, 7.5, 8.5])  # Regression values

# Sample test data
X_test = np.array([[4, 5], [5, 6]])

In [11]:
# 1. Classification Prediction using Euclidean distance
knn_classifier_euclidean = KNN(k=3, distance_metric='euclidean')
knn_classifier_euclidean.fit(X_train, y_train_classification)
predictions_classification_euclidean = knn_classifier_euclidean.predict(X_test)
print("Classification Predictions (Euclidean):", predictions_classification_euclidean)

Classification Predictions (Euclidean): [0 1]


In [12]:
# 2. Regression Prediction using Euclidean distance
knn_regressor_euclidean = KNN(k=3, distance_metric='euclidean')
knn_regressor_euclidean.fit(X_train, y_train_regression)
predictions_regression_euclidean = knn_regressor_euclidean.predict_regression(X_test)
print("Regression Predictions (Euclidean):", predictions_regression_euclidean)


Regression Predictions (Euclidean): [4.16666667 5.83333333]


In [13]:
# 3. Classification Prediction using Manhattan distance
knn_classifier_manhattan = KNN(k=3, distance_metric='manhattan')
knn_classifier_manhattan.fit(X_train, y_train_classification)
predictions_classification_manhattan = knn_classifier_manhattan.predict(X_test)
print("Classification Predictions (Manhattan):", predictions_classification_manhattan)


Classification Predictions (Manhattan): [0 1]


In [14]:
# 4. Regression Prediction using Manhattan distance
knn_regressor_manhattan = KNN(k=3, distance_metric='manhattan')
knn_regressor_manhattan.fit(X_train, y_train_regression)
predictions_regression_manhattan = knn_regressor_manhattan.predict_regression(X_test)
print("Regression Predictions (Manhattan):", predictions_regression_manhattan)


Regression Predictions (Manhattan): [4.16666667 5.83333333]


# When to Use KNN 📈

K-Nearest Neighbors is a versatile algorithm that can be applied in various scenarios. Here are some situations where KNN is particularly effective:

- **Classification Problems**: KNN is primarily used for classification tasks, where the objective is to categorize data points based on their features. It works well in scenarios with clear class boundaries.

- **Regression Problems**: While less common, KNN can also be used for regression tasks, predicting continuous values by averaging the outcomes of the K nearest neighbors.

- **Small to Medium-Sized Datasets**: KNN performs best with smaller datasets since it relies on distance calculations for each prediction. With larger datasets, the computation can become expensive.

- **Low-Dimensional Data**: KNN is sensitive to the curse of dimensionality. It works well when the data has a low number of features, as high-dimensional data can obscure the distance metrics.

- **Non-Linear Decision Boundaries**: KNN does not assume any specific distribution of data, making it suitable for problems where decision boundaries are non-linear.

# Pros of KNN ✅

- **Simple to Understand and Implement**: KNN is easy to comprehend and implement, making it a good choice for beginners in machine learning.

- **No Training Phase**: KNN is a lazy learner; it does not require a training phase. This can be advantageous when the training data changes frequently.

- **Versatile**: KNN can be used for both classification and regression tasks.

- **Naturally Handles Multi-Class Problems**: KNN can easily classify data into multiple categories without requiring additional adjustments.

# Cons of KNN ❌

- **Computationally Intensive**: KNN requires calculating distances between the query instance and all training samples, making it slow, especially for large datasets.

- **Sensitive to Irrelevant Features**: The presence of irrelevant features can negatively impact the performance of KNN. Feature selection or dimensionality reduction may be needed.

- **Curse of Dimensionality**: As the number of dimensions increases, the distance between points becomes less meaningful. This can reduce the effectiveness of KNN in high-dimensional spaces.

- **Choosing the Right K**: The performance of KNN is highly dependent on the choice of K (the number of neighbors). Selecting an optimal K value requires cross-validation.

- **Memory Intensive**: KNN stores the entire training dataset, which can be a problem for memory-constrained environments.

## Conclusion 🎯

K-Nearest Neighbors is a powerful algorithm suitable for various applications, especially in scenarios with smaller, low-dimensional datasets. Understanding its strengths and weaknesses is crucial for effectively utilizing KNN in real-world problems.