## Introduction to KNN

**K-Nearest Neighbors (KNN)** is a **supervised learning** algorithm commonly used in **classification** and **regression** tasks. KNN is simple yet powerful, operating on the principle of **finding K nearest points** to make decisions.

### üîπ How it works:
- When a new data point needs to be predicted, the algorithm **calculates the distance** between it and all points in the training dataset.
- Based on the **K nearest neighbors**, the algorithm assigns the **most common label** (for classification) or **computes the average value** of its neighbors (for regression).

### ‚úÖ Advantages of KNN:
- **Easy to implement**: The algorithm is relatively simple to understand and implement.
- **Effective for basic recognition tasks**: KNN performs well in **classification** and **pattern recognition** problems.
- **Flexible in choosing features/distance metrics**: Users can customize **features** and select a suitable **distance metric** for the data.
- **Handles multi-class classification well**: KNN can effectively manage **multi-class problems**.
- **Efficient with large training data**: The algorithm performs well when a **sufficiently large dataset** is available.

### ‚ö†Ô∏è Disadvantages of KNN:
- **Difficulty in choosing the optimal K value**: Selecting the best K can be challenging.
- **Changing K can alter classification results**: The outcome of classification or regression may vary significantly based on K's value.
- **Performance degrades with high-dimensional data**: For **high-dimensional data**, accuracy declines as the difference between the nearest and farthest neighbors diminishes.
- **Sensitive to imbalanced data distribution**: The algorithm may be affected by **uneven class distributions**, leading to biased predictions.


In [39]:
# library
import numpy as np
from collections import Counter

## First of all, we need to know about Distance Metrics.

### 1. *Manhattan Distance*
$Manhattan Distance=\sum_{i=1}^n |x_i - y_i|$

### 2. *Minkowski Distance*
$Minkowski Distance=(\sum_{i=1}^n|x_i-y_i|^p)^\frac{1}{p}$

### 3. *Euclidean Distance*
$Euclidean Distance = \sqrt{\sum_{i=1}^n (x_i - y_i)^2} $

### 4. *Chebyshev Distance (Maximun Norm)*
$ Chebyshev Distance =  max_i|x_i-y_i|$

### 5. *Cosine Similarity*
$cos \theta = \frac{\vec{a} \vec{b}}{||\vec{a}|| . ||\vec{b}||}$

# Guide to Choosing a Distance Metric for KNN

## 1. *Manhattan Distance*
**When to use:**
- When features do not have a linear relationship.
- When data is grid-structured or discrete.
- When the data contains many outliers ‚Äî Manhattan is less sensitive than Euclidean.

**Examples:**
- Robot navigation along rows and columns.
- Traffic routing in a city with a grid layout.

---

## 2. *Minkowski Distance*
**When to use:**
- When you want to control the influence of distance with the parameter `p`.
- It is a generalization of Manhattan (`p = 1`) and Euclidean (`p = 2`).
- You can experiment with larger `p` values to emphasize differences across dimensions.

---

## 3. *Euclidean Distance*
**When to use:**
- When features are continuous and already normalized or scaled evenly.
- When data lies in real coordinate space (e.g., physical distance on a map).

**Note:**  
Euclidean distance is very popular but sensitive to outliers and differences in units among features.

---

## 4. *Chebyshev Distance*
**When to use:**
- When you want the distance to be determined by the dimension with the greatest difference.
- When you're interested in the maximum deviation among features.

**Examples:**
- King‚Äôs movement in chess (can move one step in any direction).
- Tasks involving evaluation of the largest deviation threshold between features.

---

## 5. *Cosine Similarity*
**When to use:**
- When you want to compare the **direction** of vectors rather than their magnitude.
- When working with high-dimensional sparse data, like text (TF-IDF, Bag-of-Words).

**Examples:**
- Document classification, finding similar texts.
- Sentiment analysis from customer reviews.

---

## Summary: Suggested Metric Selection

| *Distance Metric* | *Suitable Data*                     | *Quick Notes*                           |
|-------------------|--------------------------------------|-----------------------------------------|
| *Manhattan*        | Discrete data, with outliers         | Robust to noise                         |
| *Euclidean*        | Continuous, normalized data          | Popular, sensitive to scale/outliers    |
| *Minkowski*        | Flexible, tunable with `p`           | General case, try various `p` values    |
| *Chebyshev*        | Compare the largest difference       | Dominated by the largest error          |
| *Cosine*           | Text data, high-dimensional          | Compares direction, ignores magnitude   |


In [40]:
def Manhanttan_Distance(x,y): 
    """ Calculate the Manhattan distance between two vectors. 
    Parameters: 
    x (list or np.array): First vector 
    y (list or np.array): Second vector 
    Returns: float: Manhattan distance value 
    """ 
    if len(x) != len(y): 
        raise ValueError("Vectors must be of the same length") 
    return sum(abs(x - y))

In [41]:
def Minkowski_Euclidean_distance(x,y,p):
    """
    Calculate the Minkowski distance between two vectors.
    
    Parameters:
    x (list or np.array): First vector
    y (list or np.array): Second vector
    p (int or float): norm parameter (1 for Manhattan and Minkowski, 2 for Euclidean)
    
    Returns:
    float: distance between x and y
    """
    if len(x) != len(y):
        raise ValueError("Vectors must be of the same length")
    return (sum(abs(x - y) ** p )) ** (1/p)

In [42]:
def chebyshev_distance(x, y):
    """
    Calculate the Chebyshev Distance (Maximum Norm) between two vectors.
    
    Parameters:
    x (list or np.array): First vector
    y (list or np.array): Second vector
    
    Returns:
    float: Chebyshev distance value
    """
    if len(x) != len(y):
        raise ValueError("Vectors must have the same length")
    return max(abs(x-y))

In [43]:
def dot_product(x,y):
    return x@y

def vector_norm(v):
    return np.sqrt(sum(v**2))

def cosine_similarity(a, b):
    """
    Calculate the Cosine Similarity between two vectors.

    Parameters:
    a (list or np.array): First vector
    b (list or np.array): Second vector

    Returns:
    float: Cosine similarity value (ranges from -1 to 1)
    """
    if len(a) != len(b):
        raise ValueError("Vectors must have the same length")
    return dot_product(a,b) / (vector_norm(a) * vector_norm(b))

In [44]:

# Test case
x_test = np.array([2, 3, 4])
y_test = np.array([5, 6, 7])

# Testing for different values of p
print("Manhattan Distance:", Manhanttan_Distance(x_test, y_test))          # Manhanttan_Distance
print("Minkowski Distance (p=1):", Minkowski_Euclidean_distance(x_test, y_test, 1))  # Minkowski Distance 
print("Minkowski Distance (p=2):", Minkowski_Euclidean_distance(x_test, y_test, 2))  # Euclidean Distance
print("Chebyshev Distance:", chebyshev_distance(x_test, y_test))  # chebyshev_distance
print("Cosine Similarity:", round(cosine_similarity(x_test, y_test), 4))

Manhattan Distance: 9
Minkowski Distance (p=1): 9.0
Minkowski Distance (p=2): 5.196152422706632
Chebyshev Distance: 3
Cosine Similarity: 0.9915


### When to Use Distance Weighting in KNN

### ‚úÖ Recommended Use Cases:

### 1. Noisy or Imbalanced Data
- If data is **overlapping** or contains **noise**, weighted KNN helps **reduce misclassification** by giving more importance to closer points.
- **Example**: Classifying messy real-world **customer behavior data**.

### 2. Classes or Regression Targets Vary by Region
- In some cases, **local patterns** might differ across data regions. Distance weighting ensures predictions focus on the **most relevant neighbors**.
- **Example**: Predicting **property prices** in different districts.

### 3. Regression Problems
- Weighted KNN improves **smoothing**, ensuring predictions are **continuous and accurate** rather than abrupt shifts in output.
- **Example**: **Temperature forecasting** based on nearby sensors.

### 4. Production-Grade Systems
- When deploying models in production, **distance-weighted KNN** with normalization enhances **reliability**, making predictions more **robust**.
- **Example**: **Fraud detection** where transaction patterns vary widely.

---

### ‚ùå When Unweighted KNN Might Be Enough:

### 1. Data is Clean, Small & Well-Separated
- If classes are **clearly distinct** and there's **no significant noise**, simple nearest-neighbor classification is **sufficient**.
- **Example**: Well-labeled **medical test results** with minimal overlap.

### 2. Low Complexity, Fast Execution Needed
- Weighted KNN requires **extra computation**, so if **speed is more important** than accuracy, unweighted KNN may be **preferred**.
- **Example**: Quick **approximate searches** in large databases.


# Build Class Classification

In [45]:
import numpy as np
from collections import Counter

class CustomKNNClassifier:
    def __init__(self, n_neighbors=3, distance_metric='euclidean', weights='uniform', p=None):
        allowed_metrics = ['manhattan', 'minkowski', 'euclidean', 'chebyshev', 'cosine']
        if distance_metric not in allowed_metrics:
            raise ValueError(f"Unsupported distance_metric. Choose from {allowed_metrics}")
        if weights not in ['uniform', 'distance']:
            raise ValueError("weights must be 'uniform' or 'distance'")
        if distance_metric != 'minkowski' and p is not None:
            print("Warning: 'p' is only used with Minkowski distance. It will be ignored.")
        
        self.k = n_neighbors
        self.metric = distance_metric
        self.weights = weights
        self.p = p if p is not None else 2 if distance_metric == 'minkowski' else None

    def fit(self, X, y):
        self.X_train = np.array(X)
        self.y_train = np.array(y)

    def _distance(self, x, y):
        x = np.array(x)
        y = np.array(y)

        if self.metric == 'manhattan':
            return np.sum(np.abs(x - y))
        elif self.metric == 'minkowski':
            return np.sum(np.abs(x - y) ** self.p) ** (1 / self.p)
        elif self.metric == 'euclidean':
            return np.sqrt(np.sum((x - y) ** 2))
        elif self.metric == 'chebyshev':
            return np.max(np.abs(x - y))
        elif self.metric == 'cosine':
            dot = np.dot(x, y)
            norm = np.linalg.norm(x) * np.linalg.norm(y)
            return 1 - dot / norm
        else:
            raise ValueError(f"Unsupported distance metric: {self.metric}")

    def predict(self, X_test):
        predictions = []
        for test_point in X_test:
            distances = np.array([self._distance(test_point, x_train) for x_train in self.X_train])
            k_indices = np.argsort(distances)[:self.k]
            k_labels = self.y_train[k_indices]
            k_distances = distances[k_indices]

            if self.weights == 'uniform':
                most_common = Counter(k_labels).most_common(1)[0][0]
                predictions.append(most_common)
            elif self.weights == 'distance':
                weights = 1 / (k_distances + 1e-10)
                class_weights = {}
                for label, weight in zip(k_labels, weights):
                    class_weights[label] = class_weights.get(label, 0) + weight
                predicted_class = max(class_weights.items(), key=lambda x: x[1])[0]
                predictions.append(predicted_class)

        return np.array(predictions)


class CustomKNNRegressor(CustomKNNClassifier):
    def predict(self, X_test):
        predictions = []
        for test_point in X_test:
            distances = np.array([self._distance(test_point, x_train) for x_train in self.X_train])
            k_indices = np.argsort(distances)[:self.k]
            k_values = self.y_train[k_indices]
            k_distances = distances[k_indices]

            if self.weights == 'uniform':
                prediction = np.mean(k_values)
            elif self.weights == 'distance':
                weights = 1 / (k_distances + 1e-10)
                prediction = np.dot(weights, k_values) / np.sum(weights)

            predictions.append(prediction)
        return np.array(predictions)


In [46]:
import numpy as np
import pandas as pd

# Example dataset: 
data = [
    [25, 40000, 5000, 0, 650],
    [35, 60000, 12000, 2, 720],
    [45, 80000, 20000, 2, 740],
    [22, 25000, 3000, 2, 610],
    [33, 120000, 15000, 0, 770],
    [50, 30000, 10000, 0, 680],
    [28, 95000, 8000, 0, 690],
    [40, 62000, 11000, 1, 705],
    [60, 100000, 25000, 0, 750],
    [48, 220000, 50000, 2, 790]
]

df = pd.DataFrame(data, columns=["Age", "Income", "LoanAmount", "Class", "CreditScore"])



In [47]:
# Split dataset into features and labels
X = df[["Age", "Income", "LoanAmount"]].values
y_class = df["Class"].values
y_regress = df["CreditScore"].values

In [48]:
# Create and fit the KNN classifier
X_test = [[30, 50000, 7000]]


In [49]:
# Intialize and fit the CustomKNNRegressor use weights = 'distance
knn_reg = CustomKNNRegressor(n_neighbors=3, distance_metric='euclidean', weights='distance')
knn_reg.fit(X, y_regress)
pred_score = knn_reg.predict(X_test)
print("Predicted credit score:", pred_score[0])

Predicted credit score: 689.8004665476948


In [50]:
# Intialize and fit the CustomKNNClassifier use weights = 'uniform'
knn_clf = CustomKNNClassifier(n_neighbors=3, distance_metric='euclidean', weights='uniform')
knn_clf.fit(X, y_class)
pred_class = knn_clf.predict(X_test)
print("Predicted class:", pred_class[0])


Predicted class: 0
