# K-Nearest Neighbors (KNN) In Depth

![alt text](../Images/ml/knn.png)

- Supervised learning algorithm
- Used in regression and classification problems
- Require feature scaling
- Can be used in text classification
- Impacted by outliers
- In imbalanced dataset, KNN becomes biased towards the majority instances of the training space
- KNN is good for small dataset, good as a baseline and easy to explain

## Steps In KNN

1. Choose number of k neighbors
2. Calculate the distance of the nearest neighbors
3. Count the nearest point to each neighbors
4. Assign the new point to the category where you counted the most neighbors(Classification)

`Note`: For regression problem statement, the predicted value is given by the average of the values of it is k nearest neighbors.

![alt text](../Images/ml/knn1.jpg)

- We got two classes in the above picture:
    - Male (Red)
    - Female (Blue)
- Green is the new point
- K=3 for the above example
- After calculation of nearest neighbors, We find out that two red points and one blue point are the closest
- We assign the green point to the red due to the most counted neighbors are reds

### **Key components:**

- **K**: Number of nearest neighbors to consider. It's a hyperparameter that needs to be chosen beforehand
- **Distance metric**: Commonly Euclidean distance, but other metrics like Manhattan distance, cosine similarity, etc., can also be used
- **Decision rule**: Majority voting for classification, averaging for regression

### Commonly Used Distance Metrics In KNN:

1. **Euclidean Distance:** This is the most common distance metric used in KNN. It calculates the straight-line distance between two points in Euclidean space. For two points *p* and *q* in an *n* dimensional space, Euclidean distance is given by:

$$
d(p, q) = \sqrt{\sum_{i=1}^{n} (p_i - q_i)^2}
$$

- d(p, q) represents the Euclidean distance between points p and q.
- Σ (sigma) represents the sum over all dimensions (i = 1 to n).
- p_i and q_i represent the corresponding values in the i-th dimension for points p and q, respectively.

2. **Manhattan Distance (City Block or Taxicab Distance)**: It calculates the distance between two points in a grid based on the sum of the absolute differences of their coordinates. For two points *p* and *q* in an *n* dimensional space, Manhattan distance is given by:
    
    $$
    d(p, q) = \sum_{i=1}^{n} |p_i - q_i|
    $$

    We can also write the above formula for points p and q in 2D space:

    **d(p, q) = |x₁ — x₂| + |y₁ — y₂|**

- d(p, q): represents the Manhattan distance between points p and q
- | |: represents the absolute value function, ensuring positive values for the distance
- x₁ and x₂: represent the x-coordinates of points p and q, respectively
- y₁ and y₂: represent the y-coordinates of points p and q, respectively
    
3. **Minkowski Distance**: This is a generalized form of both Euclidean and Manhattan distances. It is defined as:

$$
d(p, q) = \left( \sum_{i=1}^{n} |p_i - q_i|^r \right)^{\frac{1}{r}}
$$

- d(p, q): represents the Minkowski distance between points p and q
- Σ (sigma): represents the sum over all dimensions (i = 1 to n)
- p_i and q_i: represent the corresponding values in the i-th dimension for points p and q, respectively
- │ │: represents the absolute value function
- *r:* is a parameter. When *r*=1, it reduces to Manhattan distance, and when *r*=2, it reduces to Euclidean distance


4. **Cosine Similarity**: While not a distance metric in the strictest sense, cosine similarity is often used in KNN for text data or high-dimensional data. It measures the cosine of the angle between two vectors and ranges from -1 (completely opposite) to 1 (exactly the same). It is given by:

$$
\text{similarity}(p, q) = \frac{p \cdot q}{\|p\| \|q\|}
$$

- *p*⋅*q*: This represents the dot product of vectors *p* and *q*
- ∥*p*∥: This represents the Euclidean norm (or length) of vector *p*
- ∥q∥: This represents the Euclidean norm (or length) of vector q

### **Pros:**

- Simple to understand and implement
- No training phase (lazy learner), making it efficient for online learning
- Can handle multi-class cases effectively
- Non-parametric, which means it can handle any type of distribution of data
- Robust to noisy training data and effective for datasets with fewer dimensions

### **Cons:**

- Computationally expensive during testing, especially with large datasets, as it requires computing distances for every test point
- Sensitive to the choice of K and the distance metric
- Not suitable for high-dimensional data due to the curse of dimensionality (increased computational cost and decreased performance)

In [1]:
# KNN clssification Problem

from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=3)

X = [[0], [1], [2], [3]]
y = [0, 0, 1, 1]

neigh.fit(X, y)
print(neigh.predict([[1.1]]))
print(neigh.predict_proba([[0.9]]))

[0]
[[0.66666667 0.33333333]]


In [2]:
# KNN Regression Problem
from sklearn.neighbors import KNeighborsRegressor
neigh = KNeighborsRegressor(n_neighbors=2)

X = [[0], [1], [2], [3]]
y = [0, 0, 1, 1]

neigh.fit(X, y)
print(neigh.predict([[1.5]]))

[0.5]
