# K-Nearest Neighbor (KNN) Algorithm for Machine Learning

- **Supervised Learning**: KNN is a simple algorithm based on supervised learning, mainly used for classification (and regression).
- **Similarity-Based**: It classifies a new data point based on the similarity with stored data points in the training set.
- **Non-Parametric**: KNN doesn't make any assumptions about the underlying data distribution.
- **Lazy Learner**: KNN doesn't learn immediately from the training data; it stores the data and classifies new data during prediction.
- **Classification & Regression**: Primarily used for classification tasks, but can also be used for regression.
- **Training Phase**: The algorithm simply stores the dataset, and during prediction, it classifies based on the nearest neighbors.


**Example:** Suppose, we have an image of a creature that looks similar to cat and dog, but we want to know either it is a cat or dog. So for this identification, we can use the KNN algorithm, as it works on a similarity measure. Our KNN model will find the similar features of the new data set to the cats and dogs images and based on the most similar features it will put it in either cat or dog category.

![k-nearest-neighbor-algorithm-for-machine-learning.png](attachment:k-nearest-neighbor-algorithm-for-machine-learning.png)

# Why Do We Need a K-NN Algorithm?

K-Nearest Neighbor (K-NN) is essential for classifying new data points based on their similarity to known categories. 

- **Example Scenario**: Imagine two categories, A and B, and a new data point, $ x_1 $. The task is to determine whether $ x_1 $ belongs to Category A or B.
- **Solution with K-NN**: By identifying the nearest neighbors of $ x_1 $ within the dataset, K-NN helps classify $ x_1 $ into the category most similar to its neighbors, effectively predicting the class based on the majority vote of nearby data points.

K-NN is particularly useful for such classification tasks due to its simplicity and effectiveness in pattern recognition.

![k-nearest-neighbor-algorithm-for-machine-learning2.png](attachment:k-nearest-neighbor-algorithm-for-machine-learning2.png)

# How Does K-NN Work?

The K-NN algorithm follows a straightforward process to classify new data points:

### Step-by-Step Algorithm:
1. **Select the number \( K \)** of the neighbors.
2. **Calculate the Euclidean distance** from the new data point to each neighbor.
3. **Identify the K nearest neighbors** based on the calculated Euclidean distance.
4. **Count the data points** in each category among the K nearest neighbors.
5. **Assign the new data point** to the category that has the maximum count among the K neighbors.
6. **Model Ready**: The classification model is complete and ready to classify new data points.

K-NN uses this method to predict the category for any new data point based on its closest neighbors.


![k-nearest-neighbor-algorithm-for-machine-learning3.png](attachment:k-nearest-neighbor-algorithm-for-machine-learning3.png)

# Distance Metrics Used in KNN Algorithm

The K-Nearest Neighbors (KNN) algorithm relies on distance metrics to identify the nearest points or groups for a query point. Here are some common distance metrics used in KNN:

### 1. **Euclidean Distance**
Euclidean distance is the Cartesian distance between two points in a plane or hyperplane. It is the length of the straight line connecting the two points being compared. This metric is commonly used to calculate the displacement between two states.

**Formula**:
$
distance(x, X_i) = \sqrt{\sum_{j=1}^{d} (x_j - X_{ij})^2}
$
- **Application**: Useful when we care about the shortest path (straight-line distance) between points.

### 2. **Manhattan Distance**
Manhattan distance, also known as "L1 distance," is used when the total path or travel distance matters, rather than direct displacement. It is calculated by summing the absolute differences between the coordinates of points in n-dimensions.

**Formula**:
$
d(x, y) = \sum_{i=1}^{n} |x_i - y_i|
$
- **Application**: Suitable for grid-like structures or scenarios where the movement is restricted to certain directions (e.g., city blocks).

### 3. **Minkowski Distance**
Minkowski distance is a generalized metric that includes both Euclidean and Manhattan distances as special cases.

**Formula**:
$
d(x, y) = \left( \sum_{i=1}^{n} |x_i - y_i|^p \right)^{\frac{1}{p}}
$
- **Special Cases**:
  - When $ p = 2 $, it is equivalent to the Euclidean distance.
  - When $ p = 1 $, it is equivalent to the Manhattan distance.

### 4. **Hamming Distance**
Hamming distance is used for problems involving overlapping comparisons between two vectors with Boolean or string values. It counts the number of positions at which the corresponding elements are different.

- **Application**: Useful for categorical variables or binary strings.

### Keypoint
- **Euclidean Distance**: Measures straight-line displacement.
- **Manhattan Distance**: Measures path traveled.
- **Minkowski Distance**: Generalizes Euclidean and Manhattan distances.
- **Hamming Distance**: Suitable for categorical or binary data.

These metrics are essential in choosing the correct distance measure for the type of data and the problem being solved in a Machine Learning task.


## How to Choose the Value of $k$ in KNN

Choosing the best value of $k$ in the K-Nearest Neighbors (KNN) algorithm is crucial for optimal model performance. Below are various strategies to help determine an appropriate value of $k$:

### 1. **Cross-Validation**
   - **Description**: Use **k-fold cross-validation** to find the optimal $k$ value. This involves splitting the dataset into $k$ subsets (folds), training on $k-1$ folds, and validating on the remaining fold. Repeat this process multiple times for different values of $k$ and choose the one that provides the best cross-validation accuracy.
   - **Benefit**: Helps avoid overfitting and ensures that the chosen $k$ generalizes well to unseen data.

### 2. **Use an Odd Value of $k$**
   - **Description**: Prefer odd values for $k$ to reduce the chance of ties when dealing with binary classification problems (where there are two classes).
   - **Benefit**: Minimizes ambiguity in classification decisions.

### 3. **Rule of Thumb**
   - **Description**: A common rule of thumb is to set $k$ as the square root of the total number of data points ($N$):
     $$
     k = \sqrt{N}
     $$
   - **Benefit**: This provides a quick starting point that balances the trade-off between sensitivity and generalization.

### 4. **Grid Search**
   - **Description**: Use **Grid Search** to automate the search for the best $k$ value by evaluating a range of values for $k$. This can be done by specifying a range of $k$ values and selecting the one with the highest performance score (e.g., accuracy, precision).
   - **Benefit**: Provides a systematic and efficient way to find the optimal $k$ based on a predefined metric.

### 5. **Consider Data Characteristics**
   - **Description**: Assess the **nature of the dataset** when selecting $k$. If the dataset has noisy data, a larger $k$ can help smooth out the effect of noise. For datasets with clear, well-separated clusters, a smaller $k$ may work better.
   - **Benefit**: Tailors the choice of $k$ to the specific characteristics of the dataset, potentially improving accuracy.

### 6. **Bias-Variance Trade-Off**
   - **Description**: Keep in mind the **bias-variance trade-off**:
     - A **small $k$** (e.g., 1) can lead to **low bias** and **high variance**, making the model sensitive to noise (overfitting).
     - A **large $k$** can lead to **high bias** and **low variance**, which may cause the model to miss small patterns (underfitting).
   - **Benefit**: Balances the complexity and accuracy of the model by choosing $k$ that reduces errors.

### **Keypoint**
Combining these strategies helps in selecting an appropriate $k$ that fits the specific needs of your dataset, balancing between accuracy, overfitting, and underfitting.


### Example Dataset: 
Consider a dataset with 2 features (e.g., height and weight) to classify people as either "Short" or "Tall."

| Height (cm) | Weight (kg) | Label |
|-------------|-------------|-------|
| 150         | 50          | Short |
| 160         | 55          | Short |
| 170         | 70          | Tall  |
| 180         | 75          | Tall  |

### Step-by-Step KNN Classification:

1. **Select K value**: Let's choose K=3 (we'll consider the 3 nearest neighbors).
  

2. **New Data Point**: Suppose we want to classify a new person with a height of 165 cm and weight of 60 kg.


3. **Calculate Distances**: We calculate the Euclidean distance between the new data point and each of the points in the dataset.
   - Distance to (150, 50): $ \sqrt{(165 - 150)^2 + (60 - 50)^2} = 15.13 $
   - Distance to (160, 55): $ \sqrt{(165 - 160)^2 + (60 - 55)^2} = 5.10 $
   - Distance to (170, 70): $ \sqrt{(165 - 170)^2 + (60 - 70)^2} = 11.18 $
   - Distance to (180, 75): $ \sqrt{(165 - 180)^2 + (60 - 75)^2} = 15.13 $


4. **Find Nearest Neighbors**: The 3 nearest neighbors are:
   - (160, 55) with "Short"
   - (170, 70) with "Tall"
   - (150, 50) with "Short"


5. **Majority Vote**: Out of the 3 neighbors, 2 are "Short" and 1 is "Tall."


6. **Prediction**: Since the majority of neighbors are "Short," the new data point (165 cm, 60 kg) is classified as **Short**.


This is how KNN works: it finds the nearest neighbors based on a distance metric (Euclidean in this case) and assigns the class based on the majority vote from those neighbors.


## Advantages of KNN Algorithm:
    
- It is simple to implement.
- It is robust to the noisy training data
- It can be more effective if the training data is large.

## Disadvantages of KNN Algorithm:

- Always needs to determine the value of K which may be complex some time.
- The computation cost is high because of calculating the distance between the data points for all the training samples.

##  Applications of KNN Algorithm

1. **Text Classification**: KNN classifies text for tasks like spam detection and sentiment analysis by comparing text features.
2. **Image Recognition**: It identifies objects in images, such as facial recognition or handwriting recognition, by comparing pixel features.
3. **Recommendation Systems**: KNN is used to recommend items (like movies or products) based on similar user preferences.
4. **Healthcare**: It helps in diagnosing diseases and predicting health risks by classifying patient data.
5. **Fraud Detection**: KNN detects fraudulent activities, such as credit card fraud, by comparing current transactions to past patterns.
