# K-Nearest Neighbors (KNN)

## What is K-Nearest Neighbors?
K-Nearest Neighbors (KNN) is a **supervised machine learning algorithm** used for both **classification** and **regression** tasks. It is a **non-parametric** and **instance-based** learning method, meaning it does not make assumptions about the underlying data distribution and memorizes the training data to make predictions.

---

## Why is KNN Used?
- **For Simplicity**: KNN is easy to implement and intuitive.
- **Versatility**: It can handle classification and regression problems.
- **Non-parametric**: Useful for datasets where no assumption about the data distribution can be made.
- **Adaptable**: Works well with multi-class classification problems.

---

## How Does KNN Work?

### 1. Data Representation
- Each data point is represented in an n-dimensional feature space.
- For classification, each point is labeled with a class.
- For regression, each point has a continuous target value.

### 2. Prediction
When a new data point needs to be classified or predicted, the algorithm:
1. Calculates the **distance** (e.g., Euclidean distance) from the new point to all training points.
2. Selects the **k nearest neighbors** (smallest distances).
3. Aggregates the neighbors:
   - **Classification**: Assigns the most common class (majority voting).
   - **Regression**: Calculates the average (or weighted average) of their target values.

### 3. Distance Metrics
Commonly used metrics for calculating distances:
- **Euclidean Distance**: sqrt(sum((x_i - y_i)^2))
- **Manhattan Distance**: sum(|x_i - y_i|)
- **Minkowski Distance**: Generalized form combining Euclidean and Manhattan distances.

---

## Key Technical Terms in KNN
- **k (Number of Neighbors)**:
  - A hyperparameter that determines how many neighbors to consider.
  - Small k: Sensitive to noise (overfitting).
  - Large k: Can underfit by ignoring local patterns.
- **Weighted Voting**:
  - Neighbors closer to the new point may have higher influence.
- **Instance-based Learning**:
  - Unlike parametric algorithms (e.g., Logistic Regression), KNN doesn’t build a model. It memorizes the training data.
- **Lazy Learning**:
  - The algorithm does no computation during training but performs all the work during prediction.

---

## How is KNN Different from Logistic Regression?

| **Feature**                 | **KNN**                                     | **Logistic Regression**                     |
|-----------------------------|---------------------------------------------|---------------------------------------------|
| **Type**                    | Non-parametric                             | Parametric (assumes linear decision boundary). |
| **Learning Approach**       | Instance-based (lazy)                      | Model-based (eager).                        |
| **Data Assumptions**        | No assumptions about the data distribution.| Assumes a linear relationship between features and log-odds. |
| **Complexity**              | Simple to implement, computationally expensive during prediction.| Requires training, computationally faster during prediction. |
| **Performance on Large Data** | Struggles with very large datasets.       | Scales better for large datasets.           |
| **Multicollinearity**       | Unaffected by multicollinearity.            | Sensitive to multicollinearity.             |

---

## When to Use KNN?
- **Low-dimensional datasets**: KNN struggles with high-dimensional data (curse of dimensionality).
- **Balanced datasets**: Works well when classes are evenly distributed.
- **Non-linear decision boundaries**: Logistic regression may fail where KNN can adapt to complex patterns.
- **Small datasets**: Since it stores all training data, KNN is memory-intensive for large datasets.

---

## How to Evaluate KNN?

### Classification Metrics
- **Accuracy**: Proportion of correctly classified samples.
- **Precision, Recall, F1-Score**: Handle imbalanced datasets effectively.
- **Confusion Matrix**: Provides a breakdown of TP, TN, FP, and FN.
- **ROC-AUC Score**: For evaluating the model's ability to distinguish between classes.

### Regression Metrics
- **Mean Absolute Error (MAE)**.
- **Mean Squared Error (MSE)**.
- **R² Score**: Measures how well the regression predictions approximate the true values.

### Cross-validation
- Splits data into train/test sets multiple times to evaluate performance robustly.

### Grid Search for k
- Use grid search with cross-validation to determine the optimal k value.

---

## Strengths of KNN
- **Simple and Easy to Implement**.
- **No Training Phase**: Computationally cheap during training.
- **Non-linear Boundaries**: Can adapt to complex decision boundaries.
- **Robust to Noise** (if k is chosen carefully).

---

## Limitations of KNN
- **Computational Cost**: Prediction requires computing distances to all training samples, which can be slow for large datasets.
- **Curse of Dimensionality**: In high-dimensional spaces, distances become less meaningful.
- **Sensitive to Scaling**: Features need to be standardized or normalized for fair distance calculation.
- **Choice of k**: Selecting an optimal k value is crucial and can vary based on the dataset.

---

## Summary
- KNN is a simple, versatile algorithm suitable for small, non-linear, and low-dimensional datasets.
- It differs from Logistic Regression by being non-parametric and instance-based.
- The algorithm’s performance depends heavily on:
  - The choice of distance metric.
  - The number of neighbors (k).
  - Proper scaling of features.
- Evaluation involves classification and regression metrics, depending on the task.

