# **KNN Imputer**

### Univariate vs. Multivariate Imputation

**Univariate Imputation**
Univariate imputation involves imputing missing values in a single feature (column) using only the available information from that same feature. Common methods include:
- **Mean imputation**: Replacing missing values with the mean of the observed values in the column.
- **Median imputation**: Replacing missing values with the median of the observed values in the column.
- **Mode imputation**: Replacing missing values with the mode (most frequent value) of the observed values in the column.
- **Constant imputation**: Replacing missing values with a specified constant value.

*Pros/Cons*: Simple to implement but can lead to biased estimates and may underestimate the variance of the imputed variable, as it ignores relationships with other variables.

**Multivariate Imputation**
Multivariate imputation involves estimating missing values in a feature by taking into account the relationships with other features in the dataset. These methods leverage the correlation structure between variables to provide more accurate imputations. Common methods include:
- **K-Nearest Neighbors (KNN) imputation**: Missing values are imputed based on the values of the k-nearest neighbors in the feature space. The "distance" to find neighbors considers all available features.
- **Multiple Imputation by Chained Equations (MICE)**: An iterative process where each incomplete feature is imputed using a prediction model (e.g., linear regression) based on other features.
- **PCA based imputation**: Using principal components to estimate missing values.

*Pros/Cons*: Generally more sophisticated and can provide more accurate imputations, preserving relationships between variables better than univariate methods. However, they are more computationally intensive.

# K-Nearest Neighbors (KNN) Algorithm

K-Nearest Neighbors (KNN) is a simple, non-parametric, lazy learning algorithm primarily used for classification and regression tasks. It's considered "lazy" because it does not construct a model during the training phase; instead, it memorizes the entire training dataset. All computations are deferred until a prediction is requested.

## How KNN Works

The core principle of KNN is that data points that are similar tend to exist in close proximity within the feature space. When a new, unseen data point requires classification or a value prediction, KNN identifies its 'k' nearest neighbors from the training dataset.

### For Classification Tasks:

1.  **Choose the Value of K:** Select the number of nearest neighbors (`K`) to consider. `K` is typically a small, odd integer (e.g., 3, 5, 7) to prevent ties when determining the majority class.
2.  **Calculate Distances:** For the new data point, compute its distance to *every* data point in the training dataset. Common distance metrics include:
    *   **Euclidean Distance:** The straight-line distance between two points in Euclidean space. For two points $P_1 = (x_1, y_1)$ and $P_2 = (x_2, y_2)$, it's $\sqrt{(x_2-x_1)^2 + (y_2-y_1)^2}$. For $n$-dimensional space, it's $\sqrt{\sum_{i=1}^{n}(p_{1i} - p_{2i})^2}$. It's the most widely used metric.
    *   **Manhattan Distance (L1 Norm):** The sum of the absolute differences of their Cartesian coordinates. For two points, it's $|x_2-x_1| + |y_2-y_1|$. For $n$-dimensional space, it's $\sum_{i=1}^{n}|p_{1i} - p_{2i}|$. It represents distance travelled along axes at right angles.
    *   **Minkowski Distance:** A generalization of Euclidean and Manhattan distances. It is defined as $(\sum_{i=1}^{n}(|p_{1i} - p_{2i}|)^p)^{1/p}$. When $p=1$, it's Manhattan distance; when $p=2$, it's Euclidean distance.
3.  **Identify K-Nearest Neighbors:** Sort the calculated distances in ascending order and select the `K` data points with the smallest distances. These are the K-nearest neighbors.
4.  **Vote for Class Label:** Examine the class labels of these `K` neighbors. The new data point is assigned the class label that is most frequent among its `K` nearest neighbors (majority vote). In cases of ties with an even `K`, various strategies can be employed, such as taking the next nearest neighbor or assigning randomly.

### For Regression Tasks:

1.  **Choose the Value of K:** Similar to classification, select an appropriate `K`.
2.  **Calculate Distances:** Compute the distance from the new data point to all training data points using a chosen distance metric (e.g., Euclidean).
3.  **Identify K-Nearest Neighbors:** Select the `K` training data points closest to the new point.
4.  **Calculate Average/Weighted Average:** The predicted value for the new data point is typically the average (mean) of the target values of its `K` nearest neighbors. A weighted average can also be used, where closer neighbors contribute more to the average.

## Key Considerations and Characteristics:

*   **Non-parametric:** KNN makes no assumptions about the underlying data distribution.
*   **Lazy Learning:** No explicit training phase. All calculations occur during prediction, making it computationally expensive for large datasets during inference.
*   **Feature Scaling:** KNN is sensitive to the scale of features because distance calculations are heavily influenced by features with larger ranges. It's crucial to scale (e.g., standardization or normalization) the features before applying KNN.
*   **Choice of K:**
    *   **Small K:** Can be noisy and sensitive to outliers, potentially leading to overfitting.
    *   **Large K:** Smoothes out predictions, but may blur boundaries between classes or miss fine-grained patterns, potentially leading to underfitting.
    *   The optimal `K` is often found through cross-validation.
*   **Distance Metric:** The choice of distance metric depends on the nature of the data. Euclidean is common for continuous numerical data.
*   **Curse of Dimensionality:** In high-dimensional spaces, the concept of "nearest" becomes less meaningful, as all points tend to be equidistant from each other. This can degrade KNN's performance.
*   **Computational Cost:**
    *   **Training:** O(1) (just storing data).
    *   **Prediction:** O(N * D) where N is the number of training samples and D is the number of features, as it needs to calculate distance to all training points. For very large datasets, this can be slow.
*   **Handling Imbalanced Data:** If one class is dominant, its instances might frequently be among the nearest neighbors, leading to biased predictions. Techniques like weighted voting or over/under-sampling can help.

## Advantages:

*   Simple to understand and implement.
*   No training phase (lazy learner).
*   Can be used for both classification and regression.
*   Effective for non-linear decision boundaries.

## Disadvantages:

*   Computationally expensive during prediction for large datasets.
*   Sensitive to irrelevant or redundant features.
*   Sensitive to the scale of features.
*   Performance degrades with high-dimensional data (curse of dimensionality).
*   Requires sufficient memory to store the entire training dataset.

## Implementation Example (Conceptual Python using scikit-learn):

```python
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris

# Load sample data
iris = load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Scale features (important for KNN)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initialize and train KNN classifier
knn = KNeighborsClassifier(n_neighbors=5) # K=5
knn.fit(X_train_scaled, y_train)

# Make predictions
y_pred = knn.predict(X_test_scaled)

# Evaluate (e.g., accuracy)
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print(f"KNN Accuracy: {accuracy:.2f}")
```
