## KNN Imputer for Missing Data | [Link](https://github.com/AdilShamim8/50-Days-of-Machine-Learning/tree/main/Day%2023%20KNN%20Imputer)

### Overview

**KNN Imputer** is an imputation technique that leverages the concept of k-nearest neighbors (KNN) to fill in missing values. Rather than simply replacing missing values with a fixed statistic (like the mean or median), KNN Imputer identifies the k most similar observations (neighbors) based on the other feature values and then imputes the missing value using an aggregate (typically the mean) of the neighbors’ values.

### How It Works

1. **Distance Calculation:**  
   For each sample with missing values, the algorithm calculates the distance (usually Euclidean) to all other samples using only the features that are present in both samples.
   
2. **Neighbor Selection:**  
   It then selects the k-nearest neighbors that have non-missing values in the target feature.

3. **Aggregation:**  
   The missing value is imputed by aggregating (typically averaging) the corresponding values from the k-nearest neighbors.

### Benefits

- **Data-Driven:** Imputation is based on the actual observed data distribution rather than a global statistic.
- **Preserves Relationships:** By considering neighbors, the method tends to preserve local data patterns and relationships.
- **Flexibility:** The number of neighbors (k) can be tuned based on the dataset’s characteristics.

### Limitations

- **Computationally Intensive:** For large datasets, calculating distances for each missing entry can be resource-intensive.
- **Assumes Similarity:** The method assumes that similar samples (neighbors) provide a good estimate, which may not hold in all cases.
- **Sensitive to Feature Scaling:** It is important to standardize or normalize features before applying KNN Imputer to ensure fair distance comparisons.

### Python Code Example

Below is an example using scikit-learn's `KNNImputer`:

```python
import numpy as np
import pandas as pd
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler

# Sample DataFrame with missing numerical values
data = {
    'Feature1': [1.0, 2.0, np.nan, 4.0, 5.0],
    'Feature2': [2.0, np.nan, 3.0, 4.0, 5.0],
    'Feature3': [np.nan, 1.0, 2.0, 3.0, 4.0]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Optional: Scale features to standardize before imputation (important for KNN)
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)

# Create and apply KNN Imputer
# n_neighbors defines the number of neighbors to use for imputation (default is 5)
imputer = KNNImputer(n_neighbors=3)
df_imputed_scaled = imputer.fit_transform(df_scaled)

# Inverse transform to get back original scale
df_imputed = pd.DataFrame(scaler.inverse_transform(df_imputed_scaled), columns=df.columns)
print("\nDataFrame after KNN Imputation:")
print(df_imputed)
```

#### Explanation:
- **Data Creation:** We create a DataFrame with missing values in three features.
- **Scaling:** Standardization is applied using `StandardScaler` to ensure features contribute equally to the Euclidean distance calculation.
- **KNNImputer:** We initialize `KNNImputer` with `n_neighbors=3` (you can adjust k as needed) and apply it to the scaled DataFrame.
- **Inverse Transform:** The imputed scaled data is transformed back to the original scale for interpretation.

### Conclusion

KNN Imputer provides a more tailored approach to missing data imputation by using local information from similar samples. This method is especially useful when the data has local structure that simple global imputation strategies (like mean or median imputation) might overlook. Always remember to scale your features when using KNN-based methods to ensure fair distance calculations.