Here are clean and easy **notes-style points** for **KNN Imputer**, including **advantages and disadvantages**.

---

# ⭐ **KNN Imputer — Notes**

### ✅ **What is KNN Imputer?**

* A method to fill missing values using the **K-Nearest Neighbors algorithm**.
* It finds the **k closest rows** (neighbors) based on other feature values.
* Missing values are replaced using the **mean** (for numerical) or **majority vote** (for categorical) of those neighbors.
* Works similar to KNN classification but used for **imputation**, not prediction.

---

# ⭐ **How KNN Imputer Works (Step-by-Step)**

1. Identify the row that has missing values.
2. Calculate distance (usually **Euclidean distance**) from this row to all other rows **without missing values**.
3. Find **K closest rows** (neighbors).
4. Take their values:

   * For numeric data → **average of neighbors**
   * For categorical data → **most frequent**
5. Replace the missing value with this computed result.

---

# ⭐ **Important Parameters**

* **n_neighbors** → number of neighbors (default = 5)
* **weights** →

  * `"uniform"` = all neighbors equal
  * `"distance"` = closer neighbors get more weight
* **metric** → distance formula (usually Euclidean)

---

# ⭐ **When to Use KNN Imputer**

* When your data has patterns and similar rows.
* When simple imputers (mean, median) lose too much information.
* When missingness is small to moderate.

---

# ⭐ **Advantages (Pros)**

### ✔ 1. More Accurate than Mean/Median

Because it uses information from **similar rows**, not just a single statistic.

### ✔ 2. Captures Data Relationships

Maintains correlation between features.

### ✔ 3. Works for Both Numerical & Categorical Data

### ✔ 4. No assumption about data distribution

Unlike mean imputation, KNN works even when data is not normally distributed.

---

# ⭐ **Disadvantages (Cons)**

### ❌ 1. Very Slow on Large Datasets

Because it must calculate distance for every missing value.

### ❌ 2. Sensitive to Outliers

Outliers may become neighbors → wrong imputations.

### ❌ 3. Requires Scaled Data

Distances get messed up if features have different scales
(so apply **StandardScaler or MinMaxScaler first**).

### ❌ 4. Memory Heavy

KNN stores all data in memory.

### ❌ 5. Can distort categories with many classes

If categories are many, majority voting may not be meaningful.

---

# ⭐ Final Tip

KNN Imputer is powerful but should be used when:

* Dataset is not too large
* Data has meaningful similarity patterns
* Features are scaled

---

If you want, I can also give you a **simple code example** or compare KNN Imputer with **SimpleImputer**.


In [2]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.impute import KNNImputer,SimpleImputer
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score


In [3]:
df = pd.read_csv('train.csv')[['Age','Pclass','Fare','Survived']]
df.head()

Unnamed: 0,Age,Pclass,Fare,Survived
0,22.0,3,7.25,0
1,38.0,1,71.2833,1
2,26.0,3,7.925,1
3,35.0,1,53.1,1
4,35.0,3,8.05,0


In [4]:
df.isnull().mean() * 100

Age         19.86532
Pclass       0.00000
Fare         0.00000
Survived     0.00000
dtype: float64

In [5]:
X = df.drop(columns=['Survived'])
y = df['Survived']

In [6]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=2)

In [7]:
X_train.head()

Unnamed: 0,Age,Pclass,Fare
30,40.0,1,27.7208
10,4.0,3,16.7
873,47.0,3,9.0
182,9.0,3,31.3875
876,20.0,3,9.8458


In [8]:
knn = KNNImputer(n_neighbors=3,weights='distance')

X_train_trf = knn.fit_transform(X_train)
X_test_trf = knn.transform(X_test)

In [9]:
lr = LogisticRegression()

lr.fit(X_train_trf,y_train)

y_pred = lr.predict(X_test_trf)

accuracy_score(y_test,y_pred)

0.7039106145251397

In [10]:
# Comparision with Simple Imputer --> mean

si = SimpleImputer()

X_train_trf2 = si.fit_transform(X_train)
X_test_trf2 = si.transform(X_test)

In [11]:
lr = LogisticRegression()

lr.fit(X_train_trf2,y_train)

y_pred2 = lr.predict(X_test_trf2)

accuracy_score(y_test,y_pred2)

0.6927374301675978