# 🧠 K-Nearest Neighbors (KNN) Imputer
**KNN Imputer fills missing values by finding the K nearest neighbors (similar rows) based on other feature values, then imputes missing entries using the average (for numerical) or most frequent value (for categorical) from those neighbors.
It leverages multivariate relationships to provide more informed imputations than simple univariate methods.**

## 📆 When to Use
- When missing data is related to other features (not completely random)

- When your dataset is not too large (computational cost grows with data size)

- When you want to preserve complex relationships between features

- When your data is mixed type (numerical + categorical, with proper encoding)

- When you have enough complete cases to find meaningful neighbors

## ✅ Advantages
- ✅ Accounts for feature interdependencies

- ✅ Produces more realistic imputations than mean/median

- ✅ Can handle mixed data types with encoding

- ✅ Good for datasets with structured missingness

## ❌ Disadvantages
- ❌ Computationally slow on large datasets

- ❌ Sensitive to feature scaling — requires normalization or standardization

- ❌ Performance depends on choice of K and distance metric

- ❌ Can struggle if many features are missing or noisy

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.impute import KNNImputer,SimpleImputer
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

In [14]:
df = pd.read_csv(r"C:\Users\Asus\Downloads\train.csv")[['Age','Pclass','Fare','Survived']]

In [15]:
df.head()

Unnamed: 0,Age,Pclass,Fare,Survived
0,22.0,3,7.25,0
1,38.0,1,71.2833,1
2,26.0,3,7.925,1
3,35.0,1,53.1,1
4,35.0,3,8.05,0


In [16]:
df.isnull().mean() * 100

Age         19.86532
Pclass       0.00000
Fare         0.00000
Survived     0.00000
dtype: float64

In [17]:
X = df.drop(columns=['Survived'])
y = df['Survived']

In [18]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=2)

In [19]:
X_train.head()

Unnamed: 0,Age,Pclass,Fare
30,40.0,1,27.7208
10,4.0,3,16.7
873,47.0,3,9.0
182,9.0,3,31.3875
876,20.0,3,9.8458


In [20]:
knn = KNNImputer(n_neighbors=3,weights='distance')

X_train_trf = knn.fit_transform(X_train)
X_test_trf = knn.transform(X_test)

In [21]:
lr = LogisticRegression()

lr.fit(X_train_trf,y_train)

y_pred = lr.predict(X_test_trf)

accuracy_score(y_test,y_pred)

0.7039106145251397

### Comparision with Simple Imputer --> mean


In [22]:
si = SimpleImputer()

X_train_trf2 = si.fit_transform(X_train)
X_test_trf2 = si.transform(X_test)

In [23]:
lr = LogisticRegression()

lr.fit(X_train_trf2,y_train)

y_pred2 = lr.predict(X_test_trf2)

accuracy_score(y_test,y_pred2)

0.6927374301675978