### KNN Imputer  

#### Working  
- Uses **NaN-aware Euclidean distance** to find the nearest neighbors.  
- Distance is computed only on the **available (non-missing) features**.  

$$
dist(x, y) = \sqrt{\text{weight} \times \text{(squared distance from present coordinates)}}
$$

where  

$$
\text{weight} = \frac{\text{Total \# of coordinates}}{\text{\# of present coordinates}}
$$

- Missing values are then imputed by averaging (or weighted averaging) the values from the **k nearest neighbors**.


### Advantages and Disadvantages of KNN Imputer

**Advantages:**  
- Can provide **more accurate imputations** because it considers the similarity between observations.  
- Works well when the dataset has a **complex relationship** between features.  

**Disadvantages:**  
- **Computationally expensive** as it requires calculating distances between each pair of observations.  
- Can be **slow on large datasets** due to high number of calculations.  
- In production, the **entire training dataset must be available** to impute new data, which can increase memory and storage requirements.


In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.impute import KNNImputer, SimpleImputer
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

In [2]:
df = pd.read_csv('E:\\Machine learning\\ML\\FeatureEngineering\\HandlingMissingData\\UnivariateHandle\\OtherTechnique\\train.csv')[['Age','Pclass', 'Fare', 'Survived']]

In [3]:
df.head()

Unnamed: 0,Age,Pclass,Fare,Survived
0,22.0,3,7.25,0
1,38.0,1,71.2833,1
2,26.0,3,7.925,1
3,35.0,1,53.1,1
4,35.0,3,8.05,0


In [4]:
df.isnull().mean()*100

Age         19.86532
Pclass       0.00000
Fare         0.00000
Survived     0.00000
dtype: float64

In [5]:
x = df.drop('Survived', axis=1)
y = df['Survived']

In [7]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=2)

In [11]:
x_train

Unnamed: 0,Age,Pclass,Fare
451,,3,19.9667
345,24.0,2,13.0000
687,19.0,3,10.1708
279,35.0,3,20.2500
742,21.0,1,262.3750
...,...,...,...
534,30.0,3,8.6625
584,,3,8.7125
493,71.0,1,49.5042
527,,1,221.7792


In [22]:
knn = KNNImputer(n_neighbors=3, weights='distance')
x_train_trf = knn.fit_transform(x_train)
x_test_trf = knn.transform(x_test)

In [23]:
pd.DataFrame(x_train_trf, columns = x_train.columns)

Unnamed: 0,Age,Pclass,Fare
0,25.415436,3.0,19.9667
1,24.000000,2.0,13.0000
2,19.000000,3.0,10.1708
3,35.000000,3.0,20.2500
4,21.000000,1.0,262.3750
...,...,...,...
618,30.000000,3.0,8.6625
619,26.959410,3.0,8.7125
620,71.000000,1.0,49.5042
621,32.666667,1.0,221.7792


In [24]:
lr = LogisticRegression()
lr.fit(x_train_trf, y_train)
y_pred = lr.predict(x_test_trf)
accuracy_score(y_test, y_pred)

0.7350746268656716

In [25]:
si = SimpleImputer(strategy='mean')
x_train_trf2 = si.fit_transform(x_train)
x_test_trf2 = si.transform(x_test)
lr2 = LogisticRegression()
lr2.fit(x_train_trf2, y_train)
y_pred2 = lr2.predict(x_test_trf2)  
accuracy_score(y_test, y_pred2)

0.7238805970149254