**Univariate** imputation means to impute the missing values using the same column values in which the value is missing.
eg : if age of a person is missing then that age will be imputed using the date in the age column only like filling it with mean age or median age.

In **Multivariate** imputation not only that age column is used to fill the missing value but but also the values of other columns and rows.

# **KNN IMPUTER - Multivariate Imputation**

KNN - k Nearest Neighbour

k is the number of neighbour

missing value = sum of k neighbour/k -> nan euclidean distance and

weight = (total number of coordinate/total number of present coordinate)

In [19]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.impute import KNNImputer,SimpleImputer
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

In [20]:
df = pd.read_csv('train.csv', usecols=['Age','Pclass','Fare','Survived'])

In [21]:
df.head()

Unnamed: 0,Survived,Pclass,Age,Fare
0,0,3,22.0,7.25
1,1,1,38.0,71.2833
2,1,3,26.0,7.925
3,1,1,35.0,53.1
4,0,3,35.0,8.05


In [22]:
df.isnull().sum()

Unnamed: 0,0
Survived,0
Pclass,0
Age,177
Fare,0


In [23]:
x = df.drop(columns=['Survived'])
y = df['Survived']

In [24]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=2)

In [25]:
x_train.head()

Unnamed: 0,Pclass,Age,Fare
30,1,40.0,27.7208
10,3,4.0,16.7
873,3,47.0,9.0
182,3,9.0,31.3875
876,3,20.0,9.8458


In [26]:
# Creating an object of KNN class
knn = KNNImputer() #n_neighbors=3,weights='uniform' // 'distance'
x_train_trf = knn.fit_transform(x_train)
x_test_trf = knn.transform(x_test)

In [27]:
x_train_trf

array([[  1.    ,  40.    ,  27.7208],
       [  3.    ,   4.    ,  16.7   ],
       [  3.    ,  47.    ,   9.    ],
       ...,
       [  1.    ,  71.    ,  49.5042],
       [  1.    ,  33.6   , 221.7792],
       [  1.    ,  42.8   ,  25.925 ]])

In [28]:
# x_train_trf is a numpy array --> converting it to a dataframe
pd.DataFrame(x_train_trf,columns=x_train.columns)

Unnamed: 0,Pclass,Age,Fare
0,1.0,40.0,27.7208
1,3.0,4.0,16.7000
2,3.0,47.0,9.0000
3,3.0,9.0,31.3875
4,3.0,20.0,9.8458
...,...,...,...
707,3.0,30.0,8.6625
708,3.0,24.4,8.7125
709,1.0,71.0,49.5042
710,1.0,33.6,221.7792


In [39]:
# Creating an object of logistic regression class
lr = LogisticRegression()
lr.fit(x_train_trf,y_train)
y_pred = lr.predict(x_test_trf)
accuracy_score(y_pred,y_test)*100

70.39106145251397

In [40]:
# Comparing with SimpleImputer --> Mean
si = SimpleImputer()
x_train_trf2 = si.fit_transform(x_train)
x_test_trf2 = si.transform(x_test)

In [41]:
lr.fit(x_train_trf2,y_train)
y_pred = lr.predict(x_test_trf2)
accuracy_score(y_pred,y_test)*100

69.27374301675978

In [42]:
# SimpleImputer --> Median
si = SimpleImputer(strategy='median')
x_train_trf3 = si.fit_transform(x_train)
x_test_trf3 = si.transform(x_test)

lr.fit(x_train_trf3,y_train)
y_pred = lr.predict(x_test_trf3)
accuracy_score(y_pred,y_test)*100

69.27374301675978

# **Conclusion**

KNN Imputer is out performing SimpleImputer.