# Multivariate Imputation

Multivatiate imputation is a technique where, data from more than one column is used for filling a missing values

### KNN Imputation

KNN (K-Nearest Neighbors) Imputer is a technique used to handle missing data by replacing missing values with values based on the nearest neighbors in the dataset.


-  Fro the missing value we use the formula provided below(photograph)


- In knn we can set the no of nearest neighbours to be take into account. 'n_neighbours' is the respective hyperparameter.
- weights can be uniform/ distance. Generally weights = 'distance' gives more good results than 'uniform'.

### Formula for calculating value to fill at missing place

![alt text](image.png)!

### If we choose weight = 'distance' then , ->

‘distance’ : weight points by the inverse of their distance. in this case, closer neighbors of a query point will have a greater influence than neighbors which are further away.

#Advantages
- Accurate

#Disadvantages
- More number of calculations
- Sensitive to outliers, which can affect the imputation result.
- Requires careful selection of the value of k for optimal results.
- Memory requirement is high on server in productions, as training dataset has to be store on server

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [43]:
from sklearn.model_selection import train_test_split
from sklearn.impute import KNNImputer, SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [9]:
df = pd.read_csv("E:\\ml_revision\\Missing_values\\Datasets\\titanic_toy.csv")

In [12]:
df.columns

Index(['Age', 'Fare', 'Family', 'Survived'], dtype='object')

In [13]:
df.head()

Unnamed: 0,Age,Fare,Family,Survived
0,22.0,7.25,1,0
1,38.0,71.2833,1,1
2,26.0,7.925,0,1
3,35.0,53.1,1,1
4,35.0,8.05,0,0


In [14]:
df.isnull().sum()

Age         177
Fare         45
Family        0
Survived      0
dtype: int64

In [16]:
x = df.drop(columns = ['Survived'])
y = df['Survived']

In [17]:
x , y

(      Age     Fare  Family
 0    22.0   7.2500       1
 1    38.0  71.2833       1
 2    26.0   7.9250       0
 3    35.0  53.1000       1
 4    35.0   8.0500       0
 ..    ...      ...     ...
 886  27.0  13.0000       0
 887  19.0  30.0000       0
 888   NaN  23.4500       3
 889  26.0      NaN       0
 890  32.0   7.7500       0
 
 [891 rows x 3 columns],
 0      0
 1      1
 2      1
 3      1
 4      0
       ..
 886    0
 887    1
 888    0
 889    1
 890    0
 Name: Survived, Length: 891, dtype: int64)

In [18]:
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.2, random_state=42)

In [19]:
x_train

Unnamed: 0,Age,Fare,Family
331,45.5,28.5000,0
733,23.0,13.0000,0
382,32.0,7.9250,0
704,26.0,7.8542,1
813,6.0,31.2750,6
...,...,...,...
106,21.0,7.6500,0
270,,31.0000,0
860,41.0,,2
435,14.0,120.0000,3


In [49]:
for i in range(1,20):
    knn = KNNImputer(n_neighbors= i , weights='distance')
    x_train_modified = knn.fit_transform(x_train)
    x_test_modified = knn.transform(x_test)
    lr = LogisticRegression()
    lr.fit(x_train_modified, y_train)
    pred = lr.predict(x_test_modified)

    print(i,"th iteration -> ", accuracy_score(y_test, pred))



1 th iteration ->  0.6536312849162011
2 th iteration ->  0.659217877094972
3 th iteration ->  0.6536312849162011
4 th iteration ->  0.6703910614525139
5 th iteration ->  0.664804469273743
6 th iteration ->  0.664804469273743
7 th iteration ->  0.664804469273743
8 th iteration ->  0.664804469273743
9 th iteration ->  0.664804469273743
10 th iteration ->  0.6703910614525139
11 th iteration ->  0.6536312849162011
12 th iteration ->  0.659217877094972
13 th iteration ->  0.6536312849162011
14 th iteration ->  0.6536312849162011
15 th iteration ->  0.6536312849162011
16 th iteration ->  0.6536312849162011
17 th iteration ->  0.6536312849162011
18 th iteration ->  0.6536312849162011
19 th iteration ->  0.6536312849162011


### Lets compare this result with mean imputation

In [40]:
si = SimpleImputer(strategy='mean')

In [41]:
x_train_modified2 = si.fit_transform(x_train)
x_test_modified2 = si.transform(x_test)


In [42]:
lr2 = LogisticRegression()
lr2.fit(x_train_modified2, y_train)
pred2 = lr2.predict(x_test_modified2)
accuracy_score(y_test, pred2)

0.6536312849162011

#### It is clear that in this case KNN is giving good results for n_neighbours = 4 or 10

(4 th iteration :  0.6703910614525139 ) > 0.6536312849162011