# Multivariate Imputation

Multivatiate imputation is a technique where, data from more than one column is used for filling a missing values

### KNN Imputation


- Here each row is assumed as vector, mean of values from corresponding features in 'k' rows is calculated and used in vacant place
- For the feature with missing values, the imputer calculates the mean, median, or mode of the corresponding feature values of the k-nearest neighbors.


#Advantages
- Accurate

#Disadvantages
- More number of calculations
- Sensitive to outliers, which can affect the imputation result.
- Requires careful selection of the value of k for optimal results.
- Memory requirement is high on server in productions, as training dataset has to be store on server

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [20]:
from sklearn.model_selection import train_test_split
from sklearn.impute import KNNImputer

In [9]:
df = pd.read_csv("E:\\ml_revision\\Missing_values\\Datasets\\titanic_toy.csv")

In [12]:
df.columns

Index(['Age', 'Fare', 'Family', 'Survived'], dtype='object')

In [13]:
df.head()

Unnamed: 0,Age,Fare,Family,Survived
0,22.0,7.25,1,0
1,38.0,71.2833,1,1
2,26.0,7.925,0,1
3,35.0,53.1,1,1
4,35.0,8.05,0,0


In [14]:
df.isnull().sum()

Age         177
Fare         45
Family        0
Survived      0
dtype: int64

In [16]:
x = df.drop(columns = ['Survived'])
y = df['Survived']

In [17]:
x , y

(      Age     Fare  Family
 0    22.0   7.2500       1
 1    38.0  71.2833       1
 2    26.0   7.9250       0
 3    35.0  53.1000       1
 4    35.0   8.0500       0
 ..    ...      ...     ...
 886  27.0  13.0000       0
 887  19.0  30.0000       0
 888   NaN  23.4500       3
 889  26.0      NaN       0
 890  32.0   7.7500       0
 
 [891 rows x 3 columns],
 0      0
 1      1
 2      1
 3      1
 4      0
       ..
 886    0
 887    1
 888    0
 889    1
 890    0
 Name: Survived, Length: 891, dtype: int64)

In [18]:
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.2, random_state=42)

In [19]:
x_train

Unnamed: 0,Age,Fare,Family
331,45.5,28.5000,0
733,23.0,13.0000,0
382,32.0,7.9250,0
704,26.0,7.8542,1
813,6.0,31.2750,6
...,...,...,...
106,21.0,7.6500,0
270,,31.0000,0
860,41.0,,2
435,14.0,120.0000,3


In [21]:
knn = KNNImputer()

x_train_modified = knn.fit_transform(x_train)
x_test_modified = knn.transform(x_test)

In [22]:
x_train_modified

array([[ 45.5    ,  28.5    ,   0.     ],
       [ 23.     ,  13.     ,   0.     ],
       [ 32.     ,   7.925  ,   0.     ],
       ...,
       [ 41.     ,  19.79332,   2.     ],
       [ 14.     , 120.     ,   3.     ],
       [ 21.     ,  77.2875 ,   1.     ]])

In [23]:
pd.DataFrame(x_train_modified, columns=x_train.columns)

Unnamed: 0,Age,Fare,Family
0,45.5,28.50000,0.0
1,23.0,13.00000,0.0
2,32.0,7.92500,0.0
3,26.0,7.85420,1.0
4,6.0,31.27500,6.0
...,...,...,...
707,21.0,7.65000,0.0
708,34.6,31.00000,0.0
709,41.0,19.79332,2.0
710,14.0,120.00000,3.0


In [32]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [28]:
lr = LogisticRegression()
lr.fit(x_train_modified, y_train)
pred = lr.predict(x_test_modified)

accuracy_score(y_test, pred)


0.664804469273743