## Multivariate Imputation

Multivariate imputation is a technique used in __Handling Missing Values__ in a dataset. In this process the missing value of a feature(column) are imputed with the value based on other features of the dataset. <br>

In this article we are going to discuss about scikit learn library algorithm that we use in Multivariate Imputation.

    KNN Imputer
    Iterative Imputer

In [121]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer, KNNImputer, IterativeImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# import warnings
# warnings.simplefilter("ignore")

In [122]:
# read csv to df
df = pd.read_csv('./Datasets/train.csv', usecols=['Age', 'Fare', 'Pclass', 'Survived'])

In [4]:
# nulls count
(df.isnull().mean())*100

Survived     0.00000
Pclass       0.00000
Age         19.86532
Fare         0.00000
dtype: float64

In [123]:
# split
X = df.drop(columns=['Survived'])
y = df['Survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [18]:
X_train

Unnamed: 0,Pclass,Age,Fare
331,1,45.5,28.5000
733,2,23.0,13.0000
382,3,32.0,7.9250
704,3,26.0,7.8542
813,3,6.0,31.2750
...,...,...,...
106,3,21.0,7.6500
270,1,,31.0000
860,3,41.0,14.1083
435,1,14.0,120.0000


## KNN Imputer

In [134]:
# knnimputer
knn = KNNImputer(n_neighbors=2, weights='uniform') # keep changing these values till you get best accuracy

# transforming
X_train_trf = knn.fit_transform(X_train)
X_test_trf = knn.transform(X_test)

# model object
lr = LogisticRegression()

# fit
lr.fit(X_train_trf, y_train)

# predict
y_pred = lr.predict(X_test_trf)

# accuracy
accuracy_score(y_test, y_pred)

0.7430167597765364

## Iterative Imputer

In [135]:
# imputer object
imp = IterativeImputer(max_iter=10)

# transform
X_train_imp = imp.fit_transform(X_train)
X_test_imp = imp.transform(X_test)

# model object
lr = LogisticRegression()

# model fit
lr.fit(X_train_imp, y_train)

# predict
y_predimp = lr.predict(X_test_imp)

# accuracy
accuracy_score(y_test, y_predimp)

0.7486033519553073

We can observe there is a slight improvement with Iteration imputation method

## Summary

In this notebook we have learnt how to implement scikit library algorithms in Multivariadling process of Hansling missing values.