## `Handling Multivariate Missing Data (KNN Imputer)`

- In *`Multivariate Imputation`* we can impute all the missing values of any column or rows with the help of all the other rows and columns of that dataset.
- There are two techniques to do this:
    - *`KNN Imputer`*
    - *`Iterative Imputer`*

### `KNN Imputer`

- This technique mainly works on **`KNN`** algorithm.
- Here we fill the *`missing values`* of a row with the value of that row which is most similar to the row where the value is missing.
- The similarity is decided by *`Euclidean Distance`*
- So here in 
    - 1st step we need to find the *`nearest neighbor`* by calculating the *`Eucledian Distance`*.
    - 2nd step is to find the value, i.e. to find the *`mean`* value after finding value from each neighbor.
- Now here the values of the *`nearest neighbor`* may also remain missing, so to overcome this situation to find the *`nearest neighbor`* with no *`missing values`* we use a new technique to find the *`Euclidean Distance`* known as *`nan_Euclidean Distance`*.
- To know more about *`nan_Euclidean Distance`* go to:
- https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.nan_euclidean_distances.html

- **Advantages:**
    - This is more accurate than *`mean`* and *`median`*.
    - To be used with small and medium sized datasets.
    - Here more number of calculations take place.
- **Disadvantages:**
    - It is time consuming. As here we need to calculate distance between each point with a given point and calculate the *`mean`*.
    - When it is deployed on the server the entire training dataset ia also needed to be uploaded, so if there is a missing value in the user input then again to calculate the values it will need the entire training dataset. So here the speed is also becomes slow, and more memory occupied.
    - So it is not preferrable to be used in case of large dataset.

In [1]:
# Impoting the libraries

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.impute import KNNImputer,SimpleImputer
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

import warnings
warnings.filterwarnings('ignore')

In [5]:
# importing the dataset with required column in order we want

df = pd.read_csv('datasets/train.csv', usecols=['Age','Pclass','Fare','Survived'])[['Age','Pclass','Fare','Survived']]
df.head()

Unnamed: 0,Age,Pclass,Fare,Survived
0,22.0,3,7.25,0
1,38.0,1,71.2833,1
2,26.0,3,7.925,1
3,35.0,1,53.1,1
4,35.0,3,8.05,0


In [6]:
# Checking percentage of missing values in each column

df.isnull().mean()*100

Age         19.86532
Pclass       0.00000
Fare         0.00000
Survived     0.00000
dtype: float64

In [7]:
# Creating independent and dependent variables

X = df.drop(columns=['Survived'], axis=1)
y = df['Survived']

X.head()

Unnamed: 0,Age,Pclass,Fare
0,22.0,3,7.25
1,38.0,1,71.2833
2,26.0,3,7.925
3,35.0,1,53.1
4,35.0,3,8.05


In [8]:
# Doing train test split

X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.2, random_state=2)

X_train.shape, X_test.shape

((712, 3), (179, 3))

In [9]:
X_train.head()

Unnamed: 0,Age,Pclass,Fare
30,40.0,1,27.7208
10,4.0,3,16.7
873,47.0,3,9.0
182,9.0,3,31.3875
876,20.0,3,9.8458


In [10]:
# Now doing the imputation
# n_neighbors = to state the number of neighbors
# weights = it has two values 'uniform' (default) and 'distance'

knn = KNNImputer(n_neighbors=3, weights='distance')

X_train_trf = knn.fit_transform(X_train)
X_test_trf = knn.transform(X_test)

In [12]:
# Creating model and checking accuracy

lr = LogisticRegression()

lr.fit(X_train_trf, y_train)

y_pred = lr.predict(X_test_trf)

acc1 = accuracy_score(y_test, y_pred)
print(f"The accuracy of the model with KNN-Imputation is {(acc1)*100:.2f}%")

The accuracy of the model with KNN-Imputation is 71.51%


In [13]:
# Comparision with Simple Imputer --> mean

si = SimpleImputer()

X_train_trf2 = si.fit_transform(X_train)
X_test_trf2 = si.transform(X_test)

In [14]:
# Again creating model and checking accuracy

lr = LogisticRegression()

lr.fit(X_train_trf2, y_train)

y_pred = lr.predict(X_test_trf2)

acc2 = accuracy_score(y_test, y_pred)
print(f"The accuracy of the model with Simple-Imputation is {(acc2)*100:.2f}%")

The accuracy of the model with Simple-Imputation is 69.27%
