In [1]:
import numpy as np
import pandas as pd
data_fake = pd.DataFrame({
    'items': [np.nan, 2, 4, 8, 10],
    'age': [23, np.nan, 28, 32, 40],
    'cost': [9500, 11000, np.nan, np.nan, 14760]
})

In [2]:
data_fake

Unnamed: 0,items,age,cost
0,,23.0,9500.0
1,2.0,,11000.0
2,4.0,28.0,
3,8.0,32.0,
4,10.0,40.0,14760.0


If we drop all rows containing NaN values, we would only have one row remaining.<br>
We can however use an imputer algorithm to impute the missing values from known data.<br>

There are two classes of imputers:
1. Univariate imputers - identified by algorithms that impute values in the *ith* feature dimension, using only non-missing values in that feature dimension
2. Multivariate imputers - uses the entire set of avaiable features to estimate the missing values.

# Simple imputer (univariate)
For each single feature, it will impute the missing values based on a specific strategy based on the data.<br>
e.g. using mean or median to fill the data in.

In [3]:
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='median')
imp.fit_transform(data_fake)

array([[6.000e+00, 2.300e+01, 9.500e+03],
       [2.000e+00, 3.000e+01, 1.100e+04],
       [4.000e+00, 2.800e+01, 1.100e+04],
       [8.000e+00, 3.200e+01, 1.100e+04],
       [1.000e+01, 4.000e+01, 1.476e+04]])

In [4]:
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
imp_mean.fit_transform(data_fake)

array([[6.00000000e+00, 2.30000000e+01, 9.50000000e+03],
       [2.00000000e+00, 3.07500000e+01, 1.10000000e+04],
       [4.00000000e+00, 2.80000000e+01, 1.17533333e+04],
       [8.00000000e+00, 3.20000000e+01, 1.17533333e+04],
       [1.00000000e+01, 4.00000000e+01, 1.47600000e+04]])

In [7]:
'''
We can now include a dummy variable to indicate that feature was imputed or not.
'''
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean', add_indicator=True)
pd.DataFrame(imp_mean.fit_transform(data_fake))

Unnamed: 0,0,1,2,3,4,5
0,6.0,23.0,9500.0,1.0,0.0,0.0
1,2.0,30.75,11000.0,0.0,1.0,0.0
2,4.0,28.0,11753.333333,0.0,0.0,1.0
3,8.0,32.0,11753.333333,0.0,0.0,1.0
4,10.0,40.0,14760.0,0.0,0.0,0.0


# Multivariate imputers
Taking age and cost, we make the assumption that as age increases, the maintance cost will increase.<br>
We can therefore impute a higher cost with a higher age.<br>

We can use two different strategies.

## Iterative imputer
Based on Batesian Ridge Regression model - we predict the missing values based on all the observered non-missing features.<br>
If we want to impute the value for cost, the iterative imputer is going to fit a refression model using items and age as features.<br>

In [8]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

In [10]:
it = IterativeImputer()
pd.DataFrame(it.fit_transform(data_fake))

Unnamed: 0,0,1,2
0,0.994732,23.0,9500.0
1,2.0,27.846014,11000.0
2,4.0,28.0,11046.924962
3,8.0,32.0,12283.865998
4,10.0,40.0,14760.0


## K-NN imputer
Imputes missing values observed for the K most similar rows.<br>
K-NN imputer looks for the most K similar rows from the dataset based on known features.<br>
So for the missing cost, it will use items and age and use the average of the corresponding cost values to impute missing values.

In [11]:
from sklearn.impute import KNNImputer
knn = KNNImputer(n_neighbors=2)
pd.DataFrame(knn.fit_transform(data_fake))

Unnamed: 0,0,1,2
0,6.0,23.0,9500.0
1,2.0,30.0,11000.0
2,4.0,28.0,10250.0
3,8.0,32.0,12880.0
4,10.0,40.0,14760.0


As shown, we have different results between the two multivariate methods.<br>

Focusing on the missing value for `items`, the imputer looked for most similar rows: row 1 and row 4 due to the `cost`, and took the average age between them.<br>
It did not use rows 2 or 3 because the `age` in row 0 was less than both of these rows.
