Data Imputation is an area of data science used to infer missing values from surrounding context. It is useful in cases where one expects a value and there isn't one.

The can be entire data points. Or just features in a data point.

## One Dimensional Data

In [2]:
from sklearn.datasets import load_iris, load_wine
import pandas as pd
import plotly.express as px
import numpy as np

In [2]:
data = load_iris()
df = pd.DataFrame(data=data.data, columns=data.feature_names)
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [4]:
px.scatter(df,x='petal length (cm)',y='petal width (cm)',trendline='ols')

ValueError: Mime type rendering requires nbformat>=4.2.0 but it is not installed

We have some missing data. Whats the petal width of a flower that has a petal length of between 2 to 3 cm's. We can interpolate linearly with linear regression and allow us to fill in missing data.
This is a bit of a toy example but is acutally very releavant especially when looking at time series data.

![image.png](attachment:image.png)

Some time series models like ARIMA don't cope very well with missing data points and require all expected points to have values. Performing linear iterpolation along a short period is a good way to fill in these required missing data.

## When doesn't this work?

Take this graph for example. Try to interpolate the graph and work out what the values at x = 0 would be.

In [22]:
output = []
for i in range(-17,23):
    values = []
    x = i/5
    values.append(x)
    values.append(x**4 - 2*x**3 - 11*x**2 + 12*x + 60)
    output.append(values)

example_plot = output[0:6] + output [-6:]

In [24]:
fig = px.scatter(pd.DataFrame(example_plot,columns = ['x','y']),x='x',y='y')
fig.update_layout(yaxis_range=[0,100])

ValueError: Mime type rendering requires nbformat>=4.2.0 but it is not installed

In [25]:
fig = px.scatter(pd.DataFrame(output,columns = ['x','y']),x='x',y='y')
fig.update_layout(yaxis_range=[0,100])

ValueError: Mime type rendering requires nbformat>=4.2.0 but it is not installed

There are two problems one is that this is a nonlinear plot and also that whilst it looks like an X^2 plot it's actually an X^4 plot. Very sneaky.

Whilst you can fix the first one by looking at non linear iterpolation. The latter is likely impossible without outside information and it's why it's important to always consider both the context and the sampling.

Its also very important to note that we're looking at iterpolation right now. Extrapolation is whole another kettle of risky fish.

# Multi dimensional data

Things start to make a lot more sense when we think about multidimensional data.
We're off building a wine reccomendation website. And we sent John off in the lab to record a whole bunch of wine tests.

In [29]:
data = load_wine()
df = pd.DataFrame(data=data.data, columns=data.feature_names)
df.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0


In [42]:
import random
for i in range(0,25):
    x = random.randrange(10)
    y = random.randrange(10)
    df.iloc[x,y] = np.nan

In [43]:
df.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
0,14.23,1.71,2.43,,127.0,2.8,3.06,0.28,2.29,,1.04,3.92,1065.0
1,13.2,1.78,2.14,11.2,,2.65,,0.26,1.28,4.38,1.05,3.4,1050.0
2,13.16,2.36,2.67,18.6,,,3.24,0.3,,5.68,1.03,3.17,1185.0
3,14.37,,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0
4,13.24,2.59,2.87,21.0,118.0,,,0.39,1.82,,1.04,2.93,735.0


We have the classics. We can fill the values with the mean, median and mode.

In [54]:
df['malic_acid'] = df['malic_acid'].fillna(df['malic_acid'].mean())
df['alcalinity_of_ash'] = df['alcalinity_of_ash'].fillna(df['alcalinity_of_ash'].median())
df['magnesium'] = df['magnesium'].fillna(df['magnesium'].mode())

In [53]:
df

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
0,14.23,1.710000,2.43,19.5,127.0,2.80,3.06,0.28,2.29,,1.04,3.92,1065.0
1,13.20,1.780000,2.14,11.2,,2.65,,0.26,1.28,4.38,1.05,3.40,1050.0
2,13.16,2.360000,2.67,18.6,,,3.24,0.30,,5.68,1.03,3.17,1185.0
3,14.37,2.338531,2.50,16.8,113.0,3.85,3.49,0.24,2.18,7.80,0.86,3.45,1480.0
4,13.24,2.590000,2.87,21.0,118.0,,,0.39,1.82,,1.04,2.93,735.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
173,13.71,5.650000,2.45,20.5,95.0,1.68,0.61,0.52,1.06,7.70,0.64,1.74,740.0
174,13.40,3.910000,2.48,23.0,102.0,1.80,0.75,0.43,1.41,7.30,0.70,1.56,750.0
175,13.27,4.280000,2.26,20.0,120.0,1.59,0.69,0.43,1.35,10.20,0.59,1.56,835.0
176,13.17,2.590000,2.37,20.0,120.0,1.65,0.68,0.53,1.46,9.30,0.60,1.62,840.0


Mean and Median are often good ideas for numerical values. Where as mode is often a good idea for categorical data.

# Clustering Approaches to Imputation

Credit to [this useful website](https://machinelearningmastery.com/knn-imputation-for-missing-values-in-machine-learning/)

Effectively here you're looking to find other data points that are similar to the data point that has values you're interested in finding. When we have similar data points we can use the values from them to fill in the gaps.

In [7]:
from sklearn.impute import KNNImputer
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv'
dataframe = pd.read_csv(url, header=None, na_values='?')
# split into input and output elements
data = dataframe.values
dataframe.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,18,19,20,21,22,23,24,25,26,27
0,2.0,1,530101,38.5,66.0,28.0,3.0,3.0,,2.0,...,45.0,8.4,,,2.0,2,11300,0,0,2
1,1.0,1,534817,39.2,88.0,20.0,,,4.0,1.0,...,50.0,85.0,2.0,2.0,3.0,2,2208,0,0,2
2,2.0,1,530334,38.3,40.0,24.0,1.0,1.0,3.0,1.0,...,33.0,6.7,,,1.0,2,0,0,0,1
3,1.0,9,5290409,39.1,164.0,84.0,4.0,1.0,6.0,2.0,...,48.0,7.2,3.0,5.3,2.0,1,2208,0,0,1
4,2.0,1,530255,37.3,104.0,35.0,,,6.0,2.0,...,74.0,7.4,,,2.0,2,4300,0,0,2


In [9]:
ix = [i for i in range(data.shape[1]) if i != 23]
X, y = data[:, ix], data[:, 23]
# print total missing
print('Missing: %d' % sum(np.isnan(X).flatten()))
# define imputer
imputer = KNNImputer()
# fit on the dataset
imputer.fit(X)
# transform the dataset
Xtrans = imputer.transform(X)
# print total missing
print('Missing: %d' % sum(np.isnan(Xtrans).flatten()))

Missing: 1605
Missing: 0


## How do we know if what we've done is any good?

Context is important. Imputing our values is best done for a purpose. And its only when you consider the overall purpose that you can know if your method of imputation is suitable.

In [26]:

# evaluate knn imputation and random forest for the horse colic dataset
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import KNNImputer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv'
dataframe = read_csv(url, header=None, na_values='?')
# split into input and output elements
data = dataframe.values
ix = [i for i in range(data.shape[1]) if i != 23]
X, y = data[:, ix], data[:, 23]
# define modeling pipeline
model = RandomForestClassifier()
imputer = KNNImputer(n_neighbors=5)
pipeline = Pipeline(steps=[('i', imputer), ('m', model)])
# define model evaluation
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Mean Accuracy: 0.866 (0.051)
