## Comparing imputing strategies 

### Introduction

In this notebook i will implement and evaluate different strategies to adress missing values. As last attempt i will use and autoencoder to reconstruct flawed instances within the dataset. I decided to use a small datataset fro UCI about glass classification, you can find it here: https://www.kaggle.com/uciml/glass

### Setup

In [1]:
# deep learning libraries
import tensorflow as tf
from tensorflow import keras
import keras.backend as K

# common imports
import pandas as pd
import numpy as np

# setting random seed
np.random.seed(4)
tf.random.set_seed(4)

# Style setup
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=16)
mpl.rc('ytick', labelsize=12)
plt.style.use('fivethirtyeight')
plt.xkcd(False) 

Using TensorFlow backend.


<matplotlib.rc_context at 0x1f668c8de80>

### Loading the data

In [2]:
df = pd.read_csv(r'C:\Users\Aless\Downloads\\glass.csv')
df.shape

(214, 10)

In [3]:
df.columns

Index(['RI', 'Na', 'Mg', 'Al', 'Si', 'K', 'Ca', 'Ba', 'Fe', 'Type'], dtype='object')

### Creating missing values

Since there are no missing values in the dataset i am going to artificially create them. For the purpose of the project i will add one missing value to the 10% of the rows. Note that since this is a very small dataset the classifier will be more affected by the changing in the training set, so we hope to see significant results.

In [4]:
missing_size = 0.1
random = np.arange(df.shape[0])
np.random.shuffle(random)
missing_rows = random[:int(missing_size * df.shape[0])]
data = df.drop(missing_rows, axis = 0).values 
missing_data = df.loc[missing_rows].values
missing_cols = np.random.randint(0, data.shape[1] - 1, int(missing_size * data.shape[0]))
zips = list(zip(range(missing_data.shape[0]), missing_cols))
for zip_ in zips:
    missing_data[zip_[0], zip_[1]] = 0

### First strategy: dropping missing values

As null accuracy i am going to evaluate a classifier on a smaller set without missing values. As base classifier i choose a support vector machine, since it is known that achieve good results even though the reduced number of instances. But first let us preprocess the data

In [5]:
from sklearn.model_selection import train_test_split

X = data[:, :-1]
y = data[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

from sklearn.preprocessing import StandardScaler

st_sc = StandardScaler()
st_sc.fit(X_train)
X_train = st_sc.transform(X_train)

#### Cross validating

In [6]:
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import make_scorer

svc = SVC(gamma = 'scale', class_weight = 'balanced')
auc = make_scorer(roc_auc_score)
scores = cross_val_score(svc, X_train, y_train, cv = 5, scoring = 'accuracy')
print('Mean accuracy on cross validation: ', np.round(scores.mean(), 4))

Mean accuracy on cross validation:  0.5794


### Imputing missing values 

As a second strategy i am going to use various types of imputer: this strategy basically consists in replacing missing values according to a premade strategy. This approach will actually increase the number of intances but it will introduce a certain degree of approximation in our process. 

In [7]:
df.iloc[missing_rows, missing_cols] = np.nan
def compute_imputer_scores(strategy, iterative = False):
    if not iterative:
        from sklearn.impute import SimpleImputer
        # transforming the data with the imputer
        imputer = SimpleImputer(missing_values = np.nan, strategy = strategy)
        data = imputer.fit_transform(df.values)
        # creating the set
        X = data[:, :-1]
        y = data[:, -1]
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
        # preprocessing
        st_sc = StandardScaler()
        st_sc.fit(X_train)
        X_train = st_sc.transform(X_train)
        # cross validating
        svc = SVC(gamma = 'scale', class_weight = 'balanced')
        auc = make_scorer(roc_auc_score)
        scores = cross_val_score(svc, X_train, y_train, cv = 5, scoring = 'accuracy')
        return scores
    
    else:
        from sklearn.experimental import enable_iterative_imputer
        from sklearn.impute import IterativeImputer
        # transforming the data with the imputer
        imputer = IterativeImputer(missing_values = np.nan, max_iter = 50)
        data = imputer.fit_transform(df.values)
        # creating the set
        X = data[:, :-1]
        y = data[:, -1]
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
        # preprocessing
        st_sc = StandardScaler()
        st_sc.fit(X_train)
        X_train = st_sc.transform(X_train)
        # cross validating
        svc = SVC(gamma = 'scale', class_weight = 'balanced')
        auc = make_scorer(roc_auc_score)
        scores = cross_val_score(svc, X_train, y_train, cv = 5, scoring = 'accuracy')
        return scores
    
print('Mean accuracy on cross validation with strategy = median: ', np.round(compute_imputer_scores(strategy = 'median').mean(), 5))
print('Mean accuracy on cross validation with strategy = mean: ', np.round(compute_imputer_scores(strategy = 'mean').mean(), 5))
print('Mean accuracy on cross validation with IterativeImputer: ', np.round(compute_imputer_scores(strategy = None, iterative = True).mean(), 5))

Mean accuracy on cross validation with strategy = median:  0.52862
Mean accuracy on cross validation with strategy = mean:  0.53507
Mean accuracy on cross validation with IterativeImputer:  0.55305


The Cross validate accuracy seemed to be affected in a negative way from the imputing strategies. Note that this is not a general rule, but it can happens that a bad imputation lead to worse results expecially on small datasets

### Neural imputing with an autoencoder

An autoencoder is basically a neural network which try to reproduce its inputs, forcing the data to pass through a bottle neck. This will force the network to learn pattern inside the data in order to map them to the so called latent space (the bottle neck previous mentioned). We can use this poerful idea to build a small autoencoder able to reconstruct instances with a missing value. 

In [8]:
codings_size = 2

encoder = keras.models.Sequential([keras.layers.Flatten(input_shape = [data.shape[1]]),
                                   keras.layers.Dense(8, activation = 'relu'),
                                   keras.layers.Dense(5, activation = 'relu'),
                                   keras.layers.Dense(codings_size)])
decoder = keras.models.Sequential([keras.layers.Flatten(input_shape = [codings_size]),
                                   keras.layers.Dense(5, activation = 'relu'),
                                   keras.layers.Dense(8, activation = 'relu'),
                                   keras.layers.Dense(data.shape[1])])
ae = keras.models.Sequential([encoder, decoder])
ae.compile(optimizer = 'nadam', loss = 'mean_squared_error')
ae.fit(data, data, epochs = 100, batch_size = 16)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<tensorflow.python.keras.callbacks.History at 0x1f605140cf8>

Now that we have a trained autoencoder we can use it to predict the missing data. Note that i am not taking the whole reconstruction of the encoder for each instance, but instead i am just looking for the output value for the NaN

In [9]:
reconstructed_data = ae.predict(missing_data)
# copying the reconstructed missing values to the original array
for zip_ in zips:
    missing_data[zip_[0], zip_[1]] = reconstructed_data[zip_[0], zip_[1]]
# re-creating the sets
data = np.concatenate([data, missing_data])
X = data[:, :-1]
y = data[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
# preprocessing
st_sc = StandardScaler()
st_sc.fit(X_train)
X_train = st_sc.transform(X_train)
# cross validating
svc = SVC(gamma = 'scale', class_weight = 'balanced')
auc = make_scorer(roc_auc_score)
scores = cross_val_score(svc, X_train, y_train, cv = 5, scoring = 'accuracy')
print('Mean accuracy on cross validation: ', np.round(scores.mean(), 4))

Mean accuracy on cross validation:  0.609


### Conclusion

In [11]:
(194 - 216) / 194

-0.1134020618556701

The so called neural imputing increased the classifier perfomance in a significant way (from 58% to 60.1%). Another interesting comparison is the one with the accuracy of 'simple-imputed' model: the autoencoder outperformed all of them by far. Note that i did not tune neither the autoencoder and the IterativeImputer, so keep this in mind if you want to push perfomance beyond. The last point is about future study: in order to better evaluate the new model we should have to test it with more missing values (maybe realistic ones) and on different datasets. If the model will keep these promising results, we can procced to implement a class and a related library for large scale utilization