# MIDAS demonstration

This notebook provides a brief demonstration of **MIDASpy**'s core functionalities. We show how to use the package to impute missing values in the Adult census dataset (which is commonly used for benchmarking in the machine learning literature).

Users of **MIDASpy** must have TensorFlow installed as a **pip** package in their Python environment. MIDAS is compatible with both TensorFlow 1.X and TensorFlow >= 2.2 versions.


Once these packages have been installed, users can import the dependencies and load the data:

In [24]:
from sklearn.preprocessing import MinMaxScaler
import numpy as np
import pandas as pd
import tensorflow as tf
import MIDASpy as md

data_0 = pd.read_csv('adult_data.csv')
data_0.columns.str.strip()

Index(['Unnamed: 0', 'age', 'workclass', 'fnlwgt', 'education',
       'education_num', 'marital_status', 'occupation', 'relationship', 'race',
       'sex', 'capital_gain', 'capital_loss', 'hours_per_week',
       'native_country', 'class_labels'],
      dtype='object')

As the dataset has a very low proportion of missingness (one of the reasons we selected it for the accuracy test), we randomly set 5000 observed values as missing in each column:

In [25]:
np.random.seed(441)

def spike_in_generation(data):
    spike_in = pd.DataFrame(np.zeros_like(data), columns= data.columns)
    for column in data.columns:
        subset = np.random.choice(data[column].index[data[column].notnull()], 5000, replace= False)
        spike_in.loc[subset, column] = 1
    return spike_in

spike_in = spike_in_generation(data_0)
original_value = data_0.loc[4, 'hours_per_week']
data_0[spike_in == 1] = np.nan

Next, we list categorical variables in a vector and one-hot encode them using **MIDASpy**'s inbuilt preprocessing function `cat_conv`, which returns both the encoded data and a nested list of categorical column names we can pass to the imputation algorithm. To construct the final, pre-processed data we append the one-hot encoded categorical data to the non-cateogrical data, and replace null values with `np.nan` values:

In [26]:
categorical = ['workclass','marital_status','relationship','race','class_labels','sex','education','occupation','native_country']
data_cat, cat_cols_list = md.cat_conv(data_0[categorical])

data_0.drop(categorical, axis = 1, inplace = True)
constructor_list = [data_0]
constructor_list.append(data_cat)
data_in = pd.concat(constructor_list, axis=1)

na_loc = data_in.isnull()
data_in[na_loc] = np.nan

To visualise the results:


In [27]:
print(data_in.head())

   Unnamed: 0   age    fnlwgt  education_num  capital_gain  capital_loss  \
0         0.0  39.0   77516.0           13.0        2174.0           0.0   
1         1.0  50.0   83311.0           13.0           0.0           0.0   
2         2.0  38.0  215646.0            9.0           0.0           0.0   
3         3.0  53.0  234721.0            NaN           0.0           0.0   
4         4.0  28.0       NaN           13.0           0.0           NaN   

   hours_per_week  workclass_Federal-gov  workclass_Local-gov  \
0            40.0                    0.0                  0.0   
1            13.0                    0.0                  0.0   
2            40.0                    0.0                  0.0   
3            40.0                    0.0                  0.0   
4             NaN                    0.0                  0.0   

   workclass_Never-worked  ...  native_country_Portugal  \
0                     0.0  ...                      0.0   
1                     0.0  ...    

The data are now ready to be fed into the MIDAS algorithm, which involves three steps. First, we specify the dimensions, input corruption proportion, and other hyperparameters of the MIDAS neural network. Second, we build a MIDAS model based on the data. The vector of one-hot-encoded column names should be passed to the softmax_columns argument, as MIDAS employs a softmax final-layer activation function for categorical variables. Third, we train the model on the data, setting the number of training epochs as 20 in this example:

In [29]:
imputer = md.Midas(layer_structure = [256,256], vae_layer = False, seed = 89, input_drop = 0.75)
imputer.build_model(data_in, softmax_columns = cat_cols_list)
imputer.train_model(training_epochs = 20)

Size index: [7, 8, 7, 6, 5, 2, 2, 16, 14, 41]

Computation graph constructed

Model initialised

Epoch: 0 , loss: 131061.55897771953
Epoch: 1 , loss: 94879.03236991112
Epoch: 2 , loss: 90927.99433900925
Epoch: 3 , loss: 88644.25104503706
Epoch: 4 , loss: 85632.2174963395
Epoch: 5 , loss: 80600.93999836173
Epoch: 6 , loss: 76585.74140164237
Epoch: 7 , loss: 75515.47067752703
Epoch: 8 , loss: 74397.19318722867
Epoch: 9 , loss: 73943.73610222293
Epoch: 10 , loss: 74009.57563943726
Epoch: 11 , loss: 74607.9188243621
Epoch: 12 , loss: 73648.4787952828
Epoch: 13 , loss: 73824.28025167923
Epoch: 14 , loss: 73013.18742640584
Epoch: 15 , loss: 73926.46652169684
Epoch: 16 , loss: 72989.6765536687
Epoch: 17 , loss: 73968.84397654202
Epoch: 18 , loss: 73285.7860169817
Epoch: 19 , loss: 73451.92790332159
Training complete. Saving file...
Model saved in file: tmp/MIDAS


<MIDASpy.midas_base.Midas at 0x7fc5292e25e0>

Once training is complete, we can generate any number of imputed datasets using the generate_samples function (here we set M as 10). Users can then either write these imputations to separate .csv files or work with them directly in Python:

In [30]:
imputations = imputer.generate_samples(m=10).output_list 

# for i in imputations:
#    file_out = ``midas_imp_" + str(n) + ``.csv"
#    i.to_csv(file_out, index=False)
#    n += 1

INFO:tensorflow:Restoring parameters from tmp/MIDAS
Model restored.
