# MIDAS demonstration

This notebook provides a brief demonstration of the **midas** class in the Python programming environment, the software we have developed to implement MIDAS. We show how to use the class to multiply impute missing values in the Adult census dataset, the basis for our applied accuracy test.

To access the class, users must have TensorFlow installed as a **pip** package in their Python environment. We recommend creating a Conda environment before reinstalling TensorFlow to avoid conflicts with projects using later versions of the package. MIDAS is written in TensorFlow 1.X API; users of TensorFlow 2.X can install the correct version via the command line `pip install tensorflow==1.14.0`


Once these packages have been installed, users can import the dependencies and load the data:

In [2]:
from sklearn.preprocessing import MinMaxScaler
import numpy as np
import pandas as pd
import tensorflow as tf
from midas import Midas

data_0 = pd.read_csv('data/adult_data.csv')
data_0.columns.str.strip()

Index(['Unnamed: 0', 'age', 'workclass', 'fnlwgt', 'education',
       'education_num', 'marital_status', 'occupation', 'relationship', 'race',
       'sex', 'capital_gain', 'capital_loss', 'hours_per_week',
       'native_country', 'class_labels'],
      dtype='object')

As the dataset has a very low proportion of missingness (one of the reasons we selected it for the accuracy test), we randomly set 5000 observed values as missing in each column:

In [3]:
np.random.seed(441)

def spike_in_generation(data):
    spike_in = pd.DataFrame(np.zeros_like(data), columns= data.columns)
    for column in data.columns:
        subset = np.random.choice(data[column].index[data[column].notnull()], 5000, replace= False)
        spike_in.loc[subset, column] = 1
    return spike_in

spike_in = spike_in_generation(data_0)
original_value = data_0.loc[4, 'hours_per_week']
data_0[spike_in == 1] = np.nan

Next, we list categorical variables in a vector and one-hot encode them using an inbuilt function in the pandas package:

In [4]:
categorical = ['workclass','marital_status','relationship','race','class_labels','sex','education','occupation','native_country']

data_1 = data_0[categorical]
data_0.drop(categorical, axis = 1, inplace = True)

constructor_list = [data_0]
columns_list = []

for column in data_1.columns:
    na_temp = data_1[column].isnull()
    temp = pd.get_dummies(data_1[column], prefix = column)
    temp[na_temp] = np.nan
    constructor_list.append(temp)
    columns_list.append(list(temp.columns.values))
    
data_0 = pd.concat(constructor_list, axis=1)

na_loc = data_0.isnull()
data_0[na_loc] = np.nan

To visualise the results:


In [5]:
print(data_0.head())

   Unnamed: 0   age    fnlwgt  education_num  capital_gain  capital_loss  \
0         0.0  39.0   77516.0           13.0        2174.0           0.0   
1         1.0  50.0   83311.0           13.0           0.0           0.0   
2         2.0  38.0  215646.0            9.0           0.0           0.0   
3         3.0  53.0  234721.0            NaN           0.0           0.0   
4         4.0  28.0       NaN           13.0           0.0           NaN   

   hours_per_week  workclass_Federal-gov  workclass_Local-gov  \
0            40.0                    0.0                  0.0   
1            13.0                    0.0                  0.0   
2            40.0                    0.0                  0.0   
3            40.0                    0.0                  0.0   
4             NaN                    0.0                  0.0   

   workclass_Never-worked  ...  native_country_Portugal  \
0                     0.0  ...                      0.0   
1                     0.0  ...    

The data are now ready to be fed into the midas algorithm, which involves three steps. First, we specify the dimensions, input corruption proportion, and other hyperparameters of the MIDAS neural network. Second, we build a MIDAS model based on the data. The vector of one-hot-encoded column names should be passed to the softmax_columns argument, as MIDAS employs a softmax final-layer activation function for categorical variables. Third, we train the model on the data, setting the number of training epochs as 20 in this example:

In [8]:
imputer = Midas(layer_structure = [256,256], vae_layer = False, seed = 89, input_drop = 0.75)
imputer.build_model(data_0, softmax_columns = columns_list)
imputer.train_model(training_epochs = 20)

Size index: [7, 8, 7, 6, 5, 2, 2, 16, 14, 41]

Computation graph constructed

Model initialised

Epoch: 0 , loss: 133848.50805376086
Epoch: 1 , loss: 95065.40653797715
Epoch: 2 , loss: 90628.25024318071
Epoch: 3 , loss: 85635.75156979542
Epoch: 4 , loss: 79943.5477344518
Epoch: 5 , loss: 76176.41991035591
Epoch: 6 , loss: 75168.03345590494
Epoch: 7 , loss: 73609.58660368713
Epoch: 8 , loss: 73962.98218317394
Epoch: 9 , loss: 73652.5491948159
Epoch: 10 , loss: 72959.83611860563
Epoch: 11 , loss: 73128.3418826282
Epoch: 12 , loss: 73564.0481312203
Epoch: 13 , loss: 73355.49027725159
Epoch: 14 , loss: 72929.31829857982
Epoch: 15 , loss: 72174.56946968689
Epoch: 16 , loss: 73270.2137865539
Epoch: 17 , loss: 71626.98738724095
Epoch: 18 , loss: 72509.55602243406
Epoch: 19 , loss: 72089.0846345634
Training complete. Saving file...
Model saved in file: tmp/MIDAS


<midas.midas_base.Midas at 0x1a32e33438>

Once training is complete, we can generate any number of imputed datasets using the generate_samples function (here we set M as 10). Users can then either write these imputations to separate .csv files or work with them directly in Python:

In [9]:
imputations = imputer.generate_samples(m=10).output_list 

# for i in imputations:
#    file_out = ``midas_imp_" + str(n) + ``.csv"
#    i.to_csv(file_out, index=False)
#    n += 1

Model restored.
