# MIDASpy demonstration

This notebook provides a brief demonstration of **MIDASpy**'s core functionalities. We show how to use the package to impute missing values in the [Adult census dataset](https://github.com/MIDASverse/MIDASpy/blob/master/Examples/adult_data.csv) (which is commonly used for benchmarking machine learning tasks).

Users of **MIDASpy** must have **TensorFlow** installed as a **pip** package in their Python environment. **MIDASpy** is compatible with both **TensorFlow** 1.X and **TensorFlow** >= 2.2 versions.


Once these packages are installed, users can import the dependencies and load the data:

In [1]:
from sklearn.preprocessing import MinMaxScaler
import numpy as np
import pandas as pd
import tensorflow as tf
import MIDASpy as md

data_0 = pd.read_csv('adult_data.csv')
data_0.columns.str.strip()

Index(['Unnamed: 0', 'age', 'workclass', 'fnlwgt', 'education',
       'education_num', 'marital_status', 'occupation', 'relationship', 'race',
       'sex', 'capital_gain', 'capital_loss', 'hours_per_week',
       'native_country', 'class_labels'],
      dtype='object')

As the Adult dataset has very little missingness, we randomly set 5,000 observed values as missing in each column:

In [2]:
np.random.seed(441)

def spike_in_generation(data):
    spike_in = pd.DataFrame(np.zeros_like(data), columns= data.columns)
    for column in data.columns:
        subset = np.random.choice(data[column].index[data[column].notnull()], 5000, replace= False)
        spike_in.loc[subset, column] = 1
    return spike_in

spike_in = spike_in_generation(data_0)
original_value = data_0.loc[4, 'hours_per_week']
data_0[spike_in == 1] = np.nan

Next, we list categorical variables in a vector and one-hot encode them using **MIDASpy**'s inbuilt preprocessing function `cat_conv`, which returns both the encoded data and a nested list of categorical column names we can pass to the imputation algorithm. To construct the final, pre-processed data we append the one-hot encoded categorical data to the non-cateogrical data, and replace null values with `np.nan` values:

In [3]:
categorical = ['workclass','marital_status','relationship','race','class_labels','sex','education','occupation','native_country']
data_cat, cat_cols_list = md.cat_conv(data_0[categorical])

data_0.drop(categorical, axis = 1, inplace = True)
constructor_list = [data_0]
constructor_list.append(data_cat)
data_in = pd.concat(constructor_list, axis=1)

na_loc = data_in.isnull()
data_in[na_loc] = np.nan

To visualize the results:


In [4]:
print(data_in.head())

   Unnamed: 0   age    fnlwgt  education_num  capital_gain  capital_loss  \
0         0.0  39.0   77516.0           13.0        2174.0           0.0   
1         1.0  50.0   83311.0           13.0           0.0           0.0   
2         2.0  38.0  215646.0            9.0           0.0           0.0   
3         3.0  53.0  234721.0            NaN           0.0           0.0   
4         4.0  28.0       NaN           13.0           0.0           NaN   

   hours_per_week  workclass_Federal-gov  workclass_Local-gov  \
0            40.0                    0.0                  0.0   
1            13.0                    0.0                  0.0   
2            40.0                    0.0                  0.0   
3            40.0                    0.0                  0.0   
4             NaN                    0.0                  0.0   

   workclass_Never-worked  ...  native_country_Portugal  \
0                     0.0  ...                      0.0   
1                     0.0  ...    

The data are now ready to be fed into the imputation algorithm, which involves three steps. First, we specify the dimensions, input corruption proportion, and other hyperparameters of the MIDAS neural network. Second, we build a MIDAS model based on the data. The vector of one-hot-encoded column names should be passed to the softmax_columns argument, as MIDAS employs a softmax final-layer activation function for categorical variables. Third, we train the model on the data, setting the number of training epochs as 20 in this example:

In [5]:
imputer = md.Midas(layer_structure = [256,256], vae_layer = False, seed = 89, input_drop = 0.75)
imputer.build_model(data_in, softmax_columns = cat_cols_list)
imputer.train_model(training_epochs = 20)

Size index: [7, 8, 7, 6, 5, 2, 2, 16, 14, 41]

Computation graph constructed

Model initialised



2021-09-21 20:07:10.734701: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-21 20:07:10.739202: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusolver.so.11'; dlerror: libcusolver.so.11: cannot open shared object file: No such file or directory
2021-09-21 20:07:10.739670: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1835] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2021-09-21 20:07:10.740138: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library

Epoch: 0 , loss: 131060.12472738164
Epoch: 1 , loss: 94869.06297356242
Epoch: 2 , loss: 90830.15935247378
Epoch: 3 , loss: 88528.95353520745
Epoch: 4 , loss: 85461.10624308855
Epoch: 5 , loss: 80486.11211142284
Epoch: 6 , loss: 76536.06388932974
Epoch: 7 , loss: 75540.07280830193
Epoch: 8 , loss: 74316.37026309592
Epoch: 9 , loss: 73984.68447112037
Epoch: 10 , loss: 73999.93967902707
Epoch: 11 , loss: 74538.63024502376
Epoch: 12 , loss: 73596.38242213098
Epoch: 13 , loss: 73786.58791174332
Epoch: 14 , loss: 72981.07277246477
Epoch: 15 , loss: 73847.48175952757
Epoch: 16 , loss: 72961.70744273734
Epoch: 17 , loss: 74021.42337696081
Epoch: 18 , loss: 73330.54337767755
Epoch: 19 , loss: 73336.20200380898
Training complete. Saving file...
Model saved in file: tmp/MIDAS


<MIDASpy.midas_base.Midas at 0x7f9345a3d8b0>

Once training is complete, we can generate any number of imputed datasets (M) using the `generate_samples` function (here we set M as 10). Users can then either write these imputations to separate .CSV files or work with them directly in Python:

In [6]:
imputations = imputer.generate_samples(m=10).output_list 

# for i in imputations:
#    file_out = "midas_imp_" + str(n) + ".csv"
#    i.to_csv(file_out, index=False)
#    n += 1

INFO:tensorflow:Restoring parameters from tmp/MIDAS
Model restored.


Finally, using the list of generated imputations, we can estimate M separate regression models and combine the parameter and variance estimates (see Rubin 1987) using **MIDASpy's** `combine` function:

In [7]:
model = md.combine(y_var = "capital_gain", 
                   X_vars = ["education_num","age"],
                   df_list = imputations)

model

  x = pd.concat(x[::order], 1)


Unnamed: 0,term,estimate,std.error,statistic,df,p.value
0,const,-845.602953,131.761974,-6.417655,79.268249,9.372076e-09
1,education_num,57.687455,8.798527,6.55649,26.86249,5.076031e-07
2,age,33.277681,2.480652,13.414893,373.918334,0.0


### Handling one-hot encoded categories post-imputation

To impute categorical data, we one-hot encode the variable and then impute the probability of each class for each observation. For example, if we look at the one-hot encoded `workclass` variable in the imputed data, we see that it is represented by 8 columns, one for each label in the data:

In [8]:
workclasses = [x for x in imputations[0].columns if "workclass" in x]
imputations[0][workclasses].head()

Unnamed: 0,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,workclass_Private,workclass_Self-emp-inc,workclass_Self-emp-not-inc,workclass_State-gov,workclass_Without-pay
0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


Of course, if we want to use the categorical version then we usually want to transform these probabilities back into a vector of labels. The simplest approach is to select the category with the highest probability for each observation. Fortunately, having used `md.conv()` to one-hot encode these variables earlier, we can use the resulting `cat_cols_list` object to do just that. The following code collapses all encoded columns back into single categorical columns:

In [9]:
flat_cats = [cat for variable in cat_cols_list for cat in variable]

for i in range(len(imputations)):
    tmp_cat = [imputations[i][x].idxmax(axis=1) for x in cat_cols_list]
    cat_df = pd.DataFrame({categorical[i]:tmp_cat[i] for i in range(len(categorical))})
    imputations[i] = pd.concat([imputations[i], cat_df], axis = 1).drop(flat_cats, axis = 1)


If we now inspect the imputations we can see that our data is back to its original shape. Inspecting the `workclass` column, we see that the categories correspond to the one-hot encoded values identified earlier:

In [10]:
print(imputations[0].columns)

imputations[0]['workclass'].head()

Index(['Unnamed: 0', 'age', 'fnlwgt', 'education_num', 'capital_gain',
       'capital_loss', 'hours_per_week', 'workclass', 'marital_status',
       'relationship', 'race', 'class_labels', 'sex', 'education',
       'occupation', 'native_country'],
      dtype='object')


0           workclass_State-gov
1    workclass_Self-emp-not-inc
2             workclass_Private
3             workclass_Private
4             workclass_Private
Name: workclass, dtype: object