### __MIDASpy demonstration__

MIDASpy's core functionalities are demonstrated here by using it to impute missing responses to the 2018 Cooperative Congressional Election Study (CCES), an electoral survey conducted in the United States whose size and complexity poses computational difficulties for many existing multiple imputation algorithms.

The full CCES has 525 columns and 60,000 rows, the latter corresponding to individual survey respondents. After removing variables that either require extensive preprocessing or are unhelpful for imputation purposes — open-ended string variables, time indices, and ZIP code variables — the dataset contains 349 columns. The vast majority of these variables are categorical and must therefore be one-hot encoded for most multiple imputation software packages — that is, each 1 × 60,000 categorical variable with K unique classes must be expanded into a K × 60,000 matrix of 1s and 0s — increasing their number to 1,914.

_**Loading and preprocessing the data**_

We begin by loading MIDASpy, its dependencies, and additional packages called in the workflow. We then read in the formatted CCES data and sort variables into continuous, binary, and categorical types.

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.preprocessing import MinMaxScaler
import sys
import MIDASpy as md

In [2]:
data_in = pd.read_csv("cces_jss_format.csv")
cont_vars = ["citylength_1","numchildren","birthyr"]
vals = data_in.nunique()
cat_vars = list(data_in.columns[(vals.values > 2) & ~(data_in.columns.isin(cont_vars))])
bin_vars = list(data_in.columns[vals.values == 2])

Next, we apply the `.binary_conv()` function to the list of binary variables (which are not in dummy form), before appending them and the continuous variables to a `constructor_list` object, the basis for our final preprocessed dataset.

In [3]:
data_bin = data_in[bin_vars].apply(md.binary_conv)
constructor_list = [data_in[cont_vars], data_bin]

To one-hot encode categorical variables, we apply the `.cat_conv()` function to a dataframe containing them. We concatenate the resulting matrix to the existing `constructor_list` object.

In [4]:
data_cat = data_in[cat_vars]
data_oh, cat_col_list = md.cat_conv(data_cat)
constructor_list.append(data_oh)
data_0 = pd.concat(constructor_list, axis=1)

The final preprocessing step, which is nonessential, is to scale all variables between 0 and 1 to aid model convergence. We use scikit-learn’s `MinMaxScaler()` function for this step.

In [5]:
scaler = MinMaxScaler()
data_scaled = scaler.fit_transform(data_0)
data_scaled = pd.DataFrame(data_scaled, columns = data_0.columns)
na_loc = data_scaled.isnull()
data_scaled[na_loc] = np.nan

_**Imputation**_

Once the data are preprocessed, training a MIDAS network with MIDASpy is straightforward. We declare an instance of the `Midas` class, pass our data to this object (including the sorted variable names) with the `.build_model()` function, and train the network for 10 epochs with the `.train_model()` function. For the purposes of this illustration, we maintain most of MIDASpy’s default hyperparameter settings.

In [6]:
imputer = md.Midas(layer_structure= [256,256], vae_layer = False, seed= 89, input_drop = 0.75)

In [7]:
imputer.build_model(data_scaled, binary_columns = bin_vars, softmax_columns = cat_col_list)

Size index: [3, 178, 6, 8, 6, 3, 3, 6, 6, 4, 3, 59, 3, 3, 6, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 3, 5, 6, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 3, 6, 6, 6, 6, 6, 6, 6, 10, 10, 7, 4, 7, 8, 5, 8, 3, 5, 9, 5, 52, 17, 3, 3, 3, 3, 3, 6, 3, 23, 4, 7, 8, 12, 14, 11, 6, 6, 4, 7, 10, 5, 4, 4, 7, 3, 4, 6, 3, 7, 5, 4, 4, 4, 6, 5, 17, 51, 53, 53, 3, 98, 6, 6, 5, 17, 17, 4, 6, 3, 3, 3, 6, 6, 6, 10, 5, 5, 5, 5, 6, 5, 7, 5, 5, 5, 5, 224, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 5, 5, 5, 5, 5, 5, 5, 5, 5, 51, 53, 53, 5, 51, 14, 5, 6, 5]

Computation graph constructed



<MIDASpy.midas_base.Midas at 0x16a85c6d0>

In [8]:
imputer.train_model(training_epochs = 10)

Model initialised

Epoch: 0 , loss: 186.26737846679688
Epoch: 1 , loss: 169.38942487792968
Epoch: 2 , loss: 163.48311638997396
Epoch: 3 , loss: 159.68743997802736
Epoch: 4 , loss: 157.04094825032553
Epoch: 5 , loss: 154.82602157389323
Epoch: 6 , loss: 153.35590602010092
Epoch: 7 , loss: 152.05749235839843
Epoch: 8 , loss: 151.08395079345703
Epoch: 9 , loss: 150.22736969604492
Training complete. Saving file...
Model saved in file: tmp/MIDAS


<MIDASpy.midas_base.Midas at 0x16a85c6d0>

Once the model is trained, we draw a list of 10 completed datasets. When datasets are very large, as in this case, we recommend accessing each one separately rather than simultaneously holding all of them in memory. We thus construct a dataset generator using the `.yield_samples()` function.

In [9]:
imputations = imputer.yield_samples(m=10)

_**Analysis of completed datasets**_

We analyze the 10 completed datasets using MIDASpy’s inbuilt `combine()` function. We estimate a simple linear probability model in which `"CC18_415a"`, a respondent’s degree of support for giving the United States Environmental Protection Agency power to regulate carbon dioxide emissions,is regressed on `"age" (2018 − "birthyr")`, a respondent’s age.

Users can ensure exact reproducibility of analytical results by saving completed datasets to disk. The trained MIDAS model itself is also saved by default to the location specified in the `savepath` argument of `Midas()`.

As we scaled the input dataset prior to imputation with the `MinMaxScaler()` function, for each completed dataset we first invert this transformation via scikit-learn’s `.inverse_transform()` function and also convert predicted probabilities for `CC18_415a` into binary categories using a threshold of 0.5. To save memory, we append the relevant subset of the data, for analysis, to a list.

In [10]:
analysis_dfs = []

In [11]:
for df in imputations:
    df_unscaled = scaler.inverse_transform(df)
    df_unscaled = pd.DataFrame(df_unscaled, columns = data_scaled.columns) 
    df['age'] = 2018 - df_unscaled['birthyr']
    df['CC18_415a'] = np.where(df_unscaled['CC18_415a'] >= 0.5,1,0)
    analysis_dfs.append(df.loc[:,["age","CC18_415a"]])

INFO:tensorflow:Restoring parameters from tmp/MIDAS
Model restored.


In [12]:
model = md.combine(y_var = "CC18_415a", X_vars = ["age"], df_list = analysis_dfs)

In [13]:
model

Unnamed: 0,term,estimate,std.error,statistic,df,p.value
0,const,0.934493,0.005515,169.4597,3056.421238,0.0
1,age,-0.005259,0.000107,-49.160665,4565.125518,0.0
