This jupyter notebook provides an example of generate artificial missing data under MNAR multivariate mechanisms using a Tensorflow imputation method named Partial Multiple Imputation With Variational Autoencoders (PMIVAE). In the original article the authors proposed PMIVAE under MNAR mechanism.

Reference: <BR>
Pereira RC, Abreu PH, Rodrigues PP. Partial Multiple Imputation With Variational Autoencoders: Tackling Not at Randomness in Healthcare Data. IEEE J Biomed Health Inform. 2022 Aug;26(8):4218-4227. doi: 10.1109/JBHI.2022.3172656. Epub 2022 Aug 11. PMID: 35511840.

In [1]:
# Import the libraries
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer

from mdatagen.multivariate.mMNAR import mMNAR

# Load the data
wiscosin = load_breast_cancer()
wiscosin_df = pd.DataFrame(data=wiscosin.data, columns=wiscosin.feature_names) # Dataset used in PMIVAE paper

X = wiscosin_df.copy()   # Features
y = wiscosin.target    # Label values

In [2]:
# Create a instance for MNAR mechanism
generator = mMNAR(X=X, 
                  y=y,
                  n_xmiss=X.shape[1], # all features will receive the missing values
                  threshold = 0) # highest values

# Generate the missing data under MNAR mechanism up to 20% missing rate
generate_data = generator.random(missing_rate=20, 
                                 deterministic=True) # Missingness based on own values

In [3]:
qtd_miss = sum(generate_data.isna().sum())
data_dimension = generate_data.shape[0]*generate_data.shape[1]
print(f"Global Missing rate = {round(qtd_miss/(data_dimension),4)*100}%")

Global Missing rate = 19.39%


Once the missingness is introduced, we will perform the imputation process using PMIVAE, which was designed with a TensorFlow architecture. To use this autoencoder, you have to following these steps: <br>

1. Go to GitHub repository: https://github.com/ricardodcpereira/PMIVAE
2. Clone the repository
3. Use the following code

In [4]:
# import sys
# sys.path.append("path/to/pmivae/folder")

from pmivae import ConfigVAE, PMIVAE

original_shape = X.shape

y_train = generate_data["target"]
X_train = generate_data.drop(columns="target")

vae_config = ConfigVAE()
vae_config.verbose = 0
vae_config.batch_size = 128
vae_config.validation_split=0.2
vae_config.input_shape = (original_shape[1],)
vae_config.epochs = 200

pmivae_model = PMIVAE(vae_config, num_samples=200)
pmivae_model_trained = pmivae_model.fit(X=X_train.values,
                                        y=y_train)








In [5]:
data_imputed = pmivae_model_trained.transform(X_train.values)



In [6]:
pd.DataFrame(data_imputed, columns=X.columns)

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,0.973514,0.979600,0.967979,0.996791,0.118444,0.115564,0.136443,0.086983,0.198542,0.092786,...,0.973581,0.986063,0.971022,0.978695,0.140569,0.305402,0.345007,0.138460,0.319906,0.126009
1,0.988317,0.991404,0.984745,0.999256,0.081741,0.083662,0.095840,0.055880,0.159507,0.060012,...,0.988441,0.995054,0.987395,0.989881,0.104522,0.282253,0.331156,0.096495,0.288463,0.084480
2,0.986490,0.989981,0.982597,0.999037,0.087365,0.088643,0.102123,0.060483,0.165915,0.064881,...,0.986614,0.994058,0.985386,0.988453,0.110230,0.286262,0.333581,0.102970,0.293879,0.090776
3,0.787963,0.713743,0.713717,0.752805,0.316942,0.280540,0.263756,0.335186,0.228419,0.318733,...,0.738740,0.891356,0.813159,0.749234,0.346905,0.405063,0.385105,0.177233,0.317348,0.204692
4,0.972593,0.978849,0.966973,0.996588,0.120267,0.117116,0.138436,0.088589,0.200334,0.094471,...,0.972654,0.985445,0.969999,0.978022,0.142294,0.306400,0.345596,0.140526,0.321269,0.128094
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,0.977750,0.983033,0.972652,0.997653,0.109540,0.107942,0.126675,0.079217,0.189613,0.084629,...,0.977845,0.988828,0.975726,0.981823,0.132060,0.300353,0.342016,0.128342,0.313016,0.115850
565,0.962130,0.970212,0.955752,0.993880,0.138852,0.132813,0.158650,0.105255,0.217998,0.111918,...,0.962094,0.978037,0.958338,0.970472,0.159614,0.315991,0.351230,0.161502,0.334401,0.149440
566,0.974883,0.980714,0.969481,0.997083,0.115663,0.113190,0.133397,0.084544,0.195785,0.090226,...,0.974960,0.986971,0.972544,0.979702,0.137925,0.303857,0.344093,0.135304,0.317795,0.122832
567,0.942844,0.953886,0.935814,0.987023,0.166371,0.155698,0.188238,0.130841,0.242446,0.138578,...,0.942562,0.962844,0.936770,0.956927,0.184477,0.328620,0.358572,0.192262,0.351756,0.181266


Therefore, in this Jupyter Notebook, we have demonstrated that our mdatagen package is compatible with TensorFlow architecture by using the PMIVAE imputation algorithm to address missing data issues. It is important to note that mdatagen primarily focuses on the Data Amputation step (i.e., scikit-learn algorithms, TensorFlow, and/or algorithms based on other frameworks are compatible, as they are part of the data imputation step)