# Avoiding imputation undesired data entries

To avoid imputing certain missing datapoints, construct a `imputable_matrix`(DNI matrix) of the same size and shape as the original dataset. For this matrix, use 1 to represent values in the original dataset that are missing and viable for imputation or non-missing. Use 0 to represent values in the original dataset that are missing but not viable for imputation (for example, measurements at timepoints occuring after death). To use the `cols_ignore` argument in `run_cissvae` or `ClusterDataset()`, make sure that the `imputable_matrix` has the same column names/indices as the original dataset. 

## Example using included dataset

First load the dataset and load or create the DNI matrix. Here we can see that [1, Y12], [2, Y22], and [1, Y52] are marked as non-imputable. 


In [1]:
import pandas as pd
from ciss_vae.data import load_example_dataset, load_dni
import ciss_vae
print(ciss_vae.__file__)

df_missing, _, clusters = load_example_dataset()

dni = load_dni()

dni.columns = df_missing.columns

print(f"Df missing:\n{df_missing.head(3).drop(df_missing.columns[:5].to_list(), axis=1)}, \n\nDo not impute:\n{dni.head(3).drop(df_missing.columns[:5].to_list(), axis=1)}")

/home/nfs/vaithid1/CISS-VAE/CISS-VAE/src/ciss_vae/__init__.py
Df missing:
        Y11  Y12        Y13        Y14        Y15        Y21  Y22        Y23  \
0 -4.049537  NaN        NaN -14.369151 -17.564448        NaN  NaN -35.772630   
1  0.546168  NaN -12.189518  -7.722474        NaN  -7.470250  NaN -25.924360   
2       NaN  NaN -20.358905 -15.126494 -17.251376 -18.448422  NaN -34.400862   

         Y24        Y25  ...       Y41  Y42       Y43       Y44       Y45  \
0 -28.098906 -30.242588  ... -0.904960  NaN       NaN -3.694852 -5.680293   
1 -17.231424 -18.695290  ...  2.624586  NaN -5.776195 -1.379495 -2.329604   
2 -27.250598 -28.839809  ...       NaN  NaN -7.215718 -3.350797 -6.895340   

        Y51  Y52       Y53       Y54       Y55  
0  2.587588  NaN -4.681195 -2.248406 -2.679081  
1  6.080512  NaN -2.290062 -0.887398  0.562532  
2  2.531148  NaN -5.427430 -1.330163 -2.324382  

[3 rows x 25 columns], 

Do not impute:
   Y11  Y12  Y13  Y14  Y15  Y21  Y22  Y23  Y24  Y25  ...  Y

## Using `run_cissvae()` with DNI matrix

The `run_cissvae()` function can accept the DNI matrix as an input. Make sure that the column names of the DNI matrix match those of the original dataset. 

In [2]:
from ciss_vae.training.run_cissvae import run_cissvae
from ciss_vae.utils.helpers import plot_vae_architecture

imputed_data, vae, ds, history = run_cissvae(data = df_missing,
## Dataset params
    columns_ignore = df_missing.columns[:5], ## columns to ignore when selecting validation dataset (and clustering if you do not provide clusters). For example, demographic columns with no missingness.
    imputable_matrix=dni,
    clusters = clusters,
    print_dataset = False,
    
## VAE model params
    hidden_dims = [150, 120, 60], ## Dimensions of hidden layers, in order. One number per layer. 
    latent_dim = 15, ## Dimensions of latent embedding
    layer_order_enc = ["unshared", "unshared", "unshared"], ## order of shared vs unshared layers for encode (can use u or s instead of unshared, shared)
    layer_order_dec=["shared", "shared",  "shared"],  ## order of shared vs unshared layers for decode
    latent_shared=False, 
    output_shared=False, 
    batch_size = 4000, ## batch size for data loader
    return_model = True, ## if true, outputs imputed dataset and model, otherwise just outputs imputed dataset. Set to true to return model for `plot_vae_architecture`

## Initial Training params
    epochs = 5, ## default 

## Other params
    return_history = True, ## if true, will return training MSE history as pandas dataframe
    return_dataset=True
)

print(f"The successfully imputed dataset:\n{imputed_data.head}\n\n")

IndexError: too many indices for tensor of dimension 1

In [None]:
ds

Below, we can see that [1, Y12], [2, Y22], and [1, Y52] are still NaN, even though other missing entries have been imputed. 

In [None]:
print(f"Imputed dataset:\n{imputed_data.drop(df_missing.columns[:5].to_list(), axis=1).head(3)}")

As always, the vae architecture can be printed. 

In [None]:
plot_vae_architecture(model = vae,
                        title = None, ## Set title of plot
                        ## Colors below are default
                        color_shared = "skyblue", 
                        color_unshared ="lightcoral",
                        color_latent = "gold", # xx fix
                        color_input = "lightgreen",
                        color_output = "lightgreen",
                        figsize=(16, 8),
                        return_fig = False)

## We can also use the imputable maatrix with the autotune function. 

In [None]:
import pandas as pd
from ciss_vae.data import load_example_dataset, load_dni
import ciss_vae
from ciss_vae.classes.cluster_dataset import ClusterDataset
from ciss_vae.training.autotune import SearchSpace, autotune

print(ciss_vae.__file__)

df_missing, _, clusters = load_example_dataset()

dni = load_dni()

dataset = ClusterDataset(
    data = df_missing,
    cluster_labels = clusters,
    imputable = dni,

)

dataset


In [None]:
searchspace = SearchSpace(
                 num_hidden_layers=(1, 4), ## Set number of hidden layers
                 hidden_dims=[64, 512], ## Allowable dimensions of hidden layers
                 latent_dim=[10, 100],
                 latent_shared=[True, False],
                 output_shared=[True,False],
                 lr=(1e-4, 1e-3),
                 decay_factor=(0.9, 0.999),
                 beta=0.01,
                 num_epochs=100,
                 batch_size=64,
                 num_shared_encode=[0, 1, 3],
                 num_shared_decode=[0, 1, 3],
                 encoder_shared_placement = ["at_end", "at_start", "alternating", "random"], ## where should the shared layers be placed in the encoder
                 decoder_shared_placement = ["at_end", "at_start", "alternating", "random"], ## where should the shared layers be placed in the decoder
                 refit_patience=2,
                 refit_loops=10,
                 epochs_per_loop = 100,
                 reset_lr_refit = [True, False])

results = autotune(
    search_space = searchspace,
    train_dataset = dataset,                   # ClusterDataset object
    save_model_path=None,
    save_search_space_path=None,
    n_trials=20,
    study_name="vae_autotune_v3",                 # Default study name
    device_preference="cuda",
    show_progress=True,                       # Show progress bar for training
    optuna_dashboard_db="sqlite:///test_dni_python.sqlite3",                  # If using optuna dashboard set db location here
    load_if_exists=True,                       # If using optuna dashboard, if study by 'study_name' already exists, will load that study
    seed = 42,     
)