# Encode data

This notebook runs part of the Multi-Omics Variational autoEncoder (MOVE) framework for using the structure the VAE has identified for extracting categorical data assositions across all continuous datasets. In the MOVE paper we used it for identifiying drug assosiations in clinical and multi-omics data. This part is a guide for encoding the data that can be used as input in MOVE. 

In [1]:
# Import functions
from hydra import initialize, compose

from move._utils.data_utils import generate_file


The version_base parameter is not specified.
Please specify a compatability version level, or None.
Will assume defaults for version 1.1
  @hydra.main(config_path="../conf", config_name="main")


For encoding the data you need to have each dataset/data type in a format for N x M, where N is the numer of samples/individuals and M is the number of features. For using the dataset specific weighting in the training of the VAE you need to process the datasets individually or split them when you read them in. The continuous data is z-score normalised and the categorical data is one-hot encoded. Below is an example of processing a continuous dataset and two categorical datasets with different number of categories. To ensure the correct order the ID's are used for sorting the data accordingly.

In [2]:
with initialize(version_base=None, config_path="src/move/conf"):
    config = compose(config_name="main")

def main(config=config):
    
    # Define variables 
    path = config.data.processed_data_path
    ids_file_name = config.data.ids_file_name
    na_encoding = config.data.na_value
    categorical_names = config.model.categorical_names
    continuous_names = config.model.continuous_names
    
    # Encodes categorical data
    for cat_data in categorical_names:
        generate_file('categorical', path, cat_data, ids_file_name, na_encoding)
        print(f'Encoded {cat_data}')
    
    # Encodes continuous data 
    for con_data in continuous_names:
        generate_file('continuous', path, con_data, ids_file_name, na_encoding)    
        print(f'Encoded {con_data}')

if __name__ == "__main__":
    main()

Encoded diabetes_genotypes
Encoded baseline_drugs
Encoded baseline_categorical
Encoded baseline_continuous
Encoded baseline_transcriptomics
Encoded baseline_diet_wearables
Encoded baseline_proteomic_antibodies
Encoded baseline_target_metabolomics
Encoded baseline_untarget_metabolomics
Encoded baseline_metagenomics
