# Encode data

This notebook runs part of the Multi-Omics Variational autoEncoder (MOVE) framework for using the structure the VAE has identified for extracting categorical data assositions across all continuous datasets. In the MOVE paper we used it for identifiying drug assosiations in clinical and multi-omics data. This part is a guide for encoding the data that can be used as input in MOVE. 

In [1]:
# Import functions
from hydra import initialize, compose
from omegaconf import OmegaConf

from move._utils.data_utils import generate_file, merge_configs 

The version_base parameter is not specified.
Please specify a compatability version level, or None.
Will assume defaults for version 1.1
  @hydra.main(config_path="../conf", config_name="main")


For encoding the data you need to have each dataset/data type in a format for N x M, where N is the numer of samples/individuals and M is the number of features. For using the dataset specific weighting in the training of the VAE you need to process the datasets individually or split them when you read them in. The continuous data is z-score normalised and the categorical data is one-hot encoded. Below is an example of processing a continuous dataset and two categorical datasets with different number of categories. To ensure the correct order the ID's are used for sorting the data accordingly.

In [2]:
# Initializing the default config 
with initialize(version_base=None, config_path="src/move/conf"):
    base_config = compose(config_name="main")

def main(base_config=base_config):
    
    # Merging the user defined data.yaml, model.yaml and tuning_reconstruction.yaml 
    # with the base_config to override it.
    print('Overriding the default configuration with configuration from data.yaml')
    cfg = merge_configs(base_config=base_config, 
                        config_types=['data'])
    
    # Getting the variables used in the notebook
    path = cfg.data.processed_data_path
    ids_file_name = cfg.data.ids_file_name
    na_encoding = cfg.data.na_value
    categorical_names = cfg.data.categorical_names
    continuous_names = cfg.data.continuous_names    
    
    # Encoding categorical data
    print('Encoding categorical data')
    for cat_data in categorical_names:
        generate_file('categorical', path, cat_data, ids_file_name, na_encoding)
        print(f'  Encoded {cat_data}')
    
    # Encoding continuous data 
    print('Encoding continuous data')
    for con_data in continuous_names:
        generate_file('continuous', path, con_data, ids_file_name, na_encoding)    
        print(f'  Encoded {con_data}')

if __name__ == "__main__":
    main()

Overriding the default configuration with configuration from data.yaml

Configuration used: 
---
data:
  user_config: data.yaml
  na_value: 'nan'
  raw_data_path: data/
  interim_data_path: data/
  processed_data_path: data/
  version: v1
  ids_file_name: baseline_ids
  categorical_inputs:
  - name: diabetes_genotypes
    weight: 1
  - name: baseline_drugs
    weight: 1
  - name: baseline_categorical
    weight: 1
  continuous_inputs:
  - name: baseline_continuous
    weight: 2
  - name: baseline_transcriptomics
    weight: 1
  - name: baseline_diet_wearables
    weight: 1
  - name: baseline_proteomic_antibodies
    weight: 1
  - name: baseline_target_metabolomics
    weight: 1
  - name: baseline_untarget_metabolomics
    weight: 1
  - name: baseline_metagenomics
    weight: 1
  data_of_interest: baseline_drugs
  data_features_to_visualize:
  - drug_1
  - clinical_continuous_2
  - clinical_continuous_3
  categorical_names: ${names:${data.categorical_inputs}}
  continuous_names: ${names