# Encode data

This notebook runs part of the Multi-Omics Variational autoEncoder (MOVE) framework for using the structure the VAE has identified for extracting categorical data assositions across all continuous datasets. In the MOVE paper we used it for identifiying drug assosiations in clinical and multi-omics data. This part is a guide for encoding the data that can be used as input in MOVE. 

In [1]:
# Import functions
from hydra import initialize, compose

from move.utils.data_utils import read_ids, generate_file, merge_configs 
from move.utils.logger import get_logger


The notebook merges user-defined configs in data.yaml file with default configs and override it. Then reads the needed variables.  \
For encoding the data you need to have each dataset/data type in a format for N x M, where N is the numer of samples/individuals and M is the number of features. For using the dataset specific weighting in the training of the VAE you need to process the datasets individually or split them when you read them in. The continuous data is z-score normalised and the categorical data is one-hot encoded. Below is an example of processing a continuous and categorical datasets. To ensure the correct order the ID's are used for sorting the data accordingly.

In [2]:
# Initializing the default config 
with initialize(version_base=None, config_path="../src/move/conf"):
    base_config = compose(config_name="main")

In [3]:
def main(base_config=base_config):
        
    # Making logger for data writing
    logger = get_logger(logging_path='./log',
                        file_name='01_encode_data.log',
                        script_name=__name__)
    
    # Overriding base_config with the user defined configs.
    cfg = merge_configs(base_config=base_config, 
                        config_types=['data'])
    
    # Getting the variables used in the notebook
    raw_data_path = cfg.data.raw_data_path
    interim_data_path = cfg.data.interim_data_path
    headers_path = cfg.data.headers_path
    
    ids_file_name = cfg.data.ids_file_name
    ids_has_header = cfg.data.ids_has_header
    ids_colname = cfg.data.ids_colname
    
    na_encoding = cfg.data.na_value
    categorical_names = cfg.data.categorical_names
    continuous_names = cfg.data.continuous_names    
    
    # Reading ids 
    ids = read_ids(raw_data_path, ids_file_name, ids_colname, ids_has_header)

    # Encoding categorical data
    logger.info('Encoding categorical data')
    for cat_data in categorical_names:
        generate_file('categorical', raw_data_path, interim_data_path, headers_path, cat_data, ids, na_encoding)
    
    # Encoding continuous data 
    logger.info('Encoding continuous data')
    for con_data in continuous_names:
        generate_file('continuous', raw_data_path, interim_data_path, headers_path, con_data, ids, na_encoding)    

if __name__ == "__main__":
    main()

INFO    root         

---------------- Starting running the script ---------------
INFO    data_utils   Overriding the default config with configs from data.yaml
INFO    data_utils   

Configuration used:
data:
  user_config: data.yaml
  na_value: na
  raw_data_path: data/
  interim_data_path: interim_data/
  processed_data_path: processed_data/
  headers_path: headers/
  version: v1
  ids_file_name: baseline_ids.txt
  ids_has_header: false
  ids_colname: 0
  categorical_inputs:
  - name: diabetes_genotypes
    weight: 1
  - name: baseline_drugs
    weight: 1
  - name: baseline_categorical
    weight: 1
  continuous_inputs:
  - name: baseline_continuous
    weight: 2
  - name: baseline_transcriptomics
    weight: 1
  - name: baseline_diet_wearables
    weight: 1
  - name: baseline_proteomic_antibodies
    weight: 1
  - name: baseline_target_metabolomics
    weight: 1
  - name: baseline_untarget_metabolomics
    weight: 1
  - name: baseline_metagenomics
    weight: 1
  data_of_interest