In [1]:
from pathlib import Path

from move.conf.schema import InputConfig
from move.tasks.encode_data import EncodeData

# Encode data

This notebook runs part of the Multi-Omics Variational autoEncoder (MOVE) framework for using the structure the VAE has identified for extracting categorical data assositions across all continuous datasets. In the MOVE paper we used it for identifiying drug assosiations in clinical and multi-omics data. This part is a guide for encoding the data that can be used as input in MOVE.

⚠️ The notebook takes user-defined configs in a `config/data` directory.

For encoding the data you need to have each dataset in a TSV format. Each table has `N` &times; `M` shape, where `N` is the numer of samples/individuals and `M` is the number of features. The continuous data is z-score normalized (`standardization`), whereas the categorical data is one-hot encoded (`one_hot_encoding`). Below is an example of processing a continuous and categorical datasets.

First step is to locate where our `random_small` dataset is, which datasets it comprises, and what kind of pre-processing each dataset will undergo.

In [2]:
base_path = Path("../data")
interim_path = Path("../interim_data")

discrete_dnames = ["random.small.drugs"]
continuous_dnames = ["random.small.proteomics", "random.small.metagenomics"]

# Indicate which kind of pre-processing is required for each config file.
# If a dataset has already been pre-processed, you can set this to 'none'
disc_conf = [InputConfig(name, preprocessing="one_hot_encode") for name in discrete_dnames]
cont_conf = [InputConfig(name, preprocessing="standardize") for name in continuous_dnames]

The next step is to run the `EncodeData` task.

In [3]:
task = EncodeData(base_path, interim_path, "random.small.ids", disc_conf, cont_conf)
task.run()

[INFO  - EncodeData]: Beginning task: encode data
[INFO  - EncodeData]: Encoding 'random.small.drugs'
[INFO  - EncodeData]: Encoding 'random.small.proteomics'
[INFO  - EncodeData]: Encoding 'random.small.metagenomics'


Data will be encoded accordingly and saved to the directory defined as `interim_data_path` in the `data` configuration.

We can confirm how the data looks by loading it and creating a `MoveDataset` object. This type of object concatenates our datasets and keeps the information such as original dataset shapes and feature names.

The drug dataset has been encoded as a matrix of 500 samples &times; 20 drugs &times; 2 categories (either took or did not take the drug), whereas the proteomics and metagenomics datasets keep their original shapes.

In [4]:
from move.data.dataset import MoveDataset

dataset = MoveDataset.load(interim_path, discrete_dnames, continuous_dnames)
dataset

MOVE dataset (500 samples),MOVE dataset (500 samples),MOVE dataset (500 samples),MOVE dataset (500 samples)
data,type,# features,# classes
random.small.drugs,discrete,20,2.0
random.small.proteomics,continuous,200,
random.small.metagenomics,continuous,1000,


We can also confirm that the mean of the continuous datasets is now close to 0, and the standard deviation is close to 1.

In [5]:
for continuous_dataset in dataset.continuous_datasets:
    print(f"{continuous_dataset.name}: mean = {continuous_dataset.tensor.mean():.3f}, std = {continuous_dataset.tensor.std():.3f}")

random.small.proteomics: mean = -0.000, std = 0.975
random.small.metagenomics: mean = 0.000, std = 0.975


Alternatively, you can directly read the config YAML and create a task object from its content.

Note, that for this to work, the config directory structure must be in the same directory as your notebook.

In [6]:
from move.data.io import read_config

if not Path.cwd().joinpath("config/data").exists():
    raise FileNotFoundError("Requires a config files in the current working directory.")

config = read_config("random_small", "encode_data", "data.raw_data_path='../data'")
task = EncodeData.from_config(config)
task.run()

[INFO  - EncodeData]: Beginning task: encode data
[INFO  - EncodeData]: Encoding 'random.small.drugs'
[INFO  - EncodeData]: Encoding 'random.small.proteomics'
[INFO  - EncodeData]: Encoding 'random.small.metagenomics'
