In [1]:
from move.data import io
from move.tasks import encode_data

# Encode data

This notebook runs part of the Multi-Omics Variational autoEncoder (MOVE) framework for using the structure the VAE has identified for extracting categorical data assositions across all continuous datasets. In the MOVE paper we used it for identifiying drug assosiations in clinical and multi-omics data. This part is a guide for encoding the data that can be used as input in MOVE.

⚠️ The notebook takes user-defined configs in a `config/data` directory.

For encoding the data you need to have each dataset in a TSV format. Each table has `N` &times; `M` shape, where `N` is the numer of samples/individuals and `M` is the number of features. The continuous data is z-score normalized, whereas the categorical data is one-hot encoded. Below is an example of processing a continuous and categorical datasets.

First step is to read the configuration called `random_small` and specify the pre-defined task called `encode_data`.

In [2]:
config = io.read_config("random_small", "encode_data")

The next step is to run the `encode_data` task, passing our `config` object.

In [3]:
encode_data(config.data)

[INFO  - encode_data]: Beginning task: encode data
[INFO  - encode_data]: Encoding 'random.small.drugs'
[INFO  - encode_data]: Encoding 'random.small.proteomics'
[INFO  - encode_data]: Encoding 'random.small.metagenomics'


Data will be encoded accordingly and saved to the directory defined as `interim_data_path` in the `data` configuration.

We can confirm how the data looks by loading it.

In [4]:
from pathlib import Path

path = Path(config.data.interim_data_path)

cat_datasets, cat_names, con_datasets, con_names = io.load_preprocessed_data(path, config.data.categorical_names, config.data.continuous_names)

In [5]:
assert len(cat_datasets) == 1  # one categorical dataset
assert len(con_datasets) == 2  # two continuous datasets
assert len(cat_names) == 1
assert len(con_names) == 2

The drug dataset has been encoded as a matrix of 500 samples &times; 20 drugs &times; 2 categories (either took or did not take the drug), whereas the proteomics and metagenomics datasets keep their original shapes.

In [6]:
dataset_names = config.data.categorical_names + config.data.continuous_names

for dataset, dataset_name in zip(cat_datasets + con_datasets, dataset_names):
    print(f"{dataset_name}: {dataset.shape}")

random.small.drugs: (500, 20, 2)
random.small.proteomics: (500, 200)
random.small.metagenomics: (500, 1000)


We can also confirm that the mean of the continuous datasets is now close to 0, and the standard deviation is close to 1.

In [7]:
for dataset, dataset_name in zip(con_datasets, dataset_names[1:]):
    print(f"{dataset_name}: mean = {dataset.mean():.3f}, std = {dataset.std():.3f}")

random.small.proteomics: mean = -0.000, std = 0.975
random.small.metagenomics: mean = 0.000, std = 0.975
