AutoTransOP: translating omics signatures without orthologue requirements using deep learning

Github repository of the study:

AutoTransOP: Translating Omics Signatures without Orthologue Requirements using Deep Learning
Meimetis, N., Pullen, K. M., Zhu, D. Y., Nilsson, A., Hoang, T. N., Magliacane, S., & Lauffenburger, D. A.

Published in NPJ Systems Biology and Applications: https://doi.org/10.1038/s41540-024-00341-9

This repository is administered by the Lauffenburger Lab and @NickMeim. For questions contact meimetis@mit.edu

Trained models of this study are too big to be uploaded here and are available upon reasonable request.

The development of effective therapeutics and vaccines for human diseases requires a systematic understanding of human biology. While animal and in vitro culture models have successfully elucidated the molecular mechanisms of diseases in many studies, they yet fail to adequately recapitulate human biology as evidenced by the predominant likelihood of failure in clinical trials. To address this broadly important problem, we developed AutoTransOP, a neural network autoencoder framework to map omics profiles from designated species or cellular contexts into a global latent space, from which germane information can be mapped between different contexts. This approach performs as well or better than extant machine learning methods and can identify animal/culture-specific molecular features predictive of other contexts, without requiring homology matching. For an especially challenging test case, we successfully apply our framework to a set of inter-species vaccine serology studies, where no 1-1 mapping between human and non-human primate features exists.

The current repository contains code for:

Initial evaluation of the quality and preprocessing of the data.
Embedding evaluations of the autoencoder models.
Deep learning models (autoencoder's approaches) to translate transcriptomic profiles between different cell types.
Importance estimation of each feature for specific tasks.
Code to re-create the results of the research article.

Data

The transcriptomic signatures (level 5 profiles) of the L1000 CMap resource¹ are used for this study, together with data from the Bioconductor resource².

The transcriptomic profiles were generated by measuring 978 important (landmark) genes in cancer with a Luminex bead-based assay and computationally inferring the rest¹.

Details on how to access these data can be found in the data folder, but generally the main resources can be accessed here

Folder structure

article_supplementary_info : Folder containing code to re-create the supplementary figures and tables of the article
data : Folder that should contain the raw data of the study.
figures : Folder containing the scripts to produce the figures of the study (except for the serology and fibrosis case studies).
fibrosis : Folder containing code and data to re-create the results of the article regarding the fibrosis case study.
learning : Folder containing machine and deep learning algorithms and models and scripts to estimate genes' importance as considered by each model.
preprocessing : Folder containing scripts to pre-process the raw data and evaluate their quality.
- preprocessed_data : Here the pre-processed data to be used in the subsequent analysis are stored.
results : Here the results of a subsequent analysis should be stored.
postprocessing : Folder containing scripts to evaluate models' embeddings and genes' importance.
serology : Folder containing code and data to re-create the results of the article regarding the serology case study.

Installation

The study utilizes multiple resources from the Python and R programming languages.

R dependencies: You can check the list below and manually install your preferences.

Important Note:

This installation was performed in a WINDOWS environment.
For a Linux installation there might be needed some manual installation of external dependencies (especially) for tidyverse. Please check libraries' documentation online
For a MAC installation we encourage checking online

In a quick overview, the following R libraries and versions (although any version of the following libraries is appropriate) were/are used to produce the figures and results of the study:

R version 4.1.2
tidyverse 1.3.1
BiocManager 1.30.16
cmapR 1.4.0
org.Hs.eg.db 3.13.0
rhdf5 2.36.0
doFuture 0.12.0
doRNG 1.8.2
ggplot2 3.3.5
ggpubr 0.4.0
GeneExpressionSignature 1.38.0
caret 6.0-94
Rtsne 0.16
factoextra 1.0.7
ggpubr 0.6.0
ggpattern 1.1.0
ggridges 0.5.4
ggrepel 0.9.3
rstatix 0.7.2
patchwork 1.1.2.9000
dorothea 1.4.2
fgsea 1.18.0
AnnotationDbi 1.54.1
EGSEAdata 1.20.0
topGO 2.44.0
GO.db 3.13.0

Python dependencies: First, install conda (anaconda) environment in your computer and then you can use the commands in a bash-terminal after the list of libraries.

Important Note:

Pytorch GPU installation CHANGES according to your NVIDIA GPU and cuda version. Check the pytorch installation guide here for more information.
This installation was performed in a WINDOWS environment. For other environments check libraries' documentation

In a quick overview, the following Python libraries and versions (although different versions are POSSIBLY also appropriate) were/are used:

python 3.8.8
seaborn 0.11.2 (version does not matter for this library)
numpy 1.20.3 (version does not matter for this library)
pandas 1.3.5 (version does not matter for this library)
matplotlib 3.5.1 (version does not matter for this library)
scipy 1.7.3
scikit-learn 1.0.2
CUDA Version: 11.1 - 11.5 (Very important for pytorch GPU installation)
pytorch (Important to download a compatible version for your system. Click on pytorch to view more information on its original website)
GPU: NVIDIA GeForce RTX 3060

# After installing anaconda create a conda environment:
conda create -n myenv python=3.8.8
conda install numpy
conda install pandas
conda install -c conda-forge scikit-learn
conda install -c conda-forge matplotlib
conda install seaborn
conda install -c anaconda scipy
# This is hardware-specific
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch

References

Subramanian, Aravind, et al. "A next generation connectivity map: L1000 platform and the first 1,000,000 profiles." Cell 171.6 (2017): 1437-1452. ↩ ↩²
Gentleman, Robert C., et al. "Bioconductor: open software development for computational biology and bioinformatics." Genome biology 5.10 (2004): 1-16. ↩

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

article_supplementary_info

article_supplementary_info

data

data

fibrosis

fibrosis

figures

figures

learning

learning

postprocessing

postprocessing

preprocessing

preprocessing

results

results

serology

serology

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

AutoTransOP: translating omics signatures without orthologue requirements using deep learning

Data

Folder structure

Installation

References

About

Releases 1

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 185 Commits
article_supplementary_info		article_supplementary_info
data		data
fibrosis		fibrosis
figures		figures
learning		learning
postprocessing		postprocessing
preprocessing		preprocessing
results		results
serology		serology
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

License

Lauffenburger-Lab/OmicTranslationBenchmark

Folders and files

Latest commit

History

Repository files navigation

AutoTransOP: translating omics signatures without orthologue requirements using deep learning

Data

Folder structure

Installation

References

Footnotes

About

Resources

License

Stars

Watchers

Forks

Languages