Skip to content

Lauffenburger-Lab/OmicTranslationBenchmark

Repository files navigation

AutoTransOP: translating omics signatures without orthologue requirements using deep learning

Github repository of the study:

AutoTransOP: Translating Omics Signatures without Orthologue Requirements using Deep Learning
Meimetis, N., Pullen, K. M., Zhu, D. Y., Nilsson, A., Hoang, T. N., Magliacane, S., & Lauffenburger, D. A.

Published in NPJ Systems Biology and Applications: https://doi.org/10.1038/s41540-024-00341-9

This repository is administered by the Lauffenburger Lab and @NickMeim. For questions contact meimetis@mit.edu

Trained models of this study are too big to be uploaded here and are available upon reasonable request.

The development of effective therapeutics and vaccines for human diseases requires a systematic understanding of human biology. While animal and in vitro culture models have successfully elucidated the molecular mechanisms of diseases in many studies, they yet fail to adequately recapitulate human biology as evidenced by the predominant likelihood of failure in clinical trials. To address this broadly important problem, we developed AutoTransOP, a neural network autoencoder framework to map omics profiles from designated species or cellular contexts into a global latent space, from which germane information can be mapped between different contexts. This approach performs as well or better than extant machine learning methods and can identify animal/culture-specific molecular features predictive of other contexts, without requiring homology matching. For an especially challenging test case, we successfully apply our framework to a set of inter-species vaccine serology studies, where no 1-1 mapping between human and non-human primate features exists.

The current repository contains code for:

  1. Initial evaluation of the quality and preprocessing of the data.
  2. Embedding evaluations of the autoencoder models.
  3. Deep learning models (autoencoder's approaches) to translate transcriptomic profiles between different cell types.
  4. Importance estimation of each feature for specific tasks.
  5. Code to re-create the results of the research article.

Data

The transcriptomic signatures (level 5 profiles) of the L1000 CMap resource1 are used for this study, together with data from the Bioconductor resource2.

The transcriptomic profiles were generated by measuring 978 important (landmark) genes in cancer with a Luminex bead-based assay and computationally inferring the rest1.

Details on how to access these data can be found in the data folder, but generally the main resources can be accessed here

Folder structure

  1. article_supplementary_info : Folder containing code to re-create the supplementary figures and tables of the article
  2. data : Folder that should contain the raw data of the study.
  3. figures : Folder containing the scripts to produce the figures of the study (except for the serology and fibrosis case studies).
  4. fibrosis : Folder containing code and data to re-create the results of the article regarding the fibrosis case study.
  5. learning : Folder containing machine and deep learning algorithms and models and scripts to estimate genes' importance as considered by each model.
  6. preprocessing : Folder containing scripts to pre-process the raw data and evaluate their quality.
    • preprocessed_data : Here the pre-processed data to be used in the subsequent analysis are stored.
  7. results : Here the results of a subsequent analysis should be stored.
  8. postprocessing : Folder containing scripts to evaluate models' embeddings and genes' importance.
  9. serology : Folder containing code and data to re-create the results of the article regarding the serology case study.

Installation

The study utilizes multiple resources from the Python and R programming languages.

R dependencies: You can check the list below and manually install your preferences.

Important Note:

  • This installation was performed in a WINDOWS environment.
  • For a Linux installation there might be needed some manual installation of external dependencies (especially) for tidyverse. Please check libraries' documentation online
  • For a MAC installation we encourage checking online

In a quick overview, the following R libraries and versions (although any version of the following libraries is appropriate) were/are used to produce the figures and results of the study:

  1. R version 4.1.2
  2. tidyverse 1.3.1
  3. BiocManager 1.30.16
  4. cmapR 1.4.0
  5. org.Hs.eg.db 3.13.0
  6. rhdf5 2.36.0
  7. doFuture 0.12.0
  8. doRNG 1.8.2
  9. ggplot2 3.3.5
  10. ggpubr 0.4.0
  11. GeneExpressionSignature 1.38.0
  12. caret 6.0-94
  13. Rtsne 0.16
  14. factoextra 1.0.7
  15. ggpubr 0.6.0
  16. ggpattern 1.1.0
  17. ggridges 0.5.4
  18. ggrepel 0.9.3
  19. rstatix 0.7.2
  20. patchwork 1.1.2.9000
  21. dorothea 1.4.2
  22. fgsea 1.18.0
  23. AnnotationDbi 1.54.1
  24. EGSEAdata 1.20.0
  25. topGO 2.44.0
  26. GO.db 3.13.0

Python dependencies: First, install conda (anaconda) environment in your computer and then you can use the commands in a bash-terminal after the list of libraries.

Important Note:

  • Pytorch GPU installation CHANGES according to your NVIDIA GPU and cuda version. Check the pytorch installation guide here for more information.
  • This installation was performed in a WINDOWS environment. For other environments check libraries' documentation

In a quick overview, the following Python libraries and versions (although different versions are POSSIBLY also appropriate) were/are used:

  1. python 3.8.8
  2. seaborn 0.11.2 (version does not matter for this library)
  3. numpy 1.20.3 (version does not matter for this library)
  4. pandas 1.3.5 (version does not matter for this library)
  5. matplotlib 3.5.1 (version does not matter for this library)
  6. scipy 1.7.3
  7. scikit-learn 1.0.2
  8. CUDA Version: 11.1 - 11.5 (Very important for pytorch GPU installation)
  9. pytorch (Important to download a compatible version for your system. Click on pytorch to view more information on its original website)
  10. GPU: NVIDIA GeForce RTX 3060
# After installing anaconda create a conda environment:
conda create -n myenv python=3.8.8
conda install numpy
conda install pandas
conda install -c conda-forge scikit-learn
conda install -c conda-forge matplotlib
conda install seaborn
conda install -c anaconda scipy
# This is hardware-specific
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch

References

Footnotes

  1. Subramanian, Aravind, et al. "A next generation connectivity map: L1000 platform and the first 1,000,000 profiles." Cell 171.6 (2017): 1437-1452. 2

  2. Gentleman, Robert C., et al. "Bioconductor: open software development for computational biology and bioinformatics." Genome biology 5.10 (2004): 1-16.