MultiPLIER

A unsupervised transfer learning approach for rare disease transcriptomics

Taroni JN, Grayson PC, Hu Q, Eddy S, Kretzler M, Merkel PA, and Greene CS⁺. MultiPLIER: a transfer learning framework reveals systemic features of rare autoimmune disease. bioRxiv. 2018.

⁺Correspondence via issues or to greenescientist@gmail.com

Data

Data used in this analysis repo were processed in greenelab/rheum-plier-data. Please see that repository for relevant citations.

Data and code, including items that are too large to be stored with Git LFS (e.g., some models), associated with v0.2.0 are available at the following DOI: 10.6084/m9.figshare.6982919.v2

Dependencies

We have prepared a Docker image that contains all the dependencies required to reproduce these analyses. See docker/Dockerfile for more information about dependencies.

After installing Docker (Docker documentation), the image can be obtained:

docker pull jtaroni/multi-plier:0.2.0

We use R notebooks for analysis, which can be run and modified using RStudio. RStudio is included on our Docker image. This guide from Andrew Heiss, specifically the Run locally with a GUI section, is a great starting point for working with RStudio and Docker.

Overview

Unsupervised machine learning methods provide a promising means to analyze and interpret large datasets. However, most datasets generated by individual researchers remain too small to fully benefit from these methods. In the case of rare diseases, there may be too few cases available, even when multiple studies are combined. We sought to determine whether or not machine learning models could be constructed from large public data compendia and then transferred to small datasets for subsequent analysis. We trained models using Pathway Level Information ExtractoR (PLIER) (Github) over datasets of different types and scales. Models constructed from large public datasets were i) more detailed than those constructed from individual datasets; ii) included features that aligned well to important biological factors; iii) transferrable to rare disease datasets where the models describe biological processes related to disease severity more effectively than models trained within those datasets.

We call this approach MultiPLIER because we train on multiple datasets, tissues, and biological conditions.

We focus on groups of systemic autoimmune conditions in this project; one group of conditions is rare and the other disease is not. First, we establish that PLIER is appropriate for use in a single tissue, multi-dataset compendium (greenelab/rheum-plier-data/sle-wb) constructed from publicly available systemic lupus erythematosus (SLE) whole blood (WB) microarray data. We demonstrate that MultiPLIER, trained on the recount2 RNA-seq compendium, performs similarly in capturing certain cell type-specific signals and captures additional pathway signals over an SLE WB model. We also analyze expression data from 3 tissues from anti-neutrophilic cytoplasmic antibodies (ANCA)-associated vasculitis (AAV), a family of rare diseases, with MultiPLIER.

Overview figure of dataset-specific PLIER and MultiPLIER. Boxes with solid colored fills represent inputs to the model. White boxes with colored outlines represent model output. (A) PLIER (Mao et al., 2017) automatically extracts latent variables (LVs), shown as the matrix B, and their loadings (Z). We can train PLIER model for each of three datasets from different tissues, which results in three dataset-specific latent spaces. (B) PLIER takes as input a prior information/knowledge matrix C and applies a constraint such that some of the loadings (Z) and therefore some of the LVs capture biological signal in the form of curated pathways or cell type-specific gene sets. (C) Ideally, an LV will map to a single gene set or a group of highly related gene sets to allow for easy interpretation of the model. PLIER applies a penalty on U to facilitate this. Purple fill in a cell indicates a non-zero value and a darker purple indicates a higher value. We show an undesirable U matrix in the top toy example (Ci) and a favorable U matrix in the bottom toy example (Cii). (D) If models have been trained on individual datasets, we may be required to find “matching” LVs in different dataset- or tissue-specific models using the loadings (Z) from each model. Using a metric like the Pearson correlation between loadings, we may or may not be able to find a well-correlated match between datasets. (E) The MultiPLIER approach: train a PLIER on a large collection of uniformly processed data from many different biological contexts and conditions (recount2; Collado-Torres et al., 2017)—a MultiPLIER model—and then project the individual datasets into the MultiPLIER latent space. The hatched fill indicates the sample dataset of origin. (F) Latent variables from the MultiPLIER model can be tested for differential expression between disease and controls in multiple tissues.

For more information about the training set, please see this notebook.

Notebooks

Analysis notebooks are numbered and present in the top level directory. We've enabled Github pages for easy viewing of the notebooks. Some steps in the pipeline are R scripts rather than notebooks due to their computationally intensive nature; we exclude these from the TOC below.

Note that not all analyses present in this repository are included in the preprint.

The figure_notebooks directory contains notebooks that were used specifically to generate figures suitable for publication (figure_notebooks/figures).

License

This repository is dual licensed as BSD 3-Clause (source code) and CC0 1.0 (figures, documentation, and our arrangement of the facts contained in the underlying data), with the following exceptions:

recount2 data is licensed CC-BY.

Name		Name	Last commit message	Last commit date
Latest commit History 109 Commits
data		data
diagrams		diagrams
docker		docker
figure_notebooks		figure_notebooks
plots		plots
results		results
scripts		scripts
util		util
.gitattributes		.gitattributes
.gitignore		.gitignore
.nojekyll		.nojekyll
00-data_download.sh		00-data_download.sh
01-PLIER_util_proof-of-concept_notebook.Rmd		01-PLIER_util_proof-of-concept_notebook.Rmd
01-PLIER_util_proof-of-concept_notebook.nb.html		01-PLIER_util_proof-of-concept_notebook.nb.html
02-recount2_PLIER_exploration.Rmd		02-recount2_PLIER_exploration.Rmd
02-recount2_PLIER_exploration.nb.html		02-recount2_PLIER_exploration.nb.html
03-isolated_cell_type_populations.Rmd		03-isolated_cell_type_populations.Rmd
03-isolated_cell_type_populations.nb.html		03-isolated_cell_type_populations.nb.html
04-isolated_immune_cell_reconstruction.Rmd		04-isolated_immune_cell_reconstruction.Rmd
04-isolated_immune_cell_reconstruction.nb.html		04-isolated_immune_cell_reconstruction.nb.html
05-sle-wb_PLIER.Rmd		05-sle-wb_PLIER.Rmd
05-sle-wb_PLIER.nb.html		05-sle-wb_PLIER.nb.html
06-sle-wb_cell_type.Rmd		06-sle-wb_cell_type.Rmd
06-sle-wb_cell_type.nb.html		06-sle-wb_cell_type.nb.html
07-sle_cell_type_recount2_model.Rmd		07-sle_cell_type_recount2_model.Rmd
07-sle_cell_type_recount2_model.nb.html		07-sle_cell_type_recount2_model.nb.html
08-identify_ifn_LVs.Rmd		08-identify_ifn_LVs.Rmd
08-identify_ifn_LVs.nb.html		08-identify_ifn_LVs.nb.html
09-sle_ifn_data_prep.Rmd		09-sle_ifn_data_prep.Rmd
09-sle_ifn_data_prep.nb.html		09-sle_ifn_data_prep.nb.html
10-sle_ifn_analysis.Rmd		10-sle_ifn_analysis.Rmd
10-sle_ifn_analysis.nb.html		10-sle_ifn_analysis.nb.html
11-subsample_recount_PLIER.R		11-subsample_recount_PLIER.R
12-train_NARES_PLIER.Rmd		12-train_NARES_PLIER.Rmd
12-train_NARES_PLIER.nb.html		12-train_NARES_PLIER.nb.html
13-compare_NARES_B.Rmd		13-compare_NARES_B.Rmd
13-compare_NARES_B.nb.html		13-compare_NARES_B.nb.html
14-NARES_MCPcounter.Rmd		14-NARES_MCPcounter.Rmd
14-NARES_MCPcounter.nb.html		14-NARES_MCPcounter.nb.html
15-evaluate_subsampling.Rmd		15-evaluate_subsampling.Rmd
15-evaluate_subsampling.nb.html		15-evaluate_subsampling.nb.html
16-repeat_sle_wb_PLIER.R		16-repeat_sle_wb_PLIER.R
17-plotting_repeat_evals.Rmd		17-plotting_repeat_evals.Rmd
17-plotting_repeat_evals.nb.html		17-plotting_repeat_evals.nb.html
18-NARES_differential_expression.Rmd		18-NARES_differential_expression.Rmd
18-NARES_differential_expression.nb.html		18-NARES_differential_expression.nb.html
19-GPA_blood_differential_expression.Rmd		19-GPA_blood_differential_expression.Rmd
19-GPA_blood_differential_expression.nb.html		19-GPA_blood_differential_expression.nb.html
20-kidney_differential_expression.Rmd		20-kidney_differential_expression.Rmd
20-kidney_differential_expression.nb.html		20-kidney_differential_expression.nb.html
21-AAV_DLVE.Rmd		21-AAV_DLVE.Rmd
21-AAV_DLVE.nb.html		21-AAV_DLVE.nb.html
22-GPA_blood_top_LVs.Rmd		22-GPA_blood_top_LVs.Rmd
22-GPA_blood_top_LVs.nb.html		22-GPA_blood_top_LVs.nb.html
23-explore_AAV_recount_LVs.Rmd		23-explore_AAV_recount_LVs.Rmd
23-explore_AAV_recount_LVs.nb.html		23-explore_AAV_recount_LVs.nb.html
24-explore_rtx.Rmd		24-explore_rtx.Rmd
24-explore_rtx.nb.html		24-explore_rtx.nb.html
25-predict_response.Rmd		25-predict_response.Rmd
25-predict_response.nb.html		25-predict_response.nb.html
26-describe_recount2.Rmd		26-describe_recount2.Rmd
26-describe_recount2.nb.html		26-describe_recount2.nb.html
27-oncogenic_pathway_recount2_model.Rmd		27-oncogenic_pathway_recount2_model.Rmd
27-oncogenic_pathway_recount2_model.nb.html		27-oncogenic_pathway_recount2_model.nb.html
28-train_different_biological_contexts.sh		28-train_different_biological_contexts.sh
29-train_models_different_sample_size.sh		29-train_models_different_sample_size.sh
30-evaluate_sample_size_and_biological_context.Rmd		30-evaluate_sample_size_and_biological_context.Rmd
30-evaluate_sample_size_and_biological_context.nb.html		30-evaluate_sample_size_and_biological_context.nb.html
31-plotting_sample_size_biological_context_coverage.Rmd		31-plotting_sample_size_biological_context_coverage.Rmd
31-plotting_sample_size_biological_context_coverage.nb.html		31-plotting_sample_size_biological_context_coverage.nb.html
32-explore_pathway_separation.Rmd		32-explore_pathway_separation.Rmd
32-explore_pathway_separation.nb.html		32-explore_pathway_separation.nb.html
33-pathway_overlap_biological_contexts.Rmd		33-pathway_overlap_biological_contexts.Rmd
33-pathway_overlap_biological_contexts.nb.html		33-pathway_overlap_biological_contexts.nb.html
34-DIPG_data_cleaning.Rmd		34-DIPG_data_cleaning.Rmd
34-DIPG_data_cleaning.nb.html		34-DIPG_data_cleaning.nb.html
35-DIPG_recount2_model.Rmd		35-DIPG_recount2_model.Rmd
35-DIPG_recount2_model.nb.html		35-DIPG_recount2_model.nb.html
36-DIPG_analysis.Rmd		36-DIPG_analysis.Rmd
36-DIPG_analysis.nb.html		36-DIPG_analysis.nb.html
37-medulloblastoma_recount2_model.Rmd		37-medulloblastoma_recount2_model.Rmd
37-medulloblastoma_recount2_model.nb.html		37-medulloblastoma_recount2_model.nb.html
38-medulloblastoma_DELV.Rmd		38-medulloblastoma_DELV.Rmd
38-medulloblastoma_DELV.nb.html		38-medulloblastoma_DELV.nb.html
39-L2_penalty.Rmd		39-L2_penalty.Rmd
39-L2_penalty.nb.html		39-L2_penalty.nb.html
40-SLE_MCPcounter.Rmd		40-SLE_MCPcounter.Rmd
40-SLE_MCPcounter.nb.html		40-SLE_MCPcounter.nb.html
LICENSE_BSD-3.md		LICENSE_BSD-3.md
LICENSE_CC0.md		LICENSE_CC0.md
README.md		README.md

License

Licenses found

greenelab/multi-plier

Folders and files

Latest commit

History