diff --git a/docs/_includes/common-input-formats.rst b/docs/_includes/common-input-formats.rst index c6a6421be..cb38f0bd6 100644 --- a/docs/_includes/common-input-formats.rst +++ b/docs/_includes/common-input-formats.rst @@ -1,6 +1,6 @@ -We support three types of input: VCF, FASTA, BED. If you want to predict effects of noncoding variants, use VCF format input. If you want to predict chromatin feature probabilities for DNA sequences, use FASTA format. If you want to predict RBP interaction probabilities for transcript sequences, you can also use FASTA format. If you want to specify sequences from the human reference genome (GRCh37/hg19), you can use BED format. See below for a quick introduction: +We support three types of input: VCF, FASTA, BED. If you want to predict effects of noncoding variants, use VCF format input. If you want to predict chromatin feature probabilities for DNA sequences, use FASTA format. If you want to specify sequences from the human reference genome, you can use BED format. See below for a quick introduction: -**VCF format** is used for specifying a genomic variant. A minimal example is ``chr1 109817590 - G T`` (if you want to copy cover this text as input, you will need to change spaces to tabs). The five columns are chromosome, position, name, reference allele, and alternative allele. Currently, the genome position needs to be in GRCh37/hg19. +**VCF format** is used for specifying a genomic variant. A minimal example is ``chr1 109817590 - G T`` (if you want to copy cover this text as input, you will need to change spaces to tabs). The five columns are chromosome, position, name, reference allele, and alternative allele. **FASTA format** input should include sequences of |bp_length|\ bp length each. If a sequence is different from |bp_length|\ bp: @@ -12,4 +12,4 @@ We support three types of input: VCF, FASTA, BED. If you want to predict effects - **Note**: This padding behavior is not recommended. N's were extremely rare in training data (only appearing in assembly gaps), and the model has not been evaluated with artificially padded sequences - **Strong recommendation**: Always provide sequences of exactly |bp_length|\ bp by including genomic flanking sequences -**BED format** provides another way to specify sequences in human reference genome (hg19). The BED input should specify |bp_length|\ bp-length regions. A minimal example is |bed_example|. The three columns are chromosome, start position, and end position. +**BED format** provides another way to specify sequences in human reference genome. A minimal example is |bed_example|. The three columns are chromosome, start position, and end position. diff --git a/docs/_includes/common-submission-info.rst b/docs/_includes/common-submission-info.rst index bfdb322a8..4a06a9691 100644 --- a/docs/_includes/common-submission-info.rst +++ b/docs/_includes/common-submission-info.rst @@ -1,7 +1,3 @@ -Genome coordinates -~~~~~~~~~~~~~~~~~~ -We support only ``GRCh37/hg19`` genome coordinates. You can use LiftOver to convert your coordinates to the correct version. - Large submissions ~~~~~~~~~~~~~~~~~ We recommend using the web server if you have <10,000 variants or sequences. You will experience degraded performance when submitting a larger set of sequences. In those instances, we suggest that you split the set into multiple <10,000 submissions, or run the standalone version on your local machine, or contact our group directly. \ No newline at end of file diff --git a/docs/asdbrowser.rst b/docs/asdbrowser.rst new file mode 100644 index 000000000..90cda7d30 --- /dev/null +++ b/docs/asdbrowser.rst @@ -0,0 +1,23 @@ +============== +ASD Browser +============== + +Introduction +------------ + +The ASD browser allows exploration of predicted impact (chromatin effects and post-transcriptional RNA-binding protein impact) of de novo noncoding variants in autism probands in the `Simons Simplex Collection `_. + +The variant prediction approach is described in the following manuscript: Zhou J, Park CY, Theesfeld CL, Wong AK, Yuan Y, Scheckel C, Fak JJ, Funk J, Yao K, Tajima Y, Packer A, Darnell RB, Troyanskaya OG. (2019). `Whole-genome deep-learning analysis identifies contribution of noncoding mutations to autism risk `_. Nature Genetics. + +Description +----------- + +This website provides a user-friendly interactive interface for exploring the sequence-based predicted effects of SSC ASD proband mutations. Both individual molecular-level effects at chromatin (“DNA”) level and RNA-binding protein (“RNA”) level and Disease Impact Scores summarizing molecular level effects are shown. The methodology and analysis are described in the manuscript “Whole-genome deep learning analysis reveals causal role of noncoding mutations in autism”. + +.. image:: img/genome_browser.png + +The Genome browser can be navigated by entering a genomic interval, a gene name, or interactively through zooming in/out and scrolling. The tracks “DNA Disease Impact Score” and “RNA Disease Impact Score” show mutation disease impact score (DIS) from DNA and RNA models respectively. DIS scores summarize molecular-level biochemical effects at DNA and RNA level into two scores based on regularized logistic regression classifiers trained with HGMD mutations. + +.. image:: img/genome_heatmap.png + +Individual molecular-level biochemical effects are shown as a heatmap. The biochemical features are sorted by the magnitude of predicts effects of the center mutation. Each mutation may be clicked to center the genome browser and the heatmap at that mutation, or the heatmap may be dragged to alter the center mutation. The user can select “DNA features” or “RNA features” from the dropdown menu. Mousing over any individual prediction in the heatmap will display details in a tooltip. \ No newline at end of file diff --git a/docs/beluga.rst b/docs/beluga.rst index a176e1c8d..04000b3df 100644 --- a/docs/beluga.rst +++ b/docs/beluga.rst @@ -1,32 +1,27 @@ ======= -DeepSEA (Beluga) +Beluga (DeepSEA) ======= Introduction ------------ -DeepSEA is a deep learning-based algorithmic framework for predicting the chromatin effects of sequence alterations with single nucleotide sensitivity. DeepSEA can accurately predict the epigenetic state of a sequence, including transcription factors binding, DNase I sensitivities and histone marks in multiple cell types, and further utilize this capability to predict the chromatin effects of sequence variants and prioritize regulatory variants. +DeepSEA is a deep learning-based algorithmic framework for predicting the chromatin effects of sequence alterations with single nucleotide sensitivity. DeepSEA can accurately predict the epigenetic state of a sequence, including transcription factors binding, DNase I sensitivities and histone marks in multiple cell types, and further utilize this capability to predict the chromatin effects of sequence variants and prioritize regulatory variants. The 2019 version of DeepSEA, nicknamed '**Beluga**', can predict **2002** chromatin features. -The 2019 version of DeepSEA, nicknamed '**Beluga**', can predict **2002** chromatin features. Beluga is described in: +Beluga is described in: Jian Zhou, Chandra L. Theesfeld, Kevin Yao, Kathleen M. Chen, Aaron K. Wong, and Olga G. Troyanskaya, `Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk `_. Nature Genetics (2018). -Jian Zhou, Chandra L. Theesfeld, Kevin Yao, Kathleen M. Chen, Aaron K. Wong, and Olga G. Troyanskaya, **Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk**. Nature Genetics (2018). -To determine if certain features (ie. transcription factors, marks, or cell types) are present/accounted for in the model, refer to the `supplemental feature table `_ which has all the profiles used to train DeepSEA. +DeepSEA is originally described in the following manuscript: Jian Zhou, Olga G. Troyanskaya. `Predicting the Effects of Noncoding Variants with Deep learning-based Sequence Model `_ Nature Methods (2015). -DeepSEA is originally described in the following manuscript: - -Jian Zhou, Olga G. Troyanskaya. **Predicting the Effects of Noncoding Variants with Deep learning-based Sequence Model.** Nature Methods (2015). - -To determine if certain features (ie. transcription factors, marks, or cell types) are present/accounted for in the model, refer to the `supplemental feature table `_ which has all the profiles used to train DeepSEA. +To determine if certain features (ie. transcription factors, marks, or cell types) are present/accounted for in the model, refer to the `supplemental feature table `_ which has all the profiles used to train Beluga. Input ----- -DeepSEA predicts genomic variant effects on a wide range of chromatin features at the variant position (Transcription factors binding, DNase I hypersensitive sites, and histone marks in multiple human cell types). DeepSEA can also be utilized for predicting chromatin features for any DNA sequence. +Beluga predicts genomic variant effects on a wide range of chromatin features at the variant position (Transcription factors binding, DNase I hypersensitive sites, and histone marks in multiple human cell types). DeepSEA can also be utilized for predicting chromatin features for any DNA sequence. .. |bp_length| replace:: 2000 -.. |bed_example| replace:: ``chr1 109817091 109819090`` +.. |bed_example| replace:: ``chr5 134871851 134871852`` .. include:: _includes/common-input-formats.rst @@ -44,7 +39,7 @@ Regulatory feature scores Variant scores ~~~~~~~~~~~~~~ -* **Disease Impact Score (DIS)**: DIS is calculated by training a logistic regression model that prioritizes likely disease-associated mutations on the basis of the predicted transcriptional or post-transcriptional regulatory effects of these mutations (See Zhou et. al, 2019). The predicted DIS probabilities are then converted into 'DIS e-values', computed based on the empirical distributions of predicted effects for gnomAD variants. The final DIS score is: +* **Disease Impact Score (DIS)**: DIS is calculated by training a logistic regression model that prioritizes likely disease-associated mutations on the basis of the predicted transcriptional or post-transcriptional regulatory effects of these mutations (See `Zhou et. al, 2019 `_). The predicted DIS probabilities are then converted into 'DIS e-values', computed based on the empirical distributions of predicted effects for gnomAD variants. The final DIS score is: .. math:: -log10(DIS evalue_{feature}) diff --git a/docs/citations.rst b/docs/citations.rst index dadb09095..568f317ae 100644 --- a/docs/citations.rst +++ b/docs/citations.rst @@ -6,9 +6,32 @@ Tissue-specific networks, NetWAS ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Greene CS*, Krishnan A*, Wong AK*, Ricciotti E, Zelaya RA, Himmelstein DS, Zhang R, Hartmann BM, Zaslavsky E, Sealfon SC, Chasman DI, FitzGerald GA, Dolinski K, Grosser T, Troyanskaya OG. (2015). `Understanding multicellular function and disease with human tissue-specific networks `_. Nature Genetics. 10.1038/ng.3259w. +Functional module detection +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Krishnan A*, Zhang R*, Yao V, Theesfeld CL, Wong AK, Tadych A, Volfovsky N, Packer A, Lash A, Troyanskaya OG.(2016) `Genome-wide prediction and functional characterization of the genetic basis of autism spectrum disorder `_. Nature Neuroscience. + +Sei +~~~~ +Chen, K. M., Wong, A. K., Troyanskaya, O. G., & Zhou, J. (2022), `A sequence-based global map of regulatory activity for deciphering human genetics `_. Nature Genetics (2018). + +Beluga (DeepSEA) +~~~~~~~~~~~~~~~~ +Beluga model: Zhou, J., Theesfeld, C. L., Yao, K., Chen, K. M., Wong, A. K., & Troyanskaya, O. G. (2018), `Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk `_. Nature Genetics. + +Original publication of the DeepSEA method: Zhou, J., & Troyanskaya, O. G. (2015) `Predicting the Effects of Noncoding Variants with Deep learning-based Sequence Model `_ Nature Methods. + +Seqweaver +~~~~~~~~~~ +Park, C. Y., Zhou, J., Wong, A. K., Chen, K. M., Theesfeld, C. L., Darnell, R. B., & Troyanskaya, O. G. (2021). `Genome-wide landscape of RNA-binding protein target site dysregulation reveals a major impact on psychiatric disorder risk `_. Nat Genet. + + Variant effect predictions (ExPecto) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Zhou J, Theesfeld CL, Yao K, Chen KM, Wong AK, and Troyanskaya OG. (2018) Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk, Nature Genetics. +Zhou, J., Theesfeld, C. L., Yao, K., Chen, K. M., Wong, A. K., & Troyanskaya, O. G. (2018), `Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk `_, Nature Genetics. + +ExPectoSC +~~~~~~~~~ +Sokolova, K., Theesfeld, C. L., Wong, A. K., Zhang, Z., Dolinski, K., & Troyanskaya, O. G. (2023), `Atlas of primary cell-type specific sequence models of gene expression and variant effects `_. Cell Reports Methods. Autism gene predictions ~~~~~~~~~~~~~~~~~~~~~~~ @@ -16,4 +39,4 @@ Krishnan A*, Zhang R*, Yao V, Theesfeld CL, Wong AK, Tadych A, Volfovsky N, Pack Tissue-expression predictions ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -\W. Ju#, C.S. Greene#, F. Eichinger, V. Nair, J.B. Hodgin, M. Bitzer, Y. Lee, Q. Zhu, M. Kehata, M. Li, M.P. Rastaldi, C.D. Cohen, O.G. Troyanskaya*, and M. Kretzler*. Defining cell type specificity at the transcriptional level in human disease. Genome Research. 23:1862-1873. 2013 +Ju, W.*, Greene, C. S.8, Eichinger, F., Nair, V., Hodgin, J. B., Bitzer, M., ... Troyanskaya, O. G.* & Kretzler, M*. (2013). `Defining cell-type specificity at the transcriptional level in human disease `_. Genome Research. diff --git a/docs/expecto.rst b/docs/expecto.rst index 218bbe54e..a3a767935 100644 --- a/docs/expecto.rst +++ b/docs/expecto.rst @@ -4,13 +4,12 @@ ExPecto Introduction ------------ -ExPecto is a framework for ab initio sequence-based prediction of mutation gene expression effects and disease risks. With this web interface, we provide an explorer of tissue-specific expression effect predictions. The current release contains all single nucleotide substitutions within 1kb to the representative TSS of a gene and all 1000 Genomes variants that passed a minimum predicted effect threshold (>0.3 log fold-change in any tissue). +ExPecto is a framework for ab initio sequence-based prediction of mutation gene expression effects and disease risks. With this web interface, we provide an explorer of tissue-specific expression effect predictions. -The code for predicting expression effects for human genome variants and training new expression models is available at this `github repository `_. +The ExPecto framework is described in the following manuscript: Jian Zhou, Chandra L. Theesfeld, Kevin Yao, Kathleen M. Chen, Aaron K. Wong, and Olga G. Troyanskaya, `Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk `_, Nature Genetics (2018). -The ExPecto framework is described in the following manuscript: +The code for predicting expression effects for human genome variants and training new expression models is available at this `github repository `_. -Jian Zhou, Chandra L. Theesfeld, Kevin Yao, Kathleen M. Chen, Aaron K. Wong, and Olga G. Troyanskaya, Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk, Nature Genetics, 2018 Output ------ @@ -28,7 +27,7 @@ This is the bulk download `link `_. The full prediction of all 140 million mutations can be downloaded `here `_ (~125G). diff --git a/docs/expectosc.rst b/docs/expectosc.rst index 067cf1341..b01bd5ef0 100644 --- a/docs/expectosc.rst +++ b/docs/expectosc.rst @@ -8,11 +8,10 @@ Introduction ExPectoSC is a framework for `ab initio` sequence-based prediction of mutation gene expression effects for primary human cell types. With this web interface, we provide an explorer of cell type-specific expression effect predictions. The current release contains all ClinVar variants within +/- 20kb of the representative TSS of a gene. We use 1000 Genomes variant effects predictions for z-score normalization. No effect threshold was employed for the current release of the data. -The code for predicting expression effects for human genome variants and training new expression models is available at this `github repository `_. +The ExPectoSC framework is described in the following manuscript: Ksenia Sokolova, Chandra L. Theesfeld, Aaron K. Wong, Zijun Zhang, Kara Dolinski and Olga G. Troyanskaya, `Atlas of primary cell-type specific sequence models of gene expression and variant effects `_. Cell Reports Methods (2023). -The ExPectoSC framework is described in the following manuscript: -Ksenia Sokolova, Chandra L. Theesfeld, Aaron K. Wong, Zijun Zhang, Kara Dolinski and Olga G. Troyanskaya, Atlas of primary cell-type specific sequence models of gene expression and variant effects, Submitted, 2023 +The code for predicting expression effects for human genome variants and training new expression models is available at this `github repository `_. Website overview ---------------- @@ -66,5 +65,4 @@ Method Details -------------- ExPectoSC is a modular framework, that uses regularized linear module upon deep convolutional network model of chromatin profifiling effects to predict cell type specific expression. The framework is capable of predicting expression levels directly from sequence and is sensitive to the sequence variations. -The chromatin predictions were computed using a DeepSEA "Beluga" model, using sliding window approach of 2000bp width with 200bp step, for the 40kb region surrounding the TSS. Exponential condense function is then used to reduce the dimensionality of the data before using it in the module 2. - +The chromatin predictions were computed using a DeepSEA "Beluga" model, using sliding window approach of 2000bp width with 200bp step, for the 40kb region surrounding the TSS. Exponential condense function is then used to reduce the dimensionality of the data before using it in the regularized linear module. diff --git a/docs/functional-networks.rst b/docs/functional-networks.rst index 0938a19bb..aa7f32343 100644 --- a/docs/functional-networks.rst +++ b/docs/functional-networks.rst @@ -1,18 +1,22 @@ -Functional Networks +Tissue-specific Networks =========================== -In order to leverage the vast collections of raw, noisy genomic data, they must be integrated, summarized, and presented in a biologically informative manner. We provide a means of mining tens of thousands of whole-genome experiments by way of functional interaction networks. Each interaction network represents a body of data, probabilistically weighted and integrated, focused on a particular biological question. These questions can include, for example, the function of a gene, the relationship between two pathways, or the processes disrupted in a genetic disorder. (Huttenhower, et. al 2008) +In order to leverage the vast collections of raw, noisy genomic data, they must be integrated, summarized, and presented in a biologically informative manner. We provide a means of mining tens of thousands of whole-genome experiments by way of functional interaction networks. Each interaction network represents a body of data, probabilistically weighted and integrated, focused on a particular tissue or process context. + +It is important to consider gene relationships within a tissue context as the precise actions of genes are frequently dependent on their tissue context, and human diseases result from the disordered interplay of tissue- and cell lineage–specific processes. These factors combine to make the understanding of tissue-specific gene functions, disease pathophysiology and gene-disease associations particularly challenging. + +Tissue-specific network construction is described in the following publication: Greene, C. S., Krishnan, A., Wong, A. K., Ricciotti, E., Zelaya, R. A., Himmelstein, D. S., ... & Troyanskaya, O. G. (2015). `Understanding multicellular function and disease with human tissue-specific networks `_. Nature Genetics. Method --------------------------- Briefly, functional integration relies on the construction of process-specific functional relationship networks. These are interaction networks in which each node represents a gene, each edge a functional relationship, and an edge between two genes is probabilistically weighted based on experimental evidence relating to those genes. We integrate evidence from many data sets, with each data set weighted in a process-specific manner. -One naïve Bayesian classifier is trained per biological area of interest (e.g. a tissue, or a specific biological process), using the appropriate gold standard for the biological context in addition to one global process-unaware classifier trained using the complete gold standard. Each classifier f consisted of a class node predicting the binary presence or absence of a functional relationship (FR) between two genes and n nodes conditioned on FR, each representing the value of a data set Dk. +One naïve Bayesian classifier is trained per biological area of interest (e.g. a tissue, or a specific biological process), using the appropriate gold standard for the biological context in addition to one global process-unaware classifier trained using the complete gold standard. Each classifier consisted of a class node predicting the binary presence or absence of a functional relationship (FR) between two genes and n nodes conditioned on FR, each representing the value of a data set. -Parameter regularization is performed as described in Steck and Jaakkola (2002) using mutual information between data sets to estimate a strength of prior belief for each data set. While a large amount of shared information does not guarantee a redundant data set, since the same subset of information could be shared many times, it provides a valuable quantitative estimate of data set uniqueness. +Parameter regularization is performed as described in `Steck and Jaakkola (2002) `_ using mutual information between data sets to estimate a strength of prior belief for each data set. While a large amount of shared information does not guarantee a redundant data set, since the same subset of information could be shared many times, it provides a valuable quantitative estimate of data set uniqueness. -Genomics data types +Data integration --------------------------- -We collected and integrated 987 genome-scale data sets encompassing approximately 38,000 conditions from an estimated 14,000 publications including both expression and interaction measurements. +We collected and integrated 987 genome-scale data sets encompassing approximately 38,000 conditions from an estimated 14,000 publications including both expression and interaction measurements. To integrate these data, we automatically assess each data set for its relevance to each of 144 tissue- and cell lineage–specific functional contexts. The resulting functional maps provide a detailed portrait of protein function and interactions in specific human tissues and cell lineages ranging from B lymphocytes to the renal glomerulus and the whole brain. This approach allows us to profile the specialized function of genes in a high-throughput manner, even in tissues and cell lineages for which no or few tissue-specific data exist. * Gene co-expression: All gene expression data sets are from NCBI's Gene Expression Omnibus (GEO). Genes with more than 30% of values missing were removed, and remaining missing values were imputed using ten nearest neighbors. Non-log-transformed data sets were log transformed. Expression measurements were summarized to Entrez identifiers, and duplicate identifiers were merged. The Pearson correlation was calculated for each gene pair, normalized with Fisher's z transform, mean subtracted and divided by the standard deviation. @@ -32,3 +36,12 @@ Contribution of dataset D to an edge functional relationship prediction (FR):: contribution(D) = P(FR | D) - P(FR) Note that the contributions will not sum to 1.0, as each contribution is measured separately. Generally, individual gene expression datasets will not contribute much to the posterior probability but cumulatively can make a significant contribution. + +Example +--------------------------- + +IL1B in blood vessel +~~~~~~~~~~~~~~~~~~~~~~~~~ +We examined and experimentally verified the tissue-specific molecular response of blood vessel cells to stimulation by IL-1β (IL1B), a pro-inflammatory cytokine. We anticipated that the genes most tightly connected to IL1B in the blood vessel network would be among those responding to IL-1β stimulation in blood vessel cells. We tested this hypothesis by profiling the gene expression of human aortic smooth muscle cells (HASMCs; the predominant cell type in blood vessels) stimulated with IL-1β. + +Examination of the genes whose expression was significantly upregulated at 2 h after stimulation showed that 18 of the 20 IL1B network neighbors were among the top 500 most upregulated genes in the experiment (P = 2.07 × 10−23). The blood vessel network was the most accurate tissue network in predicting this experimental outcome; none of the other 143 tissue-specific networks or the tissue-naive network performed as well when evaluated by each network's ability to predict the result of IL-1β stimulation on the cells. diff --git a/docs/in-silico-mutagenesis.rst b/docs/in-silico-mutagenesis.rst index a8e256429..ed42a6e0f 100644 --- a/docs/in-silico-mutagenesis.rst +++ b/docs/in-silico-mutagenesis.rst @@ -7,14 +7,16 @@ Introduction Perform "In silico saturated mutagenesis" (ISM) analysis to discover informative sequence features within any sequence. Specifically, it performs computational mutation scanning to assess the effect of mutating every base of the input sequence on chromatin feature predictions. This method for context-specific sequence feature extraction takes advantage of DeepSEA's ability to utilize flanking context sequences information. -Note that ISM only accepts a sequence (FASTA file) as input. +Note that ISM only accepts a sequence (FASTA file) as input. The input FASTA file should be 2000 base pairs long. + +The chromatin impact prediction is performed using the :doc:`beluga` model. ISM outputs effects for each of three possible substitutions of all 2000 bases, across all chromatin features. Output ------ -The effect of a base substitution on a specific chromatin feature prediction was measured by log2 fold change of odds, where P0 represents the probability predicted for the original sequence and P1 represents the probability predicted for the mutated sequence: +The effect of a base substitution on a specific chromatin feature prediction was measured by log2 fold change of odds, where P\ :sub:`0`\ represents the probability predicted for the original sequence and P\ :sub:`1`\ represents the probability predicted for the mutated sequence: .. math:: \log_2 \left(\frac{P_0}{1 - P_0}\right) - \log_2 \left(\frac{P_1}{1 - P_1}\right) diff --git a/docs/index.rst b/docs/index.rst index 54bc2cb06..5a02b8c08 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -12,17 +12,10 @@ HumanBase User Guide About --------------------- -HumanBase is a “one stop shop” for biological and biomedical researchers interested in data-driven predictions of gene expression, function, regulation, and interactions in human, particularly in the context of specific cell types/tissues, development, and human disease. +HumanBase is a resource for biological and biomedical researchers interested in data-driven predictions of gene expression, function, regulation, and interactions in human, particularly in the context of specific cell types/tissues, development, and human disease. Data-driven integrative analyses are especially powerful because they reach beyond “known biological knowledge” represented in the biological literature to identify novel associations that are not biased toward well-studied areas of biomedical research. Thus, carefully designed algorithms can drive the development of experimentally testable hypotheses, enabling deeper understanding of basic biology at the molecular level, pathophysiology, and paving the way to therapy and drug development. ---------------------- -Example use case ---------------------- -A researcher who studies the role of the immune system and inflammation in chronic kidney disease wants to identify candidate genes for these disorders. Unfortunately, as with most specific disease contexts outside of cancer, few datasets are available for these diseases, none are focused on the role of inflammation or the immune system, and no dataset is specific to her cell-lineage of interest. Even identifying which genes are expressed in the cell type relevant to glomerular disease (podocytes) is currently impossible as this cell lineage cannot be isolated for high-throughput experiments in human. - -Using HumanBase, she will be able to examine data-driven predictions of genes expressed in the podocyte cells and analyze predicted functional and mechanistic networks specific to the kidney glomerulus. She could also provide the system with a list of relevant GWAS or family-based study results and the system will reprioritize these results based on the relevant functional maps. She will be able to iteratively refine this analysis by limiting the data used in the integration only to kidney datasets or by integrating her own data in the analysis. - --------------------- Licensing --------------------- @@ -40,8 +33,8 @@ Help topics :maxdepth: 2 :glob: + usage functional-networks - tissue-networks modules netwas deepsea @@ -51,4 +44,5 @@ Help topics expecto expectosc in-silico-mutagenesis + asdbrowser citations diff --git a/docs/modules.rst b/docs/modules.rst index aed6431ac..8a6d58ae3 100644 --- a/docs/modules.rst +++ b/docs/modules.rst @@ -4,9 +4,12 @@ Functional Module Detection HumanBase applies community detection to find cohesive gene clusters from a provided gene list and a selected relevant tissue. Genes within a cluster share local network neighborhoods and together form a cohesive, specific functional module. Module detection enables systematic association of genes - even functionally uncharacterized genes - to specific processes and phenotypes represented in the detected modules. Functional modules are identified with tissue-specific networks, which predict gene interactions from massive data collections. Thus the discovered modules potentially capture higher-order tissue-specific function. +Functional module detection is described in: Krishnan A*, Zhang R*, Yao V, Theesfeld CL, Wong AK, Tadych A, Volfovsky N, Packer A, Lash A, Troyanskaya OG.(2016) `Genome-wide prediction and functional characterization of the genetic basis of autism spectrum disorder `_. Nature Neuroscience. + + Method ------ -The approach\ :sup:`1` is based on shared k-nearest-neighbors (SKNN) and the Louvain community-finding algorithm to cluster the user-selected tissue network into distinct modules of tightly connected genes. The SKNN-based strategy has the advantages of alleviating the effect of high-degree genes and accentuating local network structure by connecting genes that are likely to be functionally clustered together. +The approach is based on shared k-nearest-neighbors (SKNN) and the Louvain community-finding algorithm to cluster the user-selected tissue network into distinct modules of tightly connected genes. The SKNN-based strategy has the advantages of alleviating the effect of high-degree genes and accentuating local network structure by connecting genes that are likely to be functionally clustered together. This technique proceeds as follows: (i) First, we create a subset of the user-selected network containing only the user-provided genes and all the edges between them. Given the resulting graph G with V nodes (user-provided genes) and E edges, with each edge between genes i and j associated with a weight p\ :sub:`ij`, @@ -17,9 +20,7 @@ This approach has two key desirable characteristics: (i) Choosing the highest k values instead of all edges deemphasizes high-degree 'hub' nodes and brings equal attention to highly specific edges between low-degree nodes; (ii) Emphasizing local network-structure by connecting nodes that share a number of local neighbors automatically links genes that are highly likely to be part of the same cluster. -We use a dynamic :code:`k = min(50, 0.2 * |V|)` to obtain the shared-nearest-neighbor tissue-specific network and apply the Louvain algorithm to cluster this network into distinct modules. To stabilize clustering across different runs of the Louvain algorithm, we run the algorithm 100 times and calculate cluster comembership scores for each pair of genes that was equal to the fraction of times (out of 100) the pair was assigned to the same cluster. Genes are assigned to clusters where their comembership score ≥ 0.9. +We use a dynamic :code:`k = min(50, 0.2 * |V|)` to obtain the shared-nearest-neighbor tissue-specific network and apply the Louvain algorithm to cluster this network into distinct modules, where V is the number of query genes. Krishnan et al. (2016) showed that module node membership and cluster sizes are robust by testing a range of values for k from 10 to 100. To stabilize clustering across different runs of the Louvain algorithm, we run the algorithm 100 times and calculate cluster comembership scores for each pair of genes that was equal to the fraction of times (out of 100) the pair was assigned to the same cluster. Genes are assigned to clusters where their comembership score ≥ 0.9. Resulting modules are then tested for functional enrichment using genes annotated to Gene Ontology biological process terms. Representative processes and pathways enriched within each cluster are presented alongside of the cluster with their resulting Q value. The Q value of each term associated to the modules is calculated using one-sided Fisher's exact tests and Benjamini–Hochberg corrections to correct for multiple tests. - -1. Krishnan A*, Zhang R*, Yao V, Theesfeld CL, Wong AK, Tadych A, Volfovsky N, Packer A, Lash A, Troyanskaya OG.(2016) Genome-wide prediction and functional characterization of the genetic basis of autism spectrum disorder. Nature Neuroscience. diff --git a/docs/netwas.rst b/docs/netwas.rst index 6767e3086..efdf1232b 100644 --- a/docs/netwas.rst +++ b/docs/netwas.rst @@ -3,6 +3,8 @@ NetWAS - Network-wide Association Study ======================================= Tissue-specific networks provide a new means to generate hypotheses related to the molecular basis of human disease. We developed an approach, termed network-wide association study (NetWAS). In NetWAS, the statistical associations from a standard GWAS guide the analysis of functional networks. This reprioritization method is driven by discovery and does not depend on prior disease knowledge. NetWAS, in conjunction with tissue-specific networks, effectively reprioritizes statistical associations from distinct GWAS to identify disease-associated genes, and tissue-specific NetWAS better identifies genes associated with hypertension than either GWAS or tissue-naive NetWAS. +The NetWAS method is described in the following publication: Greene, C. S., Krishnan, A., Wong, A. K., Ricciotti, E., Zelaya, R. A., Himmelstein, D. S., ... & Troyanskaya, O. G. (2015). `Understanding multicellular function and disease with human tissue-specific networks `_. Nature Genetics. + Method --------------------------------------- NetWAS trains a support vector machine classifier using nominally significant (P < 0.01) genes as positive examples and 10,000 randomly selected non-significant (P ≥ 0.01) genes as negatives. The classifier is constructed using a tissue network relevant to a disease (e.g. kidney for hypertension), where the features of the classifier are the edge weights of the labeled examples to all the genes in the network. Genes are re-ranked using their distance from the hyperplane, which represent a network-based prioritization of a GWAS, termed NetWAS. @@ -62,21 +64,9 @@ When a NetWAS analysis finishes, a result file will be emailed to the provided a -Examples +Example --------------------------------------- Hypertension GWAS ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Hypertension is a major cardiovascular risk factor and a complex trait involving a large number of genetic variants. We converted SNP-level association statistics into gene-level statistics for each of three recorded phenotypes—diastolic blood pressure (DBP), systolic blood pressure (SBP) and hypertension. Using the tissue-specific network for kidney, a tissue that has a central role in blood pressure control, NetWAS constructed a classifier that identified tissue-specific network connectivity patterns associated with the phenotype of interest. Genes annotated to hypertension phenotypes in the Online Mendelian Inheritance in Man (OMIM) database were more highly ranked by this classifier than by the initial GWAS. (`citation `_) - -.. figure:: https://media.nature.com/full/nature-assets/ng/journal/v47/n6/images/ng.3259-F5.jpg - :scale: 50% - - Genes ranked using GWAS (gray) and genes reprioritized using NetWAS (brown) were assessed for correspondence to genes known to be associated with hypertension phenotypes, regulatory processes and therapeutics. We compared individual (systolic blood pressure, SBP; diastolic blood pressure, DBP; hypertension, HTN) as well as combined hypertension endpoints. (a) Gene rankings were compared to OMIM-annotated hypertension genes using AUC. The AUC for the tissue-specific NetWAS is consistently higher than that for the original GWAS for all hypertension endpoints. Merging the network-based predictions for the three hypertension-related endpoints into a combined phenotype results in the best performance (AUC = 0.77; original GWAS AUC = 0.62; the dashed line at 0.5 denotes the AUC of a baseline random predictor). (b,c) Gene rankings were also assessed for enrichment of genes involved in the regulation of blood pressure (GO) (b) and targets of antihypertensive drugs (DrugBank) (c). The top NetWAS results were significantly enriched for genes involved in blood pressure regulation as well as for genes that are targets of antihypertensive drugs. Enrichment was calculated as a z score (Online Methods), with higher scores indicating a greater shift from the expected ranking toward the top of the list. In nearly all cases, the NetWAS ranking was both significantly enriched with the respective gene sets (z score > 1.645 ≈ P value < 0.05) and more enriched than in the original GWAS ranking. - - -Additional GWAS -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -.. figure:: https://media.nature.com/full/nature-assets/ng/journal/v47/n6/images/ng.3259-SF8.jpg - - Each bar shows the performance of NetWAS reprioritization as measured by the area under the curve (AUC) of documented disease associations with the disease specified in the label above the plot. The horizontal axis shows relevant networks (colored bars) and GWAS alone (gray bars), and the horizontal axis label describes the GWAS phenotype from which associations were obtained. diff --git a/docs/sei.rst b/docs/sei.rst index ebdc3c352..a4c6368cd 100644 --- a/docs/sei.rst +++ b/docs/sei.rst @@ -5,9 +5,11 @@ Sei Introduction ------------ -Sei is a deep-learning-based framework for systematically predicting sequence regulatory activities and applying sequence information to understand human genetics data. Sei provides a global map from any sequence to regulatory activities, as represented by 40 sequence classes. Each sequence class integrates predictions for 21,907 chromatin profiles (transcription factor, histone marks, and chromatin accessibility profiles across a wide range of cell types) from the underlying Sei deep learning model. You can also find the Sei code repository `here `_ or read about our manuscript `here `_. +Sei is a deep-learning-based framework for systematically predicting sequence regulatory activities and applying sequence information to understand human genetics data. Sei provides a global map from any sequence to regulatory activities, as represented by 40 sequence classes. Each sequence class integrates predictions for 21,907 chromatin profiles (transcription factor, histone marks, and chromatin accessibility profiles across a wide range of cell types) from the underlying Sei deep learning model. -Sequence class-level variant effects are computed by comparing the predictions for the reference and the alternative alleles. A positive score indicates an increase in sequence class activity by the alternative allele and vice versa. Sequence class-level scores are computed by projecting the 21,907 chromatin profile predictions for the sequence to the unit vector that represents each sequence class. +Sei is described in the following manuscript: Kathleen M. Chen, Aaron K. Wong, Olga G. Troyanskaya and Jian Zhou, `A sequence-based global map of regulatory activity for deciphering human genetics `_. Nature Genetics (2018). + +The Sei code repository can be found `here `_. For older DeepSEA models see: :doc:`beluga` (2019) @@ -17,7 +19,7 @@ Input ----- .. |bp_length| replace:: 4096 -.. |bed_example| replace:: ``chr1 109817091 109821186`` +.. |bed_example| replace:: ``chr5 134871851 134871852`` .. include:: _includes/common-input-formats.rst @@ -30,7 +32,7 @@ Output Sequence classes ~~~~~~~~~~~~~~~~~~~~~~~~~ -The Sei framework predicts 40 sequence class scores, covering a wide range of regulatory activities such as cell-type-specific enhancers and promoters, as well as 21,907 chromatin profiles for any DNA sequence. A full description of how Sei sequence scores are computed can be found in the `Sei paper (2022) `_. +The Sei framework predicts 40 sequence class scores, covering a wide range of regulatory activities such as cell-type-specific enhancers and promoters, as well as 21,907 chromatin profiles for any DNA sequence. Sequence class-level variant effects are computed by comparing the predictions for the reference and the alternative alleles. A positive score indicates an increase in sequence class activity by the alternative allele and vice versa. Sequence class-level scores are computed by projecting the 21,907 chromatin profile predictions for the sequence to the unit vector that represents each sequence class.A full description of how Sei sequence scores are computed can be found in the `Sei paper (2022) `_. To help interpretation, we grouped sequence classes into groups including P (Promoter), E (Enhancer), CTCF (CTCF-cohesin binding), TF (TF binding), PC (Polycomb-repressed), HET (Heterochromatin), TN (Transcription), and L (Low Signal) sequence classes. Please refer to our manuscript for a more detailed description of the sequence classes. diff --git a/docs/seqweaver.rst b/docs/seqweaver.rst index a238b4af4..13f968b95 100644 --- a/docs/seqweaver.rst +++ b/docs/seqweaver.rst @@ -40,6 +40,14 @@ Large submissions We recommend using the web server if you have <10,000 variants or sequences. You will experience degraded performance when submitting a larger set of sequences. In those instances, we suggest that you split the set into multiple <10,000 submissions, or run the standalone version on your local machine, or contact our group directly. +Downloads +--------- +The 1000 Genome Project RBP LDScores used in GWAS analysis are available `here `_. + +GnomAD (v2.1) Seqweaver RBP target site dysregulation scores are available `here `_. + +The code for making variant effect predictions for Seqweaver `RBP model `_ are available from this link (Zhou and Park et al. Nature Gen 2019). Our `new version `_ simplifies the dependencies by using the `Selene `_ package and streamlines the prediction process for RBP models. + Output ------ @@ -68,4 +76,4 @@ Molecular-level biochemical effects prediction See also -------- * :doc:`sei` - Latest chromatin and regulatory impact model with 4096bp input sequences -* :doc:`beluga` - 2019 DeepSEA model with 2000bp input sequences +* :doc:`beluga` - 2018 DeepSEA model with 2000bp input sequences diff --git a/docs/usage.rst b/docs/usage.rst index 98ec99bc9..65463ec24 100644 --- a/docs/usage.rst +++ b/docs/usage.rst @@ -1,3 +1,62 @@ ===================== -Getting Started +Overview ===================== + +HumanBase is a resource for biological researchers interested in data-driven predictions of gene expression, function, regulation, and interactions in human, particularly in the context of specific cell types/tissues and human disease. A brief summary of some of the key tools included in HumanBase is below; for more detail on each tool, see the corresponding documentation page. + +This resource is not merely a public database of primary genomics data or biological literature. The data-driven integrative analyses (i.e. algorithms that “learn” from large genomic data collections) presented in HumanBase are especially powerful because they separate signal from noise in large biological data collections to reach beyond “existing biological knowledge” represented in the biological literature to identify novel associations that are not biased toward well-studied areas of biomedical research. Carefully designed algorithms can drive the development of experimentally testable hypotheses. Thus, HumanBase is a resource for biomedical researchers to incorporate into their research workflows, which they can use to interpret their experimental results and generate hypotheses for experimental follow-up. + +HumanBase is actively developed by the `Genomics group `_ at the `Flatiron Institute `_. + +Functional gene networks +------------------------ + +In order to leverage the vast collections of raw, noisy genomic data, they must be integrated, summarized, and presented in a biologically informative manner. We provide a means of mining tens of thousands of whole-genome experiments by way of functional interaction networks. Each interaction network represents a body of data, probabilistically weighted and integrated, focused on a particular biological question. These questions can include, for example, the function of a gene, the relationship between two pathways, or the processes disrupted in a genetic disorder (see `Huttenhower et al., 2009 `_). + +:doc:`Tissue-specific Networks ` +------------------------------------- +HumanBase builds genome-scale functional maps of human tissues by integrating a collection of data sets covering thousands of experiments contained in more than 14,000 distinct publications. We automatically assess each data set for its relevance to each of 144 tissue- and cell lineage–specific functional contexts. The resulting functional gene maps provide a detailed portrait of protein function and interactions in specific human tissues and cell lineages ranging from B lymphocytes to the renal glomerulus and the whole brain. This approach allows HumanBase to profile the specialized function of genes in a high-throughput manner, even in tissues and cell lineages for which no or few tissue-specific data exist (`Greene et al., 2015 `_). + +:doc:`Functional Module Detection ` +--------------------------- +HumanBase applies community detection to find cohesive gene clusters from a provided gene list and a selected relevant tissue. Genes within a cluster share local network neighborhoods and together form a cohesive, specific functional module. Module detection enables systematic association of genes - even functionally uncharacterized genes - to specific processes and phenotypes represented in the detected modules. Functional modules are identified with tissue-specific networks, which predict gene interactions from massive data collections. Thus the discovered modules potentially capture higher-order tissue-specific function (`Krishnan et al. 2016 `_). + +:doc:`NetWAS ` +------------------------------- +Tissue-specific networks provide a new means to generate hypotheses related to the molecular basis of human disease. In NetWAS, the statistical associations from a standard GWAS guide the analysis of functional networks. NetWAS, in conjunction with tissue-specific networks, effectively reprioritizes statistical associations from GWAS to identify disease-associated genes. This reprioritization method is driven by GWAS discovery and does not depend on prior disease knowledge (`Greene et al., 2015 `_). + +:doc:`Sei ` and :doc:`Beluga (DeepSEA) ` +-------------------------------- +Beluga (a `2019 update `_ to the `DeepSEA framework `_) is a deep learning-based algorithmic framework for predicting the chromatin effects of sequence alterations with single nucleotide sensitivity. Beluga can accurately predict the epigenetic state of a sequence, including transcription factors binding, DNase I sensitivities and histone marks in multiple cell types, and further utilize this capability to predict the chromatin effects of sequence variants and prioritize regulatory variants. + +Sei (`Chen et al., 2022 `_) provides a global map from any sequence to regulatory activities, as represented by 40 sequence classes by integrating predictions for 21,907 chromatin profiles. + +:doc:`In-silico mutagenesis ` +--------------------- +HumanBase includes an in silico mutagenesis tool to view the predicted chromatin impact of all possible variants in a given genomic region. + + +:doc:`Seqweaver ` +----------------------------------------------- +Seqweaver (`Park et al., 2021 `_) is a deep learning framework designed to predict how genetic variants affect post-transcriptional RNA-binding protein (RBP) interactions. The model can predict the impact of genetic variants (including variants never seen in genomic databases) at single-nucleotide resolution. + + +:doc:`ExPecto ` and :doc:`ExPectoSC ` +---------------------------------------------------------------------------- +ExPecto (`Zhou et al. 2018 `_) and ExPectoSC (`Sokolova et al. 2022 `_) make highly accurate cell-type and tissue-specific predictions of gene expression solely from DNA sequence. The cell and tissue-specific impact of gene transcriptional dysregulation can be systematically probed ‘in silico’, at a scale not yet possible experimentally. Both models leverage deep learning-based sequence models trained on chromatin profiling data, and integrated with spatial transformation and regularized linear models. ExPecto is trained with bulk RNA sequencing data and ExPectoSC leverages single cell sequencing experiments that profile all cell types in primary human tissues. + +Citations +--------- +Greene CS, Krishnan A, Wong AK, Ricciotti E, Zelaya RA, Himmelstein DS, Zhang R, Hartmann BM, Zaslavsky E, Sealfon SC, Chasman DI, FitzGerald GA, Dolinski K, Grosser T, Troyanskaya OG. (2015). `Understanding multicellular function and disease with human tissue-specific networks `_. Nature Genetics. 10.1038/ng.3259w. + +Krishnan A*, Zhang R*, Yao V, Theesfeld CL, Wong AK, Tadych A, Volfovsky N, Packer A, Lash A, Troyanskaya OG.(2016) `Genome-wide prediction and functional characterization of the genetic basis of autism spectrum disorder _`. Nature Neuroscience. + +Zhou J, Theesfeld CL, Yao K, Chen KM, Wong AK, and Troyanskaya OG. (2018) `Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk `_, Nature Genetics. + +Zhou J, Troyanskaya OG. (2015). `Predicting the Effects of Noncoding Variants with Deep learning-based Sequence Model <`_. Nature Methods. + +Chen KM, Wong AK, Troyanskaya OG, Zhou J. (2022) `A sequence-based global map of regulatory activity for deciphering human genetics `_. Nature Methods. + +Park, C. Y., Zhou, J., Wong, A. K., Chen, K. M., Theesfeld, C. L., Darnell, R. B., & Troyanskaya, O. G. (2021). `Genome-wide landscape of RNA-binding protein target site dysregulation reveals a major impact on psychiatric disorder risk `_. Nature genetics. + +Sokolova, K., Theesfeld, C. L., Wong, A. K., Zhang, Z., Dolinski, K., & Troyanskaya, O. G. (2023). `Atlas of primary cell-type-specific sequence models of gene expression and variant effects `_. Cell Reports Methods.