From b67833ba879adfd747ba4ce3da8115a7bdf3dfa1 Mon Sep 17 00:00:00 2001 From: rssealfon Date: Mon, 28 Jul 2025 13:26:33 -0400 Subject: [PATCH 01/17] merge index and usage --- docs/index.rst | 14 ++++------- docs/usage.rst | 63 +------------------------------------------------- 2 files changed, 6 insertions(+), 71 deletions(-) diff --git a/docs/index.rst b/docs/index.rst index 969d678cb..99b041ce5 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -7,25 +7,21 @@ HumanBase User Guide ===================== +HumanBase is a resource for biological researchers interested in data-driven predictions of gene expression, function, regulation, and interactions in human, particularly in the context of specific cell types/tissues and human disease. A brief summary of some of the key tools included in HumanBase is below; for more detail on each tool, see the corresponding documentation page. + +This resource is not merely a public database of primary genomics data or biological literature. The data-driven integrative analyses (i.e. algorithms that “learn” from large genomic data collections) presented in HumanBase are especially powerful because they separate signal from noise in large biological data collections to reach beyond “existing biological knowledge” represented in the biological literature to identify novel associations that are not biased toward well-studied areas of biomedical research. Carefully designed algorithms can drive the development of experimentally testable hypotheses. Thus, HumanBase is a resource for biomedical researchers to incorporate into their research workflows, which they can use to interpret their experimental results and generate hypotheses for experimental follow-up. --------------------- -About +Who are we? --------------------- -HumanBase is a resource for biological and biomedical researchers interested in data-driven predictions of gene expression, function, regulation, and interactions in human, particularly in the context of specific cell types/tissues, development, and human disease. - -Data-driven integrative analyses are especially powerful because they reach beyond “known biological knowledge” represented in the biological literature to identify novel associations that are not biased toward well-studied areas of biomedical research. Thus, carefully designed algorithms can drive the development of experimentally testable hypotheses, enabling deeper understanding of basic biology at the molecular level, pathophysiology, and paving the way to therapy and drug development. +HumanBase is actively developed by the `Genomics group `_ at the `Flatiron Institute `_. --------------------- Licensing --------------------- All data in HumanBase are freely available under a `CC-BY 4.0 `_ license. Please give appropriate credit, provide a link to the license, and indicate if changes were made. ---------------------- -Who are we? ---------------------- -HumanBase is actively developed by the `Genomics group `_ at the `Flatiron Institute `_ . - --------------------- Help topics --------------------- diff --git a/docs/usage.rst b/docs/usage.rst index 65463ec24..f18e8be73 100644 --- a/docs/usage.rst +++ b/docs/usage.rst @@ -1,62 +1 @@ -===================== -Overview -===================== - -HumanBase is a resource for biological researchers interested in data-driven predictions of gene expression, function, regulation, and interactions in human, particularly in the context of specific cell types/tissues and human disease. A brief summary of some of the key tools included in HumanBase is below; for more detail on each tool, see the corresponding documentation page. - -This resource is not merely a public database of primary genomics data or biological literature. The data-driven integrative analyses (i.e. algorithms that “learn” from large genomic data collections) presented in HumanBase are especially powerful because they separate signal from noise in large biological data collections to reach beyond “existing biological knowledge” represented in the biological literature to identify novel associations that are not biased toward well-studied areas of biomedical research. Carefully designed algorithms can drive the development of experimentally testable hypotheses. Thus, HumanBase is a resource for biomedical researchers to incorporate into their research workflows, which they can use to interpret their experimental results and generate hypotheses for experimental follow-up. - -HumanBase is actively developed by the `Genomics group `_ at the `Flatiron Institute `_. - -Functional gene networks ------------------------- - -In order to leverage the vast collections of raw, noisy genomic data, they must be integrated, summarized, and presented in a biologically informative manner. We provide a means of mining tens of thousands of whole-genome experiments by way of functional interaction networks. Each interaction network represents a body of data, probabilistically weighted and integrated, focused on a particular biological question. These questions can include, for example, the function of a gene, the relationship between two pathways, or the processes disrupted in a genetic disorder (see `Huttenhower et al., 2009 `_). - -:doc:`Tissue-specific Networks ` -------------------------------------- -HumanBase builds genome-scale functional maps of human tissues by integrating a collection of data sets covering thousands of experiments contained in more than 14,000 distinct publications. We automatically assess each data set for its relevance to each of 144 tissue- and cell lineage–specific functional contexts. The resulting functional gene maps provide a detailed portrait of protein function and interactions in specific human tissues and cell lineages ranging from B lymphocytes to the renal glomerulus and the whole brain. This approach allows HumanBase to profile the specialized function of genes in a high-throughput manner, even in tissues and cell lineages for which no or few tissue-specific data exist (`Greene et al., 2015 `_). - -:doc:`Functional Module Detection ` ---------------------------- -HumanBase applies community detection to find cohesive gene clusters from a provided gene list and a selected relevant tissue. Genes within a cluster share local network neighborhoods and together form a cohesive, specific functional module. Module detection enables systematic association of genes - even functionally uncharacterized genes - to specific processes and phenotypes represented in the detected modules. Functional modules are identified with tissue-specific networks, which predict gene interactions from massive data collections. Thus the discovered modules potentially capture higher-order tissue-specific function (`Krishnan et al. 2016 `_). - -:doc:`NetWAS ` -------------------------------- -Tissue-specific networks provide a new means to generate hypotheses related to the molecular basis of human disease. In NetWAS, the statistical associations from a standard GWAS guide the analysis of functional networks. NetWAS, in conjunction with tissue-specific networks, effectively reprioritizes statistical associations from GWAS to identify disease-associated genes. This reprioritization method is driven by GWAS discovery and does not depend on prior disease knowledge (`Greene et al., 2015 `_). - -:doc:`Sei ` and :doc:`Beluga (DeepSEA) ` --------------------------------- -Beluga (a `2019 update `_ to the `DeepSEA framework `_) is a deep learning-based algorithmic framework for predicting the chromatin effects of sequence alterations with single nucleotide sensitivity. Beluga can accurately predict the epigenetic state of a sequence, including transcription factors binding, DNase I sensitivities and histone marks in multiple cell types, and further utilize this capability to predict the chromatin effects of sequence variants and prioritize regulatory variants. - -Sei (`Chen et al., 2022 `_) provides a global map from any sequence to regulatory activities, as represented by 40 sequence classes by integrating predictions for 21,907 chromatin profiles. - -:doc:`In-silico mutagenesis ` ---------------------- -HumanBase includes an in silico mutagenesis tool to view the predicted chromatin impact of all possible variants in a given genomic region. - - -:doc:`Seqweaver ` ------------------------------------------------ -Seqweaver (`Park et al., 2021 `_) is a deep learning framework designed to predict how genetic variants affect post-transcriptional RNA-binding protein (RBP) interactions. The model can predict the impact of genetic variants (including variants never seen in genomic databases) at single-nucleotide resolution. - - -:doc:`ExPecto ` and :doc:`ExPectoSC ` ----------------------------------------------------------------------------- -ExPecto (`Zhou et al. 2018 `_) and ExPectoSC (`Sokolova et al. 2022 `_) make highly accurate cell-type and tissue-specific predictions of gene expression solely from DNA sequence. The cell and tissue-specific impact of gene transcriptional dysregulation can be systematically probed ‘in silico’, at a scale not yet possible experimentally. Both models leverage deep learning-based sequence models trained on chromatin profiling data, and integrated with spatial transformation and regularized linear models. ExPecto is trained with bulk RNA sequencing data and ExPectoSC leverages single cell sequencing experiments that profile all cell types in primary human tissues. - -Citations ---------- -Greene CS, Krishnan A, Wong AK, Ricciotti E, Zelaya RA, Himmelstein DS, Zhang R, Hartmann BM, Zaslavsky E, Sealfon SC, Chasman DI, FitzGerald GA, Dolinski K, Grosser T, Troyanskaya OG. (2015). `Understanding multicellular function and disease with human tissue-specific networks `_. Nature Genetics. 10.1038/ng.3259w. - -Krishnan A*, Zhang R*, Yao V, Theesfeld CL, Wong AK, Tadych A, Volfovsky N, Packer A, Lash A, Troyanskaya OG.(2016) `Genome-wide prediction and functional characterization of the genetic basis of autism spectrum disorder _`. Nature Neuroscience. - -Zhou J, Theesfeld CL, Yao K, Chen KM, Wong AK, and Troyanskaya OG. (2018) `Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk `_, Nature Genetics. - -Zhou J, Troyanskaya OG. (2015). `Predicting the Effects of Noncoding Variants with Deep learning-based Sequence Model <`_. Nature Methods. - -Chen KM, Wong AK, Troyanskaya OG, Zhou J. (2022) `A sequence-based global map of regulatory activity for deciphering human genetics `_. Nature Methods. - -Park, C. Y., Zhou, J., Wong, A. K., Chen, K. M., Theesfeld, C. L., Darnell, R. B., & Troyanskaya, O. G. (2021). `Genome-wide landscape of RNA-binding protein target site dysregulation reveals a major impact on psychiatric disorder risk `_. Nature genetics. - -Sokolova, K., Theesfeld, C. L., Wong, A. K., Zhang, Z., Dolinski, K., & Troyanskaya, O. G. (2023). `Atlas of primary cell-type-specific sequence models of gene expression and variant effects `_. Cell Reports Methods. +.. include:: index.rst \ No newline at end of file From d0bce35e60bce77e4963a0016f3eb0ae2e4f1cf3 Mon Sep 17 00:00:00 2001 From: rssealfon Date: Mon, 28 Jul 2025 13:30:39 -0400 Subject: [PATCH 02/17] update autism gene prediction citation --- docs/citations.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/citations.rst b/docs/citations.rst index 568f317ae..d743fe146 100644 --- a/docs/citations.rst +++ b/docs/citations.rst @@ -35,7 +35,7 @@ Sokolova, K., Theesfeld, C. L., Wong, A. K., Zhang, Z., Dolinski, K., & Troyansk Autism gene predictions ~~~~~~~~~~~~~~~~~~~~~~~ -Krishnan A*, Zhang R*, Yao V, Theesfeld CL, Wong AK, Tadych A, Volfovsky N, Packer A, Lash A, Troyanskaya OG.(2016) Genome-wide prediction and functional characterization of the genetic basis of autism spectrum disorder. Nature Neuroscience. +Zhou, J.*, Park, C. Y.*, Theesfeld, C. L.*, Wong, A. K., Yuan, Y., Scheckel, C., ... & Troyanskaya, O. G. (2019). `Whole-genome deep-learning analysis identifies contribution of noncoding mutations to autism risk `_. Nature genetics, 51(6), 973-980. Tissue-expression predictions ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ From 68d2c0b64cebc8f635ed73c979c66c582c82d0fb Mon Sep 17 00:00:00 2001 From: rssealfon Date: Mon, 28 Jul 2025 13:36:15 -0400 Subject: [PATCH 03/17] copyedit autism varient effect section heading --- docs/citations.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/citations.rst b/docs/citations.rst index d743fe146..6a73c44ad 100644 --- a/docs/citations.rst +++ b/docs/citations.rst @@ -33,7 +33,7 @@ ExPectoSC ~~~~~~~~~ Sokolova, K., Theesfeld, C. L., Wong, A. K., Zhang, Z., Dolinski, K., & Troyanskaya, O. G. (2023), `Atlas of primary cell-type specific sequence models of gene expression and variant effects `_. Cell Reports Methods. -Autism gene predictions +Autism variant effect predictions ~~~~~~~~~~~~~~~~~~~~~~~ Zhou, J.*, Park, C. Y.*, Theesfeld, C. L.*, Wong, A. K., Yuan, Y., Scheckel, C., ... & Troyanskaya, O. G. (2019). `Whole-genome deep-learning analysis identifies contribution of noncoding mutations to autism risk `_. Nature genetics, 51(6), 973-980. From 678cec7d783d814a5826eda022d8751d7ce4cde6 Mon Sep 17 00:00:00 2001 From: rssealfon Date: Mon, 4 Aug 2025 13:15:59 -0400 Subject: [PATCH 04/17] copyedit expectosc use case --- docs/use-cases/expectosc-use-case.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/use-cases/expectosc-use-case.rst b/docs/use-cases/expectosc-use-case.rst index 9b3e6209d..e377a58e8 100644 --- a/docs/use-cases/expectosc-use-case.rst +++ b/docs/use-cases/expectosc-use-case.rst @@ -5,7 +5,7 @@ ExPectoSC use case **Task: Which non-coding variants have a strong effect on the cell-type specific expression of a nearby gene?** -* Navigate to humanbase.io/expectosc or choose the “ExPectoSC” option from the main Analyses menu.Input a gene of interest (for example, PTEN). +* Navigate to humanbase.io/expectosc or choose the “ExPectoSC” option from the main Analyses menu. Input a gene of interest (for example, PTEN). .. figure:: ../img/use-cases/expectosc-1.png :align: center From a239eaaba2c993a48e6603a1b70f36ef7c7c74dd Mon Sep 17 00:00:00 2001 From: rssealfon Date: Mon, 4 Aug 2025 14:05:54 -0400 Subject: [PATCH 05/17] copyedit beluga use case --- docs/use-cases/beluga-use-case.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/use-cases/beluga-use-case.rst b/docs/use-cases/beluga-use-case.rst index 17b506b77..f9c8de7df 100644 --- a/docs/use-cases/beluga-use-case.rst +++ b/docs/use-cases/beluga-use-case.rst @@ -33,7 +33,7 @@ Beluga (DeepSEA) use case :width: 600px -* An alternative view allows users to see all 2002 predictions by Beluga at one time. Select the “Features” tab. In this view each dot is a chromatin feature predicted by Beluga and they are ranked by z-score. With this view a user can see both tails of the predictions and can assess how many features are predicted to increase with the variant (right side) or decrease(left side) the probability. Mouse over of a dot shows which feature is represented and the score. +* An alternative view allows users to see all 2002 predictions by Beluga at one time. Select the “Features” tab. In this view each dot is a chromatin feature predicted by Beluga and they are ranked by z-score. With this view a user can see both tails of the predictions and can assess how many features are predicted to increase with the variant (right side) or decrease (left side) the probability. Mouse over of a dot shows which feature is represented and the score. .. figure:: ../img/use-cases/beluga-5.png :align: center From e01d46a224895a127a8871054761d327ed65cd01 Mon Sep 17 00:00:00 2001 From: rssealfon Date: Mon, 11 Aug 2025 10:23:27 -0400 Subject: [PATCH 06/17] add sentence about ability to predict impact for unseen variants --- docs/beluga.rst | 2 +- docs/deepsea.rst | 3 ++- docs/expecto.rst | 2 +- docs/expectosc.rst | 2 +- docs/sei.rst | 2 +- docs/seqweaver.rst | 3 +-- 6 files changed, 7 insertions(+), 7 deletions(-) diff --git a/docs/beluga.rst b/docs/beluga.rst index 04000b3df..895b6c106 100644 --- a/docs/beluga.rst +++ b/docs/beluga.rst @@ -5,7 +5,7 @@ Beluga (DeepSEA) Introduction ------------ -DeepSEA is a deep learning-based algorithmic framework for predicting the chromatin effects of sequence alterations with single nucleotide sensitivity. DeepSEA can accurately predict the epigenetic state of a sequence, including transcription factors binding, DNase I sensitivities and histone marks in multiple cell types, and further utilize this capability to predict the chromatin effects of sequence variants and prioritize regulatory variants. The 2019 version of DeepSEA, nicknamed '**Beluga**', can predict **2002** chromatin features. +DeepSEA is a deep learning-based algorithmic framework for predicting the chromatin effects of sequence alterations with single nucleotide sensitivity. DeepSEA can accurately predict the epigenetic state of a sequence, including transcription factors binding, DNase I sensitivities and histone marks in multiple cell types, and further utilize this capability to predict the chromatin effects of sequence variants and prioritize regulatory variants. Importantly, this framework is trained without using any variant data, allowing it to predict the chromatin impact of any variant, including rare or previously unseen ones. The 2019 version of DeepSEA, nicknamed '**Beluga**', can predict **2002** chromatin features. Beluga is described in: Jian Zhou, Chandra L. Theesfeld, Kevin Yao, Kathleen M. Chen, Aaron K. Wong, and Olga G. Troyanskaya, `Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk `_. Nature Genetics (2018). diff --git a/docs/deepsea.rst b/docs/deepsea.rst index 1218e3bed..18c757c6a 100644 --- a/docs/deepsea.rst +++ b/docs/deepsea.rst @@ -5,7 +5,8 @@ DeepSEA Analysis Introduction ------------ -DeepSEA is a deep learning framework that predicts genomic variant effects with single nucleotide sensitivity on a wide range of regulatory features: transcription factors binding, DNase I hypersensitive sites, and histone marks in multiple human cell types. +DeepSEA is a deep learning framework that predicts genomic variant effects with single nucleotide sensitivity on a wide range of regulatory features: transcription factors binding, DNase I hypersensitive sites, and histone marks in multiple human cell types. Importantly, this framework is trained without using any variant data, allowing it to predict the chromatin impact of any variant, including rare or previously unseen ones. + DeepSEA-based Methods --------------------- diff --git a/docs/expecto.rst b/docs/expecto.rst index a3a767935..e34fc7f6c 100644 --- a/docs/expecto.rst +++ b/docs/expecto.rst @@ -4,7 +4,7 @@ ExPecto Introduction ------------ -ExPecto is a framework for ab initio sequence-based prediction of mutation gene expression effects and disease risks. With this web interface, we provide an explorer of tissue-specific expression effect predictions. +ExPecto is a framework for ab initio sequence-based prediction of mutation gene expression effects and disease risks. Importantly, this framework is trained without using any variant data, allowing it to predict the expression effects of any variant, including rare or previously unseen ones. With this web interface, we provide an explorer of tissue-specific expression effect predictions. The ExPecto framework is described in the following manuscript: Jian Zhou, Chandra L. Theesfeld, Kevin Yao, Kathleen M. Chen, Aaron K. Wong, and Olga G. Troyanskaya, `Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk `_, Nature Genetics (2018). diff --git a/docs/expectosc.rst b/docs/expectosc.rst index b01bd5ef0..90f9219a2 100644 --- a/docs/expectosc.rst +++ b/docs/expectosc.rst @@ -6,7 +6,7 @@ ExPectoSC Introduction ------------ -ExPectoSC is a framework for `ab initio` sequence-based prediction of mutation gene expression effects for primary human cell types. With this web interface, we provide an explorer of cell type-specific expression effect predictions. The current release contains all ClinVar variants within +/- 20kb of the representative TSS of a gene. We use 1000 Genomes variant effects predictions for z-score normalization. No effect threshold was employed for the current release of the data. +ExPectoSC is a framework for `ab initio` sequence-based prediction of mutation gene expression effects for primary human cell types. With this web interface, we provide an explorer of cell type-specific expression effect predictions. The current release contains all ClinVar variants within +/- 20kb of the representative TSS of a gene. We use 1000 Genomes variant effects predictions for z-score normalization. No effect threshold was employed for the current release of the data. Importantly, this framework is trained without using any variant data, allowing it to predict the expression effects of any variant, including rare or previously unseen ones. The ExPectoSC framework is described in the following manuscript: Ksenia Sokolova, Chandra L. Theesfeld, Aaron K. Wong, Zijun Zhang, Kara Dolinski and Olga G. Troyanskaya, `Atlas of primary cell-type specific sequence models of gene expression and variant effects `_. Cell Reports Methods (2023). diff --git a/docs/sei.rst b/docs/sei.rst index a4c6368cd..9817aadf7 100644 --- a/docs/sei.rst +++ b/docs/sei.rst @@ -5,7 +5,7 @@ Sei Introduction ------------ -Sei is a deep-learning-based framework for systematically predicting sequence regulatory activities and applying sequence information to understand human genetics data. Sei provides a global map from any sequence to regulatory activities, as represented by 40 sequence classes. Each sequence class integrates predictions for 21,907 chromatin profiles (transcription factor, histone marks, and chromatin accessibility profiles across a wide range of cell types) from the underlying Sei deep learning model. +Sei is a deep-learning-based framework for systematically predicting sequence regulatory activities and applying sequence information to understand human genetics data. Sei provides a global map from any sequence to regulatory activities, as represented by 40 sequence classes. Each sequence class integrates predictions for 21,907 chromatin profiles (transcription factor, histone marks, and chromatin accessibility profiles across a wide range of cell types) from the underlying Sei deep learning model. Importantly, this framework is trained without using any variant data, allowing it to predict the regulatory impact of any variant, including rare or previously unseen ones. Sei is described in the following manuscript: Kathleen M. Chen, Aaron K. Wong, Olga G. Troyanskaya and Jian Zhou, `A sequence-based global map of regulatory activity for deciphering human genetics `_. Nature Genetics (2018). diff --git a/docs/seqweaver.rst b/docs/seqweaver.rst index 13f968b95..721b2c81b 100644 --- a/docs/seqweaver.rst +++ b/docs/seqweaver.rst @@ -6,8 +6,7 @@ Introduction ------------ Seqweaver is a deep learning framework designed to predict how genetic variants affect post-transcriptional RNA-binding protein (RBP) interactions. The model is trained on RBP-RNA interaction data -obtained from CLIP-seq experiments. Seqweaver consists of 232 RBP models spanning 88 distinct RBPs, and can predict the impact of genetic variants (including variants never seen in genomic -databases) at single-nucleotide resolution. +obtained from CLIP-seq experiments. Seqweaver consists of 232 RBP models spanning 88 distinct RBPs, and can predict the impact of genetic variants at single-nucleotide resolution. Importantly, this framework is trained without using any variant data, allowing it to predict the impact on RBP binding of any variant, including rare or previously unseen ones. Seqweaver is described in: Park CY, Zhou J, Wong AK, Chen KM, Theesfeld CL, Darnell RB, Troyanskaya OG. `Genome-wide landscape of RNA-binding protein target site dysregulation reveals a major impact on psychiatric disorder risk `_. Nat Genet. 2021 Feb;53(2):166-173. From 60b56e8f63d8e3cc3ad0ec1fb892a54fcf8a183a Mon Sep 17 00:00:00 2001 From: rssealfon Date: Mon, 11 Aug 2025 13:20:13 -0400 Subject: [PATCH 07/17] copyedit index --- docs/index.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/index.rst b/docs/index.rst index 99b041ce5..da19a1e39 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -7,9 +7,9 @@ HumanBase User Guide ===================== -HumanBase is a resource for biological researchers interested in data-driven predictions of gene expression, function, regulation, and interactions in human, particularly in the context of specific cell types/tissues and human disease. A brief summary of some of the key tools included in HumanBase is below; for more detail on each tool, see the corresponding documentation page. +HumanBase is a resource for biological researchers interested in data-driven predictions of gene expression, function, regulation, and interactions in human, particularly in the context of specific cell types/tissues and human disease. -This resource is not merely a public database of primary genomics data or biological literature. The data-driven integrative analyses (i.e. algorithms that “learn” from large genomic data collections) presented in HumanBase are especially powerful because they separate signal from noise in large biological data collections to reach beyond “existing biological knowledge” represented in the biological literature to identify novel associations that are not biased toward well-studied areas of biomedical research. Carefully designed algorithms can drive the development of experimentally testable hypotheses. Thus, HumanBase is a resource for biomedical researchers to incorporate into their research workflows, which they can use to interpret their experimental results and generate hypotheses for experimental follow-up. +This resource is not merely a public database of primary genomics data or biological literature. The data-driven integrative analyses (i.e. algorithms that “learn” from large genomic data collections) presented in HumanBase are especially powerful because they separate signal from noise in large biological data collections to reach beyond existing biological knowledge represented in the biological literature to identify novel associations that are not biased toward well-studied areas of biomedical research. Carefully designed algorithms can drive the development of experimentally testable hypotheses. Thus, HumanBase is a resource for biomedical researchers to incorporate into their research workflows, which they can use to interpret their experimental results and generate hypotheses for experimental follow-up. --------------------- Who are we? From dce2718c9c53352ca207ed1a985f21d6a3c81625 Mon Sep 17 00:00:00 2001 From: rssealfon Date: Mon, 11 Aug 2025 13:22:48 -0400 Subject: [PATCH 08/17] copyedit --- docs/use-cases/ctcf-disruption.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/use-cases/ctcf-disruption.rst b/docs/use-cases/ctcf-disruption.rst index d91905dfb..4db51f795 100644 --- a/docs/use-cases/ctcf-disruption.rst +++ b/docs/use-cases/ctcf-disruption.rst @@ -4,7 +4,7 @@ CTCF disruption example **Task: Investigate the impact of variants at a site experimentally determined to bind CTCF.** -HumanBase sequence models can be used to probe how sequence relates to function. For example, given the experimental observation that CTCF binds in a ChipSeq assay to the region chr1:762285-762358 (hg19) (`insulatordb.uthsc.edu `_), a researcher can ask which positions and bases (ACG or T) are important for binding through querying the model with a set of variants. +HumanBase sequence models can be used to probe how sequence relates to function. For example, given the experimental observation that CTCF binds in a ChipSeq assay to the region chr1:762285-762358 (hg19) (`insulatordb.uthsc.edu `_), a researcher can ask which positions and bases (A,C,G or T) are important for binding through querying the model with a set of variants. The variants can be any possible set of variants. In item 1, we query a set of variants near the center of the CTCF binding region (could be from a screen, a set of variants observed in genome sequencing study, etc.). No more than 10,000 variants may be queried in a single submission to the webserver. In item 7, we query the model to predict the effect of every single possible mutation across the region with in silico mutagenesis. From 2cac492d6a16aa7111c39cf7dc7b1d5bda36bc10 Mon Sep 17 00:00:00 2001 From: rssealfon Date: Mon, 11 Aug 2025 13:28:52 -0400 Subject: [PATCH 09/17] copyedit asd browser use case --- docs/use-cases/asd-genome-browser.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/use-cases/asd-genome-browser.rst b/docs/use-cases/asd-genome-browser.rst index 70fe80a78..0d0f06601 100644 --- a/docs/use-cases/asd-genome-browser.rst +++ b/docs/use-cases/asd-genome-browser.rst @@ -2,7 +2,7 @@ ASD Genome Browser use case =========================== -**Task: What is the significance of a noncoding autism proband variation observed in the `Simons Simplex Collection `_?** +**Task: What is the significance of a noncoding autism proband variation observed in the** `Simons Simplex Collection `_ **?** * Select the ASD Browser analysis. Input a genomic region of interest. For example, here we view the predicted disease impact of variants in the vicinity of gene TENM3, centered on a predicted high DNA disease impact variant in an intronic region (see the help page for Beluga (DeepSEA) for information on how the DNA disease impact score is computed and the help page for Seqweaver for information on how the RNA disease impact score is computed. From 7a341e49145442bd3b2fac1aa9d0acacac8b36a2 Mon Sep 17 00:00:00 2001 From: rssealfon Date: Mon, 11 Aug 2025 13:29:38 -0400 Subject: [PATCH 10/17] copyedit asd browser use case --- docs/use-cases/asd-genome-browser.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/use-cases/asd-genome-browser.rst b/docs/use-cases/asd-genome-browser.rst index 0d0f06601..f24656620 100644 --- a/docs/use-cases/asd-genome-browser.rst +++ b/docs/use-cases/asd-genome-browser.rst @@ -5,7 +5,7 @@ ASD Genome Browser use case **Task: What is the significance of a noncoding autism proband variation observed in the** `Simons Simplex Collection `_ **?** -* Select the ASD Browser analysis. Input a genomic region of interest. For example, here we view the predicted disease impact of variants in the vicinity of gene TENM3, centered on a predicted high DNA disease impact variant in an intronic region (see the help page for Beluga (DeepSEA) for information on how the DNA disease impact score is computed and the help page for Seqweaver for information on how the RNA disease impact score is computed. +* Select the ASD Browser analysis. Input a genomic region of interest. For example, here we view the predicted disease impact of variants in the vicinity of gene TENM3, centered on a predicted high DNA disease impact variant in an intronic region (see the help page for Beluga (DeepSEA) for information on how the DNA disease impact score is computed and the help page for Seqweaver for information on how the RNA disease impact score is computed). .. figure:: ../img/use-cases/asd-browser-1.png :align: center From 2cf2d34393309f6f5f41ba46741576bb12f4b899 Mon Sep 17 00:00:00 2001 From: rssealfon Date: Mon, 11 Aug 2025 13:31:30 -0400 Subject: [PATCH 11/17] copyedit module clustering use case --- docs/use-cases/functional-module-clustering.rst | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/docs/use-cases/functional-module-clustering.rst b/docs/use-cases/functional-module-clustering.rst index 90dccab47..4582e258d 100644 --- a/docs/use-cases/functional-module-clustering.rst +++ b/docs/use-cases/functional-module-clustering.rst @@ -2,9 +2,7 @@ Functional module clustering use case ===================================== -This use case is drawn from Bishop et al. 2022, Inflammation Subtypes and Translating - -Inflammation-Related Genetic Findings in Schizophrenia and Related Psychoses: A Perspective on Pathways for Treatment Stratification and Novel Therapies +This use case is drawn from Bishop et al. 2022, Inflammation Subtypes and Translating Inflammation-Related Genetic Findings in Schizophrenia and Related Psychoses: A Perspective on Pathways for Treatment Stratification and Novel Therapies **Task: How can I partition my list of genes of interest into modules, and what are the functional enrichments in these modules?** From df736d986edf27391b646e7c4c384ed3503a29c3 Mon Sep 17 00:00:00 2001 From: rssealfon Date: Mon, 11 Aug 2025 13:34:52 -0400 Subject: [PATCH 12/17] copyedit functional networks doc --- docs/functional-networks.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/functional-networks.rst b/docs/functional-networks.rst index aa7f32343..39055c0cc 100644 --- a/docs/functional-networks.rst +++ b/docs/functional-networks.rst @@ -24,7 +24,7 @@ We collected and integrated 987 genome-scale data sets encompassing approximatel * TF regulation: To estimate shared transcription factor regulation between genes, we collected binding motifs from JASPAR. Genes were scored for the presence of transcription factor binding sites using the MEME software suite. Motif matches were treated as binary scores (present if P < 0.001). The final score for each gene pair was obtained by calculating the Pearson correlation between the motif association vectors for the genes. -* MSigDB purturbations and miRNA: Chemical and genetic perturbation (c2:CGP) and microRNA target (c3:MIR) profiles were downloaded from the Molecular Signatures Database (MSigDB). Each gene pair's score was the sum of shared profiles weighted by the specificity of each profile +* MSigDB purturbations and miRNA: Chemical and genetic perturbation (c2:CGP) and microRNA target (c3:MIR) profiles were downloaded from the Molecular Signatures Database (MSigDB). Each gene pair's score was the sum of shared profiles weighted by the specificity of each profile. Evidence From d24ee2b5b2b0d5913259fa12dc1c88d6db729715 Mon Sep 17 00:00:00 2001 From: rssealfon Date: Mon, 11 Aug 2025 13:39:56 -0400 Subject: [PATCH 13/17] copyedit modules doc --- docs/modules.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/modules.rst b/docs/modules.rst index 8a6d58ae3..2878a88e4 100644 --- a/docs/modules.rst +++ b/docs/modules.rst @@ -13,11 +13,11 @@ The approach is based on shared k-nearest-neighbors (SKNN) and the Louvain commu This technique proceeds as follows: (i) First, we create a subset of the user-selected network containing only the user-provided genes and all the edges between them. Given the resulting graph G with V nodes (user-provided genes) and E edges, with each edge between genes i and j associated with a weight p\ :sub:`ij`, - (ii) Calculate a new weight for the edge between each pair of nodes i and j that is equal to the number of k nearest neighbors (based on the original weights p\ :sub:`ij`) shared by i and j; + (ii) Calculate a new weight for the edge between each pair of nodes i and j that is equal to the number of k nearest neighbors (based on the original weights p\ :sub:`ij`) shared by i and j. (iii) Choose the top 5% of the edges based on the new edge weights, and apply a graph clustering algorithm. This approach has two key desirable characteristics: - (i) Choosing the highest k values instead of all edges deemphasizes high-degree 'hub' nodes and brings equal attention to highly specific edges between low-degree nodes; + (i) Choosing the highest k values instead of all edges deemphasizes high-degree 'hub' nodes and brings equal attention to highly specific edges between low-degree nodes. (ii) Emphasizing local network-structure by connecting nodes that share a number of local neighbors automatically links genes that are highly likely to be part of the same cluster. We use a dynamic :code:`k = min(50, 0.2 * |V|)` to obtain the shared-nearest-neighbor tissue-specific network and apply the Louvain algorithm to cluster this network into distinct modules, where V is the number of query genes. Krishnan et al. (2016) showed that module node membership and cluster sizes are robust by testing a range of values for k from 10 to 100. To stabilize clustering across different runs of the Louvain algorithm, we run the algorithm 100 times and calculate cluster comembership scores for each pair of genes that was equal to the fraction of times (out of 100) the pair was assigned to the same cluster. Genes are assigned to clusters where their comembership score ≥ 0.9. From 4c551930ee9445e9525c81553eff7a321c331134 Mon Sep 17 00:00:00 2001 From: rssealfon Date: Mon, 11 Aug 2025 13:46:45 -0400 Subject: [PATCH 14/17] copyedit beluga, seqweaver docs --- docs/beluga.rst | 2 +- docs/seqweaver.rst | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/beluga.rst b/docs/beluga.rst index 895b6c106..cf3e76aef 100644 --- a/docs/beluga.rst +++ b/docs/beluga.rst @@ -18,7 +18,7 @@ To determine if certain features (ie. transcription factors, marks, or cell type Input ----- -Beluga predicts genomic variant effects on a wide range of chromatin features at the variant position (Transcription factors binding, DNase I hypersensitive sites, and histone marks in multiple human cell types). DeepSEA can also be utilized for predicting chromatin features for any DNA sequence. +Beluga predicts genomic variant effects on a wide range of chromatin features at the variant position (Transcription factors binding, DNase I hypersensitive sites, and histone marks in multiple human cell types). .. |bp_length| replace:: 2000 .. |bed_example| replace:: ``chr5 134871851 134871852`` diff --git a/docs/seqweaver.rst b/docs/seqweaver.rst index 721b2c81b..0749b4795 100644 --- a/docs/seqweaver.rst +++ b/docs/seqweaver.rst @@ -19,7 +19,7 @@ We support three types of input: VCF, FASTA, BED. If you want to predict effects Examples of all input formats are available in the job submission interface. See below for a quick introduction: -**VCF format** is used for specifying a genomic variant. A minimal example is ``chr8 38120276 [QKI disruption] C A +`` (if you want to copy cover this text as input, you will need to change spaces to tabs). The six columns are chromosome, position, name, reference allele, alternative allele, and strand. Note that strand must be specified for Seqweaver but not for the other deep learning models. +**VCF format** is used for specifying a genomic variant. A minimal example is ``chr8 38120276 [QKI disruption] C A +`` (if you want to copy this text as input, you will need to change spaces to tabs). The six columns are chromosome, position, name, reference allele, alternative allele, and strand. Note that strand must be specified for Seqweaver but not for the other deep learning models. **FASTA format** input should include sequences of 1000 bp length each. If a sequence is different from 1000 bp: From a980bd5db86365b5ec8b7ce6c29b7a0dfd64b389 Mon Sep 17 00:00:00 2001 From: rssealfon Date: Mon, 11 Aug 2025 13:50:49 -0400 Subject: [PATCH 15/17] add paper reference to ism --- docs/in-silico-mutagenesis.rst | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/in-silico-mutagenesis.rst b/docs/in-silico-mutagenesis.rst index ed42a6e0f..7ef34c9c3 100644 --- a/docs/in-silico-mutagenesis.rst +++ b/docs/in-silico-mutagenesis.rst @@ -9,7 +9,8 @@ Perform "In silico saturated mutagenesis" (ISM) analysis to discover informative Note that ISM only accepts a sequence (FASTA file) as input. The input FASTA file should be 2000 base pairs long. -The chromatin impact prediction is performed using the :doc:`beluga` model. +The chromatin impact prediction is performed using the :doc:`beluga` model. See the following manuscript: Jian Zhou, Chandra L. Theesfeld, Kevin Yao, Kathleen M. Chen, Aaron K. Wong, and Olga G. Troyanskaya, `Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk `_. Nature Genetics (2018). + ISM outputs effects for each of three possible substitutions of all 2000 bases, across all chromatin features. From 8b9396ed67ea91ae49f7775244ae76e3c023ab8c Mon Sep 17 00:00:00 2001 From: rssealfon Date: Mon, 11 Aug 2025 13:53:46 -0400 Subject: [PATCH 16/17] copyedit citations --- docs/citations.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/citations.rst b/docs/citations.rst index 6a73c44ad..6124848a5 100644 --- a/docs/citations.rst +++ b/docs/citations.rst @@ -25,8 +25,8 @@ Seqweaver Park, C. Y., Zhou, J., Wong, A. K., Chen, K. M., Theesfeld, C. L., Darnell, R. B., & Troyanskaya, O. G. (2021). `Genome-wide landscape of RNA-binding protein target site dysregulation reveals a major impact on psychiatric disorder risk `_. Nat Genet. -Variant effect predictions (ExPecto) -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +ExPecto +~~~~~~~~ Zhou, J., Theesfeld, C. L., Yao, K., Chen, K. M., Wong, A. K., & Troyanskaya, O. G. (2018), `Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk `_, Nature Genetics. ExPectoSC From b341d87fc6eea79e766e179ad6913287f5f23868 Mon Sep 17 00:00:00 2001 From: rssealfon Date: Mon, 11 Aug 2025 15:42:02 -0400 Subject: [PATCH 17/17] fix ExPecto bulk download description --- docs/expecto.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/expecto.rst b/docs/expecto.rst index e34fc7f6c..941f92c12 100644 --- a/docs/expecto.rst +++ b/docs/expecto.rst @@ -23,7 +23,7 @@ Download -------- Predicted expression effects ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -This is the bulk download `link `_ of all mutation predictions. +This is the bulk download `link `_ of 1000 Genomes variants that passed a minimum predicted effect threshold (>0.3 log fold-change in any tissue). Variation potential directionality scores ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~