Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/beluga.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ Beluga (DeepSEA)
Introduction
------------

DeepSEA is a deep learning-based algorithmic framework for predicting the chromatin effects of sequence alterations with single nucleotide sensitivity. DeepSEA can accurately predict the epigenetic state of a sequence, including transcription factors binding, DNase I sensitivities and histone marks in multiple cell types, and further utilize this capability to predict the chromatin effects of sequence variants and prioritize regulatory variants. The 2019 version of DeepSEA, nicknamed '**Beluga**', can predict **2002** chromatin features.
DeepSEA is a deep learning-based algorithmic framework for predicting the chromatin effects of sequence alterations with single nucleotide sensitivity. DeepSEA can accurately predict the epigenetic state of a sequence, including transcription factors binding, DNase I sensitivities and histone marks in multiple cell types, and further utilize this capability to predict the chromatin effects of sequence variants and prioritize regulatory variants. Importantly, this framework is trained without using any variant data, allowing it to predict the chromatin impact of any variant, including rare or previously unseen ones. The 2019 version of DeepSEA, nicknamed '**Beluga**', can predict **2002** chromatin features.

Beluga is described in: Jian Zhou, Chandra L. Theesfeld, Kevin Yao, Kathleen M. Chen, Aaron K. Wong, and Olga G. Troyanskaya, `Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk <https://www.nature.com/articles/s41588-018-0160-6>`_. Nature Genetics (2018).

Expand All @@ -18,7 +18,7 @@ To determine if certain features (ie. transcription factors, marks, or cell type
Input
-----

Beluga predicts genomic variant effects on a wide range of chromatin features at the variant position (Transcription factors binding, DNase I hypersensitive sites, and histone marks in multiple human cell types). DeepSEA can also be utilized for predicting chromatin features for any DNA sequence.
Beluga predicts genomic variant effects on a wide range of chromatin features at the variant position (Transcription factors binding, DNase I hypersensitive sites, and histone marks in multiple human cell types).

.. |bp_length| replace:: 2000
.. |bed_example| replace:: ``chr5 134871851 134871852``
Expand Down
8 changes: 4 additions & 4 deletions docs/citations.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,17 +25,17 @@ Seqweaver
Park, C. Y., Zhou, J., Wong, A. K., Chen, K. M., Theesfeld, C. L., Darnell, R. B., & Troyanskaya, O. G. (2021). `Genome-wide landscape of RNA-binding protein target site dysregulation reveals a major impact on psychiatric disorder risk <https://www.nature.com/articles/s41588-020-00761-3>`_. Nat Genet.


Variant effect predictions (ExPecto)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ExPecto
~~~~~~~~
Zhou, J., Theesfeld, C. L., Yao, K., Chen, K. M., Wong, A. K., & Troyanskaya, O. G. (2018), `Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk <https://www.nature.com/articles/s41588-018-0160-6>`_, Nature Genetics.

ExPectoSC
~~~~~~~~~
Sokolova, K., Theesfeld, C. L., Wong, A. K., Zhang, Z., Dolinski, K., & Troyanskaya, O. G. (2023), `Atlas of primary cell-type specific sequence models of gene expression and variant effects <https://www.cell.com/cell-reports-methods/fulltext/S2667-2375(23)00224-2>`_. Cell Reports Methods.

Autism gene predictions
Autism variant effect predictions
~~~~~~~~~~~~~~~~~~~~~~~
Krishnan A*, Zhang R*, Yao V, Theesfeld CL, Wong AK, Tadych A, Volfovsky N, Packer A, Lash A, Troyanskaya OG.(2016) Genome-wide prediction and functional characterization of the genetic basis of autism spectrum disorder. Nature Neuroscience.
Zhou, J.*, Park, C. Y.*, Theesfeld, C. L.*, Wong, A. K., Yuan, Y., Scheckel, C., ... & Troyanskaya, O. G. (2019). `Whole-genome deep-learning analysis identifies contribution of noncoding mutations to autism risk <https://www.nature.com/articles/s41588-019-0420-0>`_. Nature genetics, 51(6), 973-980.

Tissue-expression predictions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand Down
3 changes: 2 additions & 1 deletion docs/deepsea.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,8 @@ DeepSEA Analysis
Introduction
------------

DeepSEA is a deep learning framework that predicts genomic variant effects with single nucleotide sensitivity on a wide range of regulatory features: transcription factors binding, DNase I hypersensitive sites, and histone marks in multiple human cell types.
DeepSEA is a deep learning framework that predicts genomic variant effects with single nucleotide sensitivity on a wide range of regulatory features: transcription factors binding, DNase I hypersensitive sites, and histone marks in multiple human cell types. Importantly, this framework is trained without using any variant data, allowing it to predict the chromatin impact of any variant, including rare or previously unseen ones.


DeepSEA-based Methods
---------------------
Expand Down
4 changes: 2 additions & 2 deletions docs/expecto.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ ExPecto

Introduction
------------
ExPecto is a framework for ab initio sequence-based prediction of mutation gene expression effects and disease risks. With this web interface, we provide an explorer of tissue-specific expression effect predictions.
ExPecto is a framework for ab initio sequence-based prediction of mutation gene expression effects and disease risks. Importantly, this framework is trained without using any variant data, allowing it to predict the expression effects of any variant, including rare or previously unseen ones. With this web interface, we provide an explorer of tissue-specific expression effect predictions.

The ExPecto framework is described in the following manuscript: Jian Zhou, Chandra L. Theesfeld, Kevin Yao, Kathleen M. Chen, Aaron K. Wong, and Olga G. Troyanskaya, `Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk <https://www.nature.com/articles/s41588-018-0160-6>`_, Nature Genetics (2018).

Expand All @@ -23,7 +23,7 @@ Download
--------
Predicted expression effects
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This is the bulk download `link <http://deepsea.princeton.edu/media/code/expecto/combined_snps.0.3.zip>`_ of all mutation predictions.
This is the bulk download `link <http://deepsea.princeton.edu/media/code/expecto/combined_snps.0.3.zip>`_ of 1000 Genomes variants that passed a minimum predicted effect threshold (>0.3 log fold-change in any tissue).

Variation potential directionality scores
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand Down
2 changes: 1 addition & 1 deletion docs/expectosc.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ ExPectoSC
Introduction
------------

ExPectoSC is a framework for `ab initio` sequence-based prediction of mutation gene expression effects for primary human cell types. With this web interface, we provide an explorer of cell type-specific expression effect predictions. The current release contains all ClinVar variants within +/- 20kb of the representative TSS of a gene. We use 1000 Genomes variant effects predictions for z-score normalization. No effect threshold was employed for the current release of the data.
ExPectoSC is a framework for `ab initio` sequence-based prediction of mutation gene expression effects for primary human cell types. With this web interface, we provide an explorer of cell type-specific expression effect predictions. The current release contains all ClinVar variants within +/- 20kb of the representative TSS of a gene. We use 1000 Genomes variant effects predictions for z-score normalization. No effect threshold was employed for the current release of the data. Importantly, this framework is trained without using any variant data, allowing it to predict the expression effects of any variant, including rare or previously unseen ones.

The ExPectoSC framework is described in the following manuscript: Ksenia Sokolova, Chandra L. Theesfeld, Aaron K. Wong, Zijun Zhang, Kara Dolinski and Olga G. Troyanskaya, `Atlas of primary cell-type specific sequence models of gene expression and variant effects <https://www.cell.com/cell-reports-methods/fulltext/S2667-2375(23)00224-2>`_. Cell Reports Methods (2023).

Expand Down
2 changes: 1 addition & 1 deletion docs/functional-networks.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ We collected and integrated 987 genome-scale data sets encompassing approximatel

* TF regulation: To estimate shared transcription factor regulation between genes, we collected binding motifs from JASPAR. Genes were scored for the presence of transcription factor binding sites using the MEME software suite. Motif matches were treated as binary scores (present if P < 0.001). The final score for each gene pair was obtained by calculating the Pearson correlation between the motif association vectors for the genes.

* MSigDB purturbations and miRNA: Chemical and genetic perturbation (c2:CGP) and microRNA target (c3:MIR) profiles were downloaded from the Molecular Signatures Database (MSigDB). Each gene pair's score was the sum of shared profiles weighted by the specificity of each profile
* MSigDB purturbations and miRNA: Chemical and genetic perturbation (c2:CGP) and microRNA target (c3:MIR) profiles were downloaded from the Molecular Signatures Database (MSigDB). Each gene pair's score was the sum of shared profiles weighted by the specificity of each profile.


Evidence
Expand Down
3 changes: 2 additions & 1 deletion docs/in-silico-mutagenesis.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,8 @@ Perform "In silico saturated mutagenesis" (ISM) analysis to discover informative

Note that ISM only accepts a sequence (FASTA file) as input. The input FASTA file should be 2000 base pairs long.

The chromatin impact prediction is performed using the :doc:`beluga` model.
The chromatin impact prediction is performed using the :doc:`beluga` model. See the following manuscript: Jian Zhou, Chandra L. Theesfeld, Kevin Yao, Kathleen M. Chen, Aaron K. Wong, and Olga G. Troyanskaya, `Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk <https://www.nature.com/articles/s41588-018-0160-6>`_. Nature Genetics (2018).


ISM outputs effects for each of three possible substitutions of all 2000 bases, across all chromatin features.

Expand Down
14 changes: 5 additions & 9 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,25 +7,21 @@
HumanBase User Guide
=====================

HumanBase is a resource for biological researchers interested in data-driven predictions of gene expression, function, regulation, and interactions in human, particularly in the context of specific cell types/tissues and human disease.

This resource is not merely a public database of primary genomics data or biological literature. The data-driven integrative analyses (i.e. algorithms that “learn” from large genomic data collections) presented in HumanBase are especially powerful because they separate signal from noise in large biological data collections to reach beyond existing biological knowledge represented in the biological literature to identify novel associations that are not biased toward well-studied areas of biomedical research. Carefully designed algorithms can drive the development of experimentally testable hypotheses. Thus, HumanBase is a resource for biomedical researchers to incorporate into their research workflows, which they can use to interpret their experimental results and generate hypotheses for experimental follow-up.

---------------------
About
Who are we?
---------------------

HumanBase is a resource for biological and biomedical researchers interested in data-driven predictions of gene expression, function, regulation, and interactions in human, particularly in the context of specific cell types/tissues, development, and human disease.

Data-driven integrative analyses are especially powerful because they reach beyond “known biological knowledge” represented in the biological literature to identify novel associations that are not biased toward well-studied areas of biomedical research. Thus, carefully designed algorithms can drive the development of experimentally testable hypotheses, enabling deeper understanding of basic biology at the molecular level, pathophysiology, and paving the way to therapy and drug development.
HumanBase is actively developed by the `Genomics group <https://www.simonsfoundation.org/flatiron-institute/simons-center-for-data-analysis/genomics/>`_ at the `Flatiron Institute <https://www.simonsfoundation.org/flatiron-institute/>`_.

---------------------
Licensing
---------------------
All data in HumanBase are freely available under a `CC-BY 4.0 <https://creativecommons.org/licenses/by/4.0/>`_ license. Please give appropriate credit, provide a link to the license, and indicate if changes were made.

---------------------
Who are we?
---------------------
HumanBase is actively developed by the `Genomics group <https://www.simonsfoundation.org/flatiron-institute/simons-center-for-data-analysis/genomics/>`_ at the `Flatiron Institute <https://www.simonsfoundation.org/flatiron-institute/>`_ .

---------------------
Help topics
---------------------
Expand Down
4 changes: 2 additions & 2 deletions docs/modules.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,11 +13,11 @@ The approach is based on shared k-nearest-neighbors (SKNN) and the Louvain commu

This technique proceeds as follows:
(i) First, we create a subset of the user-selected network containing only the user-provided genes and all the edges between them. Given the resulting graph G with V nodes (user-provided genes) and E edges, with each edge between genes i and j associated with a weight p\ :sub:`ij`,
(ii) Calculate a new weight for the edge between each pair of nodes i and j that is equal to the number of k nearest neighbors (based on the original weights p\ :sub:`ij`) shared by i and j;
(ii) Calculate a new weight for the edge between each pair of nodes i and j that is equal to the number of k nearest neighbors (based on the original weights p\ :sub:`ij`) shared by i and j.
(iii) Choose the top 5% of the edges based on the new edge weights, and apply a graph clustering algorithm.

This approach has two key desirable characteristics:
(i) Choosing the highest k values instead of all edges deemphasizes high-degree 'hub' nodes and brings equal attention to highly specific edges between low-degree nodes;
(i) Choosing the highest k values instead of all edges deemphasizes high-degree 'hub' nodes and brings equal attention to highly specific edges between low-degree nodes.
(ii) Emphasizing local network-structure by connecting nodes that share a number of local neighbors automatically links genes that are highly likely to be part of the same cluster.

We use a dynamic :code:`k = min(50, 0.2 * |V|)` to obtain the shared-nearest-neighbor tissue-specific network and apply the Louvain algorithm to cluster this network into distinct modules, where V is the number of query genes. Krishnan et al. (2016) showed that module node membership and cluster sizes are robust by testing a range of values for k from 10 to 100. To stabilize clustering across different runs of the Louvain algorithm, we run the algorithm 100 times and calculate cluster comembership scores for each pair of genes that was equal to the fraction of times (out of 100) the pair was assigned to the same cluster. Genes are assigned to clusters where their comembership score ≥ 0.9.
Expand Down
2 changes: 1 addition & 1 deletion docs/sei.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ Sei
Introduction
------------

Sei is a deep-learning-based framework for systematically predicting sequence regulatory activities and applying sequence information to understand human genetics data. Sei provides a global map from any sequence to regulatory activities, as represented by 40 sequence classes. Each sequence class integrates predictions for 21,907 chromatin profiles (transcription factor, histone marks, and chromatin accessibility profiles across a wide range of cell types) from the underlying Sei deep learning model.
Sei is a deep-learning-based framework for systematically predicting sequence regulatory activities and applying sequence information to understand human genetics data. Sei provides a global map from any sequence to regulatory activities, as represented by 40 sequence classes. Each sequence class integrates predictions for 21,907 chromatin profiles (transcription factor, histone marks, and chromatin accessibility profiles across a wide range of cell types) from the underlying Sei deep learning model. Importantly, this framework is trained without using any variant data, allowing it to predict the regulatory impact of any variant, including rare or previously unseen ones.

Sei is described in the following manuscript: Kathleen M. Chen, Aaron K. Wong, Olga G. Troyanskaya and Jian Zhou, `A sequence-based global map of regulatory activity for deciphering human genetics <https://www.nature.com/articles/s41588-022-01102-2>`_. Nature Genetics (2018).

Expand Down
5 changes: 2 additions & 3 deletions docs/seqweaver.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,7 @@ Introduction
------------

Seqweaver is a deep learning framework designed to predict how genetic variants affect post-transcriptional RNA-binding protein (RBP) interactions. The model is trained on RBP-RNA interaction data
obtained from CLIP-seq experiments. Seqweaver consists of 232 RBP models spanning 88 distinct RBPs, and can predict the impact of genetic variants (including variants never seen in genomic
databases) at single-nucleotide resolution.
obtained from CLIP-seq experiments. Seqweaver consists of 232 RBP models spanning 88 distinct RBPs, and can predict the impact of genetic variants at single-nucleotide resolution. Importantly, this framework is trained without using any variant data, allowing it to predict the impact on RBP binding of any variant, including rare or previously unseen ones.

Seqweaver is described in:
Park CY, Zhou J, Wong AK, Chen KM, Theesfeld CL, Darnell RB, Troyanskaya OG. `Genome-wide landscape of RNA-binding protein target site dysregulation reveals a major impact on psychiatric disorder risk <https://www.nature.com/articles/s41588-020-00761-3>`_. Nat Genet. 2021 Feb;53(2):166-173.
Expand All @@ -20,7 +19,7 @@ We support three types of input: VCF, FASTA, BED. If you want to predict effects

Examples of all input formats are available in the job submission interface. See below for a quick introduction:

**VCF format** is used for specifying a genomic variant. A minimal example is ``chr8 38120276 [QKI disruption] C A +`` (if you want to copy cover this text as input, you will need to change spaces to tabs). The six columns are chromosome, position, name, reference allele, alternative allele, and strand. Note that strand must be specified for Seqweaver but not for the other deep learning models.
**VCF format** is used for specifying a genomic variant. A minimal example is ``chr8 38120276 [QKI disruption] C A +`` (if you want to copy this text as input, you will need to change spaces to tabs). The six columns are chromosome, position, name, reference allele, alternative allele, and strand. Note that strand must be specified for Seqweaver but not for the other deep learning models.

**FASTA format** input should include sequences of 1000 bp length each. If a sequence is different from 1000 bp:

Expand Down
Loading