Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions docs/_includes/common-input-formats.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
We support three types of input: VCF, FASTA, BED. If you want to predict effects of noncoding variants, use VCF format input. If you want to predict chromatin feature probabilities for DNA sequences, use FASTA format. If you want to predict RBP interaction probabilities for transcript sequences, you can also use FASTA format. If you want to specify sequences from the human reference genome (GRCh37/hg19), you can use BED format. See below for a quick introduction:
We support three types of input: VCF, FASTA, BED. If you want to predict effects of noncoding variants, use VCF format input. If you want to predict chromatin feature probabilities for DNA sequences, use FASTA format. If you want to specify sequences from the human reference genome, you can use BED format. See below for a quick introduction:

**VCF format** is used for specifying a genomic variant. A minimal example is ``chr1 109817590 - G T`` (if you want to copy cover this text as input, you will need to change spaces to tabs). The five columns are chromosome, position, name, reference allele, and alternative allele. Currently, the genome position needs to be in GRCh37/hg19.
**VCF format** is used for specifying a genomic variant. A minimal example is ``chr1 109817590 - G T`` (if you want to copy cover this text as input, you will need to change spaces to tabs). The five columns are chromosome, position, name, reference allele, and alternative allele.

**FASTA format** input should include sequences of |bp_length|\ bp length each. If a sequence is different from |bp_length|\ bp:

Expand All @@ -12,4 +12,4 @@ We support three types of input: VCF, FASTA, BED. If you want to predict effects
- **Note**: This padding behavior is not recommended. N's were extremely rare in training data (only appearing in assembly gaps), and the model has not been evaluated with artificially padded sequences
- **Strong recommendation**: Always provide sequences of exactly |bp_length|\ bp by including genomic flanking sequences

**BED format** provides another way to specify sequences in human reference genome (hg19). The BED input should specify |bp_length|\ bp-length regions. A minimal example is |bed_example|. The three columns are chromosome, start position, and end position.
**BED format** provides another way to specify sequences in human reference genome. A minimal example is |bed_example|. The three columns are chromosome, start position, and end position.
4 changes: 0 additions & 4 deletions docs/_includes/common-submission-info.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,3 @@
Genome coordinates
~~~~~~~~~~~~~~~~~~
We support only ``GRCh37/hg19`` genome coordinates. You can use LiftOver to convert your coordinates to the correct version.

Large submissions
~~~~~~~~~~~~~~~~~
We recommend using the web server if you have <10,000 variants or sequences. You will experience degraded performance when submitting a larger set of sequences. In those instances, we suggest that you split the set into multiple <10,000 submissions, or run the standalone version on your local machine, or contact our group directly.
23 changes: 23 additions & 0 deletions docs/asdbrowser.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
==============
ASD Browser
==============

Introduction
------------

The ASD browser allows exploration of predicted impact (chromatin effects and post-transcriptional RNA-binding protein impact) of de novo noncoding variants in autism probands in the `Simons Simplex Collection <https://www.sfari.org/resource/simons-simplex-collection/>`_.

The variant prediction approach is described in the following manuscript: Zhou J, Park CY, Theesfeld CL, Wong AK, Yuan Y, Scheckel C, Fak JJ, Funk J, Yao K, Tajima Y, Packer A, Darnell RB, Troyanskaya OG. (2019). `Whole-genome deep-learning analysis identifies contribution of noncoding mutations to autism risk <https://www.nature.com/articles/s41588-019-0420-0>`_. Nature Genetics.

Description
-----------

This website provides a user-friendly interactive interface for exploring the sequence-based predicted effects of SSC ASD proband mutations. Both individual molecular-level effects at chromatin (“DNA”) level and RNA-binding protein (“RNA”) level and Disease Impact Scores summarizing molecular level effects are shown. The methodology and analysis are described in the manuscript “Whole-genome deep learning analysis reveals causal role of noncoding mutations in autism”.

.. image:: img/genome_browser.png

The Genome browser can be navigated by entering a genomic interval, a gene name, or interactively through zooming in/out and scrolling. The tracks “DNA Disease Impact Score” and “RNA Disease Impact Score” show mutation disease impact score (DIS) from DNA and RNA models respectively. DIS scores summarize molecular-level biochemical effects at DNA and RNA level into two scores based on regularized logistic regression classifiers trained with HGMD mutations.

.. image:: img/genome_heatmap.png

Individual molecular-level biochemical effects are shown as a heatmap. The biochemical features are sorted by the magnitude of predicts effects of the center mutation. Each mutation may be clicked to center the genome browser and the heatmap at that mutation, or the heatmap may be dragged to alter the center mutation. The user can select “DNA features” or “RNA features” from the dropdown menu. Mousing over any individual prediction in the heatmap will display details in a tooltip.
21 changes: 8 additions & 13 deletions docs/beluga.rst
Original file line number Diff line number Diff line change
@@ -1,32 +1,27 @@
=======
DeepSEA (Beluga)
Beluga (DeepSEA)
=======

Introduction
------------

DeepSEA is a deep learning-based algorithmic framework for predicting the chromatin effects of sequence alterations with single nucleotide sensitivity. DeepSEA can accurately predict the epigenetic state of a sequence, including transcription factors binding, DNase I sensitivities and histone marks in multiple cell types, and further utilize this capability to predict the chromatin effects of sequence variants and prioritize regulatory variants.
DeepSEA is a deep learning-based algorithmic framework for predicting the chromatin effects of sequence alterations with single nucleotide sensitivity. DeepSEA can accurately predict the epigenetic state of a sequence, including transcription factors binding, DNase I sensitivities and histone marks in multiple cell types, and further utilize this capability to predict the chromatin effects of sequence variants and prioritize regulatory variants. The 2019 version of DeepSEA, nicknamed '**Beluga**', can predict **2002** chromatin features.

The 2019 version of DeepSEA, nicknamed '**Beluga**', can predict **2002** chromatin features. Beluga is described in:
Beluga is described in: Jian Zhou, Chandra L. Theesfeld, Kevin Yao, Kathleen M. Chen, Aaron K. Wong, and Olga G. Troyanskaya, `Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk <https://www.nature.com/articles/s41588-018-0160-6>`_. Nature Genetics (2018).

Jian Zhou, Chandra L. Theesfeld, Kevin Yao, Kathleen M. Chen, Aaron K. Wong, and Olga G. Troyanskaya, **Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk**. Nature Genetics (2018).

To determine if certain features (ie. transcription factors, marks, or cell types) are present/accounted for in the model, refer to the `supplemental feature table <https://s3-us-west-2.amazonaws.com/humanbase-dev/deepsea/examples/41588_2019_420_MOESM9_ESM.csv>`_ which has all the profiles used to train DeepSEA.
DeepSEA is originally described in the following manuscript: Jian Zhou, Olga G. Troyanskaya. `Predicting the Effects of Noncoding Variants with Deep learning-based Sequence Model <https://www.nature.com/articles/nmeth.3547>`_ Nature Methods (2015).

DeepSEA is originally described in the following manuscript:

Jian Zhou, Olga G. Troyanskaya. **Predicting the Effects of Noncoding Variants with Deep learning-based Sequence Model.** Nature Methods (2015).

To determine if certain features (ie. transcription factors, marks, or cell types) are present/accounted for in the model, refer to the `supplemental feature table <https://s3-us-west-2.amazonaws.com/humanbase-dev/deepsea/examples/41588_2019_420_MOESM9_ESM.csv>`_ which has all the profiles used to train DeepSEA.
To determine if certain features (ie. transcription factors, marks, or cell types) are present/accounted for in the model, refer to the `supplemental feature table <https://s3-us-west-2.amazonaws.com/humanbase-dev/deepsea/examples/41588_2019_420_MOESM9_ESM.csv>`_ which has all the profiles used to train Beluga.


Input
-----

DeepSEA predicts genomic variant effects on a wide range of chromatin features at the variant position (Transcription factors binding, DNase I hypersensitive sites, and histone marks in multiple human cell types). DeepSEA can also be utilized for predicting chromatin features for any DNA sequence.
Beluga predicts genomic variant effects on a wide range of chromatin features at the variant position (Transcription factors binding, DNase I hypersensitive sites, and histone marks in multiple human cell types). DeepSEA can also be utilized for predicting chromatin features for any DNA sequence.

.. |bp_length| replace:: 2000
.. |bed_example| replace:: ``chr1 109817091 109819090``
.. |bed_example| replace:: ``chr5 134871851 134871852``

.. include:: _includes/common-input-formats.rst

Expand All @@ -44,7 +39,7 @@ Regulatory feature scores
Variant scores
~~~~~~~~~~~~~~

* **Disease Impact Score (DIS)**: DIS is calculated by training a logistic regression model that prioritizes likely disease-associated mutations on the basis of the predicted transcriptional or post-transcriptional regulatory effects of these mutations (See Zhou et. al, 2019). The predicted DIS probabilities are then converted into 'DIS e-values', computed based on the empirical distributions of predicted effects for gnomAD variants. The final DIS score is:
* **Disease Impact Score (DIS)**: DIS is calculated by training a logistic regression model that prioritizes likely disease-associated mutations on the basis of the predicted transcriptional or post-transcriptional regulatory effects of these mutations (See `Zhou et. al, 2019 <https://www.nature.com/articles/s41588-019-0420-0>`_). The predicted DIS probabilities are then converted into 'DIS e-values', computed based on the empirical distributions of predicted effects for gnomAD variants. The final DIS score is:

.. math::
-log10(DIS evalue_{feature})
Expand Down
27 changes: 25 additions & 2 deletions docs/citations.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,37 @@ Tissue-specific networks, NetWAS
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Greene CS*, Krishnan A*, Wong AK*, Ricciotti E, Zelaya RA, Himmelstein DS, Zhang R, Hartmann BM, Zaslavsky E, Sealfon SC, Chasman DI, FitzGerald GA, Dolinski K, Grosser T, Troyanskaya OG. (2015). `Understanding multicellular function and disease with human tissue-specific networks <http://www.nature.com/ng/journal/v47/n6/full/ng.3259.html>`_. Nature Genetics. 10.1038/ng.3259w.

Functional module detection
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Krishnan A*, Zhang R*, Yao V, Theesfeld CL, Wong AK, Tadych A, Volfovsky N, Packer A, Lash A, Troyanskaya OG.(2016) `Genome-wide prediction and functional characterization of the genetic basis of autism spectrum disorder <https://www.nature.com/articles/nn.4353>`_. Nature Neuroscience.

Sei
~~~~
Chen, K. M., Wong, A. K., Troyanskaya, O. G., & Zhou, J. (2022), `A sequence-based global map of regulatory activity for deciphering human genetics <https://www.nature.com/articles/s41588-022-01102-2>`_. Nature Genetics (2018).

Beluga (DeepSEA)
~~~~~~~~~~~~~~~~
Beluga model: Zhou, J., Theesfeld, C. L., Yao, K., Chen, K. M., Wong, A. K., & Troyanskaya, O. G. (2018), `Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk <https://www.nature.com/articles/s41588-018-0160-6>`_. Nature Genetics.

Original publication of the DeepSEA method: Zhou, J., & Troyanskaya, O. G. (2015) `Predicting the Effects of Noncoding Variants with Deep learning-based Sequence Model <https://www.nature.com/articles/nmeth.3547>`_ Nature Methods.

Seqweaver
~~~~~~~~~~
Park, C. Y., Zhou, J., Wong, A. K., Chen, K. M., Theesfeld, C. L., Darnell, R. B., & Troyanskaya, O. G. (2021). `Genome-wide landscape of RNA-binding protein target site dysregulation reveals a major impact on psychiatric disorder risk <https://www.nature.com/articles/s41588-020-00761-3>`_. Nat Genet.


Variant effect predictions (ExPecto)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Zhou J, Theesfeld CL, Yao K, Chen KM, Wong AK, and Troyanskaya OG. (2018) Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk, Nature Genetics.
Zhou, J., Theesfeld, C. L., Yao, K., Chen, K. M., Wong, A. K., & Troyanskaya, O. G. (2018), `Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk <https://www.nature.com/articles/s41588-018-0160-6>`_, Nature Genetics.

ExPectoSC
~~~~~~~~~
Sokolova, K., Theesfeld, C. L., Wong, A. K., Zhang, Z., Dolinski, K., & Troyanskaya, O. G. (2023), `Atlas of primary cell-type specific sequence models of gene expression and variant effects <https://www.cell.com/cell-reports-methods/fulltext/S2667-2375(23)00224-2>`_. Cell Reports Methods.

Autism gene predictions
~~~~~~~~~~~~~~~~~~~~~~~
Krishnan A*, Zhang R*, Yao V, Theesfeld CL, Wong AK, Tadych A, Volfovsky N, Packer A, Lash A, Troyanskaya OG.(2016) Genome-wide prediction and functional characterization of the genetic basis of autism spectrum disorder. Nature Neuroscience.

Tissue-expression predictions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
\W. Ju#, C.S. Greene#, F. Eichinger, V. Nair, J.B. Hodgin, M. Bitzer, Y. Lee, Q. Zhu, M. Kehata, M. Li, M.P. Rastaldi, C.D. Cohen, O.G. Troyanskaya*, and M. Kretzler*. Defining cell type specificity at the transcriptional level in human disease. Genome Research. 23:1862-1873. 2013
Ju, W.*, Greene, C. S.8, Eichinger, F., Nair, V., Hodgin, J. B., Bitzer, M., ... Troyanskaya, O. G.* & Kretzler, M*. (2013). `Defining cell-type specificity at the transcriptional level in human disease <https://genome.cshlp.org/content/23/11/1862.full.pdf>`_. Genome Research.
9 changes: 4 additions & 5 deletions docs/expecto.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,12 @@ ExPecto

Introduction
------------
ExPecto is a framework for ab initio sequence-based prediction of mutation gene expression effects and disease risks. With this web interface, we provide an explorer of tissue-specific expression effect predictions. The current release contains all single nucleotide substitutions within 1kb to the representative TSS of a gene and all 1000 Genomes variants that passed a minimum predicted effect threshold (>0.3 log fold-change in any tissue).
ExPecto is a framework for ab initio sequence-based prediction of mutation gene expression effects and disease risks. With this web interface, we provide an explorer of tissue-specific expression effect predictions.

The code for predicting expression effects for human genome variants and training new expression models is available at this `github repository <https://github.com/FunctionLab/ExPecto>`_.
The ExPecto framework is described in the following manuscript: Jian Zhou, Chandra L. Theesfeld, Kevin Yao, Kathleen M. Chen, Aaron K. Wong, and Olga G. Troyanskaya, `Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk <https://www.nature.com/articles/s41588-018-0160-6>`_, Nature Genetics (2018).

The ExPecto framework is described in the following manuscript:
The code for predicting expression effects for human genome variants and training new expression models is available at this `github repository <https://github.com/FunctionLab/ExPecto>`_.

Jian Zhou, Chandra L. Theesfeld, Kevin Yao, Kathleen M. Chen, Aaron K. Wong, and Olga G. Troyanskaya, Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk, Nature Genetics, 2018

Output
------
Expand All @@ -28,7 +27,7 @@ This is the bulk download `link <http://deepsea.princeton.edu/media/code/expecto

Variation potential directionality scores
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Variation potential of a gene in a tissue or cell-type can reflect the evolutionary constraint on its expression level. Specifically, we compute the variation potential directionality score as the sum of all directional mutation effects within 1kb to TSS. A negative variation potential indicates active expression and constraint toward higher expression level, and vice versa. The sum of absolute mutation effects, or the magnitudes, is predictive of tissue/condition-specificity of a gene. The variation potential directionality scores and the inferred evolution constraint probabilities can be downloaded here.
Variation potential of a gene in a tissue or cell-type can reflect the evolutionary constraint on its expression level. Specifically, we compute the variation potential directionality score as the sum of all directional mutation effects within 1kb to TSS. A negative variation potential indicates active expression and constraint toward higher expression level, and vice versa. The sum of absolute mutation effects, or the magnitudes, is predictive of tissue/condition-specificity of a gene. The variation potential directionality scores and the inferred evolution constraint probabilities can be downloaded `here <http://deepsea.princeton.edu/media/code/expecto/evocon.zip>`_.

The full prediction of all 140 million mutations can be downloaded `here <http://deepsea.princeton.edu/media/code/expecto/all1kbmutations.tar>`_ (~125G).

Expand Down
8 changes: 3 additions & 5 deletions docs/expectosc.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,10 @@ Introduction

ExPectoSC is a framework for `ab initio` sequence-based prediction of mutation gene expression effects for primary human cell types. With this web interface, we provide an explorer of cell type-specific expression effect predictions. The current release contains all ClinVar variants within +/- 20kb of the representative TSS of a gene. We use 1000 Genomes variant effects predictions for z-score normalization. No effect threshold was employed for the current release of the data.

The code for predicting expression effects for human genome variants and training new expression models is available at this `github repository <https://github.com/ksenia007/ExPectoSC>`_.
The ExPectoSC framework is described in the following manuscript: Ksenia Sokolova, Chandra L. Theesfeld, Aaron K. Wong, Zijun Zhang, Kara Dolinski and Olga G. Troyanskaya, `Atlas of primary cell-type specific sequence models of gene expression and variant effects <https://www.cell.com/cell-reports-methods/fulltext/S2667-2375(23)00224-2>`_. Cell Reports Methods (2023).

The ExPectoSC framework is described in the following manuscript:

Ksenia Sokolova, Chandra L. Theesfeld, Aaron K. Wong, Zijun Zhang, Kara Dolinski and Olga G. Troyanskaya, Atlas of primary cell-type specific sequence models of gene expression and variant effects, Submitted, 2023
The code for predicting expression effects for human genome variants and training new expression models is available at this `github repository <https://github.com/ksenia007/ExPectoSC>`_.

Website overview
----------------
Expand Down Expand Up @@ -66,5 +65,4 @@ Method Details
--------------
ExPectoSC is a modular framework, that uses regularized linear module upon deep convolutional network model of chromatin profifiling effects to predict cell type specific expression. The framework is capable of predicting expression levels directly from sequence and is sensitive to the sequence variations.

The chromatin predictions were computed using a DeepSEA "Beluga" model, using sliding window approach of 2000bp width with 200bp step, for the 40kb region surrounding the TSS. Exponential condense function is then used to reduce the dimensionality of the data before using it in the module 2.

The chromatin predictions were computed using a DeepSEA "Beluga" model, using sliding window approach of 2000bp width with 200bp step, for the 40kb region surrounding the TSS. Exponential condense function is then used to reduce the dimensionality of the data before using it in the regularized linear module.
Loading