diff --git a/.gitignore b/.gitignore new file mode 100644 index 000000000..feec99438 --- /dev/null +++ b/.gitignore @@ -0,0 +1,34 @@ +# Build artifacts +docs/_build/ +out/ + +# Virtual environments +venv/ +env/ +.venv/ +.env/ + +# IDE files +.idea/ +.vscode/ +*.swp +*.swo +*~ + +# Python cache +__pycache__/ +*.py[cod] +*$py.class +*.so + +# OS files +.DS_Store +Thumbs.db + +# Claude.ai files +CLAUDE.md +.claude/ + +# Sphinx +.doctrees/ +*.doctree \ No newline at end of file diff --git a/.readthedocs.yaml b/.readthedocs.yaml new file mode 100644 index 000000000..c78adffff --- /dev/null +++ b/.readthedocs.yaml @@ -0,0 +1,18 @@ +# .readthedocs.yaml +# Read the Docs configuration file +version: 2 + +# Set the version of Python and other tools +build: + os: ubuntu-22.04 + tools: + python: "3.10" + +# Build documentation in the docs/ directory with Sphinx +sphinx: + configuration: docs/conf.py # Path adjusted to docs/conf.py + +# Optionally declare the Python requirements required to build your docs +python: + install: + - requirements: docs/requirements.txt \ No newline at end of file diff --git a/README.rst b/README.rst index 2a281ddd0..f6bcbb184 100644 --- a/README.rst +++ b/README.rst @@ -2,4 +2,88 @@ HumanBase documentation ======================= -This repository contains the documentation source files for `HumanBase `_. These files are Sphinx docs written with reStructuredText. +This repository contains the documentation source files for `HumanBase `_. These files are Sphinx docs written with reStructuredText. + +Build Status +------------ + +Check the Read the Docs build status and documentation: https://app.readthedocs.org/projects/humanbase/ + +The live documentation is available at: https://humanbase.readthedocs.io/ + +**Preview Builds:** Read the Docs can build documentation for pull requests, but this needs to be triggered. To preview your changes before merging: + +1. Create a pull request from your branch to ``master`` +2. Go to https://app.readthedocs.org/projects/humanbase/ to trigger a build + +Quick Start +----------- + +Prerequisites +~~~~~~~~~~~~~ + +* Python 3.8 or higher +* pip (Python package installer) +* Make (for building documentation) + +Local Development Setup +~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: bash + + # Clone the repository + git clone https://github.com/aaronkw/humanbase-docs.git + cd humanbase-docs + + # Create and activate a virtual environment + python3 -m venv venv + source venv/bin/activate + + # Install dependencies + pip install -r docs/requirements.txt + +Building Documentation Locally +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: bash + + # Navigate to the docs directory + cd docs + + # Clean previous builds (optional) + make clean + + # Build HTML documentation + make html + + # View the documentation + open _build/html/index.html # On macOS + # Or: python -m http.server -d _build/html 8000 # Then visit http://localhost:8000 + +Documentation Structure +----------------------- + +The documentation is organized as follows: + +* ``docs/`` - Main documentation directory + + * ``index.rst`` - Main table of contents and entry point + * ``conf.py`` - Sphinx configuration file + * ``requirements.txt`` - Python dependencies for building docs + * ``img/`` - Images and diagrams used in documentation + * Tool-specific documentation: + + * ``sei.rst`` - Sei/DeepSEA sequence-based predictions + * ``beluga.rst`` - DeepSEA (Beluga) chromatin profile predictions + * ``expecto.rst`` - ExPecto expression predictions + * ``expectosc.rst`` - ExPectoSC variant effect predictions + * ``netwas.rst`` - NetWAS network-based association studies + * ``in-silico-mutagenesis.rst`` - In silico mutagenesis analysis + + * Network documentation: + + * ``functional-networks.rst`` - Functional gene networks + * ``tissue-networks.rst`` - Tissue-specific networks + +* ``.readthedocs.yaml`` - Read the Docs build configuration +* ``README.rst`` - This file diff --git a/docs/_includes/common-input-formats.rst b/docs/_includes/common-input-formats.rst new file mode 100644 index 000000000..c061558e2 --- /dev/null +++ b/docs/_includes/common-input-formats.rst @@ -0,0 +1,15 @@ +We support three types of input: VCF, FASTA, BED. If you want to predict effects of noncoding variants, use VCF format input. If you want to predict chromatin feature probabilities for DNA sequences, use FASTA format. If you want to specify sequences from the human reference genome, you can use BED format. See below for a quick introduction: + +**VCF format** is used for specifying a genomic variant. A minimal example is ``chr1 109817590 - G T`` (if you want to copy this text as input, you will need to change spaces to tabs). The five columns are chromosome, position, name, reference allele, and alternative allele. + +**FASTA format** input should include sequences of |bp_length|\ bp length each. If a sequence is different from |bp_length|\ bp: + +* **Note**: The prediction is for the center base of the input sequence +* **Longer sequences**: Only the center |bp_length|\ bp will be used +* **Shorter sequences**: Sequences shorter than |bp_length|\ bp will be padded with 'N' bases evenly on both sides + + - **Important**: We do not recommend using FASTA input smaller than |bp_length|\ bp unless it is very close (only a few bp off) + - **Note**: This padding behavior is not recommended. N's were extremely rare in training data (only appearing in assembly gaps), and the model has not been evaluated with artificially padded sequences + - **Strong recommendation**: Always provide sequences of exactly |bp_length|\ bp by including genomic flanking sequences + +**BED format** provides another way to specify sequences in human reference genome. A minimal example is |bed_example|. The three columns are chromosome, start position, and end position. diff --git a/docs/_includes/common-submission-info.rst b/docs/_includes/common-submission-info.rst new file mode 100644 index 000000000..4a06a9691 --- /dev/null +++ b/docs/_includes/common-submission-info.rst @@ -0,0 +1,3 @@ +Large submissions +~~~~~~~~~~~~~~~~~ +We recommend using the web server if you have <10,000 variants or sequences. You will experience degraded performance when submitting a larger set of sequences. In those instances, we suggest that you split the set into multiple <10,000 submissions, or run the standalone version on your local machine, or contact our group directly. \ No newline at end of file diff --git a/docs/asdbrowser.rst b/docs/asdbrowser.rst new file mode 100644 index 000000000..90cda7d30 --- /dev/null +++ b/docs/asdbrowser.rst @@ -0,0 +1,23 @@ +============== +ASD Browser +============== + +Introduction +------------ + +The ASD browser allows exploration of predicted impact (chromatin effects and post-transcriptional RNA-binding protein impact) of de novo noncoding variants in autism probands in the `Simons Simplex Collection `_. + +The variant prediction approach is described in the following manuscript: Zhou J, Park CY, Theesfeld CL, Wong AK, Yuan Y, Scheckel C, Fak JJ, Funk J, Yao K, Tajima Y, Packer A, Darnell RB, Troyanskaya OG. (2019). `Whole-genome deep-learning analysis identifies contribution of noncoding mutations to autism risk `_. Nature Genetics. + +Description +----------- + +This website provides a user-friendly interactive interface for exploring the sequence-based predicted effects of SSC ASD proband mutations. Both individual molecular-level effects at chromatin (“DNA”) level and RNA-binding protein (“RNA”) level and Disease Impact Scores summarizing molecular level effects are shown. The methodology and analysis are described in the manuscript “Whole-genome deep learning analysis reveals causal role of noncoding mutations in autism”. + +.. image:: img/genome_browser.png + +The Genome browser can be navigated by entering a genomic interval, a gene name, or interactively through zooming in/out and scrolling. The tracks “DNA Disease Impact Score” and “RNA Disease Impact Score” show mutation disease impact score (DIS) from DNA and RNA models respectively. DIS scores summarize molecular-level biochemical effects at DNA and RNA level into two scores based on regularized logistic regression classifiers trained with HGMD mutations. + +.. image:: img/genome_heatmap.png + +Individual molecular-level biochemical effects are shown as a heatmap. The biochemical features are sorted by the magnitude of predicts effects of the center mutation. Each mutation may be clicked to center the genome browser and the heatmap at that mutation, or the heatmap may be dragged to alter the center mutation. The user can select “DNA features” or “RNA features” from the dropdown menu. Mousing over any individual prediction in the heatmap will display details in a tooltip. \ No newline at end of file diff --git a/docs/beluga.rst b/docs/beluga.rst index bef232f3d..cf3e76aef 100644 --- a/docs/beluga.rst +++ b/docs/beluga.rst @@ -1,89 +1,31 @@ ======= -DeepSEA (Beluga) +Beluga (DeepSEA) ======= Introduction ------------ -DeepSEA is a deep learning-based algorithmic framework for predicting the chromatin effects of sequence alterations with single nucleotide sensitivity. DeepSEA can accurately predict the epigenetic state of a sequence, including transcription factors binding, DNase I sensitivities and histone marks in multiple cell types, and further utilize this capability to predict the chromatin effects of sequence variants and prioritize regulatory variants. +DeepSEA is a deep learning-based algorithmic framework for predicting the chromatin effects of sequence alterations with single nucleotide sensitivity. DeepSEA can accurately predict the epigenetic state of a sequence, including transcription factors binding, DNase I sensitivities and histone marks in multiple cell types, and further utilize this capability to predict the chromatin effects of sequence variants and prioritize regulatory variants. Importantly, this framework is trained without using any variant data, allowing it to predict the chromatin impact of any variant, including rare or previously unseen ones. The 2019 version of DeepSEA, nicknamed '**Beluga**', can predict **2002** chromatin features. -The 2019 version of DeepSEA, nicknamed '**Beluga**', can predict **2002** chromatin features. Beluga is described in: +Beluga is described in: Jian Zhou, Chandra L. Theesfeld, Kevin Yao, Kathleen M. Chen, Aaron K. Wong, and Olga G. Troyanskaya, `Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk `_. Nature Genetics (2018). -Jian Zhou, Chandra L. Theesfeld, Kevin Yao, Kathleen M. Chen, Aaron K. Wong, and Olga G. Troyanskaya, **Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk**. Nature Genetics (2018). -To determine if certain features (ie. transcription factors, marks, or cell types) are present/accounted for in the model, refer to the `supplemental feature table `_ which has all the profiles used to train DeepSEA. +DeepSEA is originally described in the following manuscript: Jian Zhou, Olga G. Troyanskaya. `Predicting the Effects of Noncoding Variants with Deep learning-based Sequence Model `_ Nature Methods (2015). -DeepSEA is originally described in the following manuscript: - -Jian Zhou, Olga G. Troyanskaya. **Predicting the Effects of Noncoding Variants with Deep learning-based Sequence Model.** Nature Methods (2015). - -To determine if certain features (ie. transcription factors, marks, or cell types) are present/accounted for in the model, refer to the `supplemental feature table `_ which has all the profiles used to train DeepSEA. +To determine if certain features (ie. transcription factors, marks, or cell types) are present/accounted for in the model, refer to the `supplemental feature table `_ which has all the profiles used to train Beluga. Input ----- -DeepSEA predicts genomic variant effects on a wide range of chromatin features at the variant position (Transcription factors binding, DNase I hypersensitive sites, and histone marks in multiple human cell types). DeepSEA can also be utilized for predicting chromatin features for any DNA sequence. - -File formats -~~~~~~~~~~~~ -We support three types of input: vcf, fasta, bed. If you want to predict effects of noncoding variants, use vcf format input. If you want to predict chromatin feature probabilities for DNA sequences, use fasta format. If you want to specify sequences from the human reference genome (GRCh37/hg19), you can use bed format. See below for a quick introduction: - -**VCF format** is used for specifying a genomic variant. A minimal example is ``chr1 109817590 - G T`` (if you want to copy cover this text as input, you will need to change spaces to tabs). The five columns are chromosome, position, name, reference allele, and alternative allele. - -**Fasta format** input should include sequences of 2000bp length each. If a sequence is longer than 2000bp, only the center 2000bp will be used. A minimal example is :: - - >TestSequence - TGGGATTACAGGCGTGAGCCACCGCGCCCGGCCCATTGTACCATTCTTAT - GCCTTTGCGTCCTCATAGCTTAGCTCCCGTATATCAGTGAGAACATACTA - TGTTTGGTTTTCCATACCCGAGTTACTTCACTTAGAATAATAGTCTCCAA - TTTCATCCAGGTCAGTGCAAATGCGTTAATTCGTTCCTTTTATGGCTGAG - TAGTATTCCATCATATATATATACTACAGTTTCTTTATCCACTCGTAAAT - TGATGGGCATTTGTGTTGGAACACTTCTCCACTGCTGGTGGGAATGTAAA - TTAGTGCAGCCACTATGGATAACAGTGTGGAGATTTGTTAAAGAACTAAA - ACTAGAACTACCATTTGATCCAGCAATCCCACTACTGGGTATCTACCCAG - AAGAAAAGAAGTCATTATTTGAAAAAGATACTTGCACGGGCATGTTTATA - GCAGCACAATTCACAATTGTAGTTGTATTTCTTTAAGCGTGTCTTTTCAA - TATCTCTCATGTTTCTGGTATAGATGGTATATATGTTAATCTTGTTCCTG - AGGTCTGTTTTTTATTTTTGTCATTAAAGTGGGAATTAAATAGTTTTGTA - GTGCATATAAATTAAAGAAAAAGTTCACATAAGCATATTTGCCAATCATC - TCAAAATGCTATATTCTCCTTCACGGTTTTGAAAATAATTCAGGGTTTTC - TCTTCCTCATTGCTTTCCCACCAACTGACAGTATTATTTTCTTAGTCATT - TTACTGACCTTTGAAATTACTCCTTTGAGGTCTTCTAAAAAATTTTATGG - GCTCTGCTGCTTTTTGGTGGCCTCCTTGTATCATTTATTCTATTACAGGA - CGACTTACAAAAGGAAGCACATAAATTGACCCATATACATATCCTATCAT - TGGGGAGTTTCTGTGCAAATGTTATTTATTGGAAGCTATTACTAAGAATT - GTAAGAAAAATAATTGGTATTGATGCAGCTAGTATGGTTCCTGTAATTAT - CGTACTCAGCCACGTAAATCATAGCTATATGTAGCCAAAGATCCATGAAC - AAAATTTCCAGTAACATCATTATAATTCAAAAGGCAGACTTTCAGAACCA - GACAGACTTGAATTTAAATTCTAGCTTTACCACACATGAATTTAACCTTG - TGGAAGGTTAACCTATCTAAACTCATGTTTCTTCATTGGTAGCTGATAAA - ATTAAGGATCATGTATATAACCACCTAGTAGAGTTGTTTAAGAAACTGTT - AGAATTCCATAAATTGTTAGTATTAATGAGTTTTTGTTGGACATGTGTTA - GGCTAGGCCACTCCTTGACCTTCATAGAGGTATGGATTATGACACAAATT - CTAAACTGTAGGTAGGCATGGCTTTGTAGCAAGTATTAAAATAGTAAATA - TTTTATTTTTATAAGATAAATGTAAACCTTTTAAAAGTTTCATTACATTT - GTATTTATGAAATATCATCCTATATCAACTATAGAGAGAAGATCGCAAGA - AGGCAGTGGCAGCAGAGGCTCCAGTTAGGAGGCTACTAGTCCAAATACAT - TGCGATAAAAACTTGGCAAAAGGTGCTGGTAGTCTGATGAAATAAAGTAG - ATAAATTTTAGAGGTATTTATAAAATAATTAAAGAATATTCAATAATAGG - AGATATATTACCCAATAGAGTGGAGATTCAAAGATAACTCCGAAAGTTTT - TTGCTAAAGCAACATTTGGCTGTGCTATCATTTACTAAGAAAGACAACAA - GAGAGTAAAATCAAGTTTGAGGATGAAGTGAATTTATTCCTTTTTGATTG - ATACATAATTGACATGTAATAAAACCCACAATGTTAAGAGTTCGGTTTGA - TGTGCTTGACTATTTTAGGCACTGGTGTTATCACAACACAAGACAACAGA - TAGGACATTCTCAGAAAATTTTTTCATGTCCCTTTCCAGTCAGTTTCAAG - CCTTCTTTCCATGCAATAATTTTCTCACTTTGCCATTCTAGTAGGTGTGA - -**Bed format** provides another way to specify sequences in human reference genome (hg19). The bed input should specify 2000bp-length regions. A minimal example is ``chr1 109817091 109819090``. The three columns are chromosome, start position, and end position. - -Genome coordinates -~~~~~~~~~~~~~~~~~~ -We support only ``GRCh37/hg19`` genome coordinates. You can use LiftOver to convert your coordinates to the correct version. - -Large submissions -~~~~~~~~~~~~~~~~~ -We recommend using the web server if you have <10,000 variants or sequences. You will experience degraded performance when submitting a larger set of sequences. In those instances, we suggest that you split the set into multiple <10,000 submissions, or run the standalone version on your local machine, or contact our group directly. +Beluga predicts genomic variant effects on a wide range of chromatin features at the variant position (Transcription factors binding, DNase I hypersensitive sites, and histone marks in multiple human cell types). + +.. |bp_length| replace:: 2000 +.. |bed_example| replace:: ``chr5 134871851 134871852`` + +.. include:: _includes/common-input-formats.rst + +.. include:: _includes/common-submission-info.rst Output ------ @@ -97,7 +39,7 @@ Regulatory feature scores Variant scores ~~~~~~~~~~~~~~ -* **Disease Impact Score (DIS)**: DIS is calculated by training a logistic regression model that prioritizes likely disease-associated mutations on the basis of the predicted transcriptional or post-transcriptional regulatory effects of these mutations (See Zhou et. al, 2019). The predicted DIS probabilities are then converted into 'DIS e-values', computed based on the empirical distributions of predicted effects for gnomAD variants. The final DIS score is: +* **Disease Impact Score (DIS)**: DIS is calculated by training a logistic regression model that prioritizes likely disease-associated mutations on the basis of the predicted transcriptional or post-transcriptional regulatory effects of these mutations (See `Zhou et. al, 2019 `_). The predicted DIS probabilities are then converted into 'DIS e-values', computed based on the empirical distributions of predicted effects for gnomAD variants. The final DIS score is: .. math:: -log10(DIS evalue_{feature}) @@ -107,10 +49,3 @@ Variant scores .. math:: \sum -log10(evalue_{feature}) / N -In-silico mutagenesis ---------------------- -Perform "In silico saturated mutagenesis" (ISM) analysis to discover informative sequence features within any sequence. Specifically, it performs computational mutation scanning to assess the effect of mutating every base of the input sequence on chromatin feature predictions. This method for context-specific sequence feature extraction takes advantage of DeepSEA’s ability to utilize flanking context sequences information. - -Note that ISM only accepts a sequence (FASTA file) as input. - -ISM outputs effects for each of three possible substitutions of all 2000 bases, across all chromatin features. diff --git a/docs/citations.rst b/docs/citations.rst index dadb09095..ea650e862 100644 --- a/docs/citations.rst +++ b/docs/citations.rst @@ -4,16 +4,39 @@ Citations Tissue-specific networks, NetWAS ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Greene CS*, Krishnan A*, Wong AK*, Ricciotti E, Zelaya RA, Himmelstein DS, Zhang R, Hartmann BM, Zaslavsky E, Sealfon SC, Chasman DI, FitzGerald GA, Dolinski K, Grosser T, Troyanskaya OG. (2015). `Understanding multicellular function and disease with human tissue-specific networks `_. Nature Genetics. 10.1038/ng.3259w. +Greene CS*, Krishnan A*, Wong AK*, Ricciotti E, Zelaya RA, Himmelstein DS, Zhang R, Hartmann BM, Zaslavsky E, Sealfon SC, Chasman DI, FitzGerald GA, Dolinski K, Grosser T, Troyanskaya OG. (2015). `Understanding multicellular function and disease with human tissue-specific networks `_. Nature Genetics. 10.1038/ng.3259. -Variant effect predictions (ExPecto) -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Zhou J, Theesfeld CL, Yao K, Chen KM, Wong AK, and Troyanskaya OG. (2018) Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk, Nature Genetics. +Functional module detection +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Krishnan A*, Zhang R*, Yao V, Theesfeld CL, Wong AK, Tadych A, Volfovsky N, Packer A, Lash A, Troyanskaya OG.(2016) `Genome-wide prediction and functional characterization of the genetic basis of autism spectrum disorder `_. Nature Neuroscience. -Autism gene predictions +Sei +~~~~ +Chen, K. M., Wong, A. K., Troyanskaya, O. G., & Zhou, J. (2022), `A sequence-based global map of regulatory activity for deciphering human genetics `_. Nature Genetics (2018). + +Beluga (DeepSEA) +~~~~~~~~~~~~~~~~ +Beluga model: Zhou, J., Theesfeld, C. L., Yao, K., Chen, K. M., Wong, A. K., & Troyanskaya, O. G. (2018), `Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk `_. Nature Genetics. + +Original publication of the DeepSEA method: Zhou, J., & Troyanskaya, O. G. (2015) `Predicting the Effects of Noncoding Variants with Deep learning-based Sequence Model `_ Nature Methods. + +Seqweaver +~~~~~~~~~~ +Park, C. Y., Zhou, J., Wong, A. K., Chen, K. M., Theesfeld, C. L., Darnell, R. B., & Troyanskaya, O. G. (2021). `Genome-wide landscape of RNA-binding protein target site dysregulation reveals a major impact on psychiatric disorder risk `_. Nat Genet. + + +ExPecto +~~~~~~~~ +Zhou, J., Theesfeld, C. L., Yao, K., Chen, K. M., Wong, A. K., & Troyanskaya, O. G. (2018), `Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk `_, Nature Genetics. + +ExPectoSC +~~~~~~~~~ +Sokolova, K., Theesfeld, C. L., Wong, A. K., Zhang, Z., Dolinski, K., & Troyanskaya, O. G. (2023), `Atlas of primary cell-type specific sequence models of gene expression and variant effects `_. Cell Reports Methods. + +Autism variant effect predictions ~~~~~~~~~~~~~~~~~~~~~~~ -Krishnan A*, Zhang R*, Yao V, Theesfeld CL, Wong AK, Tadych A, Volfovsky N, Packer A, Lash A, Troyanskaya OG.(2016) Genome-wide prediction and functional characterization of the genetic basis of autism spectrum disorder. Nature Neuroscience. +Zhou, J.*, Park, C. Y.*, Theesfeld, C. L.*, Wong, A. K., Yuan, Y., Scheckel, C., ... & Troyanskaya, O. G. (2019). `Whole-genome deep-learning analysis identifies contribution of noncoding mutations to autism risk `_. Nature genetics, 51(6), 973-980. Tissue-expression predictions ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -\W. Ju#, C.S. Greene#, F. Eichinger, V. Nair, J.B. Hodgin, M. Bitzer, Y. Lee, Q. Zhu, M. Kehata, M. Li, M.P. Rastaldi, C.D. Cohen, O.G. Troyanskaya*, and M. Kretzler*. Defining cell type specificity at the transcriptional level in human disease. Genome Research. 23:1862-1873. 2013 +Ju, W.*, Greene, C. S.8, Eichinger, F., Nair, V., Hodgin, J. B., Bitzer, M., ... Troyanskaya, O. G.* & Kretzler, M*. (2013). `Defining cell-type specificity at the transcriptional level in human disease `_. Genome Research. diff --git a/docs/conf.py b/docs/conf.py index a1d19adbe..b59cabeed 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -14,6 +14,7 @@ import sys import os +from datetime import datetime # If extensions (or modules to document with autodoc) are in another directory, # add these directories to sys.path here. If the directory is relative to the @@ -44,7 +45,7 @@ # General information about the project. project = u'HumanBase' -copyright = u'2022, Flatiron Institute' +copyright = u'{}, Flatiron Institute'.format(datetime.now().year) # The version info for the project you're documenting, acts as replacement for # |version| and |release|, also used in various other places throughout the @@ -101,6 +102,10 @@ import sphinx_rtd_theme html_theme = 'sphinx_rtd_theme' +html_context = { + "READTHEDOCS_VERSION": "stable", +} + # Theme options are theme-specific and customize the look and feel of a theme # further. For a list of options available for each theme, see the # documentation. @@ -114,8 +119,6 @@ # 'navigation_depth': 4, # Depth of the headers shown in the navigation bar } -# Add any paths that contain custom themes here, relative to this directory. -html_theme_path = [sphinx_rtd_theme.get_html_theme_path()] # on_rtd is whether we are on readthedocs.org, this line of code grabbed from docs.readthedocs.org on_rtd = os.environ.get('READTHEDOCS', None) == 'True' @@ -176,7 +179,7 @@ #html_show_sourcelink = True # If true, "Created using Sphinx" is shown in the HTML footer. Default is True. -#html_show_sphinx = True +html_show_sphinx = False # If true, "(C) Copyright ..." is shown in the HTML footer. Default is True. #html_show_copyright = True diff --git a/docs/deepsea.rst b/docs/deepsea.rst index 8c7c4464c..18c757c6a 100644 --- a/docs/deepsea.rst +++ b/docs/deepsea.rst @@ -1,99 +1,22 @@ -======= -Sei / DeepSEA -======= +================= +DeepSEA Analysis +================= Introduction ------------ -Sei is a deep-learning-based framework for systematically predicting sequence regulatory activities and applying sequence information to understand human genetics data. Sei provides a global map from any sequence to regulatory activities, as represented by 40 sequence classes. Each sequence class integrates predictions for 21,907 chromatin profiles (transcription factor, histone marks, and chromatin accessibility profiles across a wide range of cell types) from the underlying Sei deep learning model. You can also find the Sei code repository `here `_ or read about our manuscript `here `_. +DeepSEA is a deep learning framework that predicts genomic variant effects with single nucleotide sensitivity on a wide range of regulatory features: transcription factors binding, DNase I hypersensitive sites, and histone marks in multiple human cell types. Importantly, this framework is trained without using any variant data, allowing it to predict the chromatin impact of any variant, including rare or previously unseen ones. -Sequence class-level variant effects are computed by comparing the predictions for the reference and the alternative alleles. A positive score indicates an increase in sequence class activity by the alternative allele and vice versa. Sequence class-level scores are computed by projecting the 21,907 chromatin profile predictions for the sequence to the unit vector that represents each sequence class. -For older DeepSEA models see: -:doc:`beluga` (2019) +DeepSEA-based Methods +--------------------- +The following analysis tools and methods in HumanBase are built upon the DeepSEA framework: -Input ------ - -File formats -~~~~~~~~~~~~ -We support three types of input: vcf, fasta, bed. If you want to predict effects of noncoding variants, use vcf format input. If you want to predict chromatin feature probabilities for DNA sequences, use fasta format. If you want to specify sequences from the human reference genome (GRCh37/hg19), you can use bed format. See below for a quick introduction: - -**VCF format** is used for specifying a genomic variant. A minimal example is ``chr1 109817590 - G T`` (if you want to copy cover this text as input, you will need to change spaces to tabs). The five columns are chromosome, position, name, reference allele, and alternative allele. Currently, the genome position needs to be in GRCh37/hg19 - -**Fasta format** input should include sequences of 4096bp length each. If a sequence is longer than 4096bp, only the center 4096bp will be used. - -**Bed format** provides another way to specify sequences in human reference genome (hg19). The bed input should specify 4096bp-length regions. A minimal example is ``chr1 109817091 109821186``. The three columns are chromosome, start position, and end position. - -Genome coordinates -~~~~~~~~~~~~~~~~~~ -We support only ``GRCh37/hg19`` genome coordinates. You can use LiftOver to convert your coordinates to the correct version. - -Large submissions -~~~~~~~~~~~~~~~~~ -We recommend using the web server if you have <10,000 variants or sequences. You will experience degraded performance when submitting a larger set of sequences. In those instances, we suggest that you split the set into multiple <10,000 submissions, or run the standalone version on your local machine, or contact our group directly. - - -Output ------- - -Sequence classes -~~~~~~~~~~~~~~~~~~~~~~~~~ - -The Sei framework predicts 40 sequence class scores, covering a wide range of regulatory activities such as cell-type-specific enhancers and promoters, as well as 21,907 chromatin profiles for any DNA sequence. - -To help interpretation, we grouped sequence classes into groups including P (Promoter), E (Enhancer), CTCF (CTCF-cohesin binding), TF (TF binding), PC (Polycomb-repressed), HET (Heterochromatin), TN (Transcription), and L (Low Signal) sequence classes. Please refer to our manuscript for a more detailed description of the sequence classes. - -Note: sequence class predictions are only available for vcf inputs. - -:: - - | Sequence class label | Sequence class name | Rank by size | Group | - |---------------------:|----------------------------------:|-------------:|------:| - | PC1 | Polycomb / Heterochromatin | 0 | PC | - | L1 | Low signal | 1 | L | - | TN1 | Transcription | 2 | TN | - | TN2 | Transcription | 3 | TN | - | L2 | Low signal | 4 | L | - | E1 | Stem cell | 5 | E | - | E2 | Multi-tissue | 6 | E | - | E3 | Brain / Melanocyte | 7 | E | - | L3 | Low signal | 8 | L | - | E4 | Multi-tissue | 9 | E | - | TF1 | NANOG / FOXA1 | 10 | TF | - | HET1 | Heterochromatin | 11 | HET | - | E5 | B-cell-like | 12 | E | - | E6 | Weak epithelial | 13 | E | - | TF2 | CEBPB | 14 | TF | - | PC2 | Weak Polycomb | 15 | PC | - | E7 | Monocyte / Macrophage | 16 | E | - | E8 | Weak multi-tissue | 17 | E | - | L4 | Low signal | 18 | L | - | TF3 | FOXA1 / AR / ESR1 | 19 | TF | - | PC3 | Polycomb | 20 | PC | - | TN3 | Transcription | 21 | TN | - | L5 | Low signal | 22 | L | - | HET2 | Heterochromatin | 23 | HET | - | L6 | Low signal | 24 | L | - | P | Promoter | 25 | P | - | E9 | Liver / Intestine | 26 | E | - | CTCF | CTCF-Cohesin | 27 | CTCF | - | TN4 | Transcription | 28 | TN | - | HET3 | Heterochromatin | 29 | HET | - | E10 | Brain | 30 | E | - | TF4 | OTX2 | 31 | TF | - | HET4 | Heterochromatin | 32 | HET | - | L7 | Low signal | 33 | L | - | PC4 | Polycomb / Bivalent stem cell Enh | 34 | PC | - | HET5 | Centromere | 35 | HET | - | E11 | T-cell | 36 | E | - | TF5 | AR | 37 | TF | - | E12 | Erythroblast-like | 38 | E | - | HET6 | Centromere | 39 | HET | - - - -Regulatory feature scores -~~~~~~~~~~~~~~~~~~~~~~~~~ -* **diffs**: The difference between the the predicted probability of the reference allele and the alternative allele for a regulatory feature (:math:`p_{alt} -p_{ref}`). +* :doc:`sei` - Extended regulatory and chromatin effects of variants (2021) +* :doc:`beluga` - Chromatin effects of variants (2019) +* :doc:`seqweaver` - Post-transcriptional variant effects +* :doc:`expecto` - Tissue-specific gene expression effects of variants +* :doc:`expectosc` - Cell-type-specific gene expression effects for mutations +* :doc:`in-silico-mutagenesis` - Discover regulatory features of sequences +* :doc:`asdbrowser` - Predicted effects of ASD proband mutations \ No newline at end of file diff --git a/docs/expecto.rst b/docs/expecto.rst index d107273a6..cd6a379a8 100644 --- a/docs/expecto.rst +++ b/docs/expecto.rst @@ -4,23 +4,30 @@ ExPecto Introduction ------------ -ExPecto is a framework for ab initio sequence-based prediction of mutation gene expression effects and disease risks. With this web interface, we provide an explorer of tissue-specific expression effect predictions. The current release contains all single nucleotide substitutions within 1kb to the representative TSS of a gene and all 1000 Genomes variants that passed a minimum predicted effect threshold (>0.3 log fold-change in any tissue). +ExPecto is a framework for *ab initio* sequence-based prediction of mutation gene expression effects and disease risks. Importantly, this framework is trained without using any variant data, allowing it to predict the expression effects of any variant, including rare or previously unseen ones. With this web interface, we provide an explorer of tissue-specific expression effect predictions. + +The ExPecto framework is described in the following manuscript: Jian Zhou, Chandra L. Theesfeld, Kevin Yao, Kathleen M. Chen, Aaron K. Wong, and Olga G. Troyanskaya, `Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk `_, Nature Genetics (2018). The code for predicting expression effects for human genome variants and training new expression models is available at this `github repository `_. -The ExPecto framework is described in the following manuscript: -Jian Zhou, Chandra L. Theesfeld, Kevin Yao, Kathleen M. Chen, Aaron K. Wong, and Olga G. Troyanskaya, Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk, Nature Genetics, 2018 +Output +------ +**Regulatory feature scores:** +The z-score, e-value, and probability diffs are computed as for the `DeepSEA (Beluga) model `_. + +**ExPecto expression effect:** +The ExPecto expression effect is the difference of predicted expression levels for reference and alternative allele. (See the `Expecto paper (2018) `_) Download -------- Predicted expression effects ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -This is the bulk download `link `_ of all mutation predictions. +This is the bulk download `link `_ of 1000 Genomes variants that passed a minimum predicted effect threshold (>0.3 log fold-change in any tissue). Variation potential directionality scores ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Variation potential of a gene in a tissue or cell-type can reflect the evolutionary constraint on its expression level. Specifically, we compute the variation potential directionality score as the sum of all directional mutation effects within 1kb to TSS. A negative variation potential indicates active expression and constraint toward higher expression level, and vice versa. The sum of absolute mutation effects, or the magnitudes, is predictive of tissue/condition-specificity of a gene. The variation potential directionality scores and the inferred evolution constraint probabilities can be downloaded here. +Variation potential of a gene in a tissue or cell-type can reflect the evolutionary constraint on its expression level. Specifically, we compute the variation potential directionality score as the sum of all directional mutation effects within 1kb to TSS. A negative variation potential indicates active expression and constraint toward higher expression level, and vice versa. The sum of absolute mutation effects, or the magnitudes, is predictive of tissue/condition-specificity of a gene. The variation potential directionality scores and the inferred evolution constraint probabilities can be downloaded `here `_. The full prediction of all 140 million mutations can be downloaded `here `_ (~125G). @@ -30,4 +37,4 @@ ExPecto uses exponential basis function-based linear models upon deep convolutio For detailed procedures of the prediction, the chromatin predictions were computed from DeepSEA "Beluga" per 200bp bin, and 200 bins centered at TSS (40kb region) were used as input to predict expression effects. To reduce the dimensionality for ExPecto model training, the predicted chromatin spatial patterns were summarized to spatial features by 10 exponential basis functions. The summarized spatial features and gene expression levels were used to train regularized linear models for the final step of the prediction. The representative TSSes are selected based on FANTOM CAGE data. -We also propose a path toward ab initio disease risk prediction through combining the prediction of expression effects and the estimation of evolution constraints on expression levels. For example, mutations predicted to have strong negative expression effects on a positively constrained gene are predicted to be deleterious. We estimate evolutionary constraints through systematic profiling of potential mutation effects through in silico mutagenesis. As proof-of-principle we showed that this approach can predict the disease alleles from both curated HGMD disease mutation data and disease GWASes. +We also propose a path toward *ab initio* disease risk prediction through combining the prediction of expression effects and the estimation of evolution constraints on expression levels. For example, mutations predicted to have strong negative expression effects on a positively constrained gene are predicted to be deleterious. We estimate evolutionary constraints through systematic profiling of potential mutation effects through in silico mutagenesis. As proof-of-principle we showed that this approach can predict the disease alleles from both curated HGMD disease mutation data and disease GWASes. diff --git a/docs/clever.rst b/docs/expectosc.rst similarity index 66% rename from docs/clever.rst rename to docs/expectosc.rst index a368bfc53..08de44b25 100644 --- a/docs/clever.rst +++ b/docs/expectosc.rst @@ -1,20 +1,20 @@ -======= +========= ExPectoSC -======= +========= Introduction ------------ -ExPectoSC is a framework for `ab initio` sequence-based prediction of mutation gene expression effects for primary human cell types. With this web interface, we provide an explorer of cell type-specific expression effect predictions. The current release contains all ClinVar variants within +/- 20kb of the representative TSS of a gene. We use 1000 Genomes variant effects predictions for z-score normalization. No effect threshold was employed for the current release of the data. -The code for predicting expression effects for human genome variants and training new expression models is available at this `github repository `_. +ExPectoSC is a framework for *ab initio* sequence-based prediction of mutation gene expression effects for primary human cell types. With this web interface, we provide an explorer of cell type-specific expression effect predictions. The current release contains all ClinVar variants within +/- 20kb of the representative TSS of a gene. We use 1000 Genomes variant effects predictions for z-score normalization. No effect threshold was employed for the current release of the data. Importantly, this framework is trained without using any variant data, allowing it to predict the expression effects of any variant, including rare or previously unseen ones. + +The ExPectoSC framework is described in the following manuscript: Ksenia Sokolova, Chandra L. Theesfeld, Aaron K. Wong, Zijun Zhang, Kara Dolinski and Olga G. Troyanskaya, `Atlas of primary cell-type specific sequence models of gene expression and variant effects `_. Cell Reports Methods (2023). -The ExPectoSC framework is described in the following manuscript: -Ksenia Sokolova, Chandra L. Theesfeld, Aaron K. Wong, Zijun Zhang, Kara Dolinski and Olga G. Troyanskaya, Atlas of primary cell-type specific sequence models of gene expression and variant effects, Submitted, 2023 +The code for predicting expression effects for human genome variants and training new expression models is available at this `github repository `_. Website overview ------------- +---------------- After user enters a gene name, the Primary View is returned showing the predictions for the pre-computed variants for the region (includes 1000G and ClinVar variants). The variants are oriented so that the lowest chromosomal coordinate for the gene region is on the left side of the screen. The heatmap colors represent the max effect cell type prediction within the organ system. Rows are grouped organ systems, and columns are variant locations: .. image:: img/expectosc_img1.png @@ -48,6 +48,10 @@ Drop-down menu in the upper left corner allows users to select multiple organ ce +Output +------ +To analyze effect of the variants we get predictions for the reference and alternative sequences and compare the difference. To compare the predictions between the cell-types, we normalized predictions of variant sets to those of 1000 Genomes variants by using the Z-scores computed per cell-type. As a rough guideline, z-scores of above ~3-5 represent more reliable predictions. See the `ExPectoSC paper (2023) `_. + Download -------- `ClinVar scaled non-coding predictions `_ @@ -59,7 +63,6 @@ Download Method Details -------------- -ExPectoSC is a modular framework, that uses regularized linear module upon deep convolutional network model of chromatin profifiling effects to predict cell type specific expression. The framework is capable of predicting expression levels directly from sequence and is sensitive to the sequence variations. - -The chromatin predictions were computed using a DeepSEA "Beluga" model, using sliding window approach of 2000bp width with 200bp step, for the 40kb region surrounding the TSS. Exponential condense function is then used to reduce the dimensionality of the data before using it in the module 2. To analyze effect of the variants we get predictions for the reference and alternative sequences and compare the difference. +ExPectoSC is a modular framework, that uses regularized linear module upon deep convolutional network model of chromatin profiling effects to predict cell type specific expression. The framework is capable of predicting expression levels directly from sequence and is sensitive to the sequence variations. +The chromatin predictions were computed using a DeepSEA "Beluga" model, using sliding window approach of 2000bp width with 200bp step, for the 40kb region surrounding the TSS. Exponential condense function is then used to reduce the dimensionality of the data before using it in the regularized linear module. diff --git a/docs/functional-networks.rst b/docs/functional-networks.rst index 0938a19bb..4545518e4 100644 --- a/docs/functional-networks.rst +++ b/docs/functional-networks.rst @@ -1,18 +1,22 @@ -Functional Networks +Tissue-specific Networks =========================== -In order to leverage the vast collections of raw, noisy genomic data, they must be integrated, summarized, and presented in a biologically informative manner. We provide a means of mining tens of thousands of whole-genome experiments by way of functional interaction networks. Each interaction network represents a body of data, probabilistically weighted and integrated, focused on a particular biological question. These questions can include, for example, the function of a gene, the relationship between two pathways, or the processes disrupted in a genetic disorder. (Huttenhower, et. al 2008) +In order to leverage the vast collections of raw, noisy genomic data, they must be integrated, summarized, and presented in a biologically informative manner. We provide a means of mining tens of thousands of whole-genome experiments by way of functional interaction networks. Each interaction network represents a body of data, probabilistically weighted and integrated, focused on a particular tissue or process context. + +It is important to consider gene relationships within a tissue context as the precise actions of genes are frequently dependent on their tissue context, and human diseases result from the disordered interplay of tissue- and cell lineage–specific processes. These factors combine to make the understanding of tissue-specific gene functions, disease pathophysiology and gene-disease associations particularly challenging. + +Tissue-specific network construction is described in the following publication: Greene, C. S., Krishnan, A., Wong, A. K., Ricciotti, E., Zelaya, R. A., Himmelstein, D. S., ... & Troyanskaya, O. G. (2015). `Understanding multicellular function and disease with human tissue-specific networks `_. Nature Genetics. Method --------------------------- Briefly, functional integration relies on the construction of process-specific functional relationship networks. These are interaction networks in which each node represents a gene, each edge a functional relationship, and an edge between two genes is probabilistically weighted based on experimental evidence relating to those genes. We integrate evidence from many data sets, with each data set weighted in a process-specific manner. -One naïve Bayesian classifier is trained per biological area of interest (e.g. a tissue, or a specific biological process), using the appropriate gold standard for the biological context in addition to one global process-unaware classifier trained using the complete gold standard. Each classifier f consisted of a class node predicting the binary presence or absence of a functional relationship (FR) between two genes and n nodes conditioned on FR, each representing the value of a data set Dk. +One naïve Bayesian classifier is trained per biological area of interest (e.g. a tissue, or a specific biological process), using the appropriate gold standard for the biological context in addition to one global process-unaware classifier trained using the complete gold standard. Each classifier consisted of a class node predicting the binary presence or absence of a functional relationship (FR) between two genes and n nodes conditioned on FR, each representing the value of a data set. -Parameter regularization is performed as described in Steck and Jaakkola (2002) using mutual information between data sets to estimate a strength of prior belief for each data set. While a large amount of shared information does not guarantee a redundant data set, since the same subset of information could be shared many times, it provides a valuable quantitative estimate of data set uniqueness. +Parameter regularization is performed as described in `Steck and Jaakkola (2002) `_ using mutual information between data sets to estimate a strength of prior belief for each data set. While a large amount of shared information does not guarantee a redundant data set, since the same subset of information could be shared many times, it provides a valuable quantitative estimate of data set uniqueness. -Genomics data types +Data integration --------------------------- -We collected and integrated 987 genome-scale data sets encompassing approximately 38,000 conditions from an estimated 14,000 publications including both expression and interaction measurements. +We collected and integrated 987 genome-scale data sets encompassing approximately 38,000 conditions from an estimated 14,000 publications including both expression and interaction measurements. To integrate these data, we automatically assess each data set for its relevance to each of 144 tissue- and cell lineage–specific functional contexts. The resulting functional maps provide a detailed portrait of protein function and interactions in specific human tissues and cell lineages ranging from B lymphocytes to the renal glomerulus and the whole brain. This approach allows us to profile the specialized function of genes in a high-throughput manner, even in tissues and cell lineages for which no or few tissue-specific data exist. * Gene co-expression: All gene expression data sets are from NCBI's Gene Expression Omnibus (GEO). Genes with more than 30% of values missing were removed, and remaining missing values were imputed using ten nearest neighbors. Non-log-transformed data sets were log transformed. Expression measurements were summarized to Entrez identifiers, and duplicate identifiers were merged. The Pearson correlation was calculated for each gene pair, normalized with Fisher's z transform, mean subtracted and divided by the standard deviation. @@ -20,7 +24,7 @@ We collected and integrated 987 genome-scale data sets encompassing approximatel * TF regulation: To estimate shared transcription factor regulation between genes, we collected binding motifs from JASPAR. Genes were scored for the presence of transcription factor binding sites using the MEME software suite. Motif matches were treated as binary scores (present if P < 0.001). The final score for each gene pair was obtained by calculating the Pearson correlation between the motif association vectors for the genes. -* MSigDB purturbations and miRNA: Chemical and genetic perturbation (c2:CGP) and microRNA target (c3:MIR) profiles were downloaded from the Molecular Signatures Database (MSigDB). Each gene pair's score was the sum of shared profiles weighted by the specificity of each profile +* MSigDB perturbations and miRNA: Chemical and genetic perturbation (c2:CGP) and microRNA target (c3:MIR) profiles were downloaded from the Molecular Signatures Database (MSigDB). Each gene pair's score was the sum of shared profiles weighted by the specificity of each profile. Evidence @@ -32,3 +36,12 @@ Contribution of dataset D to an edge functional relationship prediction (FR):: contribution(D) = P(FR | D) - P(FR) Note that the contributions will not sum to 1.0, as each contribution is measured separately. Generally, individual gene expression datasets will not contribute much to the posterior probability but cumulatively can make a significant contribution. + +Example +--------------------------- + +IL1B in blood vessel +~~~~~~~~~~~~~~~~~~~~~~~~~ +We examined and experimentally verified the tissue-specific molecular response of blood vessel cells to stimulation by IL-1β (IL1B), a pro-inflammatory cytokine. We anticipated that the genes most tightly connected to IL1B in the blood vessel network would be among those responding to IL-1β stimulation in blood vessel cells. We tested this hypothesis by profiling the gene expression of human aortic smooth muscle cells (HASMCs; the predominant cell type in blood vessels) stimulated with IL-1β. + +Examination of the genes whose expression was significantly upregulated at 2 h after stimulation showed that 18 of the 20 IL1B network neighbors were among the top 500 most upregulated genes in the experiment (P = 2.07 × 10−23). The blood vessel network was the most accurate tissue network in predicting this experimental outcome; none of the other 143 tissue-specific networks or the tissue-naive network performed as well when evaluated by each network's ability to predict the result of IL-1β stimulation on the cells. diff --git a/docs/img/genome_browser.png b/docs/img/genome_browser.png new file mode 100644 index 000000000..db013ea1d Binary files /dev/null and b/docs/img/genome_browser.png differ diff --git a/docs/img/genome_heatmap.png b/docs/img/genome_heatmap.png new file mode 100644 index 000000000..f37f6c5ec Binary files /dev/null and b/docs/img/genome_heatmap.png differ diff --git a/docs/img/use-cases/asd-browser-1.png b/docs/img/use-cases/asd-browser-1.png new file mode 100644 index 000000000..1d9c26c27 Binary files /dev/null and b/docs/img/use-cases/asd-browser-1.png differ diff --git a/docs/img/use-cases/asd-browser-2.png b/docs/img/use-cases/asd-browser-2.png new file mode 100644 index 000000000..dda7c0d41 Binary files /dev/null and b/docs/img/use-cases/asd-browser-2.png differ diff --git a/docs/img/use-cases/beluga-1.png b/docs/img/use-cases/beluga-1.png new file mode 100644 index 000000000..f4c7094ef Binary files /dev/null and b/docs/img/use-cases/beluga-1.png differ diff --git a/docs/img/use-cases/beluga-2.png b/docs/img/use-cases/beluga-2.png new file mode 100644 index 000000000..7cd2a3725 Binary files /dev/null and b/docs/img/use-cases/beluga-2.png differ diff --git a/docs/img/use-cases/beluga-3.png b/docs/img/use-cases/beluga-3.png new file mode 100644 index 000000000..ed13ba9e6 Binary files /dev/null and b/docs/img/use-cases/beluga-3.png differ diff --git a/docs/img/use-cases/beluga-4.png b/docs/img/use-cases/beluga-4.png new file mode 100644 index 000000000..4f728323d Binary files /dev/null and b/docs/img/use-cases/beluga-4.png differ diff --git a/docs/img/use-cases/beluga-5.png b/docs/img/use-cases/beluga-5.png new file mode 100644 index 000000000..861eaa4f4 Binary files /dev/null and b/docs/img/use-cases/beluga-5.png differ diff --git a/docs/img/use-cases/comparing-networks-1.png b/docs/img/use-cases/comparing-networks-1.png new file mode 100644 index 000000000..4b2453bf5 Binary files /dev/null and b/docs/img/use-cases/comparing-networks-1.png differ diff --git a/docs/img/use-cases/comparing-networks-2.png b/docs/img/use-cases/comparing-networks-2.png new file mode 100644 index 000000000..dc2a2bfc0 Binary files /dev/null and b/docs/img/use-cases/comparing-networks-2.png differ diff --git a/docs/img/use-cases/comparing-networks-3.png b/docs/img/use-cases/comparing-networks-3.png new file mode 100644 index 000000000..67e3c6b1e Binary files /dev/null and b/docs/img/use-cases/comparing-networks-3.png differ diff --git a/docs/img/use-cases/comparing-networks-4.png b/docs/img/use-cases/comparing-networks-4.png new file mode 100644 index 000000000..990c2d1e6 Binary files /dev/null and b/docs/img/use-cases/comparing-networks-4.png differ diff --git a/docs/img/use-cases/ctcf-disruption-1.png b/docs/img/use-cases/ctcf-disruption-1.png new file mode 100644 index 000000000..181a01ebf Binary files /dev/null and b/docs/img/use-cases/ctcf-disruption-1.png differ diff --git a/docs/img/use-cases/ctcf-disruption-2.png b/docs/img/use-cases/ctcf-disruption-2.png new file mode 100644 index 000000000..aceb6d0c7 Binary files /dev/null and b/docs/img/use-cases/ctcf-disruption-2.png differ diff --git a/docs/img/use-cases/ctcf-disruption-3.png b/docs/img/use-cases/ctcf-disruption-3.png new file mode 100644 index 000000000..ab1e19e3b Binary files /dev/null and b/docs/img/use-cases/ctcf-disruption-3.png differ diff --git a/docs/img/use-cases/ctcf-disruption-4.png b/docs/img/use-cases/ctcf-disruption-4.png new file mode 100644 index 000000000..398736154 Binary files /dev/null and b/docs/img/use-cases/ctcf-disruption-4.png differ diff --git a/docs/img/use-cases/ctcf-disruption-5.png b/docs/img/use-cases/ctcf-disruption-5.png new file mode 100644 index 000000000..a53ce224a Binary files /dev/null and b/docs/img/use-cases/ctcf-disruption-5.png differ diff --git a/docs/img/use-cases/ctcf-disruption-6.png b/docs/img/use-cases/ctcf-disruption-6.png new file mode 100644 index 000000000..9c73375d9 Binary files /dev/null and b/docs/img/use-cases/ctcf-disruption-6.png differ diff --git a/docs/img/use-cases/ctcf-disruption-7.png b/docs/img/use-cases/ctcf-disruption-7.png new file mode 100644 index 000000000..643168dad Binary files /dev/null and b/docs/img/use-cases/ctcf-disruption-7.png differ diff --git a/docs/img/use-cases/ctcf-disruption-8.png b/docs/img/use-cases/ctcf-disruption-8.png new file mode 100644 index 000000000..6477b34c4 Binary files /dev/null and b/docs/img/use-cases/ctcf-disruption-8.png differ diff --git a/docs/img/use-cases/expecto-1.png b/docs/img/use-cases/expecto-1.png new file mode 100644 index 000000000..812bf1f34 Binary files /dev/null and b/docs/img/use-cases/expecto-1.png differ diff --git a/docs/img/use-cases/expecto-2.png b/docs/img/use-cases/expecto-2.png new file mode 100644 index 000000000..25872e0a5 Binary files /dev/null and b/docs/img/use-cases/expecto-2.png differ diff --git a/docs/img/use-cases/expecto-3.png b/docs/img/use-cases/expecto-3.png new file mode 100644 index 000000000..2e374cc25 Binary files /dev/null and b/docs/img/use-cases/expecto-3.png differ diff --git a/docs/img/use-cases/expecto-4.png b/docs/img/use-cases/expecto-4.png new file mode 100644 index 000000000..2b720fdc1 Binary files /dev/null and b/docs/img/use-cases/expecto-4.png differ diff --git a/docs/img/use-cases/expecto-5.png b/docs/img/use-cases/expecto-5.png new file mode 100644 index 000000000..05bce27f0 Binary files /dev/null and b/docs/img/use-cases/expecto-5.png differ diff --git a/docs/img/use-cases/expecto-6.png b/docs/img/use-cases/expecto-6.png new file mode 100644 index 000000000..d79a9b530 Binary files /dev/null and b/docs/img/use-cases/expecto-6.png differ diff --git a/docs/img/use-cases/expectosc-1.png b/docs/img/use-cases/expectosc-1.png new file mode 100644 index 000000000..27b5c3756 Binary files /dev/null and b/docs/img/use-cases/expectosc-1.png differ diff --git a/docs/img/use-cases/expectosc-2.png b/docs/img/use-cases/expectosc-2.png new file mode 100644 index 000000000..9fc1c6ada Binary files /dev/null and b/docs/img/use-cases/expectosc-2.png differ diff --git a/docs/img/use-cases/expectosc-3.png b/docs/img/use-cases/expectosc-3.png new file mode 100644 index 000000000..390eac27d Binary files /dev/null and b/docs/img/use-cases/expectosc-3.png differ diff --git a/docs/img/use-cases/extracting-fasta-1.png b/docs/img/use-cases/extracting-fasta-1.png new file mode 100644 index 000000000..cf081079f Binary files /dev/null and b/docs/img/use-cases/extracting-fasta-1.png differ diff --git a/docs/img/use-cases/functional-enrichments-1.png b/docs/img/use-cases/functional-enrichments-1.png new file mode 100644 index 000000000..e087eedd7 Binary files /dev/null and b/docs/img/use-cases/functional-enrichments-1.png differ diff --git a/docs/img/use-cases/functional-enrichments-2.png b/docs/img/use-cases/functional-enrichments-2.png new file mode 100644 index 000000000..13f366a63 Binary files /dev/null and b/docs/img/use-cases/functional-enrichments-2.png differ diff --git a/docs/img/use-cases/functional-enrichments-3.png b/docs/img/use-cases/functional-enrichments-3.png new file mode 100644 index 000000000..d60ce73dd Binary files /dev/null and b/docs/img/use-cases/functional-enrichments-3.png differ diff --git a/docs/img/use-cases/functional-enrichments-4.png b/docs/img/use-cases/functional-enrichments-4.png new file mode 100644 index 000000000..e521c5aea Binary files /dev/null and b/docs/img/use-cases/functional-enrichments-4.png differ diff --git a/docs/img/use-cases/functional-module-1.png b/docs/img/use-cases/functional-module-1.png new file mode 100644 index 000000000..b64131ceb Binary files /dev/null and b/docs/img/use-cases/functional-module-1.png differ diff --git a/docs/img/use-cases/functional-module-2.png b/docs/img/use-cases/functional-module-2.png new file mode 100644 index 000000000..0c2335f98 Binary files /dev/null and b/docs/img/use-cases/functional-module-2.png differ diff --git a/docs/img/use-cases/in-silico-mutagenesis-1.png b/docs/img/use-cases/in-silico-mutagenesis-1.png new file mode 100644 index 000000000..1f7a61fe5 Binary files /dev/null and b/docs/img/use-cases/in-silico-mutagenesis-1.png differ diff --git a/docs/img/use-cases/in-silico-mutagenesis-2.png b/docs/img/use-cases/in-silico-mutagenesis-2.png new file mode 100644 index 000000000..9e2651a32 Binary files /dev/null and b/docs/img/use-cases/in-silico-mutagenesis-2.png differ diff --git a/docs/img/use-cases/netwas-1.png b/docs/img/use-cases/netwas-1.png new file mode 100644 index 000000000..b97064608 Binary files /dev/null and b/docs/img/use-cases/netwas-1.png differ diff --git a/docs/img/use-cases/netwas-2.png b/docs/img/use-cases/netwas-2.png new file mode 100644 index 000000000..fd4303535 Binary files /dev/null and b/docs/img/use-cases/netwas-2.png differ diff --git a/docs/img/use-cases/networks-1.png b/docs/img/use-cases/networks-1.png new file mode 100644 index 000000000..1dc4dcb07 Binary files /dev/null and b/docs/img/use-cases/networks-1.png differ diff --git a/docs/img/use-cases/networks-2.png b/docs/img/use-cases/networks-2.png new file mode 100644 index 000000000..a369df210 Binary files /dev/null and b/docs/img/use-cases/networks-2.png differ diff --git a/docs/img/use-cases/networks-3.png b/docs/img/use-cases/networks-3.png new file mode 100644 index 000000000..326710124 Binary files /dev/null and b/docs/img/use-cases/networks-3.png differ diff --git a/docs/img/use-cases/networks-4.png b/docs/img/use-cases/networks-4.png new file mode 100644 index 000000000..4c18a46a4 Binary files /dev/null and b/docs/img/use-cases/networks-4.png differ diff --git a/docs/img/use-cases/sei-1.png b/docs/img/use-cases/sei-1.png new file mode 100644 index 000000000..d9a69ee88 Binary files /dev/null and b/docs/img/use-cases/sei-1.png differ diff --git a/docs/img/use-cases/sei-2.png b/docs/img/use-cases/sei-2.png new file mode 100644 index 000000000..959e9f7b0 Binary files /dev/null and b/docs/img/use-cases/sei-2.png differ diff --git a/docs/img/use-cases/sei-3.png b/docs/img/use-cases/sei-3.png new file mode 100644 index 000000000..339cfe0c3 Binary files /dev/null and b/docs/img/use-cases/sei-3.png differ diff --git a/docs/img/use-cases/sei-4.png b/docs/img/use-cases/sei-4.png new file mode 100644 index 000000000..4deac8e51 Binary files /dev/null and b/docs/img/use-cases/sei-4.png differ diff --git a/docs/img/use-cases/seqweaver-1.png b/docs/img/use-cases/seqweaver-1.png new file mode 100644 index 000000000..f91d0c8c2 Binary files /dev/null and b/docs/img/use-cases/seqweaver-1.png differ diff --git a/docs/img/use-cases/seqweaver-2.png b/docs/img/use-cases/seqweaver-2.png new file mode 100644 index 000000000..bb9614345 Binary files /dev/null and b/docs/img/use-cases/seqweaver-2.png differ diff --git a/docs/img/use-cases/seqweaver-3.png b/docs/img/use-cases/seqweaver-3.png new file mode 100644 index 000000000..119608978 Binary files /dev/null and b/docs/img/use-cases/seqweaver-3.png differ diff --git a/docs/img/use-cases/seqweaver-4.png b/docs/img/use-cases/seqweaver-4.png new file mode 100644 index 000000000..909b819b8 Binary files /dev/null and b/docs/img/use-cases/seqweaver-4.png differ diff --git a/docs/in-silico-mutagenesis.rst b/docs/in-silico-mutagenesis.rst new file mode 100644 index 000000000..7ef34c9c3 --- /dev/null +++ b/docs/in-silico-mutagenesis.rst @@ -0,0 +1,25 @@ +===================== +In Silico Mutagenesis +===================== + +Introduction +------------ + +Perform "In silico saturated mutagenesis" (ISM) analysis to discover informative sequence features within any sequence. Specifically, it performs computational mutation scanning to assess the effect of mutating every base of the input sequence on chromatin feature predictions. This method for context-specific sequence feature extraction takes advantage of DeepSEA's ability to utilize flanking context sequences information. + +Note that ISM only accepts a sequence (FASTA file) as input. The input FASTA file should be 2000 base pairs long. + +The chromatin impact prediction is performed using the :doc:`beluga` model. See the following manuscript: Jian Zhou, Chandra L. Theesfeld, Kevin Yao, Kathleen M. Chen, Aaron K. Wong, and Olga G. Troyanskaya, `Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk `_. Nature Genetics (2018). + + +ISM outputs effects for each of three possible substitutions of all 2000 bases, across all chromatin features. + +Output +------ + +The effect of a base substitution on a specific chromatin feature prediction was measured by log2 fold change of odds, where P\ :sub:`0`\ represents the probability predicted for the original sequence and P\ :sub:`1`\ represents the probability predicted for the mutated sequence: + +.. math:: + \log_2 \left(\frac{P_0}{1 - P_0}\right) - \log_2 \left(\frac{P_1}{1 - P_1}\right) + +(See the `DeepSEA paper (2015) `_) \ No newline at end of file diff --git a/docs/index.rst b/docs/index.rst index dc2cd6884..da19a1e39 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -7,45 +7,39 @@ HumanBase User Guide ===================== +HumanBase is a resource for biological researchers interested in data-driven predictions of gene expression, function, regulation, and interactions in human, particularly in the context of specific cell types/tissues and human disease. ---------------------- -About ---------------------- - -HumanBase is a “one stop shop” for biological and biomedical researchers interested in data-driven predictions of gene expression, function, regulation, and interactions in human, particularly in the context of specific cell types/tissues, development, and human disease. - -Data-driven integrative analyses are especially powerful because they reach beyond “known biological knowledge” represented in the biological literature to identify novel associations that are not biased toward well-studied areas of biomedical research. Thus, carefully designed algorithms can drive the development of experimentally testable hypotheses, enabling deeper understanding of basic biology at the molecular level, pathophysiology, and paving the way to therapy and drug development. +This resource is not merely a public database of primary genomics data or biological literature. The data-driven integrative analyses (i.e. algorithms that “learn” from large genomic data collections) presented in HumanBase are especially powerful because they separate signal from noise in large biological data collections to reach beyond existing biological knowledge represented in the biological literature to identify novel associations that are not biased toward well-studied areas of biomedical research. Carefully designed algorithms can drive the development of experimentally testable hypotheses. Thus, HumanBase is a resource for biomedical researchers to incorporate into their research workflows, which they can use to interpret their experimental results and generate hypotheses for experimental follow-up. --------------------- -Example use case +Who are we? --------------------- -A researcher who studies the role of the immune system and inflammation in chronic kidney disease wants to identify candidate genes for these disorders. Unfortunately, as with most specific disease contexts outside of cancer, few datasets are available for these diseases, none are focused on the role of inflammation or the immune system, and no dataset is specific to her cell-lineage of interest. Even identifying which genes are expressed in the cell type relevant to glomerular disease (podocytes) is currently impossible as this cell lineage cannot be isolated for high-throughput experiments in human. -Using HumanBase, she will be able to examine data-driven predictions of genes expressed in the podocyte cells and analyze predicted functional and mechanistic networks specific to the kidney glomerulus. She could also provide the system with a list of relevant GWAS or family-based study results and the system will reprioritize these results based on the relevant functional maps. She will be able to iteratively refine this analysis by limiting the data used in the integration only to kidney datasets or by integrating her own data in the analysis. +HumanBase is actively developed by the `Genomics group `_ at the `Flatiron Institute `_. --------------------- Licensing --------------------- All data in HumanBase are freely available under a `CC-BY 4.0 `_ license. Please give appropriate credit, provide a link to the license, and indicate if changes were made. ---------------------- -Who are we? ---------------------- -HumanBase is actively developed by the `Genomics group `_ at the `Flatiron Institute `_ . - --------------------- Help topics --------------------- .. toctree:: - :maxdepth: 2 + :maxdepth: 1 :glob: usage + use-cases functional-networks - tissue-networks modules netwas + deepsea sei beluga + seqweaver expecto + expectosc + in-silico-mutagenesis + asdbrowser citations diff --git a/docs/modules.rst b/docs/modules.rst index e9cf110f1..2878a88e4 100644 --- a/docs/modules.rst +++ b/docs/modules.rst @@ -1,25 +1,26 @@ =========================== -Functional module detection +Functional Module Detection =========================== HumanBase applies community detection to find cohesive gene clusters from a provided gene list and a selected relevant tissue. Genes within a cluster share local network neighborhoods and together form a cohesive, specific functional module. Module detection enables systematic association of genes - even functionally uncharacterized genes - to specific processes and phenotypes represented in the detected modules. Functional modules are identified with tissue-specific networks, which predict gene interactions from massive data collections. Thus the discovered modules potentially capture higher-order tissue-specific function. +Functional module detection is described in: Krishnan A*, Zhang R*, Yao V, Theesfeld CL, Wong AK, Tadych A, Volfovsky N, Packer A, Lash A, Troyanskaya OG.(2016) `Genome-wide prediction and functional characterization of the genetic basis of autism spectrum disorder `_. Nature Neuroscience. + + Method ------ -The approach\ :sup:`1` is based on shared k-nearest-neighbors (SKNN) and the Louvain community-finding algorithm to cluster the user-selected tissue network into distinct modules of tightly connected genes. The SKNN-based strategy has the advantages of alleviating the effect of high-degree genes and accentuating local network structure by connecting genes that are likely to be functionally clustered together. +The approach is based on shared k-nearest-neighbors (SKNN) and the Louvain community-finding algorithm to cluster the user-selected tissue network into distinct modules of tightly connected genes. The SKNN-based strategy has the advantages of alleviating the effect of high-degree genes and accentuating local network structure by connecting genes that are likely to be functionally clustered together. This technique proceeds as follows: (i) First, we create a subset of the user-selected network containing only the user-provided genes and all the edges between them. Given the resulting graph G with V nodes (user-provided genes) and E edges, with each edge between genes i and j associated with a weight p\ :sub:`ij`, - (ii) Calculate a new weight for the edge between each pair of nodes i and j that is equal to the number of k nearest neighbors (based on the original weights p\ :sub:`ij`) shared by i and j; + (ii) Calculate a new weight for the edge between each pair of nodes i and j that is equal to the number of k nearest neighbors (based on the original weights p\ :sub:`ij`) shared by i and j. (iii) Choose the top 5% of the edges based on the new edge weights, and apply a graph clustering algorithm. This approach has two key desirable characteristics: - (i) Choosing the highest k values instead of all edges deemphasizes high-degree 'hub' nodes and brings equal attention to highly specific edges between low-degree nodes; + (i) Choosing the highest k values instead of all edges deemphasizes high-degree 'hub' nodes and brings equal attention to highly specific edges between low-degree nodes. (ii) Emphasizing local network-structure by connecting nodes that share a number of local neighbors automatically links genes that are highly likely to be part of the same cluster. -We use a dynamic :code:`k = min(50, 0.2 * |V|)` to obtain the shared-nearest-neighbor tissue-specific network and apply the Louvain algorithm to cluster this network into distinct modules. To stabilize clustering across different runs of the Louvain algorithm, we run the algorithm 100 times and calculate cluster comembership scores for each pair of genes that was equal to the fraction of times (out of 100) the pair was assigned to the same cluster. Genes are assigned to clusters where their comembership score ≥ 0.9. +We use a dynamic :code:`k = min(50, 0.2 * |V|)` to obtain the shared-nearest-neighbor tissue-specific network and apply the Louvain algorithm to cluster this network into distinct modules, where V is the number of query genes. Krishnan et al. (2016) showed that module node membership and cluster sizes are robust by testing a range of values for k from 10 to 100. To stabilize clustering across different runs of the Louvain algorithm, we run the algorithm 100 times and calculate cluster comembership scores for each pair of genes that was equal to the fraction of times (out of 100) the pair was assigned to the same cluster. Genes are assigned to clusters where their comembership score ≥ 0.9. Resulting modules are then tested for functional enrichment using genes annotated to Gene Ontology biological process terms. Representative processes and pathways enriched within each cluster are presented alongside of the cluster with their resulting Q value. The Q value of each term associated to the modules is calculated using one-sided Fisher's exact tests and Benjamini–Hochberg corrections to correct for multiple tests. - -1. Krishnan A*, Zhang R*, Yao V, Theesfeld CL, Wong AK, Tadych A, Volfovsky N, Packer A, Lash A, Troyanskaya OG.(2016) Genome-wide prediction and functional characterization of the genetic basis of autism spectrum disorder. Nature Neuroscience. diff --git a/docs/netwas.rst b/docs/netwas.rst index 6767e3086..faf71aaa7 100644 --- a/docs/netwas.rst +++ b/docs/netwas.rst @@ -3,23 +3,23 @@ NetWAS - Network-wide Association Study ======================================= Tissue-specific networks provide a new means to generate hypotheses related to the molecular basis of human disease. We developed an approach, termed network-wide association study (NetWAS). In NetWAS, the statistical associations from a standard GWAS guide the analysis of functional networks. This reprioritization method is driven by discovery and does not depend on prior disease knowledge. NetWAS, in conjunction with tissue-specific networks, effectively reprioritizes statistical associations from distinct GWAS to identify disease-associated genes, and tissue-specific NetWAS better identifies genes associated with hypertension than either GWAS or tissue-naive NetWAS. +The NetWAS method is described in the following publication: Greene, C. S., Krishnan, A., Wong, A. K., Ricciotti, E., Zelaya, R. A., Himmelstein, D. S., ... & Troyanskaya, O. G. (2015). `Understanding multicellular function and disease with human tissue-specific networks `_. Nature Genetics. + Method --------------------------------------- NetWAS trains a support vector machine classifier using nominally significant (P < 0.01) genes as positive examples and 10,000 randomly selected non-significant (P ≥ 0.01) genes as negatives. The classifier is constructed using a tissue network relevant to a disease (e.g. kidney for hypertension), where the features of the classifier are the edge weights of the labeled examples to all the genes in the network. Genes are re-ranked using their distance from the hyperplane, which represent a network-based prioritization of a GWAS, termed NetWAS. To calculate per-gene P values for a GWAS, we suggest the versatile gene-based association study (VEGAS) system. -We have performed and evaluated NetWAS on six GWAS: C-reactive protein levels (lnCRP), type 2 diabetes (T2D), body mass index (BMI), hypertension (ht), alzheimer's (adni) and advanced age-related macular degeneration (advanced AMD). +We have performed and evaluated NetWAS on six GWAS: C-reactive protein levels (lnCRP), type 2 diabetes (T2D), body mass index (BMI), hypertension (ht), alzheimer's (adni) and advanced age-related macular degeneration (advanced AMD). GWAS File --------------------------------------- NetWAS requires as input a GWAS result file, with per-gene p-values. We suggest the versatile gene-based association study (VEGAS) system for calculating gene p-values, but we also support forge and pseq formats. -* `VEGAS `_: versatile gene-based association study +* `VEGAS `_: versatile gene-based association study * `FORGE `_: multivariate calculation of gene-wide p-values from Genome-Wide Association Studies Authors and Affiliations -* `PLINK/SEQ `_: a library for the analysis of genetic variation data - - Note that the expected format is from the output of `Gene/group-based association tests `_ +* `PLINK/SEQ `_: a library for the analysis of genetic variation data NetWAS Results --------------------------------------- @@ -29,8 +29,8 @@ When a NetWAS analysis finishes, a result file will be emailed to the provided a # HumanBase NetWAS Analysis Results # # Job id: d7732f19-916d-4458-97b5-936b8d6345cb - # Job title: - # Email: + # Job title: + # Email: # Created: 2017-08-21 17:07:33 EDT # GWAS file: bmi-2012.out.txt # GWAS format: vegas @@ -62,21 +62,9 @@ When a NetWAS analysis finishes, a result file will be emailed to the provided a -Examples +Example --------------------------------------- Hypertension GWAS ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Hypertension is a major cardiovascular risk factor and a complex trait involving a large number of genetic variants. We converted SNP-level association statistics into gene-level statistics for each of three recorded phenotypes—diastolic blood pressure (DBP), systolic blood pressure (SBP) and hypertension. Using the tissue-specific network for kidney, a tissue that has a central role in blood pressure control, NetWAS constructed a classifier that identified tissue-specific network connectivity patterns associated with the phenotype of interest. Genes annotated to hypertension phenotypes in the Online Mendelian Inheritance in Man (OMIM) database were more highly ranked by this classifier than by the initial GWAS. (`citation `_) - -.. figure:: https://media.nature.com/full/nature-assets/ng/journal/v47/n6/images/ng.3259-F5.jpg - :scale: 50% - - Genes ranked using GWAS (gray) and genes reprioritized using NetWAS (brown) were assessed for correspondence to genes known to be associated with hypertension phenotypes, regulatory processes and therapeutics. We compared individual (systolic blood pressure, SBP; diastolic blood pressure, DBP; hypertension, HTN) as well as combined hypertension endpoints. (a) Gene rankings were compared to OMIM-annotated hypertension genes using AUC. The AUC for the tissue-specific NetWAS is consistently higher than that for the original GWAS for all hypertension endpoints. Merging the network-based predictions for the three hypertension-related endpoints into a combined phenotype results in the best performance (AUC = 0.77; original GWAS AUC = 0.62; the dashed line at 0.5 denotes the AUC of a baseline random predictor). (b,c) Gene rankings were also assessed for enrichment of genes involved in the regulation of blood pressure (GO) (b) and targets of antihypertensive drugs (DrugBank) (c). The top NetWAS results were significantly enriched for genes involved in blood pressure regulation as well as for genes that are targets of antihypertensive drugs. Enrichment was calculated as a z score (Online Methods), with higher scores indicating a greater shift from the expected ranking toward the top of the list. In nearly all cases, the NetWAS ranking was both significantly enriched with the respective gene sets (z score > 1.645 ≈ P value < 0.05) and more enriched than in the original GWAS ranking. - - -Additional GWAS -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -.. figure:: https://media.nature.com/full/nature-assets/ng/journal/v47/n6/images/ng.3259-SF8.jpg - - Each bar shows the performance of NetWAS reprioritization as measured by the area under the curve (AUC) of documented disease associations with the disease specified in the label above the plot. The horizontal axis shows relevant networks (colored bars) and GWAS alone (gray bars), and the horizontal axis label describes the GWAS phenotype from which associations were obtained. diff --git a/docs/requirements.txt b/docs/requirements.txt new file mode 100644 index 000000000..87b956914 --- /dev/null +++ b/docs/requirements.txt @@ -0,0 +1,2 @@ +sphinx>=1.3 +sphinx_rtd_theme \ No newline at end of file diff --git a/docs/sei.rst b/docs/sei.rst index 124a44bb7..94c0c43d0 100644 --- a/docs/sei.rst +++ b/docs/sei.rst @@ -5,9 +5,11 @@ Sei Introduction ------------ -Sei is a deep-learning-based framework for systematically predicting sequence regulatory activities and applying sequence information to understand human genetics data. Sei provides a global map from any sequence to regulatory activities, as represented by 40 sequence classes. Each sequence class integrates predictions for 21,907 chromatin profiles (transcription factor, histone marks, and chromatin accessibility profiles across a wide range of cell types) from the underlying Sei deep learning model. You can also find the Sei code repository `here `_ or read about our manuscript `here `_. +Sei is a deep-learning-based framework for systematically predicting sequence regulatory activities and applying sequence information to understand human genetics data. Sei provides a global map from any sequence to regulatory activities, as represented by 40 sequence classes. Each sequence class integrates predictions for 21,907 chromatin profiles (transcription factor, histone marks, and chromatin accessibility profiles across a wide range of cell types) from the underlying Sei deep learning model. Importantly, this framework is trained without using any variant data, allowing it to predict the regulatory impact of any variant, including rare or previously unseen ones. -Sequence class-level variant effects are computed by comparing the predictions for the reference and the alternative alleles. A positive score indicates an increase in sequence class activity by the alternative allele and vice versa. Sequence class-level scores are computed by projecting the 21,907 chromatin profile predictions for the sequence to the unit vector that represents each sequence class. +Sei is described in the following manuscript: Kathleen M. Chen, Aaron K. Wong, Olga G. Troyanskaya and Jian Zhou, `A sequence-based global map of regulatory activity for deciphering human genetics `_. Nature Genetics (2022). + +The Sei code repository can be found `here `_. For older DeepSEA models see: :doc:`beluga` (2019) @@ -16,23 +18,12 @@ For older DeepSEA models see: Input ----- -File formats -~~~~~~~~~~~~ -We support three types of input: vcf, fasta, bed. If you want to predict effects of noncoding variants, use vcf format input. If you want to predict chromatin feature probabilities for DNA sequences, use fasta format. If you want to specify sequences from the human reference genome (GRCh37/hg19), you can use bed format. See below for a quick introduction: - -**VCF format** is used for specifying a genomic variant. A minimal example is ``chr1 109817590 - G T`` (if you want to copy cover this text as input, you will need to change spaces to tabs). The five columns are chromosome, position, name, reference allele, and alternative allele. Currently, the genome position needs to be in GRCh37/hg19 - -**Fasta format** input should include sequences of 4096bp length each. If a sequence is longer than 4096bp, only the center 4096bp will be used. - -**Bed format** provides another way to specify sequences in human reference genome (hg19). The bed input should specify 4096bp-length regions. A minimal example is ``chr1 109817091 109821186``. The three columns are chromosome, start position, and end position. +.. |bp_length| replace:: 4096 +.. |bed_example| replace:: ``chr5 134871851 134871852`` -Genome coordinates -~~~~~~~~~~~~~~~~~~ -We support only ``GRCh37/hg19`` genome coordinates. You can use LiftOver to convert your coordinates to the correct version. +.. include:: _includes/common-input-formats.rst -Large submissions -~~~~~~~~~~~~~~~~~ -We recommend using the web server if you have <10,000 variants or sequences. You will experience degraded performance when submitting a larger set of sequences. In those instances, we suggest that you split the set into multiple <10,000 submissions, or run the standalone version on your local machine, or contact our group directly. +.. include:: _includes/common-submission-info.rst Output @@ -41,7 +32,7 @@ Output Sequence classes ~~~~~~~~~~~~~~~~~~~~~~~~~ -The Sei framework predicts 40 sequence class scores, covering a wide range of regulatory activities such as cell-type-specific enhancers and promoters, as well as 21,907 chromatin profiles for any DNA sequence. +The Sei framework predicts 40 sequence class scores, covering a wide range of regulatory activities such as cell-type-specific enhancers and promoters, as well as 21,907 chromatin profiles for any DNA sequence. Sequence class-level variant effects are computed by comparing the predictions for the reference and the alternative alleles. A positive score indicates an increase in sequence class activity by the alternative allele and vice versa. Sequence class-level scores are computed by projecting the 21,907 chromatin profile predictions for the sequence to the unit vector that represents each sequence class.A full description of how Sei sequence scores are computed can be found in the `Sei paper (2022) `_. To help interpretation, we grouped sequence classes into groups including P (Promoter), E (Enhancer), CTCF (CTCF-cohesin binding), TF (TF binding), PC (Polycomb-repressed), HET (Heterochromatin), TN (Transcription), and L (Low Signal) sequence classes. Please refer to our manuscript for a more detailed description of the sequence classes. diff --git a/docs/seqweaver.rst b/docs/seqweaver.rst new file mode 100644 index 000000000..0749b4795 --- /dev/null +++ b/docs/seqweaver.rst @@ -0,0 +1,78 @@ +========== +Seqweaver +========== + +Introduction +------------ + +Seqweaver is a deep learning framework designed to predict how genetic variants affect post-transcriptional RNA-binding protein (RBP) interactions. The model is trained on RBP-RNA interaction data +obtained from CLIP-seq experiments. Seqweaver consists of 232 RBP models spanning 88 distinct RBPs, and can predict the impact of genetic variants at single-nucleotide resolution. Importantly, this framework is trained without using any variant data, allowing it to predict the impact on RBP binding of any variant, including rare or previously unseen ones. + +Seqweaver is described in: +Park CY, Zhou J, Wong AK, Chen KM, Theesfeld CL, Darnell RB, Troyanskaya OG. `Genome-wide landscape of RNA-binding protein target site dysregulation reveals a major impact on psychiatric disorder risk `_. Nat Genet. 2021 Feb;53(2):166-173. + + +Input +----- + +We support three types of input: VCF, FASTA, BED. If you want to predict effects of noncoding variants, use VCF format input. If you want to predict RBP interaction probabilities directly from transcript sequences, you can use FASTA format. If you want to specify sequences from the human reference genome, you can use BED format. + +Examples of all input formats are available in the job submission interface. See below for a quick introduction: + +**VCF format** is used for specifying a genomic variant. A minimal example is ``chr8 38120276 [QKI disruption] C A +`` (if you want to copy this text as input, you will need to change spaces to tabs). The six columns are chromosome, position, name, reference allele, alternative allele, and strand. Note that strand must be specified for Seqweaver but not for the other deep learning models. + +**FASTA format** input should include sequences of 1000 bp length each. If a sequence is different from 1000 bp: + +* **Note**: The prediction is for the center base of the input sequence +* **Longer sequences**: Only the center 1000 bp will be used +* **Shorter sequences**: Sequences shorter than 1000 bp will be padded with 'N' bases evenly on both sides + + - **Important**: We do not recommend using FASTA input smaller than 1000 bp unless it is very close (only a few bp off) + - **Note**: This padding behavior is not recommended. N's were extremely rare in training data (only appearing in assembly gaps), and the model has not been evaluated with artificially padded sequences + - **Strong recommendation**: Always provide sequences of exactly 1000 bp by including genomic flanking sequences + +**BED format** provides another way to specify sequences in human reference genome (hg19). The BED input should specify 1000 bp-length regions. A minimal example is ``chr1 109817090 109818091 . 0 -``. The columns are chromosome, start position, end position, name, score, and strand. + + +Large submissions +~~~~~~~~~~~~~~~~~ +We recommend using the web server if you have <10,000 variants or sequences. You will experience degraded performance when submitting a larger set of sequences. In those instances, we suggest that you split the set into multiple <10,000 submissions, or run the standalone version on your local machine, or contact our group directly. + + +Downloads +--------- +The 1000 Genome Project RBP LDScores used in GWAS analysis are available `here `_. + +GnomAD (v2.1) Seqweaver RBP target site dysregulation scores are available `here `_. + +The code for making variant effect predictions for Seqweaver `RBP model `_ are available from this link (Zhou and Park et al. Nature Gen 2019). Our `new version `_ simplifies the dependencies by using the `Selene `_ package and streamlines the prediction process for RBP models. + +Output +------ + +Variant scores +~~~~~~~~~~~~~~ + +**Disease impact score:** DIS is calculated by training a logistic regression model that prioritizes likely disease-associated mutations on the basis of the predicted post-transcriptional regulatory effects of these mutations (See `Zhou et. al, 2019 `_). The predicted DIS probabilities are then converted into DIS e-values, computed based on the empirical distributions of predicted effects for gnomAD variants. The final DIS score is: + +.. math:: + -\log_{10}(DIS\ e-value_{feature}) + +**Mean -log e-value:** For each predicted regulatory feature effect (:math:`abs(p_{alt}-p_{ref})`) of a variant, we calculate a feature e-value based on the empirical distribution of that feature’s effects among gnomAD variants (see Molecular-level biochemical effects prediction: e-value). The MLE score of a variant is + +.. math:: + \sum{-\log_{10}(e-value_{feature})}/N + +Molecular-level biochemical effects prediction +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +**z-score:** A scaled score where the feature diff score (:math:`p_{alt} - p_{ref}`) is divided by the root mean square of the feature diff score across gnomAD variants. Note that this is “sign-preserving”, i.e. a negative z-score indicates that a mutation decreases the probability of a regulatory feature. + +**E-value:** E-value is defined as the expected proportion of SNPs with a larger predicted effect. We calculate an 'e-value' based on the empirical distribution of that feature’s effect (:math:`abs(p_{alt}-p_{ref})`) among gnomAD variants. For example, a feature e-value of 0.01 indicates that only 1% of gnomAD variants have a larger predicted effect. + +**Probability diffs:** The difference between the the predicted probability of the reference allele and the alternative allele for a regulatory feature (:math:`p_{alt}-p_{ref}`). + +See also +-------- +* :doc:`sei` - Latest chromatin and regulatory impact model with 4096bp input sequences +* :doc:`beluga` - 2018 DeepSEA model with 2000bp input sequences diff --git a/docs/usage.rst b/docs/usage.rst index 98ec99bc9..f18e8be73 100644 --- a/docs/usage.rst +++ b/docs/usage.rst @@ -1,3 +1 @@ -===================== -Getting Started -===================== +.. include:: index.rst \ No newline at end of file diff --git a/docs/use-cases.rst b/docs/use-cases.rst new file mode 100644 index 000000000..6d952c2a2 --- /dev/null +++ b/docs/use-cases.rst @@ -0,0 +1,42 @@ +========== +Use Cases +========== + +This section provides practical examples and tutorials demonstrating how to use various HumanBase tools for genomic analysis. Each use case walks through a specific research question and shows how to use the appropriate tools to answer it. + +These examples are drawn from real research publications and demonstrate the power of HumanBase's integrated approach to analyzing the functional effects of genetic variants, gene expression patterns, and biological networks. + +Sequence Analysis Use Cases +--------------------------- + +.. toctree:: + :maxdepth: 1 + + use-cases/ctcf-disruption + use-cases/expectosc-use-case + use-cases/expecto-use-case + use-cases/sei-use-case + use-cases/beluga-use-case + use-cases/seqweaver-use-case + use-cases/asd-genome-browser + use-cases/in-silico-mutagenesis-use-case + +Network Analysis Use Cases +-------------------------- + +.. toctree:: + :maxdepth: 1 + + use-cases/functional-enrichments + use-cases/comparing-networks + use-cases/functional-module-clustering + use-cases/networks-use-case + use-cases/netwas-use-case + +Tutorials +--------- + +.. toctree:: + :maxdepth: 1 + + use-cases/extracting-fasta diff --git a/docs/use-cases/asd-genome-browser.rst b/docs/use-cases/asd-genome-browser.rst new file mode 100644 index 000000000..f24656620 --- /dev/null +++ b/docs/use-cases/asd-genome-browser.rst @@ -0,0 +1,20 @@ +=========================== +ASD Genome Browser use case +=========================== + +**Task: What is the significance of a noncoding autism proband variation observed in the** `Simons Simplex Collection `_ **?** + + +* Select the ASD Browser analysis. Input a genomic region of interest. For example, here we view the predicted disease impact of variants in the vicinity of gene TENM3, centered on a predicted high DNA disease impact variant in an intronic region (see the help page for Beluga (DeepSEA) for information on how the DNA disease impact score is computed and the help page for Seqweaver for information on how the RNA disease impact score is computed). + +.. figure:: ../img/use-cases/asd-browser-1.png + :align: center + :width: 600px + + +* The heatmap below the genome browser view shows the predicted molecular level impact of each variant. Individual variants can be selected to reorder the feature list for the predicted highest effect features for that variant. For example, for the high predicted DNA disease impact variant at position 183066738 on chromosome 4, multiple chromatin features are predicted to be significantly altered. Users can select whether to view DNA-level or RNA-level effects of the variants. + +.. figure:: ../img/use-cases/asd-browser-2.png + :align: center + :width: 600px + diff --git a/docs/use-cases/beluga-use-case.rst b/docs/use-cases/beluga-use-case.rst new file mode 100644 index 000000000..f9c8de7df --- /dev/null +++ b/docs/use-cases/beluga-use-case.rst @@ -0,0 +1,41 @@ +========================= +Beluga (DeepSEA) use case +========================= + +**Task: What is the impact of a non-coding variant on the chromatin state of a sequence region?** + + +* Select the “Beluga” analysis from the main Analyses menu. Input noncoding variants of interest and submit the job. + +.. figure:: ../img/use-cases/beluga-1.png + :align: center + :width: 600px + + +* View visualizations of the impact of the input variants on the chromatin state of the sequence. Here, the disease impact scores of the query variants are visualized. The disease impact score is calculated by training a logistic regression model that prioritizes likely disease-associated mutations on the basis of the predicted transcriptional or post-transcriptional regulatory effects of these mutations (See `Zhou et. al, 2019 `_ and the Beluga (DeepSEA) documentation). Tabular representations of all predictions can also be viewed and downloaded. + +.. figure:: ../img/use-cases/beluga-2.png + :align: center + :width: 600px + + +* The variants can be viewed in their genomic context on a genome browser. + +.. figure:: ../img/use-cases/beluga-3.png + :align: center + :width: 600px + + +* The predicted chromatin effects can also be viewed on a heatmap. Here, the probability diffs (the difference between the predicted probability of the reference allele and the alternative allele for a regulatory feature) are visualized. + +.. figure:: ../img/use-cases/beluga-4.png + :align: center + :width: 600px + + +* An alternative view allows users to see all 2002 predictions by Beluga at one time. Select the “Features” tab. In this view each dot is a chromatin feature predicted by Beluga and they are ranked by z-score. With this view a user can see both tails of the predictions and can assess how many features are predicted to increase with the variant (right side) or decrease (left side) the probability. Mouse over of a dot shows which feature is represented and the score. + +.. figure:: ../img/use-cases/beluga-5.png + :align: center + :width: 600px + diff --git a/docs/use-cases/comparing-networks.rst b/docs/use-cases/comparing-networks.rst new file mode 100644 index 000000000..e65d2a5b8 --- /dev/null +++ b/docs/use-cases/comparing-networks.rst @@ -0,0 +1,36 @@ +=========================== +Comparing networks use case +=========================== + +This use case is drawn from Akat et al. 2022, Bronchopulmonary dysplasia and wnt pathway-associated single nucleotide polymorphisms + +**Task: How do the relationships between my genes of interest differ between lung tissue and fetal tissue?** + + +* Input bronchopulmonary dysplasia-related genes of interest. Select “Search.” + +.. figure:: ../img/use-cases/comparing-networks-1.png + :align: center + :width: 600px + + +* Select the first tissue type of interest (lung). Select “Add Network” and choose the second tissue type of interest (fetus). + +.. figure:: ../img/use-cases/comparing-networks-2.png + :align: center + :width: 600px + + +* Adjust the maximum number of genes for both networks. + +.. figure:: ../img/use-cases/comparing-networks-3.png + :align: center + :width: 600px + + +* Save the resulting networks. + +.. figure:: ../img/use-cases/comparing-networks-4.png + :align: center + :width: 600px + diff --git a/docs/use-cases/ctcf-disruption.rst b/docs/use-cases/ctcf-disruption.rst new file mode 100644 index 000000000..87c3dec90 --- /dev/null +++ b/docs/use-cases/ctcf-disruption.rst @@ -0,0 +1,70 @@ +======================= +CTCF disruption example +======================= + +**Task: Investigate the impact of variants at a site experimentally determined to bind CTCF.** + +HumanBase sequence models can be used to probe how sequence relates to function. For example, given the experimental observation that CTCF binds in a ChipSeq assay to the region chr1:762285-762358 (hg19) (`insulatordb.uthsc.edu `_), a researcher can ask which positions and bases (A,C,G or T) are important for binding through querying the model with a set of variants. + +The variants can be any possible set of variants. In item 1, we query a set of variants near the center of the CTCF binding region (could be from a screen, a set of variants observed in genome sequencing study, etc.). No more than 10,000 variants may be queried in a single submission to the webserver. In item 7, we query the model to predict the effect of every single possible mutation across the region with in silico mutagenesis. + +**Step 1:** We select "Beluga" from the Analyses menu, select the "Paste contents" radio button, input the variant in VCF format, and select "Submit." + +The VCF format is 5 columns of tab-separated data: chromosome, position, note, reference allele, alternative allele. + +The VCF file for the variants that we query looks like this: + +.. code-block:: text + + chr1 762318 [CTCF disruption] G A + chr1 762319 [CTCF disruption] T C + chr1 762320 [CTCF disruption] C A + chr1 762321 [CTCF disruption] A G + chr1 762322 [CTCF disruption] C A + +.. figure:: ../img/use-cases/ctcf-disruption-1.png + :align: center + :width: 600px + +**Step 2:** The result is presented in a probability difference heatmap where each block shows the difference in probability predicted for the reference allele binding the feature versus the alternative allele binding the feature. The results view shows that 762322 C>A variant is predicted to most strongly disrupt CTCF binding among the variants queried. + +.. figure:: ../img/use-cases/ctcf-disruption-2.png + :align: center + :width: 600px + +**Step 3:** The results view also displays a genome browser which shows the variants in their genomic context. Additional information can be added to the browser, like sequence motif information, ChIP-seq experimental data, enhancer or other regulatory feature information - as in the UCSC genome browser. To do this, select the + option to the top right of the genome browser (highlighted with red box the image below) and choose the "Files" tab. This allows adding genome tracks in bigBed, bigWig, or indexed BAM format. Some useful genome tracks suitable for loading into the interface can be found at the following directory: `https://hgdownload.soe.ucsc.edu/gbdb/hg19/ `_ (the files with extension .bb are bigBed files). + +.. figure:: ../img/use-cases/ctcf-disruption-3.png + :align: center + :width: 600px + +**Step 4:** We upload a track of motifs from the JASPAR database (`JASPAR2018.bb `_) to investigate what motifs overlap the variant sites that we have queried. + +.. figure:: ../img/use-cases/ctcf-disruption-4.png + :align: center + :width: 600px + +**Step 5:** We then see a track of motif locations added to the genome browser. Zooming in we see that our variants of interest do indeed overlap a CTCF motif. + +.. figure:: ../img/use-cases/ctcf-disruption-5.png + :align: center + :width: 600px + +**Step 6:** To further explore the predicted impact of variants at this site on CTCF binding, we can use the in silico mutagenesis tool. We first use the NCBI genome browser tool for hg19 at `https://www.ncbi.nlm.nih.gov/gdv/browser/genome/?id=GCF_000001405.25 `_ (selecting the download button to the top right over the genome browser) to download a FASTA file of a 2000 base pair region centered at the experimentally determined CTCF binding site from insulatordb (see :doc:`extracting-fasta`). The region covered by the FASTA file is chr1:761322-763321. We then go to the in silico mutagenesis tool, upload our FASTA file, and press submit. + +.. figure:: ../img/use-cases/ctcf-disruption-6.png + :align: center + :width: 600px + +**Step 7:** We then select the feature of interest (CTCF binding in K562 cells) and press submit. + +.. figure:: ../img/use-cases/ctcf-disruption-7.png + :align: center + :width: 600px + +**Step 8:** We see that variants in the center of the region we queried (in the experimentally defined CTCF region) are predicted to most strongly disrupt CTCF binding in K562 cells. Dark blue cells indicate that almost any base substitution leads to a strong decrease in probability of binding, however, substitution of a T on the 3' side of the motif to an A or G is predicted to increase binding (yellow). From this view, to see how other sequence features/TFs might be affected by the mutations, we can choose another chromatin feature from the pulldown menu and press submit to view the predicted impact of variants on that feature. + +.. figure:: ../img/use-cases/ctcf-disruption-8.png + :align: center + :width: 600px + diff --git a/docs/use-cases/expecto-use-case.rst b/docs/use-cases/expecto-use-case.rst new file mode 100644 index 000000000..7ea9a9802 --- /dev/null +++ b/docs/use-cases/expecto-use-case.rst @@ -0,0 +1,42 @@ +================ +ExPecto use case +================ + +**Task: What is the impact of a noncoding variant on the tissue-specific expression of nearby genes?** + + +* Select the ExPecto analysis and input the query by gene name, VCF file, or SNP rsID. We first demonstrate a query by gene name. + +.. figure:: ../img/use-cases/expecto-1.png + :align: center + :width: 600px + + +* View predictions for the tissue-specific gene expression of the gene of interest. Selecting a tissue system in the overview view (for example, hematopoietic system) shows the expression impact in specific tissues within that tissue system. The expression effect is the log fold change between the expression level of the reference and alternate allele (see ExPecto documentation). The results can also be downloaded in tabular form using the “Download” button. + +.. figure:: ../img/use-cases/expecto-2.png + :align: center + :width: 600px + +.. figure:: ../img/use-cases/expecto-3.png + :align: center + :width: 600px + +**Step 3:** We next demonstrate a VCF-formatted query. We query using VCF format the variant rs33944208 (chr11:5248389 G>A in hg19 coordinates). This variant is associated with beta-thalassemia with reduced expression of the HBB gene (Orkin et al. 1984 PMID: 6086605). Note that although a single variant is queried in this example, the interface supports simultaneously querying many (up to 10,000) variants. + +.. figure:: ../img/use-cases/expecto-4.png + :align: center + :width: 600px + +.. figure:: ../img/use-cases/expecto-5.png + :align: center + :width: 600px + +**Step 4:** In the genome browser results view, we see that this variant is in the 5’ UTR of the HBB gene. + +**Step 5:** ExPecto predicts that the variant downregulates the associated gene across cell types. + +.. figure:: ../img/use-cases/expecto-6.png + :align: center + :width: 600px + diff --git a/docs/use-cases/expectosc-use-case.rst b/docs/use-cases/expectosc-use-case.rst new file mode 100644 index 000000000..e377a58e8 --- /dev/null +++ b/docs/use-cases/expectosc-use-case.rst @@ -0,0 +1,27 @@ +================== +ExPectoSC use case +================== + +**Task: Which non-coding variants have a strong effect on the cell-type specific expression of a nearby gene?** + + +* Navigate to humanbase.io/expectosc or choose the “ExPectoSC” option from the main Analyses menu. Input a gene of interest (for example, PTEN). + +.. figure:: ../img/use-cases/expectosc-1.png + :align: center + :width: 600px + + +* View noncoding variants with a strong effect on the expression of a nearby gene (summarized at the tissue level). The “Expression Effect” scores (described in more detail in the documentation) are the predictions of variant effects normalized to 1000 Genomes variants. A “Download” button allows downloading predictions as a tsv file. + +.. figure:: ../img/use-cases/expectosc-2.png + :align: center + :width: 600px + + +* Select a tissue type (for example, brain) to view cell-type specific impacts of the noncoding variants on the expression of the corresponding gene. Cell-type specific variant effect predictions can also be downloaded as a table. + +.. figure:: ../img/use-cases/expectosc-3.png + :align: center + :width: 600px + diff --git a/docs/use-cases/extracting-fasta.rst b/docs/use-cases/extracting-fasta.rst new file mode 100644 index 000000000..911dc3e61 --- /dev/null +++ b/docs/use-cases/extracting-fasta.rst @@ -0,0 +1,16 @@ +================================ +Extracting a FASTA file tutorial +================================ + +**Task: Extract a FASTA file for a region of interest (for example, as input to the in silico mutagenesis tool).** + +We will use an external tool, the NCBI Genome Data Viewer, for this task. + +The NCBI Genome Data Viewer for hg19 can be accessed at the following link: https://www.ncbi.nlm.nih.gov/gdv/browser/genome/?id=GCF_000001405.25 + +Input the desired coordinates to download in the top left (red box). Once the correct genome region is displayed, select Download -> Download FASTA -> FASTA (Visible Range). This will download a FASTA file that can then be uploaded into HumanBase tools. + +.. figure:: ../img/use-cases/extracting-fasta-1.png + :align: center + :width: 600px + diff --git a/docs/use-cases/functional-enrichments.rst b/docs/use-cases/functional-enrichments.rst new file mode 100644 index 000000000..9f467286c --- /dev/null +++ b/docs/use-cases/functional-enrichments.rst @@ -0,0 +1,33 @@ +=========================================== +Functional enrichments in networks use case +=========================================== + +This use case is drawn from De Roover et al. 2021, Hypoxia induces DOT1L in articular cartilage to protect against osteoarthritis. + +**Task: What processes are enriched among genes functionally related to my genes of interest in cartilage tissue?** + + +* Input genes of interest (18 TFs identified as likely to regulate osteoarthritis-relevant gene DOT1L). Select “Search.” + +.. figure:: ../img/use-cases/functional-enrichments-1.png + :align: center + :width: 600px + + +* Select tissue type of interest (cartilage). + +.. figure:: ../img/use-cases/functional-enrichments-2.png + :align: center + :width: 600px + + +* View functional enrichments of the gene of interest and related genes in the cartilage network. Download the network and functional enrichments. The color of an edge in the functional network indicates the probability that the corresponding pair of genes is functionally related (the network edge weights are computed by Bayesian integration of the data compendium using the tissue-specific gold standard). + +.. figure:: ../img/use-cases/functional-enrichments-3.png + :align: center + :width: 600px + +.. figure:: ../img/use-cases/functional-enrichments-4.png + :align: center + :width: 600px + diff --git a/docs/use-cases/functional-module-clustering.rst b/docs/use-cases/functional-module-clustering.rst new file mode 100644 index 000000000..4582e258d --- /dev/null +++ b/docs/use-cases/functional-module-clustering.rst @@ -0,0 +1,22 @@ +===================================== +Functional module clustering use case +===================================== + +This use case is drawn from Bishop et al. 2022, Inflammation Subtypes and Translating Inflammation-Related Genetic Findings in Schizophrenia and Related Psychoses: A Perspective on Pathways for Treatment Stratification and Novel Therapies + +**Task: How can I partition my list of genes of interest into modules, and what are the functional enrichments in these modules?** + + +* Select the “Modules” analysis. Input genes of interest (103 druggable schizophrenia-related genes). Select the desired network (central nervous system). Select “Search.” + +.. figure:: ../img/use-cases/functional-module-1.png + :align: center + :width: 600px + + +* View the modules identified by data-driven community clustering of the gene set of interest in the selected network. The functional enrichments of the detected modules can also be viewed and the minimum module size cutoff  can be adjusted. The resulting module clustering and enrichment table can be downloaded in multiple formats. + +.. figure:: ../img/use-cases/functional-module-2.png + :align: center + :width: 600px + diff --git a/docs/use-cases/in-silico-mutagenesis-use-case.rst b/docs/use-cases/in-silico-mutagenesis-use-case.rst new file mode 100644 index 000000000..3fe01cd52 --- /dev/null +++ b/docs/use-cases/in-silico-mutagenesis-use-case.rst @@ -0,0 +1,20 @@ +============================== +In Silico Mutagenesis use case +============================== + +**Task: Which base changes in a sequence of interest are likely to have a large impact on a specific chromatin feature?** + + +* Select “In silico mutagenesis” from the “Analyses” menu, input a sequence of interest, and select “Submit.” + +.. figure:: ../img/use-cases/in-silico-mutagenesis-1.png + :align: center + :width: 600px + + +* Select a chromatin feature and view the predicted impact of all possible base changes in the input sequence on the selected chromatin feature. The perturbed feature that is displayed is selected from the "Chromatin features" dropdown menu. See the In Silico Mutagenesis documentation or the `DeepSEA paper `_ for the interpretation of the log fold change scores. The in silico mutagenesis predictions can also be downloaded as a csv file. + +.. figure:: ../img/use-cases/in-silico-mutagenesis-2.png + :align: center + :width: 600px + diff --git a/docs/use-cases/netwas-use-case.rst b/docs/use-cases/netwas-use-case.rst new file mode 100644 index 000000000..1bb83a494 --- /dev/null +++ b/docs/use-cases/netwas-use-case.rst @@ -0,0 +1,20 @@ +=============== +NetWAS use case +=============== + +**Task: Which nominally significant hits in a GWAS of BMI are likely to be causal?** + + +* Select “GWAS re-prioritization (NetWAS)” analysis. Prepare a per-gene P-value summary file for the GWAS of interest (using the VEGAS framework is recommended for this). Upload the per-gene P-value summary file, select the GWAS file type, tissue context, and P-value threshold. Enter an email for receiving results and job title. Select “Submit.” + +.. figure:: ../img/use-cases/netwas-1.png + :align: center + :width: 600px + + +* NetWAS analysis results are provided by email. Results are ranked by an SVM classifier based on the similarity of the network connectivity of each gene to nominally significant training examples. + +.. figure:: ../img/use-cases/netwas-2.png + :align: center + :width: 600px + diff --git a/docs/use-cases/networks-use-case.rst b/docs/use-cases/networks-use-case.rst new file mode 100644 index 000000000..5fcd2b47f --- /dev/null +++ b/docs/use-cases/networks-use-case.rst @@ -0,0 +1,36 @@ +================= +Networks use case +================= + +Drawn from Cury et al. 2023, Transcriptional profiles and common genes link lung cancer with the development and severity of COVID-19. + +**Task: What genes are functionally related to my gene(s) of interest in lung tissue?** + + +* Input genes of interest (9 genes identified as upregulated in both SARS-CoV-2 infected lung cell lines and lung cancer samples). Select “Search.” + +.. figure:: ../img/use-cases/networks-1.png + :align: center + :width: 600px + + +* Select desired tissue and data types for the network. + +.. figure:: ../img/use-cases/networks-2.png + :align: center + :width: 600px + + +* Select the desired interaction confidence threshold and maximum number of genes. + +.. figure:: ../img/use-cases/networks-3.png + :align: center + :width: 600px + + +* Export the network as an image or as text. + +.. figure:: ../img/use-cases/networks-4.png + :align: center + :width: 600px + diff --git a/docs/use-cases/sei-use-case.rst b/docs/use-cases/sei-use-case.rst new file mode 100644 index 000000000..4a2a2bd63 --- /dev/null +++ b/docs/use-cases/sei-use-case.rst @@ -0,0 +1,33 @@ +============ +Sei use case +============ + +**Task: What is the impact of a non-coding variant on the regulatory activity of a sequence?** + + +* Select Sei from the analyses menu. Input noncoding variants of interest and submit the job. + +.. figure:: ../img/use-cases/sei-1.png + :align: center + :width: 600px + + +* View visualizations of the impact of the input variants on the regulatory class of the sequence. For example, the A to G mutation at position 23508363 on chromosome 10 decreases the sequence class score of the "TF3 FOXA1/AR/ESR" regulatory class. The method for computing the Sei sequence class scores can be found in the Methods/Sequence class scores section of `the Sei paper `_. A table of the results can also be viewed and downloaded. + +.. figure:: ../img/use-cases/sei-2.png + :align: center + :width: 600px + + +* Each variant can also be viewed in its genomic context in a genome browser. + +.. figure:: ../img/use-cases/sei-3.png + :align: center + :width: 600px + +* Heatmap visualizations of the sequence class score are also provided. By selecting the down arrow next to "Sequence class score" and choosing "Probability diffs", the user can alternately view a probability difference heatmap of the molecular-level biochemical effects of the variant, where each block shows the difference in probability predicted for the reference allele having the epigenetic feature versus the alternative allele having the epigenetic feature. + +.. figure:: ../img/use-cases/sei-4.png + :align: center + :width: 600px + diff --git a/docs/use-cases/seqweaver-use-case.rst b/docs/use-cases/seqweaver-use-case.rst new file mode 100644 index 000000000..a8c9c3343 --- /dev/null +++ b/docs/use-cases/seqweaver-use-case.rst @@ -0,0 +1,36 @@ +================== +Seqweaver use case +================== + +Use case drawn from Park et al. 2021, Genome-wide landscape of RNA-binding protein target site dysregulation reveals a major impact on psychiatric disorder risk. + +**Task: What is the post-transcriptional impact of a noncoding variant on binding of RNA binding proteins?** + + +* Select the “SeqWeaver” analysis from the main analyses menu. Input noncoding variants of interest and submit job. + +.. figure:: ../img/use-cases/seqweaver-1.png + :align: center + :width: 600px + + +* View visualizations of the impact of the input variants on the RNA binding protein (RBP) affinity of the sequence. Here, the disease impact score of the query variant is visualized. The disease impact score is calculated by training a logistic regression model that prioritizes likely disease-associated mutations on the basis of the predicted post-transcriptional regulatory effects of these mutations (See `Zhou et. al, 2019 `_ and the Seqweaver documentation). Tabular representations of all predictions can also be viewed and downloaded. + +.. figure:: ../img/use-cases/seqweaver-2.png + :align: center + :width: 600px + + +* The variant can be viewed in its genomic context on a genome browser. This variant lies in the DDHD2 gene. + +.. figure:: ../img/use-cases/seqweaver-3.png + :align: center + :width: 600px + + +* The heatmap view shows that the variant is predicted to disrupt binding of the schizophrenia-associated RNA binding protein QKI. The z-score is a scaled score where the feature diff score (𝑝𝑎𝑙𝑡 − 𝑝𝑟𝑒𝑓) is divided by the root mean square of the feature diff score across gnomAD variants (see Seqweaver documentation). + +.. figure:: ../img/use-cases/seqweaver-4.png + :align: center + :width: 600px +