Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
325a138
Add gitignore
j-funk Jun 26, 2025
55d2d19
docs: Add SEI score computation reference to output section
j-funk Jun 26, 2025
7fd5d37
docs: Add output section to ExPecto documentation
j-funk Jun 26, 2025
3678503
docs: Make ExPecto output section headings bold
j-funk Jun 26, 2025
b63c528
Bring over table of contents and config improvements from seek branch
j-funk Jun 26, 2025
c5a0268
Add output section to ExPectoSC documentation
j-funk Jun 26, 2025
ff43415
Move In Silico Mutagenesis to separate documentation page
j-funk Jun 26, 2025
93e6a4b
Add Read the Docs configuration files
j-funk Jun 27, 2025
892d607
Update README
j-funk Jun 27, 2025
adac851
Minor copy edit caps
j-funk Jun 27, 2025
be4952f
Update README for deploying branch versions for review
j-funk Jun 30, 2025
648d172
Add Seqweaver documentation and refactor common input formats
j-funk Jun 30, 2025
c6e0713
Update Seqweaver doc with Rachel’s version
j-funk Jun 30, 2025
f437d58
update seqweaver doc
rssealfon Jul 14, 2025
d71f9be
add info about input example availability
rssealfon Jul 14, 2025
86f6020
Merge pull request #10 from rssealfon/rachels-edits
j-funk Jul 14, 2025
f1b5356
update beluga documentation
rssealfon Jul 14, 2025
503b2e2
update sei documentation
rssealfon Jul 14, 2025
74fd1e8
update expecto and clever documentation
rssealfon Jul 14, 2025
e08dd20
update usage.rst from intro page
rssealfon Jul 15, 2025
81b8462
update sei.rst from additional page
rssealfon Jul 15, 2025
f969c90
update expecto.rst from additional page
rssealfon Jul 15, 2025
3536014
update asd browser page
rssealfon Jul 15, 2025
b0d38f0
update modules txt
rssealfon Jul 21, 2025
fff4e61
merge tissue and functional networks pages
rssealfon Jul 21, 2025
4e61e9c
edit FMD, netwas pages
rssealfon Jul 21, 2025
f61eccd
edit ISM, etc
rssealfon Jul 21, 2025
17bc327
update citation page
rssealfon Jul 21, 2025
1772b74
Migrate ‘clever’ to ‘expectosc’
j-funk Jul 21, 2025
7a53a95
Add DeepSEA umbrella doc page
j-funk Jul 21, 2025
7caab5b
additional changes
rssealfon Jul 21, 2025
b03cabd
add download links to seqweaver
rssealfon Jul 21, 2025
13eb281
add link
rssealfon Jul 22, 2025
6d72baa
add images
rssealfon Jul 22, 2025
28aa515
Update deepsea.rst with revisions
j-funk Jul 22, 2025
27895cc
rename clever ->expectosc
rssealfon Jul 22, 2025
002a6b8
Delete docs/clever.rst
rssealfon Jul 22, 2025
f61916a
Merge branch 'hb-review-paper' into rachels-edits
j-funk Jul 22, 2025
4afd39f
Merge pull request #11 from rssealfon/rachels-edits
j-funk Jul 22, 2025
981347b
Add asdbrowser to deepsea.rst
j-funk Jul 22, 2025
a431dc6
Add asdbrowser images
j-funk Jul 22, 2025
f187b37
First draft of Use Cases
j-funk Jul 24, 2025
823a852
Update Footer
j-funk Jul 24, 2025
b67833b
merge index and usage
rssealfon Jul 28, 2025
d0bce35
update autism gene prediction citation
rssealfon Jul 28, 2025
68d2c0b
copyedit autism varient effect section heading
rssealfon Jul 28, 2025
678cec7
copyedit expectosc use case
rssealfon Aug 4, 2025
a239eaa
copyedit beluga use case
rssealfon Aug 4, 2025
e01d46a
add sentence about ability to predict impact for unseen variants
rssealfon Aug 11, 2025
60b56e8
copyedit index
rssealfon Aug 11, 2025
dce2718
copyedit
rssealfon Aug 11, 2025
2cac492
copyedit asd browser use case
rssealfon Aug 11, 2025
7a341e4
copyedit asd browser use case
rssealfon Aug 11, 2025
2cf2d34
copyedit module clustering use case
rssealfon Aug 11, 2025
df736d9
copyedit functional networks doc
rssealfon Aug 11, 2025
d24ee2b
copyedit modules doc
rssealfon Aug 11, 2025
4c55193
copyedit beluga, seqweaver docs
rssealfon Aug 11, 2025
a980bd5
add paper reference to ism
rssealfon Aug 11, 2025
8b9396e
copyedit citations
rssealfon Aug 11, 2025
b341d87
fix ExPecto bulk download description
rssealfon Aug 11, 2025
8a93ac0
Merge pull request #12 from rssealfon/hb-review-paper
j-funk Aug 12, 2025
fa169ea
Fix typos re: Rachel suggestion
j-funk Aug 12, 2025
d9aad37
Cleanup typos & copy errors
j-funk Aug 12, 2025
de3eb51
Update NetWAS links
j-funk Aug 12, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 34 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Build artifacts
docs/_build/
out/

# Virtual environments
venv/
env/
.venv/
.env/

# IDE files
.idea/
.vscode/
*.swp
*.swo
*~

# Python cache
__pycache__/
*.py[cod]
*$py.class
*.so

# OS files
.DS_Store
Thumbs.db

# Claude.ai files
CLAUDE.md
.claude/

# Sphinx
.doctrees/
*.doctree
18 changes: 18 additions & 0 deletions .readthedocs.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# .readthedocs.yaml
# Read the Docs configuration file
version: 2

# Set the version of Python and other tools
build:
os: ubuntu-22.04
tools:
python: "3.10"

# Build documentation in the docs/ directory with Sphinx
sphinx:
configuration: docs/conf.py # Path adjusted to docs/conf.py

# Optionally declare the Python requirements required to build your docs
python:
install:
- requirements: docs/requirements.txt
86 changes: 85 additions & 1 deletion README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,88 @@
HumanBase documentation
=======================

This repository contains the documentation source files for `HumanBase <hb.flatironinstitute.org>`_. These files are Sphinx docs written with reStructuredText.
This repository contains the documentation source files for `HumanBase <https://hb.flatironinstitute.org>`_. These files are Sphinx docs written with reStructuredText.

Build Status
------------

Check the Read the Docs build status and documentation: https://app.readthedocs.org/projects/humanbase/

The live documentation is available at: https://humanbase.readthedocs.io/

**Preview Builds:** Read the Docs can build documentation for pull requests, but this needs to be triggered. To preview your changes before merging:

1. Create a pull request from your branch to ``master``
2. Go to https://app.readthedocs.org/projects/humanbase/ to trigger a build

Quick Start
-----------

Prerequisites
~~~~~~~~~~~~~

* Python 3.8 or higher
* pip (Python package installer)
* Make (for building documentation)

Local Development Setup
~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: bash

# Clone the repository
git clone https://github.com/aaronkw/humanbase-docs.git
cd humanbase-docs

# Create and activate a virtual environment
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r docs/requirements.txt

Building Documentation Locally
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: bash

# Navigate to the docs directory
cd docs

# Clean previous builds (optional)
make clean

# Build HTML documentation
make html

# View the documentation
open _build/html/index.html # On macOS
# Or: python -m http.server -d _build/html 8000 # Then visit http://localhost:8000

Documentation Structure
-----------------------

The documentation is organized as follows:

* ``docs/`` - Main documentation directory

* ``index.rst`` - Main table of contents and entry point
* ``conf.py`` - Sphinx configuration file
* ``requirements.txt`` - Python dependencies for building docs
* ``img/`` - Images and diagrams used in documentation
* Tool-specific documentation:

* ``sei.rst`` - Sei/DeepSEA sequence-based predictions
* ``beluga.rst`` - DeepSEA (Beluga) chromatin profile predictions
* ``expecto.rst`` - ExPecto expression predictions
* ``expectosc.rst`` - ExPectoSC variant effect predictions
* ``netwas.rst`` - NetWAS network-based association studies
* ``in-silico-mutagenesis.rst`` - In silico mutagenesis analysis

* Network documentation:

* ``functional-networks.rst`` - Functional gene networks
* ``tissue-networks.rst`` - Tissue-specific networks

* ``.readthedocs.yaml`` - Read the Docs build configuration
* ``README.rst`` - This file
15 changes: 15 additions & 0 deletions docs/_includes/common-input-formats.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
We support three types of input: VCF, FASTA, BED. If you want to predict effects of noncoding variants, use VCF format input. If you want to predict chromatin feature probabilities for DNA sequences, use FASTA format. If you want to specify sequences from the human reference genome, you can use BED format. See below for a quick introduction:

**VCF format** is used for specifying a genomic variant. A minimal example is ``chr1 109817590 - G T`` (if you want to copy this text as input, you will need to change spaces to tabs). The five columns are chromosome, position, name, reference allele, and alternative allele.

**FASTA format** input should include sequences of |bp_length|\ bp length each. If a sequence is different from |bp_length|\ bp:

* **Note**: The prediction is for the center base of the input sequence
* **Longer sequences**: Only the center |bp_length|\ bp will be used
* **Shorter sequences**: Sequences shorter than |bp_length|\ bp will be padded with 'N' bases evenly on both sides

- **Important**: We do not recommend using FASTA input smaller than |bp_length|\ bp unless it is very close (only a few bp off)
- **Note**: This padding behavior is not recommended. N's were extremely rare in training data (only appearing in assembly gaps), and the model has not been evaluated with artificially padded sequences
- **Strong recommendation**: Always provide sequences of exactly |bp_length|\ bp by including genomic flanking sequences

**BED format** provides another way to specify sequences in human reference genome. A minimal example is |bed_example|. The three columns are chromosome, start position, and end position.
3 changes: 3 additions & 0 deletions docs/_includes/common-submission-info.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
Large submissions
~~~~~~~~~~~~~~~~~
We recommend using the web server if you have <10,000 variants or sequences. You will experience degraded performance when submitting a larger set of sequences. In those instances, we suggest that you split the set into multiple <10,000 submissions, or run the standalone version on your local machine, or contact our group directly.
23 changes: 23 additions & 0 deletions docs/asdbrowser.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
==============
ASD Browser
==============

Introduction
------------

The ASD browser allows exploration of predicted impact (chromatin effects and post-transcriptional RNA-binding protein impact) of de novo noncoding variants in autism probands in the `Simons Simplex Collection <https://www.sfari.org/resource/simons-simplex-collection/>`_.

The variant prediction approach is described in the following manuscript: Zhou J, Park CY, Theesfeld CL, Wong AK, Yuan Y, Scheckel C, Fak JJ, Funk J, Yao K, Tajima Y, Packer A, Darnell RB, Troyanskaya OG. (2019). `Whole-genome deep-learning analysis identifies contribution of noncoding mutations to autism risk <https://www.nature.com/articles/s41588-019-0420-0>`_. Nature Genetics.

Description
-----------

This website provides a user-friendly interactive interface for exploring the sequence-based predicted effects of SSC ASD proband mutations. Both individual molecular-level effects at chromatin (“DNA”) level and RNA-binding protein (“RNA”) level and Disease Impact Scores summarizing molecular level effects are shown. The methodology and analysis are described in the manuscript “Whole-genome deep learning analysis reveals causal role of noncoding mutations in autism”.

.. image:: img/genome_browser.png

The Genome browser can be navigated by entering a genomic interval, a gene name, or interactively through zooming in/out and scrolling. The tracks “DNA Disease Impact Score” and “RNA Disease Impact Score” show mutation disease impact score (DIS) from DNA and RNA models respectively. DIS scores summarize molecular-level biochemical effects at DNA and RNA level into two scores based on regularized logistic regression classifiers trained with HGMD mutations.

.. image:: img/genome_heatmap.png

Individual molecular-level biochemical effects are shown as a heatmap. The biochemical features are sorted by the magnitude of predicts effects of the center mutation. Each mutation may be clicked to center the genome browser and the heatmap at that mutation, or the heatmap may be dragged to alter the center mutation. The user can select “DNA features” or “RNA features” from the dropdown menu. Mousing over any individual prediction in the heatmap will display details in a tooltip.
93 changes: 14 additions & 79 deletions docs/beluga.rst
Original file line number Diff line number Diff line change
@@ -1,89 +1,31 @@
=======
DeepSEA (Beluga)
Beluga (DeepSEA)
=======

Introduction
------------

DeepSEA is a deep learning-based algorithmic framework for predicting the chromatin effects of sequence alterations with single nucleotide sensitivity. DeepSEA can accurately predict the epigenetic state of a sequence, including transcription factors binding, DNase I sensitivities and histone marks in multiple cell types, and further utilize this capability to predict the chromatin effects of sequence variants and prioritize regulatory variants.
DeepSEA is a deep learning-based algorithmic framework for predicting the chromatin effects of sequence alterations with single nucleotide sensitivity. DeepSEA can accurately predict the epigenetic state of a sequence, including transcription factors binding, DNase I sensitivities and histone marks in multiple cell types, and further utilize this capability to predict the chromatin effects of sequence variants and prioritize regulatory variants. Importantly, this framework is trained without using any variant data, allowing it to predict the chromatin impact of any variant, including rare or previously unseen ones. The 2019 version of DeepSEA, nicknamed '**Beluga**', can predict **2002** chromatin features.

The 2019 version of DeepSEA, nicknamed '**Beluga**', can predict **2002** chromatin features. Beluga is described in:
Beluga is described in: Jian Zhou, Chandra L. Theesfeld, Kevin Yao, Kathleen M. Chen, Aaron K. Wong, and Olga G. Troyanskaya, `Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk <https://www.nature.com/articles/s41588-018-0160-6>`_. Nature Genetics (2018).

Jian Zhou, Chandra L. Theesfeld, Kevin Yao, Kathleen M. Chen, Aaron K. Wong, and Olga G. Troyanskaya, **Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk**. Nature Genetics (2018).

To determine if certain features (ie. transcription factors, marks, or cell types) are present/accounted for in the model, refer to the `supplemental feature table <https://s3-us-west-2.amazonaws.com/humanbase-dev/deepsea/examples/41588_2019_420_MOESM9_ESM.csv>`_ which has all the profiles used to train DeepSEA.
DeepSEA is originally described in the following manuscript: Jian Zhou, Olga G. Troyanskaya. `Predicting the Effects of Noncoding Variants with Deep learning-based Sequence Model <https://www.nature.com/articles/nmeth.3547>`_ Nature Methods (2015).

DeepSEA is originally described in the following manuscript:

Jian Zhou, Olga G. Troyanskaya. **Predicting the Effects of Noncoding Variants with Deep learning-based Sequence Model.** Nature Methods (2015).

To determine if certain features (ie. transcription factors, marks, or cell types) are present/accounted for in the model, refer to the `supplemental feature table <https://s3-us-west-2.amazonaws.com/humanbase-dev/deepsea/examples/41588_2019_420_MOESM9_ESM.csv>`_ which has all the profiles used to train DeepSEA.
To determine if certain features (ie. transcription factors, marks, or cell types) are present/accounted for in the model, refer to the `supplemental feature table <https://s3-us-west-2.amazonaws.com/humanbase-dev/deepsea/examples/41588_2019_420_MOESM9_ESM.csv>`_ which has all the profiles used to train Beluga.


Input
-----

DeepSEA predicts genomic variant effects on a wide range of chromatin features at the variant position (Transcription factors binding, DNase I hypersensitive sites, and histone marks in multiple human cell types). DeepSEA can also be utilized for predicting chromatin features for any DNA sequence.

File formats
~~~~~~~~~~~~
We support three types of input: vcf, fasta, bed. If you want to predict effects of noncoding variants, use vcf format input. If you want to predict chromatin feature probabilities for DNA sequences, use fasta format. If you want to specify sequences from the human reference genome (GRCh37/hg19), you can use bed format. See below for a quick introduction:

**VCF format** is used for specifying a genomic variant. A minimal example is ``chr1 109817590 - G T`` (if you want to copy cover this text as input, you will need to change spaces to tabs). The five columns are chromosome, position, name, reference allele, and alternative allele.

**Fasta format** input should include sequences of 2000bp length each. If a sequence is longer than 2000bp, only the center 2000bp will be used. A minimal example is ::

>TestSequence
TGGGATTACAGGCGTGAGCCACCGCGCCCGGCCCATTGTACCATTCTTAT
GCCTTTGCGTCCTCATAGCTTAGCTCCCGTATATCAGTGAGAACATACTA
TGTTTGGTTTTCCATACCCGAGTTACTTCACTTAGAATAATAGTCTCCAA
TTTCATCCAGGTCAGTGCAAATGCGTTAATTCGTTCCTTTTATGGCTGAG
TAGTATTCCATCATATATATATACTACAGTTTCTTTATCCACTCGTAAAT
TGATGGGCATTTGTGTTGGAACACTTCTCCACTGCTGGTGGGAATGTAAA
TTAGTGCAGCCACTATGGATAACAGTGTGGAGATTTGTTAAAGAACTAAA
ACTAGAACTACCATTTGATCCAGCAATCCCACTACTGGGTATCTACCCAG
AAGAAAAGAAGTCATTATTTGAAAAAGATACTTGCACGGGCATGTTTATA
GCAGCACAATTCACAATTGTAGTTGTATTTCTTTAAGCGTGTCTTTTCAA
TATCTCTCATGTTTCTGGTATAGATGGTATATATGTTAATCTTGTTCCTG
AGGTCTGTTTTTTATTTTTGTCATTAAAGTGGGAATTAAATAGTTTTGTA
GTGCATATAAATTAAAGAAAAAGTTCACATAAGCATATTTGCCAATCATC
TCAAAATGCTATATTCTCCTTCACGGTTTTGAAAATAATTCAGGGTTTTC
TCTTCCTCATTGCTTTCCCACCAACTGACAGTATTATTTTCTTAGTCATT
TTACTGACCTTTGAAATTACTCCTTTGAGGTCTTCTAAAAAATTTTATGG
GCTCTGCTGCTTTTTGGTGGCCTCCTTGTATCATTTATTCTATTACAGGA
CGACTTACAAAAGGAAGCACATAAATTGACCCATATACATATCCTATCAT
TGGGGAGTTTCTGTGCAAATGTTATTTATTGGAAGCTATTACTAAGAATT
GTAAGAAAAATAATTGGTATTGATGCAGCTAGTATGGTTCCTGTAATTAT
CGTACTCAGCCACGTAAATCATAGCTATATGTAGCCAAAGATCCATGAAC
AAAATTTCCAGTAACATCATTATAATTCAAAAGGCAGACTTTCAGAACCA
GACAGACTTGAATTTAAATTCTAGCTTTACCACACATGAATTTAACCTTG
TGGAAGGTTAACCTATCTAAACTCATGTTTCTTCATTGGTAGCTGATAAA
ATTAAGGATCATGTATATAACCACCTAGTAGAGTTGTTTAAGAAACTGTT
AGAATTCCATAAATTGTTAGTATTAATGAGTTTTTGTTGGACATGTGTTA
GGCTAGGCCACTCCTTGACCTTCATAGAGGTATGGATTATGACACAAATT
CTAAACTGTAGGTAGGCATGGCTTTGTAGCAAGTATTAAAATAGTAAATA
TTTTATTTTTATAAGATAAATGTAAACCTTTTAAAAGTTTCATTACATTT
GTATTTATGAAATATCATCCTATATCAACTATAGAGAGAAGATCGCAAGA
AGGCAGTGGCAGCAGAGGCTCCAGTTAGGAGGCTACTAGTCCAAATACAT
TGCGATAAAAACTTGGCAAAAGGTGCTGGTAGTCTGATGAAATAAAGTAG
ATAAATTTTAGAGGTATTTATAAAATAATTAAAGAATATTCAATAATAGG
AGATATATTACCCAATAGAGTGGAGATTCAAAGATAACTCCGAAAGTTTT
TTGCTAAAGCAACATTTGGCTGTGCTATCATTTACTAAGAAAGACAACAA
GAGAGTAAAATCAAGTTTGAGGATGAAGTGAATTTATTCCTTTTTGATTG
ATACATAATTGACATGTAATAAAACCCACAATGTTAAGAGTTCGGTTTGA
TGTGCTTGACTATTTTAGGCACTGGTGTTATCACAACACAAGACAACAGA
TAGGACATTCTCAGAAAATTTTTTCATGTCCCTTTCCAGTCAGTTTCAAG
CCTTCTTTCCATGCAATAATTTTCTCACTTTGCCATTCTAGTAGGTGTGA

**Bed format** provides another way to specify sequences in human reference genome (hg19). The bed input should specify 2000bp-length regions. A minimal example is ``chr1 109817091 109819090``. The three columns are chromosome, start position, and end position.

Genome coordinates
~~~~~~~~~~~~~~~~~~
We support only ``GRCh37/hg19`` genome coordinates. You can use LiftOver to convert your coordinates to the correct version.

Large submissions
~~~~~~~~~~~~~~~~~
We recommend using the web server if you have <10,000 variants or sequences. You will experience degraded performance when submitting a larger set of sequences. In those instances, we suggest that you split the set into multiple <10,000 submissions, or run the standalone version on your local machine, or contact our group directly.
Beluga predicts genomic variant effects on a wide range of chromatin features at the variant position (Transcription factors binding, DNase I hypersensitive sites, and histone marks in multiple human cell types).

.. |bp_length| replace:: 2000
.. |bed_example| replace:: ``chr5 134871851 134871852``

.. include:: _includes/common-input-formats.rst

.. include:: _includes/common-submission-info.rst

Output
------
Expand All @@ -97,7 +39,7 @@ Regulatory feature scores
Variant scores
~~~~~~~~~~~~~~

* **Disease Impact Score (DIS)**: DIS is calculated by training a logistic regression model that prioritizes likely disease-associated mutations on the basis of the predicted transcriptional or post-transcriptional regulatory effects of these mutations (See Zhou et. al, 2019). The predicted DIS probabilities are then converted into 'DIS e-values', computed based on the empirical distributions of predicted effects for gnomAD variants. The final DIS score is:
* **Disease Impact Score (DIS)**: DIS is calculated by training a logistic regression model that prioritizes likely disease-associated mutations on the basis of the predicted transcriptional or post-transcriptional regulatory effects of these mutations (See `Zhou et. al, 2019 <https://www.nature.com/articles/s41588-019-0420-0>`_). The predicted DIS probabilities are then converted into 'DIS e-values', computed based on the empirical distributions of predicted effects for gnomAD variants. The final DIS score is:

.. math::
-log10(DIS evalue_{feature})
Expand All @@ -107,10 +49,3 @@ Variant scores
.. math::
\sum -log10(evalue_{feature}) / N

In-silico mutagenesis
---------------------
Perform "In silico saturated mutagenesis" (ISM) analysis to discover informative sequence features within any sequence. Specifically, it performs computational mutation scanning to assess the effect of mutating every base of the input sequence on chromatin feature predictions. This method for context-specific sequence feature extraction takes advantage of DeepSEA’s ability to utilize flanking context sequences information.

Note that ISM only accepts a sequence (FASTA file) as input.

ISM outputs effects for each of three possible substitutions of all 2000 bases, across all chromatin features.
Loading