Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
104 changes: 28 additions & 76 deletions docs/seqweaver.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,96 +10,48 @@ obtained from CLIP-seq experiments. Seqweaver consists of 232 RBP models spannin
databases) at single-nucleotide resolution.

Seqweaver is described in:
Park CY, Zhou J, Wong AK, Chen KM, Theesfeld CL, Darnell RB, Troyanskaya OG. Genome-wide landscape of RNA-binding protein target site dysregulation reveals a major impact on psychiatric disorder
risk. Nat Genet. 2021 Feb;53(2):166-173.
Park CY, Zhou J, Wong AK, Chen KM, Theesfeld CL, Darnell RB, Troyanskaya OG. `Genome-wide landscape of RNA-binding protein target site dysregulation reveals a major impact on psychiatric disorder risk <https://www.nature.com/articles/s41588-020-00761-3>`_. Nat Genet. 2021 Feb;53(2):166-173.


Input
-----

.. |bp_length| replace:: 1000
.. |bed_example| replace:: ``chr1 109817090 109818090``

.. include:: _includes/common-input-formats.rst


A minimal FASTA example of 4 sequences is shown below. Each sequence entry consists of:

- A header line starting with ``>`` followed by a sequence identifier/description
- The DNA sequence (A, C, G, T letters) on subsequent lines

Example::

>known_CEBP_binding_increase-GtoT__chr1_109817091_109818090
GTGCCTCTGGGAGGAGAGGGACTCCTGGGGGGCCTGCCCCTCATACGCCATCACCAAAAGGAAAGGACAAAGCCACACGC
AGCCAGGGCTTCACACCCTTCAGGCTGCACCCGGGCAGGCCTCAGAACGGTGAGGGGCCAGGGCAAAGGGTGTGCCTCGT
CCTGCCCGCACTGCCTCTCCCAGGAACTGGAAAAGCCCTGTCCGGTGAGGGGGCAGAAGGACTCAGCGCCCCTGGACCCC
CAAATGCTGCATGAACACATTTTCAGGGGAGCCTGTGCCCCCAGGCGGGGGTCGGGCAGCCCCAGCCCCTCTCCTTTTCC
TGGACTCTGGCCGTGCGCGGCAGCCCAGGTGTTTGCTCAGTTGCTGACCCAAAAGTGCTTCATTTTTCGTGCCCGCCCCG
CGCCCCGGGCAGGCCAGTCATGTGTTAAGTTGCGCTTCTTTGCTGTGATGTGGGTGGGGGAGGAAGAGTAAACACAGTGC
TGGCTCGGCTGCCCTGAGGTTGCTCAATCAAGCACAGGTTTCAAGTCTGGGTTCTGGTGTCCACTCACCCACCCCACCCC
CCAAAATCAGACAAATGCTACTTTGTCTAACCTGCTGTGGCCTCTGAGACATGTTCTATTTTTAACCCCTTCTTGGAATT
GGCTCTCTTCTTCAAAGGACCAGGTCCTGTTCCTCTTTCTCCCCGACTCCACCCCAGCTCCCTGTGAAGAGAGAGTTAAT
ATATTTGTTTTATTTATTTGCTTTTTGTGTTGGGATGGGTTCGTGTCCAGTCCCGGGGGTCTGATATGGCCATCACAGGC
TGGGTGTTCCCAGCAGCCCTGGCTTGGGGGCTTGACGCCCTTCCCCTTGCCCCAGGCCATCATCTCCCCACCTCTCCTCC
CCTCTCCTCAGTTTTGCCGACTGCTTTTCATCTGAGTCACCATTTACTCCAAGCATGTATTCCAGACTTGTCACTGACTT
TCCTTCTGGAGCAGGTGGCTAGAAAAAGAGGCTGTGGGCA
>known_FOXA2_binding_decrease-ReferenceAllele__chr10_23507864_23508863
CTTCTTTTTATCTCTTAACTAACTTACAATTTCTTACGTGATTTTAAAACTTGTTTTTCTATTTAAAACAACAGGGGCAA
CTGAACTTCACTTTCAAACAATATTTATTTCTATAAATCAGTGCAAAACATACTTATTGAAAATATATCTTGGGTCCAAG
GCTTCAAAGGGTAAAAAGAAAGATTTTAAATTATATCTAATATGTTACAATTGTTCTGTCCTTTAAAAACCTTTTCAGAT
CACCCCCTGGATGATTCTTCCCTAGAAGTCTCAGAGAATTAACAACACAATGTAATCTAGGTTTAAATTTGGGTTTCTCC
TGTGTTTCAGATACTGATGTTTGAGCTTTCTCTTCCTGACAAGCCACTTAAAGAGTCACTGTTACTTTGAGGTTTTATCT
GTAAGATTCGTGTCTTTTGGGCTCATTAAGAACATTTCCAAAGATTACAATGTCAATAGCACCTAATTACTGGACTGTGA
GAAAGGTCTTCTTGAGTACATAAAATCTGTGGCAGTGCACAGTACACAATGGGCAGCTCAGATCCCAAATTTTATCACAA
GTAAGTAGCAAACAAATTAATAATGTTACCTGTGCTCTCTTGGATAATTACTACTGCATAAAAACTGCTTTGAAATGTTG
CAGATAGTATTGTACCTCATTTTTTTAATCCCCTTAGAGTAACAAGGATTTATTTGTCTCAAACTTTCTATGTTGCATGC
ACCACTTGACTTTCTTGTTCTGTTTAGAATTTTTAGAACTTGCAACATAACAAAAAATCATTTTTAACCAGCCTAGGAAG
GACATATCACCTGATGTAACATTATTTTAAATTATATTTTGTATTTTACTTTACTCTTTTCAAAACATATACTGTATGTT
TTGATACTATTGCTAGATTTTATTTTTTACTTATGCCTGGTAGAAAATCAGCTATTAAAGAAGCAGAGGAGGCTGGACAC
AGTGGTTCATGTCTGTAATCGCTAGCACTTTGAAAGAGTA
>known_GATA1_binding_increase-TtoC__chr16_209210_210209
GGGCTTAGACAGAGGAGGGGAGGATTCAGATTTTAAATGGGTTGGCCACTGTAGGTCTATTAACGTGGTGACATTTGAGG
GAGTGGCAATACTAGGGAAGGGGCTTCAGGGGAGTGGCCAGGAGCTAGGGATAGAGGGAGGGAGGACAGGAGGCCTTGTC
TGTCTTTTCCTCCATATGTAAGTTTCAGGAGTGAGTGGGGGGTGTCGAGGGTGCTGTGCTCTCCGGCCTGAGCCTCAGGA
AGGAAGGGCAGTAGTCAGGGATGCCAGGGAAGGACAGTGGAGTAGGCTTTGTGGGGAACTTCACGGTTCCATTGTTGAGA
TGATTTGCTGGAGACACACAGATGAGGACATCAAATACATCCCTGGATCAGGCCCTGGGGCCTGAGTCCGGAAGAGAGGT
CTGTATGGACACACCCATCAATGGGAGCACCAGGACACAGATGGAGGCTAATGTCATGTTGTAGACAGGATGGGTGCTGA
GCTGCCACACCCACATTATCAGAAAATAACAGCACAGGCTTGGGGTGGAGGCGGGACACAAGACTAGCCAGAAGGAGAAA
GAAAGGTGAAAAGCTGTTGGTGCAAGGAAGCTCTTGGTATTTCCAATGGCTTGGGCACAGGCTGTGAGGGTGCCTGGGAC
GGCTTGTGGGGCACAGGCTGCAAGAGGTGCCCAGGACGGCTTGTGGGGCACAGGTTGTGAGAGGTGCCCTGGACGGCTTG
TGGGGCACAGGCTGTGAGAGGTGCCCAGGACGGCTTGTGGGGCACAGGCTGTGAGGGTGCCCGGGACGGCTTGTGGGGCA
CAGGTTGTGAGAGGTGCCCGGGACGGCTTGTGGGGCACAGGTTTCAGAGGTGCCCGGGACGGCTTGTGGGGCACAGGTTG
TGAGAGGTGCCCGGGACGGCTTGTGGGACACAGGTTGTGAGAGGTGCCTGGGACGGCTTGTGGGGCACAGGCTGTGAGGG
TGCCTGGGACGGCTTGTGGGGCACAGGTTGTGAGAGGTGC
>known_FOXA1_binding_increase-CtoT_chr16_52598689_52599688
GGCTCAAGCAGTCCTCCCATCTAGGCTTCCCAAAATGCTGGGATTACAGACATGAGCCACTGCACCCAGCCACAAAGATA
ACCTAAAGATGTGTTTACTTTGACCCAGGCAGTAGTTTAAAAAAGTTTTAATTTGTTGTTCACATTTAAAAACTGGACAA
TTTCTACATAAAAATCTGAATTACTCATGTCTCTTAAAAAAATAACATCTAGCAATGGTAGGCCCACATTCCTTCCTGAA
AATAATTAGCTGGGAAAGAGTAGGGACTGACCCCTTTAGACACGGTATAAATAGCATGGGAGTTGATCAGTAAATATTTG
CTGAATGAAAGAATACATGAATGAAAAGTCAGAGCCCTATAGGTCAGCATGGACGGCGGTAAAGGAACCTGGCTGAGCCT
GAAAGAGAATGTGATCTAAGATTAAATCCAGGATATGCTGGTAAATGTTTAACAGCCAACTCTTTGGGGAGGAAAAAAGT
CCCAATTTGTAGTGTTTGCTGATTATTGTGATGTAAATACTCCCATCATGACCAATTTCAAGCTACCAACATGCTGACAC
TGAACTTGGAGTTGGAAGGAGATGAACAGGCATAATCAGGTCTCGTGAGATGGCCCAAGCCGGCCCCAGCACTCCACTGT
TATATATGAGGCTAGAATTACTACATAACTGGAATAGCAACTTTCTGGACCATATGCCTGGAACACAGCAGGTGCTGAAT
AAATGTTTGTTGATCCAGGAACTGACTGTGTTGAAGCCCACAGATGGGAAATCAGTAGAAGGCAGGTAAGAGTAAAAAGA
AGGGCAGAGAATTGGGGGTACAGACCCCTGAACCATAAGTCAGAGGAATGTTGTACATGTTTTCAGATCCCTCACTGGTC
AAATGAAGGCAAAGGGTTAGATCTCTCCAAATCTTTAGAGGGACATGATGTAACTCCATTAAGTAACTCAGTGATTTTCA
ACATTAAAAAGTGTAATTATCTTTTCAAACTAAATATTAC

.. include:: _includes/common-submission-info.rst
We support three types of input: VCF, FASTA, BED. If you want to predict effects of noncoding variants, use VCF format input. If you want to predict RBP interaction probabilities directly from transcript sequences, you can use FASTA format. If you want to specify sequences from the human reference genome, you can use BED format.

Examples of all input formats are available in the job submission interface. See below for a quick introduction:

**VCF format** is used for specifying a genomic variant. A minimal example is ``chr8 38120276 [QKI disruption] C A +`` (if you want to copy cover this text as input, you will need to change spaces to tabs). The six columns are chromosome, position, name, reference allele, alternative allele, and strand. Note that strand must be specified for Seqweaver but not for the other deep learning models.

**FASTA format** input should include sequences of 1000 bp length each. If a sequence is different from 1000 bp:

* **Note**: The prediction is for the center base of the input sequence
* **Longer sequences**: Only the center 1000 bp will be used
* **Shorter sequences**: Sequences shorter than 1000 bp will be padded with 'N' bases evenly on both sides

- **Important**: We do not recommend using FASTA input smaller than 1000 bp unless it is very close (only a few bp off)
- **Note**: This padding behavior is not recommended. N's were extremely rare in training data (only appearing in assembly gaps), and the model has not been evaluated with artificially padded sequences
- **Strong recommendation**: Always provide sequences of exactly 1000 bp by including genomic flanking sequences

**BED format** provides another way to specify sequences in human reference genome (hg19). The BED input should specify 1000 bp-length regions. A minimal example is ``chr1 109817090 109818091 . 0 -``. The columns are chromosome, start position, end position, name, score, and strand.


Large submissions
~~~~~~~~~~~~~~~~~
We recommend using the web server if you have <10,000 variants or sequences. You will experience degraded performance when submitting a larger set of sequences. In those instances, we suggest that you split the set into multiple <10,000 submissions, or run the standalone version on your local machine, or contact our group directly.


Output
------

Variant scores
~~~~~~~~~~~~~~

**Disease impact score:** DIS is calculated by training a logistic regression model that prioritizes likely disease-associated mutations on the basis of the predicted post-transcriptional regulatory effects of these mutations (See Zhou et. al, 2019). The predicted DIS probabilities are then converted into 'DIS e-values', computed based on the empirical distributions of predicted effects for gnomAD variants. The final DIS score is:
**Disease impact score:** DIS is calculated by training a logistic regression model that prioritizes likely disease-associated mutations on the basis of the predicted post-transcriptional regulatory effects of these mutations (See `Zhou et. al, 2019 <https://pubmed.ncbi.nlm.nih.gov/31133750/>`_). The predicted DIS probabilities are then converted into DIS e-values, computed based on the empirical distributions of predicted effects for gnomAD variants. The final DIS score is:

.. math::
-\log_{10}(DIS\ e-value_{feature})

**Mean -log e-value:** For each predicted regulatory feature effect (:math:`abs(p_{alt}-p_{ref})`) of a variant, we calculate a 'feature e-value' based on the empirical distribution of that feature’s effects among gnomAD variants (see Molecular-level biochemical effects prediction: e-value). The MLE score of a variant is
**Mean -log e-value:** For each predicted regulatory feature effect (:math:`abs(p_{alt}-p_{ref})`) of a variant, we calculate a feature e-value based on the empirical distribution of that feature’s effects among gnomAD variants (see Molecular-level biochemical effects prediction: e-value). The MLE score of a variant is

.. math::
\sum{-\log_{10}(e-value_{feature})}/N
Expand All @@ -115,5 +67,5 @@ Molecular-level biochemical effects prediction

See also
--------
* :doc:`sei` - Latest model with 4096bp input sequences
* :doc:`sei` - Latest chromatin and regulatory impact model with 4096bp input sequences
* :doc:`beluga` - 2019 DeepSEA model with 2000bp input sequences