From f437d58a4cb0b6c77c9a16ae0893436fdd992a72 Mon Sep 17 00:00:00 2001 From: rssealfon Date: Mon, 14 Jul 2025 12:21:10 -0400 Subject: [PATCH 1/2] update seqweaver doc --- docs/seqweaver.rst | 102 ++++++++++++--------------------------------- 1 file changed, 26 insertions(+), 76 deletions(-) diff --git a/docs/seqweaver.rst b/docs/seqweaver.rst index e9db6f846..9674405a6 100644 --- a/docs/seqweaver.rst +++ b/docs/seqweaver.rst @@ -10,83 +10,33 @@ obtained from CLIP-seq experiments. Seqweaver consists of 232 RBP models spannin databases) at single-nucleotide resolution. Seqweaver is described in: -Park CY, Zhou J, Wong AK, Chen KM, Theesfeld CL, Darnell RB, Troyanskaya OG. Genome-wide landscape of RNA-binding protein target site dysregulation reveals a major impact on psychiatric disorder -risk. Nat Genet. 2021 Feb;53(2):166-173. +Park CY, Zhou J, Wong AK, Chen KM, Theesfeld CL, Darnell RB, Troyanskaya OG. `Genome-wide landscape of RNA-binding protein target site dysregulation reveals a major impact on psychiatric disorder risk `_. Nat Genet. 2021 Feb;53(2):166-173. + Input ----- -.. |bp_length| replace:: 1000 -.. |bed_example| replace:: ``chr1 109817090 109818090`` - -.. include:: _includes/common-input-formats.rst - - -A minimal FASTA example of 4 sequences is shown below. Each sequence entry consists of: - -- A header line starting with ``>`` followed by a sequence identifier/description -- The DNA sequence (A, C, G, T letters) on subsequent lines - -Example:: - - >known_CEBP_binding_increase-GtoT__chr1_109817091_109818090 - GTGCCTCTGGGAGGAGAGGGACTCCTGGGGGGCCTGCCCCTCATACGCCATCACCAAAAGGAAAGGACAAAGCCACACGC - AGCCAGGGCTTCACACCCTTCAGGCTGCACCCGGGCAGGCCTCAGAACGGTGAGGGGCCAGGGCAAAGGGTGTGCCTCGT - CCTGCCCGCACTGCCTCTCCCAGGAACTGGAAAAGCCCTGTCCGGTGAGGGGGCAGAAGGACTCAGCGCCCCTGGACCCC - CAAATGCTGCATGAACACATTTTCAGGGGAGCCTGTGCCCCCAGGCGGGGGTCGGGCAGCCCCAGCCCCTCTCCTTTTCC - TGGACTCTGGCCGTGCGCGGCAGCCCAGGTGTTTGCTCAGTTGCTGACCCAAAAGTGCTTCATTTTTCGTGCCCGCCCCG - CGCCCCGGGCAGGCCAGTCATGTGTTAAGTTGCGCTTCTTTGCTGTGATGTGGGTGGGGGAGGAAGAGTAAACACAGTGC - TGGCTCGGCTGCCCTGAGGTTGCTCAATCAAGCACAGGTTTCAAGTCTGGGTTCTGGTGTCCACTCACCCACCCCACCCC - CCAAAATCAGACAAATGCTACTTTGTCTAACCTGCTGTGGCCTCTGAGACATGTTCTATTTTTAACCCCTTCTTGGAATT - GGCTCTCTTCTTCAAAGGACCAGGTCCTGTTCCTCTTTCTCCCCGACTCCACCCCAGCTCCCTGTGAAGAGAGAGTTAAT - ATATTTGTTTTATTTATTTGCTTTTTGTGTTGGGATGGGTTCGTGTCCAGTCCCGGGGGTCTGATATGGCCATCACAGGC - TGGGTGTTCCCAGCAGCCCTGGCTTGGGGGCTTGACGCCCTTCCCCTTGCCCCAGGCCATCATCTCCCCACCTCTCCTCC - CCTCTCCTCAGTTTTGCCGACTGCTTTTCATCTGAGTCACCATTTACTCCAAGCATGTATTCCAGACTTGTCACTGACTT - TCCTTCTGGAGCAGGTGGCTAGAAAAAGAGGCTGTGGGCA - >known_FOXA2_binding_decrease-ReferenceAllele__chr10_23507864_23508863 - CTTCTTTTTATCTCTTAACTAACTTACAATTTCTTACGTGATTTTAAAACTTGTTTTTCTATTTAAAACAACAGGGGCAA - CTGAACTTCACTTTCAAACAATATTTATTTCTATAAATCAGTGCAAAACATACTTATTGAAAATATATCTTGGGTCCAAG - GCTTCAAAGGGTAAAAAGAAAGATTTTAAATTATATCTAATATGTTACAATTGTTCTGTCCTTTAAAAACCTTTTCAGAT - CACCCCCTGGATGATTCTTCCCTAGAAGTCTCAGAGAATTAACAACACAATGTAATCTAGGTTTAAATTTGGGTTTCTCC - TGTGTTTCAGATACTGATGTTTGAGCTTTCTCTTCCTGACAAGCCACTTAAAGAGTCACTGTTACTTTGAGGTTTTATCT - GTAAGATTCGTGTCTTTTGGGCTCATTAAGAACATTTCCAAAGATTACAATGTCAATAGCACCTAATTACTGGACTGTGA - GAAAGGTCTTCTTGAGTACATAAAATCTGTGGCAGTGCACAGTACACAATGGGCAGCTCAGATCCCAAATTTTATCACAA - GTAAGTAGCAAACAAATTAATAATGTTACCTGTGCTCTCTTGGATAATTACTACTGCATAAAAACTGCTTTGAAATGTTG - CAGATAGTATTGTACCTCATTTTTTTAATCCCCTTAGAGTAACAAGGATTTATTTGTCTCAAACTTTCTATGTTGCATGC - ACCACTTGACTTTCTTGTTCTGTTTAGAATTTTTAGAACTTGCAACATAACAAAAAATCATTTTTAACCAGCCTAGGAAG - GACATATCACCTGATGTAACATTATTTTAAATTATATTTTGTATTTTACTTTACTCTTTTCAAAACATATACTGTATGTT - TTGATACTATTGCTAGATTTTATTTTTTACTTATGCCTGGTAGAAAATCAGCTATTAAAGAAGCAGAGGAGGCTGGACAC - AGTGGTTCATGTCTGTAATCGCTAGCACTTTGAAAGAGTA - >known_GATA1_binding_increase-TtoC__chr16_209210_210209 - GGGCTTAGACAGAGGAGGGGAGGATTCAGATTTTAAATGGGTTGGCCACTGTAGGTCTATTAACGTGGTGACATTTGAGG - GAGTGGCAATACTAGGGAAGGGGCTTCAGGGGAGTGGCCAGGAGCTAGGGATAGAGGGAGGGAGGACAGGAGGCCTTGTC - TGTCTTTTCCTCCATATGTAAGTTTCAGGAGTGAGTGGGGGGTGTCGAGGGTGCTGTGCTCTCCGGCCTGAGCCTCAGGA - AGGAAGGGCAGTAGTCAGGGATGCCAGGGAAGGACAGTGGAGTAGGCTTTGTGGGGAACTTCACGGTTCCATTGTTGAGA - TGATTTGCTGGAGACACACAGATGAGGACATCAAATACATCCCTGGATCAGGCCCTGGGGCCTGAGTCCGGAAGAGAGGT - CTGTATGGACACACCCATCAATGGGAGCACCAGGACACAGATGGAGGCTAATGTCATGTTGTAGACAGGATGGGTGCTGA - GCTGCCACACCCACATTATCAGAAAATAACAGCACAGGCTTGGGGTGGAGGCGGGACACAAGACTAGCCAGAAGGAGAAA - GAAAGGTGAAAAGCTGTTGGTGCAAGGAAGCTCTTGGTATTTCCAATGGCTTGGGCACAGGCTGTGAGGGTGCCTGGGAC - GGCTTGTGGGGCACAGGCTGCAAGAGGTGCCCAGGACGGCTTGTGGGGCACAGGTTGTGAGAGGTGCCCTGGACGGCTTG - TGGGGCACAGGCTGTGAGAGGTGCCCAGGACGGCTTGTGGGGCACAGGCTGTGAGGGTGCCCGGGACGGCTTGTGGGGCA - CAGGTTGTGAGAGGTGCCCGGGACGGCTTGTGGGGCACAGGTTTCAGAGGTGCCCGGGACGGCTTGTGGGGCACAGGTTG - TGAGAGGTGCCCGGGACGGCTTGTGGGACACAGGTTGTGAGAGGTGCCTGGGACGGCTTGTGGGGCACAGGCTGTGAGGG - TGCCTGGGACGGCTTGTGGGGCACAGGTTGTGAGAGGTGC - >known_FOXA1_binding_increase-CtoT_chr16_52598689_52599688 - GGCTCAAGCAGTCCTCCCATCTAGGCTTCCCAAAATGCTGGGATTACAGACATGAGCCACTGCACCCAGCCACAAAGATA - ACCTAAAGATGTGTTTACTTTGACCCAGGCAGTAGTTTAAAAAAGTTTTAATTTGTTGTTCACATTTAAAAACTGGACAA - TTTCTACATAAAAATCTGAATTACTCATGTCTCTTAAAAAAATAACATCTAGCAATGGTAGGCCCACATTCCTTCCTGAA - AATAATTAGCTGGGAAAGAGTAGGGACTGACCCCTTTAGACACGGTATAAATAGCATGGGAGTTGATCAGTAAATATTTG - CTGAATGAAAGAATACATGAATGAAAAGTCAGAGCCCTATAGGTCAGCATGGACGGCGGTAAAGGAACCTGGCTGAGCCT - GAAAGAGAATGTGATCTAAGATTAAATCCAGGATATGCTGGTAAATGTTTAACAGCCAACTCTTTGGGGAGGAAAAAAGT - CCCAATTTGTAGTGTTTGCTGATTATTGTGATGTAAATACTCCCATCATGACCAATTTCAAGCTACCAACATGCTGACAC - TGAACTTGGAGTTGGAAGGAGATGAACAGGCATAATCAGGTCTCGTGAGATGGCCCAAGCCGGCCCCAGCACTCCACTGT - TATATATGAGGCTAGAATTACTACATAACTGGAATAGCAACTTTCTGGACCATATGCCTGGAACACAGCAGGTGCTGAAT - AAATGTTTGTTGATCCAGGAACTGACTGTGTTGAAGCCCACAGATGGGAAATCAGTAGAAGGCAGGTAAGAGTAAAAAGA - AGGGCAGAGAATTGGGGGTACAGACCCCTGAACCATAAGTCAGAGGAATGTTGTACATGTTTTCAGATCCCTCACTGGTC - AAATGAAGGCAAAGGGTTAGATCTCTCCAAATCTTTAGAGGGACATGATGTAACTCCATTAAGTAACTCAGTGATTTTCA - ACATTAAAAAGTGTAATTATCTTTTCAAACTAAATATTAC - -.. include:: _includes/common-submission-info.rst +We support three types of input: VCF, FASTA, BED. If you want to predict effects of noncoding variants, use VCF format input. If you want to predict RBP interaction probabilities directly from transcript sequences, you can use FASTA format. If you want to specify sequences from the human reference genome, you can use BED format. See below for a quick introduction: + +**VCF format** is used for specifying a genomic variant. A minimal example is ``chr8 38120276 [QKI disruption] C A +`` (if you want to copy cover this text as input, you will need to change spaces to tabs). The six columns are chromosome, position, name, reference allele, alternative allele, and strand. Note that strand must be specified for Seqweaver but not for the other deep learning models. + +**FASTA format** input should include sequences of 1000 bp length each. If a sequence is different from 1000 bp: + +* **Note**: The prediction is for the center base of the input sequence +* **Longer sequences**: Only the center 1000 bp will be used +* **Shorter sequences**: Sequences shorter than 1000 bp will be padded with 'N' bases evenly on both sides + + - **Important**: We do not recommend using FASTA input smaller than 1000 bp unless it is very close (only a few bp off) + - **Note**: This padding behavior is not recommended. N's were extremely rare in training data (only appearing in assembly gaps), and the model has not been evaluated with artificially padded sequences + - **Strong recommendation**: Always provide sequences of exactly 1000 bp by including genomic flanking sequences + +**BED format** provides another way to specify sequences in human reference genome (hg19). The BED input should specify 1000 bp-length regions. A minimal example is ``chr1 109817090 109818091 . 0 -``. The columns are chromosome, start position, end position, name, score, and strand. + + +Large submissions +~~~~~~~~~~~~~~~~~ +We recommend using the web server if you have <10,000 variants or sequences. You will experience degraded performance when submitting a larger set of sequences. In those instances, we suggest that you split the set into multiple <10,000 submissions, or run the standalone version on your local machine, or contact our group directly. + Output ------ @@ -94,12 +44,12 @@ Output Variant scores ~~~~~~~~~~~~~~ -**Disease impact score:** DIS is calculated by training a logistic regression model that prioritizes likely disease-associated mutations on the basis of the predicted post-transcriptional regulatory effects of these mutations (See Zhou et. al, 2019). The predicted DIS probabilities are then converted into 'DIS e-values', computed based on the empirical distributions of predicted effects for gnomAD variants. The final DIS score is: +**Disease impact score:** DIS is calculated by training a logistic regression model that prioritizes likely disease-associated mutations on the basis of the predicted post-transcriptional regulatory effects of these mutations (See `Zhou et. al, 2019 `_). The predicted DIS probabilities are then converted into DIS e-values, computed based on the empirical distributions of predicted effects for gnomAD variants. The final DIS score is: .. math:: -\log_{10}(DIS\ e-value_{feature}) -**Mean -log e-value:** For each predicted regulatory feature effect (:math:`abs(p_{alt}-p_{ref})`) of a variant, we calculate a 'feature e-value' based on the empirical distribution of that feature’s effects among gnomAD variants (see Molecular-level biochemical effects prediction: e-value). The MLE score of a variant is +**Mean -log e-value:** For each predicted regulatory feature effect (:math:`abs(p_{alt}-p_{ref})`) of a variant, we calculate a feature e-value based on the empirical distribution of that feature’s effects among gnomAD variants (see Molecular-level biochemical effects prediction: e-value). The MLE score of a variant is .. math:: \sum{-\log_{10}(e-value_{feature})}/N @@ -115,5 +65,5 @@ Molecular-level biochemical effects prediction See also -------- -* :doc:`sei` - Latest model with 4096bp input sequences +* :doc:`sei` - Latest chromatin and regulatory impact model with 4096bp input sequences * :doc:`beluga` - 2019 DeepSEA model with 2000bp input sequences From d71f9be63c24920ba2026cf43f46625dc13e8105 Mon Sep 17 00:00:00 2001 From: rssealfon Date: Mon, 14 Jul 2025 13:14:20 -0400 Subject: [PATCH 2/2] add info about input example availability --- docs/seqweaver.rst | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/seqweaver.rst b/docs/seqweaver.rst index 9674405a6..a238b4af4 100644 --- a/docs/seqweaver.rst +++ b/docs/seqweaver.rst @@ -16,7 +16,9 @@ Park CY, Zhou J, Wong AK, Chen KM, Theesfeld CL, Darnell RB, Troyanskaya OG. `Ge Input ----- -We support three types of input: VCF, FASTA, BED. If you want to predict effects of noncoding variants, use VCF format input. If you want to predict RBP interaction probabilities directly from transcript sequences, you can use FASTA format. If you want to specify sequences from the human reference genome, you can use BED format. See below for a quick introduction: +We support three types of input: VCF, FASTA, BED. If you want to predict effects of noncoding variants, use VCF format input. If you want to predict RBP interaction probabilities directly from transcript sequences, you can use FASTA format. If you want to specify sequences from the human reference genome, you can use BED format. + +Examples of all input formats are available in the job submission interface. See below for a quick introduction: **VCF format** is used for specifying a genomic variant. A minimal example is ``chr8 38120276 [QKI disruption] C A +`` (if you want to copy cover this text as input, you will need to change spaces to tabs). The six columns are chromosome, position, name, reference allele, alternative allele, and strand. Note that strand must be specified for Seqweaver but not for the other deep learning models.