diff --git a/docs/beluga.rst b/docs/beluga.rst new file mode 100644 index 000000000..439f68a52 --- /dev/null +++ b/docs/beluga.rst @@ -0,0 +1,116 @@ +======= +Beluga / DeepSEA +======= + +Introduction +------------ + +DeepSEA is a deep learning-based algorithmic framework for predicting the chromatin effects of sequence alterations with single nucleotide sensitivity. DeepSEA can accurately predict the epigenetic state of a sequence, including transcription factors binding, DNase I sensitivities and histone marks in multiple cell types, and further utilize this capability to predict the chromatin effects of sequence variants and prioritize regulatory variants. + +The 2019 version of DeepSEA, nicknamed '**Beluga**', can predict **2002** chromatin features. Beluga is described in: + +Jian Zhou, Chandra L. Theesfeld, Kevin Yao, Kathleen M. Chen, Aaron K. Wong, and Olga G. Troyanskaya, **Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk**. Nature Genetics (2018). + +To determine if certain features (ie. transcription factors, marks, or cell types) are present/accounted for in the model, refer to the `supplemental feature table `_ which has all the profiles used to train DeepSEA. + +DeepSEA is originally described in the following manuscript: + +Jian Zhou, Olga G. Troyanskaya. **Predicting the Effects of Noncoding Variants with Deep learning-based Sequence Model.** Nature Methods (2015). + +To determine if certain features (ie. transcription factors, marks, or cell types) are present/accounted for in the model, refer to the `supplemental feature table `_ which has all the profiles used to train DeepSEA. + + +Input +----- + +DeepSEA predicts genomic variant effects on a wide range of chromatin features at the variant position (Transcription factors binding, DNase I hypersensitive sites, and histone marks in multiple human cell types). DeepSEA can also be utilized for predicting chromatin features for any DNA sequence. + +File formats +~~~~~~~~~~~~ +We support three types of input: vcf, fasta, bed. If you want to predict effects of noncoding variants, use vcf format input. If you want to predict chromatin feature probabilities for DNA sequences, use fasta format. If you want to specify sequences from the human reference genome (GRCh37/hg19), you can use bed format. See below for a quick introduction: + +**VCF format** is used for specifying a genomic variant. A minimal example is ``chr1 109817590 - G T`` (if you want to copy cover this text as input, you will need to change spaces to tabs). The five columns are chromosome, position, name, reference allele, and alternative allele. + +**Fasta format** input should include sequences of 2000bp length each. If a sequence is longer than 2000bp, only the center 2000bp will be used. A minimal example is :: + + >TestSequence + TGGGATTACAGGCGTGAGCCACCGCGCCCGGCCCATTGTACCATTCTTAT + GCCTTTGCGTCCTCATAGCTTAGCTCCCGTATATCAGTGAGAACATACTA + TGTTTGGTTTTCCATACCCGAGTTACTTCACTTAGAATAATAGTCTCCAA + TTTCATCCAGGTCAGTGCAAATGCGTTAATTCGTTCCTTTTATGGCTGAG + TAGTATTCCATCATATATATATACTACAGTTTCTTTATCCACTCGTAAAT + TGATGGGCATTTGTGTTGGAACACTTCTCCACTGCTGGTGGGAATGTAAA + TTAGTGCAGCCACTATGGATAACAGTGTGGAGATTTGTTAAAGAACTAAA + ACTAGAACTACCATTTGATCCAGCAATCCCACTACTGGGTATCTACCCAG + AAGAAAAGAAGTCATTATTTGAAAAAGATACTTGCACGGGCATGTTTATA + GCAGCACAATTCACAATTGTAGTTGTATTTCTTTAAGCGTGTCTTTTCAA + TATCTCTCATGTTTCTGGTATAGATGGTATATATGTTAATCTTGTTCCTG + AGGTCTGTTTTTTATTTTTGTCATTAAAGTGGGAATTAAATAGTTTTGTA + GTGCATATAAATTAAAGAAAAAGTTCACATAAGCATATTTGCCAATCATC + TCAAAATGCTATATTCTCCTTCACGGTTTTGAAAATAATTCAGGGTTTTC + TCTTCCTCATTGCTTTCCCACCAACTGACAGTATTATTTTCTTAGTCATT + TTACTGACCTTTGAAATTACTCCTTTGAGGTCTTCTAAAAAATTTTATGG + GCTCTGCTGCTTTTTGGTGGCCTCCTTGTATCATTTATTCTATTACAGGA + CGACTTACAAAAGGAAGCACATAAATTGACCCATATACATATCCTATCAT + TGGGGAGTTTCTGTGCAAATGTTATTTATTGGAAGCTATTACTAAGAATT + GTAAGAAAAATAATTGGTATTGATGCAGCTAGTATGGTTCCTGTAATTAT + CGTACTCAGCCACGTAAATCATAGCTATATGTAGCCAAAGATCCATGAAC + AAAATTTCCAGTAACATCATTATAATTCAAAAGGCAGACTTTCAGAACCA + GACAGACTTGAATTTAAATTCTAGCTTTACCACACATGAATTTAACCTTG + TGGAAGGTTAACCTATCTAAACTCATGTTTCTTCATTGGTAGCTGATAAA + ATTAAGGATCATGTATATAACCACCTAGTAGAGTTGTTTAAGAAACTGTT + AGAATTCCATAAATTGTTAGTATTAATGAGTTTTTGTTGGACATGTGTTA + GGCTAGGCCACTCCTTGACCTTCATAGAGGTATGGATTATGACACAAATT + CTAAACTGTAGGTAGGCATGGCTTTGTAGCAAGTATTAAAATAGTAAATA + TTTTATTTTTATAAGATAAATGTAAACCTTTTAAAAGTTTCATTACATTT + GTATTTATGAAATATCATCCTATATCAACTATAGAGAGAAGATCGCAAGA + AGGCAGTGGCAGCAGAGGCTCCAGTTAGGAGGCTACTAGTCCAAATACAT + TGCGATAAAAACTTGGCAAAAGGTGCTGGTAGTCTGATGAAATAAAGTAG + ATAAATTTTAGAGGTATTTATAAAATAATTAAAGAATATTCAATAATAGG + AGATATATTACCCAATAGAGTGGAGATTCAAAGATAACTCCGAAAGTTTT + TTGCTAAAGCAACATTTGGCTGTGCTATCATTTACTAAGAAAGACAACAA + GAGAGTAAAATCAAGTTTGAGGATGAAGTGAATTTATTCCTTTTTGATTG + ATACATAATTGACATGTAATAAAACCCACAATGTTAAGAGTTCGGTTTGA + TGTGCTTGACTATTTTAGGCACTGGTGTTATCACAACACAAGACAACAGA + TAGGACATTCTCAGAAAATTTTTTCATGTCCCTTTCCAGTCAGTTTCAAG + CCTTCTTTCCATGCAATAATTTTCTCACTTTGCCATTCTAGTAGGTGTGA + +**Bed format** provides another way to specify sequences in human reference genome (hg19). The bed input should specify 2000bp-length regions. A minimal example is ``chr1 109817091 109819090``. The three columns are chromosome, start position, and end position. + +Genome coordinates +~~~~~~~~~~~~~~~~~~ +We support only ``GRCh37/hg19`` genome coordinates. You can use LiftOver to convert your coordinates to the correct version. + +Large submissions +~~~~~~~~~~~~~~~~~ +We recommend using the web server if you have <10,000 variants or sequences. You will experience degraded performance when submitting a larger set of sequences. In those instances, we suggest that you split the set into multiple <10,000 submissions, or run the standalone version on your local machine, or contact our group directly. + +Output +------ + +Regulatory feature scores +~~~~~~~~~~~~~~~~~~~~~~~~~ +* **diffs**: The difference between the the predicted probability of the reference allele and the alternative allele for a regulatory feature (:math:`p_{alt} -p_{ref}`). +* **e-value**: E-value is defined as the expected proportion of SNPs with a larger predicted effect. We calculate an 'e-value' based on the empirical distribution of that feature's effect (:math:`abs(p_{alt} -p_{ref})`) among gnomAD variants. For example, a feature e-value of 0.01 indicates that only 1% of gnomAD variants have a larger predicted effect. +* **z-score**: A scaled score where the feature diff score (:math:`p_{alt} -p_{ref}`) is divided by the root mean square of the feature diff score across gnomAD variants. Note that this is "sign-preserving", i.e. a negative z-score indicates that a mutation **decreases** the probability of a regulatory feature. + +Variant scores +~~~~~~~~~~~~~~ + +* **Disease Impact Score (DIS)**: DIS is calculated by training a logistic regression model that prioritizes likely disease-associated mutations on the basis of the predicted transcriptional or post-transcriptional regulatory effects of these mutations (See Zhou et. al, 2019). The predicted DIS probabilities are then converted into 'DIS e-values', computed based on the empirical distributions of predicted effects for gnomAD variants. The final DIS score is: + + .. math:: + -log10(DIS evalue_{feature}) + +* **Mean -log e-value (MLE)**: For each predicted regulatory feature effect (:math:`abs(p_{alt}-p_{ref}`)) of a variant, we calculate a 'feature e-value' based on the empirical distribution of that feature's effects among gnomAD variants (see above Regulatory feature scores: e-value). The MLE score of a variant is + + .. math:: + \sum -log10(evalue_{feature}) / N + +In-silico mutagenesis +--------------------- +Perform "In silico saturated mutagenesis" (ISM) analysis to discover informative sequence features within any sequence. Specifically, it performs computational mutation scanning to assess the effect of mutating every base of the input sequence on chromatin feature predictions. This method for context-specific sequence feature extraction takes advantage of DeepSEA’s ability to utilize flanking context sequences information. + +Note that ISM only accepts a sequence (FASTA file) as input. + +ISM outputs effects for each of three possible substitutions of all 2000 bases, across all chromatin features. diff --git a/docs/conf.py b/docs/conf.py index 561e2c3f3..a1d19adbe 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -44,7 +44,7 @@ # General information about the project. project = u'HumanBase' -copyright = u'2019, Simons Foundation' +copyright = u'2022, Flatiron Institute' # The version info for the project you're documenting, acts as replacement for # |version| and |release|, also used in various other places throughout the diff --git a/docs/deepsea.rst b/docs/deepsea.rst index e1f140bfb..8c7c4464c 100644 --- a/docs/deepsea.rst +++ b/docs/deepsea.rst @@ -1,78 +1,30 @@ ======= -DeepSEA +Sei / DeepSEA ======= Introduction ------------ -DeepSEA is a deep learning-based algorithmic framework for predicting the chromatin effects of sequence alterations with single nucleotide sensitivity. DeepSEA can accurately predict the epigenetic state of a sequence, including transcription factors binding, DNase I sensitivities and histone marks in multiple cell types, and further utilize this capability to predict the chromatin effects of sequence variants and prioritize regulatory variants. +Sei is a deep-learning-based framework for systematically predicting sequence regulatory activities and applying sequence information to understand human genetics data. Sei provides a global map from any sequence to regulatory activities, as represented by 40 sequence classes. Each sequence class integrates predictions for 21,907 chromatin profiles (transcription factor, histone marks, and chromatin accessibility profiles across a wide range of cell types) from the underlying Sei deep learning model. You can also find the Sei code repository `here `_ or read about our manuscript `here `_. -The current version of DeepSEA, nicknamed '**Beluga**', can predict **2002** chromatin features. Beluga is described in: +Sequence class-level variant effects are computed by comparing the predictions for the reference and the alternative alleles. A positive score indicates an increase in sequence class activity by the alternative allele and vice versa. Sequence class-level scores are computed by projecting the 21,907 chromatin profile predictions for the sequence to the unit vector that represents each sequence class. -Jian Zhou, Chandra L. Theesfeld, Kevin Yao, Kathleen M. Chen, Aaron K. Wong, and Olga G. Troyanskaya, **Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk**. Nature Genetics (2018). +For older DeepSEA models see: +:doc:`beluga` (2019) -DeepSEA is originally described in the following manuscript: - -Jian Zhou, Olga G. Troyanskaya. **Predicting the Effects of Noncoding Variants with Deep learning-based Sequence Model.** Nature Methods (2015). - -To determine if certain features (ie. transcription factors, marks, or cell types) are present/accounted for in the model, refer to the `supplemental feature table `_ which has all the profiles used to train DeepSEA. Input ----- -DeepSEA predicts genomic variant effects on a wide range of chromatin features at the variant position (Transcription factors binding, DNase I hypersensitive sites, and histone marks in multiple human cell types). DeepSEA can also be utilized for predicting chromatin features for any DNA sequence. - File formats ~~~~~~~~~~~~ We support three types of input: vcf, fasta, bed. If you want to predict effects of noncoding variants, use vcf format input. If you want to predict chromatin feature probabilities for DNA sequences, use fasta format. If you want to specify sequences from the human reference genome (GRCh37/hg19), you can use bed format. See below for a quick introduction: -**VCF format** is used for specifying a genomic variant. A minimal example is ``chr1 109817590 - G T`` (if you want to copy cover this text as input, you will need to change spaces to tabs). The five columns are chromosome, position, name, reference allele, and alternative allele. - -**Fasta format** input should include sequences of 2000bp length each. If a sequence is longer than 2000bp, only the center 2000bp will be used. A minimal example is :: - - >TestSequence - TGGGATTACAGGCGTGAGCCACCGCGCCCGGCCCATTGTACCATTCTTAT - GCCTTTGCGTCCTCATAGCTTAGCTCCCGTATATCAGTGAGAACATACTA - TGTTTGGTTTTCCATACCCGAGTTACTTCACTTAGAATAATAGTCTCCAA - TTTCATCCAGGTCAGTGCAAATGCGTTAATTCGTTCCTTTTATGGCTGAG - TAGTATTCCATCATATATATATACTACAGTTTCTTTATCCACTCGTAAAT - TGATGGGCATTTGTGTTGGAACACTTCTCCACTGCTGGTGGGAATGTAAA - TTAGTGCAGCCACTATGGATAACAGTGTGGAGATTTGTTAAAGAACTAAA - ACTAGAACTACCATTTGATCCAGCAATCCCACTACTGGGTATCTACCCAG - AAGAAAAGAAGTCATTATTTGAAAAAGATACTTGCACGGGCATGTTTATA - GCAGCACAATTCACAATTGTAGTTGTATTTCTTTAAGCGTGTCTTTTCAA - TATCTCTCATGTTTCTGGTATAGATGGTATATATGTTAATCTTGTTCCTG - AGGTCTGTTTTTTATTTTTGTCATTAAAGTGGGAATTAAATAGTTTTGTA - GTGCATATAAATTAAAGAAAAAGTTCACATAAGCATATTTGCCAATCATC - TCAAAATGCTATATTCTCCTTCACGGTTTTGAAAATAATTCAGGGTTTTC - TCTTCCTCATTGCTTTCCCACCAACTGACAGTATTATTTTCTTAGTCATT - TTACTGACCTTTGAAATTACTCCTTTGAGGTCTTCTAAAAAATTTTATGG - GCTCTGCTGCTTTTTGGTGGCCTCCTTGTATCATTTATTCTATTACAGGA - CGACTTACAAAAGGAAGCACATAAATTGACCCATATACATATCCTATCAT - TGGGGAGTTTCTGTGCAAATGTTATTTATTGGAAGCTATTACTAAGAATT - GTAAGAAAAATAATTGGTATTGATGCAGCTAGTATGGTTCCTGTAATTAT - CGTACTCAGCCACGTAAATCATAGCTATATGTAGCCAAAGATCCATGAAC - AAAATTTCCAGTAACATCATTATAATTCAAAAGGCAGACTTTCAGAACCA - GACAGACTTGAATTTAAATTCTAGCTTTACCACACATGAATTTAACCTTG - TGGAAGGTTAACCTATCTAAACTCATGTTTCTTCATTGGTAGCTGATAAA - ATTAAGGATCATGTATATAACCACCTAGTAGAGTTGTTTAAGAAACTGTT - AGAATTCCATAAATTGTTAGTATTAATGAGTTTTTGTTGGACATGTGTTA - GGCTAGGCCACTCCTTGACCTTCATAGAGGTATGGATTATGACACAAATT - CTAAACTGTAGGTAGGCATGGCTTTGTAGCAAGTATTAAAATAGTAAATA - TTTTATTTTTATAAGATAAATGTAAACCTTTTAAAAGTTTCATTACATTT - GTATTTATGAAATATCATCCTATATCAACTATAGAGAGAAGATCGCAAGA - AGGCAGTGGCAGCAGAGGCTCCAGTTAGGAGGCTACTAGTCCAAATACAT - TGCGATAAAAACTTGGCAAAAGGTGCTGGTAGTCTGATGAAATAAAGTAG - ATAAATTTTAGAGGTATTTATAAAATAATTAAAGAATATTCAATAATAGG - AGATATATTACCCAATAGAGTGGAGATTCAAAGATAACTCCGAAAGTTTT - TTGCTAAAGCAACATTTGGCTGTGCTATCATTTACTAAGAAAGACAACAA - GAGAGTAAAATCAAGTTTGAGGATGAAGTGAATTTATTCCTTTTTGATTG - ATACATAATTGACATGTAATAAAACCCACAATGTTAAGAGTTCGGTTTGA - TGTGCTTGACTATTTTAGGCACTGGTGTTATCACAACACAAGACAACAGA - TAGGACATTCTCAGAAAATTTTTTCATGTCCCTTTCCAGTCAGTTTCAAG - CCTTCTTTCCATGCAATAATTTTCTCACTTTGCCATTCTAGTAGGTGTGA - -**Bed format** provides another way to specify sequences in human reference genome (hg19). The bed input should specify 2000bp-length regions. A minimal example is ``chr1 109817091 109819090``. The three columns are chromosome, start position, and end position. +**VCF format** is used for specifying a genomic variant. A minimal example is ``chr1 109817590 - G T`` (if you want to copy cover this text as input, you will need to change spaces to tabs). The five columns are chromosome, position, name, reference allele, and alternative allele. Currently, the genome position needs to be in GRCh37/hg19 + +**Fasta format** input should include sequences of 4096bp length each. If a sequence is longer than 4096bp, only the center 4096bp will be used. + +**Bed format** provides another way to specify sequences in human reference genome (hg19). The bed input should specify 4096bp-length regions. A minimal example is ``chr1 109817091 109821186``. The three columns are chromosome, start position, and end position. Genome coordinates ~~~~~~~~~~~~~~~~~~ @@ -86,29 +38,62 @@ We recommend using the web server if you have <10,000 variants or sequences. You Output ------ -Regulatory feature scores +Sequence classes ~~~~~~~~~~~~~~~~~~~~~~~~~ -* **diffs**: The difference between the the predicted probability of the reference allele and the alternative allele for a regulatory feature (:math:`p_{alt} -p_{ref}`). -* **e-value**: E-value is defined as the expected proportion of SNPs with a larger predicted effect. We calculate an 'e-value' based on the empirical distribution of that feature's effect (:math:`abs(p_{alt} -p_{ref})`) among gnomAD variants. For example, a feature e-value of 0.01 indicates that only 1% of gnomAD variants have a larger predicted effect. -* **z-score**: A scaled score where the feature diff score (:math:`p_{alt} -p_{ref}`) is divided by the root mean square of the feature diff score across gnomAD variants. Note that this is "sign-preserving", i.e. a negative z-score indicates that a mutation **decreases** the probability of a regulatory feature. -Variant scores -~~~~~~~~~~~~~~ +The Sei framework predicts 40 sequence class scores, covering a wide range of regulatory activities such as cell-type-specific enhancers and promoters, as well as 21,907 chromatin profiles for any DNA sequence. + +To help interpretation, we grouped sequence classes into groups including P (Promoter), E (Enhancer), CTCF (CTCF-cohesin binding), TF (TF binding), PC (Polycomb-repressed), HET (Heterochromatin), TN (Transcription), and L (Low Signal) sequence classes. Please refer to our manuscript for a more detailed description of the sequence classes. + +Note: sequence class predictions are only available for vcf inputs. + +:: + + | Sequence class label | Sequence class name | Rank by size | Group | + |---------------------:|----------------------------------:|-------------:|------:| + | PC1 | Polycomb / Heterochromatin | 0 | PC | + | L1 | Low signal | 1 | L | + | TN1 | Transcription | 2 | TN | + | TN2 | Transcription | 3 | TN | + | L2 | Low signal | 4 | L | + | E1 | Stem cell | 5 | E | + | E2 | Multi-tissue | 6 | E | + | E3 | Brain / Melanocyte | 7 | E | + | L3 | Low signal | 8 | L | + | E4 | Multi-tissue | 9 | E | + | TF1 | NANOG / FOXA1 | 10 | TF | + | HET1 | Heterochromatin | 11 | HET | + | E5 | B-cell-like | 12 | E | + | E6 | Weak epithelial | 13 | E | + | TF2 | CEBPB | 14 | TF | + | PC2 | Weak Polycomb | 15 | PC | + | E7 | Monocyte / Macrophage | 16 | E | + | E8 | Weak multi-tissue | 17 | E | + | L4 | Low signal | 18 | L | + | TF3 | FOXA1 / AR / ESR1 | 19 | TF | + | PC3 | Polycomb | 20 | PC | + | TN3 | Transcription | 21 | TN | + | L5 | Low signal | 22 | L | + | HET2 | Heterochromatin | 23 | HET | + | L6 | Low signal | 24 | L | + | P | Promoter | 25 | P | + | E9 | Liver / Intestine | 26 | E | + | CTCF | CTCF-Cohesin | 27 | CTCF | + | TN4 | Transcription | 28 | TN | + | HET3 | Heterochromatin | 29 | HET | + | E10 | Brain | 30 | E | + | TF4 | OTX2 | 31 | TF | + | HET4 | Heterochromatin | 32 | HET | + | L7 | Low signal | 33 | L | + | PC4 | Polycomb / Bivalent stem cell Enh | 34 | PC | + | HET5 | Centromere | 35 | HET | + | E11 | T-cell | 36 | E | + | TF5 | AR | 37 | TF | + | E12 | Erythroblast-like | 38 | E | + | HET6 | Centromere | 39 | HET | -* **Disease Impact Score (DIS)**: DIS is calculated by training a logistic regression model that prioritizes likely disease-associated mutations on the basis of the predicted transcriptional or post-transcriptional regulatory effects of these mutations (See Zhou et. al, 2019). The predicted DIS probabilities are then converted into 'DIS e-values', computed based on the empirical distributions of predicted effects for gnomAD variants. The final DIS score is: - .. math:: - -log10(DIS evalue_{feature}) -* **Mean -log e-value (MLE)**: For each predicted regulatory feature effect (:math:`abs(p_{alt}-p_{ref}`)) of a variant, we calculate a 'feature e-value' based on the empirical distribution of that feature's effects among gnomAD variants (see above Regulatory feature scores: e-value). The MLE score of a variant is - - .. math:: - \sum -log10(evalue_{feature}) / N - -In-silico mutagenesis ---------------------- -Perform "In silico saturated mutagenesis" (ISM) analysis to discover informative sequence features within any sequence. Specifically, it performs computational mutation scanning to assess the effect of mutating every base of the input sequence on chromatin feature predictions. This method for context-specific sequence feature extraction takes advantage of DeepSEA’s ability to utilize flanking context sequences information. - -Note that ISM only accepts a sequence (FASTA file) as input. - -ISM outputs effects for each of three possible substitutions of all 2000 bases, across all chromatin features. +Regulatory feature scores +~~~~~~~~~~~~~~~~~~~~~~~~~ +* **diffs**: The difference between the the predicted probability of the reference allele and the alternative allele for a regulatory feature (:math:`p_{alt} -p_{ref}`). diff --git a/docs/index.rst b/docs/index.rst index a881bb76a..790119a5a 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -46,8 +46,6 @@ Help topics modules netwas deepsea + beluga expecto citations - - - diff --git a/docs/sei.rst b/docs/sei.rst new file mode 100644 index 000000000..441171e5e --- /dev/null +++ b/docs/sei.rst @@ -0,0 +1,67 @@ +======= +Sei +======= + +Introduction +------------ + +Sei is a deep-learning-based framework for systematically predicting sequence regulatory activities and applying sequence information to understand human genetics data. Sei provides a global map from any sequence to regulatory activities, as represented by 40 sequence classes. Each sequence class integrates predictions for 21,907 chromatin profiles (transcription factor, histone marks, and chromatin accessibility profiles across a wide range of cell types) from the underlying Sei deep learning model. You can also find the Sei code repository here (https://github.com/FunctionLab/sei-framework) or read about our manuscript here (https://www.biorxiv.org/content/10.1101/2021.07.29.454384v1). + +Sequence class-level variant effects are computed by comparing the predictions for the reference and the alternative alleles. A positive score indicates an increase in sequence class activity by the alternative allele and vice versa. Sequence class-level scores are computed by projecting the 21,907 chromatin profile predictions for the sequence to the unit vector that represents each sequence class. + +Input format +------------ + +VCF format is used for specifying a genomic variant. A minimal example is chr1 109817590 - G T (if you want to copy cover this text as input, you will need to change spaces to tabs). The five columns are chromosome, position, name, reference allele, and alternative allele. The genome position needs to be in GRCh38/hg38. + +Sequence classes +------------ + +The Sei framework predicts 40 sequence class scores, covering a wide range of regulatory activities such as cell-type-specific enhancers and promoters, as well as 21,907 chromatin profiles for any DNA sequence. + +To help interpretation, we grouped sequence classes into groups including P (Promoter), E (Enhancer), CTCF (CTCF-cohesin binding), TF (TF binding), PC (Polycomb-repressed), HET (Heterochromatin), TN (Transcription), and L (Low Signal) sequence classes. Please refer to our manuscript for a more detailed description of the sequence classes. + +:: + + | Sequence class label | Sequence class name | Rank by size | Group | + |---------------------:|----------------------------------:|-------------:|------:| + | PC1 | Polycomb / Heterochromatin | 0 | PC | + | L1 | Low signal | 1 | L | + | TN1 | Transcription | 2 | TN | + | TN2 | Transcription | 3 | TN | + | L2 | Low signal | 4 | L | + | E1 | Stem cell | 5 | E | + | E2 | Multi-tissue | 6 | E | + | E3 | Brain / Melanocyte | 7 | E | + | L3 | Low signal | 8 | L | + | E4 | Multi-tissue | 9 | E | + | TF1 | NANOG / FOXA1 | 10 | TF | + | HET1 | Heterochromatin | 11 | HET | + | E5 | B-cell-like | 12 | E | + | E6 | Weak epithelial | 13 | E | + | TF2 | CEBPB | 14 | TF | + | PC2 | Weak Polycomb | 15 | PC | + | E7 | Monocyte / Macrophage | 16 | E | + | E8 | Weak multi-tissue | 17 | E | + | L4 | Low signal | 18 | L | + | TF3 | FOXA1 / AR / ESR1 | 19 | TF | + | PC3 | Polycomb | 20 | PC | + | TN3 | Transcription | 21 | TN | + | L5 | Low signal | 22 | L | + | HET2 | Heterochromatin | 23 | HET | + | L6 | Low signal | 24 | L | + | P | Promoter | 25 | P | + | E9 | Liver / Intestine | 26 | E | + | CTCF | CTCF-Cohesin | 27 | CTCF | + | TN4 | Transcription | 28 | TN | + | HET3 | Heterochromatin | 29 | HET | + | E10 | Brain | 30 | E | + | TF4 | OTX2 | 31 | TF | + | HET4 | Heterochromatin | 32 | HET | + | L7 | Low signal | 33 | L | + | PC4 | Polycomb / Bivalent stem cell Enh | 34 | PC | + | HET5 | Centromere | 35 | HET | + | E11 | T-cell | 36 | E | + | TF5 | AR | 37 | TF | + | E12 | Erythroblast-like | 38 | E | + | HET6 | Centromere | 39 | HET | diff --git a/docs/tissue-networks.rst b/docs/tissue-networks.rst index b85163045..932c7cc9e 100644 --- a/docs/tissue-networks.rst +++ b/docs/tissue-networks.rst @@ -18,7 +18,7 @@ Examples IL1B in blood vessel ~~~~~~~~~~~~~~~~~~~~~~~~~ -We examined and experimentally verified the tissue-specific molecular response of blood vessel cells to stimulation by IL-1β (IL1B), a proinflammatory cytokine. We anticipated that the genes most tightly connected to IL1B in the blood vessel network would be among those responding to IL-1β stimulation in blood vessel cells. We tested this hypothesis by profiling the gene expression of human aortic smooth muscle cells (HASMCs; the predominant cell type in blood vessels) stimulated with IL-1β. +We examined and experimentally verified the tissue-specific molecular response of blood vessel cells to stimulation by IL-1β (IL1B), a pro-inflammatory cytokine. We anticipated that the genes most tightly connected to IL1B in the blood vessel network would be among those responding to IL-1β stimulation in blood vessel cells. We tested this hypothesis by profiling the gene expression of human aortic smooth muscle cells (HASMCs; the predominant cell type in blood vessels) stimulated with IL-1β. Examination of the genes whose expression was significantly upregulated at 2 h after stimulation showed that 18 of the 20 IL1B network neighbors were among the top 500 most upregulated genes in the experiment (P = 2.07 × 10−23). The blood vessel network was the most accurate tissue network in predicting this experimental outcome; none of the other 143 tissue-specific networks or the tissue-naive network performed as well when evaluated by each network's ability to predict the result of IL-1β stimulation on the cells.