Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -3,79 +3,80 @@ slug: genebuild-annotation
title: Genome annotation
description: How genome annotation is provided by Ensembl
---

# Genome annotation
How genomes are annotation by Ensembl.

## Genome preparation

The first phase of the Ensembl genome annotation pipeline involves loading an assembly into the Ensembl core database schema and then running a series of analyses on the loaded assembly to identify an initial set of genomic features.

### Repeat Elements
After the genomic sequence has been loaded into a database, it is screened for sequence patterns, including repeats, using repeat-detection tool [(Red Girgis, H.Z., 2015)](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-015-0654-5), dustmasker [(Morgulis, A., _et al._, 2006)](https://www.liebertpub.com/doi/10.1089/cmb.2006.13.1028), and tandem repeat finder, TRF [(Benson, G., 1999)](https://academic.oup.com/nar/article/27/2/573/1061099?login=true).

After the genomic sequence has been loaded into a database, it is screened for sequence patterns, including repeats, using [(Red Girgis, H.Z., 2015)](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-015-0654-5), dustmasker (Morgulis, A., et al., 2006), and TRF (Benson, G., 1999).

Combined, these tools provide a quick and acqurate identification of repeat elements, low-complexity DNA structures, and tandem repeats, respecfully, as well as a soft-masked version of the genome that will be used for the remaining of the analyses
Combined, these tools provide a quick and accurate identification of repeat elements, low-complexity DNA structures, and tandem repeats, respectively, as well as a soft-masked version of the genome that is used for the remainder of the analyses.

### Low complexity features, _ab initio_ predictions
The transcription start sites and CpG islands (longer than 200bp) on the soft-masked genome, are predicted using Eponine–scan [(Down, T.A., _et al_, 2002)](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-015-0654-5)and[command line version of CpG Finder](https://genome-source.gi.ucsc.edu/gitlist/kent.git/tree/master/src/utils/cpgIslandExt/) [(Gardiner-Garden, M., _et al_, 1987)](https://www.sciencedirect.com/science/article/pii/0022283687906899?via%3Dihub), respectively.

On the soft-masked genome, Transcription Start Sites are predicted using Eponine–scan [(Down, T.A., _et al_, 2002)](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-015-0654-5). Additionally, CpG islands (Larsen, F., _et al_, 1992) longer than 200 bases are predicted using the command line version of CpG Finder (Gardiner-Garden, M., _et al_, 1987).

These predictions and features are for display purposes only; they are not used in the gene annotation process.
These predictions and features are for display purposes only, they are not used in the gene annotation process.

## Protein-coding model generation
In order to predict the actual gene models and produce an annotation of the genome, several sources of biological information are used. This data is most commonly gathered on a species basis or, in the absence of species level data, genus level.

In order to predict the actual gene models and produce an annotation of the genome, several sources of biological information is used. This data is most commonly gathered on a species basis, or, in the absence of species level data, genus level.

On top of other annotated protein sets, this sources of data include transcriptomic and long read data.
On top of other annotated protein sets, these sources of data include transcriptomic and long read data.

We process each separatedly, and then merge all the results comparing them among them and to the genome.
We process each separately and then merge all the results comparing them to each other and to the genome.

### Protein-to-genome
We retrieve protein sequences with [experimental evidence](https://www.uniprot.org/help/protein_existence) at protein or transcript level from [UniProt](https://www.uniprot.org/) [(The UniProt Consortium, 2017)](https://academic.oup.com/nar/article/45/D1/D158/2605721?login=true). A protein database is produced for each annotation of protein information from the species itself, the clade it belongs to, and some other related highly curated species (often a model species such as human, or mouse). Proteins from the OrthoDB [(Kriventseva E.K., _et al_, 2019)](https://academic.oup.com/nar/article/47/D1/D807/5160989?login=true) proteins are also added, they provide a broad, targeted coverage of the proteome of interest.

We retrieve pretein sequences with Experimental evidence at protein or transcript level from UniProt (The UniProt Consortium, 2017). This database is produced for each annotation, adding protein information from the species itself, the clade it belongs to, and some other related highly curated species (often a model species such as human, or mouse). In addition to this, the OrthoDB (Kriventseva E.K., et al, 2019) proteins are added to provide a broad targeted coverage of the interest proteome.
Then we perform a series of protein to genome alignments in an attempt to identify highly conserved structures.

With these in hand we perform a series of protein to genome alignments in an attempt to identify highly conserved structures.

We process this data using GenBlast (She, R., _et al_, 2011), a splice-aware aligner protein-to-genome. This analysis is run with a cut-off 50% of coverage and 30% identity, an e-value of e-1, and the exon-repair option turned on. Only the top 10 models that pass the cut-offs were kept.
Next we process this data using GenBlast [(She, R., _et al_, 2011)](https://academic.oup.com/bioinformatics/article/27/15/2141/403866?login=true), a splice-aware aligner of proteins to genome. This analysis is run with a cut-off 50% of coverage and 30% identity, an e-value of e-1, and the exon-repair option turned on. Only the top 10 models that pass the cut-offs are kept.

### Transcriptomic

#### RNA-Seq
We retrieve RNA-Seq data available at the European Nucleotide Archive, [ENA](https://www.ebi.ac.uk/ena/browser/home) for the species of interest. In the absence of specific-level data, genus-level reads can be used. The reads are preprocessed by trimming with [Trim Galore!](https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/).

We retrieve RNA-Seq data available at ENA for the species of interest. On the absence of specific-level data, genus-level reads can be used. The reads are preprocessed with Trim Galore! for trimming.
This data is then aligned to the genome using the ultrafast universal RNA-seq aligner, STAR [(Dobin, A., _et al_, 2013)](https://academic.oup.com/bioinformatics/article/29/1/15/272537?login=true), and assembled into models using the transcript assembler, Scallop [(Shao, M., _et al_, 2017)](10.1038/nbt.4020).

This data is then aligned to the genome using STAR (Dobin, A., _et al_, 2013), and assembled into models using Scallop (Shao, M., _et al_, 2017). The BAMs (sorted and indexed) are provided as interactive tracks for the annotated genome, with expresion and models data available for review.
The BAMs (sorted and indexed) are provided as interactive tracks for the annotated genome, with expression and model data available for review.

Protein coding potential of these predictions is then validated via DIAMOND (Buchfink, B., _et al_, 2021) alignment of the longest ORF agains a UniProt database containing Eukaryote proteins. When this is insuficient, RNASamba (Camargo, A.P., _et al_, 2020)and CPC2 (Kang, Y., _et al_, 2017)are used.
The protein coding potential of these predictions are then validated with the fast and ultrasensitive protein aligner, DIAMOND [(Buchfink, B., _et al_, 2021)](https://www.nature.com/articles/s41592-021-01101-x). DIAMOND aligns the longest ORF agains a UniProt database containing Eukaryote proteins.

#### Long-read
When this is insuficient, RNASamba [(Camargo, A.P., _et al_, 2020)](https://academic.oup.com/nargab/article/2/1/lqz024/5701461?login=true) and CPC2 [(Kang, Y., _et al_, 2017)](https://academic.oup.com/nar/article/45/W1/W12/3831091?login=true)are used.

Long-read data (e.g. IsoSeq or Nanopore) is retrieved in a similar fashion from ENA.
#### Long-read
Long-read data (e.g. IsoSeq or Nanopore) is retrieved in a similar fashion from [ENA](https://www.ebi.ac.uk/ena/browser/home).

This data is mapped to the genome using Minimap2 (Li, H., 2018), with the recommended setting for each type of data. Additionally, low frequency intron/exon boundaries that are non-canonical are replaced with high frequency boundary coordinates within 50bp, and the low frequency potential gaps between adjoining exons are filled in based on high frequency observations of single exons with the same terminal boundary coordinates.
This data is mapped to the genome using Minimap2 [(Li, H., 2018)](https://academic.oup.com/bioinformatics/article/34/18/3094/4994778?login=true), with the recommended setting for each type of data. Additionally, low frequency intron/exon boundaries that are non-canonical are replaced with high frequency boundary coordinates within 50bp, and the low frequency potential gaps between adjoining exons are filled in based on high frequency observations of single exons with the same terminal boundary coordinates.

## sncRNA
Using a database containing known sncRNAs from RFAM archive (Kalvari, I., _et al_, 2018), we use CMsearch (part of the Infernal suite; Cui, X., _et al_, 2016) to search for homology models of the sequences provided in the database.
Using a database containing known sncRNAs from [RFAM archive](https://rfam.org/) [(Kalvari, I., _et al_, 2018)](https://currentprotocols.onlinelibrary.wiley.com/doi/10.1002/cpbi.51), we use CMsearch [(part of the Infernal suite; Cui, X., _et al_, 2016)](https://academic.oup.com/bioinformatics/article/32/12/i332/2288851?login=true) to search for homology models of the sequences provided in the database.

Additionally, tRNAscan-SE (Chan, P.P., _et al_, 2019) allows us to find tRNA genes, some of the largest and most complex non coding RNA sequences.
Additionally, tRNAscan-SE [(Chan, P.P., _et al_, 2019)](https://link.springer.com/protocol/10.1007/978-1-4939-9173-0_1) allows us to find tRNA genes, some of the largest and most complex non coding RNA sequences.

## Filtering and finalization of the models
## Filtering and finalisation of the models
Once all the data resulting from the different analyses and sources is processed, we remove redundancies, filter out low quality models and prepare the final models for biological interpretation.

Once all the data resulting from the different analyses and sources is processed, we need to remove redundancies, filter out low quality models, and prepare the final models for biological interpretation. We do this mostly through in-house software developed to work best with our core database schema.
We do this mostly through in-house software developed to work best with our core database schema.

### Prioritising models at each locus
Low quality models are removed, and data is collapsed and consolidated into a final set of gene models plus their associated non-redundant transcripts.

Low quality models are removed, and data is collapsed and consolidated into a final set of gene models plus their associated non-redundant transcripts. When collapsing the data, priority is given to models derived from transcriptomic data; where RNA-seq data is fragmented or missing, homology data takes precedence.
When collapsing the data, priority is given to models derived from transcriptomic data. Where RNA-seq data is fragmented or missing, homology data takes precedence.

### UTR addition
The set of coding models is extended into the untranslated regions (UTRs) using RNA-seq data (if available).

The set of coding models is extended into the untranslated regions (UTRs) using RNA-seq data (if available). The criterion for adding UTR from RNA-seq alignments to protein models lacking UTR (such as the protein-to-genome alignment models) is that the intron coordinates from the model missing UTR exactly match a subset of the coordinates from the UTR donor model.
The UTR from RNA-seq alignments is added to protein models lacking UTR (such as the protein-to-genome alignment models) when the intron coordinates from the model missing UTR exactly match a subset of the coordinates from the UTR donor model.

### Functional annotation and identification
We also use a [transformer model](https://github.com/Ensembl/gene_symbol_transformer) trained in our manually curated set of annotations to deduce the gene funcion from sequence, thus naming the genes to resemble the namings of its closer gene family. Although this should be assumed to be more a funtion indicative, rather than proper naming, as the subtler naming conventions might not be follow strictly.

We also use a transformer model trained in our manually curated set of annotations to deduce the gene funcion from sequence, thus naming the genes to resemble the namings of its closer gene family. Although this should be assumed to be more a funtion indicative, rather than proper naming, as the subtler naming conventions might not be follow strictly.
Finally, an Ensembl stable ID will be assign to each annotated feature.

Finally, a ensembl stable ID will be assign to each annotated feature. When updating an existing annotation, or replacing it with a new assembly version, an extra step will be taken to try and map the old stable IDs to the features, in an attempt to facilitate working with them.
When updating an existing annotation, or replacing it with a new assembly version, an extra step will be taken to try and map the old stable IDs to the features, in an attempt to facilitate working with them.

## Appendix

The Ensembl gene set is generated automatically, meaning that gene models are annotated using the Ensembl gene annotation pipelines. The main focus of the pipelines is to generate a conservative set of protein-coding gene models.