Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@ title: BRAKER2 genome annotation
description: How genomes are annotated using the annotation pipeline BRAKER2 by Ensembl.
---
# Genome annotation using the annotation pipeline BRAKER2

How genomes are annotation by Ensembl using the BRAKER2 annotation pipeline.

## Genome preparation
Expand All @@ -14,12 +13,11 @@ The first phase of the Ensembl genome annotation pipeline involves loading an as
After the genomic sequence has been loaded into a database, it is screened for sequence patterns, including repeats, using repeat-detection tool [(Red Girgis, H.Z., 2015)](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-015-0654-5), dustmasker [(Morgulis, A., _et al._, 2006)](https://www.liebertpub.com/doi/10.1089/cmb.2006.13.1028), and tandem repeat finder, TRF [(Benson, G., 1999)](https://academic.oup.com/nar/article/27/2/573/1061099?login=true).

## Low complexity feature prediction
Transcription start sites are predicted using Eponine–scan[6], CpG islands longer than 400bp are predicted using CpG [(Larsen, F., et al, 1992)](https://www.sciencedirect.com/science/article/pii/088875439290024M?via%3Dihub), and tRNAs are predicted using tRNAscan-SE [Chan, P.P., _et al_, 2019](https://link.springer.com/protocol/10.1007/978-1-4939-9173-0_1). The results of Eponine-scan, CpG, and tRNAscan are for display purposes only; they are not used in the gene annotation process.
Transcription start sites are predicted using Eponine–scan[6], CpG islands longer than 400bp are predicted using CpG [(Larsen, F., _et al_, 1992)](https://www.sciencedirect.com/science/article/pii/088875439290024M?via%3Dihub), and tRNAs are predicted using tRNAscan-SE [Chan, P.P., _et al_, 2019](https://link.springer.com/protocol/10.1007/978-1-4939-9173-0_1). The results of Eponine-scan, CpG, and tRNAscan are for display purposes only; they are not used in the gene annotation process.

## Model generation through the Braker2 annotation pipeline
[BRAKER2](https://github.com/Gaius-Augustus/BRAKER) [(Bruna, T., _et al_ 2021)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7787252/) is one of the most popular tools for automatic eukaryotic genome annotation. This pipeline has been included in the Ensembl gene annotation process to generate supplementary gene annotation tracks for non-vertebrate species with an existing Ensembl annotation, or as a draft annotation for species without transcriptomic data, for more informaiton, see our [blog post](https://www.ensembl.info/2022/05/24/rapid-release-33-contains-species-annotated-via-braker2/).

[BRAKER2](https://github.com/Gaius-Augustus/BRAKER) [(Bruna, T., et. al 2021)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7787252/) is one of the most popular tools for automatic eukaryotic genome annotation. This pipeline has been included in the Ensembl gene annotation process to generate supplementary gene annotation tracks for non-vertebrate species with an existing Ensembl annotation, or as a draft annotation for species without transcriptomic data, for more informaiton, see our [blog post](https://www.ensembl.info/2022/05/24/rapid-release-33-contains-species-annotated-via-braker2/).

BRAKER2 includes [GeneMark-ES](https://genemark.bme.gatech.edu/) [(Lomsadze, A., et. al, 2005)](https://academic.oup.com/nar/article/33/20/6494/1082033?login=true), a self-training algorithm for ab initio gene prediction, and [Augustus](https://bioinf.uni-greifswald.de/augustus/) [(Stanke, M., et. al, 2006)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1538822/), one of the most accurate gene prediction tools. The tool provides different options according to the available data. In the annotation process the pipeline is used in the default protein mode.
BRAKER2 includes [GeneMark-ES](https://genemark.bme.gatech.edu/) [(Lomsadze, A., _et al_, 2005)](https://academic.oup.com/nar/article/33/20/6494/1082033?login=true), a self-training algorithm for ab initio gene prediction, and [Augustus](https://bioinf.uni-greifswald.de/augustus/) [(Stanke, M., _et al_, 2006)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1538822/), one of the most accurate gene prediction tools. The tool provides different options according to the available data. In the annotation process the pipeline is used in the default protein mode.

The set of proteins have been downloaded from [UniProt](https://www.uniprot.org/) [(The UniProt Consortium, 2017)](https://academic.oup.com/nar/article/45/D1/D158/2605721?login=true) and [OrthoDB](https://www.orthodb.org/) [(Kriventseva E.K., et al, 2019)](https://academic.oup.com/nar/article/47/D1/D807/5160989?login=true), using clade-specific filters.
The set of proteins have been downloaded from [UniProt](https://www.uniprot.org/) [(The UniProt Consortium, 2017)](https://academic.oup.com/nar/article/45/D1/D158/2605721?login=true) and [OrthoDB](https://www.orthodb.org/) [(Kriventseva E.K., _et al_, 2019)](https://academic.oup.com/nar/article/47/D1/D807/5160989?login=true), using clade-specific filters.
Original file line number Diff line number Diff line change
Expand Up @@ -3,82 +3,60 @@ slug: non-vertebrate-genome-annotation
title: Non vertebrate genome annotation
description: How non-vertebrate genomes are annotated by Ensembl.
---

# Non-vertebrate genome annotation

How non-vertebrate genomes are annotated by Ensembl.

## Genome preparation

The genome preparation phase of the Ensembl gene annotation system involves loading an assembly into an Ensembl core database schema and then running a series of analyses on the loaded assembly to identify an initial set of genomic features.

The most important aspect of this phase is identifying repeat features as soft masking of the genome is used extensively later in the annotation process.

### Repeat elements

After the genomic sequence has been loaded into a database, it is screened for sequence patterns, including repeats, using repeat-detection tool [(Red Girgis, H.Z., 2015)](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-015-0654-5), dustmasker [(Morgulis, A., _et al._, 2006)](https://www.liebertpub.com/doi/10.1089/cmb.2006.13.1028), and tandem repeat finder, TRF [(Benson, G., 1999)](https://academic.oup.com/nar/article/27/2/573/1061099?login=true).
After the genomic sequence has been loaded into a database, it is screened for sequence patterns, including repeats, using repeat-detection tool [(Red Girgis, H.Z., 2015)](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-015-0654-5), dustmasker [(Morgulis, A., _et al_, 2006)](https://www.liebertpub.com/doi/10.1089/cmb.2006.13.1028), and tandem repeat finder, TRF [(Benson, G., 1999)](https://academic.oup.com/nar/article/27/2/573/1061099?login=true).

Combined, these tools provide a quick and accurate identification of repeat elements, low-complexity DNA structures, and tandem repeats, respectively, as well as a soft-masked version of the genome that is used for the remainder of the analyses.

### Low complexity feature prediction

Transcription start sites are predicted using Eponine–scan[6], CpG islands longer than 400bp are predicted using CpG [(Larsen, F., et al, 1992)](https://www.sciencedirect.com/science/article/pii/088875439290024M?via%3Dihub), and tRNAs are predicted using tRNAscan-SE [Chan, P.P., _et al_, 2019](https://link.springer.com/protocol/10.1007/978-1-4939-9173-0_1). The results of Eponine-scan, CpG, and tRNAscan are for display purposes only; they are not used in the gene annotation process.
Transcription start sites are predicted using Eponine–scan[6], CpG islands longer than 400bp are predicted using CpG [(Larsen, F., _et al_, 1992)](https://www.sciencedirect.com/science/article/pii/088875439290024M?via%3Dihub), and tRNAs are predicted using tRNAscan-SE [Chan, P.P., _et al_, 2019](https://link.springer.com/protocol/10.1007/978-1-4939-9173-0_1). The results of Eponine-scan, CpG, and tRNAscan are for display purposes only; they are not used in the gene annotation process.

## Generating automatic, evidence-based protein-coding gene models

Genome annotation is generated primarily through alignment of publicly available transcriptomic data to the genome. Gaps in the annotation are filled via protein-to-genome alignments of a select set of proteins. The data and techniques employed to generate gene models are outlined here.

### Model generation through alignment of transcriptomic data to the genome

#### Aligning RNA-Seq data

Where available, RNA-seq data are downloaded from European Nucleotide Archive, [ENA](https://www.ebi.ac.uk/ena/browser/home) and utilised for annotation.

The obtained reads are aligned to the genome using RNA-seq aligner, STAR [(Dobin, A., _et al_, 2013)](https://academic.oup.com/bioinformatics/article/29/1/15/272537?login=true) and models assembled using the transcript assembler, Scallop [(Shao, M., _et al_, 2017)](10.1038/nbt.4020).

#### Aligning long-read transcriptomic data

Where available, long-read transcriptomic data (PacBio IsoSeq or Oxford Nanopore) are downloaded from [ENA](https://www.ebi.ac.uk/ena/browser/home) and used in the annotation process. The long-read data are mapped to the genome using Minimap2 [(Li H., 2018)](https://academic.oup.com/bioinformatics/article/34/18/3094/4994778?login=true) with the recommended settings for Iso-Seq and Nanopore data.


#### Validating protein-coding models

Protein-coding models are validated by aligning the longest ORF against a database of eukaryotic UniProt proteins using the protein aligner, DIAMOND [(Buchfink, B., _et al_, 2021)](https://www.nature.com/articles/s41592-021-01101-x). If the alignments are insufficient, additional validation is performed using RNASamba[(Camargo, A.P., _et al_, 2020)](https://academic.oup.com/nargab/article/2/1/lqz024/5701461?login=true) and CPC2 [(Kang, Y., _et al_, 2017)](https://academic.oup.com/nar/article/45/W1/W12/3831091?login=true).

### Model generation through protein-to-genome alignments

### Model generation through protein-to-genome alignments
Protein sequences are downloaded from public databases and aligned to the genome in a splice-aware manner using GenBlast (She, R., et al, 2011). The proteins aligned to the genome are selected subsets of UniProt (The UniProt Consortium, 2017) and OrthoDB (Kriventseva E.K., et al, 2019) proteins, chosen to provide broad and targeted coverage of the species' proteome of interest.

For GenBlast, a cut-off of 50% coverage and 30% identity, with an e-value threshold of e-1, is applied, and the exon repair option is enabled. The top 10 transcript models generated by GenBlast for each protein meeting these cut-offs are retained.

### Model generation through mapping of reference annotations to the genome

A whole genome alignment against either the human or Mouse GENCODE reference assembly is generated using LastZ [(Harris, R.S., 2007)](https://www.bx.psu.edu/~rsharris/rsharris_phd_thesis_2007.pdf). Syntenic regions are identified from this alignment and protein-coding annotations from the most recent released gene set is mapped to the genome through localised alignments using CESAR 2.0 [(Sharma, V.P., _et al_, 2017)](https://academic.oup.com/bioinformatics/article/33/24/3985/4095639?login=true).

## Filtering and finalising protein-coding gene models

### Prioritising models at each locus

Low quality models are removed, and data is collapsed and consolidated into a final set of gene models plus their associated non-redundant transcripts. When collapsing the data, priority is given to models derived from transcriptomic data; where RNA-seq data is fragmented or missing, homology data takes precedence.

### Addition of UTR to coding models

The set of coding models is extended into the untranslated regions (UTRs) using RNA-seq data (if available). The criterion for adding UTR from RNA-seq alignments to protein models lacking UTR (such as the protein-to-genome alignment models) is that the intron coordinates from the model missing UTR exactly match a subset of the coordinates from the UTR donor model.

## Creating the final gene set

### Identifying small non-coding RNAs

Small structured non-coding genes from [RFAM](https://rfam.org/) [(Nawrocki, E.P., _et al._, 2015)](https://pubmed.ncbi.nlm.nih.gov/25392425/) and [miRBase](https://www.mirbase.org/)[(Griffiths-Jones, S., _et al._, 2006)](https://pubmed.ncbi.nlm.nih.gov/16381832/) are analyzed using [NCBI-BLAST](https://blast.ncbi.nlm.nih.gov/Blast.cgi) [(Altschul, S.F., _et al._,1990)](https://pubmed.ncbi.nlm.nih.gov/2231712/), and models are generated using the Infernal software suite [(Nawrocki, E.P. and S.R. Eddy, 2013)](https://academic.oup.com/bioinformatics/article/29/22/2933/316439?login=true).


### Identifying long non-coding RNAs

If a model fails to meet the criteria of any of the previously described categories, does not overlap a protein-coding gene, and has been constructed from transcriptomic data then it is considered as a potential lncRNA. Potential lncRNAs are additionally filtered to remove single-exon loci due to the unreliability of such models.


## Appendix

The Ensembl gene set is generated automatically, meaning that gene models are annotated using the Ensembl gene annotation pipelines. The main focus of the pipelines is to generate a conservative set of protein-coding gene models.

Every gene model produced by the Ensembl gene annotation pipeline is supported by biological sequence evidence (see the “Supporting evidence” link on the left-hand menu of a Gene page or Transcript page); _ab initio_ models are not included in our gene set.
Loading