Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
---
slug: community-genome-annotation
title: Community genome annotation
description: Requirements for community annotations to be imported into Ensembl.
---

# Community annotation formatting

Whilst we encourage our users, submitters and providers to keep submitting their assembly annotations to [INSDC](https://www.insdc.org/about-insdc/), we can understand that this process may take longer than desired, delaying the timelines of when the genome will be available in our site. Thus, we may accept GFF3 files handed over directly to us (although the assembly will still need to be already publicly available in INSDC). We have put together a short protocol that can be easily followed to make sure a GFF3 file is valid and complies with our requirements.

First and foremost, it is important the file must be in valid GFF3 format ([specifications](https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md)). This can be checked by uploading the file to the [GFF3 Validator](https://genometools.org/cgi-bin/gff3validator.cgi) from “GenomeTools” and clicking on `Validate this file!`. Run it at least once without the `tidy` box ticked to check what the issues are. You can then attempt to validate it again with the `tidy` option checked. Note that if your GFF3 file is larger than the specified limit in the site, you will need to download the tool ([installation instructions](https://github.com/genometools/genometools/blob/master/README.md)) and run it on your computer with the command:

```bash
gt gff3validator <file.gff3>
```

And, if you want to use the `tidy` option, we recommend the following command:

```bash
gt gff3 -tidy -sort -retainids <file.gff3>
```

Note that there is also the option to install this tool via conda running the command:

```bash
conda install genometools-genometools
```

Once the GFF3 file has been validated, the next step is to make the chromosome and scaffold names as well as the sequences used match the ones present in the corresponding assembly in INSDC (listed in the sequence report file, e.g. [GCA_032118995.1](https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/032/118/995/GCA_032118995.1_ASM3211899v1/GCA_032118995.1_ASM3211899v1_assembly_report.txt)). We use the names provided by INSDC as the main IDs to align ourselves with INSDC, but the initially submitted names will also be available in Ensembl. There is no need to provide the sequences for those since we will use the ones provided by INSDC/RefSeq.

Finally, make sure each protein coding mRNA produces exactly one translation, i.e. all CDS lines for the same transcript have the same CDS ID. For instance, here is how the GFF3 file section for GENEID_000001_p1 would look like:

```text
scaffold_name source_name gene 1 1000 . - . ID=GENEID_000001
scaffold_name source_name mRNA 1 1000 . - . ID=GENEID_000001_t1;Parent=GENEID_000001
scaffold_name source_name exon 1 400 . - . ID=GENEID_000001_t1-E1;Parent=GENEID_000001_t1
scaffold_name source_name exon 600 1000 . - . ID=GENEID_000001_t1-E2;Parent=GENEID_000001_t1
scaffold_name source_name CDS 100 400 . - 2 ID=GENEID_000001_p1;Parent=GENEID_000001_t1
scaffold_name source_name CDS 600 900 . - 0 ID=GENEID_000001_p1;Parent=GENEID_000001_t1
```

Additionally, if some protein sequences contain sequence edits (e.g. selenocysteines, etc.) then the provider will also need to supply a separate sequence FASTA file corresponding to those proteins (using the same IDs as the CDSs in the GFF3 file).

> Note: *Bear in mind that any handed over GFF3 files that have not passed this validation process will be returned to the provider and the genome will be postponed by at least one release.*

Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,6 @@
- name: BRAKER2 genome annotation
href: braker2-annotation.md
- name: Homology annotation
href: homology-annotation.md
href: homology-annotation.md
- name: Community genome annotation
href: community-annotation.md