Genome Annotation Tools

Introduction

Genome annotation is another nontrivial problem within the realm of bioinformatics. While there are varying approaches, there appears to be some overlap in the general pipeline for annotating a genome assembly. This pipeline includes

Repeat Identification and Masking
Structural Annotation

Ab initio or evidence-driven gene predictors
Post-processing of gene prediction (further analysis of exons, introns, consensus sequences, etc.)

Functional Annotation (attaching biological information to the gene or protein sequences previously predicted)

Homology Search
Protein sequences analysis
Post-processing of homology search (gene-products and their interactions, statistics, etc.)

The Tools

For each of these steps, we have a series of tools that can be used. Here we will introduce some of the more commonly-used ones as seen in the literature referenced in the resources below.

RepeatMasker

From the README:

"RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences. The output of the program is a detailed annotation of the repeats that are present in the query sequence as well as a modified version of the query sequence in which all the annotated repeats have been masked (default: replaced by Ns). Sequence comparisons in RepeatMasker are performed by the program cross_match, an efficient implementation of the Smith-Waterman-Gotoh algorithm developed by Phil Green, or by WU-Blast developed by Warren Gish."

RepeatMasker and its dependencies are installed/set up on the GBI AMI. To install RepeatMasker yourself, follow the instructions here: https://www.repeatmasker.org/RepeatMasker/.

It accepts files in the .fasta format, simply enter the command followed by the fasta file you want masked. Two commonly used options are --species <query species> to mask the file you are submitting using a database from a pre-existing database based on a specific species and --lib [filename] which allows you to mask the file you are submitting with a custom library/database.

ex.

RepeatMasker --species <query species> your-assembly.fasta

or

RepeatMasker --lib [filename] your-assembly.fasta

Ab Initio Gene-Prediciton with AUGUSTUS

One note before mentioning AUGUSTUS.. "While Augustus and SNAP are the most popular tools for ab initio predic- tion, they still necessitate the information of the closely related gene and genome model for screening against the newly sequenced genome." (2). So while this is a commonly used practice, it may not work for many of our de novo assemblies. More on evidence-based gene prediction after AUGUSTUS.. AUGUSTUS

From the website:

"AUGUSTUS is a program that predicts genes in eukaryotic genomic sequences... It can be used as an ab initio program, which means it bases its prediction purely on the sequence. AUGUSTUS may also incorporate hints on the gene structure coming from extrinsic sources such as EST, MS/MS, protein alignments and syntenic genomic alignments. Since version 3.0 AUGUSTUS can also predict the genes simultaneously in several aligned genomes (see README-cgp.md)."

AUGUSTUS is already installed on the GBI AMI, but to install it yourself, follow the instructions on their Github here: https://github.com/Gaius-Augustus/Augustus.

The basic command is:

augustus [parameters] --species=SPECIES queryfilename

If you want to see the current list of gene species, use

augustus [parameters] --species=help

Another example, to output the results of augustus to an output file (from (1)):

augustus --species=species.name --gff3=on genome.fasta > output.file

Evidence-Driven Gene Prediction with BRAKER

From the github:

"BRAKER is a pipeline for fully automated prediction of protein coding gene structures with GeneMark-ES/ET and AUGUSTUS in novel eukaryotic genomes."

Braker is already operational on the GBI AMI. in order to download it for yourself, follow the instructions either at https://github.com/Gaius-Augustus/BRAKER#installation or in the GBI AMI documentation at https://github.com/Green-Biome-Institute/AWS/wiki/AWS-GBI-AMI-Documentation.

The command looks like:

braker.pl [OPTIONS] --genome=genome.fa {--bam=rnaseq.bam | --prot_seq=prot.fa}

where genome.fa is your assembly file and the second two options are for RNA or protein sequence data that you may have.

Some other relevant options that are said to be frequently used,

--species=sname for creating output files relevant to a specific species,
--softmasking for when your input file is soft-masked (this is a possible result of the masking software where instead of replacing the masked nucleotides with Ns, it replaces them with a lowercase letter for the given nucleotides [A->a, G->g, etc.]),
--cores for signifying the maximum cores of your computer you would like to allocate for this analysis.

For more, use the help option:

braker.pl --help

Homology Search

"To investigate gene function or predict evolutionary associations between related sequences, newly assembled sequences are compared with gene sequences with known functions to find sequences with high homology. Tools" (1).

There are several tools that can be used here, the most common one being BLAST to query the predicted genes from our newly assembled genomes.

In order to do this for nucleotide sequences, we will use the blastn command, which is already set up on the GBI AMI. If you want to install this yourself either go to the NCBI website and follow their instructions: https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download

Resources:

(1) Kong, J., Huh, S., Won, J. I., Yoon, J., Kim, B., & Kim, K. (2019). GAAP: A Genome Assembly + Annotation Pipeline. BioMed Research International, 2019. https://doi.org/10.1155/2019/4767354

(2) Jung, H., Ventura, T., Sook Chung, J., Kim, W. J., Nam, B. H., Kong, H. J., Kim, Y. O., Jeon, M. S., & Eyun, S. Il. (2020). Twelve quick steps for genome assembly and annotation in the classroom. PLoS Computational Biology, 16(11), 1–25. https://doi.org/10.1371/journal.pcbi.1008325

Provide feedback

Saved searches

Use saved searches to filter your results more quickly