# Fungal Genome Annotation Using AUGUSTUS

AUGUSTUS is a gene prediction tool widely used in fungal genomics. It uses ab initio prediction combined with species-specific training models to identify protein-coding genes in assembled genomes. This guide walks through the essential steps to annotate a fungal genome using AUGUSTUS.

In [None]:
%%bash
curl -Ls https://micro.mamba.pm/api/micromamba/linux-64/latest | tar -xvj bin/micromamba


bin/micromamba


In [None]:
%%bash
./bin/micromamba create -y -n augustus_env -c bioconda -c conda-forge augustus


Let's annotate a fungal genome Fusarium oxysporum Fo47
https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_013085055.1/

In [None]:
!wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/013/085/055/GCF_013085055.1_ASM1308505v1/GCF_013085055.1_ASM1308505v1_genomic.fna.gz

In [None]:
!gunzip GCF_013085055.1_ASM1308505v1_genomic.fna.gz
!mv GCF_013085055.1_ASM1308505v1_genomic.fna fungus_genome.fna

In [None]:
%%bash
eval "$(./bin/micromamba shell hook -s bash)"
micromamba activate augustus_env
augustus --species=fusarium \
         --uniqueGeneId=true \
         --noInFrameStop=true \
         --gff3=on \
         --codingseq=on \
         --protein=on \
         --outfile=Foxy.gff \
         GCF_013085055.1_ASM1308505v1_genomic.fna


Get the genes and protein sequences of the genome

In [None]:
!getAnnoFasta.pl Foxy.gff

Rename those files because you will compare those with the masked genome later

# Masking the Genome for Repeats in Fungal Annotation
Genome masking is a critical preprocessing step before running gene prediction tools such as AUGUSTUS, BRAKER, or MAKER. Fungal genomes frequently contain large amounts of repetitive DNA, including transposable elements, tandem repeats, and low-complexity regions. If these repeats are not masked, annotation tools may incorrectly interpret them as protein-coding genes or exons, leading to false predictions.

Why Repeat Masking Matters
1. Prevents False Gene Predictions

Repetitive sequences often resemble coding regions or exon boundaries.
Without masking, gene predictors can:

Call false CDS inside transposable elements

Introduce spurious introns/exons

Inflate gene counts artificially

2. Improves Accuracy of Training and Annotation

Tools like AUGUSTUS and SNAP use statistical models trained on real genes. Repeats introduce noise that confuses these models.

Masking helps ensure:

Better exonâ€“intron boundary detection

Cleaner gene structures

More accurate BUSCO scores

3. Essential for High-Quality Comparative Genomics

Repeats can inflate:

Orthogroup counts

Gene family expansions

Genome similarity estimates

Masked genomes provide more reliable comparative analyses.

4. Reduces Computational Load

Large repeat regions slow down alignment, evidence integration, and gene calling. Masking simplifies the genome and accelerates annotation.

In [4]:
#rename the fast file to maske it
!mv GCF_013085055.1_ASM1308505v1_genomic.fna fungus_genome.fna

##Install repeatModeler and repeatMasker

In [None]:
!wget -qO- https://micro.mamba.pm/api/micromamba/linux-64/latest \
    | tar -xvj bin/micromamba

!mv bin/micromamba /usr/local/bin/micromamba
!rmdir bin


In [30]:
!eval "$(micromamba shell hook -s bash)"


In [None]:
!micromamba create -y -n repeats -c conda-forge -c bioconda \
    repeatmasker repeatmodeler rmblast hmmer trf perl



In [None]:
!micromamba run -n repeats RepeatMasker -h
!micromamba run -n repeats RepeatModeler -h



In [34]:
!micromamba run -n repeats BuildDatabase -name fungus_db fungus_genome.fna

Building database fungus_db:
  Reading fungus_genome.fna...
Number of sequences (bp) added to database: 12 ( 50358849 bp )


In [None]:
!micromamba run -n repeats RepeatModeler -database fungus_db -threads 8


In [None]:
!ls */consensi.fa.classified
!cp RM_9902.MonDec11742392025/consensi.fa.classified ./

In [None]:
!micromamba run -n repeats RepeatMasker \
  -pa 8 \
  -lib consensi.fa.classified \
  -gff \
  fungus_genome.fna


In [None]:
!ls -lh fungus_genome.fna*

In [43]:
!mv */fungus_genome.fna.masked ./

In [None]:
%%bash
curl -Ls https://micro.mamba.pm/api/micromamba/linux-64/latest | tar -xvj bin/micromamba


In [None]:
%%bash
./bin/micromamba create -y -n augustus_env -c bioconda -c conda-forge augustus


Annotate the Masked Genome

In [None]:
%%bash
eval "$(./bin/micromamba shell hook -s bash)"
micromamba activate augustus_env
augustus --species=fusarium \
         --uniqueGeneId=true \
         --noInFrameStop=true \
         --gff3=on \
         --codingseq=on \
         --protein=on \
         --outfile=Foxy_unmasked.gff \
        fungus_genome.fna.masked

Get the genes and protein sequences of the masked genome

In [None]:
!getAnnoFasta.pl Foxy_unmasked.gff