# Genome annotation
Genome annotation is essential for understanding an organism's genetic blueprint, divided into two main processes: structural and functional annotation. Structural annotation identifies the locations of genes and genomic elements like coding regions, exons, and regulatory sequences, providing a map of the genome's architecture. Functional annotation assigns roles to these genes, linking them to biological processes, protein functions, and pathways. Together, these approaches offer a comprehensive view, crucial for research on complex pathogen-host interactions, as in phytoplasma studies, where understanding both gene locations and functions is key to advancing diagnostics and disease management.

In [None]:
# @title
!wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
!chmod +x Miniconda3-latest-Linux-x86_64.sh
!bash ./Miniconda3-latest-Linux-x86_64.sh -b -f -p /usr/local
import sys
sys.path.append('/usr/local/lib/python3.7/site-packages/')
!conda config --add channels defaults
!conda config --add channels bioconda
!conda config --add channels conda-forge

In [None]:
!conda install bioconda::repeatmodeler -y

In [None]:
!conda install bioconda::repeatmasker -y

# Genome Annotation: Tailoring Approaches for Prokaryotes and Eukaryotes

Annotation approaches vary based on the type of organism, with specific gene models suited to either prokaryotic or eukaryotic genomes. For prokaryotes, specialized annotation tools are essential to capture their unique genomic features, such as densely packed genes and operons, which differ significantly from the complex structures found in eukaryotic genomes. Certain annotators are designed exclusively for prokaryotic genomes and SHOULD NOT be used for eukaryotes, as they lack the capacity to handle introns and the intricate regulatory regions characteristic of eukaryotic genes. Choosing the right annotation tool is crucial for accurate genome analysis and meaningful biological insights.

**Lets annotate a prokaryote genome - Prokka**

Prokka only handles prokaryotic genomes!!

Install ncbi-datasets to fetch genomes form Genbank

In [None]:
!conda install conda-forge::ncbi-datasets-cli -y

Download a Xanthomonas oryzae genome. Unzip the file, create a new folder named "genomes," and move the FASTA file containing the genome to the "genomes" folder.

In [None]:
!datasets download genome accession GCF_004355885.3 --include genome,seq-report
!unzip ncbi_dataset.zip
!mkdir genomes
!mv ncbi_dataset/data/GCF_004355885.3/*.fna genomes/

Install prokka tool

In [None]:
!conda install -c bioconda prokka -y

Run prokka to annotate the Xathomonas oryzae genome

In [None]:
!prokka --locustag xoo --outdir prokka_results genomes/GCF_004355885.3_ASM435588v3_genomic.fna

Your results will be saved in prokka_results. This folder contains sevreal files with the gene predictions, protein sequences, funational annotation , genbank file, anf gff file.

**Lets annotatte a Eukaryte genome - Augutus**

Augustus is an annotator that can handle both eukaryotic and prokaryotic genomes

Download a fungal genome, Fusarium oxysporum. Unzip the file, create a new folder named "genomes," and move the FASTA file containing the genome to the "genomes" folder.

In [None]:
!datasets download genome accession GCA_013085055.1 --include genome,seq-report

In [None]:
!unzip -o ncbi_dataset.zip

In [None]:
!cp ncbi_dataset/data/GCA_013085055.1/*.fna genomes/

Eukaryotic genomes often contain a significant proportion of repetitive sequences, which can introduce bias and inaccuracies during gene annotation. To address this, it is critical to mask repetitive elements prior to annotation, as these repeats can lead to the misidentification of genes or inflate gene counts.

In our analysis, we will first run RepeatFinder to identify and mask repetitive regions in the genome. The masked genome will then be subjected to annotation. To assess the impact of repeat masking on the annotation process, we will also annotate the unmasked genome and compare the results. This comparative approach will allow us to identify any differences in gene prediction caused by repetitive elements and evaluate the necessity and effectiveness of repeat masking in our workflow.

First lets create a new library of repeats in your genome. We use RepeatModeler

In [None]:
!BuildDatabase -name GCA_013085055 genomes/GCA_013085055.1_ASM1308505v1_genomic.fna
!RepeatModeler -database GCA_013085055

The output of RepeatModeler includes a newly generated library of repetitive elements, which can be found in a directory with the naming pattern RM_*/consensi.fa.classified. Replace the * in the folder name with the actual directory name generated by RepeatModeler in your specific run. Use this updated path in the code provided below to ensure accurate references to the repeat library.

Now lets run RepeatMasker to mask the repeats

In [None]:
!RepeatMasker -gff -lib RM_*/consensi.fa.classified genomes/GCA_013085055.1_ASM1308505v1_genomic.fna

Install Augutus annotation tool

In [None]:
!apt-get update
!apt-get install augustus

Run Augutus

Run Augustus in the unmasked genome

In [None]:
!augustus --species=fusarium --codingseq=on --protein=on genomes/GCA_013085055.1_ASM1308505v1_genomic.fna > augutus_annot.gff

Run Augustus in the masked genome

In [None]:
!augustus --species=fusarium --codingseq=on --protein=on genomes/GCA_013085055.1_ASM1308505v1_genomic.fna.masked > augutus_annot_masked.gff

Augutus will produce a gff file that contain sthe structurla annotation. This gff file cna be use to extrcta the gene and protein sequences.

In [None]:
!perl /usr/share/augustus/scripts/getAnnoFasta.pl augutus_annot.gff
!perl /usr/share/augustus/scripts/getAnnoFasta.pl augutus_annot_masked.gff

Check the number of protein and genes and compare

In [None]:
!grep -c '>' augutus_annot.aa
!grep -c '>'augutus_annot.codingseq
!grep -c '>' augutus_annot_masked.aa
!grep -c '>'augutus_annot_maskedt.codingseq