# Annotating the representative genome of *Hymenolepsis microstoma*

*Hymenolepsis microstoma* is a small worm. The genome has a size of 169 Mb. The genome was originally published in 2013 ([paper](https://doi.org/10.1038/nature12031)). There is an annotated version of this genome, available under accession number [GCA_000469805.2](https://www.ncbi.nlm.nih.gov/assembly/GCA_000469805.2). There are two problems with this genome:

  1. The original annotation is of poor quality (can be measured with BUSCO). 
  
  2. Another genome version - without structural annotation of protein coding genes - was released in the meantime: [GCA_000469805.3](https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/469/805/GCA_000469805.3_HMN_v3/).
  
Our task is today to annotate the new representative genome GCA_000469805.3 with several methods: BRAKER3, BRAKER2, BRAKER1, GALBA, and possibly TSEBRA combinations of the beforementioned, and (a) produce a high quality annotation in a fully automated way, and (b) find out which method produces best results.

## Data

The genome is available at [/home/genomics/workshop_material/genome_annotation/genome/GCA_000469805.3_HMN_v3_genomic.fna](/home/genomics/workshop_material/genome_annotation/genome/GCA_000469805.3_HMN_v3_genomic.fna). We will be using the repeat masking provided by NCBI.

There are 44 Illumina RNA-Seq libraries listed for this species at the [Sequence Read Archive](https://www.ncbi.nlm.nih.gov/sra). It is not feasible that we download and align all of them. Therefore, we have sampled reads to this genome with VARUS ([paper](https://doi.org/10.1186/s12859-019-3182-x), [software](https://github.com/Gaius-Augustus/VARUS)), provided at [/home/genomics/workshop_material/genome_annotation/sra/Merged.bam](/home/genomics/workshop_material/genome_annotation/sra/Merged.bam).

In terms of OrthoDB taxonomy, the Metazoa partition is "the closest" for this species, available at [/home/genomics/workshop_material/genome_annotation/orthodb/Metazoa.fa](/home/genomics/workshop_material/genome_annotation/orthodb/Metazoa.fa).

Proteomes of close relatives (for GALBA) are [*Opisthokonta*](https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/905/250/075/GCA_905250075.1_Rw_10x_megabubbles_genomic_v1/GCA_905250075.1_Rw_10x_megabubbles_genomic_v1_protein.faa.gz), [*Mesocestoides corti*](https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/604/375/GCA_900604375.1_M_corti_Specht_Voge_0011_upd/GCA_900604375.1_M_corti_Specht_Voge_0011_upd_protein.faa.gz), [*Rodentolepis nana*](https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/617/975/GCA_900617975.1_H_nana_Japan_0011_upd/GCA_900617975.1_H_nana_Japan_0011_upd_protein.faa.gz), [*Echinococcus multilocularis*](https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/469/725/GCA_000469725.3_EMULTI002/GCA_000469725.3_EMULTI002_protein.faa.gz), [*Echinococcus granulosus*](https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/524/195/GCF_000524195.1_ASM52419v1/GCF_000524195.1_ASM52419v1_protein.faa.gz), and [*Hydatigera taeniaeformis*](https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/622/495/GCA_900622495.1_H_taeniaeformis_Canary_Islands_0011_upd/GCA_900622495.1_H_taeniaeformis_Canary_Islands_0011_upd_protein.faa.gz). These protein sets have been downloaded, concatenated, and the fasta entries have been replaced by short, simple, and unique headers. The resulting file to be used with GALBA is available at [/home/genomics/workshop_material/genome_annotation/proteins/proteins.fa](/home/genomics/workshop_material/genome_annotation/proteins/proteins.fa)

## Group assignment

Runtime is a serious issue. Therefore, you are split into 4 different groups, and you will only complete the task of the group that you were assigned to. Execute the following code cell once to see which group are in:

In [None]:
import random
print("You are in group ", random.randint(1,5))

For all tasks, there is a possibility of failure. Should your task fail (e.g. because not sufficient evidence was provided), pick a new group by re-executing the above code cell.

## Questions to all groups (with respect to the assigned task)

1. After executing the annotation task assigned to you, apply BUSCO to estimate proteome completentess (use [/home/genomics/workshop_material/genome_annotation/metazoa_odb10](/home/genomics/workshop_material/genome_annotation/metazoa_odb10)). What are the BUSCO scores?

2. How many genes and how many transcripts are predicted?

3. How is the mono:mult ratio? 

4. What is the median number of exons per transcript? 

5. What is the largest number of exons in a transcript?

## Task for group 1: Apply GALBA

Use the following code cell to implement the application of GALBA, and execute GALBA. Use the genome file and the proteins.fa file mentioned, above. Apply `--skipOptimize`. 

In [None]:
%%script bash
T=4 # adjust to number of threads that you booted with

# delete output from a possible previous run if it exists
if [ -d GALBA_hymenolepsis ]
then
    rm -rf GALBA_hymenolepsis
fi

time galba.pl --workingdir=GALBA_hymenolepsis --genome=/home/genomics/workshop_material/genome_annotation/genome/GCA_000469805.3_HMN_v3_genomic.fna \
    --prot_seq=/home/genomics/workshop_material/genome_annotation/proteins/proteins.fa \
    --AUGUSTUS_BIN_PATH=/usr/bin/ --AUGUSTUS_SCRIPTS_PATH=/usr/share/augustus/scripts/ --threads ${T} \
    --skipOptimize \ # remember to remove both options if you are running a real job
    2> galba.log

In [None]:
%%script bash

T=4 # adjust to number of threads that you booted with

source conda_init
conda activate busco_env

cd GALBA_hymenolepsis
# create links if not already present
if [ ! -d busco_downloads ]
then
    mkdir busco_downloads
    cd busco_downloads
    if [ ! -L metazoa_odb10 ]
    then
        ln -s /home/genomics/workshop_material/genome_annotation/busco/metazoa_odb10 metazoa_odb10
    fi
fi

# delete old output if existing
if [ -d busco_${g} ]
then
    rm -r busco_${g}
fi
# run BUSCO
busco -m proteins -i augustus.hints.aa -o busco_galba_augustus \
        -l metazoa_odb10 -c ${T} &> busco_galba_augustus.log
done

# Count number of transcripts by counting FASTA headers
echo "Counting number of protein sequences = transcripts"
grep -c ">" augustus.hints.aa

# Number of genes, and descriptive statistics
echo "Computing some descriptive statistics for GALBA:"
analyze_exons.py -f galba.gtf

## Task for group 2: Apply BRAKER3

Use the following code cell to implement the application of BRAKER3, and execute BRAKER3. Use the genome file, the OrthoDB Metazoa file, the proteins.fa file, and the Merged.bam file mentioned, above. Apply `--skipOptimize`. Do not apply `--gm_max_intergenic 10000`. Expected runtime 10 minutes. If you find the BRAKER gene set of poor quality, consider using TSEBRA, manually.

In [None]:
%%script bash

T=4 # adjust to number of threads that you booted with, takes ~30 minutes with 4 threads

# delete output from a possible previous run if it exists
if [ -d BRAKER3_hymenolepsis ]
then
    rm -rf BRAKER3_hymenolepsis
fi

ORTHODB=/home/genomics/workshop_material/genome_annotation/orthodb/Metazoa.fa # adjust to suitable clade

# run BRAKER3
time braker.pl --workingdir=BRAKER3_hymenolepsis --genome=/home/genomics/workshop_material/genome_annotation/genome/GCA_000469805.3_HMN_v3_genomic.fna \
    --bam=/home/genomics/workshop_material/genome_annotation/sra/Merged.bam \
    --prot_seq=${ORTHODB},/home/genomics/workshop_material/genome_annotation/proteins/proteins.fa \
    --AUGUSTUS_BIN_PATH=/usr/bin/ \
    --AUGUSTUS_SCRIPTS_PATH=/usr/share/augustus/scripts/ --threads ${T} \
    --skipOptimize

## Task for group 3: Apply BRAKER2

Use the following code cell to implement the application of BRAKER2, and execute BRAKER2. Use the genome file, the OrthoDB Metazoa file and the proteins.fa file mentioned, above. Apply `--skipOptimize`. Do not apply `--gm_max_intergenic 10000`. Expected runtime ... If you find the BRAKER gene set of poor quality, consider using TSEBRA, manually.

In [None]:
%%script bash

T=4 # adjust to number of threads that you booted with, takes ~30 minutes with 4 threads

# delete output from a possible previous run if it exists
if [ -d BRAKER2_hymenolepsis ]
then
    rm -rf BRAKER2_hymenolepsis
fi

ORTHODB=/home/genomics/workshop_material/genome_annotation/orthodb/Metazoa.fa # adjust to suitable clade

# run BRAKER2
time braker.pl --workingdir=BRAKER2_hymenolepsis --genome=/home/genomics/workshop_material/genome_annotation/genome/GCA_000469805.3_HMN_v3_genomic.fna \
    --prot_seq=${ORTHODB},/home/genomics/workshop_material/genome_annotation/proteins/proteins.fa \
    --AUGUSTUS_BIN_PATH=/usr/bin/ \
    --AUGUSTUS_SCRIPTS_PATH=/usr/share/augustus/scripts/ --threads ${T} \
    --skipOptimize

## Task for group 4: Apply BRAKER1

Use the following code cell to implement the application of BRAKER1, and execute BRAKER1. Use the genome file and the Merged.bam file mentioned, above. Apply `--skipOptimize`. Do not apply `--gm_max_intergenic 10000`. Expected runtime ... If you find the BRAKER gene set of poor quality, consider using TSEBRA, manually.

In [15]:
%%script bash

T=4 # adjust to number of threads that you booted with, takes ~30 minutes with 4 threads

# delete output from a possible previous run if it exists
if [ -d BRAKER1_hymenolepsis ]
then
    rm -rf BRAKER1_hymenolepsis
fi

# run BRAKER1
time braker.pl --workingdir=BRAKER1_hymenolepsis --genome=/home/genomics/workshop_material/genome_annotation/genome/GCA_000469805.3_HMN_v3_genomic.fna \
    --bam=/home/genomics/workshop_material/genome_annotation/sra/Merged.bam \
    --AUGUSTUS_BIN_PATH=/usr/bin/ \
    --AUGUSTUS_SCRIPTS_PATH=/usr/share/augustus/scripts/ --threads ${T} \
    --skipOptimize


time braker.pl --workingdir=BRAKER1_hymenolepsis --genome=GCA_000469805.3_HMN_v3_genomic.fna \
    --bam=Merged.bam \
    --AUGUSTUS_BIN_PATH=/usr/bin/ \
    --AUGUSTUS_SCRIPTS_PATH=/usr/share/augustus/scripts/ --threads ${T} \
    --skipOptimize

#**********************************************************************************
#                               BRAKER CONFIGURATION                               
#**********************************************************************************
# BRAKER CALL: /opt/BRAKER/scripts/braker.pl --workingdir=BRAKER1_hymenolepsis --genome=/home/genomics/workshop_material/genome_annotation/genome/GCA_000469805.3_HMN_v3_genomic.fna --bam=/home/genomics/workshop_material/genome_annotation/sra/Merged.bam --AUGUSTUS_BIN_PATH=/usr/bin/ --AUGUSTUS_SCRIPTS_PATH=/usr/share/augustus/scripts/ --threads 4 --skipOptimize
# Wed May  3 21:50:54 2023: braker.pl version 3.0.3
# Wed May  3 21:50:54 2023: Creating directory /home/katharina/git/GenomeAnnotation_Workshop2023/BRAKER1_hymenolepsis.
# Wed May  3 21:50:54 2023: Only RNA-Seq input detected, BRAKER will be executed in ET mode (BRAKER1).
# Wed May  3 21:50:54 2023: Configuring of BRAKER for using external tools...
# Wed May  3 21:50:54 2023: Tryin

CalledProcessError: Command 'b'\nT=4 # adjust to number of threads that you booted with, takes ~30 minutes with 4 threads\n\n# delete output from a possible previous run if it exists\nif [ -d BRAKER1_hymenolepsis ]\nthen\n    rm -rf BRAKER1_hymenolepsis\nfi\n\n# run BRAKER1\ntime braker.pl --workingdir=BRAKER1_hymenolepsis --genome=/home/genomics/workshop_material/genome_annotation/genome/GCA_000469805.3_HMN_v3_genomic.fna \\\n    --bam=/home/genomics/workshop_material/genome_annotation/sra/Merged.bam \\\n    --AUGUSTUS_BIN_PATH=/usr/bin/ \\\n    --AUGUSTUS_SCRIPTS_PATH=/usr/share/augustus/scripts/ --threads ${T} \\\n    --skipOptimize\n'' returned non-zero exit status 1.