# Annotating the representative genome of *Paramecium tetraurelia*

*Paramecium tetraurelia strain d4-2* has a very small so-called macronulear genome. This is not the kind of genome that we usually expect to see, but for its size, it makes a good example for our workshop (0.98 Mb). 
  
Our task is today to annotate this with several methods: BRAKER3, BRAKER2, BRAKER1, GALBA, and possibly TSEBRA combinations of the beforementioned, and (a) produce a high quality annotation in a fully automated way, and (b) find out which method produces best results.

## Data

The genome is available at [/home/genomics/workshop_material/genome_annotation/genome/genome.fa](/home/genomics/workshop_material/genome_annotation/genome/genome.fa). We will be using the repeat masking provided by NCBI.

There are many Illumina RNA-Seq libraries listed for this species at the [Sequence Read Archive](https://www.ncbi.nlm.nih.gov/sra). It is not feasible that we download and align all of them. Therefore, we have randomly selected SRR19666444, aligned it to the genome with Hisat2, converted the ouput to bam-format with samtools, and sorted the bam file. The output is provided at [/home/genomics/workshop_material/genome_annotation/sra/SRR19666444.s.bam](/home/genomics/workshop_material/genome_annotation/sra/SRR19666444.s.bam).

In terms of OrthoDB taxonomy, the Alveolata partition is "the closest" for this species, available at [/home/genomics/workshop_material/genome_annotation/orthodb/Alveolata.fa](/home/genomics/workshop_material/genome_annotation/orthodb/Alveolata.fa). This is a rather "small" partition, and a priori, we do not know whether it will be sufficient for running BRAKER3 or BRAKER2 if used as the only protein input.

Proteomes of close relatives (for GALBA) are  These protein sets have been downloaded, concatenated, and the fasta entries have been replaced by short, simple, and unique headers. The resulting file to be used with GALBA is available at [/home/genomics/workshop_material/genome_annotation/proteins/proteins.fa](/home/genomics/workshop_material/genome_annotation/proteins/proteins.fa)

## Group assignment

Runtime is a serious issue. Therefore, you are split into 4 different groups, and you will only complete the task of the group that you were assigned to. Execute the following code cell once to see which group are in:

In [None]:
import random
print("You are in group ", random.randint(1,5))

For all tasks, there is a possibility of failure. Should your task fail (e.g. because not sufficient evidence was provided), pick a new group by re-executing the above code cell.

## Questions to all groups (with respect to the assigned task)

1. After executing the annotation task assigned to you, apply BUSCO to estimate proteome completentess (use [/home/genomics/workshop_material/genome_annotation/alveolata_odb10](/home/genomics/workshop_material/genome_annotation/alveolata_odb10)). What are the BUSCO scores?

2. How many genes and how many transcripts are predicted?

3. How is the mono:mult ratio? 

4. What is the median number of exons per transcript? 

5. What is the largest number of exons in a transcript?

## Task for group 1: Apply GALBA

Use the following code cell to implement the application of GALBA, and execute GALBA. Use the genome file and the proteins.fa file mentioned, above. Apply `--skipOptimize`. 

In [1]:
%%script bash
T=4 # adjust to number of threads that you booted with

# delete output from a possible previous run if it exists
if [ -d GALBA_param ]
then
    rm -rf GALBA_param
fi

time galba.pl --workingdir=GALBA_param --genome=/home/genomics/workshop_material/genome_annotation/genome/genome.fa \
    --prot_seq=/home/genomics/workshop_material/genome_annotation/proteins/proteins.fa \
    --AUGUSTUS_BIN_PATH=/usr/bin/ --AUGUSTUS_SCRIPTS_PATH=/usr/share/augustus/scripts/ --threads ${T} \
    --skipOptimize \ # remember to remove both options if you are running a real job
    2> galba.log

# Fri May  5 16:31:07 2023: Log information is stored in file /home/katharina/git/GenomeAnnotation_Workshop2023/GALBA_param/GALBA.log


# Fri May  5 16:32:23 2023: ERROR: in file /opt/GALBA/scripts/galba.pl at line 3975
Training gene file in genbank format /home/katharina/git/GenomeAnnotation_Workshop2023/GALBA_param/train.ff.gb does not contain any training genes. At this stage, we performed a filtering step that discarded all genes that lead to etraining errors. If you lost all training genes, now, that means you probably have an extremely fragmented assembly where all training genes are incomplete, or similar.
# Fri May  5 16:32:23 2023: ERROR: in file /opt/GALBA/scripts/galba.pl at line 3975
Training gene file in genbank format /home/katharina/git/GenomeAnnotation_Workshop2023/GALBA_param/train.ff.gb does not contain any training genes. At this stage, we performed a filtering step that discarded all genes that lead to etraining errors. If you lost all training genes, now, that means you probably have an extremely fragmented assembly where all training genes are incomplete, or similar.

real	1m16.321s
user	4m16.485s

In [None]:
%%script bash

T=4 # adjust to number of threads that you booted with

source conda_init
conda activate busco_env

cd GALBA_param
# create links if not already present
if [ ! -d busco_downloads ]
then
    mkdir busco_downloads
    cd busco_downloads
    if [ ! -L metazoa_odb10 ]
    then
        ln -s /home/genomics/workshop_material/genome_annotation/busco/metazoa_odb10 metazoa_odb10
    fi
fi

# delete old output if existing
if [ -d busco_${g} ]
then
    rm -r busco_${g}
fi
# run BUSCO
busco -m proteins -i augustus.hints.aa -o busco_galba_augustus \
        -l metazoa_odb10 -c ${T} &> busco_galba_augustus.log
done

# Count number of transcripts by counting FASTA headers
echo "Counting number of protein sequences = transcripts"
grep -c ">" augustus.hints.aa

# Number of genes, and descriptive statistics
echo "Computing some descriptive statistics for GALBA:"
analyze_exons.py -f galba.gtf

## Task for group 2: Apply BRAKER3

Use the following code cell to implement the application of BRAKER3, and execute BRAKER3. Use the genome file, the OrthoDB Metazoa file, the proteins.fa file, and the Merged.bam file mentioned, above. Apply `--skipOptimize`. Do not apply `--gm_max_intergenic 10000`. Expected runtime 10 minutes. If you find the BRAKER gene set of poor quality, consider using TSEBRA, manually.

In [2]:
%%script bash

T=4 # adjust to number of threads that you booted with, takes ~30 minutes with 4 threads

# delete output from a possible previous run if it exists
if [ -d BRAKER3_param ]
then
    rm -rf BRAKER3_param
fi

ORTHODB=/home/genomics/workshop_material/genome_annotation/orthodb/Alveolata.fa # adjust to suitable clade

# run BRAKER3
time braker.pl --workingdir=BRAKER3_param --genome=/home/genomics/workshop_material/genome_annotation/genome/genome.fa \
    --bam=/home/genomics/workshop_material/genome_annotation/sra/SRR19666444.s.bam \
    --prot_seq=${ORTHODB},/home/genomics/workshop_material/genome_annotation/proteins/proteins.fa \
    --AUGUSTUS_BIN_PATH=/usr/bin/ \
    --AUGUSTUS_SCRIPTS_PATH=/usr/share/augustus/scripts/ --threads ${T} \
    --skipOptimize

#**********************************************************************************
#                               BRAKER CONFIGURATION                               
#**********************************************************************************
# BRAKER CALL: /opt/BRAKER/scripts/braker.pl --workingdir=BRAKER3_hymenolepsis --genome=/home/genomics/workshop_material/genome_annotation/genome/genome.fa --bam=/home/genomics/workshop_material/genome_annotation/sra/SRR19666444.s.bam --prot_seq=/home/genomics/workshop_material/genome_annotation/orthodb/Alveolata.fa,/home/genomics/workshop_material/genome_annotation/proteins/proteins.fa --AUGUSTUS_BIN_PATH=/usr/bin/ --AUGUSTUS_SCRIPTS_PATH=/usr/share/augustus/scripts/ --threads 4 --skipOptimize
# Fri May  5 16:36:00 2023: braker.pl version 3.0.3
# Fri May  5 16:36:00 2023: Creating directory /home/katharina/git/GenomeAnnotation_Workshop2023/BRAKER3_hymenolepsis.
# Fri May  5 16:36:00 2023: Creating directory /home/katharina/git/GenomeAnno

# Fri May  5 16:36:00 2023: Creating directory /home/katharina/git/GenomeAnnotation_Workshop2023/BRAKER3_hymenolepsis.
# Fri May  5 16:36:00 2023:Both protein and RNA-Seq data in input detected. BRAKER will be executed in ETP mode (BRAKER3).
#*********
# Fri May  5 16:36:00 2023: Log information is stored in file /home/katharina/git/GenomeAnnotation_Workshop2023/BRAKER3_hymenolepsis/braker.log


#*********
#*********
ERROR in file /opt/BRAKER/scripts/braker.pl at line 5473
Failed to execute: /usr/bin/perl /opt/ETP/bin/gmetp.pl --cfg /home/katharina/git/GenomeAnnotation_Workshop2023/BRAKER3_hymenolepsis/GeneMark-ETP/etp_config.yaml --workdir /home/katharina/git/GenomeAnnotation_Workshop2023/BRAKER3_hymenolepsis/GeneMark-ETP --bam /home/katharina/git/GenomeAnnotation_Workshop2023/BRAKER3_hymenolepsis/GeneMark-ETP/etp_data/ --cores 4 --softmask  1>/home/katharina/git/GenomeAnnotation_Workshop2023/BRAKER3_hymenolepsis/errors/GeneMark-ETP.stdout 2>/home/katharina/git/GenomeAnnotation_Workshop2023/BRAKER3_hymenolepsis/errors/GeneMark-ETP.stderr
Failed to execute: /usr/bin/perl /opt/ETP/bin/gmetp.pl --cfg /home/katharina/git/GenomeAnnotation_Workshop2023/BRAKER3_hymenolepsis/GeneMark-ETP/etp_config.yaml --workdir /home/katharina/git/GenomeAnnotation_Workshop2023/BRAKER3_hymenolepsis/GeneMark-ETP --bam /home/katharina/git/GenomeAnnotation_Workshop2023/BRAKER3_hymenolepsis/GeneMark-ETP

CalledProcessError: Command 'b'\nT=4 # adjust to number of threads that you booted with, takes ~30 minutes with 4 threads\n\n# delete output from a possible previous run if it exists\nif [ -d BRAKER3_param ]\nthen\n    rm -rf BRAKER3_param\nfi\n\nORTHODB=/home/genomics/workshop_material/genome_annotation/orthodb/Alveolata.fa # adjust to suitable clade\n\n# run BRAKER3\ntime braker.pl --workingdir=BRAKER3_hymenolepsis --genome=/home/genomics/workshop_material/genome_annotation/genome/genome.fa \\\n    --bam=/home/genomics/workshop_material/genome_annotation/sra/SRR19666444.s.bam \\\n    --prot_seq=${ORTHODB},/home/genomics/workshop_material/genome_annotation/proteins/proteins.fa \\\n    --AUGUSTUS_BIN_PATH=/usr/bin/ \\\n    --AUGUSTUS_SCRIPTS_PATH=/usr/share/augustus/scripts/ --threads ${T} \\\n    --skipOptimize\n'' returned non-zero exit status 1.

## Task for group 3: Apply BRAKER2

Use the following code cell to implement the application of BRAKER2, and execute BRAKER2. Use the genome file, the OrthoDB Metazoa file and the proteins.fa file mentioned, above. Apply `--skipOptimize`. Do not apply `--gm_max_intergenic 10000`. Expected runtime ... If you find the BRAKER gene set of poor quality, consider using TSEBRA, manually.

In [None]:
%%script bash

T=4 # adjust to number of threads that you booted with, takes ~30 minutes with 4 threads

# delete output from a possible previous run if it exists
if [ -d BRAKER2_param ]
then
    rm -rf BRAKER2_param
fi

ORTHODB=/home/genomics/workshop_material/genome_annotation/orthodb/Alveolata.fa # adjust to suitable clade

# run BRAKER2
time braker.pl --workingdir=BRAKER2_param --genome=/home/genomics/workshop_material/genome_annotation/genome/genome.fa \
    --prot_seq=${ORTHODB},/home/genomics/workshop_material/genome_annotation/proteins/proteins.fa \
    --AUGUSTUS_BIN_PATH=/usr/bin/ \
    --AUGUSTUS_SCRIPTS_PATH=/usr/share/augustus/scripts/ --threads ${T} \
    --skipOptimize

## Task for group 4: Apply BRAKER1

Use the following code cell to implement the application of BRAKER1, and execute BRAKER1. Use the genome file and the Merged.bam file mentioned, above. Apply `--skipOptimize`. Do not apply `--gm_max_intergenic 10000`. Expected runtime ... If you find the BRAKER gene set of poor quality, consider using TSEBRA, manually.

In [None]:
%%script bash

T=4 # adjust to number of threads that you booted with, takes ~30 minutes with 4 threads

# delete output from a possible previous run if it exists
if [ -d BRAKER1_param ]
then
    rm -rf BRAKER1_param
fi

# run BRAKER1
time braker.pl --workingdir=BRAKER1_param --genome=/home/genomics/workshop_material/genome_annotation/genome/genome.fa \
    --bam=/home/genomics/workshop_material/genome_annotation/sra/SRR19666444.s.bam \
    --AUGUSTUS_BIN_PATH=/usr/bin/ \
    --AUGUSTUS_SCRIPTS_PATH=/usr/share/augustus/scripts/ --threads ${T} \
    --skipOptimize