# Annotating the Chromsome 4 of *Babesia duncani*

Originally, we had intended to provide an exercise for annotating a complete small genome. However, the computational resources available on AWS for this workshop do not permit completion of the task within one afternoon. Therefore, we provide the annotation of Chromosome 4 of *Babesia duncani* as a hands-on exercise for practicing your skills. 


*Babesia duncani* is a protozoan parasite that infects red blood cells in humans. It is primarily transmitted through tick bites. *Babesia duncani* is known to cause a malaria-like illness called babesiosis, which can lead to symptoms such as fever, fatigue, muscle aches, and anemia. The [genome](https://www.ncbi.nlm.nih.gov/data-hub/taxonomy/323732/) has a size of 10.4 Mb. We selected Chromsome 4 from this genome, which is about 1 Mb in size. 

**Warning:* None of the self-training genome annotation tools (BRAKER1, BRAKER2, BRAKER3, GALBA) are particularly suitable for annotating partial genomes. We strongly advise against doing this in practice. Please always use genomes as complete as possible. The only reason why we resort to a chunk of a genome, here, are limited computational and time resources during this workshop.
  
Your task is today to annotate this partial genome with several methods: BRAKER3, BRAKER2, BRAKER1, GALBA, and possibly TSEBRA combinations of the beforementioned, and (a) produce a high quality annotation in a fully automated way, and (b) find out which method produces best results.


## Data

The genome is available at [/home/genomics/workshop_material/genome_annotation/genome/genome.fa](/home/genomics/workshop_material/genome_annotation/genome/genome.fa). We will be using the repeat masking provided by NCBI.

We downloaded and aligned the library SRR18907291 from [Sequence Read Archive](https://www.ncbi.nlm.nih.gov/sra). (Paired-end Illumina library, Hisat2 aligner, converted and sorted with Samtools, selected aligned reads with Samtools to reduce bam file size.) It is not feasible that we process this data set during the workhop. We provide the file [/home/genomics/workshop_material/genome_annotation/sra/SRR18907291.s.mapped.bam](/home/genomics/workshop_material/genome_annotation/sra/SRR18907291.s.mapped.bam).

In terms of OrthoDB taxonomy, the Alveolata partition is "the closest" for this species, available at [/home/genomics/workshop_material/genome_annotation/orthodb/Alveolata.fa](/home/genomics/workshop_material/genome_annotation/orthodb/Alveolata.fa). This is a rather "small" partition, and a priori, we do not know whether it will be sufficient for running BRAKER3 or BRAKER2 if used as the only protein input. We recommend combining it with proteins of other *Babesia* species (see below).

Proteomes of other *Babesia* species were downloaded from NCBI, they were concatenated, and the fasta headers have been replaced by short, simple, and unique headers. The resulting file to be used with GALBA is available at [/home/genomics/workshop_material/genome_annotation/proteins/proteins.fa](/home/genomics/workshop_material/genome_annotation/proteins/proteins.fa). 

## What to expect?

This genome has no complete reference annotation, yet. Looking at other *Babesia* genomes, we exepect 3,000 to 5,000 proteins in the full genome. Chromosome 4 makes up only 10% of the total genome, i.e. we possibly expect 300 to 500 proteins to be predicted in this genome part.

We applied BUSCO to the genome as follows:

```
busco -m genome -i genome.fa -o busco_genome -l alveolata_odb10
```

and obtained this output:

```
C:17.5%[S:17.5%,D:0.0%],F:1.2%,M:81.3%,n:171
```

The BUSCO core gene set for Alveolata is relatively small (only 171 genes). However, we expect similar sensitivity scores with a well-performing tool for structural genome annotation.

## Group assignment

Runtime is a serious issue. Therefore, you are split into 4 different groups, and you will at first only complete the task of the group that you were assigned to. Execute the following code cell once to see which group are in:

In [None]:
import random
print("You are in group ", random.randint(1,5))

For all tasks, there is a possibility of failure. Should your task fail (e.g. because not sufficient evidence was provided), pick a new group by re-executing the above code cell.

## Questions to all groups (with respect to the assigned task)

1. After executing the annotation task assigned to you, apply BUSCO to estimate proteome completentess (use [/home/genomics/workshop_material/genome_annotation/alveolata_odb10](/home/genomics/workshop_material/genome_annotation/alveolata_odb10)). What are the BUSCO scores?

2. How many genes and how many transcripts are predicted?

3. How is the mono:mult ratio? 

4. What is the median number of exons per transcript? 

5. What is the largest number of exons in a transcript?

Fill in the results for your task(s) at our [Google Results Sheet](https://docs.google.com/spreadsheets/d/1WT3LGMrvWS-ai3xRjFLVvFMSUlLTfqMFnYISbQES99I/edit?usp=sharing) in the grey fields. If no results can be obtained, enter "NA".

## Common pitfalls

   * **Warning** BRAKER or GALBA may warn you that the number of training genes is very low. In our example, we are using only a partial genome, and for today, we will ignore this warning. However, in a real-life scenario with a complete genome, you should be cautious when seeing such a warning. It's not advisable to train a gene finder with less than 200 genes. Having more than 200 training genes may or may not give good results. A large number of training genes (no warning shown) often leads to good results.
   
   * Sometimes, a step in the pipeline may fail when processing your specific dataset. BRAKER usually reports that the most common problem is a missing or expired file called ~/.gm_key. However, during the workshop, we can assure you that the key file is not expired or missing! To find the actual reason for the failure, you can check the error file of the last command executed by BRAKER. You can find that command in the braker.log file in the working directory. Most often, the reason for the failure is missing data. In this workshop, we won't add any additional data to solve this problem as it would increase the runtime. However, in real-life scenarios, you may be able to overcome the issue by adding more RNA-Seq data or more protein data.

## Task for group 1: Apply GALBA

Use the following code cell to implement the application of GALBA, and execute GALBA. Use the genome file and the proteins.fa file mentioned, above. Apply `--skipOptimize`. (Expected runtime: ~10 minutes.)

In [1]:
%%script bash
T=4 # adjust to number of threads that you booted with

# delete output from a possible previous run if it exists
if [ -d GALBA_babesia ]
then
    rm -rf GALBA_babesia
fi

time galba.pl --workingdir=GALBA_babesia --genome=/home/genomics/workshop_material/genome_annotation/genome/genome.fa \
    --prot_seq=/home/genomics/workshop_material/genome_annotation/proteins/proteins.fa \
    --AUGUSTUS_BIN_PATH=/usr/bin/ --AUGUSTUS_SCRIPTS_PATH=/usr/share/augustus/scripts/ --threads ${T} \
    --skipOptimize

# Sun May  7 12:27:55 2023: Log information is stored in file /home/katharina/git/GenomeAnnotation_Workshop2023/GALBA_babesia/GALBA.log
#*********
#*********
#*********
#*********



real	10m51.353s
user	11m37.444s
sys	0m6.867s


In [4]:
%%script bash

T=4 # adjust to number of threads that you booted with

source conda_init
conda activate busco_env

cd GALBA_babesia
# create links if not already present
if [ ! -d busco_downloads ]
then
    mkdir busco_downloads
    cd busco_downloads
    if [ ! -L alveolata_odb10 ]
    then
        ln -s /home/genomics/workshop_material/genome_annotation/busco/alveolata_odb10 alveolata_odb10
    fi
fi

# delete old output if existing
if [ -d busco_galba_babesia ]
then
    rm -r busco_galba_babesia
fi

# run BUSCO
busco -m proteins -i augustus.hints.aa -o busco_galba_babesia \
        -l alveolata_odb10 -c ${T} &> busco_galba_babesia.log

# Count number of transcripts by counting FASTA headers
echo "Counting number of protein sequences = transcripts"
grep -c ">" augustus.hints.aa

# Number of genes, and descriptive statistics
echo "Computing some descriptive statistics for GALBA:"
analyze_exons.py -f augustus.hints.gtf

Counting number of protein sequences = transcripts
516
Computing some descriptive statistics for GALBA:
Number of transcripts: 468
Largest number of exons in all transcripts: 2
Monoexonic transcripts: 240
Multiexonic transcripts: 228
Mono:Mult Ratio: 1.05
Boxplot of number of exons per transcript:
Min: 1
25%: 1
50%: 1
75%: 2
Max: 2


## Task for group 2: Apply BRAKER3

Use the following code cell to implement the application of BRAKER3, and execute BRAKER3. Use the genome file, the OrthoDB Metazoa file, the proteins.fa file, and the Merged.bam file mentioned, above. Apply `--skipOptimize`. Do not apply `--gm_max_intergenic 10000`. (Expected runtime: ~10 minutes.)

In [5]:
%%script bash

T=4 # adjust to number of threads that you booted with, takes ~30 minutes with 4 threads

# delete output from a possible previous run if it exists
if [ -d BRAKER3_babesia ]
then
    rm -rf BRAKER3_babesia
fi

ORTHODB=/home/genomics/workshop_material/genome_annotation/orthodb/Alveolata.fa # adjust to suitable clade

# run BRAKER3
time braker.pl --workingdir=BRAKER3_babesia --genome=/home/genomics/workshop_material/genome_annotation/genome/genome.fa \
    --bam=/home/genomics/workshop_material/genome_annotation/sra/SRR18907291.s.mapped.bam \
    --prot_seq=${ORTHODB},/home/genomics/workshop_material/genome_annotation/proteins/proteins.fa \
    --AUGUSTUS_BIN_PATH=/usr/bin/ \
    --AUGUSTUS_SCRIPTS_PATH=/usr/share/augustus/scripts/ --threads ${T} \
    --skipOptimize

#**********************************************************************************
#                               BRAKER CONFIGURATION                               
#**********************************************************************************
# BRAKER CALL: /opt/BRAKER/scripts/braker.pl --workingdir=BRAKER3_babesia --genome=/home/genomics/workshop_material/genome_annotation/genome/genome.fa --bam=/home/genomics/workshop_material/genome_annotation/sra/SRR18907291.s.mapped.bam --prot_seq=/home/genomics/workshop_material/genome_annotation/orthodb/Alveolata.fa,/home/genomics/workshop_material/genome_annotation/proteins/proteins.fa --AUGUSTUS_BIN_PATH=/usr/bin/ --AUGUSTUS_SCRIPTS_PATH=/usr/share/augustus/scripts/ --threads 4 --skipOptimize
# Sun May  7 12:45:53 2023: braker.pl version 3.0.3
# Sun May  7 12:45:53 2023: Creating directory /home/katharina/git/GenomeAnnotation_Workshop2023/BRAKER3_babesia.
# Sun May  7 12:45:53 2023: Creating directory /home/katharina/git/GenomeAnnotat

# Sun May  7 12:45:53 2023: Creating directory /home/katharina/git/GenomeAnnotation_Workshop2023/BRAKER3_babesia.
# Sun May  7 12:45:53 2023:Both protein and RNA-Seq data in input detected. BRAKER will be executed in ETP mode (BRAKER3).
#*********
# Sun May  7 12:45:54 2023: Log information is stored in file /home/katharina/git/GenomeAnnotation_Workshop2023/BRAKER3_babesia/braker.log


#*********
#*********
ERROR in file /opt/BRAKER/scripts/braker.pl at line 5473
Failed to execute: /usr/bin/perl /opt/ETP/bin/gmetp.pl --cfg /home/katharina/git/GenomeAnnotation_Workshop2023/BRAKER3_babesia/GeneMark-ETP/etp_config.yaml --workdir /home/katharina/git/GenomeAnnotation_Workshop2023/BRAKER3_babesia/GeneMark-ETP --bam /home/katharina/git/GenomeAnnotation_Workshop2023/BRAKER3_babesia/GeneMark-ETP/etp_data/ --cores 4 --softmask  1>/home/katharina/git/GenomeAnnotation_Workshop2023/BRAKER3_babesia/errors/GeneMark-ETP.stdout 2>/home/katharina/git/GenomeAnnotation_Workshop2023/BRAKER3_babesia/errors/GeneMark-ETP.stderr
Failed to execute: /usr/bin/perl /opt/ETP/bin/gmetp.pl --cfg /home/katharina/git/GenomeAnnotation_Workshop2023/BRAKER3_babesia/GeneMark-ETP/etp_config.yaml --workdir /home/katharina/git/GenomeAnnotation_Workshop2023/BRAKER3_babesia/GeneMark-ETP --bam /home/katharina/git/GenomeAnnotation_Workshop2023/BRAKER3_babesia/GeneMark-ETP/etp_data/ --cores 4 --softmask  1>/home

CalledProcessError: Command 'b'\nT=4 # adjust to number of threads that you booted with, takes ~30 minutes with 4 threads\n\n# delete output from a possible previous run if it exists\nif [ -d BRAKER3_babesia ]\nthen\n    rm -rf BRAKER3_babesia\nfi\n\nORTHODB=/home/genomics/workshop_material/genome_annotation/orthodb/Alveolata.fa # adjust to suitable clade\n\n# run BRAKER3\ntime braker.pl --workingdir=BRAKER3_babesia --genome=/home/genomics/workshop_material/genome_annotation/genome/genome.fa \\\n    --bam=/home/genomics/workshop_material/genome_annotation/sra/SRR18907291.s.mapped.bam \\\n    --prot_seq=${ORTHODB},/home/genomics/workshop_material/genome_annotation/proteins/proteins.fa \\\n    --AUGUSTUS_BIN_PATH=/usr/bin/ \\\n    --AUGUSTUS_SCRIPTS_PATH=/usr/share/augustus/scripts/ --threads ${T} \\\n    --skipOptimize\n'' returned non-zero exit status 1.

In [None]:
%%script bash

T=4 # adjust to number of threads that you booted with

# CANNOT BE EXECUTED, BRAKER3 DIES!

source conda_init
conda activate busco_env

cd BRAKER3_babesia
# create links if not already present
if [ ! -d busco_downloads ]
then
    mkdir busco_downloads
    cd busco_downloads
    if [ ! -L alveolata_odb10 ]
    then
        ln -s /home/genomics/workshop_material/genome_annotation/busco/alveolata_odb10 alveolata_odb10
    fi
fi

# delete old output if existing
if [ -d busco_braker3_babesia ]
then
    rm -r busco_braker3_babesia
fi

# run BUSCO
busco -m proteins -i braker.aa -o busco_braker3_babesia \
        -l alveolata_odb10 -c ${T} &> busco_braker3_babesia.log

# Count number of transcripts by counting FASTA headers
echo "Counting number of protein sequences = transcripts"
grep -c ">" braker.aa

# Number of genes, and descriptive statistics
echo "Computing some descriptive statistics for BRAKER3:"
analyze_exons.py -f braker.gtf

## Task for group 3: Apply BRAKER2

Use the following code cell to implement the application of BRAKER2, and execute BRAKER2. Use the genome file, the OrthoDB Metazoa file and the proteins.fa file mentioned, above. Apply `--skipOptimize`. Do not apply `--gm_max_intergenic 10000`. (Expected runtime: ~30 minutes.)

In [None]:
%%script bash

T=4 # adjust to number of threads that you booted with, takes ~30 minutes with 4 threads

# delete output from a possible previous run if it exists
if [ -d BRAKER2_babesia ]
then
    rm -rf BRAKER2_babesia
fi

ORTHODB=/home/genomics/workshop_material/genome_annotation/orthodb/Alveolata.fa # adjust to suitable clade

# run BRAKER2
time braker.pl --workingdir=BRAKER2_babesia --genome=/home/genomics/workshop_material/genome_annotation/genome/genome.fa \
    --prot_seq=${ORTHODB},/home/genomics/workshop_material/genome_annotation/proteins/proteins.fa \
    --AUGUSTUS_BIN_PATH=/usr/bin/ \
    --AUGUSTUS_SCRIPTS_PATH=/usr/share/augustus/scripts/ --threads ${T} \
    --skipOptimize

#**********************************************************************************
#                               BRAKER CONFIGURATION                               
#**********************************************************************************
# BRAKER CALL: /opt/BRAKER/scripts/braker.pl --workingdir=BRAKER2_babesia --genome=/home/genomics/workshop_material/genome_annotation/genome/genome.fa --prot_seq=/home/genomics/workshop_material/genome_annotation/orthodb/Alveolata.fa,/home/genomics/workshop_material/genome_annotation/proteins/proteins.fa --AUGUSTUS_BIN_PATH=/usr/bin/ --AUGUSTUS_SCRIPTS_PATH=/usr/share/augustus/scripts/ --threads 4 --skipOptimize
# Sun May  7 12:57:20 2023: braker.pl version 3.0.3
# Sun May  7 12:57:20 2023: Creating directory /home/katharina/git/GenomeAnnotation_Workshop2023/BRAKER2_babesia.
# Sun May  7 12:57:20 2023: Only Protein input detected, BRAKER will be executed in EP mode (BRAKER2).
# Sun May  7 12:57:20 2023: Configuring of BRAKER for using ext

# Sun May  7 12:57:20 2023: Log information is stored in file /home/katharina/git/GenomeAnnotation_Workshop2023/BRAKER2_babesia/braker.log


#*********
#*********
ProtHint Version 2.6.0
Copyright 2019, Georgia Institute of Technology, USA

Please cite
  - ProtHint: https://doi.org/10.1093/nargab/lqaa026
  - DIAMOND:  https://doi.org/10.1038/nmeth.3176
  - Spaln:    https://doi.org/10.1093/bioinformatics/btn460

Called from: /home/katharina/git/GenomeAnnotation_Workshop2023/BRAKER2_babesia
Cmd: /opt/ETP/bin/gmes/ProtHint/bin/prothint.py --threads=4 --geneMarkGtf /home/katharina/git/GenomeAnnotation_Workshop2023/BRAKER2_babesia/GeneMark-ES/genemark.gtf /home/katharina/git/GenomeAnnotation_Workshop2023/BRAKER2_babesia/genome.fa /home/katharina/git/GenomeAnnotation_Workshop2023/BRAKER2_babesia/proteins.fa

[Sun May  7 13:10:51 2023] Pre-processing protein input
[Sun May  7 13:10:53 2023] Skipping GeneMark-ES, using the supplied gene seeds file instead
[Sun May  7 13:10:53 2023] Translating gene seeds to proteins
[Sun May  7 13:10:53 2023] Translation of seeds finished
[Sun May  7 13:10:53 2023] Running DIAMOND
diamond v0.9.24.1

[Sun May  7 13:12:35 2023] Enqueueing pair 345/7835 (4.4%)
[Sun May  7 13:12:35 2023] Enqueueing pair 353/7835 (4.5%)
[Sun May  7 13:12:35 2023] Enqueueing pair 361/7835 (4.6%)
[Sun May  7 13:12:35 2023] Enqueueing pair 369/7835 (4.7%)
[Sun May  7 13:12:35 2023] Enqueueing pair 377/7835 (4.8%)
[Sun May  7 13:12:35 2023] Enqueueing pair 384/7835 (4.9%)
[Sun May  7 13:12:35 2023] Enqueueing pair 392/7835 (5.0%)
[Sun May  7 13:12:35 2023] Enqueueing pair 400/7835 (5.1%)
[Sun May  7 13:12:35 2023] Enqueueing pair 408/7835 (5.2%)
[Sun May  7 13:12:35 2023] Enqueueing pair 416/7835 (5.3%)
[Sun May  7 13:12:35 2023] Enqueueing pair 424/7835 (5.4%)
[Sun May  7 13:12:35 2023] Enqueueing pair 431/7835 (5.5%)
[Sun May  7 13:12:35 2023] Enqueueing pair 439/7835 (5.6%)
[Sun May  7 13:12:35 2023] Enqueueing pair 447/7835 (5.7%)
[Sun May  7 13:12:35 2023] Enqueueing pair 455/7835 (5.8%)
[Sun May  7 13:12:35 2023] Enqueueing pair 463/7835 (5.9%)
[Sun May  7 13:12:35 2023] Enqueueing pair 471/7835 (6.0

[Sun May  7 13:12:47 2023] Enqueueing pair 1419/7835 (18.1%)
[Sun May  7 13:12:47 2023] Enqueueing pair 1426/7835 (18.2%)
[Sun May  7 13:12:47 2023] Enqueueing pair 1434/7835 (18.3%)
[Sun May  7 13:12:47 2023] Enqueueing pair 1442/7835 (18.4%)
[Sun May  7 13:12:47 2023] Enqueueing pair 1450/7835 (18.5%)
[Sun May  7 13:12:47 2023] Enqueueing pair 1458/7835 (18.6%)
[Sun May  7 13:12:47 2023] Enqueueing pair 1466/7835 (18.7%)
[Sun May  7 13:12:47 2023] Enqueueing pair 1473/7835 (18.8%)
[Sun May  7 13:12:47 2023] Enqueueing pair 1481/7835 (18.9%)
[Sun May  7 13:12:47 2023] Enqueueing pair 1489/7835 (19.0%)
[Sun May  7 13:12:47 2023] Enqueueing pair 1497/7835 (19.1%)
[Sun May  7 13:12:52 2023] Enqueueing pair 1505/7835 (19.2%)
[Sun May  7 13:12:52 2023] Enqueueing pair 1513/7835 (19.3%)
[Sun May  7 13:12:52 2023] Enqueueing pair 1520/7835 (19.4%)
[Sun May  7 13:12:52 2023] Enqueueing pair 1528/7835 (19.5%)
[Sun May  7 13:12:52 2023] Enqueueing pair 1536/7835 (19.6%)
[Sun May  7 13:12:52 202

[Sun May  7 13:13:31 2023] Enqueueing pair 2476/7835 (31.6%)
[Sun May  7 13:13:31 2023] Enqueueing pair 2484/7835 (31.7%)
[Sun May  7 13:13:31 2023] Enqueueing pair 2492/7835 (31.8%)
[Sun May  7 13:13:31 2023] Enqueueing pair 2500/7835 (31.9%)
[Sun May  7 13:13:31 2023] Enqueueing pair 2508/7835 (32.0%)
[Sun May  7 13:13:31 2023] Enqueueing pair 2516/7835 (32.1%)
[Sun May  7 13:13:31 2023] Enqueueing pair 2523/7835 (32.2%)
[Sun May  7 13:13:31 2023] Enqueueing pair 2531/7835 (32.3%)
[Sun May  7 13:13:31 2023] Enqueueing pair 2539/7835 (32.4%)
[Sun May  7 13:13:31 2023] Enqueueing pair 2547/7835 (32.5%)
[Sun May  7 13:13:31 2023] Enqueueing pair 2555/7835 (32.6%)
[Sun May  7 13:13:31 2023] Enqueueing pair 2563/7835 (32.7%)
[Sun May  7 13:13:31 2023] Enqueueing pair 2570/7835 (32.8%)
[Sun May  7 13:13:31 2023] Enqueueing pair 2578/7835 (32.9%)
[Sun May  7 13:13:31 2023] Enqueueing pair 2586/7835 (33.0%)
[Sun May  7 13:13:31 2023] Enqueueing pair 2594/7835 (33.1%)
[Sun May  7 13:13:41 202

[Sun May  7 13:14:09 2023] Enqueueing pair 3409/7835 (43.5%). Est. time left: 00:02:03 (hh:mm:ss)
[Sun May  7 13:14:09 2023] Enqueueing pair 3417/7835 (43.6%). Est. time left: 00:02:02 (hh:mm:ss)
[Sun May  7 13:14:09 2023] Enqueueing pair 3424/7835 (43.7%). Est. time left: 00:02:02 (hh:mm:ss)
[Sun May  7 13:14:09 2023] Enqueueing pair 3432/7835 (43.8%). Est. time left: 00:02:01 (hh:mm:ss)
[Sun May  7 13:14:09 2023] Enqueueing pair 3440/7835 (43.9%). Est. time left: 00:02:01 (hh:mm:ss)
[Sun May  7 13:14:09 2023] Enqueueing pair 3448/7835 (44.0%). Est. time left: 00:02:00 (hh:mm:ss)
[Sun May  7 13:14:09 2023] Enqueueing pair 3456/7835 (44.1%). Est. time left: 00:02:00 (hh:mm:ss)
[Sun May  7 13:14:09 2023] Enqueueing pair 3464/7835 (44.2%). Est. time left: 00:01:59 (hh:mm:ss)
[Sun May  7 13:14:09 2023] Enqueueing pair 3471/7835 (44.3%). Est. time left: 00:01:59 (hh:mm:ss)
[Sun May  7 13:14:09 2023] Enqueueing pair 3479/7835 (44.4%). Est. time left: 00:01:58 (hh:mm:ss)
[Sun May  7 13:14:09

[Sun May  7 13:14:34 2023] Enqueueing pair 4067/7835 (51.9%). Est. time left: 00:01:51 (hh:mm:ss)
[Sun May  7 13:14:34 2023] Enqueueing pair 4075/7835 (52.0%). Est. time left: 00:01:50 (hh:mm:ss)
[Sun May  7 13:14:34 2023] Enqueueing pair 4083/7835 (52.1%). Est. time left: 00:01:50 (hh:mm:ss)
[Sun May  7 13:14:34 2023] Enqueueing pair 4090/7835 (52.2%). Est. time left: 00:01:49 (hh:mm:ss)
[Sun May  7 13:14:34 2023] Enqueueing pair 4098/7835 (52.3%). Est. time left: 00:01:49 (hh:mm:ss)
[Sun May  7 13:14:43 2023] Enqueueing pair 4106/7835 (52.4%). Est. time left: 00:01:57 (hh:mm:ss)
[Sun May  7 13:14:43 2023] Enqueueing pair 4114/7835 (52.5%). Est. time left: 00:01:56 (hh:mm:ss)
[Sun May  7 13:14:43 2023] Enqueueing pair 4122/7835 (52.6%). Est. time left: 00:01:56 (hh:mm:ss)
[Sun May  7 13:14:43 2023] Enqueueing pair 4130/7835 (52.7%). Est. time left: 00:01:55 (hh:mm:ss)
[Sun May  7 13:14:43 2023] Enqueueing pair 4137/7835 (52.8%). Est. time left: 00:01:55 (hh:mm:ss)
[Sun May  7 13:14:43

[Sun May  7 13:14:57 2023] Enqueueing pair 4725/7835 (60.3%). Est. time left: 00:01:34 (hh:mm:ss)
[Sun May  7 13:14:57 2023] Enqueueing pair 4733/7835 (60.4%). Est. time left: 00:01:34 (hh:mm:ss)
[Sun May  7 13:14:57 2023] Enqueueing pair 4741/7835 (60.5%). Est. time left: 00:01:33 (hh:mm:ss)
[Sun May  7 13:14:57 2023] Enqueueing pair 4749/7835 (60.6%). Est. time left: 00:01:33 (hh:mm:ss)
[Sun May  7 13:14:57 2023] Enqueueing pair 4756/7835 (60.7%). Est. time left: 00:01:32 (hh:mm:ss)
[Sun May  7 13:14:57 2023] Enqueueing pair 4764/7835 (60.8%). Est. time left: 00:01:32 (hh:mm:ss)
[Sun May  7 13:14:57 2023] Enqueueing pair 4772/7835 (60.9%). Est. time left: 00:01:32 (hh:mm:ss)
[Sun May  7 13:14:57 2023] Enqueueing pair 4780/7835 (61.0%). Est. time left: 00:01:31 (hh:mm:ss)
[Sun May  7 13:14:57 2023] Enqueueing pair 4788/7835 (61.1%). Est. time left: 00:01:31 (hh:mm:ss)
[Sun May  7 13:14:57 2023] Enqueueing pair 4796/7835 (61.2%). Est. time left: 00:01:31 (hh:mm:ss)
[Sun May  7 13:15:04

[Sun May  7 13:15:20 2023] Enqueueing pair 5383/7835 (68.7%). Est. time left: 00:01:16 (hh:mm:ss)
[Sun May  7 13:15:20 2023] Enqueueing pair 5391/7835 (68.8%). Est. time left: 00:01:15 (hh:mm:ss)
[Sun May  7 13:15:20 2023] Enqueueing pair 5399/7835 (68.9%). Est. time left: 00:01:15 (hh:mm:ss)
[Sun May  7 13:15:21 2023] Enqueueing pair 5407/7835 (69.0%). Est. time left: 00:01:15 (hh:mm:ss)
[Sun May  7 13:15:21 2023] Enqueueing pair 5414/7835 (69.1%). Est. time left: 00:01:15 (hh:mm:ss)
[Sun May  7 13:15:21 2023] Enqueueing pair 5422/7835 (69.2%). Est. time left: 00:01:14 (hh:mm:ss)
[Sun May  7 13:15:21 2023] Enqueueing pair 5430/7835 (69.3%). Est. time left: 00:01:14 (hh:mm:ss)
[Sun May  7 13:15:21 2023] Enqueueing pair 5438/7835 (69.4%). Est. time left: 00:01:14 (hh:mm:ss)
[Sun May  7 13:15:21 2023] Enqueueing pair 5446/7835 (69.5%). Est. time left: 00:01:13 (hh:mm:ss)
[Sun May  7 13:15:21 2023] Enqueueing pair 5454/7835 (69.6%). Est. time left: 00:01:13 (hh:mm:ss)
[Sun May  7 13:15:21

[Sun May  7 13:15:41 2023] Enqueueing pair 6041/7835 (77.1%). Est. time left: 00:00:56 (hh:mm:ss)
[Sun May  7 13:15:41 2023] Enqueueing pair 6049/7835 (77.2%). Est. time left: 00:00:55 (hh:mm:ss)
[Sun May  7 13:15:41 2023] Enqueueing pair 6057/7835 (77.3%). Est. time left: 00:00:55 (hh:mm:ss)
[Sun May  7 13:15:41 2023] Enqueueing pair 6065/7835 (77.4%). Est. time left: 00:00:55 (hh:mm:ss)
[Sun May  7 13:15:41 2023] Enqueueing pair 6073/7835 (77.5%). Est. time left: 00:00:55 (hh:mm:ss)
[Sun May  7 13:15:41 2023] Enqueueing pair 6080/7835 (77.6%). Est. time left: 00:00:54 (hh:mm:ss)
[Sun May  7 13:15:41 2023] Enqueueing pair 6088/7835 (77.7%). Est. time left: 00:00:54 (hh:mm:ss)
[Sun May  7 13:15:41 2023] Enqueueing pair 6096/7835 (77.8%). Est. time left: 00:00:54 (hh:mm:ss)
[Sun May  7 13:15:48 2023] Enqueueing pair 6104/7835 (77.9%). Est. time left: 00:00:55 (hh:mm:ss)
[Sun May  7 13:15:48 2023] Enqueueing pair 6112/7835 (78.0%). Est. time left: 00:00:55 (hh:mm:ss)
[Sun May  7 13:15:48

[Sun May  7 13:16:14 2023] Enqueueing pair 6699/7835 (85.5%). Est. time left: 00:00:38 (hh:mm:ss)
[Sun May  7 13:16:15 2023] Enqueueing pair 6707/7835 (85.6%). Est. time left: 00:00:38 (hh:mm:ss)
[Sun May  7 13:16:15 2023] Enqueueing pair 6715/7835 (85.7%). Est. time left: 00:00:37 (hh:mm:ss)
[Sun May  7 13:16:15 2023] Enqueueing pair 6723/7835 (85.8%). Est. time left: 00:00:37 (hh:mm:ss)
[Sun May  7 13:16:15 2023] Enqueueing pair 6731/7835 (85.9%). Est. time left: 00:00:37 (hh:mm:ss)
[Sun May  7 13:16:15 2023] Enqueueing pair 6739/7835 (86.0%). Est. time left: 00:00:36 (hh:mm:ss)
[Sun May  7 13:16:15 2023] Enqueueing pair 6746/7835 (86.1%). Est. time left: 00:00:36 (hh:mm:ss)
[Sun May  7 13:16:15 2023] Enqueueing pair 6754/7835 (86.2%). Est. time left: 00:00:36 (hh:mm:ss)
[Sun May  7 13:16:15 2023] Enqueueing pair 6762/7835 (86.3%). Est. time left: 00:00:35 (hh:mm:ss)
[Sun May  7 13:16:15 2023] Enqueueing pair 6770/7835 (86.4%). Est. time left: 00:00:35 (hh:mm:ss)
[Sun May  7 13:16:15

[Sun May  7 13:16:47 2023] Enqueueing pair 7358/7835 (93.9%). Est. time left: 00:00:17 (hh:mm:ss)
[Sun May  7 13:16:47 2023] Enqueueing pair 7365/7835 (94.0%). Est. time left: 00:00:17 (hh:mm:ss)
[Sun May  7 13:16:47 2023] Enqueueing pair 7373/7835 (94.1%). Est. time left: 00:00:16 (hh:mm:ss)
[Sun May  7 13:16:47 2023] Enqueueing pair 7381/7835 (94.2%). Est. time left: 00:00:16 (hh:mm:ss)
[Sun May  7 13:16:47 2023] Enqueueing pair 7389/7835 (94.3%). Est. time left: 00:00:16 (hh:mm:ss)
[Sun May  7 13:16:47 2023] Enqueueing pair 7397/7835 (94.4%). Est. time left: 00:00:15 (hh:mm:ss)
[Sun May  7 13:16:49 2023] Enqueueing pair 7405/7835 (94.5%). Est. time left: 00:00:15 (hh:mm:ss)
[Sun May  7 13:16:49 2023] Enqueueing pair 7412/7835 (94.6%). Est. time left: 00:00:15 (hh:mm:ss)
[Sun May  7 13:16:49 2023] Enqueueing pair 7420/7835 (94.7%). Est. time left: 00:00:15 (hh:mm:ss)
[Sun May  7 13:16:49 2023] Enqueueing pair 7428/7835 (94.8%). Est. time left: 00:00:14 (hh:mm:ss)
[Sun May  7 13:16:49

#*********
# The hints file(s) for GeneMark-EX contain less than 1000 introns. (In total, 887 unique introns are contained.)
# Genemark-EX might fail due to the low number of hints.
#*********


In [None]:
%%script bash

T=4 # adjust to number of threads that you booted with

source conda_init
conda activate busco_env

cd BRAKER2_babesia
# create links if not already present
if [ ! -d busco_downloads ]
then
    mkdir busco_downloads
    cd busco_downloads
    if [ ! -L alveolata_odb10 ]
    then
        ln -s /home/genomics/workshop_material/genome_annotation/busco/alveolata_odb10 alveolata_odb10
    fi
fi

# delete old output if existing
if [ -d busco_braker2_babesia ]
then
    rm -r busco_braker2_babesia
fi

# run BUSCO
busco -m proteins -i braker.aa -o busco_braker2_babesia \
        -l alveolata_odb10 -c ${T} &> busco_braker2_babesia.log

# Count number of transcripts by counting FASTA headers
echo "Counting number of protein sequences = transcripts"
grep -c ">" braker.aa

# Number of genes, and descriptive statistics
echo "Computing some descriptive statistics for GALBA:"
analyze_exons.py -f braker.gtf

## Task for group 4: Apply BRAKER1

Use the following code cell to implement the application of BRAKER1, and execute BRAKER1. Use the genome file and the Merged.bam file mentioned, above. Apply `--skipOptimize`. Do not apply `--gm_max_intergenic 10000`. (Expected runtime: ~10 minutes.)

In [None]:
%%script bash

T=4 # adjust to number of threads that you booted with, takes ~30 minutes with 4 threads

# delete output from a possible previous run if it exists
if [ -d BRAKER1_babesia ]
then
    rm -rf BRAKER1_babesia
fi

# run BRAKER1
time braker.pl --workingdir=BRAKER1_babesia --genome=/home/genomics/workshop_material/genome_annotation/genome/genome.fa \
    --bam=/home/genomics/workshop_material/genome_annotation/sra/SRR19666444.s.bam \
    --AUGUSTUS_BIN_PATH=/usr/bin/ \
    --AUGUSTUS_SCRIPTS_PATH=/usr/share/augustus/scripts/ --threads ${T} \
    --skipOptimize

## Do we have a sensivity problem?

When comparing the obtained BUSCO scores, you may wonder why we are missing BUSCOs that were detected on genome level. One possible reason is that MetaEuk, the gene finder employed to detect BUSCOs on genome level, is not always ensuring the validity of a gene structure. Another reason, particularly in BRAKER, may be the lack of evidence, and poor training (remember, we are using only a small proportion of a small genomes, and we skipped an optimization step for AUGUSTUS training). If you have time left, you may play with TSEBRA to generate a possibly superior gene set. Consider combining the gene sets of BRAKER1 and BRAKER2, or even of BRAKER1, BRAKER2, and BRAKER3. Keep in mind that we learned how to enforce a gene set even in lack of evidence.

In [None]:
<details>
  <summary><b>Out of time but want to see results? Click here!</b></summary>

</details>