# Genome Annotation

Materials for the Genome Annotation a BRAKER & TSEBRA Genome Annotation workshop by Katharina Hoff (katharina.hoff@uni-greifswald.de).

In the following, we will walk through the process of genome annotation on the example of a small proportion of the *Arabidopsis thaliana* genome.

## Repeat masking

Repetitive sequences are a huge problem for genome annotation. Some repeats only coincidentally look like protein-coding genes, others (such as transposases) are protein-coding genes, but we usually are not interested in any of these "repeat genes" when trying to find protein-coding genes in a novel genome. Thus, a genome should be repeat-masked prior gene prediction. 

Repeat masking is a resource and time-consuming step that is out of scope for this workshop. We recommend using RepeatModeler2 ([paper](https://doi.org/10.1073/pnas.1921046117), [software](https://www.repeatmasker.org/RepeatModeler/) ) to construct a species-specific repeat library and mask the genome with RepeatMasker (ideally, you will perform these computations on a node with >70 threads, in a place with very fast storage i/o, possibly using RAM instead of actual hard drive as a temporary file storage place):

```
T=72 # you need a large number of threads and fast i/o storage
GENOME=/opt/BRAKER/example/genome.fa
DB=some_db_name_that_fits_to_species

BuildDatabase -name ${DB} ${GENOME}
RepeatModeler -database ${DB} -pa ${T} -LTRStruct
RepeatMasker -pa 72 -lib ${DB}-families.fa -xsmall ${GENOME}
```

This results in a file `${GENOME}.masked`. 

<details>
  <summary><b>Click to learn how to mask more rigorously when needed</b></summary>
Depending on the kind of genome, plenty of unmasked repeats may still persist. This is generally an issue to be expected in large genomes, such as vertebrate genomes, and you will notice the problem if the count of predicted proteins is extremely high. You can try to overcome "under-masking" with the following steps (we are suggesting to use GNU parallel to speed up the process):

```
ln -s genome.masked.fa genome.fa
splitMfasta.pl --minsize=25000000 ${GENOME}.masked

# Running TRF
ls genome.split.*.fa | parallel 'trf {} 2 7 7 80 10 50 500 -d -m -h'

# Parsing TRF output
# The script parseTrfOutput.py is from https://github.com/gatech-genemark/BRAKER2-exp
ls genome.split.*.fa.2.7.7.80.10.50.500.dat | parallel 'parseTrfOutput.py {} --minCopies 1 --statistics {}.STATS > {}.raw.gff 2> {}.parsedLog'

# Sorting parsed output..."
ls genome.split.*.fa.2.7.7.80.10.50.500.dat.raw.gff | parallel 'sort -k1,1 -k4,4n -k5,5n {} > {}.sorted 2> {}.sortLog'

# Merging gff...
FILES=genome.split.*.fa.2.7.7.80.10.50.500.dat.raw.gff.sorted
for f in $FILES
do
    bedtools merge -i $f | awk 'BEGIN{OFS="\t"} {print $1,"trf","repeat",$2+1,$3,".",".",".","."}' > $f.merged.gff 2> $f.bedtools_merge.log
done

# Masking FASTA chunk
ls genome.split.*.fa | parallel 'bedtools maskfasta -fi {} -bed {}.2.7.7.80.10.50.500.dat.raw.gff.sorted.merged.gff -fo {}.combined.masked -soft &> {}.bedools_mask.log'

# Concatenate split genome
cat genome.split.*.fa.combined.masked > genome.fa.combined.masked
```

The file `genome.fa.combined.masked` will be more rigorously masked.
</details>

## RNA-Seq alignment with HiSat2

Spliced alignments of RNA-Seq short reads are a valuable information source for predicting protein-coding genes with high accuracy.

<img src="et-rnaseq.png" alt="Figure 3 of Lomsadze et al. (2014) illustrates the use of RNA-Seq spliced alignments for predicting genes (with GeneMark-ET)." width="600"/>
Figure 3 of Lomsadze et al. (2014) illustrates the use of RNA-Seq spliced alignments for predicting genes (with GeneMark-ET, <a href=https://doi.org/10.1093/nar/gku55">Image Source</a>).

Executing HiSat2 is out of scope for the current session. You find a readily prepared alignment file in [/opt/BRAKER/example/RNAseq.bam](/opt/BRAKER/example/RNAseq.bam).

<details>
  <summary><b>If you want to see how such a file was prepared, click here and read.</b></summary>
  
We will map the *Arabidopsis thaliana* Illumina RNA-Seq reads from library SRR934391 in files [SRR934391_1.fastq.gz](/home/genomics/workshop_materials/genome_annotation/sra/SRR934391_1.fastq.gz) and [SRR934391_2.fastq.gz](/home/genomics/workshop_materials/genome_annotation/sra/SRR934391_2.fastq.gz). These are paired-end data, i.e. one file contains the forward reads while the other contains in the same order the reverse reads. The length of reads is in this case 100 nt.

We will use HiSat2 ([publication](https://doi.org/10.1038/s41587-019-0201-4), [software](https://github.com/DaehwanKimLab/hisat2)) to align these reads against a chunk of the *Arabidopsis thaliana* genome contained in the file [genome.fa](genome.fa). (You can in principle use any alignment tool capable of aligning RNA-seq reads to a genome, as long as it can perform spliced alignment.)

First, we need to build an index from the genome file:

```
# building the hisat2 index
hisat2-build /opt/BRAKER/example/genome.fa genome-idx 1> hisat2-build.log 2> hisat2-build.err
```

Inspect the log files [hisat2-build.log](hisat2-build.log) and [hisat2-build.err](hisat2-build.err) for possible errors.

Next, we align the RNA-seq reads against the genome. Consider to **not** do this on the Cesky Krumlov Workshop AWS resources. Performing this alignment took about 7 minutes with 70 threads. The precomputed output file is provided at `/home/genomics/workshop_materials/genome_annotation/sra/SRR934391.sam`, and we will continue to use that pre-computed file.

```
T=8 # adjust to number of threads that you booted with

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR934/SRR934391/SRR934391_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR934/SRR934391/SRR934391_2.fastq.gz

RNASEQDIR=.

time hisat2 -p ${T} -q -x genome-idx -1 ${RNASEQDIR}/SRR934391_1.fastq.gz \
    -2 ${RNASEQDIR}/SRR934391_2.fastq.gz -S rnaseq.sam \
    1> hisat2-align.log 2> hisat2-align.err
```

Our goal is to extract information on spliced alignments/intron positons from the alignment output file. To achieve this, we will use a tool called bam2hints that is part of the Augustus software suite ([software](https://github.com/Gaius-Augustus/Augustus)). However, this tool requires a sorted bam-file. Therefore, we first use Samtools ([paper](https://doi.org/10.1093/bioinformatics/btp352), [software](https://github.com/samtools) ) to convert the sam file to bam format:

```

T=8 # adjust to number of threads that you booted with, takes ~2 minutes with 4 threads

SAMFILE=/home/genomics/workshop_materials/genome_annotation/sra/SRR934391.sam

time samtools view -@${T} -bSh ${SAMFILE} -o rnaseq.bam

# if you computed your own rnaseq.sam file, delete it to save space on harddrive
if [ -f rnaseq.sam ]
then
    rm rnaseq.sam
fi
```

Then, we sort that bam file (this will require a bit less than 4 GB of RAM):

```
T=8 # adjust to number of threads that you booted with, takes ~2 minutes with 4 threads

time samtools sort -@${T} -n rnaseq.bam -o rnaseq.s.bam

# remove the unsorted bam file to save space
rm rnaseq.bam
```

Careful, above bam file is just an demo example! We will be using a different bam file for running BRAKER because the above BAM file does not contain sufficient data for running BRAKER3, successfully!
</details>




## Annotation of protein coding genes

Structural genome annotation is ideally performed by a combination of a statistical model (e.g. Hidden Markov Model derivate) and extrinsic evidence (e.g. from transcriptomics or known protein sequences). The statistical model parameters have to be adapted to the genomic properties of novel species. For adapting parameters, an initial set of high-quality training genes from the target species is required. This is tricky to obtain. BRAKER is a perl script that comprises several pipelines to automated the solution of this problem: fully automatically generate an initial set of training genes, train gene finders, and then predict genes with the trained parameters and extrinsic evidence.

We will first take an approach to structural genome annotation that takes advantage both of RNA-Seq data, and a large database of known proteins, using BRAKER3 ([poster from PAG2023](https://www.researchgate.net/profile/Lars-Gabriel-3/publication/367409816_The_BRAKER3_Genome_Annotation_Pipeline/links/63d14cbae922c50e99c29c7a/The-BRAKER3-Genome-Annotation-Pipeline.pdf), [software](https://github.com/Gaius-Augustus/BRAKER)). If sufficient transcriptome data is available, then BRAKER3 is usually the best choice of pipeline. However, in the lack of transcriptome data, we need to consider alternative approaches. 

If transcriptome evidence is available but it just was not sufficient for obtaining good results with BRAKER3, then running BRAKER1 with RNA-Seq evidence ([paper](https://doi.org/10.1093/bioinformatics/btv661)) and protein supported gene prediction with BRAKER2 ([paper](https://doi.org/10.1093/nargab/lqaa108)), combined by TSEBRA ([paper](https://doi.org/10.1186/s12859-021-04482-0), [software](https://github.com/Gaius-Augustus/TSEBRA)) is often a good option.

In the total absence of transcriptome data, we recommend running either BRAKER2 with a large database of proteins, alone, or for larger genomes (such as vertebrates), we the recommend application of GALBA ([preprint](https://www.biorxiv.org/content/10.1101/2023.04.10.536199v1.abstract), [software](https://github.com/Gaius-Augustus/GALBA)) with reference proteomes of a few closely related, already annotated species.

### BRAKER3

BRAKER3 uses spliced aligned RNA-Seq data from Hisat2 ([paper](https://www.nature.com/articles/s41587-019-0201-4), [software](https://github.com/DaehwanKimLab/hisat2)) for genome-guided transcriptome assembly with Stringtie ([paper](https://www.nature.com/articles/nbt.3122), [software](https://github.com/gpertea/stringtie)). GeneMarkS-T ([paper](https://academic.oup.com/nar/article/43/12/e78/2902598), [software](http://exon.gatech.edu/genemark/license_download.cgi)) is used to call protein coding genes in the transcripts. These transcripts are "noisy", therefore, a large database of proteins (e.g. an OrthoDB partition) is used to filter these predictions, using among other GeneMark-specific scripts including ProtHint ([software](https://github.com/gatech-genemark/ProtHint)) with Spaln ([paper](https://academic.oup.com/bioinformatics/article/24/21/2438/191484), [software](https://github.com/ogotoh/spaln)) the fast search tool DIAMOND ([paper](https://www.nature.com/articles/s41592-021-01101-x), [software](https://github.com/bbuchfink/diamond)). Both, the transcriptome and protein evidence is then used by GeneMark-ETP ([preprint](https://www.biorxiv.org/content/10.1101/2023.01.13.524024v1), [software](https://github.com/gatech-genemark/GeneMark-ETP)) for self-training this HMM-based gene finder. This generates a training set for AUGUSTUS ([paper](https://doi.org/10.1093/bioinformatics/btn013), [software](https://github.com/Gaius-Augustus/Augustus)). Both, the GeneMark-ETP, and the AUGUSTUS gene set incorporate the evidence to some extent, and these gene sets are merged with TSEBRA by BRAKER3.

Training AUGUSTUS for a novel species usually comprises a step called etraining that adapts species-specific parameters of the statistical model of AUGUSTUS, and a step called optimize_augustus.pl that optimizes meta-parameters of that model. optimize_augustus.pl is very time-consuming, it yields usually ~2 percent points of accuracy on gene level. For this session, will disable this step with --skipOptimize. If you ever want to annotate a real new genome, make sure to delete `--skipOptimize` from your BRAKER calls (and expect substantially longer runtime). Also, GeneMark-ETP usually has a longer runtime. We will here set the maximal intergenic region for GeneMark-ETP to 10000. Please never apply this setting to a real genome annotation task, and expect a larger runtime.

Note: we are in this Workshop using a toy data set instead of a real OrthoDB partition. OrthoDB partitions for real application use cases are available at https://bioinf.uni-greifswald.de/bioinf/partitioned_odb11/. We are also using the smallest BUSCO partition for compleasm ([paper](https://doi.org/10.1093/bioinformatics/btad595), [software](https://github.com/huangnengCSU/compleasm))within BRAKER. You should in reality use the BUSCO lineage that is fitting your species best.

In [17]:
%%script bash

T=8 # adjust to number of threads that you booted with, takes ~16 minutes with 8 threads

# delete output from a possible previous run if it exists
if [ -d BRAKER3 ]
then
    rm -rf BRAKER3
fi

ORTHODB=/opt/BRAKER/example/subsampled_viri.fa # adjust to suitable clade of real OrthoDB from https://bioinf.uni-greifswald.de/bioinf/partitioned_odb11/
BUSCOLINEAGE=eukaryota_odb10 # adjust to more suitable lineage for real annotation runs

# run BRAKER3
time braker.pl --workingdir=BRAKER3 --genome=/opt/BRAKER/example/genome.fa \
    --bam=/opt/BRAKER/example/RNAseq.bam \
    --prot_seq=${ORTHODB} --busco_lineage=${BUSCOLINEAGE} --threads ${T} \
    --gm_max_intergenic 10000 --skipOptimize # remember to remove both these options for real jobs!
    # this call takes a few minutes even with --skipOptimize and --gm_max_intergenic 10000

#**********************************************************************************
#                               BRAKER CONFIGURATION                               
#**********************************************************************************
# BRAKER CALL: /opt/BRAKER/scripts/braker.pl --workingdir=BRAKER3 --genome=/opt/BRAKER/example/genome.fa --bam=/opt/BRAKER/example/RNAseq.bam --prot_seq=/home/katharina/git/BRAKER/example/subsampled_viri.fa --busco_lineage=eukaryota_odb10 --threads 8 --gm_max_intergenic 10000 --skipOptimize
# Sat Jan  6 18:23:29 2024: braker.pl version 3.0.7
# Sat Jan  6 18:23:29 2024: Creating directory /home/katharina/git/GenomeAnnotation_Workshop/BRAKER3.
# Sat Jan  6 18:23:29 2024: Creating directory /home/katharina/git/GenomeAnnotation_Workshop/BRAKER3.
# Sat Jan  6 18:23:29 2024:Both protein and RNA-Seq data in input detected. BRAKER will be executed in ETP mode (BRAKER3).
#*********
# Sat Jan  6 18:23:29 2024: Configuring of BRAKER for using extern

# Sat Jan  6 18:23:29 2024: Creating directory /home/katharina/git/GenomeAnnotation_Workshop/BRAKER3.
# Sat Jan  6 18:23:29 2024:Both protein and RNA-Seq data in input detected. BRAKER will be executed in ETP mode (BRAKER3).
#*********
# Sat Jan  6 18:23:29 2024: Log information is stored in file /home/katharina/git/GenomeAnnotation_Workshop/BRAKER3/braker.log


#*********
#*********


#*********
#*********



real	13m34.240s
user	33m17.623s
sys	1m11.487s


<details>
  <summary><b>Out of time, job died? Click here.</b></summary>
If you ran out of time (the BRAKER3 job takes substantial time), you may copy the most important files as follows from a notebook cell:

```
%%script bash
cp -r BRAKER3_precomputed_results BRAKER3
```
</details>

Let's inspect the output, the most important files are braker.gtf, Augustus/augustus.hints.gtf, and GeneMark-ETP/genemark.gtf:

In [18]:
%%script bash
cd BRAKER3
ls -lh braker.gtf Augustus/augustus.hints.gtf GeneMark-ETP/genemark.gtf

-rw-rw-r-- 1 katharina katharina 343K Jan  6 18:36 Augustus/augustus.hints.gtf
-rw-rw-r-- 1 katharina katharina 299K Jan  6 18:37 braker.gtf
-rw-rw-r-- 1 katharina katharina 361K Jan  6 18:36 GeneMark-ETP/genemark.gtf


The file [BRAKER3/what-to-cite.txt](BRAKER3/what-to-cite.txt) advises you on what papers should be cited if you were going to publish a manuscript on a gene set produced with BRAKER3.

braker.gtf is the main output. BRAKER internally runs compleasm to pick the best gene set according to BUSCO presence. Be aware of this when generating the following BUSCO plot for quality control. (The folder braker_original contains BRAKER predictions prior adding BUSCOs with compleasm in case you want to look at these.)

Before running BUSCO, we need to make sure that we have protein sequences of all three gene sets (only the braker.aa exists by default):

In [19]:
%%script bash
# generate protein (and coding seq file) from AUGUSTUS predictions
cd BRAKER3/Augustus
getAnnoFastaFromJoingenes.py -g /opt/BRAKER/example/genome.fa -o augustus.hints -f augustus.hints.gtf
# generate protein (and coding seq file) from GeneMark-ETP predictions
cd ../GeneMark-ETP
getAnnoFastaFromJoingenes.py -g /opt/BRAKER/example/genome.fa -o genemark -f genemark.gtf
# see file sizes
cd ../
ls -lh braker.aa GeneMark-ETP/genemark.aa Augustus/augustus.hints.aa
# Count number of transcripts by counting FASTA headers
echo "Counting number of protein sequences = transcripts"
grep -c ">" braker.aa GeneMark-ETP/genemark.aa Augustus/augustus.hints.aa

-rw-rw-r-- 1 katharina katharina 112K Jan  6 18:48 Augustus/augustus.hints.aa
-rw-rw-r-- 1 katharina katharina 100K Jan  6 18:37 braker.aa
-rw-rw-r-- 1 katharina katharina 110K Jan  6 18:48 GeneMark-ETP/genemark.aa
Counting number of protein sequences = transcripts
braker.aa:247
GeneMark-ETP/genemark.aa:249
Augustus/augustus.hints.aa:274


GALBA has a simple script to compute the ratio of mono- to multi-exonic genes (only counting one isoform if one gene has several alternative isoforms, that's why the transcript number differs from the number above for methods that contain alternative transcripts, such as AUGUSTUS and BRAKER):

In [20]:
%%script bash
cd BRAKER3
echo "Computing some descriptive statistics for BRAKER:"
analyze_exons.py -f braker.gtf
echo ""
echo "Doing the same for Augustus:"
analyze_exons.py -f Augustus/augustus.hints.gtf
echo ""
echo "And for GeneMark-ETP:"
analyze_exons.py -f GeneMark-ETP/genemark.gtf

Computing some descriptive statistics for BRAKER:
Number of transcripts: 241
Largest number of exons in all transcripts: 30
Monoexonic transcripts: 80
Multiexonic transcripts: 161
Mono:Mult Ratio: 0.5
Boxplot of number of exons per transcript:
Min: 1
25%: 1
50%: 3
75%: 7
Max: 30

Doing the same for Augustus:
Number of transcripts: 268
Largest number of exons in all transcripts: 19
Monoexonic transcripts: 73
Multiexonic transcripts: 195
Mono:Mult Ratio: 0.37
Boxplot of number of exons per transcript:
Min: 1
25%: 1
50%: 3
75%: 7
Max: 19

And for GeneMark-ETP:
Number of transcripts: 249
Largest number of exons in all transcripts: 32
Monoexonic transcripts: 55
Multiexonic transcripts: 194
Mono:Mult Ratio: 0.28
Boxplot of number of exons per transcript:
Min: 1
25%: 2
50%: 4
75%: 8
Max: 32


#### BUSCO assessment

BUSCO ([paper](https://doi.org/10.1002/cpz1.323), [software](https://gitlab.com/ezlab/busco)) can provide information on sensitivity with respect to a clade-specific core gene set. We will in the following use BUSCO to compare sensitivity in the BRAKER3, AUGUSTUS, and GeneMark-ETP gene set.

First, we find the closest BUSCO lineage (we are working on *Arabidopsis thaliana*):

In [10]:
%%script bash

source conda_init
conda activate busco_env

busco --list-datasets > busco_lineages.txt 2> busco_lineages.log

All available lineages are now in [busco_lineages.txt](busco_lineages.txt). (Check [busco_lineages.log](busco_lineages.log) for possible errors.)

Check at [NCBI taxonomy](https://www.ncbi.nlm.nih.gov/taxonomy) the lineage of the target *Arabidopsis*. I believe the lineage is:

`cellular organisms; Eukaryota; Viridiplantae; Streptophyta; Streptophytina; Embryophyta; Tracheophyta; Euphyllophyta; Spermatophyta; Magnoliopsida; Mesangiospermae; eudicotyledons; Gunneridae; Pentapetalae; rosids; malvids; Brassicales; Brassicaceae; Camelineae`

Now find a related lineage in [busco_lineages.txt](busco_lineages.txt). `brassicales_odb10` is the closest lineage. (If we had not wanted to save time when running BRAKER, we would also have used this lineage for the BRAKER run.)

Next, we run a BUSCO assessment on all gene sets (this takes ~4 minutes with 8 threads):

In [21]:
%%script bash

T=8 # adjust to number of threads that you booted with

source conda_init
conda activate busco_env

cd BRAKER3
# create links if not already present
if [ ! -L augustus.aa ]
then
    ln -s Augustus/augustus.hints.aa augustus.aa
    sleep 1 # not sure why we need to wait a few seconds, but otherwise system doesn't find the file
fi

if [ ! -L genemark.aa ]
then
    ln -s GeneMark-ETP/genemark.aa genemark.aa
    sleep 1 # not sure why we need to wait a few seconds, but otherwise system doesn't find the file
fi

if [ ! -d busco_downloads ]
then
    mkdir busco_downloads
    cd busco_downloads
    ln -s /home/genomics/workshop_materials/genome_annotation/busco/brassicales_odb10 brassicales_odb10
fi

GENESETS=(braker augustus genemark)

for g in ${GENESETS[@]}; do
    echo "Processing ${g}..."
    # delete old output if existing
    if [ -d busco_${g} ]
    then
        rm -r busco_${g}
    fi
    # run BUSCO
    busco -m proteins -i ${g}.aa -o busco_${g} \
        -l brassicales_odb10 -c ${T} &> busco_${g}.log
done

Processing braker...
Processing augustus...
Processing genemark...


Next, we visualize the BUSCO results:

In [22]:
%%script bash

source conda_init
conda activate busco_env

cd BRAKER3

# create BUSCO_summaries folder if not present
if ! [ -d BUSCO_summaries ]
then
    mkdir BUSCO_summaries
fi

# copy all BUSCO results into the summaries folder
cp busco_*/short_summary*.txt BUSCO_summaries

# generate BUSCO plot
generate_plot.py -wd BUSCO_summaries &> generate_plot.log

Check the file [generate_plot.log](generate_plot.log) for possible errors. This results in the following figure (stored at [BRAKER3/BUSCO_summaries/busco_figure.png](BRAKER3/BUSCO_summaries/busco_figure.png)):

<img src="BRAKER3/BUSCO_summaries/busco_figure.png" alt="BUSCO results" width="400"/>

The data that we used in this session was selected purely on the criterion of feasible runtime. In a real scenario, with a complete genome, the BUSCO plot should contain a much larger number of complete BUSCOs, and you are usually happy if the number BUSCOs in the final BRAKER3 gene set is higher or equal to the number of BUSCOs detected in the AUGUSTUS and GeneMark-ETP set, while the total number of transcripts does not grow into an unexpected way (e.g. having 80.000 proteins in a BRAKER gene set does seem odd in most cases...). If the same BUSCO lineage had been chosen for BRAKER and BUSCO, that would also be the case, here (BUSCOs in BRAKER being more complete than in AUGUSTUS).

But what can we do if there is no RNA-Seq data for a particular species? In that case, we can resort to using either BRAKER2 (for small and medium sized genomes, with a large database of proteins that might be only remotely related), or we may use GALBA (for large vertebrate genomes, with a few closely related reference proteomes).

### BRAKER2

BRAKER2 ([paper](https://doi.org/10.1093/nargab/lqaa108)) uses spliced alignment information from a huge database of proteins against the target genome. We typically use OrthoDB partitions of clades, hosted at https://bioinf.uni-greifswald.de/bioinf/partitioned_odb11/. Note: a set of proteins from one or a few related species is not sufficient for running BRAKER2. A particular set of proteins of a closely related species can be appended to a larger database for running BRAKER2. However, BRAKER2 is not an ideal tool for recovering a complete set of proteins from a related species.

The following call of BRAKER2 takes ~15 minutes on 4 threads, even when optimizing AUGUSTUS parameters is disabled:

In [23]:
%%script bash

T=8 # adjust to number of threads that you booted with

ORTHODB=/opt/BRAKER/example/subsampled_viri.fa # adjust to suitable OrthoDB clade, see BRAKER3
BUSCOLINEAGE=eukaryota_odb10 # adjust to more suitable lineage for real annotation runs

# delete output from a possible previous run if it exists
if [ -d BRAKER2 ]
then
    rm -rf BRAKER2
fi

time braker.pl --workingdir=BRAKER2 --genome=/opt/BRAKER/example/genome.fa --prot_seq=${ORTHODB} \
    --busco_lineage ${BUSCOLINEAGE} --threads ${T} \
    --gm_max_intergenic 10000 --skipOptimize \ # remember to remove both options if you are running a real job
    2> braker2.log

#**********************************************************************************
#                               BRAKER CONFIGURATION                               
#**********************************************************************************
# BRAKER CALL: /opt/BRAKER/scripts/braker.pl --workingdir=BRAKER2 --genome=/opt/BRAKER/example/genome.fa --prot_seq=/home/katharina/git/BRAKER/example/subsampled_viri.fa --busco_lineage eukaryota_odb10 --threads 8 --gm_max_intergenic 10000 --skipOptimize  # remember to remove both options if you are running a real job
# Sat Jan  6 18:59:57 2024: braker.pl version 3.0.7
# Sat Jan  6 18:59:57 2024: Creating directory /home/katharina/git/GenomeAnnotation_Workshop/BRAKER2.
# Sat Jan  6 18:59:57 2024: Only Protein input detected, BRAKER will be executed in EP mode (BRAKER2).
# Sat Jan  6 18:59:57 2024: Configuring of BRAKER for using external tools...
# Sat Jan  6 18:59:57 2024: Trying to set $AUGUSTUS_CONFIG_PATH...
# Sat Jan  6 18:59:57 2024

# Sat Jan  6 18:59:57 2024: Log information is stored in file /home/katharina/git/GenomeAnnotation_Workshop/BRAKER2/braker.log


ProtHint Version 2.6.0
Copyright 2019, Georgia Institute of Technology, USA

Please cite
  - ProtHint: https://doi.org/10.1093/nargab/lqaa026
  - DIAMOND:  https://doi.org/10.1038/nmeth.3176
  - Spaln:    https://doi.org/10.1093/bioinformatics/btn460

Called from: /home/katharina/git/GenomeAnnotation_Workshop/BRAKER2
Cmd: /opt/ETP/bin/gmes/ProtHint/bin/prothint.py --threads=8 --geneMarkGtf /home/katharina/git/GenomeAnnotation_Workshop/BRAKER2/GeneMark-ES/genemark.gtf /home/katharina/git/GenomeAnnotation_Workshop/BRAKER2/genome.fa /home/katharina/git/GenomeAnnotation_Workshop/BRAKER2/proteins.fa

[Sat Jan  6 19:02:29 2024] Pre-processing protein input
[Sat Jan  6 19:02:29 2024] Skipping GeneMark-ES, using the supplied gene seeds file instead
[Sat Jan  6 19:02:29 2024] Translating gene seeds to proteins
[Sat Jan  6 19:02:29 2024] Translation of seeds finished
[Sat Jan  6 19:02:29 2024] Running DIAMOND
diamond v0.9.24.125 | by Benjamin Buchfink <buchfink@gmail.com>
Licensed under the GNU 

[Sat Jan  6 19:02:31 2024] Enqueueing pair 60/1151 (5.2%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 62/1151 (5.3%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 63/1151 (5.4%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 64/1151 (5.5%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 65/1151 (5.6%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 66/1151 (5.7%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 67/1151 (5.8%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 68/1151 (5.9%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 70/1151 (6.0%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 71/1151 (6.1%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 72/1151 (6.2%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 73/1151 (6.3%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 74/1151 (6.4%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 75/1151 (6.5%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 76/1151 (6.6%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 78/1151 (6.7%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 79/1151 (6.8%)
[Sat Jan  6 19

[Sat Jan  6 19:02:31 2024] Enqueueing pair 219/1151 (19.0%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 220/1151 (19.1%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 221/1151 (19.2%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 223/1151 (19.3%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 224/1151 (19.4%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 225/1151 (19.5%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 226/1151 (19.6%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 227/1151 (19.7%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 228/1151 (19.8%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 230/1151 (19.9%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 231/1151 (20.0%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 232/1151 (20.1%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 233/1151 (20.2%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 234/1151 (20.3%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 235/1151 (20.4%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 236/1151 (20.5%)
[Sat Jan  6 19:02:31 2024] Enqueueing pa

[Sat Jan  6 19:02:31 2024] Enqueueing pair 377/1151 (32.7%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 378/1151 (32.8%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 379/1151 (32.9%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 380/1151 (33.0%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 381/1151 (33.1%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 383/1151 (33.2%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 384/1151 (33.3%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 385/1151 (33.4%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 386/1151 (33.5%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 387/1151 (33.6%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 388/1151 (33.7%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 390/1151 (33.8%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 391/1151 (33.9%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 392/1151 (34.0%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 393/1151 (34.1%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 394/1151 (34.2%)
[Sat Jan  6 19:02:31 2024] Enqueueing pa

[Sat Jan  6 19:02:31 2024] Enqueueing pair 535/1151 (46.4%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 536/1151 (46.5%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 537/1151 (46.6%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 538/1151 (46.7%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 539/1151 (46.8%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 540/1151 (46.9%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 541/1151 (47.0%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 543/1151 (47.1%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 544/1151 (47.2%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 545/1151 (47.3%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 546/1151 (47.4%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 547/1151 (47.5%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 548/1151 (47.6%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 550/1151 (47.7%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 551/1151 (47.8%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 552/1151 (47.9%)
[Sat Jan  6 19:02:31 2024] Enqueueing pa

[Sat Jan  6 19:02:31 2024] Enqueueing pair 692/1151 (60.1%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 693/1151 (60.2%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 695/1151 (60.3%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 696/1151 (60.4%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 697/1151 (60.5%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 698/1151 (60.6%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 699/1151 (60.7%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 700/1151 (60.8%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 701/1151 (60.9%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 703/1151 (61.0%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 704/1151 (61.1%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 705/1151 (61.2%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 706/1151 (61.3%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 707/1151 (61.4%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 708/1151 (61.5%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 710/1151 (61.6%)
[Sat Jan  6 19:02:31 2024] Enqueueing pa

[Sat Jan  6 19:02:31 2024] Enqueueing pair 850/1151 (73.8%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 851/1151 (73.9%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 852/1151 (74.0%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 853/1151 (74.1%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 855/1151 (74.2%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 856/1151 (74.3%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 857/1151 (74.4%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 858/1151 (74.5%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 859/1151 (74.6%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 860/1151 (74.7%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 861/1151 (74.8%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 863/1151 (74.9%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 864/1151 (75.0%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 865/1151 (75.1%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 866/1151 (75.2%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 867/1151 (75.3%)
[Sat Jan  6 19:02:31 2024] Enqueueing pa

[Sat Jan  6 19:02:31 2024] Enqueueing pair 1008/1151 (87.5%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 1009/1151 (87.6%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 1010/1151 (87.7%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 1011/1151 (87.8%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 1012/1151 (87.9%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 1013/1151 (88.0%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 1015/1151 (88.1%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 1016/1151 (88.2%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 1017/1151 (88.3%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 1018/1151 (88.4%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 1019/1151 (88.5%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 1020/1151 (88.6%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 1021/1151 (88.7%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 1023/1151 (88.8%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 1024/1151 (88.9%)
[Sat Jan  6 19:02:31 2024] Enqueueing pair 1025/1151 (89.0%)
[Sat Jan  6 19:02:31 202

#*********
# The hints file(s) for GeneMark-EX contain less than 1000 introns. (In total, 217 unique introns are contained.)
# Genemark-EX might fail due to the low number of hints.
#*********
#*********
# The hints file(s) for GeneMark-EX contain less than 150 introns with multiplicity >= 4! (In total, 217 unique introns are contained. 145 have a multiplicity >= 4.)
# Possibly, you are trying to run braker.pl on data that does not provide sufficient multiplicity information. This will e.g. happen if you try to use introns generated from assembled RNA-Seq transcripts; or if you try to run braker.pl in epmode with mappings from proteins without sufficient hits per locus. Or if you use the example data set.
# A low number of intron hints with sufficient multiplicity may result in a crash of GeneMark-EX (it should not crash with the example data set).
#*********
#*********
#*********


ProtHint Version 2.6.0
Copyright 2019, Georgia Institute of Technology, USA

Please cite
  - ProtHint: https://doi.org/10.1093/nargab/lqaa026
  - DIAMOND:  https://doi.org/10.1038/nmeth.3176
  - Spaln:    https://doi.org/10.1093/bioinformatics/btn460

Called from: /home/katharina/git/GenomeAnnotation_Workshop/BRAKER2
Cmd: /opt/ETP/bin/gmes/ProtHint/bin/prothint.py --threads=8 --geneSeeds /home/katharina/git/GenomeAnnotation_Workshop/BRAKER2/augustus.hints_iter1.gtf --prevGeneSeeds /home/katharina/git/GenomeAnnotation_Workshop/BRAKER2/GeneMark-ES/genemark.gtf --prevSpalnGff /home/katharina/git/GenomeAnnotation_Workshop/BRAKER2/Spaln/spaln_iter1.gff /home/katharina/git/GenomeAnnotation_Workshop/BRAKER2/genome.fa /home/katharina/git/GenomeAnnotation_Workshop/BRAKER2/proteins.fa

[Sat Jan  6 19:12:25 2024] Pre-processing protein input
ProtHint is running in the iterative mode.
[Sat Jan  6 19:12:25 2024] Selecting a subset of data to run in the next iteration
[Sat Jan  6 19:12:26 2024] Tran

[Sat Jan  6 19:12:27 2024] Enqueueing pair 49/645 (7.5%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 50/645 (7.7%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 51/645 (7.9%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 52/645 (8.0%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 53/645 (8.2%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 54/645 (8.3%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 55/645 (8.5%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 56/645 (8.6%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 57/645 (8.8%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 58/645 (8.9%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 59/645 (9.1%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 60/645 (9.3%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 61/645 (9.4%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 62/645 (9.6%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 63/645 (9.7%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 64/645 (9.9%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 65/645 (10.0%)
[Sat Jan  6 19:12:27 2024] Enq

[Sat Jan  6 19:12:27 2024] Enqueueing pair 189/645 (29.3%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 190/645 (29.4%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 191/645 (29.6%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 192/645 (29.7%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 193/645 (29.9%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 194/645 (30.0%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 195/645 (30.2%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 196/645 (30.3%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 197/645 (30.5%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 198/645 (30.6%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 199/645 (30.8%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 200/645 (31.0%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 201/645 (31.1%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 202/645 (31.3%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 203/645 (31.4%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 204/645 (31.6%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 205/645 (31.7

[Sat Jan  6 19:12:27 2024] Enqueueing pair 328/645 (50.8%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 329/645 (51.0%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 330/645 (51.1%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 331/645 (51.3%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 332/645 (51.4%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 333/645 (51.6%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 334/645 (51.7%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 335/645 (51.9%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 336/645 (52.0%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 337/645 (52.2%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 338/645 (52.4%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 339/645 (52.5%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 340/645 (52.7%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 341/645 (52.8%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 342/645 (53.0%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 343/645 (53.1%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 344/645 (53.3

[Sat Jan  6 19:12:27 2024] Enqueueing pair 467/645 (72.4%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 468/645 (72.5%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 469/645 (72.7%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 470/645 (72.8%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 471/645 (73.0%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 472/645 (73.1%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 473/645 (73.3%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 474/645 (73.4%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 475/645 (73.6%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 476/645 (73.7%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 477/645 (73.9%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 478/645 (74.1%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 479/645 (74.2%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 480/645 (74.4%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 481/645 (74.5%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 482/645 (74.7%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 483/645 (74.8

[Sat Jan  6 19:12:27 2024] Enqueueing pair 606/645 (93.9%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 607/645 (94.1%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 608/645 (94.2%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 609/645 (94.4%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 610/645 (94.5%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 611/645 (94.7%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 612/645 (94.8%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 613/645 (95.0%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 614/645 (95.1%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 615/645 (95.3%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 616/645 (95.5%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 617/645 (95.6%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 618/645 (95.8%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 619/645 (95.9%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 620/645 (96.1%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 621/645 (96.2%)
[Sat Jan  6 19:12:27 2024] Enqueueing pair 622/645 (96.4

<details>
  <summary><b>Out of time, job died? Click here.</b></summary>
If you ran out of time (the BRAKER2 job takes substantial time), you may copy the most important files as follows from a notebook cell:

```
%%script bash
cp -r BRAKER2_precomputed_results BRAKER2
```
</details>

The most important output files are:

   * [BRAKER2/braker.gtf](BRAKER2/braker.gtf) - BRAKER gene predictions
   * [BRAKER2/Augustus/augustus.hints.gtf](BRAKER2/Augustus/augustus.hints.gtf) - intermediate AUGUSTUS gene predictions
   * [BRAKER2/GeneMark-EP/genemark.gtf](BRAKER2/GeneMark-EP/genemark.gtf) - intermediate GeneMark-EP gene predictions
   * [BRAKER2/hintsfile.gff](BRAKER2/hintsfile.gff) - hints that were used for running AUGUSTUS and TSEBRA in BRAKER
   
The file [BRAKER2/what-to-cite.txt](BRAKER2/what-to-cite.txt) advises you on what papers should be cited if you were going to publish a manuscript on a gene set produced with BRAKER2. 

All methods described for BRAKER3 (BUSCO, number of transcripts, mono:mult exon ratio, etc.) are of course applicable to BRAKER2, GALBA, and BRAKER1, as well. We will skip it here because of time constraints.

### GALBA

GALBA ([preprint](https://www.biorxiv.org/content/10.1101/2023.04.10.536199v1.abstract), [software](https://github.com/Gaius-Augustus/GALBA)) is a BRAKER-spinoff that uses miniprot ([paper](https://doi.org/10.1093/bioinformatics/btad014), [software](https://github.com/lh3/miniprot)) to generate a training gene set of AUGUSTUS. In contrast to the BRAKER2 and BRAKER3 pipelines, GALBA is not very good at using remotely related protein evidence. However, given reference proteins of several closely related species, GALBA is very good at recovering gene structures, particularly in large vertebrate genomes. You may execute GALBA as follows (using a toy example data set, it executes within a 3 minutes on 8 threads):

In [24]:
%%script bash

T=8 # adjust to number of threads that you booted with

# delete output from a possible previous run if it exists
if [ -d GALBA ]
then
    rm -rf GALBA
fi

time galba.pl --workingdir=GALBA --genome=/opt/BRAKER/example/genome.fa \
    --prot_seq=/opt/GALBA/example/proteins.fa \
    --threads ${T} \
    --skipOptimize \ # remember to remove both options if you are running a real job
    2> galba.log

ERROR: in file /opt/GALBA/scripts/./helpMod.pm at line 307
 found neither /usr/share/augustus/scripts/filter_gtf_by_txid.py nor /usr/scripts/filter_gtf_by_txid.py nor /home/katharina/scripts/filter_gtf_by_txid.py nor /opt/GALBA/scripts/filter_gtf_by_txid.py!
Please Check the environment variables AUGUSTUS_CONFIG_PATH and command line options AUGUSTUS_BIN_PATH and AUGUSTUS_SCRIPTS_PATH or install AUGUSTUS, again!

real	0m0.287s
user	0m0.210s
sys	0m0.073s


<details>
  <summary><b>Out of time, job died? Click here.</b></summary>
It is unlikely that GALBA will note complete, fast. However, you may copy the most important files as follows from a notebook cell:

```
%%script bash
cp -r GALBA_precomputed_results GALBA
```
</details>

The most important output files are:

   * [GALBA/galba.gtf](GALBA/galba.gtf) - gene predictions by GALBA
   * [GALBA/hintsfile.gff](GALBA/hintsfile.gff) - hints that were used for running AUGUSTUS in GALBA
   
The file [GALBA/what-to-cite.txt](GALBA/what-to-cite.txt) advises you on what papers should be cited if you were going to publish a manuscript on a gene set produced with GALBA.

### BRAKER1

Since BRAKER3, the pipeline for input of both RNA-Seq and a large database of proteins achieves usually higher accuracy than BRAKER1 with RNA-Seq, only, BRAKER1 is now rather a pipeline that we may resort to using if BRAKER3 died due to insufficient data. BRAKER1 also requires a certain amount of RNA-Seq alignments but that is less than what is required for transcriptome assembly with StringTie in BRAKER3.

BRAKER1 uses spliced alignment information from RNA-Seq for training GeneMark-ET ([paper](https://doi.org/10.1093/nar/gku557), [software](http://exon.gatech.edu/genemark/license_download.cgi)), for selecting a training gene set for AUGUSTUS, and for predicting genes with AUGUSTUS. 

We will run BRAKER1 to predict genes in the genomic sequence with the prepared RNA-Seq intron evidence. As before, we introduce options to save runtime (see BRAKER3 and BRAKER2) that should not be applied in a real-life annotation project.

In [None]:
%%script bash

T=8 # adjust to number of threads that you booted with, takes ~2.5 minutes on 4 threads

# delete output from a possible previous run if it exists
if [ -d BRAKER1 ]
then
    rm -rf BRAKER1
fi

BUSCOLINEAGE=eukaryota_odb10 # adjust to more suitable lineage for real annotation runs

time braker.pl --workingdir=BRAKER1 --genome=/opt/BRAKER/example/genome.fa \
    --bam=/opt/BRAKER/example/RNAseq.bam --softmasking \
    --busco_lineage ${BUSCOLINEAGE} --threads ${T} \
    --gm_max_intergenic 10000 --skipOptimize #  remember to remove this option if you are running a real job
    # this call takes a few minutes even with --skipOptimize

#**********************************************************************************
#                               BRAKER CONFIGURATION                               
#**********************************************************************************
# BRAKER CALL: /opt/BRAKER/scripts/braker.pl --workingdir=BRAKER1 --genome=/opt/BRAKER/example/genome.fa --bam=/opt/BRAKER/example/RNAseq.bam --softmasking --busco_lineage eukaryota_odb10 --threads 8 --gm_max_intergenic 10000 --skipOptimize
# Sat Jan  6 19:13:38 2024: braker.pl version 3.0.7
# Sat Jan  6 19:13:38 2024: Creating directory /home/katharina/git/GenomeAnnotation_Workshop/BRAKER1.
# Sat Jan  6 19:13:38 2024: Only RNA-Seq input detected, BRAKER will be executed in ET mode (BRAKER1).
# Sat Jan  6 19:13:38 2024: Configuring of BRAKER for using external tools...
# Sat Jan  6 19:13:38 2024: Trying to set $AUGUSTUS_CONFIG_PATH...
# Sat Jan  6 19:13:38 2024: Found environment variable $AUGUSTUS_CONFIG_PATH.
# Sat Jan  6 19:13:38 2024:

# Sat Jan  6 19:13:39 2024: Log information is stored in file /home/katharina/git/GenomeAnnotation_Workshop/BRAKER1/braker.log


<details>
  <summary><b>Out of time, job died? Click here.</b></summary>
This job should easily complete within a few minutes, but you may copy the most important files as follows from a notebook cell:

```
%%script bash
cp -r BRAKER1_precomputed_results BRAKER1
```
</details>

Note that BRAKER by default expects scripts and binaries in a location relative to the `$AUGUSTUS_CONFIG_PATH`. We here changed the location of the `$AUGUSTUS_CONFIG_PATH` to a writable location. Therefore, we have to tell BRAKER where the scripts and binaries are (`--AUGUSTUS_BIN_PATH`, `--AUGUSTUS_CONFIG_PATH`).

The most important output files that we will later use for running TSEBRA are 

   * [BRAKER1/braker.gtf](BRAKER1/braker.gtf) - BRAKER gene predictions
   * [BRAKER1/Augustus/augustus.hints.gtf](BRAKER1/Augustus/augustus.hints.gtf) - intermediate AUGUSTUS gene predictions
   * [BRAKER1/GeneMark-EP/genemark.gtf](BRAKER1/GeneMark-EP/genemark.gtf) - intermediate GeneMark-EP gene predictions
   * [BRAKER1/hintsfile.gff](BRAKER1/hintsfile.gff) - hints that were used for running AUGUSTUS and TSEBRA in BRAKER
   
The file [BRAKER1/what-to-cite.txt](BRAKER1/what-to-cite.txt) advises you on what papers should be cited if you were going to publish a manuscript on a gene set produced with BRAKER1.

### TSEBRA

TSEBRA is a tool for selecting a highly accurate gene set from several input sets according to supporting extrinsic evidence. BRAKER internally executes TSEBRA to combine the GeneMark and the AUGUSTUS gene set. If all went well, you do not have run TSEBRA, separately, at all. However, one scenario where TSEBRA may be useful, remains:
   
   * BRAKER3 failed executing because the provided RNA-Seq data set was too small. In this case, you may wish to combine a BRAKER1 and a BRAKER2 gene set.
   
We will have a look at how to generally run TSEBRA on the example of merging the BRAKER1 and BRAKER2 gene set according to the respective evidence of these runs:

In [None]:
%%script bash

mkdir TSEBRA
cd TSEBRA
tsebra.py -g ../BRAKER1/Augustus/augustus.hints.gtf,../BRAKER1/GeneMark-ET/genemark.gtf,../BRAKER2/Augustus/augustus.hints.gtf,../BRAKER2/GeneMark-EP/genemark.gtf \
    -e ../BRAKER1/hintsfile.gff,../BRAKER2/hintsfile.gff -o tsebra.gtf 2> tsebra.log

Check the file [tsebra.log](tsebra.log) for possible errors. The final gene set is in file [tsebra.gtf](tsebra.gtf). 

## Data visualization in the UCSC Genome Browser

Visualization of gene structures in context with extrinsic evidence is essential for coming to a decision on whether a gene set "makes sense" or "does not make sense". Typical problems that you may observe in a genome browser include "split genes" (where evidence implies two genes should in fact be a single gene) or "joined genes" (where evidence implies one gene should be split into two genes).

The UCSC Genome Browser ([paper](https://doi.org/10.1101/gr.229102), [resource](https://genome.ucsc.edu/)) is one of the most popular genome browsers. It has the advantage that you do not have to install a browser instance on your own webserver. Instead, you only need to provide a certain data structure with your target data on a webserver. The UCSC Genome Browser servers can display your data from there. The data structures are called "track data hubs" or "assembly hubs" ([paper](https://doi.org/10.1093/bioinformatics/btt637)). 

MakeHub ([paper](https://doi.org/10.1016/j.gpb.2019.05.003), [software](https://github.com/Gaius-Augustus/MakeHub )) is a python script that fully automates the generation of such track data hubs for novel genomes. In the following, we will generate a simple track data hub for the genome sequence that we annotated with BRAKER3 (takes only a few seconds):

In [None]:
%%script bash

T=8 # adjust to number of threads that you booted with

time make_hub.py -e katharina.hoff@uni-greifswald.de \
    --genome /opt/BRAKER/example/genome.fa --long_label "A chunk from the Arabidopsis thaliana genome" \
    --short_label at_chunk  --bam /opt/BRAKER/example/RNAseq.bam --threads ${T} \
    --latin_name "Arabidopsis thaliana" \
    --assembly_version "artifically split custom assembly" \
    --hints BRAKER3/hintsfile.gff --gene_track BRAKER3/braker.gtf BRAKER3

You can't perform the suggested `scp` command from the apphub, unless you have privileges on a University of Greifswald webserver. We have therefore copied a prepared hub in advance. The `hub.txt` is available at https://bioinf.uni-greifswald.de/hubs/at_chunk/hub.txt . Remember that link.

In order to visualize your data, go to https://genome.ucsc.edu/ . Click on `My Data` -> `Track Hubs` -> choose the European mirror -> click on `Connected Hubs` and enter the link https://bioinf.uni-greifswald.de/hubs/at_chunk/hub.txt into the text window -> click on `Add Hub`. Congratulations, your Hub is now connected. You should be able to browse something like this: 

<img src="at_chunk.png" alt="UCSC Genome Browser example" width="1000"/>

### How to know which sequences to browse

The long sequences are usually the most interesting to look at. The following command gives you the names of sequences in the order of descending length, you can copy-paste the sequence names into the search window in the UCSC Genome Browser.

In [None]:
%%script bash

N=5 # how many longest sequences would you like to know about

summarizeACGTcontent.pl /opt/BRAKER/example/genome.fa | grep bases | head -${N} | sort -n \
   | perl -ne 'm/(\d+)\s+bases\.\s+(\S+)/; print "$2\t$1\n";'

## How to run BRAKER (and other software) in Docker

If you have a machine on which you have root permissions and Docker, you can run the exact same container as we have been using during this workshop as follows:

```
sudo docker run --rm -it -u root katharinahoff/bioinformatics-notebook:latest bash
```

You can execute all shell commands that we covered in this notebook in that container.

## How to run BRAKER, GALBA, MakeHub, etc. in Singularity

Please read instructions in the [README.md](README.md) file.

## Troubleshooting

### I have 80.000 genes predicted by BRAKER/TSEBRA in a full genome, what shall I do?

Please first check whether you are referring to genes, or to transcripts. BRAKER predicts alternative isoforms. If RNA-Seq data supports this, the number of alternative transcripts may be large, but likely true. If it's really genes that you counted, then 80.000 sounds way too much, indeed (unless you are dealing with a genome that has multiple copies of each chromosome). Most likely, GeneMark-ET/ES/EP/ETP produced highly fragmented training genes for AUGUSTUS. This will also lead to highly fragmented genes predicted by AUGUSTUS. First, check whether your genome has been masked for repeats. Consider using the additional TRF masking desribed at the top of this notebook. If that does not help, and if you have a protein set of closely related species at hand, consider using that protein set as sole training data for AUGUSTUS. You can use GALBA for this (https://github.com/Gaius-Augustus/GALBA).

### I have only 10.000 genes predicted by BRAKER/TSEBRA in a full genome, what shall I do?

Check whether the BRAKER output files in subfolders Augustus and GeneMark-* produced more genes than TSEBRA. By default, TSEBRA will discard genes without evidence. If you have only little evidence for your species, TSEBRA might be a bad idea. You can also try rerunning TSEBRA with enforcing one of the gene sets.  There are also species for which is "normal" to observe less than 10,000 genes, check the annotated relatives.

### How do I know how many genes to expect?

Hard to say. You can download gene sets of related species e.g. from NCBI Genomes, and count. Some gene sets tend to be "underannotated", i.e. they may represent rather the lower numbers of what might be realistic. Katharina usually gets nervous about more than 45000 genes and fewer than 15000 genes. These are definitely weird gene counts - but as stated before, there are cases where these are totally fine, too. Otherwise: always inspect your gene set in a Genome Browser such as the UCSC Genome Browser to identify problems.

### I have long isoseq RNA-Seq transcripts, can I put them into BRAKER?

No. But we have [other instructions](https://github.com/Gaius-Augustus/BRAKER/blob/master/docs/long_reads/long_read_protocol.md) for you. Please note: isoseq data does not always aid structural genome annotation over short read RNA-Seq data. By next week, there will be another solution on a PAG poster in the BRAKER software repository :-) Stay tuned...

### I opened an issue on GitHub about BRAKER or TSEBRA 100 days ago, nobody replied, why?

We are a small team of developers. We try our best and usually respond to well described and easy-to-solve issues within a rather short time frame. Solving other issues may take considerable amounts of time that we simply do not have, or they may be described in a way that we don't know what do with them... please be patient with us.

### I have a problem, whom do I tell?

Please read through the Issues on Github. If the issue does not exist, yet, open an issue.

# Ready to move on?

If you feel confident about your skills, take them to the next level. We have prepared chromosome 4 of a small genome in the following notebook: [Annotate_Babesia_duncani.ipynb](Annotate_Babesia_duncani.ipynb). The task is designed such that you will not complete all tasks during today's session. Instead, you will be randomly assigned with a sub-task, and we will merge the results of everyone who participates to gain a final overview of the results.

### The End