# Genome Annotation

Materials for the Genome Annotation a BRAKER, Galba, Tiberius, & TSEBRA Genome Annotation workshop by Katharina Hoff (katharina.hoff@uni-greifswald.de).

In the following, we will walk through the process of genome annotation on the example of a small proportion of the *Arabidopsis thaliana* genome.

Runtime estimates are currently not from AWS but from 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz measured with 8 threads. Compute time on AWS should be shorter, even with 4 threads.

I was asked to label "important" and "optional" parts of this notebook for didactic purposes. Please be aware that the second notebook builds on the full contents of this notebook. If you skip something, you may be unable to complete a randomly selected task from the second notebook, or you may have to go back to this one. However, in practice, there are some parts that are more important than others. I will use the following labels:

  - â›µ - for smooth sailing in the future, do not skip this
  - ðŸš£ - you can skip this, but you may have to go back

## â›µ Repeat masking

Repetitive sequences are a huge problem for genome annotation with classical methods, such as BRAKER and Galba (not so problematic with deep learning methods such as Tiberius, even if trained on a repeat masked genome, it may perform ok on an unmasked or poorly masked genome during inference). Some repeats only coincidentally look like protein-coding genes, others (such as transposases) are protein-coding genes, but we usually are not interested in any of these "repeat genes" when trying to find protein-coding genes in a novel genome. Thus, a genome should be repeat-masked prior to gene prediction. 

Repeat masking is a resource and time-consuming step that is out of scope for this workshop. If you plan on using BRAKER or Galba for genome annotation, or want to use a Tiberius model that was trained for softmasked genomes, we recommend using RepeatModeler2 ([paper](https://doi.org/10.1073/pnas.1921046117), [software](https://www.repeatmasker.org/RepeatModeler/) ) to construct a species-specific repeat library and mask the genome with RepeatMasker (ideally, you will perform these computations on a node with >70 threads, in a place with very fast storage i/o, possibly using RAM instead of actual hard drive as a temporary file storage place):

```
T=72 # you need a large number of threads and fast i/o storage
GENOME=/opt/BRAKER/example/genome.fa
DB=some_db_name_that_fits_to_species

BuildDatabase -name ${DB} ${GENOME}
RepeatModeler -database ${DB} -pa ${T} -LTRStruct
RepeatMasker -pa 72 -lib ${DB}-families.fa -xsmall ${GENOME}
```

This results in a file `${GENOME}.masked`. 

<details>
  <summary><b>ðŸš£ Click to learn how to mask more rigorously when needed</b></summary>
Depending on the kind of genome, plenty of unmasked repeats may still persist. This is generally an issue to be expected in large genomes, such as vertebrate genomes, and you will notice the problem if the count of predicted proteins is extremely high. You can try to overcome "under-masking" with the following steps (we are suggesting to use GNU parallel to speed up the process):

```
ln -s genome.masked.fa genome.fa
splitMfasta.pl --minsize=25000000 ${GENOME}.masked

# Running TRF
ls genome.split.*.fa | parallel 'trf {} 2 7 7 80 10 50 500 -d -m -h'

# Parsing TRF output
# The script parseTrfOutput.py is from https://github.com/gatech-genemark/BRAKER2-exp
ls genome.split.*.fa.2.7.7.80.10.50.500.dat | parallel 'parseTrfOutput.py {} --minCopies 1 --statistics {}.STATS > {}.raw.gff 2> {}.parsedLog'

# Sorting parsed output..."
ls genome.split.*.fa.2.7.7.80.10.50.500.dat.raw.gff | parallel 'sort -k1,1 -k4,4n -k5,5n {} > {}.sorted 2> {}.sortLog'

# Merging gff...
FILES=genome.split.*.fa.2.7.7.80.10.50.500.dat.raw.gff.sorted
for f in $FILES
do
    bedtools merge -i $f | awk 'BEGIN{OFS="\t"} {print $1,"trf","repeat",$2+1,$3,".",".",".","."}' > $f.merged.gff 2> $f.bedtools_merge.log
done

# Masking FASTA chunk
ls genome.split.*.fa | parallel 'bedtools maskfasta -fi {} -bed {}.2.7.7.80.10.50.500.dat.raw.gff.sorted.merged.gff -fo {}.combined.masked -soft &> {}.bedools_mask.log'

# Concatenate split genome
cat genome.split.*.fa.combined.masked > genome.fa.combined.masked
```

The file `genome.fa.combined.masked` will be more rigorously masked.
</details>

## ðŸš£ RNA-Seq alignment with HiSat2

Spliced alignments of RNA-Seq short reads are a valuable information source for predicting protein-coding genes with high accuracy for BRAKER.

Executing HiSat2 is out of scope for the current session. You find a readily prepared alignment file in [/opt/BRAKER/example/RNAseq.bam](/opt/BRAKER/example/RNAseq.bam). BRAKER can directly work with FASTQ files, or even with SRA identifiers, but the runtime will be a bit longer, then. For this workshop, we will use the precomputed alignment file to save time.

<details>
  <summary><b>ðŸš£ If you want to see how such a file was prepared, click here and read.</b></summary>
  
We will map the *Arabidopsis thaliana* Illumina RNA-Seq reads from library SRR934391 in files [SRR934391_1.fastq.gz](/home/genomics/workshop_materials/genome_annotation/sra/SRR934391_1.fastq.gz) and [SRR934391_2.fastq.gz](/home/genomics/workshop_materials/genome_annotation/sra/SRR934391_2.fastq.gz). These are paired-end data, i.e. one file contains the forward reads while the other contains in the same order the reverse reads. The length of reads is in this case 100 nt.

We will use HiSat2 ([publication](https://doi.org/10.1038/s41587-019-0201-4), [software](https://github.com/DaehwanKimLab/hisat2)) to align these reads against a chunk of the *Arabidopsis thaliana* genome contained in the file [genome.fa](genome.fa). (You can in principle use any alignment tool capable of aligning RNA-seq reads to a genome, as long as it can perform spliced alignment.)

First, we need to build an index from the genome file:

```
# building the hisat2 index
hisat2-build /opt/BRAKER/example/genome.fa genome-idx 1> hisat2-build.log 2> hisat2-build.err
```

Inspect the log files [hisat2-build.log](hisat2-build.log) and [hisat2-build.err](hisat2-build.err) for possible errors.

Next, we align the RNA-seq reads against the genome. Do NOT do this on the Cesky Krumlov Workshop AWS resources. Performing this alignment took about 7 minutes with 70 threads, you have only 3 threads.

```
T=8 # adjust to number of threads that you booted with

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR934/SRR934391/SRR934391_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR934/SRR934391/SRR934391_2.fastq.gz

RNASEQDIR=.

time hisat2 -p ${T} -q -x genome-idx -1 ${RNASEQDIR}/SRR934391_1.fastq.gz \
    -2 ${RNASEQDIR}/SRR934391_2.fastq.gz -S rnaseq.sam \
    1> hisat2-align.log 2> hisat2-align.err
```

Our goal is to extract information on spliced alignments/intron positons from the alignment output file. To achieve this, we will use a tool called bam2hints that is part of the Augustus software suite ([software](https://github.com/Gaius-Augustus/Augustus)). However, this tool requires a sorted bam-file. Therefore, we first use Samtools ([paper](https://doi.org/10.1093/bioinformatics/btp352), [software](https://github.com/samtools) ) to convert the sam file to bam format:

```

T=8 # adjust to number of threads that you booted with, takes ~2 minutes with 4 threads

SAMFILE=/home/genomics/workshop_materials/genome_annotation/sra/SRR934391.sam

time samtools view -@${T} -bSh ${SAMFILE} -o rnaseq.bam

# if you computed your own rnaseq.sam file, delete it to save space on harddrive
if [ -f rnaseq.sam ]
then
    rm rnaseq.sam
fi
```

Then, we sort that bam file (this will require a bit less than 4 GB of RAM):

```
T=8 # adjust to number of threads that you booted with, takes ~2 minutes with 4 threads

time samtools sort -@${T} -n rnaseq.bam -o rnaseq.s.bam

# remove the unsorted bam file to save space
rm rnaseq.bam
```

Careful, above bam file is just an demo example! We will be using a different bam file for running BRAKER because the above BAM file does not contain sufficient data for running BRAKER3, successfully!
</details>

There is no need to run RNA-Seq alignment, samtools sorting, etc. during the workshop. A suitable BAM file is provided at `/opt/BRAKER/example/RNAseq.bam`.

## â›µ Annotation of protein coding genes

Here is a brief decision scheme to figure out "what to do" to annotate protein coding gene structures in your novel genome:

<img src="decision_tiberius.drawio.png" alt="Decision Scheme" width="700"/>


In this session, we will cover BRAKER3, BRAKER2, Galba, and Tiberius *ab initio*.

Let's start with BRAKER3 ([poster from PAG2023](https://www.researchgate.net/profile/Lars-Gabriel-3/publication/367409816_The_BRAKER3_Genome_Annotation_Pipeline/links/63d14cbae922c50e99c29c7a/The-BRAKER3-Genome-Annotation-Pipeline.pdf), [software](https://github.com/Gaius-Augustus/BRAKER)), the state of the art pipeline for annotating genomes with transcriptome and protein evidence. BRAKER3 is a great option if there are no Tiberius parameters available for your clade, and if you need alternative splicing isoforms to be annotated.

In [None]:
%%script bash

T=8 # adjust to number of threads that you booted with, takes ~14 minutes with 8 threads

# delete output from a possible previous run if it exists
if [ -d BRAKER3 ]
then
    rm -rf BRAKER3
fi

ORTHODB=/opt/BRAKER/example/subsampled_viri.fa # adjust to suitable clade of real OrthoDB from https://bioinf.uni-greifswald.de/bioinf/partitioned_odb11/
BUSCOLINEAGE=eukaryota_odb10 # adjust to more suitable lineage for real annotation runs

# run BRAKER3
time braker.pl --workingdir=BRAKER3 --genome=/opt/BRAKER/example/genome.fa \
    --bam=/opt/BRAKER/example/RNAseq.bam \
    --prot_seq=${ORTHODB} --busco_lineage=${BUSCOLINEAGE} --threads ${T} \
    --gm_max_intergenic 10000 --skipOptimize \ # remember to remove these options for real jobs!
    &> braker3.log

<details>
  <summary><b>Out of time, job died? Click here.</b></summary>
If you ran out of time (the BRAKER3 job takes substantial time), you may copy the most important files as follows from a notebook cell:

```
%%script bash
# delete output from a possible previous run if it exists
if [ -d BRAKER3 ]
then
    rm -rf BRAKER3
fi
cp -r BRAKER3_precomputed_results BRAKER3
```
</details>

Let's inspect the output, the most important files are braker.gtf, Augustus/augustus.hints.gtf, and GeneMark-ETP/genemark.gtf:

In [None]:
%%script bash
cd BRAKER3
ls -lh braker.gtf Augustus/augustus.hints.gtf GeneMark-ETP/genemark.gtf

The file [BRAKER3/what-to-cite.txt](BRAKER3/what-to-cite.txt) advises you on what papers should be cited if you were going to publish a manuscript on a gene set produced with BRAKER3.

braker.gtf is the main output. BRAKER internally runs compleasm to pick the best gene set according to BUSCO presence. Be aware of this when generating the following BUSCO plot for quality control. (The folder braker_original contains BRAKER predictions prior adding BUSCOs with compleasm in case you want to look at these.)

Before running BUSCO, we need to make sure that we have protein sequences of all three gene sets (only the braker.aa exists by default):

In [None]:
%%script bash
# generate protein (and coding seq file) from AUGUSTUS predictions
cd BRAKER3/Augustus
getAnnoFastaFromJoingenes.py -g /opt/BRAKER/example/genome.fa -o augustus.hints -f augustus.hints.gtf
# generate protein (and coding seq file) from GeneMark-ETP predictions
cd ../GeneMark-ETP
getAnnoFastaFromJoingenes.py -g /opt/BRAKER/example/genome.fa -o genemark -f genemark.gtf
# see file sizes
cd ../
ls -lh braker.aa GeneMark-ETP/genemark.aa Augustus/augustus.hints.aa
# Count number of transcripts by counting FASTA headers
echo "Counting number of protein sequences = transcripts"
grep -c ">" braker.aa GeneMark-ETP/genemark.aa Augustus/augustus.hints.aa

GALBA has a simple script to compute the ratio of mono- to multi-exonic genes (only counting one isoform if one gene has several alternative isoforms, that's why the transcript number differs from the number above for methods that contain alternative transcripts, such as AUGUSTUS and BRAKER):

In [None]:
%%script bash
cd BRAKER3
echo "Computing some descriptive statistics for BRAKER:"
analyze_exons.py -f braker.gtf
echo ""
echo "Doing the same for Augustus:"
analyze_exons.py -f Augustus/augustus.hints.gtf
echo ""
echo "And for GeneMark-ETP:"
analyze_exons.py -f GeneMark-ETP/genemark.gtf

#### â›µ BUSCO assessment

BUSCO ([paper](https://doi.org/10.1002/cpz1.323), [software](https://gitlab.com/ezlab/busco)) can provide information on sensitivity with respect to a clade-specific core gene set. We will in the following use BUSCO to compare sensitivity in the BRAKER3, AUGUSTUS, GeneMark-ETP, and Tiberius gene sets.

First, we find the closest BUSCO lineage (we are working on *Arabidopsis thaliana*):

In [None]:
%%script bash

source conda_init
conda activate busco_env

busco --list-datasets > busco_lineages.txt 2> busco_lineages.log

All available lineages are now in [busco_lineages.txt](busco_lineages.txt). (Check [busco_lineages.log](busco_lineages.log) for possible errors.)

Check at [NCBI taxonomy](https://www.ncbi.nlm.nih.gov/taxonomy) the lineage of the target *Arabidopsis*. I believe the lineage is:

`cellular organisms; Eukaryota; Viridiplantae; Streptophyta; Streptophytina; Embryophyta; Tracheophyta; Euphyllophyta; Spermatophyta; Magnoliopsida; Mesangiospermae; eudicotyledons; Gunneridae; Pentapetalae; rosids; malvids; Brassicales; Brassicaceae; Camelineae`

Now find a related lineage in [busco_lineages.txt](busco_lineages.txt). `brassicales_odb10` is the closest lineage. (If we had not wanted to save time when running BRAKER, we would also have used this lineage for the BRAKER run.)

Next, we run a BUSCO assessment on all gene sets (this takes ~4 minutes with 8 threads):

In [None]:
%%script bash

T=8 # adjust to number of threads that you booted with

source conda_init
conda activate busco_env

cd BRAKER3
# create links if not already present
if [ ! -L augustus.aa ]
then
    ln -s Augustus/augustus.hints.aa augustus.aa
    sleep 1 # not sure why we need to wait a few seconds, but otherwise system doesn't find the file
fi

if [ ! -L genemark.aa ]
then
    ln -s GeneMark-ETP/genemark.aa genemark.aa
    sleep 1 # not sure why we need to wait a few seconds, but otherwise system doesn't find the file
fi

if [ ! -d busco_downloads ]
then
    mkdir busco_downloads
    cd busco_downloads
    ln -s /home/genomics/workshop_materials/genome_annotation/busco/brassicales_odb10 brassicales_odb10
    cd ..
fi

GENESETS=(braker augustus genemark)

for g in ${GENESETS[@]}; do
    echo "Processing ${g}..."
    # delete old output if existing
    if [ -d busco_${g} ]
    then
        rm -r busco_${g}
    fi
    # run BUSCO
    busco -m proteins -i ${g}.aa -o busco_${g} \
        -l brassicales_odb10 -c ${T} &> busco_${g}.log
done

Next, we visualize the BUSCO results:

In [None]:
%%script bash

source conda_init
conda activate busco_env

cd BRAKER3

# create BUSCO_summaries folder if not present
if ! [ -d BUSCO_summaries ]
then
    mkdir BUSCO_summaries
fi

# copy all BUSCO results into the summaries folder
cp busco_*/short_summary*.txt BUSCO_summaries

# generate BUSCO plot
generate_plot.py -wd BUSCO_summaries &> generate_plot.log

Check the file [generate_plot.log](generate_plot.log) for possible errors. This results in the following figure (stored at [BRAKER3/BUSCO_summaries/busco_figure.png](BRAKER3/BUSCO_summaries/busco_figure.png)):

<img src="BRAKER3/BUSCO_summaries/busco_figure.png" alt="BUSCO results" width="400"/>

The data that we used in this session was selected purely on the criterion of feasible runtime. In a real scenario, with a complete genome, the BUSCO plot should contain a much larger number of complete BUSCOs, and you are usually happy if the number BUSCOs in the final BRAKER3 gene set is higher or equal to the number of BUSCOs detected in the AUGUSTUS and GeneMark-ETP set, while the total number of transcripts does not grow into an unexpected way (e.g. having 80.000 proteins in a BRAKER gene set does seem odd in most cases...). If the same BUSCO lineage had been chosen for BRAKER and BUSCO, that would also be the case, here (BUSCOs in BRAKER being more complete than in AUGUSTUS).

But what can we do if there is no RNA-Seq data for a particular species? In that case, we can resort to using either BRAKER2 (for small and medium sized genomes, with a large database of proteins that might be only remotely related), or we may use GALBA (for large vertebrate genomes, with a few closely related reference proteomes).

### â›µ BRAKER2

BRAKER2 ([paper](https://doi.org/10.1093/nargab/lqaa108)) uses spliced alignment information from a huge database of proteins against the target genome. We typically use OrthoDB partitions of clades, hosted at https://bioinf.uni-greifswald.de/bioinf/partitioned_odb11/. Note: a set of proteins from one or a few related species is not sufficient for running BRAKER2. A particular set of proteins of a closely related species can be appended to a larger database for running BRAKER2. However, BRAKER2 is not an ideal tool for recovering a complete set of proteins from a related species.

The following call of BRAKER2 takes ~14 minutes on 8 threads:

In [None]:
%%script bash

T=8 # adjust to number of threads that you booted with

ORTHODB=/opt/BRAKER/example/subsampled_viri.fa # adjust to suitable OrthoDB clade, see BRAKER3
BUSCOLINEAGE=eukaryota_odb10 # adjust to more suitable lineage for real annotation runs

# delete output from a possible previous run if it exists
if [ -d BRAKER2 ]
then
    rm -rf BRAKER2
fi

time braker.pl --workingdir=BRAKER2 --genome=/opt/BRAKER/example/genome.fa --prot_seq=${ORTHODB} \
    --busco_lineage ${BUSCOLINEAGE} --threads ${T} \
    --gm_max_intergenic 10000 --skipOptimize \ # remember to remove these options for real jobs!
    2> braker2.log

<details>
  <summary><b>Out of time, job died? Click here.</b></summary>
If you ran out of time (the BRAKER2 job takes substantial time), you may copy the most important files as follows from a notebook cell:

```
%%script bash
# delete output from a possible previous run if it exists
if [ -d BRAKER2 ]
then
    rm -rf BRAKER2
fi
cp -r BRAKER2_precomputed_results BRAKER2
```
</details>

The most important output files are:

   * [BRAKER2/braker.gtf](BRAKER2/braker.gtf) - BRAKER gene predictions
   * [BRAKER2/Augustus/augustus.hints.gtf](BRAKER2/Augustus/augustus.hints.gtf) - intermediate AUGUSTUS gene predictions
   * [BRAKER2/GeneMark-EP/genemark.gtf](BRAKER2/GeneMark-EP/genemark.gtf) - intermediate GeneMark-EP gene predictions
   * [BRAKER2/hintsfile.gff](BRAKER2/hintsfile.gff) - hints that were used for running AUGUSTUS and TSEBRA in BRAKER
   
The file [BRAKER2/what-to-cite.txt](BRAKER2/what-to-cite.txt) advises you on what papers should be cited if you were going to publish a manuscript on a gene set produced with BRAKER2. 

All methods described for BRAKER3 (BUSCO, number of transcripts, mono:mult exon ratio, etc.) are of course applicable to BRAKER2, GALBA, and BRAKER1, as well. We will skip it here because of time constraints.

### â›µ GALBA

GALBA ([preprint](https://www.biorxiv.org/content/10.1101/2023.04.10.536199v1.abstract), [software](https://github.com/Gaius-Augustus/GALBA)) is a BRAKER-spinoff that uses miniprot ([paper](https://doi.org/10.1093/bioinformatics/btad014), [software](https://github.com/lh3/miniprot)) to generate a training gene set of AUGUSTUS. In contrast to the BRAKER2 and BRAKER3 pipelines, GALBA is not very good at using remotely related protein evidence. However, given reference proteins of several closely related species, GALBA is very good at recovering gene structures, particularly in large vertebrate genomes. You may execute GALBA as follows (using a toy example data set, it executes within a 5 minutes on 8 threads):

In [None]:
%%script bash

T=8 # adjust to number of threads that you booted with

# delete output from a possible previous run if it exists
if [ -d GALBA ]
then
    rm -rf GALBA
fi

time galba.pl --workingdir=GALBA --genome=/opt/BRAKER/example/genome.fa \
    --prot_seq=/opt/GALBA/example/proteins.fa \
    --threads ${T} \
    --skipOptimize \ # remember to remove this option for real jobs!
    2> galba.log

<details>
  <summary><b>Out of time, job died? Click here.</b></summary>
It is unlikely that GALBA will not complete in time. However, you may copy the most important files as follows from a notebook cell:

```
%%script bash
# delete output from a possible previous run if it exists
if [ -d GALBA ]
then
    rm -rf GALBA
fi
cp -r GALBA_precomputed_results GALBA
```
</details>

The most important output files are:

   * [GALBA/galba.gtf](GALBA/galba.gtf) - gene predictions by GALBA
   * [GALBA/hintsfile.gff](GALBA/hintsfile.gff) - hints that were used for running AUGUSTUS in GALBA
   
The file [GALBA/what-to-cite.txt](GALBA/what-to-cite.txt) advises you on what papers should be cited if you were going to publish a manuscript on a gene set produced with GALBA.

### â›µ Tiberius

Tiberius ([paper](https://doi.org/10.1093/bioinformatics/btae685), [software](https://github.com/Gaius-Augustus/Tiberius)) is a deep learning-based tool for ab initio gene prediction. Unlike traditional gene finders that require species-specific training, Tiberius uses pre-trained models that can predict genes across a wide range of species without any training data from the target species.

Tiberius is particularly useful when:
- No RNA-Seq data is available
- No closely related protein sequences are available
- You want a quick initial annotation to compare against other methods

The container includes a conda environment `tiberius_env` with Tiberius installed. The model we will use is `eudicotyledons.yaml`, which is suitable for our *Arabidopsis thaliana* example. This model was trained on softmasked genome and thus should ideally be applied to a softmasked genome (lowercase letters for repeats). 

Let's run Tiberius on our genome (using a toy example data set, it executes within a 1.5 minutes on the CPU; note that Tiberius will be much faster on a GPU if you want to annotate a larger genome).

It is surprisingly tricky to run Tiberius in JupyterNotebooks on AWS. We usually do not have to change so many
bash variables to run Tiberius in singularity. With the PYTHONPATH export, we are telling the computer: "Ignore the default Python folders and ONLY look in the 'tiberius_env' folder for the tools you need (like Keras and Optree)." With the MPLBACKEND export, we are  telling the computer: "Just do the math and don't try to draw anything on my screen."

In [None]:
%%bash
# delete output from a possible previous run if it exists
if [ -d Tiberius ]
then
    rm -rf Tiberius
fi
mkdir Tiberius

export PYTHONPATH=/opt/conda/envs/tiberius_env/lib/python3.10/site-packages
export MPLBACKEND=Agg

# Run Tiberius
time python /opt/Tiberius/tiberius.py \
    --genome /opt/BRAKER/example/genome.fa \
    --out Tiberius/tiberius.gtf \
    --model_cfg /opt/Tiberius/model_cfg/eudicotyledons.yaml \
    --seq_len 259992 \
    --protseq Tiberius/tiberius.aa
    # Note: --seq_len is required for TensorFlow >= 2.13 (max 259992)

Do not worry about the tensorflow/CUDA warnings, they are expected on CPU-only systems.

<details>
  <summary><b>Out of time, job died? Click here.</b></summary>
If you ran out of time (the Tiberius job should be relatively quick, but just in case), you may copy the most important files as follows from a notebook cell:

```
%%script bash
# delete output from a possible previous run if it exists
if [ -d Tiberius ]
then
    rm -rf Tiberius
fi
cp -r Tiberius_precomputed_results Tiberius
```
</details>

The most important output files are:

   * [Tiberius/tiberius.gtf](Tiberius/tiberius.gtf) - Tiberius gene predictions
   * [Tiberius/tiberius.aa](Tiberius/tiberius.aa) - Tiberius protein sequences
   
Tiberius predictions can be combined with other gene sets using TSEBRA if desired. Note that Tiberius is an *ab initio* predictor, meaning it does not use any extrinsic evidence - it relies solely on the sequence patterns learned during deep learning training. Tiberius is very good at predicting gene structures that, in other pipelines like BRAKER3, would typically require extrinsic evidence. However, University of Greifswald Bioinformatics is currently developing a new pipeline for integrating extrinsic evidence derived gene models directly with Tiberius genes, stay tuned!

### ðŸš£ TSEBRA

TSEBRA is a tool for selecting a highly accurate gene set from several input sets according to supporting extrinsic evidence. BRAKER internally executes TSEBRA to combine the GeneMark and the AUGUSTUS gene set. If all went well, you do not have run TSEBRA, separately, at all. However, one scenario where TSEBRA may be useful, remains:
   
   * You want to combine gene sets from different annotation pipelines, e.g. BRAKER3 and Tiberius.
   
We will have a look at how to generally run TSEBRA on the example of merging the BRAKER3 and Tiberius gene sets. We will retain the BRAKER3 genes according evidence and enforce the Tiberius genes:

In [None]:
%%script bash

# delete output from a possible previous run if it exists
if [ -d TSEBRA ]
then
    rm -rf TSEBRA
fi

mkdir TSEBRA
cd TSEBRA
tsebra.py -k ../Tiberius/tiberius.gtf -g ../BRAKER3/Augustus/augustus.hints.gtf,../BRAKER3/GeneMark-ETP/genemark.gtf \
    -e ../BRAKER3/hintsfile.gff -o tsebra.gtf 2> tsebra.log

# Generate protein sequences from TSEBRA output
getAnnoFastaFromJoingenes.py -g /opt/BRAKER/example/genome.fa -o tsebra -f tsebra.gtf

Check the file [tsebra.log](TSEBRA/tsebra.log) for possible errors. The final gene set is in file [tsebra.gtf](TSEBRA/tsebra.gtf). The protein sequences are in [tsebra.aa](TSEBRA/tsebra.aa).

## ðŸš£ Data visualization in the UCSC Genome Browser

Visualization of gene structures in context with extrinsic evidence is essential for coming to a decision on whether a gene set "makes sense" or "does not make sense". Typical problems that you may observe in a genome browser include "split genes" (where evidence implies two genes should in fact be a single gene) or "joined genes" (where evidence implies one gene should be split into two genes).

The UCSC Genome Browser ([paper](https://doi.org/10.1101/gr.229102), [resource](https://genome.ucsc.edu/)) is one of the most popular genome browsers. It has the advantage that you do not have to install a browser instance on your own webserver. Instead, you only need to provide a certain data structure with your target data on a webserver. The UCSC Genome Browser servers can display your data from there. The data structures are called "track data hubs" or "assembly hubs" ([paper](https://doi.org/10.1093/bioinformatics/btt637)). 

MakeHub ([paper](https://doi.org/10.1016/j.gpb.2019.05.003), [software](https://github.com/Gaius-Augustus/MakeHub )) is a python script that fully automates the generation of such track data hubs for novel genomes. In the following, we will generate a simple track data hub for the genome sequence that we annotated with BRAKER3 (takes only a few seconds, simply repeat execution in case it fails the first time due to internet connectivity problems):

In [None]:
%%script bash

T=8 # adjust to number of threads that you booted with

time make_hub.py -e katharina.hoff@uni-greifswald.de \
    --genome /opt/BRAKER/example/genome.fa --long_label "A chunk from the Arabidopsis thaliana genome" \
    --short_label at_chunk  --bam /opt/BRAKER/example/RNAseq.bam --threads ${T} \
    --latin_name "Arabidopsis thaliana" \
    --assembly_version "artifically split custom assembly" \
    --hints BRAKER3/hintsfile.gff --gene_track BRAKER3/braker.gtf BRAKER3

# Tiberius does not report start and stop codons, this is why gtf2genePred in MakeHub needs the gene and
# transcript lines to be formatted more strictly than for BRAKER; we also remove the cds_type
awk -F'\t' 'BEGIN {OFS="\t"} {
    if ($3 == "gene") {
        $9 = "gene_id \"" $9 "\";"
    } else if ($3 == "transcript") {
        # Extracts "g1" from "g1.t1" to use as gene_id
        split($9, a, ".");
        $9 = "transcript_id \"" $9 "\"; gene_id \"" a[1] "\";"
    }
    print $0
}' Tiberius/tiberius.gtf | sed -E 's/ ?cds_type=[^; ]+;?//g' > Tiberius/tiberius_for_makehub.gtf

time make_hub.py --short_label at_chunk -e katharina.hoff@uni-greifswald.de -A \
    --gene_track Tiberius/tiberius_for_makehub.gtf Tiberius

You can't perform the suggested `scp` command from the apphub, unless you have privileges on a University of Greifswald webserver. We have therefore copied a prepared hub in advance. The `hub.txt` is available at https://bioinf.uni-greifswald.de/hubs/at_chunk/hub.txt . Remember that link.

In order to visualize your data, go to https://genome.ucsc.edu/ in **Chrome** (do not use Firefox since it does not seem to work properly with the UCSC Genome Browser at the moment). Click on `My Data` -> `Track Hubs` -> choose the European mirror -> click on `Connected Hubs` and enter the link https://bioinf.uni-greifswald.de/hubs/at_chunk/hub.txt into the text window -> click on `Add Hub` -> click on `Go`. Congratulations, your Hub is now connected. You should see something like this: 

<img src="at_chunk.png" alt="UCSC Genome Browser example" width="1000"/>

### ðŸš£ How to know which sequences to browse

The long sequences are usually the most interesting to look at. The following command gives you the names of sequences in the order of descending length, you can copy-paste the sequence names into the search window in the UCSC Genome Browser.

In [None]:
%%script bash

N=5 # how many longest sequences would you like to know about

summarizeACGTcontent.pl /opt/BRAKER/example/genome.fa | grep bases | head -${N} | sort -n \
   | perl -ne 'm/(\d+)\s+bases\.\s+(\S+)/; print "$2\t$1\n";'

## ðŸš£ How to run BRAKER, GALBA, Tiberius in Docker or Singularity

**Important:** The container used in this workshop (`katharinahoff/bioinformatics-notebook`) bundles multiple tools together for teaching convenience only. **For real-life applications, always use the separate, officially maintained containers:**

| Tool | Docker Container | Singularity |
|------|------------------|-------------|
| BRAKER | `docker pull teambraker/braker3` | `singularity build braker3.sif docker://teambraker/braker3:latest` |
| GALBA | `docker pull katharinahoff/galba` | `singularity build galba.sif docker://katharinahoff/galba:latest` |
| Tiberius | `docker pull larsgabriel23/tiberius` | `singularity build tiberius.sif docker://larsgabriel23/tiberius:latest` |

The separate containers are:
- **Regularly updated** with bug fixes and new features
- **Better tested** for their specific tool
- **Recommended** for production use and publications

### Running with Docker

Example for BRAKER3:
```
sudo docker run --rm -it -u $(id -u):$(id -g) -v /path/to/data:/data teambraker/braker3:latest braker.pl --genome=/data/genome.fa --bam=/data/rnaseq.bam --workingdir=/data/braker_out
```

### Running with Singularity

Example for BRAKER3:
```
singularity exec -B /path/to/data:/data braker3.sif braker.pl --genome=/data/genome.fa --bam=/data/rnaseq.bam --workingdir=/data/braker_out
```

For detailed instructions, please refer to the documentation on the respective GitHub repositories:
- BRAKER: https://github.com/Gaius-Augustus/BRAKER
- GALBA: https://github.com/Gaius-Augustus/GALBA
- Tiberius: https://github.com/Gaius-Augustus/Tiberius

Also see the [README.md](README.md) file for additional Singularity instructions specific to this workshop.

## Troubleshooting

### ðŸš£ I have 80.000 genes predicted by BRAKER/Tiberius/TSEBRA in a full genome, what shall I do?

Please first check whether you are referring to genes, or to transcripts. BRAKER predicts alternative isoforms. If RNA-Seq data supports this, the number of alternative transcripts may be large, but likely true. Tiberius *ab initio* predicts only one isoform, but with evidence, also alternative transcripts. If it's really genes that you counted, then 80.000 sounds way too much, indeed (unless you are dealing with a genome that has multiple copies of each chromosome). In BRAKER, most likely, GeneMark-ET/ES/EP/ETP produced highly fragmented training genes for AUGUSTUS. This will also lead to highly fragmented genes predicted by AUGUSTUS. First, check whether your genome has been masked for repeats. Consider using the additional TRF masking desribed at the top of this notebook. If that does not help, and if you have a protein set of closely related species at hand, consider using that protein set as sole training data for AUGUSTUS. You can use GALBA for this (https://github.com/Gaius-Augustus/GALBA). If Tiberius caused the issue, try running BRAKER3, instead.

### ðŸš£ I have only 10.000 genes predicted by BRAKER/TSEBRA in a full genome, what shall I do?

Check whether the BRAKER output files in subfolders Augustus and GeneMark-* produced more genes than TSEBRA. By default, TSEBRA will discard genes without evidence. If you have only little evidence for your species, TSEBRA might be a bad idea. You can also try rerunning TSEBRA with enforcing one of the gene sets.  There are also species for which is "normal" to observe less than 10,000 genes, check the annotated relatives.

### ðŸš£ How do I know how many genes to expect?

Hard to say. You can download gene sets of related species e.g. from NCBI Genomes, and count. Some gene sets tend to be "underannotated", i.e. they may represent rather the lower numbers of what might be realistic. Katharina usually gets nervous about more than 45000 genes and fewer than 15000 genes. These are definitely weird gene counts - but as stated before, there are cases where these are totally fine, too. Otherwise: always inspect your gene set in a Genome Browser such as the UCSC Genome Browser to identify problems.

### ðŸš£ I have long isoseq RNA-Seq transcripts, can I put them into BRAKER?

Yes, you find a dedicated container and instructions on the poster from PAG 2024 at https://github.com/Gaius-Augustus/BRAKER/blob/master/docs/posters/poster_PAG2024.pdf . Note: this works with only isoseq data, not with mixed short- and long reads. If you have both data types, you could run it once with long reads, once with short reads, and then combine the results with TSEBRA.

The Tiberius evidence integration pipeline natively supports the combination of isoseq and rnaseq.

### ðŸš£ I opened an issue on GitHub about BRAKER, Tiberius or TSEBRA 100 days ago, nobody replied, why?

We are a small team of developers. We try our best and usually respond to well described and easy-to-solve issues within a rather short time frame. Solving other issues may take considerable amounts of time that we simply do not have, or they may be described in a way that we don't know what do with them... please be patient with us. We are at this point in time not bugfixing BRAKER, anymore. Consider using Tiberius if you can.

### â›µ I have a problem, whom do I tell?

Please read through the Issues on Github. If the issue does not exist, yet, open an issue.

# â›µ Ready to move on?

If you feel confident about your skills, take them to the next level. We have prepared chromosome 4 of a small genome in the following notebook: [Annotate_Babesia_duncani.ipynb](Annotate_Babesia_duncani.ipynb). The task is designed such that you will not complete all tasks during today's session. Instead, you will be randomly assigned with a sub-task, and we will merge the results of everyone who participates to gain a final overview of the results.

### The End