# Submodule 2: Assessment of genome assembly and genome annotation
--------
## Overview
In this submodule, you will begin with the genome that you assembled in Submodule 1. The primary goal of this submodule is to assess the quality of the assembled genome through the lens of what we call the "5 Cs": Contiguity, Completeness, Contamination, Coverage, and Content. By utilizing a combination of bioinformatics tools, participants will evaluate the assembled genome and generate outputs that include visualizations, a cleaned genome sequence and functional annotations. These outputs wil will be used in submodule 4.


### Learning Objectives
Through this submodule, users will gain hands-on experience in quality assessment, resulting in a deeper understanding of genomic data integrity and the significance of accurate genome sequences.

- **Understand and Apply the 5 Cs of Genome Quality**:  
  Understand how to assess the overall quality of a genome sequence by examing Contiguity (QUAST), Completeness (BUSCO), Contamination (BLAST/BlobTools), Coverage (BWA/Samtools), and Content (Prokka gene annotations).

- **Understand Core Bioinformatic File Formats and Interpret Visualizations**:  
  Gain proficiency in using bioinformatics tools and foster skills in data analysis and interpretation.

- **Relate the Central Dogma of Molecular Biology to Genome Annotation**:  
  Connect the principles of the central dogma (DNA → RNA → Protein) to the process of genome annotation, understanding how gene annotations contribute to functional genomics and biological interpretations.

- **Produce a Clean and Annotated Genome**:  
  Participants will refine the genome based on their assessments, ensuring a high-quality, annotated genome that can be used for further analysis or research applications.

## **Install required software**

Several additional tools are required for Submodule 2; quast, busco, bwa, samtools, blast, blobtools, and prokka.  As with submodule 1, tools are preinstalled into a docker image, but we will demonstrate how to install these tools using __[Conda](https://docs.conda.io/en/latest/)__.

Each piece of software, along with links to publications and documentation, will be described in turn. Below is a brief summary of these tools.

### List of software
| **Tool**       | **Description**                                                                                                                                                           |
|:---------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **QUAST**      | Used for evaluating and reporting the quality of genome assemblies by comparing them against reference genomes or generating statistical summaries.                          |
| **BUSCO**      | Utilized for assessing genome completeness by searching for conserved single-copy orthologs from specific lineage datasets.                                                 |
| **BWA**        | A fast and memory-efficient tool for aligning sequence reads to large reference genomes, commonly used in variant calling pipelines.                                         |
| **Samtools**   | Used for manipulating and processing sequence alignments stored in SAM/BAM format. Essential for sorting, indexing, and viewing alignment files.                            |
| **BLAST**      | A widely used tool for comparing an input sequence to a database of sequences, identifying regions of local similarity and aiding in functional annotation.                  |
| **BlobTools**  | A versatile tool for visualizing and analyzing genome assemblies, helping to identify contamination or misassembled regions by correlating sequence features with taxonomy.   |
| **Prokka**     | Used for rapid annotation of prokaryotic genomes, identifying genes, coding sequences, rRNAs, tRNAs, and other genomic features.                                             |

In [None]:
%%bash

# Install all tools using mamba (a conda alternative) with specific versions

echo "mamba install --channel bioconda \
    quast=5.2.0 \
    busco=5.4.6 \
    bwa=0.7.18 \
    samtools=1.18 \
    blast=2.15.0 \
    blobtools=1.0.1 \
    prokka=1.14.6 \
    -y > /dev/null 2>&1"

echo "Installation of quast, busco, bwa, samtools, blast, blobtools, and prokka complete."

quast.py --version
busco --version
bakta --version

## Starting Data

This submodule starts with the **genome FASTA** file, this will be the primary input for all programs. We will define this as the variable *genome* now, and use that for the remainder of the workflow. This enables the starting data to be easily changed if a user wants to run this tutorial with their own data.

We will also need to original reads from Submodule 1 for the read mapping step. We will use BWA to calculate sequencing coverage and it requires **paired-end sequencing reads in FASTQ format**. We will define these here as *forward* for the R1 sequencing reads and *reverse* for the R2 sequencing reads.

In [None]:
%%bash

# starting genome from submodule 1
prev_genome=assembled-genome/genome.fasta

# raw reads from submodule 1
prev_forward=raw-reads/reads_1.fastq.gz
prev_reverse=raw-reads/reads_2.fastq.gz

# link to a new location (so we can use custom datasets)
mkdir -p submodule02_data/
genome=submodule02_data/genome.fasta
forward=submodule02_data/reads_1.fastq.gz
reverse=submodule02_data/reads_2.fastq.gz

# Create symbolic links using absolute paths (this way it doesn't use more space)
ln -s "$(realpath "$prev_genome")" $genome
ln -s "$(realpath "$prev_reverse")" $reverse
ln -s "$(realpath "$prev_forward")" $forward

ls submodule02_data/

## Process 1: **Contiguity** assessment using QUAST
- Program: **QUAST (Quality Assessment Tool for Genome Assemblies)**
- Citation: *Gurevich, A., Saveliev, V., Vyahhi, N., & Tesler, G. (2013). QUAST: quality assessment tool for genome assemblies. Bioinformatics, 29(8), 1072-1075.*
- Manual: https://github.com/ablab/quast

QUAST is a tool used to evaluate and compare the quality of genome assemblies by providing metrics such as N50, number of contigs, genome length, and misassemblies. Did you get one contig representing your entire genome? Or did you get thousands of contigs representing a highly fragmented genome?

QUAST has many functionalities which we will not explore in this tutorial, I encourage you to explore these, for now we are going to use it in its simplest form. The program parses the genome FASTA file and records statsitics about each contig, the length, GC content etc. This type of information is something you would typically provide in a publication or as a way to assess different assemblers/options you may use. 

The **input** to the program is the genome assembly **FASTA** and the output are various tables and an html/pdf you can export and view.

In [None]:
%%bash

genome=submodule02_data/genome.fasta
echo "Running QUast on genome FASTA file:" $genome

# run quast on the genome assembly
quast.py $genome -o output-quast > logfile_quast.txt 2>&1

# display output directory contents
echo "QUAST complete, output directory contents:"
ls -l output-quast/*

In [None]:
%%bash

# open output file
cat output-quast/report.txt

In [None]:
from IPython.display import IFrame
from IPython.display import Image
IFrame('output-quast/report.html', width=1000, height=550)

### Explanation of QUAST outputs
We used QUAST to assess the contiguity of the genome assembly, the results describes how well the genome was put back together. These statistics allow us to describe our dataset within a manuscript and us allow us to compare different assembly programs, algorithms, and parameters. As mentioned in Submodule01, a typical bacterial genome consists of a single circular chromosome, so ideally, we would expect to end up with a single contig. Unfortunately, this is rarely the case, for reasons we will discuss below.

The QUAST report we displayed above provides some basic statistics about our genome assembly. Focus your attention on the values starting at '# contigs'. As described at the top of the report, these statistics are based on all contigs in the library with length >= 500 bps, a typical cutoff for genome assembly submissions. NCBI accepts submission of contigs >= 250 bps in their genome assembly database.

#### Focused Results
```
# contigs                   30      
Largest contig              512,707  
Total length                1,693,722 
GC (%)                      30.29   
N50                         180742  
L50                         3       
```

The *de novo* genome assembly resulted in a **total length of 1,693,722 bp across 30 contigs**. If the species of interest is known, you can compare the total genome length against publically available datasets. 

The **N50 of the assemly is 180,742 bp** , meaning that half of the total assembly length is contained in contigs that are at least this long. N50 is a widely used metric for assessing contiguity, with higher N50 values indicating better assembly quality. The **L50 is 3**, which complements the N50 by showing the smallest number of contigs needed to cover half the genome length. In this case, **the three longest contigs account for at least half of the assembly**. Metrics like N90 and L90 are similar to N50 and L50 but focus on 90% of the genome length.

What qualifies as a "good" N50 or other metrics depends on the genome being assembled. Larger, more contiguous genomes, such as those with high N50 and low L50, are generally considered better. However, this also depends on the complexity of the organism and the sequencing approach used.

QUAST also provides us with the **GC content (30.29%)**, a metric describing the nucleotide composition of the genome. According to **Chargaff's second rule** (see below), GC content is generally constant within a species but can vary widely between unrelated species. GC content vaires significantly across species and comparing the value can provide clues about evolutionary relationships, ecological adaptations, and functional constraints.


## Chargaff's Second Rule

Chargaff's **second rule** states that within a double-stranded DNA molecule, the base composition is species-specific but exhibits certain patterns:

1. **Base Pair Equality**: The proportion of adenine (A) roughly equals the proportion of thymine (T), and the proportion of guanine (G) roughly equals the proportion of cytosine (C). This rule is foundational to the understanding of complementary base pairing in DNA.
   - %A ≈ %T
   - %G ≈ %C
       
2. **Species-Specific GC Content**: While the total GC content (\(%G + %C\)) and AT content (\(%A + %T\)) can vary widely between species, they are relatively consistent within the genome of a given species. This variation is a hallmark of species identity and evolutionary lineage.

### Example
A genome with 40% GC content will have 60% AT content, but the exact proportions of G and C will always be equal.

---

### Significance of Chargaff's Second Rule

1. **Foundation for Watson and Crick's Model**: Chargaff's observations were critical for deducing the double-helix structure of DNA, where base pairing (A-T and G-C) ensures the rules hold true.

2. **Comparative Genomics**: The species-specific nature of GC content allows researchers to use it as a comparative tool for identifying evolutionary relationships, detecting horizontal gene transfer, or distinguishing between microbial strains.

3. **Genome Assembly Quality**: Deviations from expected GC content in an assembly may indicate errors, contamination, or sequencing biases, making it a valuable metric in genome analysis tools like QUAST.

---

In essence, Chargaff's second rule highlights the balance and specificity in DNA's molecular composition, emphasizing both its biochemical properties and its role in evolution and species differentiation.

In [None]:
%%bash

# Save select Quast Results into log file
log_file="genome-assessment-log.txt"

echo "Saving Select Quast Results in log file"
echo "Quast results:" > $log_file
grep -A 4 "# contigs" output-quast/report.txt | tail -n 5 >> $log_file
echo "----------------------------------" >> $log_file

cat $log_file

## Process 2: **Completeness** assessment using BUSCO

- Program: **BUSCO - Benchmarking Universal Single-Copy Orthologs**
- Citation: *Seppey, M., Manni, M., & Zdobnov, E. M. (2019). BUSCO: assessing genome assembly and annotation completeness. Gene prediction: methods and protocols, 227-245.*  
- Manual: https://busco.ezlab.org/

BUSCO is a program utilized to assess the completeness of a genome assembly in terms of the number of found and universal genes. This program makes use of the OrthoDB set of single-copy orthologous that are found in at least 90% of all the organisms in question. There are different data sets for various taxonomic groups (Eukaryotes, Metazoa, Bacteria, Gammaproteobacteria, etc. etc.). The idea is that a newly sequenced genome *should* contain most of these highly conserved genes. If your genome doesn't contain a large portion of these single-copy orthologs it may indicate that your genome is not complete.


<p align="center">
  <img src="images/busco_sampling.png" width="40%"/>
</p>


The input to the program is your genome assembly (contigs) as well as a selection of which database to use. The output is a directory with a short summary of the results, a full table with coordinates for each orthologous gene is located in your assembly, and a directory with the nucleotide and amino acid sequences of all the identified sequences.

We will focus on the main summary output as a way to provide a simple QC assessment of our assembly, the outputs provided by BUSCO however have many uses, such as phylogenomics and gene prediction.

In [None]:
%%bash

# View available sets
busco --list-datasets

In [None]:
%%bash

# starting data
genome=submodule02_data/genome.fasta

# lineage to search against
lineage=bacteria

# run BUSCO
busco -i $genome -m genome -o output-busco -l $lineage --cpu 24 -f

### Explanation of BUSCO output
The output displayed above provides a summary of the BUSCO completeness analysis. See the output file 'short_summary_busco_output.txt' for the detaile report. This is a file which summarizes the main findings, how many of the expected genes did we find? This summary breaks the report into four main categories: **complete single-copy genes, complete duplicated genes, fragmented genes, and missing genes**. 

We are hopeful that the majority of our genes will be found as 'complete single-copy'. Duplicated genes could indicate that that particular gene underwent a gene duplication event or that we had a miss assembly and essentially have two copies of a region of our genome. Fragmented genes are an artifact of the fact that our genome did not assemble perfectly. Some of our genome is fragmented into multiple contigs, and with that some of our genes are going to be fragmented as well. This is why it is important to inspect the N50 of the genome with QUAST. We want the majority of our contigs to be at least as big as a gene, if it's not than we will have many fragmented genes as a result.

### Other BUSCO results

Next we will view the 'full_table_busco_output.tsv' file. This is a file which shows the coordinates for all the associated single copy genes in our genome. It also provides information about the status of that ortholog (missing, complete, fragmented). This tsv file can be exported and viewed in excel.

The final files we will examine are in a directory called 'single_copy_busco_sequences/'. This houses all the amino acid and protein sequences. This is a rich source for comparative genomics and other sorts of analyses.

In [None]:
%%bash

# examine the table (first ten lines only)
echo "Header:"
grep '# Busco id' output-busco/run_bacteria_odb10/full_table.tsv
echo '#############################################'

# see the categories of genes
echo "Fragmented genes:"
awk -F'\t' '$2 == "Fragmented"' output-busco/run_bacteria_odb10/full_table.tsv

echo '#############################################'
echo "Missing genes:"
awk -F'\t' '$2 == "Missing"' output-busco/run_bacteria_odb10/full_table.tsv

echo '#############################################'
echo "Duplicate genes:"
awk -F'\t' '$2 == "Duplicate"' output-busco/run_bacteria_odb10/full_table.tsv


In [None]:
%%bash

# Save BUSCO Results
log_file="genome-assessment-log.txt"

echo "BUSCO results:" >> $log_file
grep -A 4 'Complete and single-copy BUSCOs' output-busco/short_summary.specific.bacteria_odb10.output-busco.txt \
    | awk -F'\t' '{print $3"\t"$2}' >> $log_file
echo "----------------------------------" >> $log_file

cat $log_file

## Process 3: **Coverage** assessment using BWA

- Program: **BWA** and **Samtools**
- Citations: *Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25(14), 1754–1760. https://doi.org/10.1093/bioinformatics/btp324*, *Li, H., et al. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25(16), 2078–2079. https://doi.org/10.1093/bioinformatics/btp352*
- BWA manual: https://bio-bwa.sourceforge.net/bwa.shtml
- SAMtools manual: http://www.htslib.org/doc/samtools-1.2.html
- SAM format specifications: https://samtools.github.io/hts-specs/SAMv1.pdf

Read Mapping refers to the process of aligning short reads to a reference sequence. This reference can be a complete genome, a transcriptome, or in our case de novo assembly. Read mapping is fundamental to many commonly used pipelines like differential expression or SNP analysis. We will be using it to calculate the average coverage of each of our contigs and to calculate the overall coverage of our genome (a requirement for genbank submission).The main output of read mapping is a **Sequence Alignment Map format (SAM)**. The file provides information about where our sequencing reads match to our assembly and information about how it maps. There are hundreds of programs that use SAM files as a primary input. A BAM file is the binary version of a SAM, and can be converted very easily using samtools.

Many programs perform read mapping. The recommended program depends on what you are trying to do. My favorite is 'BWA mem' which balances performance and accuracy well. The input to the program is a referece assembly and reads to map (forward and reverse). The output is a SAM file. By default BWA writes the SAM file to standard output, I therefore save it directly to a file. There are lots of options, please see the manual to understand what I am using.

In [None]:
%%bash

#genome and reads
genome=submodule02_data/genome.fasta
forward=submodule02_data/reads_1.fastq.gz
reverse=submodule02_data/reads_2.fastq.gz


# Step 1: Index your reference genome. This is a requirement before read mapping.
bwa index $genome
# Step 2: Map the reads and construct a SAM file.
bwa mem -t 24 $genome $forward $reverse > raw_mapped.sam
# view the file with less, note that to see the data you have to scroll down past all the headers (@SQ).
# head -n 200 raw_mapped.sam | less -S

In [None]:
%%bash

# Remove sequencing reads that did not match to the assembly and convert the SAM to a BAM.
samtools view -@ 24 -Sb  raw_mapped.sam  | samtools sort -@ 24 - -o sorted_mapped.bam

# Examine how many reads mapped with samtools
samtools flagstat sorted_mapped.bam
# Calculate per base coverage with bedtools

# index the new bam file
samtools index sorted_mapped.bam

#bedtools genomecov -ibam sorted_mapped.bam > coverage.out
# Calculate per contig coverage with gen_input_table.py
#gen_input_table.py  --isbedfiles $fasta coverage.out >  coverage_table.tsv
# This outputs a simple file with two columns, the contig header and the average coverage.

## Process 4: Taxonomic assignment using BLAST and blobtools


manual: https://www.ncbi.nlm.nih.gov/books/NBK279690/

Using the command line BLAST works essentially the same as NCBI BLAST except we have more control. We can specify more options like output formats and also use our own local databases. It is also a lot more useful for pipelines and workflows since it can be automated, you don't need to open a web page and fill out any forms.

As a quick example for how BLAST works we will use the same 16S_sequence and BLAST it against our genome assembly. Before we begin we will make a database out of our contig assembly. This is done to construct a set of files that BLAST can use to speed up its sequence lookup. In the end it means we have to wait less time for our results.

#### Make a BLAST db from your contig files

The only required input is a FASTA file (our contigs), the database type (nucl or prot), and an output name for the new database.


In [None]:
%%bash

genome=submodule02_data/genome.fasta

ref_genomes=wgs-nf/databases/blast_db/reference-genome.fasta

makeblastdb -in $ref_genomes -dbtype nucl -out wgs-nf/databases/blast_db/blast_db

ls wgs-nf/databases/blast_db/

### BLAST genome assembly against the nt database

<p align="center">
  <img src="images/nucleutide-blast-cover.png" width="30%"/>
</p>


We store a local copy of the complete nucleotide database on our server. We will be using this to provide a rough taxonomy to every sequence in our assembly and to ultimately identify non-target contaminates (like human and other bacteria) and to confirm our species identification from the 16S BLAST. Later we will be using the output file as in input to blobtools and to visualize this information. blobtools requires a specifically formatted BLAST file, I therefore provide a script that will run the BLAST to the programs specification. We will simply provide the script with our contigs file and it will complete the task. This is a simple script that is not much different than the example we ran above. It will automatically format a meaningfull output name.

In [None]:
%%bash

genome=submodule02_data/genome.fasta
database=wgs-nf/databases/blast_db/blast_db

blastn \
    -task megablast \
    -query $genome \
    -db $database\
    -outfmt '6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore' \
    -culling_limit 5 \
    -max_target_seqs 10 \
    -num_threads 24 \
    -evalue 1e-5 \
    -out genome.vs.blastdb.cul5.maxt10.1e5.megablast.out > /dev/null 2>&1

# view the results
echo -e "qseqid\t\t\t\t\tsseqid\t\tpident\tlength\tmismatch gapopen qstart qend\tsstart\tsend\tevalue\tbitscore"
head genome.vs.blastdb.cul5.maxt10.1e5.megablast.out

## Combine datasets into a blobtools database

Program: **BlobTools**
Citation: *Laetsch, D. R., & Blaxter, M. L. (2017). BlobTools: Interrogation of genome assemblies. F1000Research, 6, 1287*  
Manual: https://blobtools.readme.io/docs  

Blobtools is a tool to visualize our genome assembly. It is also useful for filtering read and assembly data sets. There are three main inputs to the program: 1.) Genome assembly **FASTA** file (the one we used for BLAST and BWA), 2.) a 'hits' file generated from **BLAST**, 3.) A SAM or **BAM** file. The main output of the program are blobplots which plot the GC, coverage, taxonomy, and contigs lengths on a single graph.  

The first step (blobtools create) in this short pipeline takes all of our input files and creates a lookup table that is used for plotting and constructing tables. This step does the brunt of the working, parsing the BLAST file to assign taxonomy to each of our sequences, and parsing the SAM file to calculate coverage information.  

After that is complete we will use 'blobtools view' to output all the data into a human readable table. Finally we will use 'blobtools plot' to construct the blobplot visuals.  


In [None]:
%%bash

# BLAST results from previous step
blast_results=genome.vs.blastdb.cul5.maxt10.1e5.megablast.out

# lookup table that provides taxonomic IDs for BLAST DB accessions. This enables conversion of accessions to full taoxnomic ranks (Kingdom, phylum etc.)
taxid_lookup=wgs-nf/databases/blast_db/accessions_to_taxids.txt

cat $taxid_lookup

# Add taxids to BLAST results
# blobtools taxify --help
blobtools taxify \
    -f $blast_results \
    -m $taxid_lookup \
    -s 0 \
    -t 1

# -s is the column of sequenceID of subject in taxID mapping file
# -t is the column of TaxID of sequenceID in taxID mapping file
head genome.vs.blastdb.cul5.maxt10.1e5.megablast.taxified.out

In [None]:
%%bash

# input data
genome=assembled-genome/contigs.fasta
bam=sorted_mapped.bam
blast_taxified=genome.vs.blastdb.cul5.maxt10.1e5.megablast.taxified.out

# taxonomy database
taxdb=wgs-nf/databases/blast_db/nodesDB.txt

# Create lookup table
blobtools create -i $genome -b $bam -t $blast_taxified -o blob_out --db $taxdb

# Create output table
blobtools view -i blob_out.blobDB.json -r all -o blob_taxonomy

# # view the table, I remove headers with grep -v and view with tabview
# grep -v '##' blob_taxonomy.blob_out.blobDB.table.txt

# Plot the data
blobtools plot -i blob_out.blobDB.json -r genus

## Filter non-target sequences from *de novo* assembly

<p align="center">
  <img src="images/example_blobplot.png" width="50%"/>
</p>

The x-axis on these plots is GC content, the y-axis is the coverage (log transformed). The size of the 'blobs' are the length of the contigs. Colors represent taxonomic assignment (the -r option lets you choose which rank to view). The concept of these plots and ultimately for assembly filtering is that each organism has a unique GC content. For example Streptomyces has an average GC content of about 0.72 while other bacteria can go as low as 0.2. In addition, contamination is most likely has much lower coverage compared to the rest of your assembly. Combine that with the taxonomic assignments and you have multiple lines of evidence to identify your non-target contigs. In the plot above you can fairly easily see what contigs we plan to remove.

## Genome annotation using Bakta

- Program: **Bakta** - Rapid & standardized annotation of bacterial genomes, MAGs & plasmids
- Citation: *Schwengers, O., Jelonek, L., Dieckmann, M. A., Beyvers, S., Blom, J., & Goesmann, A. (2021). Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification. Microbial genomics, 7(11), 000685.*  
- Manual: https://github.com/oschwengers/bakta
- Annotation GFF3 file format: https://useast.ensembl.org/info/website/upload/gff3.html


Bakta is a  tool for fast, taxon-independent, annotation of bacterial genomes. Annotation results are exported in **GFF3** and International Nucleotide Sequence Database Collaboration (INSDC)-compliant flat files that can be submitted with your genome to Genbank. Bakta is an alternative tool to the computationally demanding NCBI PGAP and highly customizable Prokka. We use it here for both its lightweight design (i.e. easy installation) and speed (~5 mins).

Bakta annotates ncRNA cis-regulatory regions, oriC/oriV/oriT and assembly gaps as well as standard feature types: tRNA, tmRNA, rRNA, ncRNA genes, CRISPR, CDS and pseudogenes. Just like its inspiration (Prokka), bakta is a workflow that relies on other, more specific, annotation tools. We will focus the lesson on annotation to tRNAs, rRNAs, and protein-coding genes (PCGs). Here is a full list of the tools used by Brakta. 

- tRNAscan-SE (2.0.8) https://doi.org/10.1101/614032 http://lowelab.ucsc.edu/tRNAscan-SE
- Aragorn (1.2.38) http://dx.doi.org/10.1093/nar/gkh152 http://130.235.244.92/ARAGORN
- INFERNAL (1.1.4) https://dx.doi.org/10.1093%2Fbioinformatics%2Fbtt509 http://eddylab.org/infernal
- PILER-CR (1.06) https://doi.org/10.1186/1471-2105-8-18 http://www.drive5.com/pilercr
- Pyrodigal (2.1.0) https://doi.org/10.21105/joss.04296 https://github.com/althonos/pyrodigal
- PyHMMER (0.10.0) https://doi.org/10.21105/joss.04296 https://github.com/althonos/pyhmmer
- Diamond (2.0.14) https://doi.org/10.1038/nmeth.3176 https://github.com/bbuchfink/diamond
- Blast+ (2.12.0) https://www.ncbi.nlm.nih.gov/pubmed/2231712 https://blast.ncbi.nlm.nih.gov
- AMRFinderPlus (3.10.23) https://github.com/ncbi/amr
- DeepSig (1.2.5) https://doi.org/10.1093/bioinformatics/btx818

The program is simple to run, but does a lot. After you start the analysis in the next block of code, I suggest watching the following video about genome annotations.

In [None]:
%%bash
# Copy the data from an AWS S3 bucket.

# output location
mkdir -p databases
outdir=wgs-nf/databases/bakta_db/

# copy from aws S3 bucket
aws s3 cp s3://nh-inbre-genome-sequencing-and-comparative-genomic-analysis/databases/bakta-light/ $outdir --recursive --quiet

In [None]:
%%bash

genome=submodule02_data/genome.fasta
database=wgs-nf/databases/bakta_db/

## setup bakta database (not required)
# wget https://zenodo.org/record/10522951/files/db-light.tar.gz
# tar -xzf db-light.tar.gz
# rm db-light.tar.gz
# mv X wgs-nf/databases/

# Run bakta with default options.
#bakta --help
bakta $genome -o output-bakta/ --db $database --threads 24

## Visualize Dataset with Circos

Program: **Circos** - pyCirclize is a circular visualization python package implemented based on matplotlib. This package is developed for the purpose of easily and beautifully plotting circular figure such as Circos Plot and Chord Diagram in Python.
Manual: https://github.com/moshi4/pyCirclize

We can use the Circos class in the pycirclize module to visualize our circular bacterial genome. Using some python code, we can visualize the genome contiguity, coverage, and content in one plot. This code uses the GFF3 and GBFF files from BAKTA to add annotations to the contig assembly. Using the per base coverage, we can see the coverage across contigs and look for any abnormally covered regions.

<p align="center">
  <img src="images/example_genome.png" width="40%"/>
</p>


In [None]:
%%bash

# get per base coverage
bedtools genomecov -ibam sorted_mapped.bam -d > per_base_coverage.bed

In [None]:
%%bash

# We can call the script with the --help flag to learn how to run the program
scripts/circos.py --help

<div class="alert alert-block alert-info"><b>Tip</b>: Most programs will have a built in help menu. Call the program with the flag <code>--help</code> or <code>-h</code> to learn how to run unfamiliar programs.</div>

In [None]:
%%bash

scripts/circos.py --species "Enter your species here" --gff output-bakta/contigs.gff3 --gbk output-bakta/contigs.gbff --bed per_base_coverage.bed

In [None]:
%%bash
amrfinder_update --force_update --database /home/sagemaker-user/Genome-Sequencing-and-Comparative-Genomic-Analysis/databases/bakta_db/amrfinderplus-db

In [None]:
Image(filename='circos_plot.png')

## Conclusion and wrap up

End of submodule 2.  After learning about the general methods of assessing a genome assembly we ran the tools and identifies potential problems with our data.

Submodule 4 will begin with a directory of proteomes. One of the final steps of our workflow was runnign bakta to annotate our genome. One of the main outputs is the FAA proteome.

Be sure to shut down your instance. We will run one more bit of code to copy our genome file for use in submodule 2.