# Submodule 2: Assessment of genome assembly and genome annotation
--------
## Overview
In this submodule, you will begin with the genome that you assembled in Submodule 1. The primary goal of this submodule is to assess the quality of the assembled genome through the lens of what we call the "5 Cs": Contiguity, Completeness, Contamination, Coverage, and Content. By utilizing a combination of bioinformatics tools, participants will evaluate the assembled genome and generate outputs that include visualizations, a cleaned genome sequence and functional annotations. These outputs wil will be used in submodule 4.


### Learning Objectives
Through this submodule, users will gain hands-on experience in quality assessment, resulting in a deeper understanding of genomic data integrity and the significance of accurate genome sequences.

- **Understand and Apply the 5 Cs of Genome Quality**:  
  Understand how to assess the overall quality of a genome sequence by examing Contiguity (QUAST), Completeness (BUSCO), Contamination (BLAST/BlobTools), Coverage (BWA/Samtools), and Content (Prokka gene annotations).

- **Generate and Interpret Visualizations**:  
  Gain proficiency in using bioinformatics tools and foster skills in data analysis and interpretation.

- **Relate the Central Dogma of Molecular Biology to Genome Annotation**:  
  Connect the principles of the central dogma (DNA → RNA → Protein) to the process of genome annotation, understanding how gene annotations contribute to functional genomics and biological interpretations.

- **Produce a Clean and Annotated Genome**:  
  Participants will refine the genome based on their assessments, ensuring a high-quality, annotated genome that can be used for further analysis or research applications.

## **Install required software**

Several additional tools are required for Submodule 2; quast, busco, bwa, samtools, blast, blobtools, and prokka.  As with submodule 1, tools are preinstalled into a docker image, but we will demonstrate how to install these tools using __[Conda](https://docs.conda.io/en/latest/)__.

Each piece of software, along with links to publications and documentation, will be described in turn. Below is a brief summary of these tools.

### List of software
| **Tool**       | **Description**                                                                                                                                                           |
|:---------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **QUAST**      | Used for evaluating and reporting the quality of genome assemblies by comparing them against reference genomes or generating statistical summaries.                          |
| **BUSCO**      | Utilized for assessing genome completeness by searching for conserved single-copy orthologs from specific lineage datasets.                                                 |
| **BWA**        | A fast and memory-efficient tool for aligning sequence reads to large reference genomes, commonly used in variant calling pipelines.                                         |
| **Samtools**   | Used for manipulating and processing sequence alignments stored in SAM/BAM format. Essential for sorting, indexing, and viewing alignment files.                            |
| **BLAST**      | A widely used tool for comparing an input sequence to a database of sequences, identifying regions of local similarity and aiding in functional annotation.                  |
| **BlobTools**  | A versatile tool for visualizing and analyzing genome assemblies, helping to identify contamination or misassembled regions by correlating sequence features with taxonomy.   |
| **Prokka**     | Used for rapid annotation of prokaryotic genomes, identifying genes, coding sequences, rRNAs, tRNAs, and other genomic features.                                             |

In [1]:
%%bash

# Install all tools using mamba (a conda alternative) with specific versions

echo "mamba install --channel bioconda \
    quast=5.2.0 \
    busco=5.4.6 \
    bwa=0.7.18 \
    samtools=1.18 \
    blast=2.15.0 \
    blobtools=1.0.1 \
    prokka=1.14.6 \
    -y > /dev/null 2>&1"

echo "Installation of quast, busco, bwa, samtools, blast, blobtools, and prokka complete."

mamba install --channel bioconda     quast=5.2.0     busco=5.4.6     bwa=0.7.18     samtools=1.18     blast=2.15.0     blobtools=1.0.1     prokka=1.14.6     -y > /dev/null 2>&1
Installation of quast, busco, bwa, samtools, blast, blobtools, and prokka complete.


## Starting Data

This submodule starts with the **genome FASTA** file, this will be the primary input for all programs. We will define this as the variable *genome* now, and use that for the remainder of the workflow. This enables the starting data to be easily changed if a user wants to run this tutorial with their own data.

We will also need to original reads from Submodule 1 for the read mapping step. We will use BWA to calculate sequencing coverage and it requires **paired-end sequencing reads in FASTQ format**. We will define these here as *forward* for the R1 sequencing reads and *reverse* for the R2 sequencing reads.

In [2]:
%%bash

# starting genome from submodule 1
genome=assembled-genome/contigs.fasta

# raw reads from submodule 1
forward=raw-reads/SRR10056829_1.fastq.gz
reverse=raw-reads/SRR10056829_2.fastq.gz

## Process 1: **Contiguity** assessment using QUAST
- Program: **QUAST (Quality Assessment Tool for Genome Assemblies)**
- Citation: *Gurevich, A., Saveliev, V., Vyahhi, N., & Tesler, G. (2013). QUAST: quality assessment tool for genome assemblies. Bioinformatics, 29(8), 1072-1075.*
- Manual: https://github.com/ablab/quast

QUAST is a tool used to evaluate and compare the quality of genome assemblies by providing metrics such as N50, number of contigs, genome length, and misassemblies. Did you get one contig representing your entire genome? Or did you get thousands of contigs representing a highly fragmented genome?

QUAST has many functionalities which we will not explore in this tutorial, I encourage you to explore these, for now we are going to use it in its simplest form. The program parses the genome FASTA file and records statsitics about each contig, the length, GC content etc. This type of information is something you would typically provide in a publication or as a way to assess different assemblers/options you may use. 

The **input** to the program is the genome assembly **FASTA** and the output are various tables and an html/pdf you can export and view.

In [3]:
%%bash

genome=assembled-genome/contigs.fasta

# run quast on the genome assembly
quast.py $genome

/opt/conda/bin/quast.py assembled-genome/contigs.fasta

Version: 5.2.0

System information:
  OS: Linux-5.10.230-223.885.amzn2.x86_64-x86_64-with-glibc2.35 (linux_64)
  Python version: 3.10.14
  CPUs number: 2

Started: 2025-01-03 16:06:45

Logging to /home/sagemaker-user/Genome-Sequencing-and-Comparative-Genomic-Analysis/quast_results/results_2025_01_03_16_06_45/quast.log
NOTICE: Maximum number of threads is set to 1 (use --threads option to set it manually)

CWD: /home/sagemaker-user/Genome-Sequencing-and-Comparative-Genomic-Analysis
Main parameters: 
  MODE: default, threads: 1, min contig length: 500, min alignment length: 65, min alignment IDY: 95.0, \
  ambiguity: one, min local misassembly length: 200, min extensive misassembly length: 1000

Contigs:
  Pre-processing...
  assembled-genome/contigs.fasta ==> contigs

2025-01-03 16:06:49
Running Basic statistics processor...
  Contig files: 
    contigs
  Calculating N50 and L50...
    contigs, N50 = 180742, L50 = 3, auN = 228778.9

## Process 2: **Completeness** assessment using BUSCO

- Program: **BUSCO - Benchmarking Universal Single-Copy Orthologs**
- Citation: *Seppey, M., Manni, M., & Zdobnov, E. M. (2019). BUSCO: assessing genome assembly and annotation completeness. Gene prediction: methods and protocols, 227-245.*  
- Manual: https://busco.ezlab.org/

BUSCO is a program utilized to assess the completeness of a genome assembly in terms of the number of found and universal genes. This program makes use of the OrthoDB set of single-copy orthologous that are found in at least 90% of all the organisms in question. There are different data sets for various taxonomic groups (Eukaryotes, Metazoa, Bacteria, Gammaproteobacteria, etc. etc.). The idea is that a newly sequenced genome *should* contain most of these highly conserved genes. If your genome doesn't contain a large portion of these single-copy orthologs it may indicate that your genome is not complete.


<p align="center">
  <img src="images/busco_sampling.png" width="40%"/>
</p>


The input to the program is your genome assembly (contigs) as well as a selection of which database to use. The output is a directory with a short summary of the results, a full table with coordinates for each orthologous gene is located in your assembly, and a directory with the nucleotide and amino acid sequences of all the identified sequences.

We will focus on the main summary output as a way to provide a simple QC assessment of our assembly, the outputs provided by BUSCO however have many uses, such as phylogenomics and gene prediction.

In [4]:
%%bash

# View available sets
busco --list-datasets

2025-01-03 16:07:28 INFO:	Downloading information on latest versions of BUSCO data...
2025-01-03 16:07:30 INFO:	Downloading file 'https://busco-data.ezlab.org/v5/data/information/lineages_list.2024-12-05.txt.tar.gz'
2025-01-03 16:07:31 INFO:	Decompressing file '/home/sagemaker-user/Genome-Sequencing-and-Comparative-Genomic-Analysis/busco_downloads/information/lineages_list.2024-12-05.txt.tar.gz'
################################################

Datasets available to be used with BUSCO v5.8.1 and above:

- archaea_odb12
    - euryarchaeota_odb12
        - methanomicrobia_odb12
            - methanosarcinaceae_odb12
            - methanosarcina_odb12
            - methanomicrobiales_odb12
               - methanomicrobiaceae_odb12
        - halobacteria_odb12
            - halobacteriales_odb12
                - halobacteriaceae_odb12
                - haloarculaceae_odb12
                    - haloarcula_odb12
            - natrialbaceae_odb12
                - natrinema_odb12
         

In [7]:
%%bash

genome=assembled-genome/contigs.fasta

# run BUSCO
busco -i $genome -m genome -o output-busco -l bacteria --cpu 24

2025-01-03 16:09:27 INFO:	***** Start a BUSCO v5.4.6 analysis, current time: 01/03/2025 16:09:27 *****
2025-01-03 16:09:27 INFO:	Configuring BUSCO with local environment
2025-01-03 16:09:27 INFO:	Mode is genome
2025-01-03 16:09:27 INFO:	Downloading information on latest versions of BUSCO data...
2025-01-03 16:09:29 INFO:	Input file is /home/sagemaker-user/Genome-Sequencing-and-Comparative-Genomic-Analysis/assembled-genome/contigs.fasta
2025-01-03 16:09:29 INFO:	Downloading file 'https://busco-data.ezlab.org/v5/data/lineages/bacteria_odb10.2024-01-08.tar.gz'
2025-01-03 16:09:31 INFO:	Decompressing file '/home/sagemaker-user/Genome-Sequencing-and-Comparative-Genomic-Analysis/busco_downloads/lineages/bacteria_odb10.tar.gz'
2025-01-03 16:09:33 INFO:	Running BUSCO using lineage dataset bacteria_odb10 (prokaryota, 2024-01-08)
2025-01-03 16:09:33 INFO:	Running 1 job(s) on bbtools, starting at 01/03/2025 16:09:33
2025-01-03 16:09:35 INFO:	[bbtools]	1 of 1 task(s) completed
2025-01-03 16:09:35 

### Examine the BUSCO output.
The first file we will look at is the 'short_summary_busco_output.txt'. This is a file which summarizes the main findings, how many of the expected genes did we find? This summary breaks the report into four main categories: **complete single-copy genes, complete duplicated genes, fragmented genes, and missing genes**. 

We are hopeful that the majority of our genes will be found as 'complete single-copy'. Duplicated genes could indicate that that particular gene underwent a gene duplication event or that we had a miss assembly and essentially have two copies of a region of our genome. Fragmented genes are an artifact of the fact that our genome did not assemble perfectly. Some of our genome is fragmented into multiple contigs, and with that some of our genes are going to be fragmented as well. This is why it is important to inspect the N50 of the genome with QUAST. We want the majority of our contigs to be at least as big as a gene, if it's not than we will have many fragmented genes as a result.

Next we will view the 'full_table_busco_output.tsv' file. This is a file which shows the coordinates for all the associated single copy genes in our genome. It also provides information about the status of that ortholog (missing, complete, fragmented). This tsv file can be exported and viewed in excel.

The final files we will examine are in a directory called 'single_copy_busco_sequences/'. This houses all the amino acid and protein sequences. This is a rich source for comparative genomics and other sorts of analyses.

In [15]:
%%bash

# examine the table (first ten lines only)
echo "Header:"
grep '# Busco id' output-busco/run_bacteria_odb10/full_table.tsv
echo '#############################################'

# see the categories of genes
echo "Fragmented genes:"
awk -F'\t' '$2 == "Fragmented"' output-busco/run_bacteria_odb10/full_table.tsv

echo '#############################################'
echo "Missing genes:"
awk -F'\t' '$2 == "Missing"' output-busco/run_bacteria_odb10/full_table.tsv


echo '#############################################'
echo "Duplicate genes:"
awk -F'\t' '$2 == "Duplicate"' output-busco/run_bacteria_odb10/full_table.tsv


Header:
# Busco id	Status	Sequence	Gene Start	Gene End	Strand	Score	Length	OrthoDB url	Description
#############################################
Fragmented genes:
841869at2	Fragmented	NODE_2_length_184940_cov_88.881445	72417	73193	-	82.2	144	https://v10-1.orthodb.org/?query=841869at2	UDP-N-acetylenolpyruvoylglucosamine reductase
1822215at2	Fragmented	NODE_10_length_47326_cov_75.539443	38096	38521	-	130.2	108	https://v10-1.orthodb.org/?query=1822215at2	Ribosomal protein L13
1842956at2	Fragmented	NODE_4_length_104950_cov_85.510082	63946	64599	+	24.7	43	https://v10-1.orthodb.org/?query=1842956at2	Biotin--acetyl-CoA-carboxylase ligase
1990141at2	Fragmented	NODE_1_length_512707_cov_68.489338	117764	118036	+	95.8	63	https://v10-1.orthodb.org/?query=1990141at2	Ribosomal protein S15
2040741at2	Fragmented	NODE_2_length_184940_cov_88.881445	59294	59527	+	38.1	42	https://v10-1.orthodb.org/?query=2040741at2	Ribosomal protein L24
2063644at2	Fragmented	NODE_2_length_184940_cov_88.881445	58480	58665	

## Process 3: **Coverage** assessment using BWA

- Program: **BWA** and **Samtools**
- Citations: *Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25(14), 1754–1760. https://doi.org/10.1093/bioinformatics/btp324*, *Li, H., et al. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25(16), 2078–2079. https://doi.org/10.1093/bioinformatics/btp352*
- BWA manual: https://bio-bwa.sourceforge.net/bwa.shtml
- SAMtools manual: http://www.htslib.org/doc/samtools-1.2.html
- SAM format specifications: https://samtools.github.io/hts-specs/SAMv1.pdf

Read Mapping refers to the process of aligning short reads to a reference sequence. This reference can be a complete genome, a transcriptome, or in our case de novo assembly. Read mapping is fundamental to many commonly used pipelines like differential expression or SNP analysis. We will be using it to calculate the average coverage of each of our contigs and to calculate the overall coverage of our genome (a requirement for genbank submission).The main output of read mapping is a **Sequence Alignment Map format (SAM)**. The file provides information about where our sequencing reads match to our assembly and information about how it maps. There are hundreds of programs that use SAM files as a primary input. A BAM file is the binary version of a SAM, and can be converted very easily using samtools.

Many programs perform read mapping. The recommended program depends on what you are trying to do. My favorite is 'BWA mem' which balances performance and accuracy well. The input to the program is a referece assembly and reads to map (forward and reverse). The output is a SAM file. By default BWA writes the SAM file to standard output, I therefore save it directly to a file. There are lots of options, please see the manual to understand what I am using.

In [16]:
%%bash

# starting genome from submodule 1
genome=assembled-genome/contigs.fasta

# raw reads from submodule 1
forward=raw-reads/SRR10056829_1.fastq.gz
reverse=raw-reads/SRR10056829_2.fastq.gz

# Step 1: Index your reference genome. This is a requirement before read mapping.
bwa index $genome
# Step 2: Map the reads and construct a SAM file.
bwa mem -t 24 $genome $forward $reverse > raw_mapped.sam
# view the file with less, note that to see the data you have to scroll down past all the headers (@SQ).
head -n 200 raw_mapped.sam | less -S

[bwa_index] Pack FASTA... 0.01 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 0.40 seconds elapse.
[bwa_index] Update BWT... 0.01 sec
[bwa_index] Pack forward-only FASTA... 0.01 sec
[bwa_index] Construct SA from BWT and Occ... 0.24 sec
[main] Version: 0.7.18-r1243-dirty
[main] CMD: bwa index assembled-genome/contigs.fasta
[main] Real time: 0.927 sec; CPU: 0.683 sec
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 2400000 sequences (240000000 bp)...
[M::process] read 782560 sequences (78256000 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (35, 1187092, 2, 39)
[M::mem_pestat] analyzing insert size distribution for orientation FF...
[M::mem_pestat] (25, 50, 75) percentile: (98, 120, 203)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 413)
[M::mem_pestat] mean and std.dev: (148.31, 75.23)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 518)
[M::mem_pestat] analyzing insert size distribu

@SQ	SN:NODE_1_length_512707_cov_68.489338	LN:512707
@SQ	SN:NODE_2_length_184940_cov_88.881445	LN:184940
@SQ	SN:NODE_3_length_180742_cov_76.756479	LN:180742
@SQ	SN:NODE_4_length_104950_cov_85.510082	LN:104950
@SQ	SN:NODE_5_length_103244_cov_79.882449	LN:103244
@SQ	SN:NODE_6_length_97797_cov_68.419922	LN:97797
@SQ	SN:NODE_7_length_89725_cov_76.544173	LN:89725
@SQ	SN:NODE_8_length_82897_cov_77.232382	LN:82897
@SQ	SN:NODE_9_length_60299_cov_72.795349	LN:60299
@SQ	SN:NODE_10_length_47326_cov_75.539443	LN:47326
@SQ	SN:NODE_11_length_45856_cov_136.897164	LN:45856
@SQ	SN:NODE_12_length_29079_cov_92.851640	LN:29079
@SQ	SN:NODE_13_length_25629_cov_151.390162	LN:25629
@SQ	SN:NODE_14_length_24523_cov_77.475764	LN:24523
@SQ	SN:NODE_15_length_23101_cov_86.272846	LN:23101
@SQ	SN:NODE_16_length_21883_cov_60.060808	LN:21883
@SQ	SN:NODE_17_length_13932_cov_88.718743	LN:13932
@SQ	SN:NODE_18_length_12703_cov_68.384567	LN:12703
@SQ	SN:NODE_19_length_7107_cov_55.165060	LN:7107
@SQ	SN:NODE_20_length_5811_cov

In [17]:
%%bash

# Remove sequencing reads that did not match to the assembly and convert the SAM to a BAM.
samtools view -@ 24 -Sb  raw_mapped.sam  | samtools sort -@ 24 - -o sorted_mapped.bam

# Examine how many reads mapped with samtools
samtools flagstat sorted_mapped.bam
# Calculate per base coverage with bedtools

# index the new bam file
samtools index sorted_mapped.bam

#bedtools genomecov -ibam sorted_mapped.bam > coverage.out
# Calculate per contig coverage with gen_input_table.py
#gen_input_table.py  --isbedfiles $fasta coverage.out >  coverage_table.tsv
# This outputs a simple file with two columns, the contig header and the average coverage.

[bam_sort_core] merging from 0 files and 24 in-memory blocks...


3182761 + 0 in total (QC-passed reads + QC-failed reads)
3182560 + 0 primary
0 + 0 secondary
201 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
3177280 + 0 mapped (99.83% : N/A)
3177079 + 0 primary mapped (99.83% : N/A)
3182560 + 0 paired in sequencing
1591280 + 0 read1
1591280 + 0 read2
3155510 + 0 properly paired (99.15% : N/A)
3173626 + 0 with itself and mate mapped
3453 + 0 singletons (0.11% : N/A)
11772 + 0 with mate mapped to a different chr
11528 + 0 with mate mapped to a different chr (mapQ>=5)


## Process 4 - Taxonomic assignment using BLAST and blobtools


manual: https://www.ncbi.nlm.nih.gov/books/NBK279690/

Using the command line BLAST works essentially the same as NCBI BLAST except we have more control. We can specify more options like output formats and also use our own local databases. It is also a lot more useful for pipelines and workflows since it can be automated, you don't need to open a web page and fill out any forms.

As a quick example for how BLAST works we will use the same 16S_sequence and BLAST it against our genome assembly. Before we begin we will make a database out of our contig assembly. This is done to construct a set of files that BLAST can use to speed up its sequence lookup. In the end it means we have to wait less time for our results.

#### Make a BLAST db from your contig files

The only required input is a FASTA file (our contigs), the database type (nucl or prot), and an output name for the new database.


In [19]:
%%bash

genome=assembled-genome/contigs.fasta

makeblastdb -in $genome -dbtype nucl -out contigs_db 



Building a new DB, current time: 01/03/2025 16:37:27
New DB name:   /home/sagemaker-user/Genome-Sequencing-and-Comparative-Genomic-Analysis/contigs_db
New DB title:  assembled-genome/contigs.fasta
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 3000000000B
Adding sequences from FASTA; added 127 sequences in 0.022356 seconds.




### BLAST genome assembly against the nt database

<p align="center">
  <img src="images/nucleutide-blast-cover.png" width="30%"/>
</p>


We store a local copy of the complete nucleotide database on our server. We will be using this to provide a rough taxonomy to every sequence in our assembly and to ultimately identify non-target contaminates (like human and other bacteria) and to confirm our species identification from the 16S BLAST. Later we will be using the output file as in input to blobtools and to visualize this information. blobtools requires a specifically formatted BLAST file, I therefore provide a script that will run the BLAST to the programs specification. We will simply provide the script with our contigs file and it will complete the task. This is a simple script that is not much different than the example we ran above. It will automatically format a meaningfull output name.

In [20]:
%%bash

genome=assembled-genome/contigs.fasta
database=data/ncbi_nt/refseq
outname=blast_nt

blastn \
    -task megablast \
    -query $genome \
    -db $database\
    -outfmt '6 qseqid staxids bitscore std sscinames sskingdoms stitle' \
    -culling_limit 5 \
    -max_target_seqs 10 \
    -num_threads 24 \
    -evalue 1e-5 \
    -out $outname.vs.nt.cul5.maxt10.1e5.megablast.out &


BLAST Database error: No alias or index file found for nucleotide database [data/ncbi_nt/nt] in search path [/home/sagemaker-user/Genome-Sequencing-and-Comparative-Genomic-Analysis::]


## Combine datasets into a blobtools database

Program: **BlobTools**
Citation: *Laetsch, D. R., & Blaxter, M. L. (2017). BlobTools: Interrogation of genome assemblies. F1000Research, 6, 1287*  
Manual: https://blobtools.readme.io/docs  

Blobtools is a tool to visualize our genome assembly. It is also useful for filtering read and assembly data sets. There are three main inputs to the program: 1.) Genome assembly **FASTA** file (the one we used for BLAST and BWA), 2.) a 'hits' file generated from **BLAST**, 3.) A SAM or **BAM** file. The main output of the program are blobplots which plot the GC, coverage, taxonomy, and contigs lengths on a single graph.  

The first step (blobtools create) in this short pipeline takes all of our input files and creates a lookup table that is used for plotting and constructing tables. This step does the brunt of the working, parsing the BLAST file to assign taxonomy to each of our sequences, and parsing the SAM file to calculate coverage information.  

After that is complete we will use 'blobtools view' to output all the data into a human readable table. Finally we will use 'blobtools plot' to construct the blobplot visuals.  


In [None]:
%%bash

# Create lookup table
blobtools create --help
blobtools create -i contigs.fasta -b sorted_mapped.bam -t contigs.fasta.vs.nt.cul5.1e5.megablast.out -o blob_out

# Create output table
blobtools view --help
blobtools view -i blob_out.blobDB.json -r all -o blob_taxonomy

# view the table, I remove headers with grep -v and view with tabview
grep -v '##' blob_taxonomy.blob_out.blobDB.table.txt

# Plot the data
blobtools plot --help
blobtools plot -i blob_out.blobDB.json -r genus

## Filter non-target sequences from *de novo* assembly

<p align="center">
  <img src="images/example_blobplot.png" width="50%"/>
</p>

The x-axis on these plots is GC content, the y-axis is the coverage (log transformed). The size of the 'blobs' are the length of the contigs. Colors represent taxonomic assignment (the -r option lets you choose which rank to view). The concept of these plots and ultimately for assembly filtering is that each organism has a unique GC content. For example Streptomyces has an average GC content of about 0.72 while other bacteria can go as low as 0.2. In addition, contamination is most likely has much lower coverage compared to the rest of your assembly. Combine that with the taxonomic assignments and you have multiple lines of evidence to identify your non-target contigs. In the plot above you can fairly easily see what contigs we plan to remove.

## Genome annotation using Bakta

Program: **Bakta** - Rapid & standardized annotation of bacterial genomes, MAGs & plasmids
Citation: *Schwengers, O., Jelonek, L., Dieckmann, M. A., Beyvers, S., Blom, J., & Goesmann, A. (2021). Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification. Microbial genomics, 7(11), 000685.*  
Manual: https://github.com/oschwengers/bakta
Annotation GFF3 file format: https://useast.ensembl.org/info/website/upload/gff3.html


Bakta is a  tool for fast, taxon-independent, annotation of bacterial genomes. Annotation results are exported in **GFF3** and International Nucleotide Sequence Database Collaboration (INSDC)-compliant flat files that can be submitted with your genome to Genbank. Bakta is an alternative tool to the computationally demanding NCBI PGAP and highly customizable Prokka. We use it here for both its lightweight design (i.e. easy installation) and speed (~5 mins).

Bakta annotates ncRNA cis-regulatory regions, oriC/oriV/oriT and assembly gaps as well as standard feature types: tRNA, tmRNA, rRNA, ncRNA genes, CRISPR, CDS and pseudogenes. Just like its inspiration (Prokka), bakta is a workflow that relies on other, more specific, annotation tools. We will focus the lesson on annotation to tRNAs, rRNAs, and protein-coding genes (PCGs). Here is a full list of the tools used by Brakta. 

- tRNAscan-SE (2.0.8) https://doi.org/10.1101/614032 http://lowelab.ucsc.edu/tRNAscan-SE
- Aragorn (1.2.38) http://dx.doi.org/10.1093/nar/gkh152 http://130.235.244.92/ARAGORN
- INFERNAL (1.1.4) https://dx.doi.org/10.1093%2Fbioinformatics%2Fbtt509 http://eddylab.org/infernal
- PILER-CR (1.06) https://doi.org/10.1186/1471-2105-8-18 http://www.drive5.com/pilercr
- Pyrodigal (2.1.0) https://doi.org/10.21105/joss.04296 https://github.com/althonos/pyrodigal
- PyHMMER (0.10.0) https://doi.org/10.21105/joss.04296 https://github.com/althonos/pyhmmer
- Diamond (2.0.14) https://doi.org/10.1038/nmeth.3176 https://github.com/bbuchfink/diamond
- Blast+ (2.12.0) https://www.ncbi.nlm.nih.gov/pubmed/2231712 https://blast.ncbi.nlm.nih.gov
- AMRFinderPlus (3.10.23) https://github.com/ncbi/amr
- DeepSig (1.2.5) https://doi.org/10.1093/bioinformatics/btx818

The program is simple to run, but does a lot. After you start the analysis in the next block of code, I suggest watching the following video about genome annotations.

In [None]:
%%bash

genome=assembled-genome/contigs.fasta

# Run bakta with default options.
#bakta --help
bakta $genome -o output-bakta/ --db /home/sagemaker-user/Genome-Sequencing-and-Comparative-Genomic-Analysis/databases/db-light --threads 2

## Visualize Dataset with Circos

Program: **Circos** - pyCirclize is a circular visualization python package implemented based on matplotlib. This package is developed for the purpose of easily and beautifully plotting circular figure such as Circos Plot and Chord Diagram in Python.
Manual: https://github.com/moshi4/pyCirclize

We can use the Circos class in the pycirclize module to visualize our circular bacterial genome. Using some python code, we can visualize the genome contiguity, coverage, and content in one plot. This code uses the GFF3 and GBFF files from BAKTA to add annotations to the contig assembly. Using the per base coverage, we can see the coverage across contigs and look for any abnormally covered regions.

<p align="center">
  <img src="images/example_genome.png" width="40%"/>
</p>


In [None]:
%%bash

# get per base coverage
bedtools genomecov -ibam sorted_mapped.bam -d > per_base_coverage.bed

In [None]:
%%bash

# We can call the script with the --help flag to learn how to run the program
scripts/circos.py --help

<div class="alert alert-block alert-info"><b>Tip</b>: Most programs will have a built in help menu. Call the program with the flag <code>--help</code> or <code>-h</code> to learn how to run unfamiliar programs.</div>

In [None]:
%%bash

scripts/circos.py --species "Enter your species here" --gff output-bakta/contigs.gff3 --gbk output-bakta/contigs.gbff --bed per_base_coverage.bed

<p align="center">
  <img src="circos_plot.png" width="80%"/>
</p>