# Submodule 2: Assessment of genome assembly and genome annotation

In this submodule, you will begin with the genome that you assembled in Submodule 1. The primary goal is to assess the quality of the assembled genome through the lens of what we call the "5 Cs": Contiguity, Completeness, Contamination, Coverage, and Content. By utilizing a combination of bioinformatics tools, participants will evaluate the assembled genome and generate outputs that include visualizations, a cleaned genome sequence and functionalannotations. These outputs wil will be used in submodule 4.


### Learning Objectives

Through this submodule, users will gain hands-on experience in quality assessment, resulting in a deeper understanding of genomic data integrity and the significance of accurate genome sequences.

- **Understand and Apply the 5 Cs of Genome Quality**:  
  Understand how to assess the overall quality of a genome sequence by examing Contiguity (QUAST), Completeness (BUSCO), Contamination (BLAST/BlobTools), Coverage (BWA/Samtools), and Content (Prokka gene annotations).

- **Generate and Interpret Visualizations**:  
  Gain proficiency in using bioinformatics tools and foster skills in data analysis and interpretation.

- **Relate the Central Dogma of Molecular Biology to Genome Annotation**:  
  Connect the principles of the central dogma (DNA → RNA → Protein) to the process of genome annotation, understanding how gene annotations contribute to functional genomics and biological interpretations.

- **Produce a Clean and Annotated Genome**:  
  Participants will refine the genome based on their assessments, ensuring a high-quality, annotated genome that can be used for further analysis or research applications.

## **Install required software**

Several additional tools are required for Submodule 2; quast, busco, bwa, samtools, blast, blobtools, and prokka.  As with submodule 1, we will install these tools using __[Conda](https://docs.conda.io/en/latest/)__.

### List of software
| **Tool**       | **Description**                                                                                                                                                           |
|:---------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **QUAST**      | Used for evaluating and reporting the quality of genome assemblies by comparing them against reference genomes or generating statistical summaries.                          |
| **BUSCO**      | Utilized for assessing genome completeness by searching for conserved single-copy orthologs from specific lineage datasets.                                                 |
| **BWA**        | A fast and memory-efficient tool for aligning sequence reads to large reference genomes, commonly used in variant calling pipelines.                                         |
| **Samtools**   | Used for manipulating and processing sequence alignments stored in SAM/BAM format. Essential for sorting, indexing, and viewing alignment files.                            |
| **BLAST**      | A widely used tool for comparing an input sequence to a database of sequences, identifying regions of local similarity and aiding in functional annotation.                  |
| **BlobTools**  | A versatile tool for visualizing and analyzing genome assemblies, helping to identify contamination or misassembled regions by correlating sequence features with taxonomy.   |
| **Prokka**     | Used for rapid annotation of prokaryotic genomes, identifying genes, coding sequences, rRNAs, tRNAs, and other genomic features.                                             |

In [6]:
%%bash

# Install all tools using mamba (a conda alternative) with specific versions

mamba install --channel bioconda \
    quast=5.2.0 \
    busco=5.4.6 \
    bwa=0.7.18 \
    samtools=1.18 \
    blast=2.15.0 \
    blobtools=1.0.1 \
    prokka=1.14.6 \
    -y

echo "Installation of quast, busco, bwa, samtools, blast, blobtools, and prokka complete."

bioconda/linux-64                                           Using cache
bioconda/noarch                                             Using cache
conda-forge/linux-64                                        Using cache
conda-forge/noarch                                          Using cache
nvidia/linux-64                                             Using cache
nvidia/noarch                                               Using cache
pytorch/linux-64                                            Using cache
pytorch/noarch                                              Using cache





Looking for: ['quast=5.2.0', 'busco=5.4.6', 'bwa=0.7.18', 'samtools=1.18', 'blast=2.15.0', 'blobtools=1.0.1', 'prokka=1.14.6']


Pinned packages:
  - python 3.9.*


Could not solve for environment specs
The following packages are incompatible
└─ blobtools 1.0.1**  is installable with the potential options
   ├─ blobtools 1.0.1 would require
   │  └─ python [2.7* |>=2.7,<2.8.0a0 ], which can be installed;
   └─ blobtools 1.0.1 would require
      └─ matplotlib 2.0.2 , which does not exist (perhaps a missing channel).
Installation of quast, busco, bwa, samtools, blast, blobtools, and prokka complete.


## Contiguity assessment using QUAST

## 2.2 - Completeness assessment using BUSCO

## 2.3 - Coverage assessment using BWA

## 2.4 - Taxonomic assignment using BLAST and blobtools

### 2.4.1 - BLAST genome assembly against the nt database

### 2.4.2 - Combine datasets into a blobtools database

## 2.5 - Filter non-target sequences from de novo assembly

## 2.6 - Genome annotation using PROKKA