# Genome Assembly and Annotation Workflow

This notebook describes a workflow on assembling and annotating a prokaryotic genome, starting from raw paired-end reads. This includes (1) a quality control step using Trimmomatic and FastQC, (2) assembly using SPAdes, (3) assembly evaluation using QUAST, and (5) annotation using Prokka.

The workflow described here is primarily used for the assembly and annotation of prokaryotic genomes. SPAdes was initially built for the assembly of bacterial genomes, however, later on it optimized for small eukaryotic genomes as well. However, Prokka was specifically built for the annotation of bacterial, archaeal, and viral genomes.

---
## <font color ='blue'>How to Use This Notebook</font></h2>

1. Open jupyter notebook with the command below and select the notebook.
>`jupyter-notebook`
2. To run the cells in this notebook, press Shift+Enter.

---
## Tools Used
1. <b>FastQC v0.11.9</b> for quality checking.
2. <b>Trimmomatic v0.39</b> for filtering and trimming reads.
3. <b>SPAdes v3.15.4</b> for genome assembly.
4. <b>QUAST v5.2.0</b> for assembly evaluation.
5. <b>Prokka v1.14.6</b> for determining coding regions and their identities.

---
## Starting Files

1. This Jupyter notebook.
2. Directories for organizing the data. To make the folders, run the following code block:

In [None]:
!mkdir genome_assembly_and_annot_demo_folder
%cd genome_assembly_and_annot_demo_folder
!mkdir \
0-raw-reads \
1-fastqc \
2-trimmomatic \
3-spades \
4-quast \
5-prokka

---
## Acknowledgement
The data used for this demonstration is from a Pseudomonas sp. B10 isolate by <a href="https://journals.asm.org/doi/10.1128/MRA.00237-19">Leon-Zayas et al. (2019).</a> The SRA accession ID of the data used is SRR8835125:

<i>León-Zayas, R., Roberts, C., Vague, M., & Mellies, J. L. (2019). Draft genome sequences of five environmental bacterial isolates that degrade polyethylene terephthalate plastic. Microbiology Resource Announcements, 8(25), e00237-19.</i>

---
## Table of Contents
 * [**Step 1: Downloading Demo Data**](#Step-1:-Downloading-Demo-Data)  
 * [**Step 2: Initial Checking of Reads**](#Step-1:-Initial-Checking-of-Reads)  
 * [**Step 3: Trimming and Filtering**](#Step-2:-Trimming-and-Filtering)  
     * [Running Trimmomatic](#Running-Trimmomatic)
     * [Post-Trimmomatic QC](#Post-Trimmomatic-QC)
 * [**Step 4: Assembly**](#Step-3:-Assembly)
     * [SPAdes: Default](#SPAdes:-Default)
     * [SPAdes: --careful](#SPAdes:---careful)
     * [SPAdes: --isolate](#SPAdes:---isolate)
 * [**Step 5: Assembly Checking**](#Step-4:-Assembly-Checking)
 * [**Step 6: Annotation**](#Step-5:-Annotation)
---

# <font color = 'gray'>Step 1: Downloading Demo Data</font>

In [None]:
!wget -O 0-raw-reads/forward.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR883/005/SRR8835125/SRR8835125_1.fastq.gz
!wget -O 0-raw-reads/reverse.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR883/005/SRR8835125/SRR8835125_2.fastq.gz

---
# <font color = 'gray'>Step 2: Initial Checking of Reads</font>

First, we have to check the quality of the raw reads to assess if sequencing errors may be present and to decide how the reads should be trimmed.

In [None]:
#Generate the fastqc reports
!fastqc \
    -o 1-fastqc/ \
    0-raw-reads/*

In [None]:
#View the fastqc reports
import webbrowser
webbrowser.open('1-fastqc/forward_fastqc.html', new=2)
webbrowser.open('1-fastqc/reverse_fastqc.html', new=2)

---
# <font color = 'gray'>Step 3: Trimming and Filtering</font>

### Running Trimmomatic

Next, based on the fastqc reports, trim the reads accordingly. Check the printed terminal window output of Trimmomatic to see how many paired reads are left after filtering. If you think you're losing too many reads on your own data, try to be a bit more lenient with your trimming and filtering parameters.

In [None]:
!trimmomatic PE \
    0-raw-reads/forward.fastq.gz \
    0-raw-reads/reverse.fastq.gz \
    2-trimmomatic/ofp.fastq \
    2-trimmomatic/ofu.fastq \
    2-trimmomatic/orp.fastq \
    2-trimmomatic/oru.fastq \
    LEADING:30 \
    TRAILING:30 \
    SLIDINGWINDOW:5:18 \
    MINLEN:75
                    
#Abbreviations:
#     ofp = output forward paired
#     ofu = output forward unpaired
#     orp = output reverse paired
#     oru = output reverse unpaired

### Post-Trimmomatic QC

Finally, check the quality of the trimmed and filtered reads and decide whether this is acceptable for your downstream steps.

In [None]:
#Generate the fastqc reports for the trimmed and filtered reads
!fastqc \
    -o 1-fastqc/ \
    -t 2 \
    2-trimmomatic/ofp.fastq \
    2-trimmomatic/orp.fastq

In [None]:
#View fastqc reports
import webbrowser
webbrowser.open('1-fastqc/ofp_fastqc.html', new=2)
webbrowser.open('1-fastqc/orp_fastqc.html', new=2)

---
# <font color = 'gray'>Step 4: Assembly</font>

Not one assembler or assembly parameters will yield the "best" assembly for all. Hence, for the next steps, we will generate different assemblies using different modes in SPAdes.

Note that these steps may take a while (~30 minutes on an i5 10th gen processor). If you want to skip these steps, please copy the relevant files on the sample data folder to the demo folder.

### SPAdes: Default

In [None]:
!spades.py \
    -1 2-trimmomatic/ofp.fastq \
    -2 2-trimmomatic/orp.fastq \
    -o 3-spades/1-spades_result_default \
    --threads 2 \
    --memory 6

### SPAdes: <font face = 'Consolas'>--careful</font>

The <font face = 'Consolas'><b>--careful</b></font> option in SPAdes will attempt to reduce the number of mismatches and short indels.

In [None]:
!spades.py \
    --careful \
    -1 2-trimmomatic/ofp.fastq \
    -2 2-trimmomatic/orp.fastq \
    -o 3-spades/2-spades_result_careful \
    --threads 2 \
    --memory 6

### SPAdes: <font face = 'Consolas'>--isolate</font>

The <font face = 'Consolas'><b>--isolate</b></font> option in SPAdes is recommended for high-coverage isolate and multi-cell data.

In [None]:
!spades.py \
    --isolate \
    -1 2-trimmomatic/ofp.fastq \
    -2 2-trimmomatic/orp.fastq \
    -o 3-spades/3-spades_result_isolate \
    --threads 2 \
    --memory 6

---
# <font color = 'gray'>Step 5: Assembly Checking</font>

To decide which assembly is the best among the three, we may evaluate them based on the metrics produced by QUAST. The command below does not use any reference genome, hence you will receive fewer metrics. However, if you have reference genome, you may specify its filepath after the <font face = 'Consolas'><b>-r</b></font> option.

Compare the results of the 3 assemblies and decide which one to use for the succeeding steps.

In [None]:
#Generate the evaluation report
!quast.py \
    -o 4-quast \
    -l "default, careful, isolate" \
    3-spades/1-spades_result_default/scaffolds.fasta \
    3-spades/2-spades_result_careful/scaffolds.fasta \
    3-spades/3-spades_result_isolate/scaffolds.fasta

In [None]:
#Open the report of quast
import webbrowser
webbrowser.open('4-quast/report.html', new=2)

---
# <font color = 'gray'>Step 6: Annotation</font>

The Prokka pipeline involves identifying the coding sequences (CDS), using Prodigal, and determining the identities of the predicted CDS by performing BLAST in a hierarchical manner using their built-in databases. It also predicts other genomic features such as tRNA, rRNA, and more.

Optionally, if you want to annotate your assembled genome using a reference genome, you may supply the <font face = 'Consolas'><b>--proteins</b></font> option which requires a FASTA or a genbank file of your reference as the input.

Note that the last argument of the command below uses the the scaffolds produced by the <b>default</b> mode of SPAdes. Replace this by whichever you've selected as the "best" assembly.

In [None]:
!prokka \
    --outdir 5-prokka \
    --force \
    --compliant \
    3-spades/1-spades_result_default/scaffolds.fasta