# Submodule 1: Introduction to genome sequencing and assembly
--------
Genomics is the comprehensive study of an organism's complete set of DNA, including all of its genes. It provides a deep understanding of the genetic blueprint that governs the biology, function, and behavior of organisms. 

With the advent of advanced sequencing technologies, genomics has become an essential tool for exploring the genetic basis of health and disease, understanding evolution, and studying biodiversity. A critical component of genomics is genome assembly, which involves reconstructing the genome of an organism from short DNA sequences generated by sequencing technologies. Genome assembly is a foundational step that allows researchers to generate a complete picture of the genetic material in a given organism. Assessing the quality and completeness of a genome assembly ensures that it accurately reflects the original genome, providing a reliable foundation for further analysis.

Comparative genomics builds on genome assembly by comparing the genomes of different species or individuals. This field focuses on identifying similarities and differences in DNA sequences to understand evolutionary relationships, discover conserved genetic elements, and reveal the genetic underpinnings of adaptation and diversity. Comparative genomics can uncover how species evolve and adapt over time and identify genes associated with specific traits or diseases.

In this tutorial, we will explore the key concepts and methodologies in genomics, including how to assemble a genome, how to assess its quality, how to annotate the genome, and how to perform comparative analyses.
 

### Learning Objectives:

+ Understand how high throughput sequencing data is generated. 

+ Develop an understanding of core bioinformatic input/output formats as it relates to comparative genomics. 

+ Acquire the skills to assemble raw sequencing reads into a draft genome, assess the quality of the genome assembly, and annotate the genome sequence. 

+ Learn to perform comparative genomic analyses to identify similarities and differences across genomes, run phylogenomic analyses, construct pangenomes to capture genetic diversity, and apply these techniques to address biological questions and hypotheses. 

## Background 01: What is a genome?

## Backround 02:How is sequencing data produced?

### What is Next-Generation Sequencing (NGS)?

Next-Generation Sequencing (NGS) is a high-throughput method that allows for rapid sequencing of DNA. Illumina remains the most widely used sequenicng platform. It is accurate, inexpensive, and fast, providing up to 540 Gb of data on a single flow cell in about a 1-2 days. Other long-read sequencing technologies (i.e. PacBio and Nanopore) are becoming widely used in the research community, 

Figure 1, outlines the basic workflow for sequencing a microbial genome. It begins with the **isolation of bacterial cultures**, which involves culturing bacteria from a sample on selective media to obtain a pure strain. Once a pure culture is established, **DNA extraction** is performed to isolate the genomic DNA from the bacterial cells.

For Illumina-based sequencing the DNA is **fragmented** into smaller pieces and attached to specially coated surfaces, creating a dense cluster of identical fragments. Through a process called **sequencing by synthesis**, fluorescently labeled nucleotides are incorporated one at a time into these clusters. Each incorporated nucleotide emits a distinct signal, allowing the sequencer to determine the DNA sequence in real time.

This technology enables millions of fragments to be sequenced simultaneously, significantly increasing throughput and reducing costs.


Isolate Bacteria            |  Extract DNA
:-------------------------:|:-------------------------:
![alt text](https://github.com/Joseph7e/HCGS-Genomics-Tutorial/blob/master/petri.jpg?raw=true)  |  <img src="https://www.cephamls.com/wp-content/uploads/2019/02/DNA-Extraction-Figure-3-22.jpg" width="420">
<div style="text-align: center;">
<img src="https://github.com/Joseph7e/HCGS-Genomics-Tutorial/blob/master/fragmentation3.png?raw=true" width="700">
</div>

Prepare Library           |  Sequence DNA
:-------------------------:|:-------------------------:
<img src="https://jef.works//assets/blog/librarystructure.png" width="520">  |  <img src="https://github.com/Joseph7e/HCGS-Genomics-Tutorial/blob/master/hiseq.png?raw=true" width="320">

Figure 1. Sequencing a bacterial isolate.

## How next-generation sequencing works

The following video describes the process in detail.

[![Sequencing by Synthesis](images/cluster-generation.jpg)](https://www.youtube.com/watch?v=p4vKJJlNTKA)

In [34]:
%%html
<iframe width="560" height="315" src="https://www.youtube.com/embed/p4vKJJlNTKA" frameborder="0" allowfullscreen></iframe>

## **Starting Dataset**

The dataset we will use in this tutorial 


## **Install required software**

Four main tools are required for for Submodule 1; fastq-dump, fastqc, fastp, and spades.  We will install these tools using __[Conda](https://docs.conda.io/en/latest/)__, an open-source package management system and environment management tool. Conda helps to easily install, update, and manage software dependencies across different platforms.

### List of software
- **sra-tools** - The tool `fastq-dump` will be used to retrieve publicly available data from the NCBI Sequence Read Archive (SRA).

- **fastqc** - Used to assess the quality of raw sequencing data by generating visual reports on metrics like read length, GC content, and sequence duplication.

- **fastp** - Used for quality control and pre-processing of FASTQ files, including read trimming, adapter removal, and filtering out low-quality reads.

- **spades** - Used for genome assembly, creating contigs from high-quality, pre-processed sequencing reads.


We will install these tools with specific versions, this ensures consistent behavior and reproducibility of results across different environments and systems.

<div class="alert alert-block alert-info"><b>Tip</b>: The code below is written in Bash. The <code>%%bash</code> at the beginning of the cell instructs the Jupyter notebook to execute the code as a Bash script when the cell is run. This is the same code you would run from a standard command-line terminal</div>

<div class="alert alert-block alert-info"><b>Tip</b>: To execute the code you can either press **ctrl-ENTER** when the cell is selected or by pressing the play button at the top of the notebook</div>

In [28]:
%%bash

# Install all tools using mamba (a conda alternative) with specified versions

mamba install --channel bioconda \
    python=3.9 \
    sra-tools=3.1.1 \
    fastqc=0.11.8 \
    fastp=0.23.4 \
    spades=4.0.0 \
    -y > /dev/null 2>&1


echo -e "\033[1mInstallation of sra-tools, fastqc, fastp, and spades complete.\nDisplaying versions:\033[0m"

# Confirm installation by checking versions
fastq-dump --version
fastqc --version
fastp --version
spades.py --version


[1mInstallation of sra-tools, fastqc, fastp, and spades complete.
Displaying versions:[0m

fastq-dump : 3.1.1

FastQC v0.11.8


fastp 0.23.4


SPAdes genome assembler v4.0.0


<div class= "alert alert-block alert-info"><b>Tip</b>: use <code>\</code> to break a long command into multiple lines</div>

<div class= "alert alert-block alert-info"><b>Tip</b>: use <code>/dev/null 2>&1</code> to write standard error and standard output to <i>/dev/null</i>, effectively discarding them.</div></div>

## ** Download Starting data **

In this tutorial we utilize data obtained from 

The data utilized in this tutorial was generated by Illumina HiSeq 2500, is paired-end and each read is 250 bps n length. 

<p align="center">
    <img src="images/jupyterNotebook_annotated.png" alt="jupyterNotebook" width="50%"/>
</p>

In [45]:
%%bash

# Capture the SRA accession in a variable 
accession=SRR10056829

# Download data from the SRA using the variable from above 

# prefetching downloads metadata of the SRA records (5.629s)
prefetch -v $accession

# fastq-dump downlaods the reads and compresses them using standard gzip compression. (1m33.96s)
fastq-dump --outdir raw-reads --gzip --split-files "$accession"/"$accession".sra

# Remove the prefetch directory
rm -r $accession

echo Process Complete

2024-10-07T21:27:16 prefetch.3.1.1: 1) Resolving 'SRR10056829'...
2024-10-07T21:27:16 prefetch.3.1.1: 'tools/ascp/disabled': not found in configuration
2024-10-07T21:27:16 prefetch.3.1.1: Checking 'ascp'
2024-10-07T21:27:16 prefetch.3.1.1: 'ascp': not found
2024-10-07T21:27:16 prefetch.3.1.1: Checking 'ascp'
2024-10-07T21:27:16 prefetch.3.1.1: 'ascp': not found
2024-10-07T21:27:16 prefetch.3.1.1: Checking '/usr/bin/ascp'
2024-10-07T21:27:16 prefetch.3.1.1: '/usr/bin/ascp': not found
2024-10-07T21:27:16 prefetch.3.1.1: Checking '/usr/bin/ascp'
2024-10-07T21:27:16 prefetch.3.1.1: '/usr/bin/ascp': not found
2024-10-07T21:27:16 prefetch.3.1.1: Checking '/opt/aspera/bin/ascp'
2024-10-07T21:27:16 prefetch.3.1.1: '/opt/aspera/bin/ascp': not found
2024-10-07T21:27:16 prefetch.3.1.1: Checking '/opt/aspera/bin/ascp'
2024-10-07T21:27:16 prefetch.3.1.1: '/opt/aspera/bin/ascp': not found
2024-10-07T21:27:16 prefetch.3.1.1: Checking '/home/ec2-user/.aspera/connect/bin/ascp'
2024-10-07T21:27:16 prefe

### FASTQ file format
The data we just downloaded are in **FASTQ** file format. Let's examine this data and file format.

Let's run three sections of code. 1. Ensures the data we downloaded exists. 2. Defines the forward and reverse variable names for later use and prints the first entry of the forward reads. 3.)



In [49]:
%%bash

# If the above code worked you should have two FASTQ files in a directory called raw-reads
ls raw-reads/

SRR10056829_1.fastq.gz
SRR10056829_2.fastq.gz


In [59]:
%%bash

# define variable names for reads
forward=raw-reads/SRR10056829_1.fastq.gz
reverse=raw-reads/SRR10056829_2.fastq.gz

# view the first four lines
echo "-------- FASTQ READ ---------"
zcat $forward | head -n 4
echo "-----------------------------"

-------- FASTQ READ ---------
@SRR10056829.1 1 length=100
NCCCCAAGGAATAACATTCATAACCCCCATAGCCGAAGTAATAATCAATAACATTGAAGTCTTTCTTATACCTAAACGCTCATAAATAGGTAATAATGCA
+SRR10056829.1 1 length=100
#0<FFFFFFFFFFIIFFIIIIIIFIIIIIIIIIFIFFIIIIIIIIFFFIIIIIIIIFIIIIIIIIIIFFFFFFFBFFFFFFFFFFFFFBFBBFFFFFFFF
-----------------------------


The above code should have displayed the first four lines of the FASTQ file. Each sequencing read in a FASTQ formatted file is made up of four lines. 

The length of the sequencing reads can the length depending on the type of sequencing instrument and flow cell used, but they should be consistent across the raw data. In our example datset the reads are 100 bps in length. Illumina sequencers produce reads with length lengths randing between 50 and 300 bps. Run the following code to print the length of the reads.

In [53]:
%%bash
zcat raw-reads/SRR10056829_1.fastq.gz | head -n 2 | tail -n 1 | wc -c

101


<div class= "alert alert-block alert-info"><b>Tip</b>: The number above is one higher than the actual read length, this is because of the newline charcater at the end of each line that counted with the wc command</div>

### Do we have enough data? 

A typical bacterial genome is around 5 million basepairs in length. However, this can range from XXX to XXX depending on the species in question.

### What is depth of 'coverage'?

### How do we determine our total throughput?

INSERT FLASH CARD WITH QUESTIONS ABOUT READ LENGTH AND COVERAGE

In [64]:
%%bash

# counting the lines and dividing by 4. Remember each read entry is exactly four lines long. These numbers should match.
zcat raw-reads/*_1.fastq.gz | wc -l

6365120


In [63]:
%%bash

# we can do this calculation from the terminal with echo and bc (bc is the terminal calculator)
num_reads=6365120
read_length=100
genome_size=5000000

# calculate our total throughput
echo "Total number of base pairs sequenced:"
throughput=$(echo $num_reads " * " $read_length " * 2" | bc)
echo $throughput

# calculate our estimated coverage
echo "Total coverage for genome dataset:"
echo $throughput " / " $genome_size | bc


Total number of base pairs sequenced:
1273024000
Total coverage for genome dataset:
254


### 1.5.2 Quality Assessment with FASTQC

Program: FASTQC  
Manual: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/  
FASTQC explained:  
Conda:  

FastQC is a program to summarize read qualities and base composition. Since we have millions of reads there is no practical way to do this by hand. We call the program to parse through the fastq files and determine some basic statistics about the data. The input to the program is one or more fastq file(s) and the output is an html file with several figures. The link above describes what each of the output figures are describing. I mainly focus on the first graph which visualizes our average read qualities and the last figure which shows the adapter content. Note that this program does not do anything to your data, it merely reads it and writes a report.

In [None]:
%%bash
# make a directory to store the output
mkdir output-fastqc

# run the program on the two read files (R1 and R2).
fastqc raw-reads/*.fastq.gz -o output-fastqc --threads 24

# the resulting folder should contain a zipped archive and an html file, we can ignore the zipped archive which is redundant.
ls output-fastqc/

In [None]:
## view HTML report
from IPython.display import IFrame
IFrame('output-fastqc/SRR10056829_1_fastqc.html', width=1000, height=550)

## 1.6 - Quality and adapter trimming with FASTP

Program: fastp - an ultra-fast all-in-one FASTQ preprocessor
Citation: *Chen, S. (2023). Ultrafast one‐pass FASTQ data preprocessing, quality control, and deduplication using fastp. Imeta, 2(2), e107.*
Manual: https://github.com/OpenGene/fastp
Conda: https://anaconda.org/bioconda/fastp

Description:
Ultrafast one-pass FASTQ data preprocessing, quality control, and deduplication.

In [None]:
%%bash

# make a directory to store the output
mkdir output-fastp

# define input variables
forward=raw-reads/SRR10056829_1.fastq.gz
reverse=raw-reads/SRR10056829_2.fastq.gz

# run the program
fastp --in1 $forward --in2 $reverse --out1 output-fastp/trimmed_1.fastq.gz --out2 output-fastp/trimmed_2.fastq.gz

# the resulting folder should contain new trimmed read files
ls output-fastp/

## 1.7 - Genome Assembly

Program: SPAdes - St. Petersburg genome assembler
Citation: *Bankevich, A., Nurk, S., Antipov, D., Gurevich, A. A., Dvorkin, M., Kulikov, A. S., ... & Pevzner, P. A. (2012). SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. Journal of computational biology, 19(5), 455-477.*
Manual: https://ablab.github.io/spades/
Conda: https://anaconda.org/bioconda/spades


Description: 
SPAdes (St. Petersburg genome assembler) is a genome assembly algorithm which was designed for single cell and multi-cells bacterial data sets. However, it might not be suitable for large genomes projects.

SPAdes works with Ion Torrent, PacBio, Oxford Nanopore, and Illumina paired-end, mate-pairs and single reads

In [None]:
%%bash
## Assemble trimmed reads using the SPAdes assembler.

# define input
forward=output-fastp/trimmed_1.fastq.gz
reverse=output-fastp/trimmed_2.fastq.gz

# assemble reads with SPAdes
spades.py -1 $forward -2 $reverse -o output-spades