# Submodule 1: Introduction to genome sequencing and assembly
--------
## Overview

Genomics is the comprehensive study of an organism's complete set of DNA, including all of its genes. It provides a deep understanding of the genetic blueprint that governs the biology, function, and behavior of organisms. 

With the advent of advanced sequencing technologies, genomics has become an essential tool for exploring the genetic basis of health and disease, understanding evolution, and studying biodiversity. A critical component of genomics is genome assembly, which involves reconstructing the genome of an organism from short DNA sequences generated by sequencing technologies. Genome assembly is a foundational step that allows researchers to generate a complete picture of the genetic material in a given organism. Assessing the quality and completeness of a genome assembly ensures that it accurately reflects the original genome, providing a reliable foundation for further analysis.

Comparative genomics builds on genome assembly by comparing the genomes of different species or individuals. This field focuses on identifying similarities and differences in DNA sequences to understand evolutionary relationships, discover conserved genetic elements, and reveal the genetic underpinnings of adaptation and diversity. Comparative genomics can uncover how species evolve and adapt over time and identify genes associated with specific traits or diseases.

In this tutorial, we will explore the key concepts and methodologies in genomics, including how to assemble a genome, how to assess its quality, how to annotate the genome, and how to perform comparative analyses.

<p align="center">
  <img src="images/diagram-WGS.png" width="70%"/>
</p>



### Learning Objectives:

+ Understand how high throughput sequencing data is generated. 

+ Develop an understanding of core bioinformatic input/output formats as it relates to comparative genomics. 

+ Acquire the skills to assemble raw sequencing reads into a draft genome, assess the quality of the genome assembly, and annotate the genome sequence. 

+ Learn to perform comparative genomic analyses to identify similarities and differences across genomes, run phylogenomic analyses, construct pangenomes to capture genetic diversity, and apply these techniques to address biological questions and hypotheses. 

## Background: How is sequencing data produced?

### What is Next-Generation Sequencing (NGS)?

Next-Generation Sequencing (NGS) is a high-throughput method that allows for rapid sequencing of DNA. Illumina remains the most widely used sequenicng platform. It is accurate, inexpensive, and fast, providing up to 540 Gb of data on a single flow cell in about a 1-2 days. Other long-read sequencing technologies (i.e. PacBio and Nanopore) are becoming widely used in the research community, and will be discussed in later modules.

Figure 1, outlines the basic workflow for sequencing a microbial genome. It begins with the **isolation of bacterial cultures**, which involves culturing bacteria from a sample on selective media to obtain a pure strain. Once a pure culture is established, **DNA extraction** is performed to isolate the genomic DNA from the bacterial cells.

For Illumina-based sequencing the DNA is **fragmented** into smaller pieces and attached to specially coated surfaces, creating a dense cluster of identical fragments. Through a process called **sequencing by synthesis**, fluorescently labeled nucleotides are incorporated one at a time into these clusters. Each incorporated nucleotide emits a distinct signal, allowing the sequencer to determine the DNA sequence in real time.

This technology enables millions of fragments to be sequenced simultaneously, significantly increasing throughput and reducing costs.



Isolate Bacteria            |  Extract DNA
:-------------------------:|:-------------------------:
![alt text](https://github.com/Joseph7e/HCGS-Genomics-Tutorial/blob/master/petri.jpg?raw=true)  |  <img src="https://www.cephamls.com/wp-content/uploads/2019/02/DNA-Extraction-Figure-3-22.jpg" width="420">
<div style="text-align: center;">
<img src="https://github.com/Joseph7e/HCGS-Genomics-Tutorial/blob/master/fragmentation3.png?raw=true" width="700">
</div>

Prepare Library           |  Sequence DNA
:-------------------------:|:-------------------------:
<img src="https://jef.works//assets/blog/librarystructure.png" width="520">  |  <img src="https://github.com/Joseph7e/HCGS-Genomics-Tutorial/blob/master/hiseq.png?raw=true" width="320">




## Video Lesson: How next-generation sequencing works.

The following video describes the process in detail.

[![Sequencing by Synthesis](images/cluster-generation.jpg)](https://www.youtube.com/watch?v=p4vKJJlNTKA)

In [None]:
%%html
<iframe width="560" height="315" src="https://www.youtube.com/embed/p4vKJJlNTKA" frameborder="0" allowfullscreen></iframe>

## **Install required software**

Four main tools are required for for Submodule 1; fastq-dump, fastqc, fastp, and spades.  We will install these tools using __[Conda](https://docs.conda.io/en/latest/)__, an open-source package management system and environment management tool. Conda helps to easily install, update, and manage software dependencies across different platforms.

We will install these tools with specific versions, this ensures consistent behavior and reproducibility of results across different environments and systems.

### List of software

| **Tool**       | **Description**                                                                                                                                                           |
|:---------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **sra-tools**      | fastq-dump is used to retrieve publicly available data from the NCBI Sequence Read Archive (SRA).                                                 |
| **fastqc**        | Used to assess the quality of raw sequencing data by generating visual reports on metrics like read length, GC content, and sequence duplication.|
| **fastp**   | Used for quality control and pre-processing of FASTQ files, including read trimming, adapter removal, and filtering out low-quality reads.                          |
| **spades**      | Used for genome assembly, creating contigs from high-quality, pre-processed sequencing reads.                  |  

<div class="alert alert-block alert-warning"> <b>Attention:</b> The code below is written in Bash. The <code>%%bash</code> at the beginning of the cell instructs the Jupyter notebook to execute the code as a Bash script when the cell is run. This is the same code you would run from a standard command-line terminal</a>. </div>

<div class="alert alert-block alert-info"><b>Tip</b>: To execute the code you can either press **ctrl-ENTER** when the cell is selected or by pressing the play button at the top of the notebook</div>

In [None]:
%%bash
# Install tools using mamba (a conda alternative) with specified versions

mamba install --channel bioconda \
    python=3.9 \
    sra-tools=3.1.1 \
    fastqc=0.11.8 \
    fastp=0.23.4 \
    spades=4.0.0 \
    -y > /dev/null 2>&1


echo -e "\033[1mInstallation of sra-tools, fastqc, fastp, and spades complete.\nDisplaying versions:\033[0m"

# Confirm installation by checking versions
fastq-dump --version
fastqc --version
fastp --version
spades.py --version


<div class= "alert alert-block alert-info"><b>Tip</b>: use <code>\</code> to break a long command into multiple lines</div>

<div class= "alert alert-block alert-info"><b>Tip</b>: use <code>/dev/null 2>&1</code> to write standard error and standard output to <i>/dev/null</i>, effectively discarding them so your screen doesn't get flooded with text.</div></div>

## **Starting data**
The data used for this module is described in a manuscript comparing phenotypic and whole-genome sequencing derived AMR profiles (Painset et al. 2020) and is available under the SRA accession SRR10056829. Genomic DNA was extracted, fragmented and tafed for muktiplexing with **Nextera XT DNA Sample Preperation Kits**. Paired-end data was produced on an **Illumina HiSeq 2500** platform.

The manuscript includes sequencing read datasets for 528 isolates of *Campylobacter* spp. (452 *C. jejuni* and 76 *C. coli*) from human (494), food (21) and environmental (2) sources. We will start by walking through the process on just a single dataset. We will later expand this to include a larger set of this data.

In [None]:
%%bash

# Capture the SRA accession in a variable 
accession=SRR10056829

# Download data from the SRA using the variable from above 

# prefetching downloads metadata of the SRA records (5.629s)
prefetch -v $accession

# fastq-dump downlaods the reads and compresses them using standard gzip compression. (1m33.96s)
fastq-dump --outdir raw-reads --gzip --split-files "$accession"/"$accession".sra

# Remove the prefetch directory
rm -r $accession

echo Process Complete

<div class= "alert alert-block alert-info"><b>Tip</b>: Assigning certain values to variables (like the accession above) allows you to easily adjust input to a set of commands and reduces the amount of typing for long or complicated file names</div>

### FASTQ file format
The data we just downloaded are in **FASTQ** file format. Let's examine this data and file format.

Let's run three sections of code to learn about the FASTQ file format. Please also see a detailed description of FASTQ files in the NIGMS module [Fundementals of Bionformatics](https://github.com/NIGMS/Fundamentals-of-Bioinformatics)

In [None]:
%%bash

# If the above code worked you should have two FASTQ files in a directory called raw-reads
ls raw-reads/

<div class="alert alert-block alert-warning"> <b>Attention:</b> Notice the the files above have the .gz extension. FASTQ data is commonly gzip compressed to save sapce. Most bioinformatic tools accept or require gzipped FASTQs. </div>

In [None]:
%%bash

# define variable names for reads
forward=raw-reads/SRR10056829_1.fastq.gz
reverse=raw-reads/SRR10056829_2.fastq.gz

# view the first four lines
echo "-------- FASTQ READ ---------"
zcat $forward | head -n 4
echo "-----------------------------"

The above code should have displayed the first four lines of the FASTQ file. **Each sequencing read in a FASTQ formatted file is made up of four lines**. 

The length of the sequencing reads can the length depending on the type of sequencing instrument and flow cell used, but they should be consistent across the raw data. In our example datset the reads are 100 bps in length. Illumina sequencers produce reads with length lengths randing between 50 and 300 bps. Run the following code to print the length of the reads.

In [None]:
%%bash
zcat raw-reads/SRR10056829_1.fastq.gz | head -n 2 | tail -n 1 | wc -c

<div class= "alert alert-block alert-info"><b>Tip</b>: The number above is one higher than the actual read length, this is because of the newline charcater at the end of each line that counted with the wc command</div>

## Do we have enough data? 
Most bacterial genomes are composed of a single, circular chromosome, although some bacteria may have linear chromosomes or multiple chromosomes. Plasmids (small circular DNA molecules) often coexist with the main chromosome and carry additional genes. Compared to eukaryotes, a typical bacterial genome is generally small, ranging from about 0.5 to 10 megabases (Mb) in size. For example, *Escherichia coli*, has a genome size of approximately 4.6 Mb. 

| Organism                        | Genome Size (Mb) |
|----------------------------------|------------------|
| *Escherichia coli* (Bacteria)    | 4.6              |
| *Streptomyces coelicolor* (Bacteria) | 8.7          |
| *Mycoplasma genitalium* (Bacteria)   | 0.58         |
| *Homo sapiens* (Human)           | 3,200            |
| *Drosophila melanogaster* (Fruit Fly) | 139          |
| *Arabidopsis thaliana* (Thale Cress) | 135           |
| *Saccharomyces cerevisiae* (Yeast)   | 12.1         |
| *Caenorhabditis elegans* (Nematode)  | 100           |
| *Mus musculus* (Mouse)           | 2,700            |
| *Zea mays* (Maize)               | 2,300            |


### What is 'coverage'?
**Breadth of coverage** refers to the percentage of the genome that has been sequenced at least once. It tells you how much of the genome has any sequencing data aligned to it, but it doesn’t indicate how many times each base has been sequenced. High breadth of coverage means most or all regions of the genome are represented in the sequencing data.

**Depth of coverage**, on the other hand, refers to the average number of times each base in the genome is sequenced, often represented as "X" (e.g., 30x depth). Depth is important for accuracy, as higher depth ensures that individual bases are sequenced multiple times, improving confidence in variant calling and reducing sequencing errors.

When researchers say *coverage* they usually mean *depth of coverage*.


The equation for genome depth of coverage is:

$$
\text{Coverage} = \frac{N \times L}{G}
$$

Where:
- \( N \) = Number of reads
- \( L \) = Read length
- \( G \) = Genome size


The total number of bases sequenced, the numerator in the above equation, is also called the total **throughput** for the sample.


### How can we determine our total throughput and coverage?

Lets do the math based on the data we have so far, we know the read length, and we can use a best guess at the genome size as 5 Mb for now. SO lets count our number of reads.

In [None]:
%%bash

# counting the lines and dividing by 4. Remember each read entry is exactly four lines long. These numbers should match.
zcat raw-reads/*_1.fastq.gz | wc -l

In [None]:
%%bash

# we can do this calculation from the terminal with echo and bc (bc is the terminal calculator)
num_reads=6365120/4
read_length=100
# adjust this number to the organisms genome size.
genome_size=5000000

# calculate our total throughput
echo "Total number of base pairs sequenced:"
throughput=$(echo $num_reads " * " $read_length " * 2" | bc)
echo $throughput

# calculate our estimated coverage
echo "The estimated depth of coverage for this genome dataset:"
echo $throughput " / " $genome_size | bc


## Step 1: Quality Assessment with FASTQC

Program: FASTQC  
Manual: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/  
FASTQC explained:  https://www.bioinformatics.babraham.ac.uk/projects/fastqc/


FastQC is a program to summarize read qualities and base composition. Since we have millions of reads there is no practical way to do this by hand. We call the program to parse through the fastq files and determine some basic statistics about the data. The input to the program is one or more fastq file(s) and the output is an html file with several figures. The link above describes what each of the output figures are describing. I mainly focus on the first graph which visualizes our average read qualities and the last figure which shows the adapter content. Note that this program does not do anything to your data, it merely reads it and writes a report.

In [None]:
%%bash
# make a directory to store the output
mkdir output-fastqc

# run the program on the two read files (R1 and R2).
fastqc raw-reads/*.fastq.gz -o output-fastqc --threads 24

# the resulting folder should contain a zipped archive and an html file, we can ignore the zipped archive which is redundant.
ls output-fastqc/

In [None]:
## view HTML report
from IPython.display import IFrame
IFrame('output-fastqc/SRR10056829_1_fastqc.html', width=1000, height=550)

## Step 2: Quality and adapter trimming with FASTP

Program: fastp - an ultra-fast all-in-one FASTQ preprocessor
Citation: *Chen, S. (2023). Ultrafast one‐pass FASTQ data preprocessing, quality control, and deduplication using fastp. Imeta, 2(2), e107.*
Manual: https://github.com/OpenGene/fastp
Conda: https://anaconda.org/bioconda/fastp

Description:
Ultrafast one-pass FASTQ data preprocessing, quality control, and deduplication.

In [None]:
%%bash

# make a directory to store the output
mkdir output-fastp

# define input variables
forward=raw-reads/SRR10056829_1.fastq.gz
reverse=raw-reads/SRR10056829_2.fastq.gz

# run the program
fastp --in1 $forward --in2 $reverse --out1 output-fastp/trimmed_1.fastq.gz --out2 output-fastp/trimmed_2.fastq.gz

# the resulting folder should contain new trimmed read files
ls output-fastp/

## Step 3: Genome Assembly

- Program: **SPAdes - St. Petersburg genome assembler**
- Citation: *Bankevich, A., Nurk, S., Antipov, D., Gurevich, A. A., Dvorkin, M., Kulikov, A. S., ... & Pevzner, P. A. (2012). SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. Journal of computational biology, 19(5), 455-477.*  
- Manual: https://ablab.github.io/spades/  

SPAdes (St. Petersburg genome assembler) is a genome assembly algorithm which was designed for single cell and multi-cells bacterial data sets. However, it might not be suitable for large genomes projects. SPAdes works with Ion Torrent, PacBio, Oxford Nanopore, and Illumina paired-end, mate-pairs and single-end reads.

<div class="alert alert-block alert-info"><b>Tip</b>: I recommend running the code below, then watching the following video tutorial or reading more while it runs.</div>

In [None]:
%%bash
## Assemble trimmed reads using the SPAdes assembler.

# define input
forward=output-fastp/trimmed_1.fastq.gz
reverse=output-fastp/trimmed_2.fastq.gz

# assemble reads with SPAdes
spades.py -1 $forward -2 $reverse -o output-spades --threads 24 > logfile_spades.txt 2>&1


## Video Lesson: How genome assembly of short reads works.

The following video describes the process in detail. Image credit. https://homolog.us/Tutorials/book4/p2.1.html

[![Sequencing by Synthesis](images/debruijn.png)](https://www.youtube.com/watch?v=ZmF6QROPlTU)

## Video Recap

### De Bruijn Graphs in Genome Assembly

A **de Bruijn graph** is a structure used in genome assembly to represent sequences using **k-mers**. Here’s how it works:

1. **K-mers**: These are short sequences of length *k* derived from a longer DNA sequence. For example, from the sequence "AGCT", the 2-mers are "AG", "GC", and "CT".

2. **Graph Construction**: Each k-mer becomes a vertex in the graph. An edge is drawn from k-mer **A** to k-mer **B** if the last *k-1* nucleotides of **A** match the first *k-1* nucleotides of **B**. This creates a directed graph where paths through the vertices represent potential sequences in the genome.

3. **Contigs and Scaffolds**:
   - **Contigs**: These are contiguous sequences formed by traversing paths in the graph.
   - **Scaffolds**: Longer constructs that may combine multiple contigs, providing additional structure to the genome assembly.

De Bruijn graphs enable efficient handling of sequencing data, especially for complex genomes with repetitive sequences.
  
  
### Key challenges for genome assembly:
**<u>Intrinsic Challenges:</u>**
1. **Heterozygosity:** The alleles of a gene are not he same, yet we force them into a single consensus sequence.
2. **Paralogy vs. Alleleism**: Genes come from other genes by a process of duplication. This results in two or more similar genes in an organism. There are two alleles in a diploid organism that are very similiar. How do you tell a duplicated gene from alleles of a gene?
3. **Sequence complexity**: Simple sequence repeats (SSR), large scale repeats like transposable elements (TEs).

**<u>Extrinsic Challenges:</u>**
1. **Quality of DNA sequences (sequencing errors):** Each sequencing technology has specific patterns of error.
2. **Length of DNA sequence reads**: Shorter reads are less likely to be unique or to include unique K-mers (see below).
3. **Coverage**: Depth of coverage is a random process at best. Consequently some regions of the genome will have low levels of coverage.
4. **Memory intensive**: Inherently requires large amounts of RAM for assebly and storage of input and output.




In [29]:
%%bash
# make a copy of the genome sequence in a new directory. This data will be used for Submodule 2.

mkdir -p assembled-genome
cp output-spades/contigs.fasta assembled-genome/