# Submodule 1: Introduction to genome sequencing and assembly
--------
Genomics is the comprehensive study of an organism's complete set of DNA, including all of its genes. It provides a deep understanding of the genetic blueprint that governs the biology, function, and behavior of organisms. 

With the advent of advanced sequencing technologies, genomics has become an essential tool for exploring the genetic basis of health and disease, understanding evolution, and studying biodiversity. A critical component of genomics is genome assembly, which involves reconstructing the genome of an organism from short DNA sequences generated by sequencing technologies. Genome assembly is a foundational step that allows researchers to generate a complete picture of the genetic material in a given organism. Assessing the quality and completeness of a genome assembly ensures that it accurately reflects the original genome, providing a reliable foundation for further analysis.

Comparative genomics builds on genome assembly by comparing the genomes of different species or individuals. This field focuses on identifying similarities and differences in DNA sequences to understand evolutionary relationships, discover conserved genetic elements, and reveal the genetic underpinnings of adaptation and diversity. Comparative genomics can uncover how species evolve and adapt over time and identify genes associated with specific traits or diseases.

In this tutorial, we will explore the key concepts and methodologies in genomics, including how to assemble a genome, assess its quality, and perform comparative analyses. Through hands-on examples, you will learn how to analyze genomic data, interpret results, and apply various computational tools to gain deeper insights into the genetic information that defines life.

 

### Learning Objectives:

+ Understand how high throughput sequencing data is generated. 

+ Develop an understanding of core bioinformatic input/output formats as it relates to comprative genomics. 

+ Acquire the skills to assemble raw sequencing reads into a draft genome, assess the quality of the genome assembly, and annotate the genome sequence. 

+ Learn to perform comparative genomic analyses to identify similarities and differences across genomes, run phylogenomic analyses, construct pangenomes to capture genetic diversity, and apply these techniques to address biological questions and hypotheses. 


## 1.1 - Generation of sequencing data

Isolate Bacteria            |  Extract DNA
:-------------------------:|:-------------------------:
![alt text](https://github.com/Joseph7e/HCGS-Genomics-Tutorial/blob/master/petri.jpg?raw=true)  |  <img src="https://www.cephamls.com/wp-content/uploads/2019/02/DNA-Extraction-Figure-3-22.jpg" width="420">
<img src="https://github.com/Joseph7e/HCGS-Genomics-Tutorial/blob/master/fragmentation3.png?raw=true" width="800">

Prepare Library           |  Sequence DNA
:-------------------------:|:-------------------------:
<img src="https://jef.works//assets/blog/librarystructure.png" width="520">  |  <img src="https://github.com/Joseph7e/HCGS-Genomics-Tutorial/blob/master/hiseq.png?raw=true" width="320">


## 1.2 - How next-generation sequencing works

[![sequencing by synthesis](img/youtube-video-sequencing.PNG)](https://www.youtube.com/watch?v=p4vKJJlNTKA&t=9s "Sequencing")



## 1.3 - Comparison of sequencing instruments



## 1.4 - Starting data

Each of these samples represent the genome of a unique and novel microbe that has not been seen before (except by me). 

The data utilized in this tutorial was generated by Illumina HiSeq 2500, is paired-end and each read is 250 bps n length. 

<p align="center">
    <img src="images/jupyterNotebook_annotated.png" alt="jupyterNotebook" width="50%"/>
</p>

### 1.4.1 - Download sra-tools to access public data available on the NCBI Sequence Read Archive (SRA)

In [11]:
%%bash

# Install fastq-dump using mamba ( a conda alternative)
mamba install --channel bioconda python=3.9 sra-tools -y

In [11]:
### 1.4.2 - Download sequnce data from the SRA

Transaction

  Prefix: /home/ec2-user/anaconda3/envs/python3

  All requested packages already installed


Looking for: ['python=3.9', 'sra-tools']



In [1]:
%%bash

# setup working directory and move into it
mkdir wgs-workflow
cd wgs-workflow

# Capture the SRA accession in a variable 
accession=SRR10056829

# Download data from the SRA using the variable from above 

# prefetching downloads metadata of the SRA records (5.629s)
prefetch -v $accession

# fastq-dump downlaods the reads and compresses them using standard gzip compression. (1m33.96s)
fastq-dump --outdir raw-reads --gzip --split-files "$accession"/"$accession".sra

# Remove the prefetch directory
rm -r $accession

bash: line 12: prefetch: command not found
bash: line 15: fastq-dump: command not found
rm: cannot remove ‘SRR10056829’: No such file or directory


CalledProcessError: Command 'b'\n# setup working directory and move into it\nmkdir wgs-workflow\ncd wgs-workflow\n\n# Capture the SRA accession in a variable \naccession=SRR10056829\n\n# Download data from the SRA using the variable from above \n\n# prefetching downloads metadata of the SRA records (5.629s)\nprefetch -v $accession\n\n# fastq-dump downlaods the reads and compresses them using standard gzip compression. (1m33.96s)\nfastq-dump --outdir raw-reads --gzip --split-files "$accession"/"$accession".sra\n\n# Remove the prefetch directory\nrm -r $accession\n'' returned non-zero exit status 1.