# 1. Downloading and assembling microbial sequence data

Welcome to the **Downloading and assembling microbial sequence data** supplementary materials. This document is written as a [Jupyter](https://jupyter.org/) notebook, which means that it contains both written documentation as well as runnable code in the same document.

There are two methods you can use to run the code here: (1) either copy-paste the bash commands to a separate terminal, or (2) you can run directly in Juptyer.

## 1.1. Copy-paste commands to a separate terminal

## 1.2. Running from within Jupyter

The easiest way to run the code is to click the **Play** button in the Jupyter notebook as shown below:

![run-jupyter.png](../images/run-jupyter.png)

This will run a single cell (block of code or text) and advance to the next cell. A cell that's running will be marked with an asterisk `*`:

![jupyter-cell-run.png](../images/jupyter-cell-run.png)

If you, instead, which to run everything in this notebook all at once you can find the **Kerenel** option in the menu at the top and select **Restart Kerenel and Run All Cells..**, which will run everything in this notebook all at once.

![restart-kernel-jupyter.png](../images/restart-kernel-jupyter.png)

# 2. Download genomes from NCBI

In this section we will go over how to download genomes from [NCBI's Sequence Read Archive](https://www.ncbi.nlm.nih.gov/sra/) by using the [SRA Tools](https://github.com/ncbi/sra-tools) command-line software provided by NCBI.

## 2.1. Install `sra-tools` using conda

This will create a new conda environment `sra-tools` and install the package `sra-tools` in this environment.

The `-y` option here means to automatically answer `yes` to any prompts for input by conda. This is important when running using Jupyter (but can be left out if you are running via the command-line).

In [None]:
conda create -y -n sra-tools sra-tools

## 2.2. Verify `sra-tools` is working

Let's verify that the tools found in the package `sra-tools` are working properly. We can do this by running one of the installed commands, `prefetch` like so:

In [None]:
conda run -n sra-tools prefetch --help

Notice that I ran by first specifying `conda run -n sra-tools [...]`. The `conda run` part lets you specify a particular environment to activate first before trying to run a tool. It's useful in cases where you are installing each tool in a separate environment (as we are doing here). You can also use `conda activate sra-tools` first and then run `prefetch --help` (but this will not work if running in Jupyter).

## 2.3. Prefetch genomes (`prefetch`)

The [prefetch](https://github.com/ncbi/sra-tools/wiki/08.-prefetch-and-fasterq-dump) command that is part of the `sra-tools` package can be used to download and store sequence read data from NCBI's Sequence Read Archive. You can run it like so:

In [None]:
conda run -n sra-tools prefetch SRR3028792

You can even pass multiple accession numbers to `prefetch` like so:

In [None]:
conda run -n sra-tools prefetch SRR3028792 SRR3028793

The `prefetch` command downloads the data from NCBI and stores on your computer in a special directory by default. You can see which directory using the `srapath` command:

In [None]:
conda run -n sra-tools srapath SRR3028792

For how to know what identifier to use for `prefetch`, this can be found on the NCBI website.

For example for `SRR3028792`, you can find this identifier (and the associated biological sample) at https://www.ncbi.nlm.nih.gov/sra/?term=SRR3028792

![genome-sra-identifier.png](../images/genome-sra-identifier.png)

## 2.4. Convert to fastq (`fasterq-dump`)

Once you've prefetched the sequence data you can convert to the **fastq** format using `fasterq-dump` (which is named based on the older `fastq-dump` tool).

In [None]:
conda run -n sra-tools fasterq-dump SRR3028792

This will write two `*.fastq` files do your current directory which contain the forward (`_1.fastq`) and reverse (`_2.fastq`) reads.

We can use the `ls` command to **l**i**s**t the files and show additional information (such as file size).

In [None]:
ls -lh *.fastq

The `396M` part shows us that `SRR3028792_1.fastq` is **396 MB**. The `-lh` part is interpreted as follows: `l` is for **long list**, which means extra information (such as file size) is printed, `h` is for **human readable** which means that file sizes are converted into easier to read units (like MB, GB, etc) instead of being left as bytes. 

We can compress these files to save on storage space using the `gzip` command.

*Tip: To speed this up, you can use `pigz` (parallel gzip) instead of `gzip`, which will take advantage of multiple processing cores on your machine. Just run `pigz *.fastq`. If you don't have `pigz` installed it can be installed with `conda install pigz`.*

In [None]:
gzip *.fastq

In [None]:
ls -lth *.fastq.gz

Now the files end in `.fastq.gz` instead of `.fastq` and are much smaller (`SRR3028792_1.fastq.gz` is **110 MB** compared to **396 MB** for the uncompressed file).

We can also look at the file contents using the `cat` (or `zcat` for files ending in `.gz`) commands along with `head`.

In [None]:
zcat SRR3028792_1.fastq.gz | head

The command `zcat` uncompresses any `*.gz` files and then prints out the contents. Since these are large files we likely only want to look at the first few lines in the file. The command `head` will print out the first 10 lines by default (and ignore all the rest). The `|` character tells bash that we want to take the output of `zcat SRR3028792_1.fastq.gz` (which will print all the lines of a file) and forward it into the `head` command (which will print out the first 10 lines of data and ignore the rest).

The end result is we see the first 10 lines of the `SRR3028792_1.fastq.gz` file.

We can do the same thing for the reverse reads.

In [None]:
zcat SRR3028792_2.fastq.gz | head

*Tip: In addition to `head` which prints the first few lines of a file, there is also `tail` which prints the last few lines of a file.*

# 3. Quality filter files (`fastp`)

The software [fastp](https://github.com/OpenGene/fastp) can be used to generate a report of the quality of the sequence reads as well as remove any poor-quality reads which might cause issues for downstream analysis.

## 3.1. Install `fastp` using conda

We can first install `fastp` using conda.

In [None]:
conda create -y -n fastp fastp

Now let's take a look at all the different options available for running `fastp`.

In [None]:
conda run -n fastp fastp --help

## 3.2. Evaluate quality of sequence reads

To run `fastp` on our set of sequence reads, we can pass in the `*.fastq.gz` files to the command and specify the output file names.

In [None]:
conda run -n fastp fastp --in1 SRR3028792_1.fastq.gz --in2 SRR3028792_2.fastq.gz --out1 SRR3028792_1.fp.fastq.gz --out2 SRR3028792_2.fp.fastq.gz --html SRR3028792.html --json SRR3028792.json

# 4. Genome assembly (`skesa`)

## 4.1. Install skesa

In [None]:
conda create -y -n skesa skesa

## 4.2. Assemble genome

In [None]:
conda run -n skesa skesa --reads SRR3028792_1.fp.fastq.gz,SRR3028792_2.fp.fastq.gz --contigs_out SRR3028792.fasta

# 5. Evaulate assembly quality (`quast`)

## 5.1. Install `quast`

In [None]:
conda create -y -n quast quast

In [None]:
conda run -n quast quast --help

## 5.2. Create a quality report with `quast`

In [None]:
conda run -n quast quast SRR3028792.fasta

### 5.2.1 Zip up report for download

Let's zip up the quast report for download so we can open it in our browser. You can create a `*.zip` archive from the command-line with the `zip` command.

In [None]:
zip -r quast.zip quast_results

# 6. Where to go from here

An assembled genome file (`SRR3028792.fasta`) often forms the basis for other downstream analysis. Here are a few examples you can try on your own:

1. Try uploading `SRR3028792.fasta` to CARD/RGI (which will identify genes associated with antimicrobial resistance): https://card.mcmaster.ca/analyze/rgi