# 1. Downloading and assembling microbial sequence data

Welcome to the **Downloading and assembling microbial sequence data** supplementary materials. This document is written as a [Jupyter](https://jupyter.org/) notebook, which means that it contains both written documentation as well as runnable code in the same document.

There are two methods you can use to run the code here: (1) either copy-paste the bash commands to a separate terminal, or (2) you can run directly in Juptyer.

## 1.1. Copy-paste commands to a separate terminal

## 1.2. Running from within Jupyter

The easiest way to run the code is to click the **Play** button in the Jupyter notebook as shown below:

![run-jupyter.png](../images/run-jupyter.png)

This will run a single cell (block of code or text) and advance to the next cell. A cell that's running will be marked with an asterisk `*`:

![jupyter-cell-run.png](../images/jupyter-cell-run.png)

If you, instead, which to run everything in this notebook all at once you can find the **Kerenel** option in the menu at the top and select **Restart Kerenel and Run All Cells..**, which will run everything in this notebook all at once.

![restart-kernel-jupyter.png](../images/restart-kernel-jupyter.png)

# 2. Download genomes from NCBI

In this section we will go over how to download genomes from [NCBI's Sequence Read Archive](https://www.ncbi.nlm.nih.gov/sra/) by using the [SRA Tools](https://github.com/ncbi/sra-tools) command-line software provided by NCBI.

## 2.1. Install `sra-tools` using conda

This will create a new conda environment `sra-tools` and install the package `sra-tools` in this environment.

The `-y` option here means to automatically answer `yes` to any prompts for input by conda. This is important when running using Jupyter (but can be left out if you are running via the command-line).

In [1]:
conda create -y -n sra-tools sra-tools

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/CSCScience.ca/apetkau/miniconda3/envs/sra-tools

  added / updated specs:
    - sra-tools


The following NEW packages will be INSTALLED:

  _libgcc_mutex      conda-forge/linux-64::_libgcc_mutex-0.1-conda_forge
  _openmp_mutex      conda-forge/linux-64::_openmp_mutex-4.5-1_gnu
  c-ares             conda-forge/linux-64::c-ares-1.18.1-h7f98852_0
  ca-certificates    conda-forge/linux-64::ca-certificates-2021.10.8-ha878542_0
  curl               conda-forge/linux-64::curl-7.81.0-h2574ce0_0
  hdf5               conda-forge/linux-64::hdf5-1.10.6-nompi_h6a2412b_1114
  icu                conda-forge/linux-64::icu-69.1-h9c3ff4c_0
  krb5               conda-forge/linux-64::krb5-1.19.2-hcc1bbae_3
  libcurl            conda-forge/linux-64::libcurl-7.81.0-h2574ce0_0
  libedit            conda-forge/linux-64::libedit-3.1.20191231-he28a2e2_2
  libev              con

## 2.2. Verify `sra-tools` is working

Let's verify that the tools found in the package `sra-tools` are working properly. We can do this by running one of the installed commands, `prefetch` like so:

In [2]:
conda run -n sra-tools prefetch --help


Usage: prefetch [ options ] [ accessions(s)... ]

Parameters:

  accessions(s)                    list of accessions to process


Options:

  -T|--type <file-type>            Specify file type to download. Default: sra
  -N|--min-size <size>             Minimum file size to download in KB
                                     (inclusive).
  -X|--max-size <size>             Maximum file size to download in KB
                                     (exclusive). Default: 20G
  -f|--force <no|yes|all|ALL>      Force object download - one of: no, yes,
                                     all, ALL. no [default]: skip download if
                                     the object if found and complete; yes:
                                     download it even if it is found and is
                                     complete; all: ignore lock files (stale
                                     locks or it is being downloaded by
                                     another process - use at your own

Notice that I ran by first specifying `conda run -n sra-tools [...]`. The `conda run` part lets you specify a particular environment to activate first before trying to run a tool. It's useful in cases where you are installing each tool in a separate environment (as we are doing here). You can also use `conda activate sra-tools` first and then run `prefetch --help` (but this will not work if running in Jupyter).

## 2.3. Prefetch genomes (`prefetch`)

The [prefetch](https://github.com/ncbi/sra-tools/wiki/08.-prefetch-and-fasterq-dump) command that is part of the `sra-tools` package can be used to download and store sequence read data from NCBI's Sequence Read Archive. You can run it like so:

In [3]:
conda run -n sra-tools prefetch SRR3028792


2022-01-21T19:03:53 prefetch.2.11.0: 1) 'SRR3028792' is found locally
2022-01-21T19:03:53 prefetch.2.11.0: 'SRR3028792' has 0 unresolved dependencies




You can even pass multiple accession numbers to `prefetch` like so:

In [4]:
conda run -n sra-tools prefetch SRR3028792 SRR3028793


2022-01-21T19:03:55 prefetch.2.11.0: 1) 'SRR3028792' is found locally
2022-01-21T19:03:55 prefetch.2.11.0: 'SRR3028792' has 0 unresolved dependencies

2022-01-21T19:03:55 prefetch.2.11.0: 2) 'SRR3028793' is found locally
2022-01-21T19:03:55 prefetch.2.11.0: 'SRR3028793' has 0 unresolved dependencies




The `prefetch` command downloads the data from NCBI and stores on your computer in a special directory by default. You can see which directory using the `srapath` command:

In [5]:
conda run -n sra-tools srapath SRR3028792

/home/CSCScience.ca/apetkau/ncbi/public/sra/SRR3028792.sra




This tells us that there is a file `SRR3028792.sra` that has been downloaded to the path above which contains the sequence data.

For how to know what identifier to use for `prefetch`, this can be found on the NCBI website.

For example for `SRR3028792`, you can find this identifier (and the associated biological sample) at https://www.ncbi.nlm.nih.gov/sra/?term=SRR3028792

![genome-sra-identifier.png](../images/genome-sra-identifier.png)

## 2.4. Convert to fastq (`fasterq-dump`)

Once you've prefetched the sequence data you can convert to the **fastq** format using `fasterq-dump` (which is named based on the older `fastq-dump` tool). We need to convert from the `.sra` format to the `.fastq` format as most tools work with **fastq** files (instead of **sra** files, which is specific to NCBI's sequence read archive).

In [6]:
conda run -n sra-tools fasterq-dump SRR3028792

spots read      : 824,262
reads read      : 1,648,524
reads written   : 1,648,524




This will write two `*.fastq` files do your current directory which contain the forward (`_1.fastq`) and reverse (`_2.fastq`) reads.

We can use the `ls` command to **l**i**s**t the files and show additional information (such as file size).

In [7]:
ls -lh *.fastq

-rw-r--r-- 1 apetkau grp_apetkau 396M Jan 21 13:03 SRR3028792_1.fastq
-rw-r--r-- 1 apetkau grp_apetkau 398M Jan 21 13:03 SRR3028792_2.fastq



The `396M` part shows us that `SRR3028792_1.fastq` is **396 MB**. The `-lh` part is interpreted as follows: `l` is for **long list**, which means extra information (such as file size) is printed, `h` is for **human readable** which means that file sizes are converted into easier to read units (like MB, GB, etc) instead of being left as bytes. 

We can compress these files to save on storage space using the `gzip` command.

*Tip: To speed this up, you can use `pigz` (parallel gzip) instead of `gzip`, which will take advantage of multiple processing cores on your machine. Just run `pigz *.fastq`. If you don't have `pigz` installed it can be installed with `conda install pigz`.*

In [8]:
gzip *.fastq




In [9]:
ls -lth *.fastq.gz

-rw-r--r-- 1 apetkau grp_apetkau 138M Jan 21 13:03 [0m[01;31mSRR3028792_2.fastq.gz[0m
-rw-r--r-- 1 apetkau grp_apetkau 110M Jan 21 13:03 [01;31mSRR3028792_1.fastq.gz[0m



Now the files end in `.fastq.gz` instead of `.fastq` and are much smaller (`SRR3028792_1.fastq.gz` is **110 MB** compared to **396 MB** for the uncompressed file).

We can also look at the file contents using the `cat` (or `zcat` for files ending in `.gz`) commands along with `head`.

In [10]:
zcat SRR3028792_1.fastq.gz | head

@SRR3028792.1 1 length=90
CTTCATAATCAGGCGATAAATGCCCACCACTTTAGGCTTTCTGGCGCGGATAGCCTCCCCAATAAAATCTTTACGCGTACGCTTTGCTTC
+SRR3028792.1 1 length=90
AABC-C--CE,-,C++@,,,<,CCC,BB,:CCE,,,;CCCCC,,6+,+++8,,:,99,,8,,:,,,,<CCEE9C,@+8+8++88C,9CCC
@SRR3028792.2 2 length=271
ATACACGCCAGCGCGAATATTTATCGCTAGCATCAGGAAAACCAACGAAACAGCCCTCGTTGATTTGCCCCCTGAACAGTGGGCTTTTTTCGCGTACAGATGGCTCCGGCATTAGACCTGTGACATCCCGATACTATTCACATTCCGCAGCTCATACTTCACTTCTGCCCGTAATACGCCTTCGGACCATGCTTACGCATGACGTGTTTATTCATCAGATCATCGTCGATGGGATTCAGCGCCGGGTTAATGCCGCGCGCAATCCACGCCA
+SRR3028792.2 2 length=271
-86A<F-@F7+C+C@77C,CFF9EE8FE7+C,CE,,,,,,,;@,,8,,,,+,,9:,9,+::,,CEE,CEEDEF7,,:,,,,,,:CEFFEB=+8+++++,,:,,49A?++++8B,,,:??,:,,:,CEFF7+8++:,:A@,>,DDDD,+++88@,@,@CCC,@@B>@,>@D*@**6,5*4*6>?***>:*<,5=>,>*;5=*,,*2*<:C+<CF7A?+++3**;:*9C))7+***29F**9)7<5))8>)*9*8;5;)7)7((-57)5(4:(
@SRR3028792.3 3 length=181
TGTCGAGATGGTGGCTGAACTGCATCGGGTCTATAGCGCCCAGACTCAGGCGTATGTCCTGTTTAATGAAGAGAGTGCACTTTCGCAAGCGCTTCTGTTGCAACGTCATGGGGAAGAAGATTTTCTTGCCTTTCTCCGGGC

The command `zcat` uncompresses any `*.gz` files and then prints out the contents. Since these are large files we likely only want to look at the first few lines in the file. The command `head` will print out the first 10 lines by default (and ignore all the rest). The `|` character tells bash that we want to take the output of `zcat SRR3028792_1.fastq.gz` (which will print all the lines of a file) and forward it into the `head` command (which will print out the first 10 lines of data and ignore the rest).

The end result is we see the first 10 lines of the `SRR3028792_1.fastq.gz` file.

We can do the same thing for the reverse reads.

In [11]:
zcat SRR3028792_2.fastq.gz | head

@SRR3028792.1 1 length=300
CTCTCAAACCTTCCTCGTTACTTTTTTCTTTCCGCCTCTCTCCTCCCCCCCCCGCCTCCCCTCTTTTCCCTTTCTCTCCTGCTTCTGCCCCTCTCTCTTCTCCTCCTCTGCCTCTTCCTCCTCCCCCTTACTTCTTCCTCTCTTTCCTCCCCCTCTCCTTCCTCCTCCCCCCCTCTCCCTCCCCCCTTCCCTCCCCTCTCCCTCCCCCCCCTCCCCCCCCCTCCCTCCCTCCTCCCCCCTCTCTCTCCCCTCTCTCCCTCCCCCCTCCCCCCCCTCTCCTCCCCCCCCCCCCTCCCCCCC
+SRR3028792.1 1 length=300
--,-8,-,;:,;,;,;,:,,,,,6;;@,<@,;;,,8,8;,66;;;,88++++++678,,,,6,,6,,,6,9:?,59,:95,,:9,:,,,,9:+995:9:,:,9,4,999,,9,94,99,,8,,,++84,,,,3,8,,,,,63,,3,,33,+++++,6,,,,,,,,,,,+++*******0******))))*)+*))))))0)))))))))(((((((((((,((,((,(((((((((((((,(())))(,((((.))(((((((((((((((((((,((()((((((((,((((,,(((((
@SRR3028792.2 2 length=300
TTGCTTCGATTTCTCTCTCCCTTCACCCGGCGCTGACTCCCCTCTCCTCTTCTCTGATGCCTCCTCACTTCATCCTTCCTCCTCCTCCCCCTTCGTATTACTTTCCTCCGTCCCTTCTGCTCTCCTCCCTCTCCCTCTTTTCTCTCTTTCACATTTCTACTTCCGCCCCCCTCTTTCCCCCCACCCACCCCCCTTTTCCGCGTGCCTCTCCACTCTCGCTTTTTCCTTCCTTCTCCTGTTTCTCCCTCTCACTCTTCCCTCTTGCTTTTTTCTTTCTCTTCTACCCCTCTGCCCCTTCCC
+SRR3028792.2 2 

*Tip: In addition to `head` which prints the first few lines of a file, there is also `tail` which prints the last few lines of a file.*

# 3. Quality filter files (`fastp`)

The software [fastp](https://github.com/OpenGene/fastp) can be used to generate a report of the quality of the sequence reads as well as remove any poor-quality reads which might cause issues for downstream analysis.

## 3.1. Install `fastp` using conda

We can first install `fastp` using conda.

In [12]:
conda create -y -n fastp fastp

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/CSCScience.ca/apetkau/miniconda3/envs/fastp

  added / updated specs:
    - fastp


The following NEW packages will be INSTALLED:

  _libgcc_mutex      conda-forge/linux-64::_libgcc_mutex-0.1-conda_forge
  _openmp_mutex      conda-forge/linux-64::_openmp_mutex-4.5-1_gnu
  fastp              bioconda/linux-64::fastp-0.23.2-hd36eab0_1
  isa-l              conda-forge/linux-64::isa-l-2.30.0-ha770c72_4
  libdeflate         conda-forge/linux-64::libdeflate-1.9-h7f98852_0
  libgcc-ng          conda-forge/linux-64::libgcc-ng-11.2.0-h1d223b6_11
  libgomp            conda-forge/linux-64::libgomp-11.2.0-h1d223b6_11
  libstdcxx-ng       conda-forge/linux-64::libstdcxx-ng-11.2.0-he4da1e4_11


Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
#     $ conda activate fastp
#
# To deactivate an a

Now let's take a look at all the different options available for running `fastp`.

In [13]:
conda run -n fastp fastp --help

usage: fastp [options] ... 
options:
  -i, --in1                            read1 input file name (string [=])
  -o, --out1                           read1 output file name (string [=])
  -I, --in2                            read2 input file name (string [=])
  -O, --out2                           read2 output file name (string [=])
      --unpaired1                      for PE input, if read1 passed QC but read2 not, it will be written to unpaired1. Default is to discard it. (string [=])
      --unpaired2                      for PE input, if read2 passed QC but read1 not, it will be written to unpaired2. If --unpaired2 is same as --unpaired1 (default mode), both unpaired reads will be written to this same file. (string [=])
      --overlapped_out                 for each read pair, output the overlapped region if it has no any mismatched base. (string [=])
      --failed_out                     specify the file to store reads that cannot pass the filters. (string [=])
  -m, --merge  

## 3.2. Evaluate quality of sequence reads

To run `fastp` on our set of sequence reads, we can pass in the `*.fastq.gz` files to the command and specify the output file names.

In [14]:
conda run -n fastp fastp --in1 SRR3028792_1.fastq.gz --in2 SRR3028792_2.fastq.gz --out1 SRR3028792_1.fp.fastq.gz --out2 SRR3028792_2.fp.fastq.gz --html SRR3028792.html --json SRR3028792.json

Read1 before filtering:
total reads: 824262
total bases: 176567188
Q20 bases: 171950867(97.3855%)
Q30 bases: 163432314(92.561%)

Read2 before filtering:
total reads: 824262
total bases: 177556496
Q20 bases: 153139830(86.2485%)
Q30 bases: 134061609(75.5036%)

Read1 after filtering:
total reads: 788353
total bases: 166850080
Q20 bases: 163779407(98.1596%)
Q30 bases: 156905038(94.0395%)

Read2 after filtering:
total reads: 788353
total bases: 166698912
Q20 bases: 149106629(89.4467%)
Q30 bases: 132133429(79.2647%)

Filtering result:
reads passed filter: 1576706
reads failed due to low quality: 71818
reads failed due to too many N: 0
reads failed due to too short: 0
reads with adapter trimmed: 46646
bases trimmed due to adapters: 1033979

Duplication rate: 0.665322%

Insert size peak (evaluated by paired-end reads): 35

JSON report: SRR3028792.json
HTML report: SRR3028792.html

fastp --in1 SRR3028792_1.fastq.gz --in2 SRR3028792_2.fastq.gz --out1 SRR3028792_1.fp.fastq.gz --out2 SRR3028792_2.

The meaning of the following command-line arguments is as follows:

* `--in1 SRR3028792_1.fastq.gz --in2 SRR3028792_2.fastq.gz`: The first pair of fastq reads (`--in1`) and second pair (`--in2`) to examine.
* `--out1 SRR3028792_1.fp.fastq.gz --out2 SRR3028792_2.fp.fastq.gz`: The output file names for reads that pass the default quality filters, `--out1` for the first pair and `--out2` for the second pair.
* `--html SRR3028792.html`: The name of the HTML report, which can be opened in a web browser.
* `--json SRR3028792.json`: The name of the JSON report (an alternative format to HTML).

Once `fastp` is finished you can open up the `SRR3028792.html` report in any web browser. In Jupyter you can right-click and select **Open in New Browser Tab**:

![jupyter-fastp-report.png](../images/jupyter-fastp-report.png)

# 4. Genome assembly (`skesa`)

Now that we have removed poor-quality reads from our **fastq** files we can move on to genome assembly, which attempts to fit all the smaller sequence reads in the **fastq** files together to produce longer contiguous sequences (contigs). Ideally you would have one contig per chromosome/plasmid but often chromosomes/plasmids will be broken up into smaller contigs, normally at repetitive regions on the genome that are longer than the read length.

## 4.1. Install skesa

We will use [skesa](https://github.com/ncbi/SKESA) to assemble the sequence reads. Skesa is a *de-novo* genome assembler that is designed specifically for microbial genomes. The first step is installing the software using conda.

In [15]:
conda create -y -n skesa skesa

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/CSCScience.ca/apetkau/miniconda3/envs/skesa

  added / updated specs:
    - skesa


The following NEW packages will be INSTALLED:

  _libgcc_mutex      conda-forge/linux-64::_libgcc_mutex-0.1-conda_forge
  _openmp_mutex      conda-forge/linux-64::_openmp_mutex-4.5-1_gnu
  boost              conda-forge/linux-64::boost-1.70.0-py38h9de70de_1
  boost-cpp          conda-forge/linux-64::boost-cpp-1.70.0-h7b93d67_3
  bzip2              conda-forge/linux-64::bzip2-1.0.8-h7f98852_4
  ca-certificates    conda-forge/linux-64::ca-certificates-2021.10.8-ha878542_0
  icu                conda-forge/linux-64::icu-67.1-he1b5a44_0
  ld_impl_linux-64   conda-forge/linux-64::ld_impl_linux-64-2.36.1-hea4e1c9_2
  libblas            conda-forge/linux-64::libblas-3.9.0-13_linux64_openblas
  libcblas           conda-forge/linux-64::libcblas-3.9.0-13_linux64_openblas
  libffi  

In [16]:
conda run -n skesa skesa --help


General options:
  -h [ --help ]                 Produce help message
  -v [ --version ]              Print version
  --cores arg (=0)              Number of cores to use (default all) [integer]
  --memory arg (=32)            Memory available (GB, only for sorted counter) 
                                [integer]
  --hash_count                  Use hash counter [flag]
  --estimated_kmers arg (=100)  Estimated number of unique kmers for bloom 
                                filter (millions, only for hash counter) 
                                [integer]
  --skip_bloom_filter           Don't do bloom filter; use --estimated_kmers as
                                the hash table size (only for hash counter) 
                                [flag]

Input/output options : at least one input providing reads for assembly must be specified:
  --reads arg                   Input fasta/fastq file(s) for reads (could be 
                                used multiple times for different ru

## 4.2. Assemble genome

Next we will run `skesa` on our filtered **fastq** files.

In [17]:
conda run -n skesa skesa --reads SRR3028792_1.fp.fastq.gz,SRR3028792_2.fp.fastq.gz --contigs_out SRR3028792.fasta

skesa --reads SRR3028792_1.fp.fastq.gz,SRR3028792_2.fp.fastq.gz --contigs_out SRR3028792.fasta 

Total mates: 1576706 Paired reads: 788353
Reads acquired in  14.197494s wall, 13.940000s user + 0.230000s system = 14.170000s CPU (99.8%)

Kmer len: 19
Raw kmers: 305168274 Memory needed (GB): 5.85923 Memory available (GB): 29.8726 1 cycle(s) will be performed
Distinct kmers: 172827
Kmer count in  3.678642s wall, 148.840000s user + 9.670000s system = 158.510000s CPU (4308.9%)
Uniq kmers merging in  0.172947s wall, 0.060000s user + 0.480000s system = 0.540000s CPU (312.2%)
Adapters: 0 Reads before: 1576706 Sequence before: 333548982 Reads after: 1576706 Sequence after: 333548982 Reads clipped: 0
Adapters clipped in  3.858329s wall, 148.900000s user + 10.150000s system = 159.050000s CPU (4122.3%)

Kmer len: 21
Raw kmers: 302014862 Memory needed (GB): 5.79869 Memory available (GB): 29.783 1 cycle(s) will be performed
Distinct kmers: 5886038
Kmer count in  3.480091s wall, 151.330000s user + 9.1

The meaning of the command-line arguments is as follows:

* `--reads SRR3028792_1.fp.fastq.gz,SRR3028792_2.fp.fastq.gz`: The sequence reads. For paired-end data each pair of files needs to be separated by a comma `,`.
* `--contigs_out SRR3028792.fasta`: The output file containing the longer contiguous sequences (contigs), in **fasta** format.

## 4.3. Examine output contigs (`seqkit`)

We can examine the output contigs file quickly on the command-line using the [seqkit](https://bioinf.shenwei.me/seqkit/) command-line tool.

### 4.3.1. Install `seqkit`

To install `seqkit` we can use conda.

In [18]:
conda create -y -n seqkit seqkit

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/CSCScience.ca/apetkau/miniconda3/envs/seqkit

  added / updated specs:
    - seqkit


The following NEW packages will be INSTALLED:

  seqkit             bioconda/linux-64::seqkit-2.1.0-h9ee0642_0


Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
#     $ conda activate seqkit
#
# To deactivate an active environment, use
#
#     $ conda deactivate




In [19]:
conda run -n seqkit seqkit stats --help

simple statistics of FASTA/Q files

Tips:
  1. For lots of small files (especially on SDD), use big value of '-j' to
     parallelize counting.

Usage:
  seqkit stats [flags]

Aliases:
  stats, stat

Flags:
  -a, --all                  all statistics, including quartiles of seq length, sum_gap, N50
  -b, --basename             only output basename of files
  -E, --fq-encoding string   fastq quality encoding. available values: 'sanger', 'solexa', 'illumina-1.3+', 'illumina-1.5+', 'illumina-1.8+'. (default "sanger")
  -G, --gap-letters string   gap letters (default "- .")
  -h, --help                 help for stats
  -i, --stdin-label string   label for replacing default "-" for stdin (default "-")
  -T, --tabular              output in machine-friendly tabular format

Global Flags:
      --alphabet-guess-seq-length int   length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)
      --id-ncbi                   

### 4.3.2. Print stats of contigs

To print basic stats on the assembled genome (`SRR3028792.fasta`) we can run the following:

In [20]:
conda run -n seqkit seqkit stats -a SRR3028792.fasta

file              format  type  num_seqs    sum_len  min_len    avg_len    max_len   Q1      Q2       Q3  sum_gap      N50  Q20(%)  Q30(%)
SRR3028792.fasta  FASTA   DNA         38  4,842,245      370  127,427.5  1,022,474  803  45,160  184,096        0  298,538       0       0




This gives us some basic information on our assembled genome. For example `sum_len` is the total length of all contigs (and so our genome, around 4.8 million bp). `num_seqs` is the number of contigs in our genome.

# 5. Evaulate assembly quality (`quast`)

While `seqkit stats` gives us a quick way to examine the assembled genome, it can also be nice to have a visual and interactive report. This is where [quast](https://github.com/ablab/quast) comes in, as it can produce an interactive report that you can open in your web browser.

You can upload the `SRR3028792.fasta` file to the quast website at <http://cab.cc.spbu.ru/quast/>. However, we will go over how to run quast on the command-line (useful if you have many genomes you want to examine at once).

## 5.1. Install `quast`

The first step is to install a local version of `quast` using conda.

In [21]:
conda create -y -n quast quast

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/CSCScience.ca/apetkau/miniconda3/envs/quast

  added / updated specs:
    - quast


The following NEW packages will be INSTALLED:

  _libgcc_mutex      conda-forge/linux-64::_libgcc_mutex-0.1-conda_forge
  _openmp_mutex      conda-forge/linux-64::_openmp_mutex-4.5-1_gnu
  blast              bioconda/linux-64::blast-2.12.0-pl5262h3289130_0
  brotli             conda-forge/linux-64::brotli-1.0.9-h7f98852_6
  brotli-bin         conda-forge/linux-64::brotli-bin-1.0.9-h7f98852_6
  bzip2              conda-forge/linux-64::bzip2-1.0.8-h7f98852_4
  c-ares             conda-forge/linux-64::c-ares-1.18.1-h7f98852_0
  ca-certificates    conda-forge/linux-64::ca-certificates-2021.10.8-ha878542_0
  certifi            conda-forge/linux-64::certifi-2021.10.8-py37h89c1867_1
  circos             bioconda/noarch::circos-0.69.8-hdfd78af_1
  curl               conda-forge/

In [22]:
conda run -n quast quast --help

QUAST: Quality Assessment Tool for Genome Assemblies
Version: 5.0.2

Usage: python /home/CSCScience.ca/apetkau/miniconda3/envs/quast/bin/quast [options] <files_with_contigs>

Options:
-o  --output-dir  <dirname>       Directory to store all result files [default: quast_results/results_<datetime>]
-r                <filename>      Reference genome file
-g  --features [type:]<filename>  File with genomic feature coordinates in the reference (GFF, BED, NCBI or TXT)
                                  Optional 'type' can be specified for extracting only a specific feature type from GFF
-m  --min-contig  <int>           Lower threshold for contig length [default: 500]
-t  --threads     <int>           Maximum number of threads [default: 25% of CPUs]

Advanced options:
-s  --split-scaffolds                 Split assemblies by continuous fragments of N's and add such "contigs" to the comparison
-l  --labels "label, label, ..."      Names of assemblies to use in reports, comma-separated. If cont

## 5.2. Create a quality report with `quast`

Now we can run quast as follows:

In [23]:
conda run -n quast quast SRR3028792.fasta

/home/CSCScience.ca/apetkau/miniconda3/envs/quast/bin/quast SRR3028792.fasta

Version: 5.0.2

System information:
  OS: Linux-4.15.0-161-generic-x86_64-with-debian-buster-sid (linux_64)
  Python version: 3.7.12
  CPUs number: 56

Started: 2022-01-21 13:07:14

Logging to /home/CSCScience.ca/apetkau/workspace/2022-01-26-Downloading-and-assembling-microbial-sequence-data/tutorial/quast_results/results_2022_01_21_13_07_14/quast.log
NOTICE: Maximum number of threads is set to 14 (use --threads option to set it manually)

CWD: /home/CSCScience.ca/apetkau/workspace/2022-01-26-Downloading-and-assembling-microbial-sequence-data/tutorial
Main parameters: 
  MODE: default, threads: 14, minimum contig length: 500, minimum alignment length: 65, \
  ambiguity: one, threshold for extensive misassembly size: 1000

Contigs:
  Pre-processing...
  SRR3028792.fasta ==> SRR3028792

2022-01-21 13:07:15
Running Basic statistics processor...
  Contig files: 
    SRR3028792
  Calculating N50 and L50...
    SRR

If you have more than one genome to evaluate you can run list them all on the command line `quast genome1.fasta genome2.fasta ...` or just use `quast *.fasta`.

### 5.2.1 Zip up report for download

The output of quast can be found in the directory `quast_results/latest/report.html` and you can open this in a web browser. However, if you are running this using Jupyter then we have to download the entire directory to your local computer first before viewing.

We can download the entire `quast_results/` directory by first zipping up the quast report. You can create a `*.zip` archive from the command-line with the `zip` command.

In [24]:
zip -r quast.zip quast_results

  adding: quast_results/ (stored 0%)
  adding: quast_results/results_2022_01_21_13_07_14/ (stored 0%)
  adding: quast_results/results_2022_01_21_13_07_14/report.tsv (deflated 57%)
  adding: quast_results/results_2022_01_21_13_07_14/quast.log (deflated 71%)
  adding: quast_results/results_2022_01_21_13_07_14/transposed_report.tsv (deflated 58%)
  adding: quast_results/results_2022_01_21_13_07_14/report.pdf (deflated 27%)
  adding: quast_results/results_2022_01_21_13_07_14/icarus.html (deflated 85%)
  adding: quast_results/results_2022_01_21_13_07_14/report.tex (deflated 64%)
  adding: quast_results/results_2022_01_21_13_07_14/transposed_report.txt (deflated 67%)
  adding: quast_results/results_2022_01_21_13_07_14/transposed_report.tex (deflated 60%)
  adding: quast_results/results_2022_01_21_13_07_14/report.html (deflated 73%)
  adding: quast_results/results_2022_01_21_13_07_14/basic_stats/ (stored 0%)
  adding: quast_results/results_2022_01_21_13_07_14/basic_stats/SRR3028792_GC_content

Now you can download the zip file by right-clicking `quast.zip` and selecting **Download**.

![jupyter-quast-zip.png](../images/jupyter-quast-zip.png)

Next, you can unzip the `quast.zip` file on your local computer and open `quast_report/latest/report.html` in a web browser. ou should end up with something that looks like:

![quast-report.png](../images/quast-report.png)

# 6. Where to go from here

An assembled genome file (`SRR3028792.fasta`) often forms the basis for other downstream analysis. Here are a few examples you can try on your own.

## 6.1. Anti-microbial resistance

Try uploading `SRR3028792.fasta` to CARD/RGI (which will identify genes associated with antimicrobial resistance): https://card.mcmaster.ca/analyze/rgi

## 6.2. Multi-locus sequence typing

Try downloading and running [mlst](https://github.com/tseemann/mlst) on the `SRR3028792.fasta` file, which will give you the [Multi-locus sequence type](https://en.wikipedia.org/wiki/Multilocus_sequence_typing) of the genome. You can do this with the following commands.

First we will install [mamba](https://mamba.readthedocs.io/en/latest/), which is a faster version of `conda`. For most cases using the `conda` command is okay, but sometimes conda is slow at installing tools and `mamba` is meant to speed things up. I found that `conda` was too slow to install `mlst`.

In [3]:
conda install -y mamba

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

(base) 


Next we can use `mamba` to install `mlst`. `mamba` is meant to replace `conda` so you can run the same `conda create` command as you're used to be replace it with `mamba create`.

In [7]:
# Text starting with "#" is a comment as not a command
# The below is the equivalent `conda` command which we are replacing with `mamba`
# I use "-q" to print less information when running mamba
# conda create -y -n mlst mlst
mamba create -q -y -n mlst mlst

  Package                                       Version  Build             Channel                    Size
────────────────────────────────────────────────────────────────────────────────────────────────────────────
  Install:
────────────────────────────────────────────────────────────────────────────────────────────────────────────

[32m  + _libgcc_mutex                        [00m           0.1  conda_forge       conda-forge/linux-64[32m     Cached[00m
[32m  + _openmp_mutex                        [00m           4.5  1_gnu             conda-forge/linux-64[32m     Cached[00m
[32m  + any2fasta                            [00m         0.4.2  hdfd78af_3        bioconda/noarch     [32m     Cached[00m
[32m  + blast                                [00m        2.12.0  pl5262h3289130_0  bioconda/linux-64   [32m     Cached[00m
[32m  + bzip2                                [00m         1.0.8  h7f98852_4        conda-forge/linux-64[32m     Cached[00m
[32m  + c-ares             

Once the tools are installed with `mamba` we used the regular `conda` command to run.

In [2]:
conda run -n mlst mlst SRR3028792.fasta

SRR3028792.fasta	senterica	15	aroC(2)	dnaN(7)	hemD(9)	hisD(9)	purE(5)	sucA(9)	thrA(12)

[14:22:22] This is mlst 2.19.0 running on linux with Perl 5.026002
[14:22:22] Checking mlst dependencies:
[14:22:22] Found 'blastn' => /home/CSCScience.ca/apetkau/miniconda3/envs/mlst/bin/blastn
[14:22:22] Found 'any2fasta' => /home/CSCScience.ca/apetkau/miniconda3/envs/mlst/bin/any2fasta
[14:22:23] Found blastn: 2.12.0+ (002012)
[14:22:23] Excluding 2 schemes: ecoli_2 abaumannii
[14:22:24] Found exact allele match senterica.hemD-9
[14:22:24] Found exact allele match senterica.hisD-9
[14:22:24] Found exact allele match senterica.dnaN-7
[14:22:24] Found exact allele match ecoli.gyrB-213
[14:22:24] Found exact allele match senterica.aroC-2
[14:22:24] Found exact allele match senterica.sucA-9
[14:22:24] Found exact allele match senterica.purE-5
[14:22:24] Found exact allele match senterica.thrA-12
[14:22:25] You can use --label XXX to replace an ugly filename in the output.
[14:22:25] Done.




## 6.3. Serotyping

As this is a *Salmonella* genome, try running the [sistr](https://github.com/phac-nml/sistr_cmd) command on the genome to get the [serotype](https://en.wikipedia.org/wiki/Serotype).

In [5]:
mamba create -q -y -n sistr sistr_cmd

  Package                         Version  Build                Channel                    Size
─────────────────────────────────────────────────────────────────────────────────────────────────
  Install:
─────────────────────────────────────────────────────────────────────────────────────────────────

[32m  + _libgcc_mutex          [00m           0.1  conda_forge          conda-forge/linux-64[32m     Cached[00m
[32m  + _openmp_mutex          [00m           4.5  1_gnu                conda-forge/linux-64[32m     Cached[00m
[32m  + blast                  [00m        2.12.0  pl5262h3289130_0     bioconda/linux-64   [32m     Cached[00m
[32m  + blosc                  [00m        1.21.0  h9c3ff4c_0           conda-forge/linux-64[32m     Cached[00m
[32m  + bzip2                  [00m         1.0.8  h7f98852_4           conda-forge/linux-64[32m     Cached[00m
[32m  + c-ares                 [00m        1.18.1  h7f98852_0           conda-forge/linux-64[32m     Cached[00m


In [10]:
conda run -n sistr sistr --help

usage: sistr_cmd [-h] [-i fasta_path genome_name] [-f OUTPUT_FORMAT]
                 [-o OUTPUT_PREDICTION] [-M] [-p CGMLST_PROFILES]
                 [-n NOVEL_ALLELES] [-a ALLELES_OUTPUT] [-T TMP_DIR] [-K]
                 [--use-full-cgmlst-db] [--no-cgmlst] [-m] [--qc] [-t THREADS]
                 [-v] [-V]
                 [F [F ...]]

SISTR (Salmonella In Silico Typing Resource) Command-line Tool
Serovar predictions from whole-genome sequence assemblies by determination of antigen gene and cgMLST gene alleles using BLAST.

Note about using the "--use-full-cgmlst-db" flag:
    The "centroid" allele database is ~10% the size of the full set so analysis is much quicker with the "centroid" vs "full" set of alleles. Results between 2 cgMLST allele sets should not differ.

If you find this program useful in your research, please cite as:

The Salmonella In Silico Typing Resource (SISTR): an open web-accessible tool for rapidly typing and subtyping draft Salmonella genome assemblies.


In [11]:
conda run -n sistr sistr -f csv -o SRR3028792.sistr.csv SRR3028792.fasta




The output is saved to the file `SRR3028792.sistr.csv`. We can view it with `cat`. Also, since it's a **csv** file, you can download it and open it in any spreadsheet software (e.g., Excel).

In [12]:
cat SRR3028792.sistr.csv

cgmlst_ST,cgmlst_distance,cgmlst_found_loci,cgmlst_genome_match,cgmlst_matching_alleles,cgmlst_subspecies,fasta_filepath,genome,h1,h2,o_antigen,serogroup,serovar,serovar_antigen,serovar_cgmlst
3331578236,0.018181818181818188,330,2017-MER-0640,324,enterica,/home/CSCScience.ca/apetkau/workspace/2022-01-26-Downloading-and-assembling-microbial-sequence-data/tutorial/SRR3028792.fasta,SRR3028792,r,"1,2","1,4,[5],12",B,Heidelberg,Heidelberg|Bradford,Heidelberg

