# Module: Assembly of Sequencing Data Into Contigs

Short sequencing reads produced by next-generation sequencing technologies can be reconstructed into longer contiguous sequences oftentimes referred to as contigs. This provides better resolution for downstream analyses such as functional annotation and comparative genomics. As such, a clearer picture of the genomic or transcriptomic architecture is revealed, providing a better and more accurate understanding of biological systems.

This module introduces tools widely used for _de novo_ sequence assembly: (1) SPAdes for genome assembly, (2) MEGAHIT for metagenomics, and (3) Trinity for RNA-seq. It also covers the evaluation of assembly quality using (meta)QUAST.

Created by: _Microbial Oceanography Laboratory (MOLab)_

---
## How to Use This Notebook

1. Make sure tools are installed already (see below if not yet).
2. Activate environment. Replace environment name accordingly.
```bash
conda activate assembly-env
```
2. Open jupyter notebook with the command below and select the notebook.
```bash
jupyter notebook
```
3. To run the cells in this notebook, press Shift+Enter.

---
## Tools Used
1. **SPAdes**
2. **MEGAHIT**
3. **QUAST**
4. **Trinity**

To install tools (1) - (3), find the `assembly.yaml` file located in the same folder as this notebook (in repository). Then run the command below in the terminal:
```bash
conda env create -f assembly.yaml
```

To install tool (4), a Docker image of Trinity can be pulled. Instructions are outlined here: [Trinity Docker Installation](https://github.com/trinityrnaseq/trinityrnaseq/wiki/Trinity-in-Docker).

<div class="alert alert-block alert-info">
<b>Note:</b> 
    
Assembly is quite a resource-intensive process and may not be feasible for many to run locally. Access to HPCs is ideal, but as another alternative, you can run these tools using the <a href='https://usegalaxy.eu/'>Galaxy EU webserver</a>.
</div>

---
## Starting Files 

1. Illumina paired-end FASTQ sequences that underwent quality control process (i.e. trimmed low quality regions and adapters) (see **Quality Control Module**).
2. (Optional) FASTA file of reference genome assembly.

---
## Expected Outputs

1. FASTA file of assembly.
2. Quality reports.

---
## Table of Contents
 * [**Genome Assembly**](#Genome-Assembly)
     * [Default Mode](#Default)
     * [Careful Mode](#--careful)
     * [Isolate Mode](#--isolate)
 * [**Metagenome Assembly**](#Metagenome-Assembly)
 * [**RNA-Seq Assembly**](#RNA-Seq-Assembly)
 * [**Assembly Evaluation**](#Assembly-Evaluation)
     * [QUAST](#QUAST)

----
# <font color = 'gray'>Genome Assembly</font>

For genome assembly, we will be using `spades`. The genome assembly component of `spades` have been developed for bacterial isolates. If your target organism has a much larger genome, consider other tools that have been optimized for such scenarios.

Not one assembler or assembly parameters will yield the "best" assembly for all. Hence, for the codes below, we will generate different assemblies using different modes in SPAdes. Specifically, we will use `spades` with default, `--isolate`, and `--careful` settings. We assume that your forward and reverse reads are named `WGS1_ofp.fastq` and `WGS1_orp.fastq`, respectively.

If you want to explore more options in `spades`, you can refer to their documentation: [SPAdes Wiki](https://github.com/ablab/spades/wiki).

### Default

In [None]:
!spades.py \
    -1 WGS1_ofp.fastq \
    -2 WGS1_orp.fastq \
    -o spades_result_default \
    --threads 2 \
    --memory 6

### `--careful`

The `--careful` option in SPAdes will attempt to reduce the number of mismatches and short indels.

In [None]:
!spades.py \
    --careful \
    -1 WGS1_ofp.fastq \
    -2 WGS1_orp.fastq \
    -o spades_result_careful \
    --threads 2 \
    --memory 6

### `--isolate`

The `--isolate` option in SPAdes is recommended for high-coverage isolate and multi-cell data.

In [None]:
!spades.py \
    --isolate \
    -1 WGS1_ofp.fastq \
    -2 WGS1_orp.fastq \
    -o spades_result_isolate \
    --threads 2 \
    --memory 6

In the output folder of `spades`, two important FASTA files are produced: `scaffolds.fasta` and `contigs.fasta`. We make use of `scaffolds.fasta` which is just a slightly more clean version of the `contigs.fasta`.

----
# <font color = 'gray'>Metagenome Assembly</font>

Although `spades` has a `--meta` option which was optimized for shotgun metagenomics assembly, here we will be utilizing `megahit` instead.  `megahit` assembles faster and also is more resource efficient. 

Below, we run `megahit` on a single paired-end library (`M_ofp.fastq.gz` and `M_orp.fastq.gz`) using default options. But if you have the time and resource, you can check out other parameters like `--presets` which will set other assembly options optimized for different levels of complexity of metagenome, or improved sensitivity. Also, note that `megahit`, by default, will retain at least 200 bp contigs only.

Check out other megahit assembly parameters here: [MEGAHIT metagenome assembly](https://www.metagenomics.wiki/tools/assembly/megahit).

In [None]:
!megahit \
    -1 M_ofp.fastq.gz \
    -2 M_orp.fastq.gz \
    -o megahit_output_dir

<div class="alert alert-block alert-warning">
<b>Individual assembly versus Co-assembly</b> 

For a multi-sample data, oftentimes, part of the research objective is to compare different samples and/or sample groups. In the context of assembly, for you to be able to do this, a single reference assembly is ideal since it makes comparison between samples more direct. In that case, generally, there are two possible assembly routes that you may take. First is to <i>assemble the libraries individually</i>, then merge the assemblies into a single reference assembly. The second option is to <i>co-assemble</i> your sequencing libraries. This is simply a process of pooling together the reads from all of your libraries, and then generating a single assembly from the pooled reads.
    
So, what approach do you use? There isn't a direct answer to that since both methods have their pros and cons. Method 2 (co-assembly) may improve recovery of low-coverage regions since reads are pooled, but it may also assemble reads from closely related strains into the same contig. Method 1 avoids this problem, but do not benefit from the improved coverage that co-assembly presents. Ultimately, the best option, if possible is to try both approaches. But if limited on time and resources, you may select the method whose features better suit your objectives.
    
To perform co-assembly using <code>megahit</code>, you simply have to specify the paths to your FASTQ files delimited by comma.
    
The following readings may provide more insights into these two methods:
    <li><a href='https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-022-01259-2'>Evaluating metagenomic assembly approaches for biome-specific gene catalogues</a></li>
    <li><a href='https://angus.readthedocs.io/en/2019/recovering-rep-genomes-from-mgs.html#to-co-assemble-or-not-to-co-assemble'>Recovering “genomes” from metagenomes</a>
</div>

Again, after producing the metagenome assembly, you can calculate various assembly metrics using `quast`.

----
# <font color = 'gray'>RNA-Seq Assembly</font>

Trinity, developed at the Broad Institute and the Hebrew University of Jerusalem, represents a novel method for the efficient and robust de novo reconstruction of transcriptomes from RNA-seq data.

_Source: [Trinity Wiki](https://github.com/trinityrnaseq/trinityrnaseq/wiki)_

Below, we assume that `Trinity` has been setup using Docker. Make sure Docker Daemon is running before executing the code below. If not, type the following code in the terminal.

```bash
sudo service docker start
```

<div class="alert alert-block alert-info">
<b>Note:</b> 
    
Trinity prefers to work with stranded RNA-seq data. Check what protocol you used in generating your RNA-seq data and supply the appropriate strandedness in the <code>--SS_lib_type</code> option. You can see more details here: <a href='https://github.com/trinityrnaseq/trinityrnaseq/wiki/Running-Trinity#strand_specific_assembly'>Strand-specific assembly</a>
</div>

In [None]:
!docker run --rm -v `pwd`:`pwd` trinityrnaseq/trinityrnaseq Trinity \
    --seqType fq \
    --SS_lib_type 'RF'
    --left `pwd`/reads_1.fq.gz \
    --right `pwd`/reads_2.fq.gz
    --max_memory 1G \
    --CPU 4 \
    --output `pwd`/trinity_out_dir

----
# <font color = 'gray'>Assembly Evaluation</font>

Once you’ve generated your assemblies, the next critical step is to evaluate the "quality" of the resulting contigs. Many commonly used assembly metrics focus on contiguity, which measures how long and contiguous the assembled sequences are. While contiguity provides useful insights, it does not always reflect the accuracy of an assembly.

To assess accuracy, the ideal approach is to compare your assembly to a closely related reference genome. A reference serves as a benchmark to evaluate how biologically accurate your assembly is or how well it aligns with known sequences. However, references may not always be available. For instance, if you are working with a non-model organism, high-quality assemblies might not exist in public databases. Similarly, for shotgun metagenomics, defining a suitable reference can be challenging or even impossible due to the complex and diverse nature of microbial communities.

In such cases, you may need to rely primarily on contiguity-based metrics as a proxy for assembly quality. While not perfect, these metrics can still provide a starting point for evaluating your assemblies when more comprehensive assessments are not feasible.

### QUAST

`quast` is one of the widely used tools to assess assembly quality. Below we use the output of the [Genome Assembly](#Genome-Assembly) section as an example. The goal here is to compare the assemblies produced using three modes in SPAdes. The table below describes the options and inputs provided.

| option/input | description |
| :-: | :- |
| `-o` | Output folder |
| `-l` | Header labels in the report. Should be in the same order as how the assembly files are supplied. |
| `-r` | Reference genome file |
| `spades_result_default/scaffolds.fasta`, `spades_result_careful/scaffolds.fasta`, `spades_result_isolate/scaffolds.fasta` | Input assembly files |

In [None]:
!quast.py \
    -o quast_out \
    -l "default, careful, isolate" \
    -r reference_genome.fasta \
    spades_result_default/scaffolds.fasta \
    spades_result_careful/scaffolds.fasta \
    spades_result_isolate/scaffolds.fasta

You could also add the `-b` option. This will search for single-copy marker genes using BUSCO. The presence/absence of these marker genes can be used as proxy to measure the level of completeness and contamination of your assembly. This option is nice to use, for example, when assessing the quality of a metagenome-assembled genome (MAG). Evaluation methods discussed in the **Binning Module** is also applicable for this scenario.

For RNA-seq data, I refer you to the following page which discusses different ways to evaluate the assembly generated by Trinity: [Transcriptome Assembly Quality Assessment](https://github.com/trinityrnaseq/trinityrnaseq/wiki/Transcriptome-Assembly-Quality-Assessment).