# Module: OTU Clustering in QIIME2

This notebook is a guide on working with QIIME2 with **raw paired-end demultiplexed reads** as the starting dataset. This notebook includes quality checking of raw reads, primer trimming, read merging, dereplication, OTU picking, and optional filtering steps.

This module was built with the following as the main references: <a href = 'https://github.com/LangilleLab/microbiome_helper/wiki/Amplicon-SOP-v2-(qiime2-2020.8)'>LangilleLab SOP</a>, <a href = 'https://docs.qiime2.org/2021.2/tutorials/moving-pictures/'>"Moving pictures" Tutorial</a>, and <a href = 'https://docs.qiime2.org/2021.2/tutorials/atacama-soils/'>"Atacama soil microbiome" tutorial</a>.

Created by: _Microbial Oceanography Laboratory (MOLab)_

---
## How to Use This Notebook

1. Activate conda environment in terminal window. Make sure to change the environment name to what is applicable in your case.
>`conda activate qiime2-2023.2`
2. Open jupyter notebook with the command below and select the notebook.
>`jupyter notebook`
3. To run the cells in this notebook, press Shift+Enter.

---
## Tools Used
1. **QIIME 2 Amplicon Distribution**
    - Installation procedure can be found here: [QIIME2 native installation](https://docs.qiime2.org/2024.10/install/native/)

---
## Starting Files 

1. Paired-end demultiplexed FASTQ dataset imported as QIIME2 artifact (filename: `seqs.qza`, location: `0-raw-sequences`)
2. Directories to organize the files. Run the command below:

In [None]:
!mkdir \
0-raw-sequences \
1-cleanup

---
## Expected Outputs

1. `.qza` of type `FeatureTable[Frequency]`
2. `.qza` of type `FeatureData[Sequence]`

---
## Table of Contents
 * [**Data Processing**](#Data-Processing)
     * [Inspecting raw data](#Inspecting-raw-data)
     * [Trimming primers](#Trim-primers)
     * [Merging reads](#Merging-reads)
     * [Quality filtering](#Quality-filtering)
     * [OTU clustering](#OTU-clustering)

----
# <font color = 'gray'>Data Processing</font>

To prepare our sequences, we have to perform several steps:

1. Inspecting raw data
1. Trim primers
2. Merge paired-end reads
3. Filter sequences by quality
4. Dereplicate
5. Pick OTUs

### Inspecting raw data

Our sequences are already *demultiplexed*, meaning they are already separated into different samples. We can use the `demux` plugin instead to visualize our sequences. **QIIME visualizations** have the extension `.qzv`. The `.qzv` files can be viewed in http://view.qiime2.org or we can import the `qiime2` module to view the visualizations inline.

In [None]:
# Make summary of the QIIME2 artifact (.qza file)
!qiime demux summarize \
    --i-data  0-raw-sequences/seqs.qza \
    --p-n 100000 \
    --o-visualization 0-raw-sequences/seqs.qzv

In [None]:
#Visualize
import qiime2 as q2
q2.Visualization.load('0-raw-sequences/seqs.qzv')

Open the visualization summary and go to the **Interactive Quality Plot**. Here, we can see the average quality score of the reads at each position. In general, we want to maintain a score above 30. 

### Trim primers
To remove the primers in our sequences, we use the `cutadapt` plugin. The primers used were E572F/E1009R, which have <b>18bp</b> and <b>20bp</b> lengths, respectively. Removing the primers is important especially if there are ambiguous bases, which might get confused as chimeric or low quality positions. You can explore more about the primer sequences, length, and predicted amplicon size in this excellent app <a href="https://app.pr2-primers.org/">PR-2 Primers</a>.

<div class="alert alert-block alert-info">
<b>Note:</b> 
    
If you are not using the E572F/E1009R primer pairs, you must replace the sequences indicated in the <code>--p-front-f</code> and <code>--p-front-r</code> options.
</div>

<div class="alert alert-block alert-info">
<b>Tip:</b> 
    
Inspect the standard output of the <code>cutadapt trim-paired</code> plugin. Look for any unusual occurrences and adjust the options accordingly. For instance, if a large fraction of reads are discarded, you can either increase <code>--p-error-rate</code> or disable <code>--p-discard-untrimmed</code> (albeit you may end up with lower quality sequences).
</div>

In [None]:
!qiime cutadapt trim-paired \
    --i-demultiplexed-sequences 0-raw-sequences/seqs.qza \
    --p-front-f CYGCGGTAATTCCAGCTC  \
    --p-front-r AYGGTATCTRATCRTCTTYG  \
    --p-error-rate 0 \
    --p-discard-untrimmed \
    --o-trimmed-sequences 1-cleanup/1-primer-trimmed-seqs.qza \
    --verbose

In [None]:
#Check after trimming primers
!qiime demux summarize \
    --i-data  1-cleanup/1-primer-trimmed-seqs.qza \
    --p-n 100000 \
    --o-visualization 1-cleanup/1-primer-trimmed-seqs.qzv

In [None]:
#Visualize
import qiime2 as q2
q2.Visualization.load('1-cleanup/1-primer-trimmed-seqs.qzv')

### Merging reads
Now, we merge our forward and reverse reads using `vsearch`. Make sure to adjust the minimum overlap length to a value you should expect based on the region being amplified and size of the reads.

<div class="alert alert-block alert-info">
<b>Note:</b> 
    
Although <code>vsearch merge-pairs</code> is a quite robust utility, the value provided to the <code>--p-minovlen</code> option should be adjusted to whatever is applicable to your dataset to avoid potential erroneous merging.
</div>

<div class="alert alert-block alert-info">
<b>Tip:</b> 
    
Inspect the standard output and of the <code>vsearch merge-pairs</code> plugin and the visualization file, <code>2-merged-seqs.qzv</code>. Check the percent of reads merged and mean length of merged reads. If fraction of merged reads is poor, the standard output will also provide reasons as to why reads failed to merge.
</div>

In [None]:
!qiime vsearch merge-pairs \
    --i-demultiplexed-seqs 1-cleanup/1-primer-trimmed-seqs.qza \
    --o-joined-sequences 1-cleanup/2-merged-seqs.qza \
    --p-minovlen 140 \
    --verbose

In [None]:
#Check output after joining reads
!qiime demux summarize \
    --i-data  1-cleanup/2-merged-seqs.qza \
    --p-n 100000 \
    --o-visualization 1-cleanup/2-merged-seqs.qzv

In [None]:
#Visualize
import qiime2 as q2
q2.Visualization.load('1-cleanup/2-merged-seqs.qzv')

### Quality filtering
In the next step, sequences with PHRED score below **30** will be filtered out.

<div class="alert alert-block alert-info">
<b>Tip:</b> 
    
Inspect the output and adjust the PHRED score threshold if too many reads are purged.
</div>

In [None]:
!qiime quality-filter q-score \
    --i-demux 1-cleanup/2-merged-seqs.qza \
    --o-filtered-sequences 1-cleanup/3-merged-qc-seqs.qza \
    --p-min-quality 30 \
    --o-filter-stats 1-cleanup/3-merged-qc-stats.qza \
    --verbose

In [None]:
#Check post QC data
!qiime demux summarize \
    --i-data 1-cleanup/3-merged-qc-seqs.qza \
    --p-n 100000 \
    --o-visualization 1-cleanup/3-merged-qc-seqs.qzv

In [None]:
#Visualize
import qiime2 as q2
q2.Visualization.load('1-cleanup/3-merged-qc-seqs.qzv')

### Dereplicating

Dereplication aggregates identical sequences but still takes note of their frequency. This reduces the sequence space making it easier to cluster sequences by some threshold (e.g. OTU clustering). This can be done using `vsearch dereplicate-sequences` plugin. This outputs a feature table containing counts of unique sequences across all samples and a sequence file both as QIIME2 artifacts.

In [None]:
!qiime vsearch dereplicate-sequences \
    --i-sequences 1-cleanup/3-merged-qc-seqs.qza \
    --o-dereplicated-table 1-cleanup/4-drp-table.qza \
    --o-dereplicated-sequences 1-cleanup/4-drp-seqs.qza

Let's take a peek at the resulting table:

In [None]:
# Summarize table
!qiime feature-table summarize \
    --i-table 1-cleanup/4-drp-table.qza \
    --o-visualization 1-cleanup/4-drp-table.qzv

In [None]:
# Visualize
import qiime2 as q2
q2.Visualization.load('1-cleanup/4-drp-table.qzv')

### OTU clustering

Next we group together sequences according to a percent similarity threshold. This process is called OTU clustering. There are three general approaches to OTU clustering namely:

1. De novo
2. Close reference
3. Open reference

More details about these three methods are discussed here: [OTU picking strategies](http://qiime.org/tutorials/otu_picking.html).

In this tutorial, we will be using open reference clustering. Additionally, we will use a classifier curated by MOLab, which uses a combination of the SILVA and Nordicana databases. The custom 18S V4 reference database could be found here: [MOLab SILVA + Nordicana Reference DB](https://drive.google.com/drive/u/0/folders/1ZLRp73X96ukmFFW7AUJuV3WAHuq6xQ8q). Other QIIME2-formatted reference databases are also available in the [QIIME2 data resources page](https://docs.qiime2.org/2024.10/data-resources/).

In [None]:
!qiime vsearch cluster-features-open-reference \
    --i-table 1-cleanup/4-drp-table.qza \
    --i-sequences 1-cleanup/4-drp-seqs.qza \
    --p-perc-identity 0.98 \
    --i-reference-sequences ../classifier/silva-138-nord-drp-seq.qza \
    --o-clustered-table 1-cleanup/5-clust-OTU-table.qza \
    --o-clustered-sequences 1-cleanup/5-clust-OTU-seqs.qza \
    --o-new-reference-sequences 1-cleanup/5-clust-OTU-ref-seqs.qza

In [None]:
# Summarize OTU table
!qiime feature-table summarize \
    --i-table 1-cleanup/5-clust-OTU-table.qza  \
    --o-visualization 1-cleanup/5-clust-OTU-table.qzv 

In [None]:
#Visualize
import qiime2 as q2
q2.Visualization.load('1-cleanup/5-clust-OTU-table.qzv')

In [None]:
#Check clustered OTU seqs
!qiime feature-table tabulate-seqs \
    --i-data 1-cleanup/5-clust-OTU-seqs.qza \
    --o-visualization 1-cleanup/5-clust-OTU-seqs.qzv

In [None]:
# Visualize
import qiime2 as q2
q2.Visualization.load('1-cleanup/5-clust-OTU-seqs.qzv')