<h1>QIIME2 Workflow for OTU Clustering</h1>

This notebook is a guide on working with QIIME2 with raw paired-end demultiplexed reads as the starting dataset. This notebook includes quality checking of raw reads, primer trimming, OTU picking, taxonomic assignment, and exporting data.

This workflow was built with the following as the main references: <a href = 'https://github.com/LangilleLab/microbiome_helper/wiki/Amplicon-SOP-v2-(qiime2-2020.8)'>LangilleLab SOP</a>, <a href = 'https://docs.qiime2.org/2021.2/tutorials/moving-pictures/'>"Moving pictures" Tutorial</a>, and <a href = 'https://docs.qiime2.org/2021.2/tutorials/atacama-soils/'>"Atacama soil microbiome" tutorial</a>.

Written for Day 1 of Bioinformatics Workshop by the Microbial Oceanography Laboratory. Credits: LBR dela Peña, BW Hingpit, JB Quijano, D Purganan. 

---
## <font color ='blue'>How to Use This Notebook</font>

1. Activate conda environment in terminal window. Make sure to change the environment name to what is applicable in your case.
>`conda activate qiime2-2021.4`
2. Open jupyter notebook with the command below and select the notebook.
>`jupyter-notebook`
3. To run the cells in this notebook, press Shift+Enter.

---
## Tools Used
1. <b>QIIME2 2021.4</b>

---
## Starting Files 

1. This Jupyter notebook
2. Raw amplicon sequencing data files found in the folder a <font face="Consolas">**amplicon_sample_data/raw_sequences**</font>
3. Naive-bayes classifier and reference sequences found in the folder <font face="Consolas">**classifier**</font>.
3. Directories for organizing the data. To make the folders, run the following code block:

In [None]:
!mkdir clustering_demo_folder
%cd clustering_demo_folder
!mkdir \
0-raw-sequences \
1-cleanup \
2-tax-assign

---
## Acknowledgement
The data used for this demonstration are from 8 samples collected and sequenced by <a href="https://www.researchgate.net/publication/345988236_Diversity_of_Marine_Eukaryotic_Picophytoplankton_Communities_with_Emphasis_on_Mamiellophyceae_in_Northwestern_Philippines">dela Peña et al. (2021)</a>:

<i>Dela Peña, L. B. R. O., Tejada, A. J. P., Quijano, J. B., Alonzo, K. H., Gernato, E. G., Caril, A., ... & Onda, D. F. L. (2021). Diversity of Marine Eukaryotic Picophytoplankton Communities with Emphasis on Mamiellophyceae in Northwestern Philippines. Philipp. J. Sci, 150, 27-42.</i>

---
## Table of Contents
 * [**Step 1: Data Preparation**](#Step-1:-Data-Preparation)  
     * [Download data](#Download-data)
     * [Making the manifest file](#Making-the-manifest-file)
     * [Importing sequences](#Importing-sequences)  
     * [Quality checking](#Quality-checking)
 * [**Step 2: Data Processing**](#Step-2:-Data-Processing)  
     * [Trimming primers](#Trim-primers)
     * [Merging reads](#Merging-reads)
     * [Quality filtering](#Quality-filtering)
     * [OTU clustering](#OTU-clustering)
     * [Filtering and chimera removal](#Filtering-singletons-and-chimeras)
 * [**Step 3: Assigning Taxonomy**](#Step-3:-Assign-Taxonomy)
     * [Feature data summaries](#Feature-data-summaries)
     * [Taxonomy assignment](#Taxonomy-assignment)
     * [Exporting OTU tables](#Exporting-OTU-tables)
---

# <font color = 'gray'>Step 1: Data Preparation</font>


### Download data

The data that will be used for the demonstration of this workflow was taken from the study mentioned in the Acknowledgement section.

In [None]:
!wget -i ../amplicon_sample_data/data-links.txt -P ./0-raw-sequences/

### Making the manifest file

Before we import our data, we have to make a **manifest file** that contains links to the forward and reverse file paths of each sample.

In [None]:
import pandas as pd
import glob
import os

sampleIDs, forwardpaths, reversepaths = [],[],[]
fpath= os.getcwd()+"/0-raw-sequences/"
for filepath in (glob.glob(fpath+"*.gz")):
    sample = filepath.split("/")[-1].rsplit("_", 2)[0]

    if sample not in sampleIDs:
        sampleIDs.append(sample)
    if "_1.fastq.gz" in filepath:
        forwardpaths.append(filepath)
    elif "_2.fastq.gz" in filepath:
        reversepaths.append(filepath)

manifest =  pd.DataFrame({'sampleID': sorted(sampleIDs), 'forward-absolute-filepath': sorted(forwardpaths), 'reverse-absolute-filepath':sorted(reversepaths)} ) 
with open('manifest.txt', 'w') as m:
    print(manifest.to_csv(sep='\t', index=False, header=True), file=m)

The <font face="Consolas">**manifest.txt**</font> file will show the sample ID (or SRA Number) and the absolute paths to the forward and reverse reads.


### Importing sequences
Now that we prepared all the necessary files, we can make our first QIIME command: importing the sequence data.

In [None]:
# Import the sequences
# Insert path to sequence folder after '--input-path'
!qiime tools import \
    --type 'SampleData[PairedEndSequencesWithQuality]' \
    --input-path manifest.txt \
    --output-path 0-raw-sequences/seqs.qza \
    --input-format PairedEndFastqManifestPhred33V2



This converts the sequence data into a **QIIME artifact**. Artifacts have the extension '.qza'

### Quality checking

Our sequences are already *demultiplexed*, meaning they are already separated into different samples. We can use the `demux` plugin instead to visualize our sequences. **QIIME visualizations** have the extension '.qzv'. The .qzv files can be viewed in  http://view.qiime2.org or we can import the `qiime2` module to view the visualizations inline.



In [None]:
# Make summary of the QIIME2 artifact (.qza file)
!qiime demux summarize \
    --i-data  0-raw-sequences/seqs.qza \
    --p-n 100000 \
    --o-visualization 0-raw-sequences/seqs.qzv

In [None]:
#Visualize
import qiime2 as q2
# Visualize
q2.Visualization.load('0-raw-sequences/seqs.qzv')

Open the visualization summary and go to the **Interactive Quality Plot**. Here, we can see the average quality score of the reads at each position. In general, we want to maintain a score above 30. 
 
----
# <font color = 'gray'>Step 2: Data Processing</font>

To prepare our sequences, we have to perform several steps:
1. Trim primers
2. Merge paired-end reads
3. Filter sequences by quality
4. Dereplicate
5. Pick OTUs
6. Filter chimeras and singletons

⚠️  Some commands here may take a long time, depending on how whether your machine is able to handle the task. If at any point you think the command is taking too long, you can copy the files from the <font face="Consolas">**amplicon_sample_data/output_data/otu_clustering**</font> folder.

### Trim primers
To remove the primers in our sequences, we use the `cutadapt` plugin. The primers used were E572F/E1009R, which have <b>18bp</b> and <b>20bp</b> lengths, respectively. Removing the primers is important especially if there are ambiguous bases, which might get confused as chimeric or low quality positions. You can explore more about the primer sequences, length, and predicted amplicon size in this excellent app <a href="https://app.pr2-primers.org/">PR-2 Primers</a>.

<font color = 'red'>NOTE: Remember to set the primer pair sequences that is applicable in your case in the <font face = 'Consolas'><b>--p-front-f</b></font> and <font face = 'Consolas'><b>--p-front-r</b></font> options.

In [None]:
!qiime cutadapt trim-paired \
    --i-demultiplexed-sequences 0-raw-sequences/seqs.qza \
    --p-front-f CYGCGGTAATTCCAGCTC  \
    --p-front-r AYGGTATCTRATCRTCTTYG  \
    --p-error-rate 0 \
    --p-discard-untrimmed \
    --o-trimmed-sequences 1-cleanup/1-primer-trimmed-seqs.qza \
    --verbose

In [None]:
#Check after trimming primers
!qiime demux summarize \
    --i-data  1-cleanup/1-primer-trimmed-seqs.qza \
    --p-n 100000 \
    --o-visualization 1-cleanup/1-primer-trimmed-seqs.qzv

In [None]:
#Visualize
import qiime2 as q2
# Visualize
q2.Visualization.load('1-cleanup/1-primer-trimmed-seqs.qzv')

### Merging reads
Now, we merge our forward and reverse reads using `vsearch`. Make sure to adjust the minimum overlap length to a value you should expect based on the region being amplified and size of the reads.

In [None]:
!qiime vsearch join-pairs \
    --i-demultiplexed-seqs 1-cleanup/1-primer-trimmed-seqs.qza \
    --o-joined-sequences 1-cleanup/2-merged-seqs.qza \
    --p-minovlen 140 \
    --verbose

In [None]:
#Check output after joining reads
!qiime demux summarize \
    --i-data  1-cleanup/2-merged-seqs.qza \
    --p-n 100000 \
    --o-visualization 1-cleanup/2-merged-seqs.qzv

In [None]:
#Visualize
import qiime2 as q2
# Visualize
q2.Visualization.load('1-cleanup/2-merged-seqs.qzv')

### Quality filtering
In the next step, we will filter out low-quality sequences. We set our minimum PHRED score to **30**, filtering out low-quality sequences.

In [None]:
!qiime quality-filter q-score \
    --i-demux 1-cleanup/2-merged-seqs.qza \
    --o-filtered-sequences 1-cleanup/3-merged-qc-seqs.qza \
    --p-min-quality 30 \
    --o-filter-stats 1-cleanup/3-merged-qc-stats.qza

In [None]:
#Check post QC data
!qiime demux summarize \
    --i-data 1-cleanup/3-merged-qc-seqs.qza \
    --p-n 100000 \
    --o-visualization 1-cleanup/3-merged-qc-seqs.qzv

In [None]:
#Visualize
import qiime2 as q2
# Visualize
q2.Visualization.load('1-cleanup/3-merged-qc-seqs.qzv')

### Dereplicating

Dereplication of sequences can be done using `vsearch`. This outputs a table and a sequence artifact.

In [None]:
!qiime vsearch dereplicate-sequences \
    --i-sequences 1-cleanup/3-merged-qc-seqs.qza \
    --o-dereplicated-table 1-cleanup/4-drp-table.qza \
    --o-dereplicated-sequences 1-cleanup/4-drp-seqs.qza

Let's take a peek at the resulting table:

In [None]:
# Summarize table
!qiime feature-table summarize \
    --i-table 1-cleanup/4-drp-table.qza \
    --o-visualization 1-cleanup/4-drp-table.qzv

In [None]:
import qiime2 as q2
# Visualize
q2.Visualization.load('1-cleanup/4-drp-table.qzv')

### OTU clustering

Vsearch can also perform OTU picking, which clusters sequences according to their similarity. OTU clustering can be done with a reference database, by grouping sequences that match with the same reference sequence.  For this step, we will use a classifier curated by MOLab, which uses a combination of the SILVA and Nordicana databases. Other QIIME2-formatted reference databases are also available in the <a href="https://docs.qiime2.org/2021.4/data-resources/">QIIME2 data resources page </a>.

In [None]:
# Reference: Molab silva nord sequences
!qiime vsearch cluster-features-open-reference \
    --i-table 1-cleanup/4-drp-table.qza \
    --i-sequences 1-cleanup/4-drp-seqs.qza \
    --p-perc-identity 0.98 \
    --i-reference-sequences ../classifier/silva-138-nord-drp-seq.qza \
    --o-clustered-table 1-cleanup/5-clust-OTU-table.qza \
    --o-clustered-sequences 1-cleanup/5-clust-OTU-seqs.qza \
    --o-new-reference-sequences 1-cleanup/5-clust-OTU-ref-seqs.qza

In [None]:
# Summarize OTU table
!qiime feature-table summarize \
    --i-table 1-cleanup/5-clust-OTU-table.qza  \
    --o-visualization 1-cleanup/5-clust-OTU-table.qzv 

In [None]:
#Visualize
import qiime2 as q2
q2.Visualization.load('1-cleanup/5-clust-OTU-table.qzv')

In [None]:
#Check clustered OTU seqs
!qiime feature-table tabulate-seqs \
    --i-data 1-cleanup/5-clust-OTU-seqs.qza \
    --o-visualization 1-cleanup/5-clust-OTU-seqs.qzv

In [None]:
# Visualize
import qiime2 as q2
q2.Visualization.load('1-cleanup/5-clust-OTU-seqs.qzv')

### Filtering singletons and chimeras

We can remove singletons using `feature-table`. The `min-frequency` parameter is set to 2 to remove those occuring only once.

In [None]:
#Filter features with frequency of <= 1
!qiime feature-table filter-features \
    --i-table 1-cleanup/5-clust-OTU-table.qza  \
    --p-min-frequency 2 \
    --o-filtered-table 1-cleanup/6-filtered-OTU-table.qza 

#Remove the same features in the sequence file as well
!qiime feature-table filter-seqs \
    --i-data 1-cleanup/5-clust-OTU-seqs.qza \
    --i-table 1-cleanup/6-filtered-OTU-table.qza \
    --o-filtered-data 1-cleanup/6-filtered-OTU-seqs.qza

In [None]:
# Summarize OTU table with singletons removed
!qiime feature-table summarize \
    --i-table 1-cleanup/6-filtered-OTU-table.qza  \
    --o-visualization 1-cleanup/6-filtered-OTU-table.qzv 

In [None]:
q2.Visualization.load('1-cleanup/6-filtered-OTU-table.qzv')

In [None]:
#Check OTU seqs with singletons removed
!qiime feature-table tabulate-seqs \
    --i-data 1-cleanup/6-filtered-OTU-seqs.qza \
    --o-visualization 1-cleanup/6-filtered-OTU-seqs.qzv

In [None]:
q2.Visualization.load('1-cleanup/6-filtered-OTU-seqs.qzv')

Chimeras can be removed using `vsearch uchime`. 

In [None]:
#Detect chimeras
!qiime vsearch uchime-denovo\
    --i-sequences 1-cleanup/6-filtered-OTU-seqs.qza \
    --i-table 1-cleanup/6-filtered-OTU-table.qza \
    --output-dir 3-chimeras/

In [None]:
#Remove chimeras from the table
!qiime feature-table filter-features \
    --i-table 1-cleanup/6-filtered-OTU-table.qza \
    --m-metadata-file 3-chimeras/nonchimeras.qza \
    --o-filtered-table 1-cleanup/7-OTU-table.qza

#Remove chimeras from the sequences
!qiime feature-table filter-seqs \
    --i-data 1-cleanup/6-filtered-OTU-seqs.qza \
    --i-table 1-cleanup/7-OTU-table.qza \
    --o-filtered-data 1-cleanup/7-OTU-rep-seqs.qza

---
# <font color = 'gray'>Step 3: Assign Taxonomy</font>


### Feature data summaries
After quality filtering, the resulting data can be explored using `feature-table summarize` and `feature table tabulate-seqs`. The former command will give information on how many sequences are associated with each sample and with each feature (OTUs), histograms of those and some related summary statistics while the latter will provide a mapping of feature IDs to sequences, and provide links to easily BLAST each sequence against the NCBI nt database.


In [None]:
# Summarize OTU table
!qiime feature-table summarize \
    --i-table 1-cleanup/7-OTU-table.qza \
    --o-visualization 1-cleanup/7-OTU-table.qzv 

In [None]:
import qiime2 as q2
# Visualize
q2.Visualization.load('1-cleanup/7-OTU-table.qzv')

The results show how many OTUs were determined in our samples in the overview. In the *Interactive Sample Detail* tab, you can see how many OTUs were detected in each sample. In the last tab (*Feature Detail*), you can see the OTU ID, their frequencies, and occurence in the samples. 
🤔 What is the most frequently detected OTU?

In [None]:
# Map OTUs to sequences
!qiime feature-table tabulate-seqs \
    --i-data 1-cleanup/7-OTU-rep-seqs.qza \
    --o-visualization 1-cleanup/7-OTU-rep-seqs.qzv

In [None]:
import qiime2 as q2
# Visualize
q2.Visualization.load('1-cleanup/7-OTU-rep-seqs.qzv')

This summary shows how many different OTUs were detected in each sample. Scroll down to the **Sequence Table** summary. Try to find the most frequent OTU (determined from the previous visualization). Clicking on the link in the **sequence** column will take you to NCBI BLAST, which will match that sequence with other publicly available sequences.
🤔 What organism is identified?

### Taxonomy assignment

To annotate the metabarcoding data, we use a reference database which will classify the sequences to their taxonomic identities using the plugin `sci-kit learn`. For 18S rRNA eukaryotic data, we will use a curated database, which has been optimized for the specific region targeted by the primers used in this run (18S V4 region). This database was annotated in the Microbial Oceanography Laboratory and features entries from both <a href="https://www.arb-silva.de/">SILVA</a> and <a href="http://www.cen.ulaval.ca/nordicanad/dpage.aspx?doi=45409XD-79A199B76BCC4110">Nordicana</a> databases. Other references for eukaryotic sequences can be used, such as the <a href="https://pr2-database.org/">PR2 database</a> which has high-quality reference sequences curated by other experts.

In [None]:
# Classify using sci-kit learn (sklearn)
!qiime feature-classifier classify-sklearn \
    --i-classifier ../classifier/silva-138-nord-classifier.2021-4.qza \
    --i-reads 1-cleanup/7-OTU-rep-seqs.qza \
    --o-classification 2-tax-assign/1-OTU-taxa.qza \
    --verbose

In [None]:
#Tabulate predictions
!qiime metadata tabulate \
    --m-input-file 2-tax-assign/1-OTU-taxa.qza \
    --o-visualization 2-tax-assign/1-OTU-taxa.qzv

In [None]:
# Visualize
import qiime2 as q2
q2.Visualization.load('2-tax-assign/1-OTU-taxa.qzv')

We can view interactive taxonomic barplot to see the composition of each sample.

After loading the visualization, select *Level* to 7 to view at the most resolved classification. You can also toggle the orders of the samples based on their metadata.

In [None]:
#generate a taxa barplot
!qiime taxa barplot \
    --i-table 1-cleanup/7-OTU-table.qza \
    --i-taxonomy 2-tax-assign/1-OTU-taxa.qza \
    --o-visualization 2-tax-assign/2-bar-plots-OTU.qzv

In [None]:
#Visualize
import qiime2 as q2
q2.Visualization.load('2-tax-assign/2-bar-plots-OTU.qzv')

### Exporting OTU tables

We can export the OTU tables in a format we can use in other programs, such as R.

In [None]:
!qiime tools export --input-path 1-cleanup/7-OTU-table.qza  --output-path exported
!qiime tools export --input-path 2-tax-assign/1-OTU-taxa.qza --output-path exported

#Change the first line of biom-taxonomy.tsv (i.e. the header) to this:
# #OTUID taxonomy confidence
!sed '1c#OTUID\ttaxonomy\tconfidence' exported/taxonomy.tsv > exported/biom-taxonomy.tsv

In [None]:
!biom add-metadata \
    -i exported/feature-table.biom \
    -o exported/otu-table-with-taxonomy.biom \
    --observation-metadata-fp exported/biom-taxonomy.tsv \
    --sc-separated taxonomy

!biom convert\
    -i exported/otu-table-with-taxonomy.biom\
    -o exported/otu-feature-table-with-tax.tsv\
    --to-tsv \
    --header-key taxonomy

Now, we have an OTU table, showing the feature ID, frequencies in samples, and assigned taxonomy.

In [None]:
import pandas as pd

OTUtable = pd.read_csv('exported/otu-feature-table-with-tax.tsv', sep="\t", header = 1)
OTUtable