<h1>QIIME2 Workflow for DADA2 Denoising</h1>

This notebook is a guide on working with QIIME2 with raw paired-end demultiplexed reads as the starting dataset. This notebook includes quality checking of raw reads, primer trimming, denoising, and taxonomic assignment.

This workflow was built with the following as the main references: <a href = 'https://github.com/LangilleLab/microbiome_helper/wiki/Amplicon-SOP-v2-(qiime2-2020.8)'>LangilleLab SOP</a>, <a href = 'https://docs.qiime2.org/2021.2/tutorials/moving-pictures/'>"Moving pictures" Tutorial</a>, and <a href = 'https://docs.qiime2.org/2021.2/tutorials/atacama-soils/'>"Atacama soil microbiome" tutorial</a>.

---

<h2><font color ='blue'>How to Use This Notebook</font></h2>

1. Activate conda environment in terminal window. Make sure to change the environment name to what is applicable in your case.
>`conda activate qiime2-2021.4`
2. Open jupyter notebook with the command below and select the notebook.
>`jupyter-notebook`
3. To run the cells in this notebook, press Shift+Enter.

---

## Tools Used
1. <b>QIIME2 2021.4</b>

---

## Starting Files 

1. This Jupyter notebook
2. Raw amplicon sequencing data files found in the folder <font face="Consolas">**amplicon_sample_data/raw_sequences**</font>
3. Naive-bayes classifier and reference sequences found in the folder <font face="Consolas">**classifier**</font>.
3. Directories for organizing the data. To make the folders and copy the input files, run the following code blocks:

In [None]:
!mkdir denoising_demo_folder
%cd denoising_demo_folder
!mkdir \
0-raw-sequences \
1-cleanup \
2-tax-assign

---
## Acknowledgement
The data used for this demonstration are from 8 samples collected and sequenced by <a href="https://www.researchgate.net/publication/345988236_Diversity_of_Marine_Eukaryotic_Picophytoplankton_Communities_with_Emphasis_on_Mamiellophyceae_in_Northwestern_Philippines">dela Peña et al. (2021)</a>:

<i>Dela Peña, L. B. R. O., Tejada, A. J. P., Quijano, J. B., Alonzo, K. H., Gernato, E. G., Caril, A., ... & Onda, D. F. L. (2021). Diversity of Marine Eukaryotic Picophytoplankton Communities with Emphasis on Mamiellophyceae in Northwestern Philippines. Philipp. J. Sci, 150, 27-42.</i>

---
## Table of Contents
 * [**Step 1: Data Preparation**](#Step-1:-Data-Preparation)  
     * [Download data](#Download-data)
     * [Making the manifest file](#Making-the-manifest-file)
     * [Importing sequences](#Importing-sequences)  
     * [Quality checking](#Quality-checking)
 * [**Step 2: Data Processing**](#Step-2:-Data-Processing)  
     * [Trimming primers](#Trim-primers)
     * [Denoising](#Denoising-with-DADA2)
 * [**Step 3: Assigning Taxonomy**](#Step-3:-Assign-Taxonomy)
     * [Taxonomy assignment](#Taxonomy-assignment)
     * [Exporting ASV tables](#Exporting-ASV-tables)
---

# <font color = 'gray'>Step 1: Data Preparation</font>

### Download data

The data that will be used for the demonstration of this workflow was taken from the study mentioned in the Acknowledgement section.

In [None]:
!wget -i ../amplicon_sample_data/data-links.txt -P ./0-raw-sequences/

### Making the manifest file

Before we import our data, we have to make a **manifest file** that contains links to the forward and reverse file paths of each sample.

In [None]:
import pandas as pd
import glob
import os

sampleIDs, forwardpaths, reversepaths = [],[],[]
fpath= os.getcwd()+"/0-raw-sequences/"
for filepath in (glob.glob(fpath+"*.gz")):
    sample = filepath.split("/")[-1].rsplit("_", 2)[0]

    if sample not in sampleIDs:
        sampleIDs.append(sample)
    if "_1.fastq.gz" in filepath:
        forwardpaths.append(filepath)
    elif "_2.fastq.gz" in filepath:
        reversepaths.append(filepath)

manifest =  pd.DataFrame({'sampleID': sorted(sampleIDs), 'forward-absolute-filepath': sorted(forwardpaths), 'reverse-absolute-filepath':sorted(reversepaths)} ) 
with open('manifest.txt', 'w') as m:
    print(manifest.to_csv(sep='\t', index=False, header=True), file=m)

This <font face="Consolas">**manifest.txt**</font> file will show the sample ID (or SRA Number) and the absolute paths to the forward and reverse reads.


### Importing sequences
Now that we prepared all the necessary files, we can make our first QIIME command: importing the sequence data.

In [None]:
# Import the sequences
# Insert path to sequence folder after '--input-path'
!qiime tools import \
    --type 'SampleData[PairedEndSequencesWithQuality]' \
    --input-path manifest.txt \
    --output-path 0-raw-sequences/seqs.qza \
    --input-format PairedEndFastqManifestPhred33V2



This converts the sequence data into a **QIIME artifact**. Artifacts have the extension '.qza'

### Quality checking

Our sequences are already *demultiplexed*, meaning they are already separated into different samples. We can use the `demux` plugin instead to visualize our sequences. **QIIME visualizations** have the extension '.qzv'. The .qzv files can be viewed in  http://view.qiime2.org or we can import the `qiime2` module to view the visualizations inline.



In [None]:
# Make summary of the QIIME2 artifact (.qza file)
!qiime demux summarize \
    --i-data 0-raw-sequences/seqs.qza \
    --p-n 100000 \
    --o-visualization 0-raw-sequences/seqs.qzv

In [None]:
import qiime2 as q2
# Visualize
q2.Visualization.load('0-raw-sequences/seqs.qzv')

---
# <font color = 'gray'>Step 2: Data Processing</font>

This stage involves the following steps only:
1. Trim primers
2. Denoising with DADA2

The DADA2 workflow wraps read merging, QC, dereplication, and chimera filtering, which is why unlike the OTU clustering workflow, there is no need to specify commands for those steps.

### Trim primers
To remove the primers in our sequences, we use the `cutadapt` plugin. The primers used were E572F/E1009R, which have <b>18bp</b> and <b>20bp</b> lengths, respectively. Removing the primers is important especially if there are ambiguous bases, which might get confused as chimeric or low quality positions. You can explore more about the primer sequences, length, and predicted amplicon size in this excellent app <a href="https://app.pr2-primers.org/">PR-2 Primers</a>.

<font color = 'red'>NOTE: Remember to set the primer pair sequences that is applicable in your case in the <font face = 'Consolas'><b>--p-front-f</b></font> and <font face = 'Consolas'><b>--p-front-r</b></font> options.

In [None]:
!qiime cutadapt trim-paired \
    --i-demultiplexed-sequences 0-raw-sequences/seqs.qza \
    --p-front-f CYGCGGTAATTCCAGCTC  \
    --p-front-r AYGGTATCTRATCRTCTTYG  \
    --p-error-rate 0 \
    --p-discard-untrimmed \
    --o-trimmed-sequences 1-cleanup/1-primer-trimmed-seqs.qza

In [None]:
#Check quality after trimming primers
!qiime demux summarize \
    --i-data 1-cleanup/1-primer-trimmed-seqs.qza \
    --p-n 100000 \
    --o-visualization 1-cleanup/1-primer-trimmed-seqs.qzv

In [None]:
#Visualize
import qiime2 as q2
q2.Visualization.load('1-cleanup/1-primer-trimmed-seqs.qzv')

### Denoising with DADA2
There are two denoising methods available in QIIME2 namely, Deblur and DADA2. For this workflow, the DADA2 denoiser will be used. DADA2 is a pipeline used in inferring amplicon sequence variants (ASVs) from HTS data.

<font color = 'red'>NOTE: Change the truncation length indicated in the <font face = 'Consolas'><b>--p-trunc-len-f</b></font> (truncation of forward reads starting at the 3' end) and <font face = 'Consolas'><b>--p-trunc-len-r</b></font> (truncation of reverse reads starting at the 3' end) options. You may base this on the quality report of the primer-trimmed sequences.

In [None]:
#Denoising with dada2
!qiime dada2 denoise-paired \
    --i-demultiplexed-seqs 1-cleanup/1-primer-trimmed-seqs.qza \
    --p-trunc-len-f 258 \
    --p-trunc-len-r 237 \
    --o-table 1-cleanup/2-table-dada2.qza \
    --o-representative-sequences 1-cleanup/2-rep-seqs-dada2.qza \
    --o-denoising-stats 1-cleanup/2-stats-dada2.qza \
    --p-n-threads 0

In [None]:
#Creating a visualization file of the denoising stats output
!qiime metadata tabulate \
    --m-input-file 1-cleanup/2-stats-dada2.qza \
    --o-visualization 1-cleanup/2-stats-dada2.qzv

In [None]:
#Visualize
q2.Visualization.load('1-cleanup/2-stats-dada2.qzv')

In [None]:
#Creating a visualization file of the feature table
!qiime feature-table summarize \
    --i-table 1-cleanup/2-table-dada2.qza \
    --o-visualization 1-cleanup/2-table-dada2.qzv \
    --m-sample-metadata-file metadata.txt

In [None]:
#Visualize
q2.Visualization.load('1-cleanup/2-table-dada2.qzv')

In [None]:
#Creating a visualization file of the ASV sequences
!qiime feature-table tabulate-seqs \
    --i-data 1-cleanup/2-rep-seqs-dada2.qza \
    --o-visualization 1-cleanup/2-rep-seqs-dada2.qzv

In [None]:
#Visualize
q2.Visualization.load('1-cleanup/2-rep-seqs-dada2.qzv')

---
# <font color = 'gray'>Step 3: Assign Taxonomy</font>


### Taxonomy assignment

To annotate the metabarcoding data, we use a reference database which will classify the sequences to their taxonomic identities using the plugin `sci-kit learn`. For 18S rRNA eukaryotic data, we will use a curated database, which has been optimized for the specific region targeted by the primers used in this run (18S V4 region). This database was annotated in the Microbial Oceanography Laboratory and features entries from both <a href="https://www.arb-silva.de/">SILVA</a> and <a href="http://www.cen.ulaval.ca/nordicanad/dpage.aspx?doi=45409XD-79A199B76BCC4110">Nordicana</a> databases. Other references for eukaryotic sequences can be used, such as the <a href="https://pr2-database.org/">PR2 database</a> which has high-quality reference sequences curated by other experts.

<font color = 'red'>NOTE: Replace the file specified in the <font face = 'Consolas'><b>--i-classifier</b></font> flag by whichever you will use. 

In [None]:
#Using the green genes classifier to assign taxonomies to the ASV sequences
!qiime feature-classifier classify-sklearn \
    --i-reads 1-cleanup/2-rep-seqs-dada2.qza \
    --i-classifier ../classifier/silva-138-nord-classifier.2021-4.qza \
    --o-classification 2-tax-assign/1-asv-taxa.qza 

In [None]:
#Tabulate predictions
!qiime metadata tabulate \
    --m-input-file 2-tax-assign/1-asv-taxa.qza \
    --o-visualization 2-tax-assign/1-asv-taxa.qzv

In [None]:
#Visualize
q2.Visualization.load('2-tax-assign/1-asv-taxa.qzv')

We can view interactive taxonomic barplot to see the composition of each sample.

After loading the visualization, select *Level* to 7 to view at the most resolved classification. You can also toggle the orders of the samples based on their metadata.

In [None]:
#generate a taxa barplot
!qiime taxa barplot \
    --i-table 1-cleanup/2-table-dada2.qza \
    --i-taxonomy 2-tax-assign/1-asv-taxa.qza \
    --o-visualization 2-tax-assign/2-bar-plots-asv.qzv

In [None]:
#Visualize
import qiime2 as q2
q2.Visualization.load('2-tax-assign/2-bar-plots-asv.qzv')

### Exporting ASV tables

We can export the OTU tables, taxonomy assignments, and representative sequences in a format we can use in other programs, such as R.

In [None]:
#Export feature table
!qiime tools export --input-path 1-cleanup/2-table-dada2.qza  --output-path exported

#Export taxonomy assignments
!qiime tools export --input-path 2-tax-assign/1-asv-taxa.qza --output-path exported

#Export representative sequences
!qiime tools export --input-path 1-cleanup/2-rep-seqs-dada2.qza --output-path exported

#Change the first line of biom-taxonomy.tsv (i.e. the header) to this:
# #OTUID taxonomy confidence
!sed '1c#OTUID\ttaxonomy\tconfidence' exported/taxonomy.tsv > exported/biom-taxonomy.tsv

In [None]:
#Add taxonomy header to the feature table
!biom add-metadata \
    -i exported/feature-table.biom \
    -o exported/asv-table-with-taxonomy.biom \
    --observation-metadata-fp exported/biom-taxonomy.tsv \
    --sc-separated taxonomy

#Convert feature table with taxonomy from .biom to .tsv format
!biom convert\
    -i exported/asv-table-with-taxonomy.biom\
    -o exported/asv-feature-table-with-tax.tsv\
    --to-tsv \
    --header-key taxonomy

#Convert feature table from .biom to .tsv format
!biom convert \
    -i exported/feature-table.biom \
    -o exported/feature-table.tsv \
    --to-tsv

In [None]:
import pandas as pd

OTUtable = pd.read_csv('exported/asv-feature-table-with-tax.tsv', sep="\t", header = 1)
OTUtable