<h1>QIIME2 Workflow for DADA2 Denoising of PacBio CCS Reads</h1>

This notebook is a guide on working with QIIME2 with PacBio circular consensus sequences (CCS). This includes the DADA2 plugin in QIIME2 which encompasses the following steps: primer removal, filtering and trimming, learning of errors, and denoising.

This workflow was built with the following as the main references: <a href = 'https://github.com/LangilleLab/microbiome_helper/wiki/PacBio-CCS-Amplicon-SOP-v1-%28qiime2%29'>LangilleLab SOP</a>, and <a href = 'https://docs.qiime2.org/2021.2/tutorials/moving-pictures/'>"Moving pictures" Tutorial</a>

<h2><font color ='blue'>How to Use This Notebook</font></h2>

1. Activate conda environment in terminal window. Make sure to change the environment name to what is applicable in your case. Note that QIIME2 support for PacBio CCS denoising with DADA2 started with the 2022.2 version. Please use the same version or something more recent than that.
>`conda activate qiime2-2022.2`
2. Open jupyter notebook with the command below and select the notebook.
>`jupyter-notebook`
3. To run the cells in this notebook, press Shift+Enter.

## Starting Files 

1. This Jupyter notebook
2. Directories for organizing the data. To make the folders, run the following code block:

In [None]:
!mkdir denoising_ccs_demo_folder
%cd denoising_ccs_demo_folder
!mkdir \
0-raw-sequences \
1-cleanup \
2-tax-assign

## Acknowledgement
The data used for this demonstration are from <a href="https://academic.oup.com/nar/article/47/18/e103/5527971">Callahan et al. (2019)</a>. The SRA accession IDs of the data used are SRR8557472 to SRR8557480.

---
## Table of Contents
 * [**Step 1: Downloading Data**](#Step-1:-Downloading-Data)  
 * [**Step 2: Data Preparation**](#Step-2:-Data-Preparation)  
     * [Create manifest file](#Create-manifest-file)
     * [Import reads](#Import-reads)
 * [**Step 3: Denoising**](#Step-3:-Denoising)  
---

# <font color = 'gray'>Step 1: Downloading Data</font>

The command below downloads the demo files and saves them inside the <font face="Consolas">**0-raw-sequences**</font> folder.

In [None]:
!wget \
    -P 0-raw-sequences \
    ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR855/002/SRR8557472/SRR8557472_subreads.fastq.gz \
    ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR855/003/SRR8557473/SRR8557473_subreads.fastq.gz \
    ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR855/004/SRR8557474/SRR8557474_subreads.fastq.gz \
    ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR855/005/SRR8557475/SRR8557475_subreads.fastq.gz \
    ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR855/006/SRR8557476/SRR8557476_subreads.fastq.gz \
    ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR855/007/SRR8557477/SRR8557477_subreads.fastq.gz \
    ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR855/008/SRR8557478/SRR8557478_subreads.fastq.gz \
    ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR855/009/SRR8557479/SRR8557479_subreads.fastq.gz \
    ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR855/000/SRR8557480/SRR8557480_subreads.fastq.gz

In [None]:
#Extract
!gunzip 0-raw-sequences/*

---
# <font color = 'gray'>Step 2: Data Preparation</font>

### Create manifest file

Now that we have trimmed and taken only the reads in the proper orientation, we are nearly ready to import the sequences into QIIME2. We just need to create a manifest file so that QIIME2 would know which reads to import.

In [None]:
import pandas as pd
import glob
import os

sampleIDs, abs_path = [],[]
fpath= os.getcwd()+"/0-raw-sequences/"
for filepath in (glob.glob(fpath+"*.fastq")):
    sample = filepath.split("/")[-1].rsplit("_", 2)[0]

    if sample not in sampleIDs:
        sampleIDs.append(sample)
    if filepath not in abs_path:
        abs_path.append(filepath)

manifest =  pd.DataFrame({'sampleID': sorted(sampleIDs), 'absolute-filepath': sorted(abs_path)}) 
with open('manifest.txt', 'w') as m:
    print(manifest.to_csv(sep='\t', index=False, header=True), file=m)

### Import reads

Now, we can import the pooled reads to QIIME2 for further processing.

In [None]:
!qiime tools import \
    --type SampleData[SequencesWithQuality] \
    --input-path manifest.txt \
    --output-path 1-cleanup/ccs_reads.qza \
    --input-format SingleEndFastqManifestPhred33V2

In [None]:
#Check the imported sequences
!qiime demux summarize \
    --i-data 1-cleanup/ccs_reads.qza \
    --o-visualization 1-cleanup/ccs_reads.qzv

In [None]:
#Visualize
import qiime2 as q2
q2.Visualization.load("1-cleanup/ccs_reads.qzv")

---
# <font color = 'gray'>Step 3: Denoising</font>

The command below does plenty of stuff for us besides denoising. 

This includes reorienting the reads; sequence files of PacBio CCS are typically in mixed orientation, that is, reads are randomly oriented in either 5' &#8594; 3' or 3' &#8594; 5' direction. To make sure that all reads are pointing to the same direction, we have to specify the forward and reverse primers in the <font face="Consolas">--p-front</font> and <font face="Consolas">--p-adapter</font> arguments, respectively.

The command below also removes the specified primers for us.

Moreover, this also attempts to remove chimeric sequences, and dereplicate the sequences as well.

In [None]:
!qiime dada2 denoise-ccs \
    --i-demultiplexed-seqs 1-cleanup/ccs_reads.qza \
    --p-front AGRGTTYGATYMTGGCTCAG \
    --p-adapter RGYTACCTTGTTACGACTT \
    --p-n-threads 4 \
    --p-min-len 1000 \
    --p-max-len 1600 \
    --o-table 1-cleanup/ccs_denoised_table.qza \
    --o-representative-sequences 1-cleanup/ccs_denoised_rep_seqs.qza \
    --o-denoising-stats 1-cleanup/ccs_denoised_stats.qza \
    --verbose

In [None]:
#Create visualization files
!qiime feature-table summarize \
    --i-table 1-cleanup/ccs_denoised_table.qza \
    --o-visualization 1-cleanup/ccs_denoised_table.qzv

!qiime feature-table tabulate-seqs \
    --i-data 1-cleanup/ccs_denoised_rep_seqs.qza \
    --o-visualization 1-cleanup/ccs_denoised_rep_seqs.qzv

!qiime metadata tabulate \
    --m-input-file 1-cleanup/ccs_denoised_stats.qza \
    --o-visualization 1-cleanup/ccs_denoised_stats.qzv

In [None]:
#Visualize table
import qiime2 as q2
q2.Visualization.load("1-cleanup/ccs_denoised_table.qzv")

In [None]:
#Visualize rep seqs
q2.Visualization.load("1-cleanup/ccs_denoised_rep_seqs.qzv")

In [None]:
#Visualize denoising stats
q2.Visualization.load("1-cleanup/ccs_denoised_stats.qzv")

After denoising, you can now proceed to feature filtering as described in the **Amplicon_Feature_Filtering.ipynb** and **Amplicon_Clustering_Pipeline.ipynb** notebooks. However, since samples here have much shallower depth, adjust the levels/frequency at which you filter your features.

Moreover, you may also proceed to taxonomy assignment as described in the **Amplicon_Clustering_Pipeline.ipynb** and **Amplicon_Denoising_Pipeline.ipynb** notebooks. However, note that for those notebooks, the classifier used was specific for the 18S V4 region. Please select a classifier that covers your entire amplicon.