# QIIME2 Filtering Options for Feature Table and Sequences

This notebook showcases different filtering options that could be performed after generating feature table and representative sequences either through clustering or denoising. Filtering your data further could be helpful in reducing the noise in your data. Make sure to choose the filtering options and parameters that are suitable for your objectives.

Other filtering examples (filtering of singletons and chimeras) could also be found in the <font face="Consolas">**Amplicon_Clustering_Pipeline.ipynb**</font> Jupyter notebook.

This notebook is mainly based on <a href = 'https://github.com/LangilleLab/microbiome_helper/wiki/Amplicon-SOP-v2-(qiime2-2020.8)'>LangilleLab SOP</a>.

---

## <font color ='blue'>How to Use This Notebook</font>

1. Activate conda environment in terminal window. Make sure to change the environment name to what is applicable in your case.
>`conda activate qiime2-2021.4`
2. Open jupyter notebook with the command below and select the notebook.
>`jupyter-notebook`
3. To run the cells in this notebook, press Shift+Enter.

---

## Tools Used
1. <b>QIIME2 2021.4</b>

---

## Starting Files 

1. This Jupyter notebook.
2. Feature table and sequences which will be copied from the <font face="Consolas">**amplicon_sample_data/asv_table_and_sequences**</font> using the code block below. The sample data used here are the same outputs of the <font face="Consolas">**Amplicon_Denoising_Pipeline.ipynb**</font> jupyter notebook.
3. Directories for organizing the data. To make the folders and copy the sample data, run the following code blocks:`

In [None]:
#Create demo directories
!mkdir data_filtering_demo_folder
%cd data_filtering_demo_folder
!mkdir 0-feature_table_and_sequences

In [None]:
!cp ../amplicon_sample_data/asv_table_and_sequences/* 0-feature_table_and_sequences/

---
## Table of Contents
 * [**1: Filtering Rare Features**](#1:-Filtering-Rare-Features)  
 * [**2: Filter Taxons and Unclassified Features**](#2:-Filter-Taxons-and-Unclassified-Features)  
 * [**3: Exclude Low-depth Samples**](#3:-Exclude-Low-depth-Samples)
     * [Generate Rarefaction Curves](#Generate-Rarefaction-Curves)
     * [Filter Based on Rarefaction Curves](#Filter-Based-on-Rarefaction-Curves)
---

# <font color = 'gray'>1: Filtering Rare Features</font>

The first filtering step will remove features that have low frequency. The important option here is the <font face = 'Consolas'><b>--p-min-frequency</b></font> which will specify the minimum frequency of a feature to be included. For Miseq data, a good value would be 0.1% of the mean frequency to remove ASVs that are likely generated by bleed-through between runs.

To calculate this, check the visual output of the feature table (<font face = 'Consolas'>**2-table-dada2.qzv**</font>). Inside the visual output, go the "Frequency per sample" section, check the mean frequency, and multiply this by 0.001. Then replace the value specified in the <font face = 'Consolas'><b>--p-min-frequency</b></font> option.

However, if you think you're losing too much features, you can be more lenient. For example, you can opt to remove singletons similar to the one performed in the <font face="Consolas">**Amplicon_Clustering_Pipeline.ipynb**</font> notebook.

<font color = 'red'>NOTE: Replace the value specified in <font face = 'Consolas'><b>--p-min-frequency</b></font> by what is applicable in your case.

In [None]:
#Visualize 2-table-dada2.qzv
import qiime2 as q2
q2.Visualization.load('0-feature_table_and_sequences/2-table-dada2.qzv')

In [None]:
#Filter the table
!qiime feature-table filter-features \
    --i-table 0-feature_table_and_sequences/2-table-dada2.qza \
    --p-min-frequency 8 \
    --p-min-samples 1 \
    --o-filtered-table 1-filt-table.qza

In [None]:
#Filter the sequences as well
!qiime feature-table filter-seqs \
    --i-data 0-feature_table_and_sequences/2-rep-seqs-dada2.qza \
    --i-table 1-filt-table.qza \
    --o-filtered-data 1-filt-seqs.qza

Now, let's create visualization files to see how many features are left.

In [None]:
#Create .qzv file of filtered table and sequences
!qiime feature-table summarize \
    --i-table 1-filt-table.qza \
    --o-visualization 1-filt-table.qzv

!qiime feature-table tabulate-seqs \
    --i-data 1-filt-seqs.qza \
    --o-visualization 1-filt-seqs.qzv

In [None]:
#Visualize filtered table
q2.Visualization.load('1-filt-table.qzv')

In [None]:
#Visualize filtered sequences
q2.Visualization.load('1-filt-seqs.qzv')

---
# <font color = 'gray'>2: Filter Taxons and Unclassified Features</font>

The second filtering step will remove features that are classified as coming from certain taxonomic groups (in the example below, the putative metazoan and fungal sequences). Optionally, for example, you may also want to remove features that are unclassified at the phylum level (indicated in the <b><font face = 'Consolas'>--p-include</font></b> option). However, if you are working with a poorly characterized environment or searching for a novel taxon, it may be best to retain unclassified features.

In [None]:
#Filter the specified taxon labels on the feature table
!qiime taxa filter-table \
    --i-table 0-feature_table_and_sequences/2-table-dada2.qza \
    --i-taxonomy 0-feature_table_and_sequences/1-asv-taxa.qza \
    --p-exclude p__Metazoa,p__Fungi,p__Porifera,p__Cnidaria,p__Lophophorata,p__Platyhelminthes \
    --p-include p__ \
    --o-filtered-table 2-filt-table.qza

In [None]:
#Filter the specified taxon labels on the feature sequences
!qiime taxa filter-seqs \
    --i-sequences 0-feature_table_and_sequences/2-rep-seqs-dada2.qza \
    --i-taxonomy 0-feature_table_and_sequences/1-asv-taxa.qza \
    --p-exclude p__Metazoa,p__Fungi,p__Porifera,p__Cnidaria,p__Lophophorata,p__Platyhelminthes \
    --p-include p__ \
    --o-filtered-sequences 2-filt-seqs.qza

To check whether the specified sequences are removed, we can use the taxa barplot. If we set the taxonomic level to 2, we would no longer see unclassified sequences (at the phylum level) and the other sequences specified in the <font face="Consolas">**--p-exclude**</font> option.

In [None]:
#Generate taxa barplot post-filtering
!qiime taxa barplot \
    --i-table 2-filt-table.qza \
    --i-taxonomy 0-feature_table_and_sequences/1-asv-taxa.qza \
    --o-visualization 2-taxa-barplot.qzv

In [None]:
#visualize taxa barplot after filtering
q2.Visualization.load('2-taxa-barplot.qzv')

Now, let's create visualization files of the filtered data.

In [None]:
#Create .qzv file of filtered table and sequences
!qiime feature-table summarize \
    --i-table 2-filt-table.qza \
    --o-visualization 2-filt-table.qzv

!qiime feature-table tabulate-seqs \
    --i-data 2-filt-seqs.qza \
    --o-visualization 2-filt-seqs.qzv

In [None]:
#Visualize filtered table
q2.Visualization.load('2-filt-table.qzv')

In [None]:
#Visualize filtered sequences
q2.Visualization.load('2-filt-seqs.qzv')

---
# <font color = 'gray'>3: Exclude Low-depth Samples</font>

The final filtering example will exclude **samples** based on a specified minimum frequency of features (<b><font face = 'Consolas'>--p-min-frequency</font></b>). In choosing the value for this option, the guides below will be using a value that is based on the rarefaction curves. This will help us assess at what sampling depth does the diversity of our samples start to level out. Another possible value you could supply to the minimum frequency cutoff is 2000, which, based on this <a href = 'https://forum.qiime2.org/t/deep-sequencing/3586/2'>forum</a>, is enough to capture community diversity. However, each dataset is unique and it is best try different things and see what works for your own application. 

If you think you are losing too much samples, you may decrease the minimum cutoff value. You may also opt to skip these steps if you do not want to exclude any samples.

First, let us visualize our feature table.

In [None]:
#Visualize
q2.Visualization.load('0-feature_table_and_sequences/2-table-dada2.qzv')

### Generate Rarefaction Curves

Once viewed, look for the sample with highest depth (in the <i>Interactive Sample Detail</i>)  and use its depth as the input for the <b><font face = 'Consolas'>--p-max-depth</font></b> option to generate the rarefaction curves.

<font color = 'red'>NOTE: Replace the <font face = 'Consolas'><b>--p-max-depth</b></font> based on the summary visualization file of <font face = 'Consolas'><b>2-table-dada2.qzv</b></font>.

In [None]:
#Generating the rarefaction curves and viewing the result
!qiime diversity alpha-rarefaction \
    --i-table 0-feature_table_and_sequences/2-table-dada2.qza \
    --p-max-depth 12760 \
    --p-steps 50 \
    --p-metrics "shannon" \
    --o-visualization 3-rrfctn_curves.qzv

In [None]:
#Visualize
q2.Visualization.load('3-rrfctn_curves.qzv')

### Filter Based on Rarefaction Curves

Finally, inspect the generated rarefaction curves and determine the depth at which the rarefaction curves start to level out and use this as the minimum cutoff for sample filtering. In this case, 1500 seems to be a good value where all curves plateaued already. Moreover, since no sample has sampling depth below 1500, none will be filtered out.

<font color = 'red'>NOTE: Replace the <font face = 'Consolas'><b>--p-min-frequency</b></font> option based on the rarefaction curves.

In [None]:
#Filtering based on the a specified minimum frequency
!qiime feature-table filter-samples \
    --i-table 0-feature_table_and_sequences/2-table-dada2.qza \
    --p-min-frequency 1500 \
    --o-filtered-table 3-filt-table.qza

In [None]:
#Filter the sequences as well
!qiime feature-table filter-seqs \
    --i-data 0-feature_table_and_sequences/2-rep-seqs-dada2.qza \
    --i-table 3-filt-table.qza \
    --o-filtered-data 3-filt-seqs.qza

In [None]:
#Create .qzv file
!qiime feature-table summarize \
    --i-table 3-filt-table.qza \
    --o-visualization 3-filt-table.qzv

!qiime feature-table tabulate-seqs \
    --i-data 3-filt-seqs.qza \
    --o-visualization 3-filt-seqs.qzv

In [None]:
#Visualize filtered table
q2.Visualization.load('3-filt-table.qzv')

In [None]:
#Visualize filtered table
q2.Visualization.load('3-filt-seqs.qzv')