<h1>Module: Filtering Features and Samples in QIIME2</h1>

At times there are features (i.e. OTUs, ASVs) that may add noise to the data or are simply not relevant to the analysis, or samples that you may want to exclude. This module illustrates how to remove these features and/or samples from the feature table and representative sequences.

This module was built with the following as the main references: [LangilleLab SOP](https://github.com/LangilleLab/microbiome_helper/wiki/Amplicon-SOP-v2-(qiime2-2020.8)), ["Moving pictures" Tutorial](https://docs.qiime2.org/2021.2/tutorials/moving-pictures/), and [QIIME2 filtering feature tables](https://docs.qiime2.org/jupyterbooks/cancer-microbiome-intervention-tutorial/030-tutorial-downstream/010-filtering.html).

Created by: _Microbial Oceanography Laboratory (MOLab)_

---
## How to Use This Notebook

1. Activate conda environment in terminal window. Make sure to change the environment name to what is applicable in your case.
```bash
conda activate qiime2-2023.2
```
2. Open jupyter notebook with the command below and select the notebook.
```bash
jupyter notebook
```
3. To run the cells in this notebook, press Shift+Enter.

---
## Tools Used
1. **QIIME 2 Amplicon Distribution**
    - Installation procedure can be found here: [QIIME2 native installation](https://docs.qiime2.org/2024.10/install/native/)

---
## Starting Files 

1. This Jupyter notebook.
2. QIIME2 artifact of type `FeatureTable[Frequency]` (named `feature-table.qza` below).
3. QIIME2 artifact of type `FeatureData[Sequence]` (named `rep-seqs.qza` below).

---
## Expected Outputs

1. Filtered feature table (`.qza` of type `FeatureTable[Frequency]`).
2. Filtered feature sequences (`.qza` of type `FeatureData[Sequence]`).

---
## Table of Contents
 * [**Filtering Singletons**](#Filtering-Singletons)  
 * [**Filtering Chimeras**](#Filtering-Chimeras)
     * [Identification of chimeras](#Identification-of-chimeras)
     * [Removing chimeras](#Removing-chimeras)
 * [**Filtering Taxa**](#Filtering-Taxa)
 * [**Other Filtering Options**](#Other-Filtering-Options)

---
# <font color = 'gray'>Filtering Singletons</font>

Singletons are features that have a frequency of 1. We can remove singletons from the feature table using `feature-table filter-features`. The `--p-min-frequency` parameter is set to 2 to remove those occuring only once.

<div class="alert alert-block alert-info">
<b>Note:</b> 
    
Removing singletons is optional. But since some low-abundance reads may be artifacts, this gives you an opportunity to further clean your data. If you are interested in rare taxa you can perform a separate analysis looking into the low-abundance OTUs only.
</div>

<div class="alert alert-block alert-info">
<b>Note:</b> 
    
If you are using DADA2, by default, singletons are automatically removed.
</div>

In [None]:
!qiime feature-table filter-features \
    --i-table feature-table.qza  \
    --p-min-frequency 2 \
    --o-filtered-table feature-table-no-singletons.qza

To remove the singletons in `rep-seqs.qza` as well, we subsequently run `feature-table filter-seqs`.

In [None]:
!qiime feature-table filter-seqs \
    --i-data rep-seqs.qza \
    --i-table feature-table-no-singletons.qza \
    --o-filtered-data rep-seqs-no-singletons.qza

Produce visualizations (`.qzv`) for the filtered feature table and representative sequences.

In [None]:
!qiime feature-table summarize \
    --i-table feature-table-no-singletons.qza  \
    --o-visualization feature-table-no-singletons.qzv

In [None]:
!qiime feature-table tabulate-seqs \
    --i-data rep-seqs-no-singletons.qza \
    --o-visualization rep-seqs-no-singletons.qzv

---
# <font color = 'gray'>Filtering Chimeras</font>

Chimeras are sequences that are artifacts generated during PCR. Since they are not biologically true sequences, we can remove them using the following steps.

### Identification of chimeras

First, identify chimeric and non-chimeric sequences using `vsearch uchime`. 

In [None]:
!qiime vsearch uchime-denovo \
    --i-sequences rep-seqs.qza \
    --i-table feature-table.qza \
    --output-dir chimeras/

### Removing chimeras

Afterwards, filter out from the feature table and sequences the OTUs identified as chimeric.

In [None]:
!qiime feature-table filter-features \
    --i-table feature-table.qza \
    --m-metadata-file chimeras/nonchimeras.qza \
    --o-filtered-table feature-table-no-chimeras.qza

In [None]:
!qiime feature-table filter-seqs \
    --i-data rep-seqs.qza \
    --i-table feature-table-no-chimeras.qza \
    --o-filtered-data rep-seqs-no-chimeras.qza

Produce visualizations (`.qzv`) for the table and representative sequences rid of chimeras.

In [None]:
!qiime feature-table summarize \
    --i-table feature-table-no-chimeras.qza  \
    --o-visualization feature-table-no-chimeras.qzv

In [None]:
!qiime feature-table tabulate-seqs \
    --i-data rep-seqs-no-chimeras.qza \
    --o-visualization rep-seqs-no-chimeras.qzv

---
# <font color = 'gray'>Filtering Taxa</font>

This section demonstrates how you can exclude some taxonomic groups from the OTU table and OTU representative sequences.

<div class="alert alert-block alert-info">
<b>Note:</b> 
    
Taxa filtering is optional as well since this depends on the scope of your study.
</div>

<div class="alert alert-block alert-info">
<b>Note:</b> 
    
This step assumes that you have assigned taxonomic identities to the representative sequences (<code>rep-seqs-taxa.qza</code>). Additionally, it assumes that the labels follow the same format as SILVA. Check <code>Metabarcoding Taxonomic Assignment.ipynb</code> notebook or module.
</div>

The code below removes the features in the feature table having the same taxonomic annotation as the ones listed in the `--p-exclude`. Moreover, in the `--p-include`, specifying `p__` here tells the command to include only features with at least a phylum level annotation.

In [None]:
!qiime taxa filter-table \
    --i-table feature-table.qza \
    --i-taxonomy rep-seqs-taxa.qza \
    --p-exclude p__Metazoa,p__Fungi,p__Porifera,p__Cnidaria,p__Lophophorata,p__Platyhelminthes \
    --p-include p__ \
    --o-filtered-table feature-table-tax-filtered.qza

Then you can apply the same filter to `rep-seqs.qza`.

In [None]:
!qiime taxa filter-seqs \
    --i-sequences rep-seqs.qza \
    --i-taxonomy rep-seqs-taxa.qza \
    --p-exclude p__Metazoa,p__Fungi,p__Porifera,p__Cnidaria,p__Lophophorata,p__Platyhelminthes \
    --p-include p__ \
    --o-filtered-sequences rep-seqs-taxa-filtered.qza

---
# <font color = 'gray'>Other Filtering Options</font>

Another powerful argument that you might find useful is the `--p-where` option. You could specify an SQLite clause with it to filter your feature table and/or feature sequences according to your metadata. 

Sample metadata:

| sample-id | SITE | MONTH |
|-----------|------|-------|
| A         | S1   | JAN   |
| B         | S1   | FEB   |
| C         | S2   | MAR|
| D         | S2  | APR  |
| E  | S3  | MAY  |
| F  | S3  | JUN  |

Consider the metadata table displayed above. The code below uses the `filter-feature` method to select only the samples whose value under the `MONTH` column is `APR` or `MAR`.

In [None]:
!qiime feature-table filter-features \
    --i-table feature-table.qza \
    --m-metadata-file metadata.txt \
    --p-where "[MONTH]='APR' OR [MONTH]='MAR'"
    --o-filtered-table feature-table-no-chimeras.qza