<h1>QIIME2 Statistical Analyses</h1>

This notebook is a guide on doing different analyses within QIIME2 starting with a feature table, feature sequences, and taxonomic assignments. This includes rarefying samples, generating PCoA plots, testing for diversity significance between/among groups, and differential abundance testing.

This workflow was built with the following as the main references: <a href = 'https://docs.qiime2.org/2021.2/tutorials/moving-pictures/'>"Moving pictures" Tutorial</a>

Written for Day 2 of Bioinformatics Workshop by the Microbial Oceanography Laboratory. Credits: LBR dela Peña, BW Hingpit, JB Quijano, D Purganan.

---

## <font color ='blue'>How to Use This Notebook</font>

1. Activate conda environment in terminal window. Make sure to change the environment name to what is applicable in your case.
>`conda activate qiime2-2021.4`
2. Open jupyter notebook with the command below and select the notebook.
>`jupyter-notebook`
3. To run the cells in this notebook, press Shift+Enter.

---

## Tools Used
1. <b>QIIME2 2021.4</b>

---

## Starting Files 

1. This Jupyter notebook 
2. OTU table and representative sequences generated from <font face="Consolas">**Amplicon_Clustering_Pipeline.ipynb**</font>. These data can be copied from <font face="Consolas">**amplicon_sample_data/otu_table_and_sequences**</font> by running the code block below.
3. Data directories. Run the code block below to create the directories.

In [None]:
!mkdir stat_analyses_demo_folder
%cd stat_analyses_demo_folder
!mkdir \
0-feature-table-and-sequences \
1-cleanup \
2-phylogeny \
3-rarefy \
4-diversity \
5-ancom

In [None]:
!cp ../amplicon_sample_data/otu_table_and_sequences/* 0-feature-table-and-sequences/
!cp ../amplicon_sample_data/metadata-day-2.txt ./

---
## Table of Contents
 * [**Filtering feature table and representative sequences**](#Filtering-OTU-table-and-rep-sequences)  
     * [Filtering metazoan and fungal sequences](#Filtering-metazoan-and-fungal-sequences)
     * [Double-checking](#Double-checking)
 * [**Alpha Rarefaction**](#Alpha-Rarefaction)  
     * [Checking the number of features per sample](#Checking-the-number-of-features-per-sample)
     * [Making alpha rarefaction curves](#Making-alpha-rarefaction-curves)
     * [Choosing a sampling depth and making a phylogenetic tree](#Choosing-a-sampling-depth-and-making-a-phylogenetic-tree)  
     * [Rarefying samples](#Rarefying-samples)
 * [**PCoA plots and statistical testing**](#PCoA-plots-and-statistical-testing)  
     * [Visualizing beta diversity indices](#Visualizing-beta-diversity-indices)
     * [Alpha significance testing between/among groups](#Alpha-significance-testing-between/among-groups)
     * [Beta significance testing between/among groups](#Beta-significance-testing-between/among-groups)  
     * [OTU table with annotation for other stats](#OTU-table-with-annotation-for-other-stats)
 * [**Differential abundance testing with ANCOM**](#Differential-abundance-testing-with-ANCOM)  
---

# <font color = 'gray'>Filtering OTU table and rep sequences</font>

### Filtering metazoan and fungal sequences

The `OTU-table.qza` and `OTU-rep-seqs.qza` will be filtered to exclude putative metazoan and fungal sequences.

This is done to focus on protists only.

In [None]:
# Filtering OTU table
!qiime taxa filter-table \
    --i-table 0-feature-table-and-sequences/7-OTU-table.qza \
    --i-taxonomy 0-feature-table-and-sequences/1-OTU-taxa.qza \
    --p-exclude p__Metazoa,p__Fungi,p__Porifera,p__Cnidaria,p__Lophophorata,p__Platyhelminthes \
    --o-filtered-table 1-cleanup/1-otu-table-cleaned.qza \

# Filtering representative sequences
!qiime taxa filter-seqs \
    --i-sequences 0-feature-table-and-sequences/7-OTU-rep-seqs.qza \
    --i-taxonomy 0-feature-table-and-sequences/1-OTU-taxa.qza \
    --p-exclude p__Metazoa,p__Fungi,p__Porifera,p__Cnidaria,p__Lophophorata,p__Platyhelminthes \
    --o-filtered-sequences 1-cleanup/1-otu-rep-seqs-cleaned.qza

### Double-checking
To double check, taxonomy barplots will be made to see if metazoans are still present in the OTU-table.

In [None]:
# Taxonomy barplots
!qiime taxa barplot \
    --i-table 1-cleanup/1-otu-table-cleaned.qza \
    --i-taxonomy 0-feature-table-and-sequences/1-OTU-taxa.qza \
    --o-visualization 1-cleanup/1-taxa-barplot-cleaned.qzv

#Visualize
import qiime2 as q2
q2.Visualization.load('1-cleanup/1-taxa-barplot-cleaned.qzv')

# <font color = 'gray'>Alpha Rarefaction</font>

### Checking the number of features per sample
In this step, the `1-otu-table-cleaned.qza` file will be used to view the alpha rarefaction curves.

Alpha rarefaction curves can be used to visualize whether a sample has been sufficiently sequenced to represent its true diversity.

To view the rarefaction curves for all the samples, check the number of features/OTUs per sample in the `1-otu-table-cleaned.qza` file. Run the code block below and go to the <i>Interactive Sample Detail</i> tab to check for the sample with the most numerous features/OTUs.

In [None]:
# Summarize OTU table
!qiime feature-table summarize \
    --i-table 1-cleanup/1-otu-table-cleaned.qza \
    --o-visualization 1-cleanup/1-otu-table-cleaned.qzv \

import qiime2 as q2
# Visualize
q2.Visualization.load('1-cleanup/1-otu-table-cleaned.qzv')

### Making alpha rarefaction curves
In the <i>Interactive Sample Detail</i>, you can see that the lowest and highest OTU count in a sample is 2329 and 13131, respectively.

To clearly view the whole rarefaction curves for all the samples, we will set the maximum depth to around 10000 reads

In [None]:
!qiime diversity alpha-rarefaction \
    --i-table 1-cleanup/1-otu-table-cleaned.qza \
    --p-max-depth 10000 \
    --p-metrics 'shannon' \
    --o-visualization 3-rarefy/1-otu-table-cleaned_arare.qzv \

# Visualize
q2.Visualization.load('3-rarefy/1-otu-table-cleaned_arare.qzv')

### Choosing a sampling depth and making a phylogenetic tree
After seeing the rarefaction curves, we will select the sampling depth where we will rarefy the samples. Rarefaction will standardize the number of OTUs to the smallest number or OTUs in a sample which will allow for comparision between sites.

NOTE: The smallest number of OTUs in a sample is not always be used as the sampling depth. This decision should be based on the rarefaction curves. If the selected sampling depth is not on the plateaued part of the curve, problems in statistical analyses may arise as the actual diversity for other samples may be reduced.

Before rarefying the samples and compute for alpha and beta diversity indices, run the code block below to make a tree file which will be used for phylogeny-based diversity analyses (e.g., Faith's PD and Unifrac).

In [None]:
# Generate a tree for phylogenetic diversity analyses
!qiime phylogeny align-to-tree-mafft-fasttree \
    --i-sequences 1-cleanup/1-otu-rep-seqs-cleaned.qza \
    --o-alignment 2-phylogeny/aligned-rep-seqs.qza \
    --o-masked-alignment 2-phylogeny/masked-aligned-rep-seqs.qza \
    --o-tree 2-phylogeny/unrooted-tree.qza \
    --o-rooted-tree 2-phylogeny/rooted-tree.qza

### Rarefying samples
As the rarefaction curves for the samples seem to have plateau-ed in the smallest number of OTUs (i.e., 2329), sampling depth will be set to this number.

In [None]:
# Alpha and Beta Diversity Analyses
!qiime diversity core-metrics-phylogenetic \
    --i-phylogeny 2-phylogeny/rooted-tree.qza \
    --i-table 1-cleanup/1-otu-table-cleaned.qza \
    --p-sampling-depth 2329 \
    --m-metadata-file metadata-day-2.txt \
    --output-dir 4-diversity/1-core-metrics-results

---
# <font color = 'gray'>PCoA plots and statistical testing</font>

### Visualizing beta diversity indices
After looking for the discontinuity of data using hierarchical clustering, grouping of sites will then be viewed in a multidimensional space.

Run the code blocks below to view the PCoA plot of each beta diversity metric. You can select a <i>color category</i> for any metadata column, for example, the cluster category, to see if any data points group together.

In [None]:
# Jaccard distance
q2.Visualization.load('4-diversity/1-core-metrics-results/jaccard_emperor.qzv')

In [None]:
# Bray-Curtis dissimilarity
q2.Visualization.load('4-diversity/1-core-metrics-results/bray_curtis_emperor.qzv')

In [None]:
# Unweighted Unifrac
q2.Visualization.load('4-diversity/1-core-metrics-results/unweighted_unifrac_emperor.qzv')

In [None]:
# Weighted Unifrac
q2.Visualization.load('4-diversity/1-core-metrics-results/weighted_unifrac_emperor.qzv')

### Alpha significance testing between/among groups
After visualizing in a multidimensional space and deciding on clusters, significance testing of the groups based on their alpha diversity will be done. This is explored to check if a cluster of sites is significantly more diverse than the other.

Run the code blocks below to test for significance and view the results.

In [None]:
# Shannon
!qiime diversity alpha-group-significance \
  --i-alpha-diversity 4-diversity/1-core-metrics-results/shannon_vector.qza \
  --m-metadata-file metadata-day-2.txt \
  --o-visualization 4-diversity/2-shannon-group-significance.qzv

q2.Visualization.load('4-diversity/2-shannon-group-significance.qzv')

In [None]:
# Faith's PD
!qiime diversity alpha-group-significance \
  --i-alpha-diversity 4-diversity/1-core-metrics-results/faith_pd_vector.qza \
  --m-metadata-file metadata-day-2.txt \
  --o-visualization 4-diversity/3-faith_pd-group-significance.qzv

q2.Visualization.load('4-diversity/3-faith_pd-group-significance.qzv')

### Beta significance testing between/among groups
Then, significance testing of the groups based on their beta diversity will be examined. This is done to check if there is a significant difference in the composition of communities between/among groups.

Run the code blocks below to test for significance and view the results.

In [None]:
# Bray-Curtis
!qiime diversity beta-group-significance \
  --i-distance-matrix 4-diversity/1-core-metrics-results/bray_curtis_distance_matrix.qza \
  --m-metadata-file metadata-day-2.txt \
  --m-metadata-column cluster \
  --o-visualization 4-diversity/4-bray-curtis-cluster-significance.qzv \
  --p-pairwise

q2.Visualization.load('4-diversity/4-bray-curtis-cluster-significance.qzv')

In [None]:
# Unweighted Unifrac
!qiime diversity beta-group-significance \
  --i-distance-matrix 4-diversity/1-core-metrics-results/unweighted_unifrac_distance_matrix.qza \
  --m-metadata-file metadata-day-2.txt \
  --m-metadata-column cluster \
  --o-visualization 4-diversity/5-unweighted-unifrac-cluster-significance.qzv \
  --p-pairwise

q2.Visualization.load('4-diversity/5-unweighted-unifrac-cluster-significance.qzv')

---
# <font color = 'gray'>Differential abundance testing with ANCOM</font>

Differential abundance testing checks for features and/or taxonomic levels that are significantly different between samples or categories of samples.

### Find differentially abundant features

ANCOM requires a <i>FeatureData[Composition]</i> as the input artifact. To convert our OTU table to the mentioned artifact type, we should run the code block below first.

In [None]:
!qiime composition add-pseudocount \
    --i-table 1-cleanup/1-otu-table-cleaned.qza \
    --o-composition-table 5-ancom/1-otu-table-add-psdcnt.qza

Run the code block below to see which feature are significantly different between samples grouped according cluster metadata category.

In [None]:
!qiime composition ancom \
    --i-table 5-ancom/1-otu-table-add-psdcnt.qza \
    --m-metadata-file metadata-day-2.txt \
    --m-metadata-column cluster \
    --o-visualization 5-ancom/2-ancom-cluster.qzv

In [None]:
#Visualize
q2.Visualization.load("5-ancom/2-ancom-cluster.qzv")

### Find differentially abundant genus

We could also compare the differentially abundant features at a specific taxonomic level. To do this, we must first collapse our feature table to the desired taxonomic level (in the example below, taxonomic level 6 (genus) will be used).

In [None]:
# Collapse table
!qiime taxa collapse \
    --i-table 1-cleanup/1-otu-table-cleaned.qza \
    --i-taxonomy 0-feature-table-and-sequences/1-OTU-taxa.qza \
    --p-level 6 \
    --o-collapsed-table 5-ancom/3-otu-table-collapsed-lvl6.qza

# Add pseudocount
!qiime composition add-pseudocount \
    --i-table 5-ancom/3-otu-table-collapsed-lvl6.qza \
    --o-composition-table 5-ancom/3-otu-table-lvl6-add-psdcnt.qza

Then we could run differential abundance testing once again to determine which genus are significantly different between samples from different clusters.

In [None]:
!qiime composition ancom \
    --i-table 5-ancom/3-otu-table-lvl6-add-psdcnt.qza \
    --m-metadata-file metadata-day-2.txt \
    --m-metadata-column cluster \
    --o-visualization 5-ancom/4-ancom-lvl6-cluster.qzv

In [None]:
#Visualize
q2.Visualization.load("5-ancom/4-ancom-lvl6-cluster.qzv")