# Module: Diversity Analysis in QIIME2

In diversity analysis, we explore the richness and composition of communities to understand patterns of biodiversity across different environments or conditions.

Here, we illustrate how various diversity indices can be calculated in QIIME2. Additionally, we will see different outputs that can be used for data exploration (PCoA) and summaries of statistical tests.

The following resources were used for this tutorial: ["Moving Pictures" Tutorial](https://docs.qiime2.org/2024.10/tutorials/moving-pictures/).

Created by: _Microbial Oceanography Laboratory (MOLab)_

---
## How to Use This Notebook

1. Activate conda environment in terminal window. Make sure to change the environment name to what is applicable in your case.
>`conda activate qiime2-2023.2`
2. Open jupyter notebook with the command below and select the notebook.
>`jupyter notebook`
3. To run the cells in this notebook, press Shift+Enter.

---
## Tools Used
1. **QIIME 2 Amplicon Distribution**
    - Installation procedure can be found here: [QIIME2 native installation](https://docs.qiime2.org/2024.10/install/native/)

---
## Starting Files 

1. `.qza` of type `FeatureTable[Frequency]`. Can be generated from OTU clustering or denoising.
2. `.qza` of type `FeatureData[Sequence]`. Can be generated from OTU clustering or denoising.

---
## Expected Outputs

1. Rarefied feature table (`.qza` of type `FeatureTable[Frequency]`).
2. Various `.qza` and `.qzv` files related to alpha rarefaction and calculation of diversity indices. 

---
## Table of Contents
 * [**Alpha Rarefaction**](#Alpha-Rarefaction)
     * [Checking the number of features per sample](#Checking-the-number-of-features-per-sample)
     * [Making alpha rarefaction curves](#Making-alpha-rarefaction-curves)
 * [**Calculation of Diversity Indices**](#Calculation-of-Diversity-Indices)
     * [Generate phylogenetic tree](#Generate-phylogenetic-tree)
     * [Rarefy and calculate diversity indices](#Rarefy-and-calculate-diversity-indices)
 * [**Principal Coordinate Analysis**](#Principal-Coordinate-Analysis)
 * [**Statistical Testing**](#Statistical-Testing)
     * [Alpha significance testing](#Alpha-significance-testing)
     * [Alpha correlation](#Alpha-correlation)
     * [Beta significance testing](#Beta-significance-testing)

---
# <font color = 'gray'>Alpha Rarefaction</font>

Rarefaction is simply a process of randomly subsampling the data. In this context, we randomly select _N_ reads from each sample so that each sample would end up having equal sampling depth. The reason for this is that different samples often have varying sequencing depths, which can introduce bias when comparing diversity metrics across samples. By rarefying to a uniform depth, we mitigate the impact of sequencing depth variability, allowing for a more accurate comparison of diversity metrics between samples.

Below we will see how to select the rarefaction depth, _N_.

<div class="alert alert-block alert-info">
<b>Note:</b> 
    
Although rarefaction normalizes library depth, it is not free from criticisms. Many argue that it throws away available yet valid data, making inefficient use of the sequencing library. Additionally, rare taxa might be loss during random subsampling, leading to an underestimation of community richness. It also decreases statistical power when trying to infer differentially abundant taxa.
</div>

### Checking the number of features per sample

First, we look at the number of reads per sample. Run the code blocks below, and in the visualizationm go to the _Interactive Sample Detail_ tab. Take note of the samples with the lowest and highest number of reads.

In [None]:
!qiime feature-table summarize \
    --i-table feature-table.qza \
    --o-visualization feature-table.qzv \

In [None]:
import qiime2 as q2
q2.Visualization.load('feature-table.qzv')

### Making alpha rarefaction curves

Next, we make an alpha rarefaction curve. This plot shows the alpha diversity of each sample rarefied at different depths. In the `--p-max-depth` argument, enter the depth with highest value (here, we supplied `100000` as a placeholder) noted from above. For the `--p-steps` option, you could increase the value if you want a smoother curve. You could also calculate for more diversity indices by adding the `--p-metrics` argument.

In [None]:
!qiime diversity alpha-rarefaction \
    --i-table feature-table.qza \
    --p-max-depth 100000 \
    --p-steps 20 \
    --p-metrics 'shannon' \
    --p-metrics 'observed_features' \
    --p-metrics 'chao1'
    --o-visualization alpha-rare-curve.qzv

In [None]:
import qiime2 as q2
q2.Visualization.load('alpha-rare-curve.qzv')

<div class="alert alert-block alert-success">
<b>Question:</b> 
    
Does the alpha diversity index of all samples plateau? If yes, at what depth should you rarefy your samples? If not, again, at what depth should you rarefy your samples and how many samples will be omitted if rarefied at this depth?
</div>

If you are interested in characterizing general patterns in your samples, selecting an index that takes into account both richness and evenness (such as Shannon) may be more relevant. Based on the above alpha rarefaction curve for the selected index, select the rarefaction depth, _N_. In the next section, we will see how to rarefy the samples at this depth.

---
# <font color = 'gray'>Calculation of Diversity Indices</font>



### Generate phylogenetic tree

Before rarefying the samples and computing for diversity indices, run the code block below to make a tree file which will be used for phylogeny-based diversity analyses (e.g., Faith's PD and UniFrac).

<div class="alert alert-block alert-info">
<b>Note:</b> 
    
If you are not interested in phylogeny-aware metrics, you can skip this step.
</div>

In [None]:
!qiime phylogeny align-to-tree-mafft-fasttree \
    --i-sequences feature-rep-seqs.qza \
    --o-alignment aligned-rep-seqs.qza \
    --o-masked-alignment masked-aligned-rep-seqs.qza \
    --o-tree unrooted-tree.qza \
    --o-rooted-tree rooted-tree.qza

### Rarefy and calculate diversity indices

Rarefaction and calculation of alpha and beta diversity indices (including phylogeny-aware metrics) can be done in a single command (`diversity core-metrics-phylogenetic`). Besides the feature table (`feature-table.qza`) and a rooted phylogenetic tree (`rooted-tree.qza`), you will also need your metadata file (`metadata.txt`), which will be used in grouping your samples for statistical analyses.

<div class="alert alert-block alert-info">
<b>Note:</b> 
    
Replace the value specified in the <code>--p-sampling-depth</code> parameter by an appropriate depth for your dataset.
</div>

In [None]:
!qiime diversity core-metrics-phylogenetic \
    --i-phylogeny rooted-tree.qza \
    --i-table feature-table.qza.qza \
    --m-metadata-file metadata.txt \
    --p-sampling-depth 30000 \
    --output-dir core-metrics-results

---
# <font color = 'gray'>Principal Coordinate Analysis</font>

Principal coordinate analysis (PCoA) is a multivariate analysis technique that begins with a distance matrix, typically derived from a beta diversity index. It then projects samples into a lower-dimensional space (commonly 2 or 3 dimensions) to facilitate visualization of potential clusters within the dataset.

Inside the `core-metrics-results` output of the code above, you will see multiple `.qzv` file containing PCoA plots (e.g. `jaccard_emperor.qzv`). You can view these files inline (codes below) or through [QIIME2 visualizer](https://view.qiime2.org/) to see the PCoA plots.

**Jaccard Distance**

In [None]:
import qiime2 as q2
q2.Visualization.load('core-metrics-results/jaccard_emperor.qzv')

**Bray-Curtis Distance**

In [None]:
q2.Visualization.load('core-metrics-results/bray_curtis_emperor.qzv')

**Unweighted UniFrac Distance**

In [None]:
q2.Visualization.load('core-metrics-results/unweighted_unifrac_emperor.qzv')

**Weighted UniFrac Distance**

In [None]:
q2.Visualization.load('core-metrics-results/weighted_unifrac_emperor.qzv')

---
# <font color = 'gray'>Statistical Testing</font>

### Alpha significance testing

In QIIME2, we test if alpha diversity measures differ significantly between or among sample groups using the non-parametric Kruskal-Wallis rank-sum test. This test checks if the sum of ranks for alpha diversity measures within each sample group differs significantly from others.

The code below automatically recognizes the categorical variables in your metadata file. In the output visualization, you can then select various categorical metadata columns to check if statistical tests are significant.

<div class="alert alert-block alert-info">
<b>Note:</b> 
    
Replace the <code>.qza</code> file specified in the <code>--i-alpha-diversity</code> parameter to select a different alpha diversity metric.
</div>

In [None]:
!qiime diversity alpha-group-significance \
  --i-alpha-diversity core-metrics-results/shannon_vector.qza \
  --m-metadata-file metadata.txt \
  --o-visualization shannon-group-significance.qzv

In [None]:
q2.Visualization.load('shannon-group-significance.qzv')

### Alpha correlation

You can also check if the per-sample alpha diversity measures are correlated with a continuous variable in your metadata.

<div class="alert alert-block alert-info">
<b>Note:</b> 
    
Replace the <code>.qza</code> file specified in the <code>--i-alpha-diversity</code> parameter to select a different alpha diversity metric.
</div>

In [None]:
!qiime diversity alpha-correlation \
  --i-alpha-diversity core-metrics-results/shannon_vector.qza \
  --m-metadata-file metadata.txt \
  --o-visualization shannon-correlation.qzv

In [None]:
q2.Visualization.load('shannon-correlation.qzv')

### Beta significance testing

Here, you will test if the distances (beta diversity measures) between samples within a group are more similar than distances between samples in different groups. QIIME2 provides two methods for this: (1) `beta-group-significance`, which performs PERMANOVA using a single variable, and (2) `adonis`, which also uses PERMANOVA but allows for more complex multi-factor models. This flexibility in `adonis` is particularly useful for testing multiple variables or controlling for covariates. Either way, if you will be considering a single variate only, both methods should yield the same results.

Below, you will see how to run the `adonis` method in QIIME2.

<div class="alert alert-block alert-info">
<b>Note:</b> 
    
Replace the <code>.qza</code> file specified in the <code>--i-distance-matrix</code> parameter to select a different beta diversity metric.
</div>

<div class="alert alert-block alert-info">
<b>Note:</b> 

The <code>--p-formula</code> uses the same notation as R formulas. Here we want to look at the main effects of <i>site</i> and <i>season</i> only, hence the formula <code>site+season</code>. You could also include interaction terms. For example: <code>site+season+site:season</code>.
</div>

In [None]:
!qiime diversity adonis \
    --i-distance-matrix unweighted_unifrac_distance_matrix.qza \
    --m-metadata-file metadata.txt \
    --p-formula "site+season" \
    --o-visualization adonis-test.qzv

In [None]:
q2.Visualization.load('adonis-test.qzv')