# 3. Diversity

> **Goal:** Compare the samples according to their diversity.

---

**Overview**

This section examines the microbial diversity of the given samples, focusing on both within-sample complexity (alpha diversity) and between-sample differences (beta diversity). These measures allow the assessment of how microbial community structure varies with different metadata attributes.

The workflow is organized into three key steps:

1. **Sampling depth assessment**  
   Sequencing depths are inspected and an appropriate threshold is defined.

2. **Bootstrapping on Euler**  
   To achieve a sufficient number of iterations, the bootstrapping of the k-mer–based diversity calculation is performed on Euler.

3. **Diversity computation and statistical testing**  
   - **Alpha diversity:** Used to evaluate richness or evenness within individual samples.  
   - **Beta diversity:** Distance metrics such as Bray–Curtis and Jaccard quantify community differences between samples.  
     The analysis first applies **QIIME’s `beta-group-significance`**, which tests for group-level differences in beta diversity.  
   - **Follow-up tests:**  
     - **PERMANOVA (adonis):** Applied to determine whether any detected significance persists when accounting for multiple covariates (multivariable effects).  
     - **PERMDISP** (`beta-group-significance` with the `--permdisp` option): Used to assess whether observed differences are driven by heterogeneous dispersion rather than true shifts in community composition.

Together, these steps provide a robust framework for identifying which metadata attributes are associated with meaningful differences in microbial diversity.

---

**Table of Contents**

- [3.1 Import packages](#3.1-Import-packages)
- [3.2 Import the data](#3.2-Import-the-data)
- [3.3 Determination of the sampling depth](#3.3-Determination-of-the-sampling-depth)
- [3.4 Bootstraping on Euler](#3.4-Bootstraping-on-Euler)
    - [3.4.1 Import files](#3.4.1-Import-files)
    - [3.4.2 Bootstrapping script](#3.4.2-Bootstrapping-script)
- [3.5 Diversity](#3.5-Diversity)
    - [3.5.1 Alpha diversity](#3.5.1-Alpha-diversity)
    - [3.5.2 Beta diversity](#3.5.2-Beta-diversity)
        - [3.5.2.1 Bray-Curtis](#3.5.2.1-Bray-Curtis)
        - [3.5.2.1 Jaccard](#3.5.2.1-Jaccard)

## 3.1 Import packages

In [1]:
# Importing all required packages at the start of the notebook
import IPython

from qiime2 import Visualization

import pandas as pd

## 3.2 Import the data

In [2]:
# Location
data_dir = "Project_data/Diversity"
! mkdir -p "$data_dir"

## 3.3 Determination of the sampling depth

In [3]:
# Create a summary table
! qiime feature-table summarize \
    --i-table Project_data/Taxonomy/table_filtered.qza \
    --m-sample-metadata-file Project_data/Metadata/updated_fungut_metadata.tsv \
    --o-visualization $data_dir/table_filtered.qzv

  import pkg_resources
[32mSaved Visualization to: Project_data/Diversity/table_filtered.qzv[0m
[0m[?25h

In [4]:
Visualization.load(f"{data_dir}/table_filtered.qzv")

In [5]:
# Determine the sampling depth
! qiime diversity alpha-rarefaction \
    --i-table Project_data/Taxonomy/table_filtered.qza \
    --p-max-depth 80000 \
    --m-metadata-file Project_data/Metadata/updated_fungut_metadata.tsv \
    --o-visualization $data_dir/alpha-rarefaction.qzv

  import pkg_resources
[32mSaved Visualization to: Project_data/Diversity/alpha-rarefaction.qzv[0m
[0m[?25h

In [6]:
Visualization.load(f"{data_dir}/alpha-rarefaction.qzv")

According to alpha rarefication, a sampling depth of 20.000 was chosen, since the Shannon and observed feature metrics start to plateau at this point. Referring to the feature table for this sampling depth results in a retention of 2.720.000 reads (40.81%) across 136 samples (90.67%).

## 3.4 Bootstraping on Euler
The diversity analysis was performed using the `q2-boots` plugin for QIIME2. According to the [documentation](https://f1000research.com/articles/14-87/v1) of the plugin any iterations higher than 100 do not scale well. Therefore, the analysis was run on the Euler cluster with `n = 1000` to ensure robust results. As this plugin was not included in the previously installed MOSHPIT distribution, the Amplicon distribution had to be installed additionally via Miniconda.

### 3.4.1 Import files
As with the 2.Taxonomy script, the files required to run the bootstrapping on Euler were uploaded to Polybox for download by the script running on Euler.

### 3.4.2 Bootstraping script
The following script was run on Euler.

```bash
#!/bin/bash
#SBATCH --job-name=bootstraping
#SBATCH --time=24:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem-per-cpu=32G
#SBATCH --output=bootstraping_%j.out
#SBATCH --error=bootstraping_%j.err
#SBATCH --mail-type=END,FAIL

# Activate conda
source ~/miniconda3/etc/profile.d/conda.sh
conda activate qiime2-amplicon-2025.10

# Data folder
data_dir="ProjectData"


# Download the meta data and reads
module load eth_proxy

# table_filtered
wget --content-disposition -nc --progress=dot:giga -P "$data_dir" https://polybox.ethz.ch/index.php/s/dpP6wHGcgWe2ncd/download
# rep-seqs_filtered
wget --content-disposition -nc --progress=dot:giga -P "$data_dir" https://polybox.ethz.ch/index.php/s/Kkb47x6PgmesMgt/download
# updated_fungut_metadata
wget --content-disposition -nc --progress=dot:giga -P "$data_dir" https://polybox.ethz.ch/index.php/s/WdQ3GPTSEC6qqZm/download

echo "Download done!"

# Run the bootstraping
qiime boots kmer-diversity \
  --i-table $data_dir/table_filtered.qza \
  --i-sequences $data_dir/rep-seqs_filtered.qza \
  --m-metadata-file $data_dir/updated_fungut_metadata.tsv\
  --p-sampling-depth 20000 \
  --p-n 1000 \
  --p-replacement \
  --p-alpha-average-method median \
  --p-beta-average-method medoid \
  --output-dir $data_dir/boots-kmer-diversity

echo "Bootstraping done!"
```

## 3.5 Diversity
The files created by the script on Euler were downloaded and uploaded to Polybox in order to be accessible for this script.

In [7]:
%%bash -s $data_dir

# Shannon
wget --content-disposition -nc --progress=dot:giga -P "$1" https://polybox.ethz.ch/index.php/s/XewRqjqBxPjsBMK/download
# Scatter plot
wget --content-disposition -nc --progress=dot:giga -P "$1" https://polybox.ethz.ch/index.php/s/SbRQpewZm9LDcrW/download
# Bray-Curtis distance matrix
wget --content-disposition -nc --progress=dot:giga -P "$1" https://polybox.ethz.ch/index.php/s/X689NN8ziCxCtaM/download
# Jaccard distance matrix
wget --content-disposition -nc --progress=dot:giga -P "$1" https://polybox.ethz.ch/index.php/s/jJ8MQj3oBx9ckyw/download

chmod -R +rxw "$1"

--2025-12-18 16:52:36--  https://polybox.ethz.ch/index.php/s/XewRqjqBxPjsBMK/download
Resolving polybox.ethz.ch (polybox.ethz.ch)... 129.132.71.243
Connecting to polybox.ethz.ch (polybox.ethz.ch)|129.132.71.243|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 81583927 (78M) [application/octet-stream]
Saving to: ‘Project_data/Diversity/shannon.qza’

     0K ........ ........ ........ ........ 41%  153M 0s
 32768K ........ ........ ........ ........ 82%  165M 0s
 65536K ........ .....                     100%  148M=0.5s

2025-12-18 16:52:37 (157 MB/s) - ‘Project_data/Diversity/shannon.qza’ saved [81583927/81583927]

--2025-12-18 16:52:37--  https://polybox.ethz.ch/index.php/s/SbRQpewZm9LDcrW/download
Resolving polybox.ethz.ch (polybox.ethz.ch)... 129.132.71.243
Connecting to polybox.ethz.ch (polybox.ethz.ch)|129.132.71.243|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 145333965 (139M) [application/octet-stream]
Saving to: ‘Project_data

### 3.5.1 Alpha diversity

In [8]:
! qiime diversity alpha-group-significance \
    --i-alpha-diversity $data_dir/shannon.qza \
    --m-metadata-file $data_dir/../Metadata/updated_fungut_metadata.tsv \
    --o-visualization $data_dir/alpha_group_significance.qzv

  import pkg_resources
[32mSaved Visualization to: Project_data/Diversity/alpha_group_significance.qzv[0m
[0m[?25h

In [9]:
Visualization.load(f"{data_dir}/alpha_group_significance.qzv")

### 3.5.2 Beta diversity
The Beta diversity was run with both distance matricies (Bray-Curtis & Jaccard) obtained during the boot strapping.  
**Note**: The diversity analyses performed using the qiime2 diversity plugin are affected by minor variations from one run to another. Since there was no argument to set a seed in order to create reproducible results. By increasing the permutation number the variation is reduced, but cannot completely be eliminated.

In [10]:
Visualization.load(f"{data_dir}/scatter_plot.qzv")

#### 3.5.2.1 Bray-Curtis

##### Beta-group-significance

In [11]:
meta_cols = ["age_range", "sex_sample", "diet_type_sample", "ibd_sample", "gluten_sample", "continent", "bmi_category", "urban_category"]

for col in meta_cols:
    output_name = f"{data_dir}/bray_curtis-{col}-significance.qzv"
    print(f"Running for column: {col}")

    ! qiime diversity beta-group-significance \
        --i-distance-matrix $data_dir/braycurtis.qza \
        --m-metadata-file $data_dir/../Metadata/updated_fungut_metadata.tsv \
        --m-metadata-column {col} \
        --p-permutations 9999 \
        --p-pairwise \
        --o-visualization {output_name}
    
# Errors with country_sample (some unique values), bmi_sample (numeric type) 

Running for column: age_range
  import pkg_resources
[32mSaved Visualization to: Project_data/Diversity/bray_curtis-age_range-significance.qzv[0m
[0m[?25hRunning for column: sex_sample
  import pkg_resources
[32mSaved Visualization to: Project_data/Diversity/bray_curtis-sex_sample-significance.qzv[0m
[0m[?25hRunning for column: diet_type_sample
  import pkg_resources
[32mSaved Visualization to: Project_data/Diversity/bray_curtis-diet_type_sample-significance.qzv[0m
[0m[?25hRunning for column: ibd_sample
  import pkg_resources
[32mSaved Visualization to: Project_data/Diversity/bray_curtis-ibd_sample-significance.qzv[0m
[0m[?25hRunning for column: gluten_sample
  import pkg_resources
[32mSaved Visualization to: Project_data/Diversity/bray_curtis-gluten_sample-significance.qzv[0m
[0m[?25hRunning for column: continent
  import pkg_resources
[32mSaved Visualization to: Project_data/Diversity/bray_curtis-continent-significance.qzv[0m
[0m[?25hRunning for column: bmi_ca

In [12]:
Visualization.load(f"{data_dir}/bray_curtis-age_range-significance.qzv")

In [13]:
Visualization.load(f"{data_dir}/bray_curtis-sex_sample-significance.qzv")

In [14]:
Visualization.load(f"{data_dir}/bray_curtis-diet_type_sample-significance.qzv")

In [15]:
Visualization.load(f"{data_dir}/bray_curtis-ibd_sample-significance.qzv")

In [16]:
Visualization.load(f"{data_dir}/bray_curtis-gluten_sample-significance.qzv")

In [17]:
Visualization.load(f"{data_dir}/bray_curtis-continent-significance.qzv")

In [18]:
Visualization.load(f"{data_dir}/bray_curtis-bmi_category-significance.qzv")

In [19]:
Visualization.load(f"{data_dir}/bray_curtis-urban_category-significance.qzv")

Of the tested attributes, only Continent showed a general significance level of 0.0206. Additionally, a pairwise significance was detected between Europe and Oceania (q-value = 0.04200). As mentioned these values slightly vary between runs.

##### Adonis

In [20]:
! qiime diversity adonis \
    --i-distance-matrix $data_dir/braycurtis.qza \
    --m-metadata-file $data_dir/../Metadata/updated_fungut_metadata_noNaN.tsv \
    --p-formula "age_range + sex_sample + diet_type_sample + ibd_sample + gluten_sample + continent + bmi_category + urban_category" \
    --p-permutations 9999 \
    --o-visualization $data_dir/bray_curtis-adonis_multi.qzv

  import pkg_resources
[32mSaved Visualization to: Project_data/Diversity/bray_curtis-adonis_multi.qzv[0m
[0m[?25h

In [21]:
Visualization.load(f"{data_dir}/bray_curtis-adonis_multi.qzv")

Notes:
- By accounting for all other metadata variables, continent is slighlty above the significance level and thus not significant.
- All other metadata dont have a significant effect

##### PERMDISP

In [22]:
# Check if difference between groups is due to group variance, not true group centroid separation
! qiime diversity beta-group-significance \
    --i-distance-matrix $data_dir/braycurtis.qza \
    --m-metadata-file $data_dir/../Metadata/updated_fungut_metadata.tsv \
    --m-metadata-column continent \
    --p-permutations 9999 \
    --p-method permdisp \
    --o-visualization $data_dir/braycurtis_continent_dispersion.qzv

  import pkg_resources
[32mSaved Visualization to: Project_data/Diversity/braycurtis_continent_dispersion.qzv[0m
[0m[?25h

In [23]:
Visualization.load(f"{data_dir}/braycurtis_continent_dispersion.qzv")

Notes:
- Significant PERMANOVA signal is due to differences in group centroids (actual community shifts), not because one group is just more spread out. Means each of the different groups are equally spread.
- Significant effect of continent not anymore after considering for other attributes

In [24]:
# Visualize possible clusterings
! qiime diversity pcoa \
    --i-distance-matrix $data_dir/braycurtis.qza \
    --o-pcoa $data_dir/braycurtis_pcoa.qza

! qiime emperor plot \
    --i-pcoa $data_dir/braycurtis_pcoa.qza \
    --m-metadata-file $data_dir/../Metadata/updated_fungut_metadata.tsv \
    --o-visualization $data_dir/braycurtis_pcoa_continent.qzv

  import pkg_resources
[32mSaved PCoAResults to: Project_data/Diversity/braycurtis_pcoa.qza[0m
  import pkg_resources
[32mSaved Visualization to: Project_data/Diversity/braycurtis_pcoa_continent.qzv[0m
[0m[?25h

In [25]:
Visualization.load(f"{data_dir}/braycurtis_pcoa_continent.qzv")

Notes:
- No clusters for continents visible, only clusters visible consists of mixed continents.

#### 3.5.2.1 Jaccard

##### Beta-group-significance

In [43]:
for col in meta_cols:
    output_name = f"{data_dir}/jaccard-{col}-significance.qzv"
    print(f"Running for column: {col}")

    ! qiime diversity beta-group-significance \
        --i-distance-matrix $data_dir/jaccard.qza \
        --m-metadata-file $data_dir/../Metadata/updated_fungut_metadata.tsv \
        --m-metadata-column {col} \
        --p-permutations 9999 \
        --p-pairwise \
        --o-visualization {output_name}
    
# Errors with country_sample (only unique values), bmi_sample (numeric type) 

Running for column: age_range
  import pkg_resources
[32mSaved Visualization to: Project_data/Diversity/jaccard-age_range-significance.qzv[0m
[0m[?25hRunning for column: sex_sample
  import pkg_resources
[32mSaved Visualization to: Project_data/Diversity/jaccard-sex_sample-significance.qzv[0m
[0m[?25hRunning for column: diet_type_sample
  import pkg_resources
[32mSaved Visualization to: Project_data/Diversity/jaccard-diet_type_sample-significance.qzv[0m
[0m[?25hRunning for column: ibd_sample
  import pkg_resources
[32mSaved Visualization to: Project_data/Diversity/jaccard-ibd_sample-significance.qzv[0m
[0m[?25hRunning for column: gluten_sample
  import pkg_resources
[32mSaved Visualization to: Project_data/Diversity/jaccard-gluten_sample-significance.qzv[0m
[0m[?25hRunning for column: continent
  import pkg_resources
[32mSaved Visualization to: Project_data/Diversity/jaccard-continent-significance.qzv[0m
[0m[?25hRunning for column: bmi_category
  import pkg_reso

In [44]:
Visualization.load(f"{data_dir}/jaccard-age_range-significance.qzv")

In [45]:
Visualization.load(f"{data_dir}/jaccard-sex_sample-significance.qzv")

In [46]:
Visualization.load(f"{data_dir}/jaccard-diet_type_sample-significance.qzv")

In [47]:
Visualization.load(f"{data_dir}/jaccard-ibd_sample-significance.qzv")

In [48]:
Visualization.load(f"{data_dir}/jaccard-continent-significance.qzv")

In [49]:
Visualization.load(f"{data_dir}/jaccard-gluten_sample-significance.qzv")

In [50]:
Visualization.load(f"{data_dir}/jaccard-bmi_category-significance.qzv")

In [51]:
Visualization.load(f"{data_dir}/jaccard-urban_category-significance.qzv")

Notes:
Detected significances:
- Age range (p = 0.0312): no pairwise significance
- Diet type (p = 0.1891): Omnivore <-> Vegan (q = 0.034500) & Omnivore (no red meat) <-> Vegan (q = 0.034500)
- Continent (p = 0.0001): Europe <-> North America (q = 0.00015) & Europe <-> Oceania (q = 0.00015)

##### Adonis

In [52]:
! qiime diversity adonis \
    --i-distance-matrix $data_dir/jaccard.qza \
    --m-metadata-file $data_dir/../Metadata/updated_fungut_metadata_noNaN.tsv \
    --p-formula "age_range + sex_sample + diet_type_sample + ibd_sample + gluten_sample + continent + bmi_category + urban_category" \
    --o-visualization $data_dir/jaccard-adonis_multi.qzv

  import pkg_resources
[32mSaved Visualization to: Project_data/Diversity/jaccard-adonis_multi.qzv[0m
[0m[?25h

In [53]:
Visualization.load(f"{data_dir}/jaccard-adonis_multi.qzv")

Notes:
- Cols with significant results from Adoins:
    - Age_range (p = 0.022)
    - Continent (p = 0.001)
    
Means:
- Continent strongest since appears in both
- Age's effect is partially masked by other metadata
- diet_type_sample signal is confounded by other variables (continent, age, etc)

##### PERMDISP

In [63]:
# Check if difference between groups is due to group variance, not true group centroid separation
! qiime diversity beta-group-significance \
    --i-distance-matrix $data_dir/jaccard.qza \
    --m-metadata-file $data_dir/../Metadata/updated_fungut_metadata.tsv \
    --m-metadata-column age_range \
    --p-permutations 9999 \
    --p-method permdisp \
    --o-visualization $data_dir/jaccard_age_range_dispersion.qzv

  import pkg_resources
[32mSaved Visualization to: Project_data/Diversity/jaccard_age_range_dispersion.qzv[0m
[0m[?25h

In [64]:
Visualization.load(f"{data_dir}/jaccard_age_range_dispersion.qzv")

Notes:
- Non-significant results for age-ranges
- Homogenous dispersion between groups, PERMANOVA results not driven by difference in dispersion

In [65]:
# Check if difference between groups is due to group variance, not true group centroid separation
! qiime diversity beta-group-significance \
    --i-distance-matrix $data_dir/jaccard.qza \
    --m-metadata-file $data_dir/../Metadata/updated_fungut_metadata.tsv \
    --m-metadata-column diet_type_sample \
    --p-permutations 9999 \
    --p-method permdisp \
    --o-visualization $data_dir/jaccard_diet_type_sample_dispersion.qzv

  import pkg_resources
[32mSaved Visualization to: Project_data/Diversity/jaccard_diet_type_sample_dispersion.qzv[0m
[0m[?25h

In [66]:
Visualization.load(f"{data_dir}/jaccard_diet_type_sample_dispersion.qzv")

Notes:?????
- Non-significant results for age-ranges
- Homogenous dispersion between groups, PERMANOVA results not driven by difference in dispersion

In [67]:
# Check if difference between groups is due to group variance, not true group centroid separation
! qiime diversity beta-group-significance \
    --i-distance-matrix $data_dir/jaccard.qza \
    --m-metadata-file $data_dir/../Metadata/updated_fungut_metadata.tsv \
    --m-metadata-column continent \
    --p-permutations 9999 \
    --p-method permdisp \
    --o-visualization $data_dir/jaccard_continent_dispersion.qzv

  import pkg_resources
[32mSaved Visualization to: Project_data/Diversity/jaccard_continent_dispersion.qzv[0m
[0m[?25h

In [68]:
Visualization.load(f"{data_dir}/jaccard_continent_dispersion.qzv")

Notes:
- Non-significant results for continent, IMPORTANT!
- Not homogenous dispersion between groups, PERMANOVA results IS driven by difference in dispersion!

In [69]:
# Visualize possible clusterings
! qiime diversity pcoa \
    --i-distance-matrix $data_dir/jaccard.qza \
    --o-pcoa $data_dir/jaccard_pcoa.qza

! qiime emperor plot \
    --i-pcoa $data_dir/braycurtis_pcoa.qza \
    --m-metadata-file $data_dir/../Metadata/updated_fungut_metadata.tsv \
    --o-visualization $data_dir/jaccard_pcoa_continent.qzv

  import pkg_resources
[32mSaved PCoAResults to: Project_data/Diversity/jaccard_pcoa.qza[0m
  import pkg_resources
[32mSaved Visualization to: Project_data/Diversity/jaccard_pcoa_continent.qzv[0m
[0m[?25h

In [70]:
Visualization.load(f"{data_dir}/braycurtis_pcoa_continent.qzv")

Notes:
- No visible cluster for all three attributes of interest