# 1.Import packages

In [1]:
# Importing all required packages at the start of the notebook
import IPython

from qiime2 import Visualization

import pandas as pd

# 2.Import the data

In [2]:
# Location
data_dir = "Project_data/Diversity"
! mkdir -p "$data_dir"

# 3.Determination of the sampling depth

In [3]:
! qiime feature-table summarize \
    --i-table Project_data/Taxonomy/table_filtered.qza \
    --m-sample-metadata-file Project_data/Metadata/updated_fungut_metadata.tsv \
    --o-visualization $data_dir/table_filtered.qzv

  import pkg_resources
[32mSaved Visualization to: Project_data/Diversity/table_filtered.qzv[0m
[0m[?25h

In [4]:
Visualization.load(f"{data_dir}/table_filtered.qzv")

In [5]:
! qiime diversity alpha-rarefaction \
    --i-table Project_data/Taxonomy/table_filtered.qza \
    --p-max-depth 80000 \
    --m-metadata-file Project_data/Metadata/updated_fungut_metadata.tsv \
    --o-visualization $data_dir/alpha-rarefaction.qzv

  import pkg_resources
[32mSaved Visualization to: Project_data/Diversity/alpha-rarefaction.qzv[0m
[0m[?25h

In [6]:
Visualization.load(f"{data_dir}/alpha-rarefaction.qzv")

According to alpha rarefication, a sampling depth of 20.000 was chosen, since the Shannon and observed feature metrics start to plateau at this point. Referring to the feature table for this sampling depth results in a retention of 2.720.000 reads (40.81%) across 136 samples (90.67%).

# 4.Euler
The diversity analysis was performed using the `q2-boots` plugin for QIIME2. To run the bootstrapping with a sufficiently high number of iterations (`n = 1000`), this step was performed on Euler. As this plugin was not included in the previously installed MOSHPIT distribution, the Amplicon distribution had to be installed additionally via Miniconda.

## 4.1 Import files
As with the 2.Taxonomy script, the files required to run the bootstrapping on Euler were uploaded to Polybox for download by the script running on Euler.

## 4.2 Bootstraping script
The following script was run on Euler.

```bash
#!/bin/bash
#SBATCH --job-name=bootstraping
#SBATCH --time=24:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem-per-cpu=32G
#SBATCH --output=bootstraping_%j.out
#SBATCH --error=bootstraping_%j.err
#SBATCH --mail-type=END,FAIL

# Activate conda
source ~/miniconda3/etc/profile.d/conda.sh
conda activate qiime2-amplicon-2025.10

# Data folder
data_dir="ProjectData"


# Download the meta data and reads
module load eth_proxy

wget --content-disposition -nc --progress=dot:giga -P "$data_dir" https://polybox.ethz.ch/index.php/s/e7ieANgiAn26oBs/download
wget --content-disposition -nc --progress=dot:giga -P "$data_dir" https://polybox.ethz.ch/index.php/s/xNSLKnR2y3QG9eb/download
wget --content-disposition -nc --progress=dot:giga -P "$data_dir" https://polybox.ethz.ch/index.php/s/KscLWzSGnkmEmY5/download

echo "Download done!"

# Run the bootstraping
qiime boots kmer-diversity \
  --i-table $data_dir/table_filtered.qza \
  --i-sequences $data_dir/rep-seqs_filtered.qza \
  --m-metadata-file $data_dir/updated_fungut_metadata.tsv\
  --p-sampling-depth 20000 \
  --p-n 1000 \
  --p-replacement \
  --p-alpha-average-method median \
  --p-beta-average-method medoid \
  --output-dir $data_dir/boots-kmer-diversity

echo "Bootstraping done!"
```

# 5.Diversity
The files created by the script on Euler were downloaded and uploaded to Polybox in order to be accessible for this script.

In [7]:
%%bash -s $data_dir

wget --content-disposition -nc --progress=dot:giga -P "$1" https://polybox.ethz.ch/index.php/s/nmb4j2YDSJbjJP2/download
wget --content-disposition -nc --progress=dot:giga -P "$1" https://polybox.ethz.ch/index.php/s/sYGkqwCffpcK8Si/download
wget --content-disposition -nc --progress=dot:giga -P "$1" https://polybox.ethz.ch/index.php/s/XJFWGkkNYfZSyse/download
wget --content-disposition -nc --progress=dot:giga -P "$1" https://polybox.ethz.ch/index.php/s/joMGaF5g3sNA6fT/download

chmod -R +rxw "$1"

--2025-12-02 14:10:48--  https://polybox.ethz.ch/index.php/s/nmb4j2YDSJbjJP2/download
Resolving polybox.ethz.ch (polybox.ethz.ch)... 129.132.71.243
Connecting to polybox.ethz.ch (polybox.ethz.ch)|129.132.71.243|:443... connected.
HTTP request sent, awaiting response... 200 OK
--2025-12-02 14:10:48--  https://polybox.ethz.ch/index.php/s/sYGkqwCffpcK8Si/download
Resolving polybox.ethz.ch (polybox.ethz.ch)... 129.132.71.243
Connecting to polybox.ethz.ch (polybox.ethz.ch)|129.132.71.243|:443... connected.
HTTP request sent, awaiting response... 200 OK
--2025-12-02 14:10:48--  https://polybox.ethz.ch/index.php/s/XJFWGkkNYfZSyse/download
Resolving polybox.ethz.ch (polybox.ethz.ch)... 129.132.71.243
Connecting to polybox.ethz.ch (polybox.ethz.ch)|129.132.71.243|:443... connected.
HTTP request sent, awaiting response... 200 OK
--2025-12-02 14:10:49--  https://polybox.ethz.ch/index.php/s/joMGaF5g3sNA6fT/download
Resolving polybox.ethz.ch (polybox.ethz.ch)... 129.132.71.243
Connecting to polybox

## 5.1 Alpha diversity

In [8]:
! qiime diversity alpha-group-significance \
  --i-alpha-diversity $data_dir/shannon.qza \
  --m-metadata-file $data_dir/../Metadata/updated_fungut_metadata.tsv \
  --o-visualization $data_dir/alpha_group_significance.qzv

  import pkg_resources
[32mSaved Visualization to: Project_data/Diversity/alpha_group_significance.qzv[0m
[0m[?25h

In [9]:
Visualization.load(f"{data_dir}/alpha_group_significance.qzv")

## 5.2 Beta diversity
The Beta diversity was run with both distance matricies (Bray-Curtis & Jaccard) obtained during the boot strapping

In [10]:
Visualization.load(f"{data_dir}/scatter_plot.qzv")

### 5.2.1 Bray-Curtis

In [11]:
meta_data_df = pd.read_csv(f"{data_dir}/../Differential_Abundance/metadata_gluten_clean.tsv", sep="\t")

In [12]:
meta_data_df.columns

Index(['ID', 'country_sample', 'state_sample', 'latitude_sample',
       'longitude_sample', 'sex_sample', 'age_years_sample',
       'height_cm_sample', 'weight_kg_sample', 'bmi_sample',
       'diet_type_sample', 'ibd_sample', 'gluten_sample', 'age_range',
       'bmi_category', 'continent', 'gluten_clean'],
      dtype='object')

In [13]:
meta_cols = ["age_range", "sex_sample", "diet_type_sample", "ibd_sample", "gluten_clean", "continent", "bmi_category"]

for col in meta_cols:
    output_name = f"{data_dir}/bray_curtis-{col}-significance.qzv"
    print(f"Running for column: {col}")

    ! qiime diversity beta-group-significance \
        --i-distance-matrix $data_dir/braycurtis.qza \
        --m-metadata-file $data_dir/../Differential_Abundance/metadata_gluten_clean.tsv \
        --m-metadata-column {col} \
        --p-permutations 9999 \
        --p-pairwise \
        --o-visualization {output_name}
    
# Errors with country_sample (only unique values), bmi_sample (numeric type) 

Running for column: age_range
  import pkg_resources
[32mSaved Visualization to: Project_data/Diversity/bray_curtis-age_range-significance.qzv[0m
[0m[?25hRunning for column: sex_sample
  import pkg_resources
[32mSaved Visualization to: Project_data/Diversity/bray_curtis-sex_sample-significance.qzv[0m
[0m[?25hRunning for column: diet_type_sample
  import pkg_resources
[32mSaved Visualization to: Project_data/Diversity/bray_curtis-diet_type_sample-significance.qzv[0m
[0m[?25hRunning for column: ibd_sample
  import pkg_resources
[32mSaved Visualization to: Project_data/Diversity/bray_curtis-ibd_sample-significance.qzv[0m
[0m[?25hRunning for column: gluten_clean
  import pkg_resources
[32mSaved Visualization to: Project_data/Diversity/bray_curtis-gluten_clean-significance.qzv[0m
[0m[?25hRunning for column: continent
  import pkg_resources
[32mSaved Visualization to: Project_data/Diversity/bray_curtis-continent-significance.qzv[0m
[0m[?25hRunning for column: bmi_cate

In [14]:
Visualization.load(f"{data_dir}/bray_curtis-age_range-significance.qzv")

In [15]:
Visualization.load(f"{data_dir}/bray_curtis-diet_type_sample-significance.qzv")

In [16]:
Visualization.load(f"{data_dir}/bray_curtis-ibd_sample-significance.qzv")

In [17]:
Visualization.load(f"{data_dir}/bray_curtis-sex_sample-significance.qzv")

In [18]:
Visualization.load(f"{data_dir}/bray_curtis-continent-significance.qzv")

In [19]:
Visualization.load(f"{data_dir}/bray_curtis-gluten_clean-significance.qzv")

In [20]:
Visualization.load(f"{data_dir}/bray_curtis-bmi_category-significance.qzv")

### Adonis

In [21]:
# Need another version of the meta file since issues with NaN values
meta = pd.read_csv(f"{data_dir}/../Differential_Abundance/metadata_gluten_clean.tsv", sep="\t")

In [22]:
meta.isna().sum()

ID                   0
country_sample       1
state_sample        58
latitude_sample      5
longitude_sample     5
sex_sample           1
age_years_sample     5
height_cm_sample     3
weight_kg_sample     2
bmi_sample           3
diet_type_sample     5
ibd_sample           7
gluten_sample        6
age_range            5
bmi_category         3
continent            1
gluten_clean         6
dtype: int64

In [23]:
meta = meta.fillna("missing")
meta.isna().sum()

ID                  0
country_sample      0
state_sample        0
latitude_sample     0
longitude_sample    0
sex_sample          0
age_years_sample    0
height_cm_sample    0
weight_kg_sample    0
bmi_sample          0
diet_type_sample    0
ibd_sample          0
gluten_sample       0
age_range           0
bmi_category        0
continent           0
gluten_clean        0
dtype: int64

In [24]:
print(meta['continent'].value_counts())

continent
Europe           58
Oceania          47
North America    44
missing           1
Name: count, dtype: int64


In [25]:
meta.to_csv(f"{data_dir}/metadata_cleaned.tsv", sep="\t", index=False)

In [26]:
! qiime diversity adonis \
    --i-distance-matrix $data_dir/braycurtis.qza \
    --m-metadata-file $data_dir/metadata_cleaned.tsv \
    --p-formula "ibd_sample + age_range + sex_sample + continent + bmi_category + diet_type_sample + bmi_category" \
    --o-visualization $data_dir/bray_curtis-adonis_multi.qzv

  import pkg_resources
[32mSaved Visualization to: Project_data/Diversity/bray_curtis-adonis_multi.qzv[0m
[0m[?25h

In [27]:
Visualization.load(f"{data_dir}/bray_curtis-adonis_multi.qzv")

Notes:
- Significant different composition by continent, even after accounting for all other metadata variables
- All other metadata dont have a significant effect

In [28]:
# Check if difference between groups is due to group variance, not true group centroid separation
! qiime diversity beta-group-significance \
    --i-distance-matrix $data_dir/braycurtis.qza \
    --m-metadata-file $data_dir/../Metadata/updated_fungut_metadata.tsv \
    --m-metadata-column continent \
    --p-permutations 999 \
    --p-method permdisp \
    --o-visualization $data_dir/braycurtis_continent_dispersion.qzv

  import pkg_resources
[32mSaved Visualization to: Project_data/Diversity/braycurtis_continent_dispersion.qzv[0m
[0m[?25h

In [29]:
Visualization.load(f"{data_dir}/braycurtis_continent_dispersion.qzv")

Notes:
- Not significant -> good
- Significant PERMANOVA signal is due to differences in group centroids (actual community shifts), not because one group is just more spread out. Means each of the different groups are equally spread.

In [30]:
# Visualize possible clusterings
! qiime diversity pcoa \
    --i-distance-matrix $data_dir/braycurtis.qza \
    --o-pcoa $data_dir/braycurtis_pcoa.qza

! qiime emperor plot \
    --i-pcoa $data_dir/braycurtis_pcoa.qza \
    --m-metadata-file $data_dir/../Metadata/updated_fungut_metadata.tsv \
    --o-visualization $data_dir/braycurtis_pcoa_continent.qzv

  import pkg_resources
[32mSaved PCoAResults to: Project_data/Diversity/braycurtis_pcoa.qza[0m
  import pkg_resources
[32mSaved Visualization to: Project_data/Diversity/braycurtis_pcoa_continent.qzv[0m
[0m[?25h

In [31]:
Visualization.load(f"{data_dir}/braycurtis_pcoa_continent.qzv")

Notes:
- No clusters for continents visible, only clusters visible consists of mixed continents.

### 5.2.1 Jaccard

In [32]:
meta_cols = ["age_range", "sex_sample", "diet_type_sample", "ibd_sample", "gluten_clean", "continent", "bmi_category"]

for col in meta_cols:
    output_name = f"{data_dir}/jaccard-{col}-significance.qzv"
    print(f"Running for column: {col}")

    ! qiime diversity beta-group-significance \
        --i-distance-matrix $data_dir/jaccard.qza \
        --m-metadata-file $data_dir/../Differential_Abundance/metadata_gluten_clean.tsv \
        --m-metadata-column {col} \
        --p-permutations 9999 \
        --p-pairwise \
        --o-visualization {output_name}
    
# Errors with country_sample (only unique values), bmi_sample (numeric type) 

Running for column: age_range
  import pkg_resources
[32mSaved Visualization to: Project_data/Diversity/jaccard-age_range-significance.qzv[0m
[0m[?25hRunning for column: sex_sample
  import pkg_resources
[32mSaved Visualization to: Project_data/Diversity/jaccard-sex_sample-significance.qzv[0m
[0m[?25hRunning for column: diet_type_sample
  import pkg_resources
[32mSaved Visualization to: Project_data/Diversity/jaccard-diet_type_sample-significance.qzv[0m
[0m[?25hRunning for column: ibd_sample
  import pkg_resources
[32mSaved Visualization to: Project_data/Diversity/jaccard-ibd_sample-significance.qzv[0m
[0m[?25hRunning for column: gluten_clean
  import pkg_resources
[32mSaved Visualization to: Project_data/Diversity/jaccard-gluten_clean-significance.qzv[0m
[0m[?25hRunning for column: continent
  import pkg_resources
[32mSaved Visualization to: Project_data/Diversity/jaccard-continent-significance.qzv[0m
[0m[?25hRunning for column: bmi_category
  import pkg_resour

In [33]:
Visualization.load(f"{data_dir}/jaccard-age_range-significance.qzv")

In [34]:
Visualization.load(f"{data_dir}/jaccard-diet_type_sample-significance.qzv")

In [35]:
Visualization.load(f"{data_dir}/jaccard-ibd_sample-significance.qzv")

In [36]:
Visualization.load(f"{data_dir}/jaccard-sex_sample-significance.qzv")

In [37]:
Visualization.load(f"{data_dir}/jaccard-continent-significance.qzv")

In [38]:
Visualization.load(f"{data_dir}/jaccard-gluten_clean-significance.qzv")

In [39]:
Visualization.load(f"{data_dir}/jaccard-bmi_category-significance.qzv")

### Adonis

In [40]:
! qiime diversity adonis \
    --i-distance-matrix $data_dir/jaccard.qza \
    --m-metadata-file $data_dir/metadata_cleaned.tsv \
    --p-formula "ibd_sample + age_range + sex_sample + continent + bmi_category + diet_type_sample + bmi_category" \
    --o-visualization $data_dir/jaccard-adonis_multi.qzv

  import pkg_resources
[32mSaved Visualization to: Project_data/Diversity/jaccard-adonis_multi.qzv[0m
[0m[?25h

In [41]:
Visualization.load(f"{data_dir}/jaccard-adonis_multi.qzv")

Notes:
- Cols with significant results from beta-group-significance:
    - Omnivore <-> Vegan
    - Vegan <-> Vegetarian
    - Europe <-> North America
    - Europe <-> Oceania
- Cols with significant results from Adoins:
    - Age_range
    - Continent
    
Means:
- Continent strongest since appears in both
- Age's effect is partially masked by other metadata
- diet_type_sample signal is confounded by other variables (continent, age, etc)

In [42]:
# Check if difference between groups is due to group variance, not true group centroid separation
! qiime diversity beta-group-significance \
    --i-distance-matrix $data_dir/jaccard.qza \
    --m-metadata-file $data_dir/../Metadata/updated_fungut_metadata.tsv \
    --m-metadata-column continent \
    --p-permutations 999 \
    --p-method permdisp \
    --o-visualization $data_dir/jaccard_continent_dispersion.qzv

  import pkg_resources
[32mSaved Visualization to: Project_data/Diversity/jaccard_continent_dispersion.qzv[0m
[0m[?25h

In [43]:
Visualization.load(f"{data_dir}/jaccard_continent_dispersion.qzv")

Notes:
- Significant -> not good
- Significant PERMANOVA signal is not due to differences in group centroids (actual community shifts), instead its because one group is just more spread out. Different spread of groups -> lead to significance in PERMANOVA.

In [46]:
# Visualize possible clusterings
! qiime diversity pcoa \
    --i-distance-matrix $data_dir/jaccard.qza \
    --o-pcoa $data_dir/jaccard_pcoa.qza

! qiime emperor plot \
    --i-pcoa $data_dir/braycurtis_pcoa.qza \
    --m-metadata-file $data_dir/../Metadata/updated_fungut_metadata.tsv \
    --o-visualization $data_dir/jaccard_pcoa_continent.qzv

  import pkg_resources
[32mSaved PCoAResults to: Project_data/Diversity/jaccard_pcoa.qza[0m
  import pkg_resources
[32mSaved Visualization to: Project_data/Diversity/jaccard_pcoa_continent.qzv[0m
[0m[?25h

In [47]:
Visualization.load(f"{data_dir}/braycurtis_pcoa_continent.qzv")

Summary:   
Overall there was no significance in alpha diversity, thus there was no significant difference, indicating similar within-sample diversity across metadata attributes like age, continent, diet, etc.
For beta diversity bay-curtis and jaccard was examained for possible differences, it turned out jaccard's resutls are not reliable, only bay-curtis. There a significant difference between the diversity between samples originating from different continents was detected. Gut microbiome composition differs between participants from different continents.