# 02 Diversity

This notebook calculates diversity metrics for both the non-collapsed and collapsed feature tables. The diversity measures of the non-collapsed feature table will be used to assess inter-infant differences and temporal trajectories. The diversity measures from the collapsed feature table will be used to correlate behavioural outcome measures.


<img src="./figures/workflow_diversity_interinf.jpg" width="58%"> <img src="./figures/workflow_diversity_outcomemeasures.jpg" width="40%"  align="top">

## Setup
Activate the environment `microbEvolve` before running this Jupiter notebook. Again, this notebook can be exectuted on a SLURM cluster, when submitting the job from the `scripts/` directory:

```bash
sbatch --time=03:59:00 --cpus-per-task=4 --mem-per-cpu=10G --output=slurm-%j.out --error=slurm-%j.err --wrap="bash -c 'module load eth_proxy && source $HOME/.bashrc && conda activate microbEvolve && jupyter nbconvert --to notebook --execute ./02_diversity.ipynb --output ./02_diversity.ipynb'"
```

This step loads all required packages and stores the paths to the scripts and data directories in the variables `scripts_dir` and `data_dir`.

In [1]:
import os
import pandas as pd
from qiime2 import Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import os

%matplotlib inline

In [2]:
scripts_dir = "src"
data_dir = "../data"

## Rarefaction

We used alpha rarefaction to evaluate how sequencing depth affects within-sample diversity and to select a depth that preserves both diversity estimates and sample retention. Depths of 20 000 and 15 000 reads would have removed many samples without providing meaningful improvements in diversity estimates. The Shannon curves showed a steep rise up to roughly 5 000 reads and then reached a near-plateau between 5 000 and 10 000 reads, which indicates that the overall community structure is already captured in this range. The observed-features curves increased more slowly and did not fully plateau, which is expected for richness metrics, but the additional features gained beyond 9 000 reads were small compared with the large rise at low depths. 

9 000 reads therefore retain almost all samples while still lying in the stability range indicated by the Shannon curves. We therefore set the sampling depth to 9 000 reads as a balanced choice that preserves diversity saturation and maximizes sample retention.

## K-mer size comparison

We primarily chose an alignment-free k-mer approach to calculate diversity without relying on phylogenetic trees or reference databases. This allows for an unbiased and efficient view of the raw sequence data.

We compared k-mer sizes 12, 14, and 16 to identify the parameter that best preserves within-sample diversity patterns across the 2-, 4-, and 6-month groups.

We first tested how k-mer size affects within-sample diversity: Shannon entropy and Pielou’s evenness remained nearly identical across all three k-mer sizes.
This stability shows that alpha-diversity estimates do not depend on the k-mer choice and that the observed patterns are not artifacts of the parameter.
Because k-mer 12 reproduced the same diversity structure as k-mer 14 and 16 but with fewer computational demands, it offered the most efficient representation.
We also checked how the first two PCoA axes from Bray–Curtis behaved: The cumulative variance explained declined slightly with increasing k-mer size (48% -> 47%->46%).
This trend supported the broader impression that larger k-mers did not capture additional structure.

We therefore selected k-mer 12 because it produced stable Shannon and Pielou estimates that matched those from larger k-mers while avoiding unnecessary parameter inflation. The alpha-diversity consistency across 12, 14, and 16 provided the strongest evidence that k-mer 12 is the appropriate and efficient choice for downstream analyses. The code for this comparison analysis can be found in the archive folder. 

To check whether the k-mer results were robust, we performed a secondary analysis using standard phylogeny-based core diversity metrics. These metrics rely on an inferred phylogenetic tree and therefore capture evolutionary relationships that are not considered in the alignment-free k-mer approach.


## Bootstrapping

We then applied bootstrapping, as recommended in one of our course notebooks, to avoid relying on a single rarefied subsample. Bootstrapping draws many subsamples at the chosen depth and computes diversity metrics for each draw. This approach shows the variability in the estimates and reduces the risk that one arbitrary subsample shapes the results. We used it to obtain more robust diversity estimates and to ensure that the chosen sampling depth produces stable metrics across repeated resampling, without assuming that a single subsample reflects the full data structure.

## Feature table selection
We calculated diversity metrics using both the non-collapsed and a collapsed feature table. Using both variants allows flexibility depending on the analysis: the collapsed feature table prevents infants with multiple samples from dominating the results, while the non-collapsed preserves all available data.

## ASV diversity
After the discussion in our presentation, ASV-based diversity estimation was also performed to compare temporal changes in alpha diversity with the k-mer based approach. However, the rest of the analysis is based on k-mer derived diversity.

In [3]:
! sh {scripts_dir}/diversity.sh

[2025-12-18 17:19:43] Starting diversity script
[2025-12-18 17:19:43] Starting rarefaction...


  import pkg_resources


[32mSaved Visualization to: ../data/processed/alpha_rarefaction.qzv[0m
[?25h

[0m

[2025-12-18 17:23:31] Rarefaction completed
[2025-12-18 17:23:31] Starting bootstrapping...


  import pkg_resources


[32mSaved Collection[FeatureTable[Frequency]] to: ../data/raw/boots_kmer_diversity_collapsed/resampled_tables[0m


[32mSaved Collection[FeatureTable[Frequency]] to: ../data/raw/boots_kmer_diversity_collapsed/kmer_tables[0m


[32mSaved Collection[SampleData[AlphaDiversity]] to: ../data/raw/boots_kmer_diversity_collapsed/alpha_diversities[0m


[32mSaved Collection[DistanceMatrix] to: ../data/raw/boots_kmer_diversity_collapsed/distance_matrices[0m


[32mSaved Collection[PCoAResults] to: ../data/raw/boots_kmer_diversity_collapsed/pcoas[0m


[32mSaved Visualization to: ../data/raw/boots_kmer_diversity_collapsed/scatter_plot.qzv[0m


[?25h

[0m

  import pkg_resources


[32mSaved Collection[FeatureTable[Frequency]] to: ../data/raw/boots_kmer_diversity/resampled_tables[0m


[32mSaved Collection[FeatureTable[Frequency]] to: ../data/raw/boots_kmer_diversity/kmer_tables[0m


[32mSaved Collection[SampleData[AlphaDiversity]] to: ../data/raw/boots_kmer_diversity/alpha_diversities[0m


[32mSaved Collection[DistanceMatrix] to: ../data/raw/boots_kmer_diversity/distance_matrices[0m


[32mSaved Collection[PCoAResults] to: ../data/raw/boots_kmer_diversity/pcoas[0m


[32mSaved Visualization to: ../data/raw/boots_kmer_diversity/scatter_plot.qzv[0m
[?25h

[0m

[33mQIIME is caching your current deployment for improved performance. This may take a few moments and should only happen once per deployment.[0m


  import pkg_resources


[32mSaved Collection[FeatureTable[Frequency]] to: ../data/raw/boots_core_metrics_collapsed/resampled_tables[0m


[32mSaved Collection[SampleData[AlphaDiversity]] to: ../data/raw/boots_core_metrics_collapsed/alpha_diversities[0m


[32mSaved Collection[DistanceMatrix] to: ../data/raw/boots_core_metrics_collapsed/distance_matrices[0m


[32mSaved Collection[PCoAResults] to: ../data/raw/boots_core_metrics_collapsed/pcoas[0m


[32mSaved Collection[Visualization] to: ../data/raw/boots_core_metrics_collapsed/emperor_plots[0m


[32mSaved Visualization to: ../data/raw/boots_core_metrics_collapsed/scatter_plot.qzv[0m


[?25h

[0m

  import pkg_resources


[32mSaved Collection[FeatureTable[Frequency]] to: ../data/raw/boots_core_metrics/resampled_tables[0m


[32mSaved Collection[SampleData[AlphaDiversity]] to: ../data/raw/boots_core_metrics/alpha_diversities[0m


[32mSaved Collection[DistanceMatrix] to: ../data/raw/boots_core_metrics/distance_matrices[0m


[32mSaved Collection[PCoAResults] to: ../data/raw/boots_core_metrics/pcoas[0m


[32mSaved Collection[Visualization] to: ../data/raw/boots_core_metrics/emperor_plots[0m


[32mSaved Visualization to: ../data/raw/boots_core_metrics/scatter_plot.qzv[0m


[?25h

[0m

[2025-12-18 18:05:59] Bootstrapping completed
[2025-12-18 18:05:59] Generating alpha-significance and alpha-correlation ...


  import pkg_resources


[32mSaved Visualization to: ../data/processed/shannon_kmer_significance.qzv[0m
[?25h

[0m

  import pkg_resources


[32mSaved Visualization to: ../data/processed/shannon_kmer_correlation.qzv[0m
[?25h

[0m

  import pkg_resources


[32mSaved Visualization to: ../data/processed/shannon_core_significance_collapsed.qzv[0m
[?25h

[0m

  import pkg_resources


[32mSaved Visualization to: ../data/processed/shannon_core_correlation_collapsed.qzv[0m
[?25h

[0m

[33mQIIME is caching your current deployment for improved performance. This may take a few moments and should only happen once per deployment.[0m


  import pkg_resources


[32mSaved Visualization to: ../data/processed/shannon_core_significance.qzv[0m
[?25h

[0m

  import pkg_resources


[32mSaved Visualization to: ../data/processed/shannon_core_correlation.qzv[0m
[?25h

[0m

[2025-12-18 18:09:19] Generating alpha-significance and alpha-correlation ...


[2025-12-18 18:09:19] Diversity script completed successfully!


In [4]:
Visualization.load(f"{data_dir}/processed/alpha_rarefaction.qzv")

<img src="./figures/kmer_comparison.png" width="58%">