# 02 Diversity

This notebook calculated the diversity metric for both the non-collapsed and collapsed feature table. The diversity measures of the non-collapsed feature table will be used to assess inter-infant differences and temporal trajectories. The diversity measures from the collapsed feature table will be used to correlate behavioural outcome measures.


<img src="./figures/workflow_diversity_interinf.jpg" width="58%"> <img src="./figures/workflow_diversity_outcomemeasures.jpg" width="40%"  align="top">

## Setup
Activate the environment `microbEvolve` before running this Jupiter notebook.

This step loads all required packages and stores the paths to the scripts and data directories in the variables `scripts_dir` and `data_dir`.

In [2]:
import os
import pandas as pd
from qiime2 import Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import os

%matplotlib inline

In [3]:
scripts_dir = "src"
data_dir = "../data"

## Rarefaction

We used alpha rarefaction to evaluate how sequencing depth affects within-sample diversity and to select a depth that preserves both diversity estimates and sample retention. Depths of 20 000 and 15 000 reads would have removed many samples without providing meaningful improvements in diversity estimates. The Shannon curves showed a steep rise up to roughly 5 000 reads and then reached a near-plateau between 5 000 and 10 000 reads, which indicates that the overall community structure is already captured in this range. The observed-features curves increased more slowly and did not fully plateau, which is expected for richness metrics, but the additional features gained beyond 9 000 reads were small compared with the large rise at low depths. 

9 000 reads therefore retain almost all samples while still lying in the stability range indicated by the Shannon curves. We therefore set the sampling depth to 9 000 reads as a balanced choice that preserves diversity saturation and maximizes sample retention.

## K-mer size comparison

We primarily chose an alignment-free k-mer approach to calculate diversity without relying on phylogenetic trees or reference databases. This allows for an unbiased and efficient view of the raw sequence data.

We compared k-mer sizes 12, 14, and 16 to identify the parameter that best preserves within-sample diversity patterns across the 2-, 4-, and 6-month groups.

We first tested how k-mer size affects within-sample diversity: Shannon entropy and Pielou’s evenness remained nearly identical across all three k-mer sizes.
This stability shows that alpha-diversity estimates do not depend on the k-mer choice and that the observed patterns are not artifacts of the parameter.
Because k-mer 12 reproduced the same diversity structure as k-mer 14 and 16 but with fewer computational demands, it offered the most efficient representation.
We also checked how the first two PCoA axes from Bray–Curtis behaved: The cumulative variance explained declined slightly with increasing k-mer size (48% -> 47%->46%).
This trend supported the broader impression that larger k-mers did not capture additional structure.

We therefore selected k-mer 12 because it produced stable Shannon and Pielou estimates that matched those from larger k-mers while avoiding unnecessary parameter inflation. The alpha-diversity consistency across 12, 14, and 16 provided the strongest evidence that k-mer 12 is the appropriate and efficient choice for downstream analyses. The code for this comparison analysis can be found in the archive folder. 

To check whether the k-mer results were robust, we performed a secondary analysis using standard phylogeny-based core diversity metrics. These metrics rely on an inferred phylogenetic tree and therefore capture evolutionary relationships that are not considered in the alignment-free k-mer approach.


## Bootstrapping

We then applied bootstrapping, as recommended in one of our course notebooks, to avoid relying on a single rarefied subsample. Bootstrapping draws many subsamples at the chosen depth and computes diversity metrics for each draw. This approach shows the variability in the estimates and reduces the risk that one arbitrary subsample shapes the results. We used it to obtain more robust diversity estimates and to ensure that the chosen sampling depth produces stable metrics across repeated resampling, without assuming that a single subsample reflects the full data structure.

## Feature table selection
We calculated diversity metrics using both the non-collapsed and a collapsed feature table. Using both variants allows flexibility depending on the analysis: the collapsed feature table prevents infants with multiple samples from dominating the results, while the non-collapsed preserves all available data.

## ASV diversity
After assessing alpha diversity using a k-mer–based, we also examined ASV-based diversity metrics. This allows a direct comparison between alignment-free and ASV-based methods and helps assess whether method choice influences the observed diversity patterns.

In [None]:
! sh {scripts_dir}/diversity.sh

In [None]:
Visualization.load(f"{data_dir}/processed/alpha_rarefaction.qzv")

<img src="./figures/kmer_comparison.png" width="58%">