# Feature Table and Metadata preparation

This notebook covers the preparation of feature table and metadata.

The number of samples collected from each infant at a given timepoint varies, depending on how often a child produces a stool. This leads to uneven sampling frequency across infants and timepoints. Therefore we decided to work with two feature tables for the downstream analysis:
- non-collapsed
- collapsed  

The collapsed feature table is collapsed by infant at each timepoint to control for the uneven sampling frequency. This version is used for correlating behavioural outcome measures to prevent over-represented infants from skewing the results.
The non-collapse feature table is used for all other analyses to include all the data and prevent unnecessary information loss.

<img src=./figures/workflow_collapsed_noncollapsed.jpg alt="Description" width="750" height="">

## Setup
Activate the environment `microbEvolve` before running this Jupiter notebook.

Again, this notebook can be exectuted on a SLURM cluster, when run submitted from the `scripts/` directory:

```bash
sbatch --time=03:59:00 --cpus-per-task=4 --mem-per-cpu=10G --output=slurm-%j.out --error=slurm-%j.err --wrap="bash -c 'module load eth_proxy && source $HOME/.bashrc && conda activate microbEvolve && jupyter nbconvert --to notebook --execute ./01-2_featuretable_metadata_preparation.ipynb --output ./01-2_featuretable_metadata_preparation.ipynb'"
```

This step loads all required packages and stores the paths to the scripts and data directories in the variables `scripts_dir` and `data_dir`.

In [1]:
import os
import pandas as pd
from qiime2 import Artifact, Metadata, Visualization
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [2]:
scripts_dir = "src"
data_dir = "../data"

## Merge metadata_per_sample and metadata_per_age
The metadata we have consists of two files `metadata_per_samples`and `metadata_per_age`.  
- metadata_per_sample contains information for each sample, including the infant it comes from and the timepoint of collection.
- metadata_per_age contains information about each infant at a specific timepoint, including measurements describing developmental state, sleep rhythm, and sleep quality

We merge these two files into a single metadata file `metadata.tsv` to simplify handling in further analyses. 

Additionally, we create a metadata file `metadata_withtype` that includes the data type for each column. This file will be used to assess beta significance between timepoints.

In [3]:
!python "{scripts_dir}/merge_metadata.py"

Merging metadata...


Metadata file written to ../data/raw/metadata.tsv
  ].replace({"2 months": 2, "4 months": 4, "6 months": 6})
Metadata file with type information written to ../data/raw/metadata_withtypes.tsv


## Collapse feature table and metadata
The number of samples collected from each infant at a given timepoint varies, depending on how often the child produces a stool. The `sample_number` column in `data/raw/metadata_per_sample.tsv` indicates the order in which the samples were taken.
Since the number of samples is not consistent across infants at a specific timepoint, this can introduce biases in the data, as infants with more samples may disproportionately influence the results. 

We decided to defined one reference sample per timepoint for each infant by averaging the abundance of each ASV. The metadata was then collapsed so that each infant has a single representative sample per timepoint.

This was done in three steps: 
1. **Create an intermediate metadata file**  
    A new column, infant_time, was added to the existing `metadata.tsv`, combining `infant_id` and `timepoint`. The updated metadata was then saved as an intermediate file (`infant_time_metadata.tsv`).
    All samples from the same infant at the same timepoint share the same `infant_time` entry.
    This field is later used to collapse the feature table.  

2. **Collapse the feature table**  
    Using `--qiime feature-table group`, we grouped features by the `infant_time` column.
    The mean-ceiling method was applied to average ASV abundances across the samples, producing one representative table entry per infant per timepoint.  

3. **Collapse the metadata**  
    The original `sampleid` column was removed, and infant_time was renamed to `sampleid`, which becomes the new unique identifier. The metadata was then collapsed to have one representative entry per infant per timepoint

In [4]:
!python "{scripts_dir}/infant_time_metadata.py"

Starting script infant_time_metadata...
Intermediate metadata file metadata_infant_time.tsv created successfully!


In [5]:
! sh {scripts_dir}/collapse_featuretable.sh

[2025-12-18 11:15:05] Starting script to collapse feature table...
[2025-12-18 11:15:05] Collapse feature table to have on representative sample per infant per timepoint...


  import pkg_resources


[32mSaved FeatureTable[Frequency] to: ../data/raw/table_collapsed.qza[0m
[?25h

[0m

[2025-12-18 11:16:01] Feature table collapsed successfully!
[2025-12-18 11:16:01] Script to collapse feature table completed successfully!


In [6]:
!python "{scripts_dir}/collapse_metadata.py"

Starting script collapse_metadata script...
Metadata collapsed successfully, collapsed_metadata.tsv and metadata_collapsed_withtypes.tsv stored in ../data/raw
