# S3. Change of microbial communities between different timepoints 

Author: Marc Kesselring


In this Jupyter Notebook the change of microbial communities between different timepoints is analyzed.

**Exercise overview:**<br>
[1. Setup](#setup)<br>
[2. Filter data and run ANCOM-BC](#filter)<br>
[3. Statistical Evaluation](#ancom)<br>
[4. Visualization](#visuala)<br>

<a id='setup'></a>

## 1. Setup

In [36]:
import os
import matplotlib.pyplot as plt
import pandas as pd
import qiime2 as q2
from qiime2 import Visualization
import seaborn as sns
from scipy.stats import shapiro, kruskal, f_oneway
import subprocess
from qiime2 import Artifact

%matplotlib inline

In [2]:
raw_data_dir = "../data/raw"
data_dir = "../data/processed"
vis_dir  = "../results"

<a id='filter'></a>

## 2. Filter data and run ANCOM-BC

The data was already filtered in notebook 06_DifferentalAbundance.ipynb. The features were only retained if they had a minimum frequency of 25 and were present in at least 5 samples. Afterwards the features were collapsed to phylum, class, order, family, genus and species levels respectively. Additionally, the metadata was filtered to only contain samples where a patient had a measurement for both timepoints (abduction and recovery)

In [12]:
# Load metadata as dataframe
meta = pd.read_csv(f"{data_dir}/metadata_binned.tsv", sep="\t")

# Identify the Patient_IDs with a count of 2
true_patient_ids = meta.Patient_ID.value_counts()[meta.Patient_ID.value_counts() == 2].index

# Filter the meta table to include only rows with these Patient_IDs
filtered_meta = meta[meta.Patient_ID.isin(true_patient_ids)]

# Display the filtered meta table
filtered_meta

Unnamed: 0,sample-id,Patient_ID,Stool_Consistency,Patient_Sex,Sample_Day,Recovery_Day,Cohort_Number,Cohort_Number_Bin
0,EG2580,P042,liquid,F,13,17.0,2,Recovery
1,EG2559,P043,liquid,M,15,17.0,2,Recovery
2,EG2537,P042,liquid,F,0,17.0,1,Abduction
3,EG2518,P043,liquid,M,0,17.0,1,Abduction
5,EG2473,P055,semi-formed,M,20,22.0,2,Recovery
...,...,...,...,...,...,...,...,...
96,EG2638,P017,semi-formed,M,12,17.0,2,Recovery
97,EG2608,P034,formed,F,0,18.0,1,Abduction
98,EG2591,P017,liquid,M,0,17.0,1,Abduction
99,EG0141,P032,liquid,F,0,21.0,1,Abduction


In [14]:
# Generate tsv file from altered metadata
filtered_meta.to_csv(f"{data_dir}/timepoint_filtered_metadata_binned.tsv", sep='\t', index=False)

##### In order to use ANCOM-BC the feature table and metadata files have to contain the exact same sample-ids. This for-loop runs through all levels of collapsed taxa feature tables to get matching sample-ids with the altered metadata file and the runs ANCOM--BC with them. Additionally, barplot and result visualization are generated from the ANCOM-BC output.

In [88]:
# Define the data directory and levels
levels = ["l7", "l6", "l5", "l4", "l3", "l2"]


# Loop through the levels and run the commands
for level in levels:
    try:
        print(f"Running commands for level: {level}")
        
        #Filter feature table to only contain samples present in filtered metadata such that every patient has a sample for both timepoints
        data_level = q2.Artifact.load(f"{data_dir}/table_abund_{level}.qza").view(pd.DataFrame)
        combined_level = filtered_meta.merge(data_level,left_on = 'sample-id',right_index = True,how = 'inner')
        combined_drop_level = combined_level.drop(['Patient_ID','Stool_Consistency','Patient_Sex','Sample_Day','Recovery_Day','Cohort_Number','Cohort_Number_Bin'],axis = 1)
        combined_drop_level.set_index('sample-id',inplace = True)
        combined_drop_level.to_csv(f"{data_dir}/table_abund_level_filtered.tsv",sep = '\t',index = False)
        
        
        # Save feature table as qza artifact
        table_level = Artifact.import_data('FeatureTable[Frequency]',combined_drop_level)
        table_level.save(f"{data_dir}/table_abund_{level}_filtered.qza")
        
        # Run ANCOM-BC
        subprocess.run([
            "qiime", "composition", "ancombc",
            "--i-table", f"{data_dir}/table_abund_{level}_filtered.qza",
            "--m-metadata-file", f"{data_dir}/timepoint_filtered_metadata_binned.tsv",
            "--p-formula", "Cohort_Number_Bin",
            "--o-differentials", f"{data_dir}/ancombc_cohort_number_{level}_differentials.qza"
        ], check=True)
        
        # Generate a barplot
        subprocess.run([
            "qiime", "composition", "da-barplot",
            "--i-data", f"{data_dir}/ancombc_cohort_number_{level}_differentials.qza",
            "--o-visualization", f"{data_dir}/ancombc_cohort_number_{level}_barplot.qzv"
        ], check=True)
        
        # Generate a results table
        subprocess.run([
            "qiime", "composition", "tabulate",
            "--i-data", f"{data_dir}/ancombc_cohort_number_{level}_differentials.qza",
            "--o-visualization", f"{data_dir}/ancombc_cohort_number_{level}_results.qzv"
        ], check=True)
        
        print(f"Commands for level {level} completed successfully!")
    except subprocess.CalledProcessError as e:
        print(f"Error running commands for level {level}: {e}")


Running commands for level: l7
Saved FeatureData[DifferentialAbundance] to: ../data/processed/ancombc_cohort_number_l7_differentials.qza
Saved Visualization to: ../data/processed/ancombc_cohort_number_l7_barplot.qzv
Saved Visualization to: ../data/processed/ancombc_cohort_number_l7_results.qzv
Commands for level l7 completed successfully!
Running commands for level: l6
Saved FeatureData[DifferentialAbundance] to: ../data/processed/ancombc_cohort_number_l6_differentials.qza
Saved Visualization to: ../data/processed/ancombc_cohort_number_l6_barplot.qzv
Saved Visualization to: ../data/processed/ancombc_cohort_number_l6_results.qzv
Commands for level l6 completed successfully!
Running commands for level: l5
Saved FeatureData[DifferentialAbundance] to: ../data/processed/ancombc_cohort_number_l5_differentials.qza
Saved Visualization to: ../data/processed/ancombc_cohort_number_l5_barplot.qzv
Saved Visualization to: ../data/processed/ancombc_cohort_number_l5_results.qzv
Commands for level l5 c

<a id='ancom'></a>

## 3. Statistical Evaluation

##### Filtering ANCOM-BC differentials artifact for q-values <= 0.05 for all levels of collapsed taxa using a for-loop

In [110]:
# Define the data directory and taxonomy levels
taxonomy_levels = {
    "l7": "species",
    "l6": "genus",
    "l5": "family",
    "l4": "order",
    "l3": "class",
    "l2": "phylum"
}

# Initialize a dictionary to store results
results = {}

# Iterate through each taxonomy level
for level, name in taxonomy_levels.items():
    print(f"Processing taxonomy level: {name} ({level})")
    
    # Load the ANCOM-BC results as a directory format
    artifact_path = f'{data_dir}/ancombc_cohort_number_{level}_differentials.qza'
    dirfmt = q2.Artifact.load(artifact_path).view(DataLoafPackageDirFmt)

    # Extract data slices
    slices = {str(relpath): view for relpath, view in dirfmt.data_slices.iter_views(pd.DataFrame)}

    # Prepare the dataframes
    lfc = slices[list(slices.keys())[0]]
    lfc.set_index(lfc.columns[0], inplace=True)
    lfc.columns = ['lfc_' + col for col in lfc.columns]

    p_val = slices[list(slices.keys())[1]]
    p_val.set_index(p_val.columns[0], inplace=True)
    p_val.columns = ['p_val_' + col for col in p_val.columns]

    q_val = slices[list(slices.keys())[2]]
    q_val.set_index(q_val.columns[0], inplace=True)
    q_val.columns = ['q_val_' + col for col in q_val.columns]

    # Combine the dataframes
    df = pd.concat([lfc, p_val, q_val], axis=1, join='inner')

    # Count significant features for each stool consistency
    cohort_significant = df['q_val_Cohort_Number_BinRecovery'].loc[df['q_val_Cohort_Number_BinRecovery'] <= 0.05]
    cohort_significant_num = len(df['q_val_Cohort_Number_BinRecovery'].loc[df['q_val_Cohort_Number_BinRecovery'] <= 0.05])

    # Store the results
    results[level] = {
        "taxonomy_level": name,
        "cohort_significant_num": cohort_significant_num,
        "cohort_significant_tax": cohort_significant
    }

    print(f"Finished processing {name}. Cohort_significant_num: {cohort_significant_num}, Cohort_significant_taxa: {cohort_significant}")

# Convert results to a DataFrame for better visualization
#results_df = pd.DataFrame.from_dict(results, orient='index')

# Display the results
#print(results_df)

Processing taxonomy level: species (l7)
Finished processing species. Cohort_significant_num: 4, Cohort_significant_taxa: id
d__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Clostridiaceae;g__Clostridium_sensu_stricto_1;s__            0.002497
d__Bacteria;p__Firmicutes;c__Clostridia;o__Lachnospirales;f__Lachnospiraceae;g__Blautia;s__                              0.005644
d__Bacteria;p__Firmicutes;c__Bacilli;o__Erysipelotrichales;f__Erysipelotrichaceae;g__[Clostridium]_innocuum_group;s__    0.015709
d__Bacteria;p__Firmicutes;c__Negativicutes;o__Veillonellales-Selenomonadales;__;__;__                                    0.005145
Name: q_val_Cohort_Number_BinRecovery, dtype: float64
Processing taxonomy level: genus (l6)
Finished processing genus. Cohort_significant_num: 4, Cohort_significant_taxa: id
d__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Clostridiaceae;g__Clostridium_sensu_stricto_1            0.002497
d__Bacteria;p__Firmicutes;c__Clostridia;o__Lachnospirale

<a id='visual'></a>

## 4. Visualization