# S3. Change of microbial communities between different timepoints 

Author: Marc Kesselring


In this Jupyter Notebook the change of microbial communities between different timepoints is analyzed.

**Exercise overview:**<br>
[1. Setup](#setup)<br>
[2. Filter Data](#filter)<br>
[3. Analysis of composition of microbiomes](#ancom)<br>

<a id='setup'></a>

## 1. Setup

In [36]:
import os
import matplotlib.pyplot as plt
import pandas as pd
import qiime2 as q2
from qiime2 import Visualization
import seaborn as sns
from scipy.stats import shapiro, kruskal, f_oneway
import subprocess
from qiime2 import Artifact

%matplotlib inline

In [2]:
raw_data_dir = "../data/raw"
data_dir = "../data/processed"
vis_dir  = "../results"

<a id='filter'></a>

## 2. Filter data

The data was already filtered in notebook 06_DifferentalAbundance.ipynb. The features were only retained if they had a minimum frequency of 25 and were present in at least 5 samples. Afterwards the features were collapsed to phylum, class, order, family, genus and species levels respectively. Additionally, the metadata was filtered to only contain samples where a patient had a measurement for both timepoints (abduction and recovery)

In [12]:
# Load metadata as dataframe
meta = pd.read_csv(f"{data_dir}/metadata_binned.tsv", sep="\t")

# Identify the Patient_IDs with a count of 2
true_patient_ids = meta.Patient_ID.value_counts()[meta.Patient_ID.value_counts() == 2].index

# Filter the meta table to include only rows with these Patient_IDs
filtered_meta = meta[meta.Patient_ID.isin(true_patient_ids)]

# Display the filtered meta table
filtered_meta

Unnamed: 0,sample-id,Patient_ID,Stool_Consistency,Patient_Sex,Sample_Day,Recovery_Day,Cohort_Number,Cohort_Number_Bin
0,EG2580,P042,liquid,F,13,17.0,2,Recovery
1,EG2559,P043,liquid,M,15,17.0,2,Recovery
2,EG2537,P042,liquid,F,0,17.0,1,Abduction
3,EG2518,P043,liquid,M,0,17.0,1,Abduction
5,EG2473,P055,semi-formed,M,20,22.0,2,Recovery
...,...,...,...,...,...,...,...,...
96,EG2638,P017,semi-formed,M,12,17.0,2,Recovery
97,EG2608,P034,formed,F,0,18.0,1,Abduction
98,EG2591,P017,liquid,M,0,17.0,1,Abduction
99,EG0141,P032,liquid,F,0,21.0,1,Abduction


In [14]:
# Generate tsv file from altered metadata
filtered_meta.to_csv(f"{data_dir}/timepoint_filtered_metadata_binned.tsv", sep='\t', index=False)

In [88]:
# Define the data directory and levels
levels = ["l7", "l6", "l5", "l4", "l3", "l2"]


# Loop through the levels and run the commands
for level in levels:
    try:
        print(f"Running commands for level: {level}")
        
        #Filter feature table to only contain samples present in filtered metadata such that every patient has a sample for both timepoints
        data_level = q2.Artifact.load(f"{data_dir}/table_abund_{level}.qza").view(pd.DataFrame)
        combined_level = filtered_meta.merge(data_level,left_on = 'sample-id',right_index = True,how = 'inner')
        combined_drop_level = combined_level.drop(['Patient_ID','Stool_Consistency','Patient_Sex','Sample_Day','Recovery_Day','Cohort_Number','Cohort_Number_Bin'],axis = 1)
        combined_drop_level.set_index('sample-id',inplace = True)
        combined_drop_level.to_csv(f"{data_dir}/table_abund_level_filtered.tsv",sep = '\t',index = False)
        
        
        # Save feature table as qza artifact
        table_level = Artifact.import_data('FeatureTable[Frequency]',combined_drop_level)
        table_level.save(f"{data_dir}/table_abund_{level}_filtered.qza")
        
        # Run ANCOM-BC
        subprocess.run([
            "qiime", "composition", "ancombc",
            "--i-table", f"{data_dir}/table_abund_{level}_filtered.qza",
            "--m-metadata-file", f"{data_dir}/timepoint_filtered_metadata_binned.tsv",
            "--p-formula", "Cohort_Number_Bin",
            "--o-differentials", f"{data_dir}/ancombc_cohort_number_{level}_differentials.qza"
        ], check=True)
        
        # Generate a barplot
        subprocess.run([
            "qiime", "composition", "da-barplot",
            "--i-data", f"{data_dir}/ancombc_cohort_number_{level}_differentials.qza",
            "--o-visualization", f"{data_dir}/ancombc_cohort_number_{level}_barplot.qzv"
        ], check=True)
        
        # Generate a results table
        subprocess.run([
            "qiime", "composition", "tabulate",
            "--i-data", f"{data_dir}/ancombc_cohort_number_{level}_differentials.qza",
            "--o-visualization", f"{data_dir}/ancombc_cohort_number_{level}_results.qzv"
        ], check=True)
        
        print(f"Commands for level {level} completed successfully!")
    except subprocess.CalledProcessError as e:
        print(f"Error running commands for level {level}: {e}")


Running commands for level: l7
Saved FeatureData[DifferentialAbundance] to: ../data/processed/ancombc_cohort_number_l7_differentials.qza
Saved Visualization to: ../data/processed/ancombc_cohort_number_l7_barplot.qzv
Saved Visualization to: ../data/processed/ancombc_cohort_number_l7_results.qzv
Commands for level l7 completed successfully!
Running commands for level: l6
Saved FeatureData[DifferentialAbundance] to: ../data/processed/ancombc_cohort_number_l6_differentials.qza
Saved Visualization to: ../data/processed/ancombc_cohort_number_l6_barplot.qzv
Saved Visualization to: ../data/processed/ancombc_cohort_number_l6_results.qzv
Commands for level l6 completed successfully!
Running commands for level: l5
Saved FeatureData[DifferentialAbundance] to: ../data/processed/ancombc_cohort_number_l5_differentials.qza
Saved Visualization to: ../data/processed/ancombc_cohort_number_l5_barplot.qzv
Saved Visualization to: ../data/processed/ancombc_cohort_number_l5_results.qzv
Commands for level l5 c

In [32]:
# Filter feature table such that it only contains samples present in the above altered metadata file
data = q2.Artifact.load(f'{data_dir}/table_abund_l7.qza').view(pd.DataFrame)
combined = filtered_meta.merge(data, left_on='sample-id', right_index=True, how='inner')
combined_drop = combined.drop(['Patient_ID', 'Stool_Consistency', 'Patient_Sex', 'Sample_Day', 'Recovery_Day','Cohort_Number', 'Cohort_Number_Bin'], axis=1)
combined_drop.set_index('sample-id', inplace=True)
combined_drop.to_csv(f"{data_dir}/table_abund_l7_filtered.tsv", sep='\t', index=False)

In [37]:
# Save the data frame as qza artifact
table = Artifact.import_data('FeatureTable[Frequency]', combined_drop)
table.save(f"{data_dir}/table_abund_l7_filtered.qza")

'../data/processed/table_abund_l7_filtered.qza'

<a id='ancom'></a>

## 3. Analysis of compositon of microbiomes

##### Run ANCOM-BC to investigate if taxa are differentially abundant in the 2 cohorts

In [38]:
# Run ANCOM-BC
! qiime composition ancombc \
    --i-table $data_dir/table_abund_l7_filtered.qza \
    --m-metadata-file $data_dir/timepoint_filtered_metadata_binned.tsv \
    --p-formula Cohort_Number_Bin \
    --o-differentials $data_dir/ancombc_cohort_number_differentials_l7.qza

[32mSaved FeatureData[DifferentialAbundance] to: ../data/processed/ancombc_cohort_number_differentials_l7.qza[0m
[0m

##### Generate a barplot and tabular results from the ANCOM-BC

In [39]:
# Generate a barplot of differentially abundant taxa between environments
! qiime composition da-barplot \
    --i-data $data_dir/ancombc_cohort_number_differentials_l7.qza \
    --o-visualization $data_dir/ancombc_cohort_number_da_barplot_l7.qzv

# Generate a table of these same values for all taxa
! qiime composition tabulate \
    --i-data $data_dir/ancombc_cohort_number_differentials_l7.qza \
    --o-visualization $data_dir/ancombc_cohort_number_results_l7.qzv

[32mSaved Visualization to: ../data/processed/ancombc_cohort_number_da_barplot_l7.qzv[0m
[0m[32mSaved Visualization to: ../data/processed/ancombc_cohort_number_results_l7.qzv[0m
[0m

##### Inspect barplot and tabular results visually

In [41]:
Visualization.load(f"{data_dir}/ancombc_cohort_number_da_barplot_l7.qzv")

In [42]:
Visualization.load(f"{data_dir}/ancombc_cohort_number_results_l7.qzv")

#### Load ANCOM-BC results into a data frame for further analysis

In [66]:
from q2_composition import DataLoafPackageDirFmt

dirfmt_cohort = q2.Artifact.load(f'{data_dir}/ancombc_cohort_number_differentials_l7.qza')
# view it as that directory format
dirfmt_cohort = dirfmt_cohort.view(DataLoafPackageDirFmt)

# this directory format has a model attribute called `data_slices`
# each of which represents a CSV in the directory

slices = {}
for relpath, view in dirfmt_cohort.data_slices.iter_views(pd.DataFrame):
    slices[str(relpath)] = view

In [67]:
lfc_coh = list(slices.values())[0]
lfc_coh.set_index(lfc_coh.columns[0], inplace=True)
lfc_coh.columns = ['lfc_' + col for col in lfc_coh.columns]
p_val_coh = list(slices.values())[1]
p_val_coh.set_index(p_val_coh.columns[0], inplace=True)
p_val_coh.columns = ['p_val_' + col for col in p_val_coh.columns]
q_val_coh = list(slices.values())[2]
q_val_coh.set_index(q_val_coh.columns[0], inplace=True)
q_val_coh.columns = ['q_val_' + col for col in q_val_coh.columns]

df_coh = pd.concat([lfc_coh, p_val_coh, q_val_coh], axis=1, join='inner')

##### Extract features where the false recovery rate corrected p-value is <= 0.05

In [68]:
df_coh.loc[df_coh.q_val_Cohort_Number_BinRecovery <= 0.05]

Unnamed: 0_level_0,lfc_(Intercept),lfc_Cohort_Number_BinRecovery,p_val_(Intercept),p_val_Cohort_Number_BinRecovery,q_val_(Intercept),q_val_Cohort_Number_BinRecovery
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
d__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Clostridiaceae;g__Clostridium_sensu_stricto_1;s__,4.369164,-2.763962,1.310634e-16,3.5e-05,8.91231e-15,0.002497
d__Bacteria;p__Firmicutes;c__Clostridia;o__Lachnospirales;f__Lachnospiraceae;g__Blautia;s__,6.625425,-2.51611,1.8850709999999998e-38,8.1e-05,1.3384e-36,0.005644
d__Bacteria;p__Firmicutes;c__Bacilli;o__Erysipelotrichales;f__Erysipelotrichaceae;g__[Clostridium]_innocuum_group;s__,4.175642,-2.41461,2.137762e-15,0.000228,1.4323e-13,0.015709
d__Bacteria;p__Firmicutes;c__Negativicutes;o__Veillonellales-Selenomonadales;__;__;__,-0.654978,0.953809,0.0001944263,7.2e-05,0.009332462,0.005145
