# S3. Change of microbial communities between different timepoints 

Author: Marc Kesselring


In this Jupyter Notebook the change of microbial communities between different timepoints is analyzed.

**Exercise overview:**<br>
[1. Setup](#setup)<br>
[2. Filter Data](#filter)<br>
[3. Analysis of composition of microbiomes](#ancom)<br>
[4. Mapping significant features to taxons](#taxon)<br>
[5. Feature abundance for both cohorts](#cohort)<br>

<a id='setup'></a>

## 1. Setup

In [15]:
import os
import matplotlib.pyplot as plt
import pandas as pd
import qiime2 as q2
from qiime2 import Visualization
import seaborn as sns
from scipy.stats import shapiro, kruskal, f_oneway
import subprocess

%matplotlib inline

In [3]:
raw_data_dir = "../data/raw"
data_dir = "../data/processed"
vis_dir  = "../results"

<a id='filter'></a>

## 2. Filter data

The data was already filtered in notebook 06_DifferentalAbundance.ipynb. The features were only retained if they had a minimum frequency of 25 and were present in at least 5 samples. Afterwards the features were collapsed to phylum, class, order, family, genus and species levels respectively.

<a id='ancom'></a>

## 3. Analysis of compositon of microbiomes

##### Run ANCOM-BC to investigate if taxa are differentially abundant in the 2 cohorts

In [11]:
# Run ANCOM-BC
! qiime composition ancombc \
    --i-table $data_dir/table_abund.qza \
    --m-metadata-file $data_dir/metadata_binned.tsv \
    --p-formula Cohort_Number_Bin \
    --o-differentials $data_dir/ancombc_cohort_number_differentials.qza

2281.69s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


[32mSaved FeatureData[DifferentialAbundance] to: ../data/processed/ancombc_cohort_number_differentials.qza[0m
[0m

##### Generate a barplot and tabular results from the ANCOM-BC

In [12]:
# Generate a barplot of differentially abundant taxa between environments
! qiime composition da-barplot \
    --i-data $data_dir/ancombc_cohort_number_differentials.qza \
    --o-visualization $data_dir/ancombc_cohort_number_da_barplot.qzv

# Generate a table of these same values for all taxa
! qiime composition tabulate \
    --i-data $data_dir/ancombc_cohort_number_differentials.qza \
    --o-visualization $data_dir/ancombc_cohort_number_results.qzv

2329.93s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


[32mSaved Visualization to: ../data/processed/ancombc_cohort_number_da_barplot.qzv[0m
[0m

2347.70s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


[32mSaved Visualization to: ../data/processed/ancombc_cohort_number_results.qzv[0m
[0m

##### Inspect barplot and tabular results visually

In [17]:
Visualization.load(f"{data_dir}/ancombc_cohort_number_da_barplot.qzv")

In [18]:
Visualization.load(f"{data_dir}/ancombc_cohort_number_results.qzv")

#### Load ANCOM-BC results into a data frame for further analysis

In [29]:
from q2_composition import DataLoafPackageDirFmt

dirfmt_cohort = q2.Artifact.load(f'{data_dir}/ancombc_cohort_number_differentials.qza')
# view it as that directory format
dirfmt_cohort = dirfmt_cohort.view(DataLoafPackageDirFmt)

# this directory format has a model attribute called `data_slices`
# each of which represents a CSV in the directory

slices = {}
for relpath, view in dirfmt_cohort.data_slices.iter_views(pd.DataFrame):
    slices[str(relpath)] = view

In [30]:
lfc_cohort = list(slices.values())[0]
lfc_cohort.set_index(lfc_cohort.columns[0], inplace=True)
lfc_cohort.columns = ['lfc_' + col for col in lfc_cohort.columns]
p_val_cohort = list(slices.values())[1]
p_val_cohort.set_index(p_val_cohort.columns[0], inplace=True)
p_val_cohort.columns = ['p_val_' + col for col in p_val_cohort.columns]
q_val_cohort = list(slices.values())[2]
q_val_cohort.set_index(q_val_cohort.columns[0], inplace=True)
q_val_cohort.columns = ['q_val_' + col for col in q_val_cohort.columns]

df = pd.concat([lfc_cohort, p_val_cohort, q_val_cohort], axis=1, join='inner')

##### Extract features where the false recovery rate corrected p-value is <= 0.05

In [36]:
df_cohort.loc[df_cohort.q_val_Cohort_Number_BinRecovery <= 0.05]

Unnamed: 0_level_0,lfc_(Intercept),lfc_Cohort_Number_BinRecovery,p_val_(Intercept),p_val_Cohort_Number_BinRecovery,q_val_(Intercept),q_val_Cohort_Number_BinRecovery
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
c923fa454975cd4424cc1fa448968444,-0.392504,2.113322,0.04024519,4.927738e-06,1.0,0.000626
95126920a496aaf496fa0c4f89e16e5b,2.032863,2.352877,1.944249e-12,0.0001514977,2.313657e-10,0.018483
afc3e6543b9af325490ba4bed4a3f654,-0.759205,2.136853,1.298099e-05,5.624343e-06,0.001414928,0.000709
5a0f522431143dce1339d7359fc37599,3.986123,-2.138339,1.285529e-25,0.0001114365,1.645477e-23,0.013818
aeb03963939e00b75d7370f4be601417,3.301611,-2.586496,6.988208e-13,1.666767e-06,8.455731e-11,0.000215
833bf02443c2dece76422ef394ce48d0,3.347799,-2.098431,1.149156e-12,0.0001762682,1.378987e-10,0.021328
d383d75128d7423a9bbdb2076120e365,3.808993,-2.780357,2.937471e-20,7.970993e-08,3.730589e-18,1e-05
df009054f19d9aac55f8a5bc2eeaa409,2.869416,-1.911654,3.714499e-11,0.0001482533,4.345964e-09,0.018235
6a125442b3d882bd11b5cfe1866470fd,2.09629,-2.11633,3.326888e-07,1.82957e-06,3.726115e-05,0.000234
e3bff2e5d94dbb2b69f466ee85a1acf4,1.986358,-1.895627,1.723736e-05,7.652178e-05,0.001861635,0.009565


<a id='taxon'></a>

## 4. Mapping significant features to taxons

In [55]:
pd.set_option('max_colwidth', 200)

In [51]:
# note: QIIME 2 artifact files can be loaded as python objects! This is how.
taxa = q2.Artifact.load(f'{data_dir}/taxonomy.qza')
# view as a `pandas.DataFrame`. Note: Only some Artifact types can be transformed to DataFrames
taxa = taxa.view(pd.DataFrame)

In [52]:
ancom = taxa.loc[['c923fa454975cd4424cc1fa448968444', '95126920a496aaf496fa0c4f89e16e5b', 'afc3e6543b9af325490ba4bed4a3f654', '5a0f522431143dce1339d7359fc37599', 'aeb03963939e00b75d7370f4be601417', '833bf02443c2dece76422ef394ce48d0', 'd383d75128d7423a9bbdb2076120e365', 'df009054f19d9aac55f8a5bc2eeaa409', '6a125442b3d882bd11b5cfe1866470fd', 'e3bff2e5d94dbb2b69f466ee85a1acf4']]

In [57]:
ancom

Unnamed: 0_level_0,Taxon,Confidence
Feature ID,Unnamed: 1_level_1,Unnamed: 2_level_1
c923fa454975cd4424cc1fa448968444,d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Lactobacillaceae;g__Lactobacillus;s__,0.9999988617600988
95126920a496aaf496fa0c4f89e16e5b,d__Bacteria;p__Firmicutes;c__Bacilli;o__Staphylococcales;f__Staphylococcaceae;g__Staphylococcus;s__,0.9999999899462466
afc3e6543b9af325490ba4bed4a3f654,d__Bacteria;p__Firmicutes;c__Bacilli;o__Staphylococcales;f__Staphylococcaceae;g__Staphylococcus;s__,0.9999999769429168
5a0f522431143dce1339d7359fc37599,d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Streptococcaceae;g__Streptococcus;s__,0.9999999976258208
aeb03963939e00b75d7370f4be601417,d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Streptococcaceae;g__Streptococcus;s__,0.9999999641579512
833bf02443c2dece76422ef394ce48d0,d__Bacteria;p__Firmicutes;c__Bacilli;o__Erysipelotrichales;f__Erysipelatoclostridiaceae;g__Erysipelatoclostridium;s__,0.9999997011969576
d383d75128d7423a9bbdb2076120e365,d__Bacteria;p__Firmicutes;c__Bacilli;o__Erysipelotrichales;f__Erysipelotrichaceae;g__[Clostridium]_innocuum_group;s__,0.9999975834612012
df009054f19d9aac55f8a5bc2eeaa409,d__Bacteria;p__Firmicutes;c__Clostridia;o__Peptostreptococcales-Tissierellales;f__Peptostreptococcaceae;g__Romboutsia;s__,0.992520181789368
6a125442b3d882bd11b5cfe1866470fd,d__Bacteria;p__Firmicutes;c__Clostridia;o__Peptostreptococcales-Tissierellales;f__Peptostreptococcaceae;g__Intestinibacter;s__,0.99396907266501
e3bff2e5d94dbb2b69f466ee85a1acf4,d__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Clostridiaceae;g__Clostridium_sensu_stricto_1;s__,0.9993446613517992


<a id='cohort'></a>

## 5. Feature abundance for both cohorts

In [58]:
# Load features and metadata into dataframes, transpose features to align rows and columns with the format of data frame 'ancom'
features = q2.Artifact.load(f"{data_dir}/table-filtered.qza")
metadata = pd.read_csv(f"{raw_data_dir}/metadata.tsv", sep='\t')
features = features.view(pd.DataFrame).transpose()

In [59]:
# Inner join the feature table with the 10 identified differentially abundant features into one dataframe
df1 = pd.concat([ancom, features], axis=1, join='inner')

In [60]:
# Set Sample_Name as index and join differentially abundant features with metadata then ommit metadata other than Cohort_Number
df2 = df1.transpose()
metadata.index = metadata['Sample_Name']
df3 = pd.concat([df2, metadata], axis=1, join='outer')
df4 = df3.drop(columns = ['Sample_Name', 'Patient_ID', 'Stool_Consistency', 'Patient_Sex', 'Sample_Day', 'Recovery_Day'])

In [72]:
# Generate 2 data frames for both cohorts individually then drop the Cohort_Number column
df_abduction = df4[df4['Cohort_Number'] == 1]
df_recovery = df4[df4['Cohort_Number'] == 2]
df_abduction = df_abduction.drop(columns = ['Cohort_Number'])
df_recovery = df_recovery.drop(columns = ['Cohort_Number'])

##### Covert columns to numeric to use the describe() function

In [73]:
# Function to convert columns to numeric
def convert_to_numeric(col):
    # Convert to numeric, coercing errors to NaN
    return pd.to_numeric(col)

# Apply the function to all columns in the DataFrame
for column in df_abduction.columns:
    df_abduction[column] = convert_to_numeric(df_abduction[column])

In [74]:
df_abduction.describe()

Unnamed: 0,c923fa454975cd4424cc1fa448968444,95126920a496aaf496fa0c4f89e16e5b,afc3e6543b9af325490ba4bed4a3f654,5a0f522431143dce1339d7359fc37599,aeb03963939e00b75d7370f4be601417,833bf02443c2dece76422ef394ce48d0,d383d75128d7423a9bbdb2076120e365,df009054f19d9aac55f8a5bc2eeaa409,6a125442b3d882bd11b5cfe1866470fd,e3bff2e5d94dbb2b69f466ee85a1acf4
count,54.0,54.0,54.0,54.0,54.0,54.0,54.0,54.0,54.0,54.0
mean,7.5,453.333333,7.12963,1308.425926,1855.351852,2212.5,1882.537037,2279.611111,996.759259,2938.425926
std,36.523319,2756.837715,46.318848,2264.945654,3516.213809,5157.699227,4223.671709,5987.250559,3789.564081,10068.587755
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,3.75,0.0,38.75,0.0,0.0,10.25,0.0,0.0,0.0
50%,0.0,23.0,0.0,262.5,258.0,209.0,193.0,28.0,0.0,0.0
75%,0.0,66.75,0.0,1726.5,1499.25,1362.75,1554.25,2109.25,588.75,290.25
max,265.0,20278.0,338.0,10911.0,14867.0,28565.0,19940.0,34335.0,27511.0,56796.0


In [75]:
# Function to convert columns to numeric
def convert_to_numeric(col):
    # Convert to numeric, coercing errors to NaN
    return pd.to_numeric(col)

# Apply the function to all columns in the DataFrame
for column in df_recovery.columns:
    df_recovery[column] = convert_to_numeric(df_recovery[column])

In [76]:
df_recovery.describe()

Unnamed: 0,c923fa454975cd4424cc1fa448968444,95126920a496aaf496fa0c4f89e16e5b,afc3e6543b9af325490ba4bed4a3f654,5a0f522431143dce1339d7359fc37599,aeb03963939e00b75d7370f4be601417,833bf02443c2dece76422ef394ce48d0,d383d75128d7423a9bbdb2076120e365,df009054f19d9aac55f8a5bc2eeaa409,6a125442b3d882bd11b5cfe1866470fd,e3bff2e5d94dbb2b69f466ee85a1acf4
count,48.0,48.0,48.0,48.0,48.0,48.0,48.0,48.0,48.0,48.0
mean,2955.020833,13608.020833,1414.041667,772.708333,158.708333,245.916667,248.0625,298.8125,64.020833,2.125
std,16633.506344,32960.337739,5145.334582,2944.364903,756.912737,921.962911,1294.267235,1982.02547,443.549344,8.900454
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,2.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,60.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,31.25,483.5,9.0,41.0,0.0,8.25,1.5,9.5,0.0,0.0
max,114479.0,134934.0,23673.0,14721.0,4989.0,4160.0,8826.0,13742.0,3073.0,57.0


##### Subtract the feature abundances in the cohort 1 (timepoint of abduction) from cohort 2 (timepoint of recovery). Positive numbers indicate increased feature abundance at recovey whilst negative values indicate decreased feature abundance at recovery

In [77]:
df_diff_cohort = df_recovery.subtract(df_abduction, fill_value=0)
df_diff_cohort.describe()

Unnamed: 0,c923fa454975cd4424cc1fa448968444,95126920a496aaf496fa0c4f89e16e5b,afc3e6543b9af325490ba4bed4a3f654,5a0f522431143dce1339d7359fc37599,aeb03963939e00b75d7370f4be601417,833bf02443c2dece76422ef394ce48d0,d383d75128d7423a9bbdb2076120e365,df009054f19d9aac55f8a5bc2eeaa409,6a125442b3d882bd11b5cfe1866470fd,e3bff2e5d94dbb2b69f466ee85a1acf4
count,102.0,102.0,102.0,102.0,102.0,102.0,102.0,102.0,102.0,102.0
mean,1386.627451,6163.77451,661.656863,-329.068627,-907.558824,-1055.598039,-879.901961,-1066.235294,-497.568627,-1554.637255
std,11443.686431,23649.107033,3581.772545,2795.694452,2788.391085,3984.416829,3359.005763,4723.524626,2812.56635,7441.313314
min,-265.0,-20278.0,-338.0,-10911.0,-14867.0,-28565.0,-19940.0,-34335.0,-27511.0,-56796.0
25%,0.0,-34.25,0.0,-325.75,-298.25,-294.25,-241.5,-37.75,-15.0,-1.5
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,48.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,114479.0,134934.0,23673.0,14721.0,4989.0,4160.0,8826.0,13742.0,3073.0,57.0
