# S3. Change of microbial communities between different timepoints 

Author: Marc Kesselring


In this Jupyter Notebook the change of microbial communities between different timepoints is analyzed.

**Exercise overview:**<br>
[1. Setup](#setup)<br>
[2. Filter data and run ANCOM-BC](#filter)<br>
[3. Statistical Evaluation](#ancom)<br>
[4. Visualization](#visuala)<br>

<a id='setup'></a>

## 1. Setup

In [1]:
import os
import matplotlib.pyplot as plt
import pandas as pd
import qiime2 as q2
from qiime2 import Visualization
import seaborn as sns
from scipy.stats import shapiro, kruskal, f_oneway
import subprocess
from qiime2 import Artifact

%matplotlib inline

In [2]:
raw_data_dir = "../data/raw"
data_dir = "../data/processed"
vis_dir  = "../results"

<a id='filter'></a>

## 2. Filter data and run ANCOM-BC

The data was already filtered in notebook 06_DifferentalAbundance.ipynb. The features were only retained if they had a minimum frequency of 25 and were present in at least 5 samples. Afterwards the features were collapsed to phylum, class, order, family, genus and species levels respectively. Additionally, the metadata was filtered to only contain samples where a patient had a measurement for both timepoints (abduction and recovery)

In [12]:
# Load metadata as dataframe
meta = pd.read_csv(f"{data_dir}/metadata_binned.tsv", sep="\t")

# Identify the Patient_IDs with a count of 2
true_patient_ids = meta.Patient_ID.value_counts()[meta.Patient_ID.value_counts() == 2].index

# Filter the meta table to include only rows with these Patient_IDs
filtered_meta = meta[meta.Patient_ID.isin(true_patient_ids)]

# Display the filtered meta table
filtered_meta

Unnamed: 0,sample-id,Patient_ID,Stool_Consistency,Patient_Sex,Sample_Day,Recovery_Day,Cohort_Number,Cohort_Number_Bin
0,EG2580,P042,liquid,F,13,17.0,2,Recovery
1,EG2559,P043,liquid,M,15,17.0,2,Recovery
2,EG2537,P042,liquid,F,0,17.0,1,Abduction
3,EG2518,P043,liquid,M,0,17.0,1,Abduction
5,EG2473,P055,semi-formed,M,20,22.0,2,Recovery
...,...,...,...,...,...,...,...,...
96,EG2638,P017,semi-formed,M,12,17.0,2,Recovery
97,EG2608,P034,formed,F,0,18.0,1,Abduction
98,EG2591,P017,liquid,M,0,17.0,1,Abduction
99,EG0141,P032,liquid,F,0,21.0,1,Abduction


In [14]:
# Generate tsv file from altered metadata
filtered_meta.to_csv(f"{data_dir}/timepoint_filtered_metadata_binned.tsv", sep='\t', index=False)

##### In order to use ANCOM-BC the feature table and metadata files have to contain the exact same sample-ids. This for-loop runs through all levels of collapsed taxa feature tables to get matching sample-ids with the altered metadata file and the runs ANCOM--BC with them. Additionally, barplot and result visualization are generated from the ANCOM-BC output.

In [88]:
# Define the data directory and levels
levels = ["l7", "l6", "l5", "l4", "l3", "l2"]


# Loop through the levels and run the commands
for level in levels:
    try:
        print(f"Running commands for level: {level}")
        
        #Filter feature table to only contain samples present in filtered metadata such that every patient has a sample for both timepoints
        data_level = q2.Artifact.load(f"{data_dir}/table_abund_{level}.qza").view(pd.DataFrame)
        combined_level = filtered_meta.merge(data_level,left_on = 'sample-id',right_index = True,how = 'inner')
        combined_drop_level = combined_level.drop(['Patient_ID','Stool_Consistency','Patient_Sex','Sample_Day','Recovery_Day','Cohort_Number','Cohort_Number_Bin'],axis = 1)
        combined_drop_level.set_index('sample-id',inplace = True)
        combined_drop_level.to_csv(f"{data_dir}/table_abund_level_filtered.tsv",sep = '\t',index = False)
        
        
        # Save feature table as qza artifact
        table_level = Artifact.import_data('FeatureTable[Frequency]',combined_drop_level)
        table_level.save(f"{data_dir}/table_abund_{level}_filtered.qza")
        
        # Run ANCOM-BC
        subprocess.run([
            "qiime", "composition", "ancombc",
            "--i-table", f"{data_dir}/table_abund_{level}_filtered.qza",
            "--m-metadata-file", f"{data_dir}/timepoint_filtered_metadata_binned.tsv",
            "--p-formula", "Cohort_Number_Bin",
            "--o-differentials", f"{data_dir}/ancombc_cohort_number_{level}_differentials.qza"
        ], check=True)
        
        # Generate a barplot
        subprocess.run([
            "qiime", "composition", "da-barplot",
            "--i-data", f"{data_dir}/ancombc_cohort_number_{level}_differentials.qza",
            "--o-visualization", f"{data_dir}/ancombc_cohort_number_{level}_barplot.qzv"
        ], check=True)
        
        # Generate a results table
        subprocess.run([
            "qiime", "composition", "tabulate",
            "--i-data", f"{data_dir}/ancombc_cohort_number_{level}_differentials.qza",
            "--o-visualization", f"{data_dir}/ancombc_cohort_number_{level}_results.qzv"
        ], check=True)
        
        print(f"Commands for level {level} completed successfully!")
    except subprocess.CalledProcessError as e:
        print(f"Error running commands for level {level}: {e}")


Running commands for level: l7
Saved FeatureData[DifferentialAbundance] to: ../data/processed/ancombc_cohort_number_l7_differentials.qza
Saved Visualization to: ../data/processed/ancombc_cohort_number_l7_barplot.qzv
Saved Visualization to: ../data/processed/ancombc_cohort_number_l7_results.qzv
Commands for level l7 completed successfully!
Running commands for level: l6
Saved FeatureData[DifferentialAbundance] to: ../data/processed/ancombc_cohort_number_l6_differentials.qza
Saved Visualization to: ../data/processed/ancombc_cohort_number_l6_barplot.qzv
Saved Visualization to: ../data/processed/ancombc_cohort_number_l6_results.qzv
Commands for level l6 completed successfully!
Running commands for level: l5
Saved FeatureData[DifferentialAbundance] to: ../data/processed/ancombc_cohort_number_l5_differentials.qza
Saved Visualization to: ../data/processed/ancombc_cohort_number_l5_barplot.qzv
Saved Visualization to: ../data/processed/ancombc_cohort_number_l5_results.qzv
Commands for level l5 c

<a id='ancom'></a>

## 3. Statistical Evaluation

##### Filtering ANCOM-BC differentials artifact for q-values <= 0.05 for all levels of collapsed taxa using a for-loop

In [110]:
# Define the data directory and taxonomy levels
taxonomy_levels = {
    "l7": "species",
    "l6": "genus",
    "l5": "family",
    "l4": "order",
    "l3": "class",
    "l2": "phylum"
}

# Initialize a dictionary to store results
results = {}

# Iterate through each taxonomy level
for level, name in taxonomy_levels.items():
    print(f"Processing taxonomy level: {name} ({level})")
    
    # Load the ANCOM-BC results as a directory format
    artifact_path = f'{data_dir}/ancombc_cohort_number_{level}_differentials.qza'
    dirfmt = q2.Artifact.load(artifact_path).view(DataLoafPackageDirFmt)

    # Extract data slices
    slices = {str(relpath): view for relpath, view in dirfmt.data_slices.iter_views(pd.DataFrame)}

    # Prepare the dataframes
    lfc = slices[list(slices.keys())[0]]
    lfc.set_index(lfc.columns[0], inplace=True)
    lfc.columns = ['lfc_' + col for col in lfc.columns]

    p_val = slices[list(slices.keys())[1]]
    p_val.set_index(p_val.columns[0], inplace=True)
    p_val.columns = ['p_val_' + col for col in p_val.columns]

    q_val = slices[list(slices.keys())[2]]
    q_val.set_index(q_val.columns[0], inplace=True)
    q_val.columns = ['q_val_' + col for col in q_val.columns]

    # Combine the dataframes
    df = pd.concat([lfc, p_val, q_val], axis=1, join='inner')

    # Count significant features for each stool consistency
    cohort_significant = df['q_val_Cohort_Number_BinRecovery'].loc[df['q_val_Cohort_Number_BinRecovery'] <= 0.05]
    cohort_significant_num = len(df['q_val_Cohort_Number_BinRecovery'].loc[df['q_val_Cohort_Number_BinRecovery'] <= 0.05])

    # Store the results
    results[level] = {
        "taxonomy_level": name,
        "cohort_significant_num": cohort_significant_num,
        "cohort_significant_tax": cohort_significant
    }

    print(f"Finished processing {name}. Cohort_significant_num: {cohort_significant_num}, Cohort_significant_taxa: {cohort_significant}")

# Convert results to a DataFrame for better visualization
#results_df = pd.DataFrame.from_dict(results, orient='index')

# Display the results
#print(results_df)

Processing taxonomy level: species (l7)
Finished processing species. Cohort_significant_num: 4, Cohort_significant_taxa: id
d__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Clostridiaceae;g__Clostridium_sensu_stricto_1;s__            0.002497
d__Bacteria;p__Firmicutes;c__Clostridia;o__Lachnospirales;f__Lachnospiraceae;g__Blautia;s__                              0.005644
d__Bacteria;p__Firmicutes;c__Bacilli;o__Erysipelotrichales;f__Erysipelotrichaceae;g__[Clostridium]_innocuum_group;s__    0.015709
d__Bacteria;p__Firmicutes;c__Negativicutes;o__Veillonellales-Selenomonadales;__;__;__                                    0.005145
Name: q_val_Cohort_Number_BinRecovery, dtype: float64
Processing taxonomy level: genus (l6)
Finished processing genus. Cohort_significant_num: 4, Cohort_significant_taxa: id
d__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Clostridiaceae;g__Clostridium_sensu_stricto_1            0.002497
d__Bacteria;p__Firmicutes;c__Clostridia;o__Lachnospirale

<a id='visual'></a>

## 4. Visualization

### Visualizations of barplots and table results from ANCOM-BC for all level (phylum, class, order, family, genus, species)

#### Phylum

In [3]:
Visualization.load(f"{data_dir}/ancombc_cohort_number_l2_barplot.qzv")

In [4]:
Visualization.load(f"{data_dir}/ancombc_cohort_number_l2_results.qzv")

#### Class

In [5]:
Visualization.load(f"{data_dir}/ancombc_cohort_number_l3_barplot.qzv")

In [6]:
Visualization.load(f"{data_dir}/ancombc_cohort_number_l3_results.qzv")

#### Order

In [7]:
Visualization.load(f"{data_dir}/ancombc_cohort_number_l4_barplot.qzv")

In [8]:
Visualization.load(f"{data_dir}/ancombc_cohort_number_l4_results.qzv")

#### Family

In [9]:
Visualization.load(f"{data_dir}/ancombc_cohort_number_l5_barplot.qzv")

In [10]:
Visualization.load(f"{data_dir}/ancombc_cohort_number_l5_results.qzv")

#### Genus

In [11]:
Visualization.load(f"{data_dir}/ancombc_cohort_number_l6_barplot.qzv")

In [12]:
Visualization.load(f"{data_dir}/ancombc_cohort_number_l6_results.qzv")

#### Species

In [13]:
Visualization.load(f"{data_dir}/ancombc_cohort_number_l7_barplot.qzv")

In [14]:
Visualization.load(f"{data_dir}/ancombc_cohort_number_l7_results.qzv")

### Analysis with Longitudinal

In [None]:
#filter metadata for patients present in both cohorts

# Load metadata
metadata = pd.read_csv(f"{data_dir}/metadata.tsv", sep='\t')

# Ensure `Cohort_Number` contains both 1 (pre) and 2 (post) for comparison
valid_patients = metadata.groupby('Patient_ID')['Cohort_Number'].apply(lambda x: set(x)).reset_index()
valid_patients = valid_patients[valid_patients['Cohort_Number'].apply(lambda x: {1, 2}.issubset(x))]

# Filter metadata for these patients
metadata_pre_post = metadata[metadata['Patient_ID'].isin(valid_patients['Patient_ID'])]

# New row to add at the beginning
# Corrected dictionary with placeholders as values
new_row = {
    'sample-id': '#q2:types',  # This could represent a placeholder for a sample type
    'Patient_ID': 'categorical',  # Placeholder for categorical data type
    'Stool_Consistency': 'categorical',  # Placeholder for categorical data type
    'Patient_Sex': 'categorical',  # Placeholder for categorical data type
    'Sample_Day': 'numeric',  # Placeholder for numerical data type
    'Recovery_Day': 'numeric',  # Placeholder for numerical data type
    'Cohort_Number': 'numeric'  # Placeholder for categorical data type
}

# Convert the new row to a DataFrame
new_row_df = pd.DataFrame([new_row])

# Concatenate the new row with the original DataFrame
metadata_pre_post = pd.concat([new_row_df, metadata_pre_post], ignore_index=True)

# Save the filtered metadata
metadata_pre_post.to_csv(f"{data_dir}/metadata_pre_post.tsv", sep='\t', index=False)

pd.read_csv(f"{data_dir}/metadata_pre_post.tsv", sep='\t')

Unnamed: 0,sample-id,Patient_ID,Stool_Consistency,Patient_Sex,Sample_Day,Recovery_Day,Cohort_Number
0,#q2:types,categorical,categorical,categorical,numeric,numeric,numeric
1,EG2580,P042,liquid,F,13,17.0,2
2,EG2559,P043,liquid,M,15,17.0,2
3,EG2537,P042,liquid,F,0,17.0,1
4,EG2518,P043,liquid,M,0,17.0,1
...,...,...,...,...,...,...,...
66,EG2638,P017,semi-formed,M,12,17.0,2
67,EG2608,P034,formed,F,0,18.0,1
68,EG2591,P017,liquid,M,0,17.0,1
69,EG0141,P032,liquid,F,0,21.0,1


In [4]:
#Importing the metadata from the Shannon and Faith PD results
Shannon_categorical = pd.read_csv(f'{data_dir}/core-metrics-results-bt/shannon-group-significance_exported/metadata.tsv', sep='\t')
FaithPD_categorical = pd.read_csv(f'{data_dir}/core-metrics-results-bt/faith-pd-group-significance_exported/metadata.tsv', sep='\t')
metadata = pd.read_csv(f"{data_dir}/metadata.tsv", sep="\t")

#Merging both tables for easier handling and changing Shannon Entropy and Faith PD to numerical for plotting
categorical = pd.merge(Shannon_categorical, FaithPD_categorical, how='inner', on=['Patient_ID', 'id', 'Patient_Sex', 'Stool_Consistency'])
categorical = categorical.loc[categorical.index != 0]
categorical = categorical.sort_values(by="Patient_ID", ascending=True) 
categorical['shannon_entropy'] = pd.to_numeric(categorical['shannon_entropy'], errors='coerce')
categorical['faith_pd'] = pd.to_numeric(categorical['faith_pd'], errors='coerce')
categorical.shape
categorical.rename(columns={'id': 'sample-id'}, inplace=True)
metadata_alpha = pd.merge(metadata, categorical,  how='left', on=['Patient_ID', 'sample-id', 'Patient_Sex', 'Stool_Consistency'])
metadata_alpha.to_csv(f"{data_dir}/metadata_alpha.tsv", sep="\t", index=False)
metadata_alpha

metadata_pre_post_alpha = pd.merge(metadata_pre_post, categorical, how='left', on=['Patient_ID', 'sample-id', 'Patient_Sex', 'Stool_Consistency'])
metadata_pre_post_alpha.to_csv(f"{data_dir}/metadata_pre_post_alpha.tsv", sep="\t", index=False)
pd.read_csv(f"{data_dir}/metadata_pre_post_alpha.tsv", sep='\t')

Unnamed: 0,sample-id,Patient_ID,Stool_Consistency,Patient_Sex,Sample_Day,Recovery_Day,Cohort_Number,shannon_entropy,faith_pd
0,#q2:types,categorical,categorical,categorical,numeric,numeric,numeric,,
1,EG2580,P042,liquid,F,13,17.0,2,3.199968,10.672487
2,EG2559,P043,liquid,M,15,17.0,2,1.489482,7.371690
3,EG2537,P042,liquid,F,0,17.0,1,,
4,EG2518,P043,liquid,M,0,17.0,1,,
...,...,...,...,...,...,...,...,...,...
66,EG2638,P017,semi-formed,M,12,17.0,2,2.746580,14.813480
67,EG2608,P034,formed,F,0,18.0,1,3.506510,13.212383
68,EG2591,P017,liquid,M,0,17.0,1,4.920576,23.524030
69,EG0141,P032,liquid,F,0,21.0,1,2.967400,12.361355


In [5]:
#Creates a FeatureTable[RelativeFrequency] that is needed for the qiime longitudinal pairwise-differences command
! qiime feature-table relative-frequency \
  --i-table $data_dir/table-filtered.qza \
  --o-relative-frequency-table $data_dir/relative-frequency-table.qza

[32mSaved FeatureTable[RelativeFrequency] to: ../data/processed/relative-frequency-table.qza[0m
[0m

### Testing alpha diversity differences between cohorts pairwise

In [6]:
! qiime longitudinal pairwise-differences \
  --i-table $data_dir/relative-frequency-table.qza \
  --m-metadata-file $data_dir/metadata_pre_post_alpha.tsv \
  --p-state-column Cohort_Number \
  --p-state-1 1 \
  --p-state-2 2 \
  --p-individual-id-column Patient_ID \
  --p-replicate-handling random \
  --p-metric shannon_entropy \
  --o-visualization $data_dir/pairwise_differences_pre_post_shannon.qzv
 

[32mSaved Visualization to: ../data/processed/pairwise_differences_pre_post_shannon.qzv[0m
[0m

In [7]:
Visualization.load(f"{data_dir}/pairwise_differences_pre_post_shannon.qzv")

In [None]:
! qiime longitudinal pairwise-differences \
  --m-metadata-file $data_dir/metadata_pre_post_alpha.tsv \
  --p-metric faith_pd \
  --p-state-column Cohort_Number \
  --p-state-1 1 \
  --p-state-2 2 \
  --p-individual-id-column Patient_ID \
  --p-replicate-handling random \
  --o-visualization $data_dir/pairwise_differences_pre_post_faith_pd.qzv

[32mSaved Visualization to: ../data/processed/pairwise_differences_pre_post_faith_pd.qzv[0m
[0m

In [5]:
Visualization.load(f"{data_dir}/pairwise_differences_pre_post_faith_pd.qzv")

This means that the alpha diversity of each patient is significantly less in cohort 2 compared to cohort 1

### Testing changes of feature abundance on a single patient level

In [10]:
#Importing the metadata from the Shannon and Faith PD results
Shannon_categorical = pd.read_csv(f'{data_dir}/core-metrics-results-bt/shannon-group-significance_exported/metadata.tsv', sep='\t')
FaithPD_categorical = pd.read_csv(f'{data_dir}/core-metrics-results-bt/faith-pd-group-significance_exported/metadata.tsv', sep='\t')
metadata = pd.read_csv(f"{data_dir}/metadata.tsv", sep="\t")

#Merging both tables for easier handling and changing Shannon Entropy and Faith PD to numerical for plotting
categorical = pd.merge(Shannon_categorical, FaithPD_categorical, how='inner', on=['Patient_ID', 'id', 'Patient_Sex', 'Stool_Consistency'])
categorical = categorical.loc[categorical.index != 0]
categorical = categorical.sort_values(by="Patient_ID", ascending=True) 
categorical['shannon_entropy'] = pd.to_numeric(categorical['shannon_entropy'], errors='coerce')
categorical['faith_pd'] = pd.to_numeric(categorical['faith_pd'], errors='coerce')
categorical.shape
categorical.rename(columns={'id': 'sample-id'}, inplace=True)
metadata_alpha = pd.merge(metadata, categorical,  how='left', on=['Patient_ID', 'sample-id', 'Patient_Sex', 'Stool_Consistency'])
metadata_alpha.to_csv(f"{data_dir}/metadata_alpha.tsv", sep="\t", index=False)
metadata_alpha

Unnamed: 0,sample-id,Patient_ID,Stool_Consistency,Patient_Sex,Sample_Day,Recovery_Day,Cohort_Number,shannon_entropy,faith_pd
0,EG2580,P042,liquid,F,13,17.0,2,3.199968,10.672487
1,EG2559,P043,liquid,M,15,17.0,2,1.489482,7.371690
2,EG2537,P042,liquid,F,0,17.0,1,,
3,EG2518,P043,liquid,M,0,17.0,1,,
4,EG2490,P030,formed,F,0,,1,3.333076,10.057748
...,...,...,...,...,...,...,...,...,...
97,EG2608,P034,formed,F,0,18.0,1,3.506510,13.212383
98,EG2591,P017,liquid,M,0,17.0,1,4.920576,23.524030
99,EG0141,P032,liquid,F,0,21.0,1,2.967400,12.361355
100,EG0031,P021,formed,M,20,24.0,2,2.242202,6.405608


In [11]:
#filtering for features that are abundant in at least 10 patients
! qiime feature-table filter-features \
  --i-table $data_dir/table-filtered.qza \
  --p-min-samples 10 \
  --o-filtered-table $data_dir/table-filtered-min-abund.qza

[32mSaved FeatureTable[Frequency] to: ../data/processed/table-filtered-min-abund.qza[0m
[0m

In [None]:
#volatility
! qiime longitudinal feature-volatility \
  --i-table $data_dir/table-filtered-min-abund.qza  \
  --m-metadata-file $data_dir/metadata_alpha.tsv \
  --p-state-column Cohort_Number \
  --p-individual-id-column Patient_ID \
  --p-n-estimators 10 \
  --p-random-state 17 \
  --output-dir $data_dir/feat-volatility-min

Usage: [94mqiime longitudinal feature-volatility[0m [OPTIONS]

  Identify features that are predictive of a numeric metadata column,
  state_column (e.g., time), and plot their relative frequencies across states
  using interactive feature volatility plots. A supervised learning regressor
  is used to identify important features and assess their ability to predict
  sample states. state_column will typically be a measure of time, but any
  numeric metadata column can be used.

[1mInputs[0m:
  [94m[4m--i-table[0m ARTIFACT [32mFeatureTable[Frequency][0m
                          Feature table containing all features that should
                          be used for target prediction.            [35m[required][0m
[1mParameters[0m:
  [94m[4m--m-metadata-file[0m METADATA...
    (multiple arguments   Sample metadata file containing
     will be merged)      [4mindividual-id-column[0m.                     [35m[required][0m
  [94m[4m--p-state-column[0m TEXT   Metadata co

In [3]:
Visualization.load(f"{data_dir}/feat-volatility-min/volatility_plot.qzv")

In [14]:
! qiime tools export \
    --input-path $data_dir/feat-volatility-min/volatility_plot.qzv \
    --output-path $data_dir/feat-volatility-min/volatility_plot-exported

[32mExported ../data/processed/feat-volatility-min/volatility_plot.qzv as Visualization to directory ../data/processed/feat-volatility-min/volatility_plot-exported[0m


In [4]:
volatility = pd.read_csv(f"{data_dir}/feat-volatility-min/volatility_plot-exported/data.tsv", sep='\t')

volatility.head()

Unnamed: 0,id,Patient_ID,Stool_Consistency,Patient_Sex,Cohort_Number,648070229fc4f45e01a9481f1beefe43,df009054f19d9aac55f8a5bc2eeaa409,d383d75128d7423a9bbdb2076120e365,aeb03963939e00b75d7370f4be601417,833bf02443c2dece76422ef394ce48d0,...,8cd50fa8bc80d1145a11f65333b2fe23,c0b3f40bd9e0962679113b410585c85e,90a6eda58afa2847935edae38233a43a,8d0f2844d6fcf4ffd3249e83ca344f01,ded7ceb2681f4b74942422edb1406cb0,9a3f5d12656e19e24dccd8ffcce90434,de5f6f7a885554dc30b281d16ef0efcd,b97625685f91c7fc630c3ff4b804cd0f,f659dd4ff0fac85ec37a1bedd2946721,b9d1c6dd86d6b91715f33f4ec265f144
0,#q2:types,categorical,categorical,categorical,numeric,numeric,numeric,numeric,numeric,numeric,...,numeric,numeric,numeric,numeric,numeric,numeric,numeric,numeric,numeric,numeric
1,EG2580,P042,liquid,F,2,0.00107802182994206,0,0,0,1.22502480675234e-05,...,0,0.545644424299592,0,0,0,0.00309318763704965,0,0,0,0
2,EG2559,P043,liquid,M,2,0,0,0,0,0,...,0,0.846282888229476,0,0,0.000522255192878338,0.000181998021760633,0,0,0,0
3,EG2537,P042,liquid,F,1,0,0,0,0,0,...,0,0.541666666666667,0,0,0,0,0,0,0,0
4,EG2518,P043,liquid,M,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [13]:
change = pd.read_csv(f"{data_dir}/feat-volatility-min/volatility_plot-exported/feature_metadata.tsv", sep='\t')

# Convert 'importance' to numeric, setting non-numeric values to NaN
change['importance'] = pd.to_numeric(change['importance'], errors='coerce')

# Filter for numeric values > 0
change = change[change['importance'] > 0]

# Reduce the DataFrame to the desired columns
reduced_change = change[
    ["id", "importance", "Cumulative Avg Decrease", "Cumulative Avg Increase", "Net Avg Change"]]


# Save the reduced DataFrame to an Excel file
reduced_change.to_excel(f"{vis_dir}/volatility-reduced_change_table.xlsx", index=False)

change

Unnamed: 0,id,importance,Cumulative Avg Decrease,Cumulative Avg Increase,Net Avg Change,Global Variance,Global Mean,Global Median,Global Standard Deviation,Global CV (%)
1,648070229fc4f45e01a9481f1beefe43,0.152867,-0.0006031659266517,0.0,-0.0006031659266517,1.7351385962861e-05,0.0013259908632871,0.0,0.0041654994853992,3.14142397261535
2,df009054f19d9aac55f8a5bc2eeaa409,0.146382,-0.0139335086919329,0.0,-0.0139335086919329,0.0011467792330575,0.0113133597237111,0.0,0.0338641290019031,2.9932866830822
3,d383d75128d7423a9bbdb2076120e365,0.123987,-0.0139653589025482,0.0,-0.0139653589025482,0.0015053434598059,0.0118085156609011,8.09018800004156e-05,0.0387987559053892,3.28565901249171
4,aeb03963939e00b75d7370f4be601417,0.102578,-0.0121807607641158,0.0,-0.0121807607641158,0.0004136045508613,0.0083943390752024,0.0,0.0203372699952899,2.42273630039175
5,833bf02443c2dece76422ef394ce48d0,0.0781,-0.0204583099594443,0.0,-0.0204583099594443,0.0022105759935848,0.0132088048008495,3.93047619916105e-05,0.0470167628998942,3.55950168155037
6,045fd2f376df8ab160c365aa9811b1eb,0.047759,-0.0060518015339083,0.0,-0.0060518015339083,0.0009634504354194,0.0064617345482329,0.0,0.0310394979891661,4.80358605842979
7,439ee71de7eb52a826e95b385e1b1731,0.041407,-0.017745849514363,0.0,-0.017745849514363,0.0030376478967373,0.0094005487261017,0.0,0.0551148609427377,5.86294082915659
8,0e376c68726959309777cd950dc65fc0,0.02961,-0.0001751403884882,0.0,-0.0001751403884882,5.12338868435488e-07,0.0002836069544412,0.0,0.0007157785051505,2.523839750547
9,e96e7b1c7d4de490dbb32be165504c2e,0.025512,-0.0019282442587017,0.0,-0.0019282442587017,0.000181989134804,0.0037036334355149,0.0,0.013490334866268,3.64245952013129
10,e5d04ee31cd7dabfbaf2166dcf6ede2b,0.024412,-0.0129150519014728,0.0,-0.0129150519014728,0.0019156070675295,0.0069617997953578,0.0,0.0437676486406294,6.28682954511475


### Testing differences in relative abundance of features between cohorts pairwise

In [17]:
#Creates a FeatureTable[RelativeFrequency] that is needed for the qiime longitudinal pairwise-differences command
! qiime feature-table relative-frequency \
  --i-table $data_dir/table-filtered-min-abund.qza \
  --o-relative-frequency-table $data_dir/relative-frequency-table-min-abund.qza

[32mSaved FeatureTable[RelativeFrequency] to: ../data/processed/relative-frequency-table-min-abund.qza[0m
[0m

Shows a list of all the feature names

In [6]:
# Load the FeatureTable[RelativeFrequency]
feature_table = Artifact.load(f"{data_dir}/relative-frequency-table.qza")
# Extract the feature table as a Pandas DataFrame
table = feature_table.view(pd.DataFrame)
# Get the list of feature IDs
features = table.columns.tolist()

# Load the FeatureTable[RelativeFrequency]
feature_table_filtered = Artifact.load(f"{data_dir}/relative-frequency-table-min-abund.qza")
# Extract the feature table as a Pandas DataFrame
table_filtered = feature_table_filtered.view(pd.DataFrame)
# Get the list of feature IDs
features_filtered = table_filtered.columns.tolist()
len(features_filtered)

print("The feature table originally contained", len(features), "features and after filtering for features that are present in at least 10 samples", len(features_filtered),"features remain")

The feature table originally contained 1990 features and after filtering for features that are present in at least 10 samples 147 features remain


Analysis if the relative abundance of a feature changes between pre- and post-abduction pairwise

In [10]:
#can be used to check if the difference in abundance of a specific feature is actually significantly
! qiime longitudinal pairwise-differences \
  --i-table $data_dir/relative-frequency-table-min-abund.qza \
  --m-metadata-file $data_dir/metadata_pre_post_alpha.tsv \
  --p-metric d383d75128d7423a9bbdb2076120e365 \
  --p-state-column Cohort_Number \
  --p-state-1 1 \
  --p-state-2 2 \
  --p-individual-id-column Patient_ID \
  --p-replicate-handling random \
  --o-visualization $data_dir/pairwise_differences_pre_post_feature.qzv

Visualization.load(f"{data_dir}/pairwise_differences_pre_post_feature.qzv")

[32mSaved Visualization to: ../data/processed/pairwise_differences_pre_post_feature.qzv[0m
[0m

After randomly testing a few features it becomes clear that some but not all of the features discovered by the volatility analysis display a significant decrease/increase between the patient cohorts when tested pairwise on an individual patient level. Since the features have to be tested one at a time and I could not find a way to test them all at once, this analysis was stopped at this point, but left in since it still might be of interest.