#### S2. Predictive Biomarkers

Author: Willem Fuetterer


In this Jupyter Notebook characteristics of microbial communities and identification of predictive biomarkers for the speed of recovery are analyzed

**Exercise overview:**<br>
[1. Setup](#setup)<br>
[2. Identification of correct sampling depth](#depth)<br>
[3. Calculating the alpha diversity](#calc)<br>
[4. Testing the associations between categorical metadata columns and the diversity metric](#categorical)<br>
[5. Testing whether numeric sample metadata columns are correlated with microbial community richness](#numeric)<br>
[6. Loading the results into variables to plot figures](#figures)<br>

<a id='setup'></a>

## 1. Setup

In [2]:
# importing all required packages & notebook extensions at the start of the notebook
import os
import biom
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import qiime2 as q2
from qiime2 import Visualization
from scipy.stats import spearmanr
from scipy.stats import linregress

%matplotlib inline

In [3]:
# defining location of data
raw_data_dir = "../data/raw"
data_dir = "../data/processed"
vis_dir  = "../results"

<a id='depth'></a>

## 2. Statistical Analysis and Correlation with Recovery Speed

In [7]:
# Load metadata
metadata = pd.read_csv(f"{data_dir}/metadata.tsv", sep='\t')

# Define diversity metrics
metrics = ['shannon', 'evenness', 'faith_pd']  # Metrics to analyze
results = []  # To store results for each metric

# Function to perform Spearman correlation
def calculate_spearman(diversity_metric_name, diversity_qza_file, metadata, post_transplant):
    # Load diversity vector (.qza file)
    diversity_metric = ! qiime2.Artifact.load(diversity_qza_file)
    
    # Extract the diversity vector as a pandas Series
    diversity_series = diversity_metric.view(pd.Series)
    
    # Convert to a DataFrame and reset the index
    diversity_df = diversity_series.reset_index()
    diversity_df.columns = ['sample-id', diversity_metric_name]  # Rename columns for clarity
    
    # Merge diversity data with metadata
    merged = metadata.set_index('sample-id').join(diversity_df.set_index('sample-id'))
    
    # Filter post-transplantation samples (if only interested in post-transplant)
    post_transplant = merged[merged['Cohort_Number'] == 1]
    
    # Check for missing values and drop rows if necessary
    post_transplant = post_transplant.dropna(subset=[diversity_metric_name, 'Recovery_Day'])
    
    # Check if there are enough samples
    n_samples = post_transplant.shape[0]
    
    # Perform Spearman correlation between the diversity metric and Recovery_Day
    correlation, p_value = spearmanr(post_transplant[diversity_metric_name], post_transplant['Recovery_Day'])
    
    return {
        'metric': diversity_metric_name,
        'Spearman correlation': correlation,
        'p-value': p_value,
        'n': n_samples
    }

# Loop through each metric and calculate the results
for metric in metrics:
    # Define file paths based on the metric
    diversity_qza_file = f"{data_dir}/core-metrics-results-bt/{metric}_vector.qza"
    
    # Calculate Spearman correlation for the current metric
    result = calculate_spearman(metric, diversity_qza_file, metadata, 1)
    
    # Append the result to the results list
    results.append(result)

# Create a DataFrame to display results
results_df = pd.DataFrame(results)

# Save results to a CSV file
#results_df.to_csv(f"{data_dir}/alpha_diversity_correlation_results.csv", index=False) -> change to Excel later

# Print the results
print(results_df)

AttributeError: 'SList' object has no attribute 'view'

Microbiome composition for each patient could maybe be extracted from this file: 

In [8]:
! qiime tools export \
    --input-path $data_dir/taxa-bar-plots-filtered.qzv \
    --output-path $data_dir/taxa-bar-plots-filtered-exported

[32mExported ../data/processed/taxa-bar-plots-filtered.qzv as Visualization to directory ../data/processed/taxa-bar-plots-filtered-exported[0m


In [31]:
taxonomic_composition_l7 = pd.read_csv(f'{data_dir}/taxa-bar-plots-filtered-exported/level-7.csv')
metadata_faith_shannon = pd.read_csv(f'{data_dir}/metadata_faith_shannon.csv')
taxonomic_composition_l7.rename(columns={'index': 'id'}, inplace=True)
# Sort the DataFrame by 'id' in descending order
taxonomic_composition_l7 = taxonomic_composition_l7.sort_values(by='id', ascending=True)
metadata_faith_shannon = metadata_faith_shannon.sort_values(by='id', ascending=True)
#print
taxonomic_composition_l7


Unnamed: 0,id,d__Bacteria;p__Firmicutes;c__Clostridia;o__Lachnospirales;f__Lachnospiraceae;__;__,d__Bacteria;p__Firmicutes;c__Clostridia;o__Lachnospirales;f__Lachnospiraceae;g__Roseburia;s__,d__Bacteria;p__Firmicutes;c__Clostridia;o__Oscillospirales;f__Ruminococcaceae;g__Fournierella;s__,d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Streptococcaceae;g__Streptococcus;s__,d__Bacteria;p__Firmicutes;c__Clostridia;o__Lachnospirales;f__Lachnospiraceae;g__Lachnoclostridium;s__,d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Morganellaceae;g__Morganella;s__,d__Bacteria;p__Firmicutes;c__Clostridia;o__Lachnospirales;f__Lachnospiraceae;g__Blautia;s__,d__Bacteria;p__Firmicutes;c__Clostridia;o__Oscillospirales;f__Butyricicoccaceae;g__UCG-009;s__,d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Rikenellaceae;g__Alistipes;s__,...,d__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Clostridiaceae;__;__,d__Bacteria;p__Actinobacteriota;c__Actinobacteria;o__Corynebacteriales;f__Dietziaceae;g__Dietzia;s__,d__Bacteria;p__Actinobacteriota;c__Coriobacteriia;o__Coriobacteriales;f__Coriobacteriales_Incertae_Sedis;g__uncultured;s__,d__Bacteria;p__Firmicutes;c__Clostridia;o__Lachnospirales;f__Lachnospiraceae;g__Lactonifactor;s__,Patient_ID,Stool_Consistency,Patient_Sex,Sample_Day,Recovery_Day,Cohort_Number
0,EG0024,94.0,0.0,0.0,1967.0,23.0,0.0,54296.0,0.0,0.0,...,0.0,0.0,0.0,0.0,P004,formed,F,0.0,34.0,1.0
1,EG0031,0.0,0.0,0.0,83.0,9.0,0.0,23.0,0.0,0.0,...,0.0,0.0,0.0,0.0,P021,formed,M,20.0,24.0,2.0
2,EG0039,0.0,202.0,0.0,3994.0,553.0,0.0,16239.0,0.0,0.0,...,0.0,0.0,0.0,0.0,P073,formed,M,0.0,,1.0
3,EG0055,0.0,27.0,0.0,3004.0,0.0,0.0,69.0,0.0,0.0,...,0.0,0.0,0.0,0.0,P020,liquid,F,0.0,28.0,1.0
4,EG0057,0.0,0.0,0.0,45.0,0.0,0.0,128.0,0.0,0.0,...,0.0,0.0,0.0,0.0,P004,formed,F,35.0,34.0,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
97,EG2580,7452.0,0.0,0.0,90426.0,14163.0,0.0,102.0,0.0,0.0,...,0.0,0.0,0.0,0.0,P042,liquid,F,13.0,17.0,2.0
98,EG2591,48514.0,12862.0,11.0,17750.0,5903.0,0.0,8062.0,5.0,6697.0,...,0.0,0.0,0.0,73.0,P017,liquid,M,0.0,17.0,1.0
99,EG2608,89.0,0.0,0.0,3015.0,8.0,0.0,29357.0,0.0,10.0,...,0.0,0.0,0.0,0.0,P034,formed,F,0.0,18.0,1.0
100,EG2638,87.0,22.0,21099.0,2924.0,313.0,0.0,214.0,0.0,1522.0,...,0.0,0.0,0.0,0.0,P017,semi-formed,M,12.0,17.0,2.0


Reduce data to samples taken before the abduction

In [32]:
# Filter the rows where Cohort_Number equals 1
tax_comp_pretransplant = taxonomic_composition_l7[taxonomic_composition_l7['Cohort_Number'] == 1]
tax_comp_pretransplant


Unnamed: 0,id,d__Bacteria;p__Firmicutes;c__Clostridia;o__Lachnospirales;f__Lachnospiraceae;__;__,d__Bacteria;p__Firmicutes;c__Clostridia;o__Lachnospirales;f__Lachnospiraceae;g__Roseburia;s__,d__Bacteria;p__Firmicutes;c__Clostridia;o__Oscillospirales;f__Ruminococcaceae;g__Fournierella;s__,d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Streptococcaceae;g__Streptococcus;s__,d__Bacteria;p__Firmicutes;c__Clostridia;o__Lachnospirales;f__Lachnospiraceae;g__Lachnoclostridium;s__,d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Morganellaceae;g__Morganella;s__,d__Bacteria;p__Firmicutes;c__Clostridia;o__Lachnospirales;f__Lachnospiraceae;g__Blautia;s__,d__Bacteria;p__Firmicutes;c__Clostridia;o__Oscillospirales;f__Butyricicoccaceae;g__UCG-009;s__,d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Rikenellaceae;g__Alistipes;s__,...,d__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Clostridiaceae;__;__,d__Bacteria;p__Actinobacteriota;c__Actinobacteria;o__Corynebacteriales;f__Dietziaceae;g__Dietzia;s__,d__Bacteria;p__Actinobacteriota;c__Coriobacteriia;o__Coriobacteriales;f__Coriobacteriales_Incertae_Sedis;g__uncultured;s__,d__Bacteria;p__Firmicutes;c__Clostridia;o__Lachnospirales;f__Lachnospiraceae;g__Lactonifactor;s__,Patient_ID,Stool_Consistency,Patient_Sex,Sample_Day,Recovery_Day,Cohort_Number
0,EG0024,94.0,0.0,0.0,1967.0,23.0,0.0,54296.0,0.0,0.0,...,0.0,0.0,0.0,0.0,P004,formed,F,0.0,34.0,1.0
2,EG0039,0.0,202.0,0.0,3994.0,553.0,0.0,16239.0,0.0,0.0,...,0.0,0.0,0.0,0.0,P073,formed,M,0.0,,1.0
3,EG0055,0.0,27.0,0.0,3004.0,0.0,0.0,69.0,0.0,0.0,...,0.0,0.0,0.0,0.0,P020,liquid,F,0.0,28.0,1.0
5,EG0070,14937.0,4439.0,0.0,6514.0,479.0,0.0,53557.0,0.0,0.0,...,0.0,0.0,0.0,0.0,P062,semi-formed,F,0.0,27.0,1.0
9,EG0136,79.0,1557.0,76.0,11029.0,543.0,0.0,47881.0,0.0,36.0,...,0.0,0.0,0.0,6.0,P027,formed,M,0.0,33.0,1.0
10,EG0141,0.0,0.0,0.0,4669.0,47196.0,0.0,1481.0,0.0,0.0,...,0.0,0.0,0.0,0.0,P032,liquid,F,0.0,21.0,1.0
12,EG0194,340.0,0.0,10.0,12358.0,2962.0,0.0,22505.0,0.0,7.0,...,0.0,0.0,0.0,0.0,P029,liquid,M,0.0,20.0,1.0
14,EG0236,538.0,0.0,0.0,512.0,949.0,0.0,4184.0,0.0,0.0,...,0.0,0.0,0.0,0.0,P033,semi-formed,M,0.0,47.0,1.0
16,EG0280,250.0,0.0,0.0,46670.0,1605.0,0.0,26711.0,0.0,0.0,...,0.0,0.0,0.0,0.0,P051,formed,M,0.0,11.0,1.0
17,EG0282,189.0,0.0,0.0,10992.0,0.0,0.0,73281.0,0.0,0.0,...,0.0,0.0,0.0,0.0,P044,liquid,F,0.0,23.0,1.0


In [33]:
print(tax_comp_pretransplant.dtypes)

id                                                                                                    object
d__Bacteria;p__Firmicutes;c__Clostridia;o__Lachnospirales;f__Lachnospiraceae;__;__                   float64
d__Bacteria;p__Firmicutes;c__Clostridia;o__Lachnospirales;f__Lachnospiraceae;g__Roseburia;s__        float64
d__Bacteria;p__Firmicutes;c__Clostridia;o__Oscillospirales;f__Ruminococcaceae;g__Fournierella;s__    float64
d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Streptococcaceae;g__Streptococcus;s__     float64
                                                                                                      ...   
Stool_Consistency                                                                                     object
Patient_Sex                                                                                           object
Sample_Day                                                                                           float64
Recovery_Day       

In [34]:
# List of columns to exclude
exclude_columns = ['id', 'Patient_ID', 'Stool_Consistency', 'Patient_Sex', 
                   'Sample_Day', 'Recovery_Day', 'Cohort_Number']

# Drop rows with NaN in Recovery_Day
tax_comp_pretransplant_cleaned = tax_comp_pretransplant.dropna(subset=['Recovery_Day'])

# Select only relevant columns for correlation
columns_for_corr = [col for col in tax_comp_pretransplant_cleaned.columns if col not in exclude_columns]
df_for_corr = tax_comp_pretransplant_cleaned[columns_for_corr]

# Calculate Spearman correlation of Recovery_Day with other columns
correlations = df_for_corr.corrwith(tax_comp_pretransplant_cleaned['Recovery_Day'], method='spearman')

correlations

  return spearmanr(a, b)[0]


d__Bacteria;p__Firmicutes;c__Clostridia;o__Lachnospirales;f__Lachnospiraceae;__;__                                             0.029704
d__Bacteria;p__Firmicutes;c__Clostridia;o__Lachnospirales;f__Lachnospiraceae;g__Roseburia;s__                                  0.029931
d__Bacteria;p__Firmicutes;c__Clostridia;o__Oscillospirales;f__Ruminococcaceae;g__Fournierella;s__                             -0.145632
d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Streptococcaceae;g__Streptococcus;s__                              -0.203581
d__Bacteria;p__Firmicutes;c__Clostridia;o__Lachnospirales;f__Lachnospiraceae;g__Lachnoclostridium;s__                         -0.042213
                                                                                                                                 ...   
d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Pseudoalteromonadaceae;g__Pseudoalteromonas;s__         NaN
d__Bacteria;p__Firmicutes;c__Clostridia;o__Clost

In [37]:
import pandas as pd
from scipy.stats import spearmanr

# Assuming tax_comp_pretransplant is your DataFrame
# Step 1: Remove rows with NaN in 'Recovery_Day'
tax_comp_pretransplant_clean = tax_comp_pretransplant.dropna(subset=['Recovery_Day'])

# Step 2: Exclude the columns we don't want to correlate with 'Recovery_Day'
exclude_columns = ['id', 'Patient_ID', 'Stool_Consistency', 'Patient_Sex', 'Sample_Day', 'Recovery_Day', 'Cohort_Number']
columns_to_correlate = [col for col in tax_comp_pretransplant_clean.columns if col not in exclude_columns]

# Step 3: Calculate the Spearman correlation for each column
correlation_results = []
for column in columns_to_correlate:
    # Check if the column has constant values
    if tax_comp_pretransplant_clean[column].nunique() == 1:
        # If constant, skip correlation and set to NaN
        corr, p_value = float('nan'), float('nan')
        sample_size = 0
    else:
        # Calculate Spearman correlation between 'Recovery_Day' and the current column
        corr, p_value = spearmanr(tax_comp_pretransplant_clean['Recovery_Day'], tax_comp_pretransplant_clean[column])
        # Calculate sample size (number of valid pairs)
        sample_size = tax_comp_pretransplant_clean[['Recovery_Day', column]].dropna().shape[0]
    
    # Append the results
    correlation_results.append({
        'correlated_column': column,
        'correlation': corr,
        'p_value': p_value,
        'sample_size': sample_size
    })

# Step 4: Create a DataFrame from the results
correlation_df = pd.DataFrame(correlation_results)

# Display the results
correlation_df


Unnamed: 0,correlated_column,correlation,p_value,sample_size
0,d__Bacteria;p__Firmicutes;c__Clostridia;o__Lac...,0.029704,0.836076,51
1,d__Bacteria;p__Firmicutes;c__Clostridia;o__Lac...,0.029931,0.834842,51
2,d__Bacteria;p__Firmicutes;c__Clostridia;o__Osc...,-0.145632,0.307878,51
3,d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactob...,-0.203581,0.151894,51
4,d__Bacteria;p__Firmicutes;c__Clostridia;o__Lac...,-0.042213,0.768666,51
...,...,...,...,...
267,d__Bacteria;p__Proteobacteria;c__Gammaproteoba...,,,0
268,d__Bacteria;p__Firmicutes;c__Clostridia;o__Clo...,0.101064,0.480401,51
269,d__Bacteria;p__Actinobacteriota;c__Actinobacte...,,,0
270,d__Bacteria;p__Actinobacteriota;c__Coriobacter...,0.061863,0.666284,51


Further steps: What I would like to do is calculate correlation between recovery day with diversity, composition, specific species, features, genes, etc. to see if any of these are predicitive for a quick recovery