#### S2. Predictive Biomarkers

Author: Willem Fuetterer


In this Jupyter Notebook characteristics of microbial communities and identification of predictive biomarkers for the speed of recovery are analyzed

**Exercise overview:**<br>
[1. Setup](#setup)<br>
[2. Identification of correct sampling depth](#depth)<br>
[3. Calculating the alpha diversity](#calc)<br>
[4. Testing the associations between categorical metadata columns and the diversity metric](#categorical)<br>
[5. Testing whether numeric sample metadata columns are correlated with microbial community richness](#numeric)<br>
[6. Loading the results into variables to plot figures](#figures)<br>

<a id='setup'></a>

## 1. Setup

In [3]:
# importing all required packages & notebook extensions at the start of the notebook
import os
import biom
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import qiime2 as q2
from qiime2 import Visualization
from scipy.stats import spearmanr
from scipy.stats import linregress

%matplotlib inline

In [1]:
# defining location of data
raw_data_dir = "../data/raw"
data_dir = "../data/processed"
vis_dir  = "../results"

<a id='depth'></a>

## 2. Statistical Analysis and Correlation with Recovery Speed

In [2]:
# Load metadata
metadata = pd.read_csv(f"{data_dir}/metadata.tsv", sep='\t')

# Define diversity metrics
metrics = ['shannon', 'evenness', 'faith_pd']  # Metrics to analyze
results = []  # To store results for each metric

# Function to perform Spearman correlation
def calculate_spearman(diversity_metric_name, diversity_qza_file, metadata, post_transplant):
    # Load diversity vector (.qza file)
    diversity_metric = qiime2.Artifact.load(diversity_qza_file)
    
    # Extract the diversity vector as a pandas Series
    diversity_series = diversity_metric.view(pd.Series)
    
    # Convert to a DataFrame and reset the index
    diversity_df = diversity_series.reset_index()
    diversity_df.columns = ['sample-id', diversity_metric_name]  # Rename columns for clarity
    
    # Merge diversity data with metadata
    merged = metadata.set_index('sample-id').join(diversity_df.set_index('sample-id'))
    
    # Filter post-transplantation samples (if only interested in post-transplant)
    post_transplant = merged[merged['Cohort_Number'] == 1]
    
    # Check for missing values and drop rows if necessary
    post_transplant = post_transplant.dropna(subset=[diversity_metric_name, 'Recovery_Day'])
    
    # Check if there are enough samples
    n_samples = post_transplant.shape[0]
    
    # Perform Spearman correlation between the diversity metric and Recovery_Day
    correlation, p_value = spearmanr(post_transplant[diversity_metric_name], post_transplant['Recovery_Day'])
    
    return {
        'metric': diversity_metric_name,
        'Spearman correlation': correlation,
        'p-value': p_value,
        'n': n_samples
    }

# Loop through each metric and calculate the results
for metric in metrics:
    # Define file paths based on the metric
    diversity_qza_file = f"{data_dir}/core-metrics-results-bt/{metric}_vector.qza"
    
    # Calculate Spearman correlation for the current metric
    result = calculate_spearman(metric, diversity_qza_file, metadata, post_transplant)
    
    # Append the result to the results list
    results.append(result)

# Create a DataFrame to display results
results_df = pd.DataFrame(results)

# Save results to a CSV file
#results_df.to_csv(f"{data_dir}/alpha_diversity_correlation_results.csv", index=False) -> change to Excel later

# Print the results
print(results_df)

NameError: name 'pd' is not defined

Further steps: What I would like to do is calculate correlation between recovery day with diversity, composition, specific species, features, genes, etc. to see if any of these are predicitive for a quick recovery