#### S2. Predictive Biomarkers

Author: Willem Fuetterer


In this Jupyter Notebook characteristics of microbial communities and identification of predictive biomarkers for the speed of recovery are analyzed

**Exercise overview:**<br>
[1. Setup](#setup)<br>
[2. Identification of correct sampling depth](#depth)<br>
[3. Calculating the alpha diversity](#calc)<br>
[4. Testing the associations between categorical metadata columns and the diversity metric](#categorical)<br>
[5. Testing whether numeric sample metadata columns are correlated with microbial community richness](#numeric)<br>
[6. Loading the results into variables to plot figures](#figures)<br>

<a id='setup'></a>

## 1. Setup

In [3]:
# importing all required packages & notebook extensions at the start of the notebook
import os
import biom
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import qiime2 as q2
from qiime2 import Visualization
from scipy.stats import spearmanr
from scipy.stats import linregress

%matplotlib inline

In [4]:
# assigning variables throughout the notebook

# location of this week's data and all the results produced by this notebook
# - this should be a path relative to your working directory
raw_data_dir = "../data/raw"
data_dir = "../data/processed"
vis_dir  = "../results"

<a id='depth'></a>

## 2. Statistical Analysis and Correlation with Recovery Speed

In [59]:
# Load metadata
metadata = pd.read_csv(f"{data_dir}/metadata.tsv", sep='\t')

# Define diversity metrics
metrics = ['shannon', 'evenness', 'faith_pd']  # Metrics to analyze
results = []  # To store results for each metric

# Function to perform Spearman correlation
def calculate_spearman(diversity_metric_name, diversity_qza_file, metadata, post_transplant):
    # Load diversity vector (.qza file)
    diversity_metric = qiime2.Artifact.load(diversity_qza_file)
    
    # Extract the diversity vector as a pandas Series
    diversity_series = diversity_metric.view(pd.Series)
    
    # Convert to a DataFrame and reset the index
    diversity_df = diversity_series.reset_index()
    diversity_df.columns = ['sample-id', diversity_metric_name]  # Rename columns for clarity
    
    # Merge diversity data with metadata
    merged = metadata.set_index('sample-id').join(diversity_df.set_index('sample-id'))
    
    # Filter post-transplantation samples (if only interested in post-transplant)
    post_transplant = merged[merged['Cohort_Number'] == 1]
    
    # Check for missing values and drop rows if necessary
    post_transplant = post_transplant.dropna(subset=[diversity_metric_name, 'Recovery_Day'])
    
    # Check if there are enough samples
    n_samples = post_transplant.shape[0]
    
    # Perform Spearman correlation between the diversity metric and Recovery_Day
    correlation, p_value = spearmanr(post_transplant[diversity_metric_name], post_transplant['Recovery_Day'])
    
    return {
        'metric': diversity_metric_name,
        'Spearman correlation': correlation,
        'p-value': p_value,
        'n': n_samples
    }

# Loop through each metric and calculate the results
for metric in metrics:
    # Define file paths based on the metric
    diversity_qza_file = f"{data_dir}/core-metrics-results-bt/{metric}_vector.qza"
    
    # Calculate Spearman correlation for the current metric
    result = calculate_spearman(metric, diversity_qza_file, metadata, post_transplant)
    
    # Append the result to the results list
    results.append(result)

# Create a DataFrame to display results
results_df = pd.DataFrame(results)

# Save results to a CSV file
#results_df.to_csv(f"{data_dir}/alpha_diversity_correlation_results.csv", index=False)

# Print the results
print(results_df)

     metric  Spearman correlation   p-value   n
0   shannon             -0.095223  0.519707  48
1  evenness             -0.116704  0.429564  48
2  faith_pd             -0.139653  0.343797  48


  df[cols] = df[cols].apply(pd.to_numeric, errors='ignore')
  df[cols] = df[cols].apply(pd.to_numeric, errors='ignore')
  df[cols] = df[cols].apply(pd.to_numeric, errors='ignore')


## What I would like to do is calculate correlation between recovery day with diversity, composition, specific species, features, genes, etc. to see if any of these are predicitive for a quick recovery

other

In [52]:
!qiime tools export \
  --input-path $data_dir/core-metrics-results-bt/bray_curtis_pcoa_results.qza \
  --output-path $data_dir/core-metrics-results-bt/bray_curtis_pcoa_exported

[32mExported ../data/processed/core-metrics-results-bt/bray_curtis_pcoa_results.qza as OrdinationDirectoryFormat to directory ../data/processed/core-metrics-results-bt/bray_curtis_pcoa_exported[0m
[0m

PERMANOVA only for categorical 

In [15]:
! qiime diversity beta-group-significance \
    --i-distance-matrix $data_dir/core-metrics-results-bt/bray_curtis_distance_matrix.qza \
    --m-metadata-file $data_dir/metadata.tsv \
    --m-metadata-column Recovery_Day \
    --p-method permdisp \
    --o-visualization $data_dir/beta-group-significance.qzv


Usage: [94mqiime diversity beta-group-significance[0m [OPTIONS]

  Determine whether groups of samples are significantly different from one
  another using a permutation-based statistical test.

[1mInputs[0m:
  [94m[4m--i-distance-matrix[0m ARTIFACT
    [32mDistanceMatrix[0m     Matrix of distances between pairs of samples.
                                                                    [35m[required][0m
[1mParameters[0m:
  [94m[4m--m-metadata-file[0m METADATA
  [94m[4m--m-metadata-column[0m COLUMN  [32mMetadataColumn[Categorical][0m
                       Categorical sample metadata column.          [35m[required][0m
  [94m--p-method[0m TEXT [32mChoices('permanova', 'anosim', 'permdisp')[0m
                       The group significance test to be applied.
                                                        [35m[default: 'permanova'][0m
  [94m--p-pairwise[0m / [94m--p-no-pairwise[0m
                       Perform pairwise tests between all pairs

This approach can be ignored

In [17]:
! qiime tools peek $data_dir/table-filtered.qza #-> FeatureTable[Frequency]

[32mUUID[0m:        702565bb-ce3d-472e-acd6-4b914601f892
[32mType[0m:        FeatureTable[Frequency]
[32mData format[0m: BIOMV210DirFmt


In [21]:
! qiime taxa collapse \
  --i-table $data_dir/table-filtered.qza \
  --i-taxonomy $data_dir/taxonomy.qza \
  --p-level 6 \
  --o-collapsed-table $data_dir/genus-table.qza

[32mSaved FeatureTable[Frequency] to: ../data/processed/genus-table.qza[0m
[0m

In [22]:
! qiime feature-table filter-features-conditionally \
  --i-table $data_dir/genus-table.qza \
  --p-prevalence 0.1 \
  --p-abundance 0.01 \
  --o-filtered-table $data_dir/filtered-genus-table.qza

[32mSaved FeatureTable[Frequency] to: ../data/processed/filtered-genus-table.qza[0m
[0m

In [23]:
! qiime feature-table relative-frequency \
  --i-table $data_dir/filtered-genus-table.qza \
  --o-relative-frequency-table $data_dir/genus-rf-table.qza

[32mSaved FeatureTable[RelativeFrequency] to: ../data/processed/genus-rf-table.qza[0m
[0m

Volatility plots

In [28]:
! qiime longitudinal volatility \
  --i-table $data_dir/genus-rf-table.qza \
  --p-state-column Recovery_Day \
  --m-metadata-file $data_dir/metadata.tsv \
  --p-individual-id-column Patient_ID \
  --o-visualization $data_dir/volatility-plot-1.qzv

[32mSaved Visualization to: ../data/processed/volatility-plot-1.qzv[0m
[0m

In [29]:
Visualization.load(f"{data_dir}/volatility-plot-1.qzv")