#### S2. Predictive Biomarkers

Author: Willem Fuetterer


In this Jupyter Notebook characteristics of microbial communities and identification of predictive biomarkers for the speed of recovery are analyzed

**Exercise overview:**<br>
[1. Setup](#setup)<br>
[2. Data preparation](#depth)<br>
[3. Calculating Spearman correlation between the  abundance of single microbial taxa on all levels and the recovery time](#calc)<br>
[4. Calculating Spearman correlation between the  abundance of pairs of microbial taxa on the species level and the recovery time](#four)<br>
[5. Calculating Spearman correlation between the alpha diversity (Shannon Entropy and Faith PD) and the recovery time](#five)<br>
[6. Calculating Spearman correlation between the beta diversity (Weighted Unifrac & Bray Curtis) and duration of recovery](#six)<br>
[7. Comparing recovered and deceased patients](#seven)<br>


<a id='setup'></a>

## 1. Setup

In [41]:
# importing all required packages & notebook extensions at the start of the notebook
import os
import biom
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import qiime2 as q2
from qiime2 import Visualization
from scipy import stats
from scipy.stats import spearmanr
from scipy.stats import linregress
import itertools
from statsmodels.stats.multitest import multipletests

%matplotlib inline



In [2]:
# defining location of data
raw_data_dir = "../data/raw"
data_dir = "../data/processed"
vis_dir  = "../results"

<a id='depth'></a>

## 2. Data preparation

#### Exporting species abundance separated by sample

In [8]:
! qiime tools export \
    --input-path $data_dir/taxa-bar-plots-filtered.qzv \
    --output-path $data_dir/taxa-bar-plots-filtered-exported

[32mExported ../data/processed/taxa-bar-plots-filtered.qzv as Visualization to directory ../data/processed/taxa-bar-plots-filtered-exported[0m


#### Loading species abundance and alpha diversity seperated by sample 

In [31]:
#Loading the data into dataframes
taxonomic_composition_l7 = pd.read_csv(f'{data_dir}/taxa-bar-plots-filtered-exported/level-7.csv')
metadata_faith_shannon = pd.read_csv(f'{data_dir}/metadata_faith_shannon.csv')
#rename column so they match between the tables
taxonomic_composition_l7.rename(columns={'index': 'id'}, inplace=True)
# Perform a left join
taxonomic_composition_diversity = taxonomic_composition_l7.merge(metadata_faith_shannon, 
                                          on=['id', 'Patient_ID', 'Stool_Consistency', 
                                               'Patient_Sex', 'Sample_Day', 'Recovery_Day', 
                                               'Cohort_Number'], 
                                          how='left')

# Sort the DataFrame by 'id' in descending order
taxonomic_composition_diversity = taxonomic_composition_diversity.sort_values(by='id', ascending=True)
# Display the resulting DataFrame
taxonomic_composition_diversity


Unnamed: 0,id,d__Bacteria;p__Firmicutes;c__Clostridia;o__Lachnospirales;f__Lachnospiraceae;__;__,d__Bacteria;p__Firmicutes;c__Clostridia;o__Lachnospirales;f__Lachnospiraceae;g__Roseburia;s__,d__Bacteria;p__Firmicutes;c__Clostridia;o__Oscillospirales;f__Ruminococcaceae;g__Fournierella;s__,d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Streptococcaceae;g__Streptococcus;s__,d__Bacteria;p__Firmicutes;c__Clostridia;o__Lachnospirales;f__Lachnospiraceae;g__Lachnoclostridium;s__,d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Morganellaceae;g__Morganella;s__,d__Bacteria;p__Firmicutes;c__Clostridia;o__Lachnospirales;f__Lachnospiraceae;g__Blautia;s__,d__Bacteria;p__Firmicutes;c__Clostridia;o__Oscillospirales;f__Butyricicoccaceae;g__UCG-009;s__,d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Rikenellaceae;g__Alistipes;s__,...,d__Bacteria;p__Actinobacteriota;c__Coriobacteriia;o__Coriobacteriales;f__Coriobacteriales_Incertae_Sedis;g__uncultured;s__,d__Bacteria;p__Firmicutes;c__Clostridia;o__Lachnospirales;f__Lachnospiraceae;g__Lactonifactor;s__,Patient_ID,Stool_Consistency,Patient_Sex,Sample_Day,Recovery_Day,Cohort_Number,shannon_entropy,faith_pd
0,EG0024,94.0,0.0,0.0,1967.0,23.0,0.0,54296.0,0.0,0.0,...,0.0,0.0,P004,formed,F,0.0,34.0,1.0,2.350224,7.792528
1,EG0031,0.0,0.0,0.0,83.0,9.0,0.0,23.0,0.0,0.0,...,0.0,0.0,P021,formed,M,20.0,24.0,2.0,2.242202,6.405608
2,EG0039,0.0,202.0,0.0,3994.0,553.0,0.0,16239.0,0.0,0.0,...,0.0,0.0,P073,formed,M,0.0,,1.0,3.354383,11.444418
3,EG0055,0.0,27.0,0.0,3004.0,0.0,0.0,69.0,0.0,0.0,...,0.0,0.0,P020,liquid,F,0.0,28.0,1.0,1.098737,8.205870
4,EG0057,0.0,0.0,0.0,45.0,0.0,0.0,128.0,0.0,0.0,...,0.0,0.0,P004,formed,F,35.0,34.0,2.0,1.709977,8.711084
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
97,EG2580,7452.0,0.0,0.0,90426.0,14163.0,0.0,102.0,0.0,0.0,...,0.0,0.0,P042,liquid,F,13.0,17.0,2.0,3.199968,10.672487
98,EG2591,48514.0,12862.0,11.0,17750.0,5903.0,0.0,8062.0,5.0,6697.0,...,0.0,73.0,P017,liquid,M,0.0,17.0,1.0,4.920576,23.524030
99,EG2608,89.0,0.0,0.0,3015.0,8.0,0.0,29357.0,0.0,10.0,...,0.0,0.0,P034,formed,F,0.0,18.0,1.0,3.506510,13.212383
100,EG2638,87.0,22.0,21099.0,2924.0,313.0,0.0,214.0,0.0,1522.0,...,0.0,0.0,P017,semi-formed,M,12.0,17.0,2.0,2.746580,14.813480


<a id='calc'></a>

## 3. Calculating Spearman correlation between the  abundance of single microbial taxa on all levels and the recovery time

#### (Domain, Phylum, Class, Order, Family, Genus, Species)


(Domain, Phylum, Class, Order, Family, Genus, Species)


In [27]:
# Define function to calculate Spearman correlation
def calculate_spearman_correlation(taxonomic_composition, exclude_columns):
    # Filter the rows where Cohort_Number equals 1 and drop NaN in 'Recovery_Day'
    tax_comp_pretransplant = taxonomic_composition[taxonomic_composition['Cohort_Number'] == 1]
    tax_comp_pretransplant_clean = tax_comp_pretransplant.dropna(subset=['Recovery_Day'])
    
    # Get columns to correlate
    columns_to_correlate = [col for col in tax_comp_pretransplant_clean.columns if col not in exclude_columns]

    # Calculate Spearman correlation for each column
    correlation_results = []
    for column in columns_to_correlate:
        if tax_comp_pretransplant_clean[column].nunique() == 1:
            corr, p_value = float('nan'), float('nan')
            sample_size = 0
        else:
            corr, p_value = spearmanr(tax_comp_pretransplant_clean['Recovery_Day'], tax_comp_pretransplant_clean[column])
            sample_size = tax_comp_pretransplant_clean[['Recovery_Day', column]].dropna().shape[0]
        
        correlation_results.append({
            'correlated_column': column,
            'correlation': corr,
            'p_value': p_value,
            'sample_size': sample_size
        })
    
    return pd.DataFrame(correlation_results)

# Taxonomic level names
taxonomic_levels = ["Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species"]

# Loop over taxonomic levels 1-7
all_correlation_results = []
for level in range(1, 8):
    # Load the taxonomic composition for the current level
    taxonomic_composition = pd.read_csv(f'{data_dir}/taxa-bar-plots-filtered-exported/level-{level}.csv')
    taxonomic_composition.rename(columns={'index': 'id'}, inplace=True)
    taxonomic_composition = taxonomic_composition.sort_values(by='id', ascending=True)

    # Exclude the columns we don't want to correlate with 'Recovery_Day'
    exclude_columns = ['id', 'Patient_ID', 'Stool_Consistency', 'Patient_Sex', 'Sample_Day', 'Recovery_Day', 'Cohort_Number', 'shannon_entropy', 'faith_pd']
    
    # Call the correlation calculation function
    correlation_level = calculate_spearman_correlation(taxonomic_composition, exclude_columns)
    
    # Add the taxonomic level as a new column
    correlation_level['level'] = level
    correlation_level['taxonomic_rank'] = taxonomic_levels[level - 1]  # Add the rank name

    # Append to the list of all results
    all_correlation_results.append(correlation_level)

# Combine all correlation results into one DataFrame
all_correlation_results_df = pd.concat(all_correlation_results, ignore_index=True)

# ---- Filter and Sort Results ----
# Filter and sort the results by p-value
all_correlation_results_filtered = all_correlation_results_df.dropna(subset=['p_value'])

# Perform Benjamini-Hochberg FDR correction on the p-values
from statsmodels.stats.multitest import multipletests

# Extract p-values for correction
p_values = all_correlation_results_filtered['p_value'].values

# Apply FDR correction
rejected, corrected_p_values, _, _ = multipletests(p_values, method='fdr_bh')

# Add corrected p-values to the DataFrame
all_correlation_results_filtered['corrected_p_value'] = corrected_p_values

# Filter significant results before and after correction
significant_results_raw = all_correlation_results_filtered[all_correlation_results_filtered['p_value'] < 0.05]
significant_results_corrected = all_correlation_results_filtered[all_correlation_results_filtered['corrected_p_value'] < 0.05]

# ---- Save Results ----
# Save the uncorrected and corrected results to separate Excel sheets
significant_results_raw_sorted = significant_results_raw.sort_values(by=['level', 'p_value'], ascending=[True, True])
significant_results_corrected_sorted = significant_results_corrected.sort_values(by=['level', 'corrected_p_value'], ascending=[True, True])

significant_results_raw_sorted

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  all_correlation_results_filtered['corrected_p_value'] = corrected_p_values


Unnamed: 0,correlated_column,correlation,p_value,sample_size,level,taxonomic_rank,corrected_p_value
86,d__Bacteria;p__Actinobacteriota;c__Coriobacter...,-0.347597,0.012445,51,5,Family,0.960647
195,d__Bacteria;p__Actinobacteriota;c__Coriobacter...,-0.347597,0.012445,51,6,Genus,0.960647
213,d__Bacteria;p__Actinobacteriota;c__Coriobacter...,-0.339033,0.014945,51,6,Genus,0.960647
363,d__Bacteria;p__Firmicutes;c__Clostridia;o__Lac...,-0.326704,0.019289,51,6,Genus,0.960647
230,d__Bacteria;p__Patescibacteria;c__Saccharimona...,-0.309856,0.026917,51,6,Genus,0.960647
258,d__Bacteria;p__Firmicutes;c__Bacilli;o__Erysip...,-0.309019,0.027353,51,6,Genus,0.960647
322,d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactob...,-0.285429,0.042327,51,6,Genus,0.960647
336,d__Bacteria;p__Firmicutes;c__Bacilli;o__Erysip...,0.28371,0.04364,51,6,Genus,0.960647
273,d__Bacteria;p__Firmicutes;c__Clostridia;o__Osc...,-0.280866,0.045886,51,6,Genus,0.960647
467,d__Bacteria;p__Actinobacteriota;c__Coriobacter...,-0.347597,0.012445,51,7,Species,0.960647


In [16]:
# Save the results
significant_results_raw_sorted.to_excel(f'{vis_dir}/predictive_biomarker_species_significant_results_uncorrected.xlsx', index=False)

In this context a positive Spearman correlation (between abundance of a certain species and recovery days) means that higher abundance of the species is associated with a longer recovery time and a negative Spearman correlation that higher abundance of the species is associated with a shorter recovery time. Both of them can be considered predictive biomarkers with the latter being beneficial for a quicker recovery.

<a id='four'></a>

## 4. Calculating Spearman correlation between the  abundance of pairs of microbial taxa on the species level and the recovery time

In [95]:
# Filter the rows where Cohort_Number equals 1
tax_comp_pretransplant = taxonomic_composition_diversity[taxonomic_composition_diversity['Cohort_Number'] == 1]

# Remove rows with NaN in 'Recovery_Day'
tax_comp_pretransplant_clean = tax_comp_pretransplant.dropna(subset=['Recovery_Day'])

# Exclude the columns we don't want to correlate with 'Recovery_Day'
exclude_columns = ['id', 'Patient_ID', 'Stool_Consistency', 'Patient_Sex', 'Sample_Day', 'Recovery_Day', 'Cohort_Number', 'shannon_entropy', 'faith_pd']
columns_to_correlate = [col for col in tax_comp_pretransplant_clean.columns if col not in exclude_columns]

# ---- Iterating Over Pairs of Species Abundances ----

# Generate all possible pairs of species abundance columns
species_abundance_columns = [col for col in columns_to_correlate]  # All the columns containing species abundances
species_pairs = list(itertools.combinations(species_abundance_columns, 2))

# ---- Calculate Spearman Correlation for Each Pair of Species ----

correlation_results = []
for species_pair in species_pairs:
    # Calculate the combined abundance for the pair of species (sum of their abundances)
    tax_comp_pretransplant_clean.loc[:, 'combined_abundance'] = tax_comp_pretransplant_clean[list(species_pair)].sum(axis=1)
    
    # Drop rows where 'combined_abundance' or 'Recovery_Day' is NaN
    tax_comp_pretransplant_clean_filtered = tax_comp_pretransplant_clean.dropna(subset=['combined_abundance', 'Recovery_Day'])

    # Check if the combined abundance has variance (not constant)
    if tax_comp_pretransplant_clean_filtered['combined_abundance'].nunique() > 1:
        # Calculate Spearman correlation between 'Recovery_Day' and the combined abundance of the species pair
        corr_combined, p_value_combined = spearmanr(tax_comp_pretransplant_clean_filtered['Recovery_Day'], tax_comp_pretransplant_clean_filtered['combined_abundance'])
        sample_size_combined = tax_comp_pretransplant_clean_filtered[['Recovery_Day', 'combined_abundance']].dropna().shape[0]
        
        # Append the results with both species names included in the output
        correlation_results.append({
            'species_1': species_pair[0],
            'species_2': species_pair[1],
            'correlation': corr_combined,
            'p_value': p_value_combined,
            'sample_size': sample_size_combined
        })

# Create a DataFrame from the results
correlation_species = pd.DataFrame(correlation_results)

# ---- Filter and Sort ----

# Filter and sort the results by p-value
correlation_species_filtered = correlation_species.dropna(subset=['p_value'])

# Perform Benjamini-Hochberg FDR correction on the p-values
p_values = correlation_species_filtered['p_value'].values
rejected, pvals_corrected, _, _ = multipletests(p_values, method='fdr_bh')

# Add the corrected p-values to the results
correlation_species_filtered['corrected_p_value'] = pvals_corrected
shape = correlation_species_filtered.shape
correlation_species_filtered = correlation_species_filtered[correlation_species_filtered['p_value'] < 0.05]
correlation_species_filtered_pvalue = correlation_species_filtered[correlation_species_filtered['corrected_p_value'] < 0.05]

# Sort the results by corrected p-value
correlation_species_sorted = correlation_species_filtered.sort_values(by='p_value', ascending=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tax_comp_pretransplant_clean.loc[:, 'combined_abundance'] = tax_comp_pretransplant_clean[list(species_pair)].sum(axis=1)


In [91]:
correlation_species_sorted

Unnamed: 0,species_1,species_2,correlation,p_value,sample_size,corrected_p_value
10783,d__Bacteria;p__Actinobacteriota;c__Coriobacter...,d__Bacteria;p__Firmicutes;c__Bacilli;o__Erysip...,-0.470834,0.000489,51,0.999409
10752,d__Bacteria;p__Actinobacteriota;c__Coriobacter...,d__Bacteria;p__Fusobacteriota;c__Fusobacteriia...,-0.425308,0.001863,51,0.999409
10811,d__Bacteria;p__Actinobacteriota;c__Coriobacter...,d__Bacteria;p__Firmicutes;c__Clostridia;o__Lac...,-0.422890,0.001991,51,0.999409
18091,d__Bacteria;p__Patescibacteria;c__Saccharimona...,d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactob...,-0.419409,0.002187,51,0.999409
14766,d__Bacteria;p__Actinobacteriota;c__Coriobacter...,d__Bacteria;p__Firmicutes;c__Clostridia;o__Lac...,-0.418495,0.002242,51,0.999409
...,...,...,...,...,...,...
20585,d__Bacteria;p__Firmicutes;c__Clostridia;o__Osc...,d__Bacteria;p__Firmicutes;c__Clostridia;o__Lac...,0.276485,0.049527,51,0.999409
31578,d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactob...,d__Bacteria;p__Firmicutes;c__Clostridia;o__Osc...,-0.276460,0.049549,51,0.999409
20488,d__Bacteria;p__Firmicutes;c__Clostridia;o__Osc...,d__Bacteria;p__Firmicutes;c__Negativicutes;o__...,0.276413,0.049590,51,0.999409
12954,d__Bacteria;p__Firmicutes;c__Bacilli;o__Staphy...,d__Bacteria;p__Firmicutes;c__Clostridia;o__Osc...,-0.276369,0.049627,51,0.999409


In [118]:
print("Number of performed tests:", shape[0])
print("Number of significant results before correcting for multiple testing:", len(correlation_species_sorted)) 
print("Number of significant results after correcting for multiple testing:", len(correlation_species_filtered_pvalue)) 

Number of performed tests: 36421
Number of significant results before correcting for multiple testing: 1155
Number of significant results after correcting for multiple testing: 0


Due to the high number of tests the likelyhood of false positive results is a lot higher. After adjusting for the number of tests there are no significant results.

In [67]:
! qiime tools export \
    --input-path $data_dir/table.qzv \
    --output-path $data_dir/table-exported

[32mExported ../data/processed/table.qzv as Visualization to directory ../data/processed/table-exported[0m


<a id='five'></a>

## 5. Calculating Spearman correlation between the alpha diversity (Shannon Entropy and Faith PD) and the recovery time

In [137]:
# Filter out NaN values from all three columns: 'Recovery_Day', 'shannon_entropy', 'faith_pd'
filtered_df = tax_comp_pretransplant_clean.dropna(subset=['Recovery_Day', 'shannon_entropy', 'faith_pd'])

# Initialize an empty list to store the results for 'shannon_entropy' and 'faith_pd'
correlation_entropy_faith = []

# List of columns to correlate with 'Recovery_Day'
columns_to_correlate = ['shannon_entropy', 'faith_pd']

# Calculate Spearman correlation for each column
for column in columns_to_correlate:
    # Calculate Spearman correlation between 'Recovery_Day' and the current column
    corr, p_value = spearmanr(filtered_df['Recovery_Day'], filtered_df[column])
    
    # Calculate sample size (number of valid pairs)
    sample_size = filtered_df[['Recovery_Day', column]].dropna().shape[0]
    
    # Append the results
    correlation_entropy_faith.append({
        'correlated_column': column,
        'correlation': corr,
        'p_value': p_value,
        'sample_size': sample_size
    })

# Create a DataFrame from the results for entropy and faith_pd
correlation_entropy_faith_df = pd.DataFrame(correlation_entropy_faith)

# Filter and sorting
correlation_entropy_faith_filtered = correlation_entropy_faith_df.dropna(subset=['p_value'])
correlation_entropy_faith_sorted = correlation_entropy_faith_sorted.rename(columns={'correlated_column': 'diversity_metric'})

correlation_entropy_faith_sorted.to_excel(f'{vis_dir}/predicted_biomarker_alphadiversity.xlsx', index=False)

correlation_entropy_faith_sorted

Unnamed: 0,diversity_metric,correlation,p_value,sample_size
1,faith_pd,-0.139653,0.343797,48
0,shannon_entropy,-0.095223,0.519707,48


There is no significant correlation between alpha diversity and recovery time

<a id='six'></a>

## 6. Calculating Spearman correlation between the beta diversity (Weighted Unifrac & Bray Curtis) and duration of recovery

In [121]:
# Load metadata
metadata = pd.read_csv(f'{data_dir}/metadata.tsv', sep='\t')

# Filter for Cohort_Number == 1 and exclude rows where Recovery_Day is NaN
filtered_metadata = metadata[(metadata['Cohort_Number'] == 1) & metadata['Recovery_Day'].notna()]

# Save the filtered metadata
filtered_metadata.to_csv(f'{data_dir}/filtered_metadata.tsv', sep='\t', index=False)


## Weigthed Unifrac

In [17]:
! qiime diversity beta-correlation \
    --i-distance-matrix $data_dir/core-metrics-results-bt/weighted_unifrac_distance_matrix.qza \
    --m-metadata-file $data_dir/filtered_metadata.tsv \
    --m-metadata-column Recovery_Day \
    --p-intersect-ids \
    --o-metadata-distance-matrix $data_dir/core-metrics-results-bt/beta-correlation/spearman-recov-day-pre-w-unifrac.qza \
    --o-mantel-scatter-visualization $data_dir/core-metrics-results-bt/beta-correlation/scatter-plot-recov-day-pre-w-unifrac.qzv

[32mSaved DistanceMatrix to: ../data/processed/core-metrics-results-bt/beta-correlation/spearman-recov-day-pre-w-unifrac.qza[0m
[32mSaved Visualization to: ../data/processed/core-metrics-results-bt/beta-correlation/scatter-plot-recov-day-pre-w-unifrac.qzv[0m
[0m

In [18]:
#Visualization.load(f"{data_dir}/core-metrics-results-bt/beta-correlation/scatter-plot-recov-day-pre-w-unifrac.qzv")
! qiime tools view $data_dir/core-metrics-results-bt/beta-correlation/scatter-plot-recov-day-pre-w-unifrac.qzv

Press the 'q' key, Control-C, or Control-D to quit. This view may no longer be accessible or work correctly after quitting.
Press the 'q' key, Control-C, or Control-D to quit. This view may no longer be accessible or work correctly after quitting.

No correlation, spearman = 0.029 and p-value = 0.716 (n = 48) 

## Bray Curtis

In [19]:
! qiime diversity beta-correlation \
    --i-distance-matrix $data_dir/core-metrics-results-bt/bray_curtis_distance_matrix.qza \
    --m-metadata-file $data_dir/filtered_metadata.tsv \
    --m-metadata-column Recovery_Day \
    --p-intersect-ids \
    --o-metadata-distance-matrix $data_dir/core-metrics-results-bt/beta-correlation/spearman-recov-day-pre-bray-curtis.qza \
    --o-mantel-scatter-visualization $data_dir/core-metrics-results-bt/beta-correlation/scatter-plot-recov-day-pre-bray-curtis.qzv

[32mSaved DistanceMatrix to: ../data/processed/core-metrics-results-bt/beta-correlation/spearman-recov-day-pre-bray-curtis.qza[0m
[32mSaved Visualization to: ../data/processed/core-metrics-results-bt/beta-correlation/scatter-plot-recov-day-pre-bray-curtis.qzv[0m
[0m

In [21]:
#Visualization.load(f"{data_dir}/core-metrics-results-bt/beta-correlation/scatter-plot-recov-day-pre-bray-curtis.qzv")
! qiime tools view $data_dir/core-metrics-results-bt/beta-correlation/scatter-plot-recov-day-pre-bray-curtis.qzv

Press the 'q' key, Control-C, or Control-D to quit. This view may no longer be accessible or work correctly after quitting.
Press the 'q' key, Control-C, or Control-D to quit. This view may no longer be accessible or work correctly after quitting.

No correlation, spearman = 0.011 and p-value = 0.886 (n = 48)

### -> Maybe calculate correlation between recovery day with composition, features, genes, etc. to see if any of these are predicitive for a quick recovery

<a id='seven'></a>

## 7. Comparing recovered and deceased patients

(assuming NaN values for Recovery_Day implicated the patient died)

In [42]:
# Filter for Cohort_Number == 1
metadata_survival = taxonomic_composition_diversity[taxonomic_composition_diversity['Cohort_Number'] == 1]
# Add the 'outcome' column based on the 'Recovery_Day' column
metadata_survival['outcome'] = metadata_survival['Recovery_Day'].apply(
    lambda x: 'recovered' if pd.notna(x) else 'died'
)
# Save the filtered metadata with the added outcome column
metadata_survival.to_csv(f'{data_dir}/metadata_survival.tsv', sep='\t', index=False)

metadata_survival.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  metadata_survival['outcome'] = metadata_survival['Recovery_Day'].apply(


Unnamed: 0,id,d__Bacteria;p__Firmicutes;c__Clostridia;o__Lachnospirales;f__Lachnospiraceae;__;__,d__Bacteria;p__Firmicutes;c__Clostridia;o__Lachnospirales;f__Lachnospiraceae;g__Roseburia;s__,d__Bacteria;p__Firmicutes;c__Clostridia;o__Oscillospirales;f__Ruminococcaceae;g__Fournierella;s__,d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Streptococcaceae;g__Streptococcus;s__,d__Bacteria;p__Firmicutes;c__Clostridia;o__Lachnospirales;f__Lachnospiraceae;g__Lachnoclostridium;s__,d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Morganellaceae;g__Morganella;s__,d__Bacteria;p__Firmicutes;c__Clostridia;o__Lachnospirales;f__Lachnospiraceae;g__Blautia;s__,d__Bacteria;p__Firmicutes;c__Clostridia;o__Oscillospirales;f__Butyricicoccaceae;g__UCG-009;s__,d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Rikenellaceae;g__Alistipes;s__,...,d__Bacteria;p__Firmicutes;c__Clostridia;o__Lachnospirales;f__Lachnospiraceae;g__Lactonifactor;s__,Patient_ID,Stool_Consistency,Patient_Sex,Sample_Day,Recovery_Day,Cohort_Number,shannon_entropy,faith_pd,outcome
0,EG0024,94.0,0.0,0.0,1967.0,23.0,0.0,54296.0,0.0,0.0,...,0.0,P004,formed,F,0.0,34.0,1.0,2.350224,7.792528,recovered
2,EG0039,0.0,202.0,0.0,3994.0,553.0,0.0,16239.0,0.0,0.0,...,0.0,P073,formed,M,0.0,,1.0,3.354383,11.444418,died
3,EG0055,0.0,27.0,0.0,3004.0,0.0,0.0,69.0,0.0,0.0,...,0.0,P020,liquid,F,0.0,28.0,1.0,1.098737,8.20587,recovered
5,EG0070,14937.0,4439.0,0.0,6514.0,479.0,0.0,53557.0,0.0,0.0,...,0.0,P062,semi-formed,F,0.0,27.0,1.0,4.982222,19.296875,recovered
9,EG0136,79.0,1557.0,76.0,11029.0,543.0,0.0,47881.0,0.0,36.0,...,6.0,P027,formed,M,0.0,33.0,1.0,4.137198,19.930942,recovered


Alpha diversity

In [49]:
# Group the data by 'outcome'
grouped = metadata_survival.groupby('outcome')

# Initialize a dictionary to store T-test results
t_test_results = {
    'alpha_diversity': [],
    't_statistic': [],
    'p_value': []
}

# Perform the T-test for 'shannon_entropy' and 'faith_pd'
for column in ['shannon_entropy', 'faith_pd']:
    # Get the two groups based on outcome
    group_recovered = grouped.get_group('recovered')[column]
    group_died = grouped.get_group('died')[column]
    
    # Perform T-test
    t_stat, p_value = stats.ttest_ind(group_recovered, group_died, nan_policy='omit')
    
    # Append results to dictionary
    t_test_results['alpha_diversity'].append(column)
    t_test_results['t_statistic'].append(t_stat)
    t_test_results['p_value'].append(p_value)

# Convert the results to a DataFrame
survival_alphadiversity = pd.DataFrame(t_test_results)

#Save data to results
survival_alphadiversity.to_excel(f'{vis_dir}/predicted_biomarker_survival_alphadiversity.xlsx', index=False)

# Print the results dataframe
survival_alphadiversity


Unnamed: 0,alpha_diversity,t_statistic,p_value
0,shannon_entropy,1.498005,0.140547
1,faith_pd,0.801787,0.426547


Species

In [47]:
# Exclude metadata columns
exclude_columns = [
    'Patient_ID', 'Stool_Consistency', 'Patient_Sex', 'Sample_Day', 
    'Recovery_Day', 'Cohort_Number', 'shannon_entropy', 'faith_pd', 'outcome'
]

# Select numeric columns that are not in the exclude list
numeric_columns = metadata_survival.select_dtypes(include='number').columns.difference(exclude_columns)

# Initialize a dictionary to store T-test results
t_test_results = {
    'species': [],
    't_statistic': [],
    'p_value': []
}

# Perform the T-test for each selected numeric feature
for column in numeric_columns:
    # Get the two groups based on outcome
    group_recovered = metadata_survival[metadata_survival['outcome'] == 'recovered'][column]
    group_died = metadata_survival[metadata_survival['outcome'] == 'died'][column]
    
    # Perform T-test
    t_stat, p_value = stats.ttest_ind(group_recovered, group_died, nan_policy='omit')
    
    # Append results to dictionary
    t_test_results['species'].append(column)
    t_test_results['t_statistic'].append(t_stat)
    t_test_results['p_value'].append(p_value)

# Convert the results to a DataFrame
t_test_df = pd.DataFrame(t_test_results)

# Remove rows where p-value is NaN
t_test_df = t_test_df.dropna(subset=['p_value'])

# Apply Benjamini-Hochberg correction (FDR control)
_, corrected_p_values, _, _ = multipletests(t_test_df['p_value'], method='fdr_bh')

# Add the corrected p-values to the DataFrame
t_test_df['corrected_p_value'] = corrected_p_values

# Filter the results for corrected p-values below 0.05
filtered_t_test_df = t_test_df[t_test_df['corrected_p_value'] < 0.05]

#Save data to results
filtered_t_test_df.to_excel(f'{vis_dir}/predicted_biomarker_survival_species.xlsx', index=False)

# Print the filtered results dataframe
filtered_t_test_df


Unnamed: 0,species,t_statistic,p_value,corrected_p_value
71,d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactob...,-4.955356,8e-06,0.00098
76,d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactob...,-4.954873,8e-06,0.00098
180,d__Bacteria;p__Firmicutes;c__Clostridia;o__Osc...,-4.563199,3.1e-05,0.002518
199,d__Bacteria;p__Firmicutes;c__Clostridia;o__Osc...,-3.831598,0.000345,0.016755
271,d__Bacteria;p__Verrucomicrobiota;c__Verrucomic...,-4.068649,0.000161,0.00979


Negative t-statistic indicates that the mean abundance of these taxa is higher in the group of deceased patients than in the groups of recovered patients.