
# 04. Beta Diversity 

Author: Marc Kesselring


In this Jupyter Notebook the beta diversity of the samples is analyzed.

**Exercise overview:**<br>
[1. Setup](#setup)<br>
[2. Visual Inspection](#inspection)<br>
[3. Statistical Analysis](#stat)<br>
[4. Beta correlation](#corr)<br>

<a id='setup'></a>

## 1. Setup

In [2]:
# importing all required packages & notebook extensions at the start of the notebook
import os
import biom
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import qiime2 as q2
from qiime2 import Visualization
%matplotlib inline

In [3]:
# assigning variables throughout the notebook
raw_data_dir = "../data/raw"
data_dir = "../data/processed"
vis_dir  = "../results"

<a id='inspection'></a>

## 2. Visual Inspection

In [4]:
Visualization.load(f"{data_dir}/core-metrics-results-bt/weighted_unifrac_emperor.qzv")

In [5]:
Visualization.load(f"{data_dir}/core-metrics-results-bt/bray_curtis_emperor.qzv")

<a id='stat'></a>

## 3. Statistical Analysis

In [11]:
#Fill data missing in Recovery_Days with 0
md = pd.read_csv(f"{data_dir}/metadata.tsv", sep='\t')
md = md.fillna(0)
md.to_csv(f"{data_dir}/metadata_fillna.tsv", sep='\t', index=False)

##### Using qiime diversity adonis to test for statistical significances in beta diversity according to following R-formula Cohort_Number*Stool_Consistency*Patient_Sex*Sample_Day*Recovery_Day

In [29]:
! qiime diversity adonis \
--i-distance-matrix $data_dir/core-metrics-results-bt/weighted_unifrac_distance_matrix.qza \
--m-metadata-file $data_dir/metadata_fillna.tsv \
--p-formula Cohort_Number*Stool_Consistency*Patient_Sex*Sample_Day*Recovery_Day \
--o-visualization $data_dir/core-metrics-results-bt/beta-diversity/weighted_unifrac_adonis.qzv

3561.61s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


[32mSaved Visualization to: ../data/processed/core-metrics-results-bt/beta-diversity/weighted_unifrac_adonis.qzv[0m
[0m

In [30]:
! qiime diversity adonis \
--i-distance-matrix $data_dir/core-metrics-results-bt/bray_curtis_distance_matrix.qza \
--m-metadata-file $data_dir/metadata_fillna.tsv \
--p-formula Cohort_Number*Stool_Consistency*Patient_Sex*Sample_Day*Recovery_Day \
--o-visualization $data_dir/core-metrics-results-bt/beta-diversity/bray_curtis_adonis.qzv

3585.74s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


[32mSaved Visualization to: ../data/processed/core-metrics-results-bt/beta-diversity/bray_curtis_adonis.qzv[0m
[0m

### Visualizations of the generated qzv files and downloading the tsv files to use for further analysis

In [31]:
Visualization.load(f"{data_dir}/core-metrics-results-bt/beta-diversity/weighted_unifrac_adonis.qzv")

In [32]:
Visualization.load(f"{data_dir}/core-metrics-results-bt/beta-diversity/bray_curtis_adonis.qzv")

### Adjustments for multiple testing using the Bonferroni correction using pandas after uploading the tsv files into this environment

In [33]:
# Load adonis result
df_unifrac = pd.read_csv(f"{data_dir}/core-metrics-results-bt/beta-diversity/adonis_weighted_unifrac.tsv", sep="\t")

# Add a Bonferroni-adjusted p-value column
num_tests = len(df_unifrac)  # Number of tests performed
df_unifrac['p_value_bonferroni'] = df_unifrac['Pr(>F)'] * num_tests

# Ensure adjusted p-values do not exceed 1
df_unifrac['p_value_bonferroni'] = df_unifrac['p_value_bonferroni'].clip(upper=1)

# Save the adjusted results to a new file
df_unifrac.to_csv(f"{data_dir}/core-metrics-results-bt/beta-diversity/adonis_weighted_unifrac_bonferroni.tsv", sep='\t', index=False)

print(df_unifrac[df_unifrac['p_value_bonferroni']<0.05])

               Df  SumsOfSqs   MeanSqs    F.Model        R2  Pr(>F)  \
Cohort_Number   1   5.262276  5.262276  14.615373  0.127197   0.001   

               p_value_bonferroni  
Cohort_Number               0.025  


In [34]:
# Load adonis result
df_bray_curtis = pd.read_csv(f"{data_dir}/core-metrics-results-bt/beta-diversity/adonis_bray_curtis.tsv", sep="\t")

# Add a Bonferroni-adjusted p-value column
num_tests = len(df_bray_curtis)  # Number of tests performed
df_bray_curtis['p_value_bonferroni'] = df_bray_curtis['Pr(>F)'] * num_tests

# Ensure adjusted p-values do not exceed 1
df_bray_curtis['p_value_bonferroni'] = df_bray_curtis['p_value_bonferroni'].clip(upper=1)

# Save the adjusted results to a new file
df_bray_curtis.to_csv(f"{data_dir}/core-metrics-results-bt/beta-diversity/adonis_bray_curtis_bonferroni.tsv", sep='\t', index=False)

print(df_bray_curtis[df_bray_curtis['p_value_bonferroni']<0.05])

               Df  SumsOfSqs   MeanSqs   F.Model        R2  Pr(>F)  \
Cohort_Number   1   1.409494  1.409494  3.128293  0.033285   0.001   

               p_value_bonferroni  
Cohort_Number               0.025  


### pairwise Permanova testing column Cohort_Number to obtain Group significance plots

In [13]:
# Map Cohort_Number data into categorical data
metadata = pd.read_csv(f"{data_dir}/metadata.tsv", sep='\t')
metadata['Cohort_Number_Bin'] = metadata['Cohort_Number'].map({1: 'Abduction', 2: 'Recovery'})
metadata.to_csv(f"{data_dir}/metadata_binned.tsv", sep='\t', index=False)

##### Weighted unifrac

In [36]:
! qiime diversity beta-group-significance \
    --i-distance-matrix $data_dir/core-metrics-results-bt/weighted_unifrac_distance_matrix.qza \
    --m-metadata-file $data_dir/metadata_binned.tsv \
    --m-metadata-column Cohort_Number_Bin \
    --p-pairwise \
    --o-visualization $data_dir/core-metrics-results-bt/beta-correlation/weighted_unifrac-Cohort-number-significance.qzv

4606.03s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


[32mSaved Visualization to: ../data/processed/core-metrics-results-bt/beta-correlation/weighted_unifrac-Cohort-number-significance.qzv[0m
[0m

In [40]:
Visualization.load(f"{data_dir}/core-metrics-results-bt/beta-correlation/weighted_unifrac-Cohort-number-significance.qzv")

##### Bray Curtis

In [41]:
! qiime diversity beta-group-significance \
    --i-distance-matrix $data_dir/core-metrics-results-bt/bray_curtis_distance_matrix.qza \
    --m-metadata-file $data_dir/metadata_binned.tsv \
    --m-metadata-column Cohort_Number_Bin \
    --p-pairwise \
    --o-visualization $data_dir/core-metrics-results-bt/beta-correlation/bray_curtis-Cohort-number-significance.qzv

4816.34s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


[32mSaved Visualization to: ../data/processed/core-metrics-results-bt/beta-correlation/bray_curtis-Cohort-number-significance.qzv[0m
[0m

In [42]:
Visualization.load(f"{data_dir}/core-metrics-results-bt/beta-correlation/bray_curtis-Cohort-number-significance.qzv")

<a id='corr'></a>

## 4. Beta correlation

### Testing column Cohort_Number for beta correlation

In [43]:
! qiime diversity beta-correlation \
    --i-distance-matrix $data_dir/core-metrics-results-bt/weighted_unifrac_distance_matrix.qza \
    --m-metadata-file $data_dir/metadata.tsv \
    --m-metadata-column Cohort_Number \
    --p-intersect-ids \
    --o-metadata-distance-matrix $data_dir/core-metrics-results-bt/beta-correlation/weighted_unifrac_spearman.qza \
    --o-mantel-scatter-visualization $data_dir/core-metrics-results-bt/beta-correlation/weighted_unifrac_scatter-plot.qzv

5132.20s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


[32mSaved DistanceMatrix to: ../data/processed/core-metrics-results-bt/beta-correlation/weighted_unifrac_spearman.qza[0m
[32mSaved Visualization to: ../data/processed/core-metrics-results-bt/beta-correlation/weighted_unifrac_scatter-plot.qzv[0m
[0m

In [44]:
Visualization.load(f"{data_dir}/core-metrics-results-bt/beta-correlation/weighted_unifrac_scatter-plot.qzv")

In [45]:
! qiime diversity beta-correlation \
    --i-distance-matrix $data_dir/core-metrics-results-bt/bray_curtis_distance_matrix.qza \
    --m-metadata-file $data_dir/metadata.tsv \
    --m-metadata-column Cohort_Number \
    --p-intersect-ids \
    --o-metadata-distance-matrix $data_dir/core-metrics-results-bt/beta-correlation/bray_curtis_spearman.qza \
    --o-mantel-scatter-visualization $data_dir/core-metrics-results-bt/beta-correlation/bray_curtis_scatter-plot.qzv

5270.12s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


[32mSaved DistanceMatrix to: ../data/processed/core-metrics-results-bt/beta-correlation/bray_curtis_spearman.qza[0m
[32mSaved Visualization to: ../data/processed/core-metrics-results-bt/beta-correlation/bray_curtis_scatter-plot.qzv[0m
[0m

In [46]:
Visualization.load(f"{data_dir}/core-metrics-results-bt/beta-correlation/bray_curtis_scatter-plot.qzv")

##### Weak positive correlation for column Cohort_Number