Correlating gut microbiome composition with obesity and type 2 diabetes using statistical analysis.
This project investigates the relationship between gut microbiome composition and metabolic health, specifically obesity (BMI) and Type 2 Diabetes (HbA1c levels). The goal was to analyze statistical correlations between specific bacterial species and these health markers. The microbiome dataset was taken from Kaggle and Obesity to HbA1c dataset was taken from zenoda and merged to form a single dataset. Missing values were dropped to ensure accurate statistical analysis. Categorical microbiome data (Organism Name) was converted into dummy variables for numerical analysis.
4 parametric and 4 non parametric tests were applied to explore associations.
Parametric tests :
1. Pearson Correlation: Measures linear correlation between microbiome abundance and BMI/HbA1c.
2. T-test: Compares BMI between two microbiome abundance groups.
3. ANOVA (F-test): Assesses BMI differences across multiple microbiome abundance levels.
4. Linear Regression: Determines how microbiome abundance predicts BMI.
Non-parametric tests :
1. Spearman Correlation: Measures monotonic relationships between microbiome and BMI/HbA1c.
2. Mann-Whitney U Test: Compares BMI in high vs. low microbiome groups.
3. Kruskal-Wallis Test: Checks BMI differences across microbiome levels.
4. Kendall’s Tau: Another correlation measure for ranked data.
This study provides statistical evidence linking gut microbiome composition to obesity and Type 2 diabetes.

In [None]:
from google.colab import files
uploaded = files.upload()

!pip install --upgrade pandas

import pandas as pd
import numpy as np
from scipy.stats import pearsonr, spearmanr, kendalltau, ttest_ind, f_oneway, mannwhitneyu, kruskal
from sklearn.linear_model import LinearRegression


file_name = "microbiome to obesity and T2D combine.xlsx"
data = pd.read_excel("microbiome to obesity and T2D combine.xlsx")

if data.isnull().sum().any():
    data = data.dropna()


microbiome = data['Organism Name']
bmi = data['BMI, kg/m2']
hba1c = data['HbA1c, mmol/mol']

microbiome_dummies = pd.get_dummies(data['Organism Name'], prefix='Organism')
data = pd.concat([data, microbiome_dummies], axis=1)
bmi = data['BMI, kg/m2']
hba1c = data['HbA1c, mmol/mol']


for organism in microbiome_dummies.columns:
    pearson_corr_bmi, p_value_bmi = pearsonr(data[organism], bmi)
    pearson_corr_hba1c, p_value_hba1c = pearsonr(data[organism], hba1c)


microbiome_median = data[microbiome_dummies.columns].sum(axis=1).median()
group1 = bmi[data[microbiome_dummies.columns].sum(axis=1) <= microbiome_median]
group2 = bmi[data[microbiome_dummies.columns].sum(axis=1) > microbiome_median]
t_stat, t_p_value = ttest_ind(group1, group2)


microbiome_total = data[microbiome_dummies.columns].sum(axis=1)
print(microbiome_total.describe())

microbiome_total_no_dupes = microbiome_total.drop_duplicates()
microbiome_bins = pd.qcut(
    microbiome_total,
    q=3,
    labels=False,
    duplicates='drop'
)

bin_mapping = dict(zip(microbiome_total_no_dupes, microbiome_bins))
microbiome_bins = microbiome_total.map(bin_mapping)

unique_bins = microbiome_bins.unique()
num_bins = len(unique_bins)
labels = ["Low", "Medium", "High"][:num_bins]
microbiome_bins = microbiome_bins.map(dict(zip(unique_bins, labels)))

groups = [bmi[microbiome_bins == group] for group in microbiome_bins.unique()]
anova_f_stat, anova_p_value = f_oneway(*groups)


X = data[microbiome_dummies.columns]
y = bmi.values
reg = LinearRegression().fit(X, y)
regression_coef = reg.coef_



spearman_corr_bmi, spearman_p_bmi = spearmanr(data[microbiome_dummies.columns].sum(axis=1), bmi)
spearman_corr_hba1c, spearman_p_hba1c = spearmanr(data[microbiome_dummies.columns].sum(axis=1), hba1c)


u_stat, u_p_value = mannwhitneyu(group1, group2)


kruskal_stat, kruskal_p_value = kruskal(*groups)


kendall_corr_bmi, kendall_p_bmi = kendalltau(data[microbiome_dummies.columns].sum(axis=1), bmi)
kendall_corr_hba1c, kendall_p_hba1c = kendalltau(data[microbiome_dummies.columns].sum(axis=1), hba1c)


print("Parametric Tests:")
print(f"1. Pearson Correlation (BMI): {pearson_corr_bmi}, p-value: {p_value_bmi}")
print(f"   Pearson Correlation (HbA1c): {pearson_corr_hba1c}, p-value: {p_value_hba1c}")
print(f"2. T-test: t-statistic: {t_stat}, p-value: {t_p_value}")
print(f"3. ANOVA: F-statistic: {anova_f_stat}, p-value: {anova_p_value}")
print(f"4. Linear Regression Coefficient (Microbiome vs BMI): {regression_coef}")

print("\nNon-Parametric Tests:")
print(f"1. Spearman Correlation (BMI): {spearman_corr_bmi}, p-value: {spearman_p_bmi}")
print(f"   Spearman Correlation (HbA1c): {spearman_corr_hba1c}, p-value: {spearman_p_hba1c}")
print(f"2. Mann-Whitney U: U-statistic: {u_stat}, p-value: {u_p_value}")
print(f"3. Kruskal-Wallis: H-statistic: {kruskal_stat}, p-value: {kruskal_p_value}")
print(f"4. Kendall Tau (BMI): {kendall_corr_bmi}, p-value: {kendall_p_bmi}")
print(f"   Kendall Tau (HbA1c): {kendall_corr_hba1c}, p-value: {kendall_p_hba1c}")


Saving microbiome to obesity and T2D combine.xlsx to microbiome to obesity and T2D combine (17).xlsx
count    707.0
mean       1.0
std        0.0
min        1.0
25%        1.0
50%        1.0
75%        1.0
max        1.0
dtype: float64


TypeError: at least two inputs are required; got 1.

In [None]:
from google.colab import files
uploaded = files.upload()


import pandas as pd
import numpy as np
from scipy.stats import pearsonr, spearmanr, kendalltau, ttest_ind, f_oneway, mannwhitneyu, kruskal
from sklearn.linear_model import LinearRegression


file_name = "microbiome to obesity and T2D combine.xlsx"
data = pd.read_excel("microbiome to obesity and T2D combine.xlsx")

if data.isnull().sum().any():
    data = data.dropna()


microbiome = data['Organism Name']
bmi = data['BMI, kg/m2']
hba1c = data['HbA1c, mmol/mol']

microbiome_dummies = pd.get_dummies(data['Organism Name'], prefix='Organism')
data = pd.concat([data, microbiome_dummies], axis=1)
bmi = data['BMI, kg/m2']
hba1c = data['HbA1c, mmol/mol']


for organism in microbiome_dummies.columns:
    pearson_corr_bmi, p_value_bmi = pearsonr(data[organism], bmi)
    pearson_corr_hba1c, p_value_hba1c = pearsonr(data[organism], hba1c)


microbiome_median = data[microbiome_dummies.columns].sum(axis=1).median()
group1 = bmi[data[microbiome_dummies.columns].sum(axis=1) <= microbiome_median]
group2 = bmi[data[microbiome_dummies.columns].sum(axis=1) > microbiome_median]

if group1.empty or group2.empty:
    print("Warning: One or both groups are empty. Adjusting microbiome_median...")

    adjustment_factor = 0.99

    while group1.empty or group2.empty:
        microbiome_median *= 0.99
        group1 = bmi[data[microbiome_dummies.columns].sum(axis=1) <= microbiome_median]
        group2 = bmi[data[microbiome_dummies.columns].sum(axis=1) > microbiome_median]

    if microbiome_median < 0.1:
        print("Warning: Could not create non-empty groups. Skipping Mann-Whitney U test.")
        u_stat, u_p_value = np.nan, np.nan
    else:
        u_stat, u_p_value = mannwhitneyu(group1, group2)



if not group1.empty and not group2.empty:
     u_stat, u_p_value = mannwhitneyu(group1, group2)
else:
    u_stat, u_p_value = np.nan, np.nan


microbiome_total = data[microbiome_dummies.columns].sum(axis=1)
print(microbiome_total.describe())


if microbiome_total.nunique() < 3:

    microbiome_bins = pd.qcut(microbiome_total, q=3, labels=False, duplicates='drop')
else:

    microbiome_total_no_dupes = microbiome_total.drop_duplicates()
    microbiome_bins = pd.qcut(
        microbiome_total,
        q=3,
        labels=False,
        duplicates='drop'
    )
    bin_mapping = dict(zip(microbiome_total_no_dupes, microbiome_bins))
    microbiome_bins = microbiome_total.map(bin_mapping)

unique_bins = microbiome_bins.unique()
num_bins = len(unique_bins)
labels = ["Low", "Medium", "High"][:num_bins]
microbiome_bins = microbiome_bins.map(dict(zip(unique_bins, labels)))

groups = [bmi[microbiome_bins == group] for group in microbiome_bins.unique()]

X = data[microbiome_dummies.columns]
y = bmi.values
reg = LinearRegression().fit(X, y)
regression_coef = reg.coef_



spearman_corr_bmi, spearman_p_bmi = spearmanr(data[microbiome_dummies.columns].sum(axis=1), bmi)
spearman_corr_hba1c, spearman_p_hba1c = spearmanr(data[microbiome_dummies.columns].sum(axis=1), hba1c)


u_stat, u_p_value = mannwhitneyu(group1, group2)


kruskal_stat, kruskal_p_value = kruskal(*groups)


kendall_corr_bmi, kendall_p_bmi = kendalltau(data[microbiome_dummies.columns].sum(axis=1), bmi)
kendall_corr_hba1c, kendall_p_hba1c = kendalltau(data[microbiome_dummies.columns].sum(axis=1), hba1c)


print("Parametric Tests:")
print(f"1. Pearson Correlation (BMI): {pearson_corr_bmi}, p-value: {p_value_bmi}")
print(f"   Pearson Correlation (HbA1c): {pearson_corr_hba1c}, p-value: {p_value_hba1c}")
print(f"2. T-test: t-statistic: {t_stat}, p-value: {t_p_value}")
print(f"3. ANOVA: F-statistic: {anova_f_stat}, p-value: {anova_p_value}")
print(f"4. Linear Regression Coefficient (Microbiome vs BMI): {regression_coef}")

print("\nNon-Parametric Tests:")
print(f"1. Spearman Correlation (BMI): {spearman_corr_bmi}, p-value: {spearman_p_bmi}")
print(f"   Spearman Correlation (HbA1c): {spearman_corr_hba1c}, p-value: {spearman_p_hba1c}")
print(f"2. Mann-Whitney U: U-statistic: {u_stat}, p-value: {u_p_value}")
print(f"3. Kruskal-Wallis: H-statistic: {kruskal_stat}, p-value: {kruskal_p_value}")
print(f"4. Kendall Tau (BMI): {kendall_corr_bmi}, p-value: {kendall_p_bmi}")
print(f"   Kendall Tau (HbA1c): {kendall_corr_hba1c}, p-value: {kendall_p_hba1c}")


Saving microbiome to obesity and T2D combine.xlsx to microbiome to obesity and T2D combine (26).xlsx
