# Multiomics BMI Paper — Gut Microbiome-based Obesity Classifier in the Arivale Cohort

***Analyzed by Tomasz Wilmanski originally, and modified by Kengo Watanabe***  

This Jupyter Notebook (with Python 3 kernel) generated the random forest models for classifying participants into normal vs. obese class (based on either BMI or MetBMI) from the Arivale baseline gut microbiome dataset, and calculated the testing (hold-out) set-derived class (label and probability) predictions for the Arivale cohort. Because DeLong's test did not seem available in Python yet, DeLong's test was performed in another sub-notebook with R kernel.  

Input files:  
* Arivale baseline gut microbiome taxon abundances: 220902_Multiomics-BMI-NatMed1stRevision_Microbiome-DataCleaning_AlphaDiversity-and-TaxonAbundance_final.tsv  
* Arivale baseline BMI and MetBMI: 220803_Multiomics-BMI-NatMed1stRevision_DeltaBMI-misclassification_biologicalBMI-baseline-summary-BothSex.tsv  
* ROC curve of the classifiers: 221010_Multiomics-BMI-NatMed1stRevision_Microbiome-RFclassifier-DeLong-ver5_Arivale-wenceslaus_\[BMI/MetBMI\]-ROC-curve.tsv  
* DeLong's test result: 221010_Multiomics-BMI-NatMed1stRevision_Microbiome-RFclassifier-DeLong-ver5_Arivale-wenceslaus_result-summary.tsv  

Output figures and tables:  
* Figure 4c, 4d  
* Tables for Supplementary Data 1, 10  
* Intermediate tables for the sub-notebook (class predictions)  

Original notebook (memo for my future tracing):  
* wenceslaus:\[JupyterLab HOME\]/220621_Multiomics-BMI-NatMedRevision/221010_Multiomics-BMI-NatMed1stRevision_Microbiome-RFclassifier-ver5_Arivale-wenceslaus.ipynb  

In [1]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
#For Arial font
#!conda install -c conda-forge -y mscorefonts
##-> The below was also needed in matplotlib 3.4.2
#import shutil
#import matplotlib
#shutil.rmtree(matplotlib.get_cachedir())
import warnings
warnings.filterwarnings('ignore')
from IPython.display import display
import time

import random
import statsmodels.formula.api as smf
from statsmodels.stats.anova import anova_lm
from statsmodels.stats import multitest as multi
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from decimal import Decimal, ROUND_HALF_UP
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve, precision_recall_curve, auc
from sklearn.metrics import roc_auc_score, average_precision_score
from statsmodels.stats import weightstats

!conda list

# packages in environment at /opt/conda/envs/arivale-py3:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       1_gnu    conda-forge
analytics                 0.1                      pypi_0    pypi
argon2-cffi               21.1.0           py39h3811e60_0    conda-forge
arivale-data-interface    0.1.0                    pypi_0    pypi
async_generator           1.10                       py_0    conda-forge
atk-1.0                   2.36.0               h3371d22_4    conda-forge
attrs                     21.2.0             pyhd8ed1ab_0    conda-forge
backcall                  0.2.0              pyh9f0ad1d_0    conda-forge
backports                 1.0                        py_2    conda-forge
backports.functools_lru_cache 1.6.4              pyhd8ed1ab_0    conda-forge
biopython                 1.79             py39h3811e60_0    conda-forge
bleach 

## 1. Data preparation

> The necessary files were copied from the dalek server in advance.  

### 1-1. Gut microbiome taxon abundance

In [None]:
#Import cleaned table for baseline gut microbiome data
fileDir = './ImportData/'
ipynbName = '220902_Multiomics-BMI-NatMed1stRevision_Microbiome-DataCleaning_'
fileName = 'AlphaDiversity-and-TaxonAbundance_final.tsv'
tempDF = pd.read_csv(fileDir+ipynbName+fileName, sep='\t', dtype={'public_client_id': str})
tempDF = tempDF.set_index('public_client_id')
print('Before:', tempDF.shape)

#Take only the taxon abundances
tempL1 = ['Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species']
tempL2 = []
for taxon_rank in tempL1:
    tempL = tempDF.loc[:, tempDF.columns.str.contains(taxon_rank+':')].columns.tolist()
    tempL = sorted(tempL)
    for taxon in tempL:
        tempL2.append(taxon)
tempDF = tempDF[tempL2]
print('After taking the target taxonomic ranks:', tempDF.shape)
tempDF1 = tempDF.columns.to_series().str.split(pat=':', expand=True)
display(tempDF1[0].value_counts())

display(tempDF)

biomeDF = tempDF

### 1-2. BMI and omics-inferred BMI classes

In [None]:
#Import cleaned table for baseline measured and biological BMIs
fileDir = './ImportData/'
ipynbName = '220803_Multiomics-BMI-NatMed1stRevision_DeltaBMI-misclassification_'
fileName = 'biologicalBMI-baseline-summary-BothSex.tsv'
tempDF = pd.read_csv(fileDir+ipynbName+fileName, sep='\t', dtype={'public_client_id': str})
tempDF = tempDF.set_index('public_client_id')
print('Original:', len(tempDF))

#Take the participants having gut microbiome data
tempDF = tempDF.loc[tempDF.index.isin(biomeDF.index.tolist())]
print(' -> with gut microbiome data:', len(tempDF))

#Clean to handle easier in this notebook
tempDF.columns = tempDF.columns.str.replace('Base', '')

#Select the BMI class and covariates (just for the display in Jupyter notebook)
tempL1 = tempDF.loc[:, tempDF.columns.str.contains('_class')].columns.tolist()
tempL2 = ['BMI', 'Sex', 'Age', 'PC1', 'PC2', 'PC3', 'PC4', 'PC5']
tempL = [col_n for sublist in [tempL1, tempL2] for col_n in sublist]
tempDF = tempDF[tempL]

display(tempDF)
tempL = []
for bmi_class in tempL1:
    tempL.append(tempDF[bmi_class].value_counts())
tempDF1 = pd.concat(tempL, axis=1)
tempDF1 = tempDF1.sort_index(ascending=True)
display(tempDF1)

bmiDF = tempDF

### 1-3. Split the cohort into 5 sets

> In this case, the split sets are different between BMI and MetBMI classes. Hence, the randomization step is added to reduce the bias of common participant existence, which would happen if simply using the row order.  

In [None]:
tempD1 = {'BothSex':bmiDF}
yvarL = ['BMI', 'MetBMI']#Classifier target in this study; Fix through this notebook
classD = {'Normal':0, 'Obese':1}#Target class label and its code in this study; Fix through this notebook
nmodels = 5#Fix through this notebook
random.seed(123)#For reproducibility (while checking the following independency)

tempD2 = {}
for sex in tempD1.keys():
    tempD3 = {}
    for yvar in yvarL:
        #Select target classes for the classifier
        tempDF = tempD1[sex]
        tempDF = tempDF.loc[tempDF[yvar+'_class'].isin(list(classD.keys()))]
        #Randomize the row order
        tempL = tempDF.index.tolist()
        tempL = random.sample(tempL, len(tempL))
        tempDF = tempDF.loc[tempL]
        #Split cohort to define the training and testing (hold-out) sets
        tempL = np.array_split(tempDF, nmodels)#List of DFs
        tempD = {}
        for model_k in range(nmodels):
            tempDF1 = tempL[model_k]
            model_n = 'Model_'+str(model_k+1).zfill(2)
            tempS = pd.Series(np.repeat(model_n, len(tempDF1)),
                              index=tempDF1.index, name='Testing_'+yvar)
            tempD[model_k] = tempS
        tempS = pd.concat(list(tempD.values()), axis=0)
        tempD3[yvar] = tempS
    tempDF = pd.concat(list(tempD3.values()), axis=1)#NaN for out-of-target participant
    #Add the info to bmiDF
    tempDF1 = tempD1[sex]
    tempDF = pd.merge(tempDF1, tempDF, left_index=True, right_index=True, how='left')
    tempD2[sex] = tempDF
    
    print(sex)
    display(tempDF)
    for yvar in yvarL:
        tempDF1 = tempDF.loc[~tempDF['Testing_'+yvar].isnull()]
        print(' - '+yvar)
        display(tempDF1.describe(include='all'))
        display(tempDF1['Testing_'+yvar].value_counts())
    print('')
#Update
bmiDF = tempD2['BothSex']

### 1-4. Check independency among the set

> Perform Pearson's chi-squared test for categorical variables (using scipy library) and ANOVA for numeric variables (using statsmodels library). Note that scipy.stats.f_oneway() doesn't report degrees of freedom.  

In [None]:
#Compare the sets per classifier target (i.e., 5 sets per comparison)
tempD1 = {'BothSex':bmiDF}
for sex in tempD1.keys():
    tempD2 = {}
    for yvar in yvarL:
        tempDF = tempD1[sex]
        tempDF = tempDF.loc[~tempDF['Testing_'+yvar].isnull()]
        tempL1 = []#For variable name
        tempL2 = []#Test name
        tempL3 = []#For degrees of freedom
        tempL4 = []#For test statistic
        tempL5 = []#For P-value
        #Categorical variables
        #tempL = tempDF.select_dtypes(exclude=[np.number]).columns.tolist()
        tempL = [yvar+'_class', 'Sex']
        for col_n in tempL:
            if (col_n=='Sex')&(sex in ['Female', 'Male']):
                continue
            else:
                tempDF1 = pd.crosstab(tempDF[col_n], tempDF['Testing_'+yvar])
                #Pearson's chi-squared test
                chi2, pval, dof, tempA = stats.chi2_contingency(tempDF1, correction=False)
                tempL1.append(col_n)
                tempL2.append('Pearson\'s chi-squared test')
                tempL3.append(dof)
                tempL4.append(chi2)
                tempL5.append(pval)
        #Numeric variables
        tempL = tempDF.select_dtypes(include=[np.number]).columns.tolist()
        for col_n in tempL:
            #ANOVA
            formula = col_n+' ~ C(Testing_'+yvar+')'
            model = smf.ols(formula, data=tempDF).fit()
            tempDF1 = anova_lm(model, typ=1)#ANOVA type doesn't matter this case
            tempL1.append(col_n)
            tempL2.append('ANOVA')
            dof1 = tempDF1['df'].astype('int64').loc['C(Testing_'+yvar+')']#Between-groups
            dof2 = tempDF1['df'].astype('int64').loc['Residual']#Within-groups
            tempL3.append((dof1, dof2))
            tempL4.append(tempDF1['F'].loc['C(Testing_'+yvar+')'])
            tempL5.append(tempDF1['PR(>F)'].loc['C(Testing_'+yvar+')'])
        tempDF = pd.DataFrame({'Variable':tempL1,
                               'StatisticalTest':tempL2,
                               'N':len(tempDF),
                               'DoF':tempL3,
                               'Statistic':tempL4,
                               'Pval':tempL5})
        #P-value adjustment (within classifier target) by using Benjamini–Hochberg method
        tempDF['AdjPval_within'] = multi.multipletests(tempDF['Pval'], alpha=0.05, method='fdr_bh',
                                                       is_sorted=False, returnsorted=False)[1]
        tempDF['Classifier'] = yvar+' class'
        tempD2[yvar] = tempDF
    tempDF = pd.concat(list(tempD2.values()), axis=0)
    #P-value adjustment (across all tests) by using Benjamini–Hochberg method
    tempDF['AdjPval_all'] = multi.multipletests(tempDF['Pval'], alpha=0.05, method='fdr_bh',
                                                is_sorted=False, returnsorted=False)[1]
    tempDF = tempDF.set_index(['Classifier', 'Variable'])
    print(sex)
    display(tempDF)
    #Save
    fileDir = './ExportData/'
    ipynbName = '221010_Multiomics-BMI-NatMed1stRevision_Microbiome-RFclassifier-ver5_Arivale-wenceslaus_'
    fileName = 'split-sets-independency_within-classifier-'+sex+'.tsv'
    tempDF.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)

In [None]:
#Compare the sets across classifier targets per BMI class (i.e., 10 sets per comparison)
tempD1 = {'BothSex':bmiDF}
for sex in tempD1.keys():
    tempD2 = {}
    for bmi_class in classD.keys():
        tempD = {}
        for yvar in yvarL:
            tempDF = tempD1[sex]
            tempDF = tempDF.loc[tempDF[yvar+'_class']==bmi_class]
            tempDF['Testing'] = yvar+'_'+tempDF['Testing_'+yvar]
            tempDF = tempDF.loc[:, ~tempDF.columns.str.contains('Testing_')]
            tempD[yvar] = tempDF
        tempDF = pd.concat(list(tempD.values()), axis=0)
        tempL1 = []#For variable name
        tempL2 = []#Test name
        tempL3 = []#For degrees of freedom
        tempL4 = []#For test statistic
        tempL5 = []#For P-value
        #Categorical variables
        #tempL = tempDF.select_dtypes(exclude=[np.number]).columns.tolist()
        tempL = ['Sex']
        for col_n in tempL:
            if (col_n=='Sex')&(sex in ['Female', 'Male']):
                continue
            else:
                tempDF1 = pd.crosstab(tempDF[col_n], tempDF['Testing'])
                #Pearson's chi-squared test
                chi2, pval, dof, tempA = stats.chi2_contingency(tempDF1, correction=False)
                tempL1.append(col_n)
                tempL2.append('Pearson\'s chi-squared test')
                tempL3.append(dof)
                tempL4.append(chi2)
                tempL5.append(pval)
        #Numeric variables
        tempL = tempDF.select_dtypes(include=[np.number]).columns.tolist()
        for col_n in tempL:
            #ANOVA
            formula = col_n+' ~ C(Testing)'
            model = smf.ols(formula, data=tempDF).fit()
            tempDF1 = anova_lm(model, typ=1)#ANOVA type doesn't matter this case
            tempL1.append(col_n)
            tempL2.append('ANOVA')
            dof1 = tempDF1['df'].astype('int64').loc['C(Testing)']#Between-groups
            dof2 = tempDF1['df'].astype('int64').loc['Residual']#Within-groups
            tempL3.append((dof1, dof2))
            tempL4.append(tempDF1['F'].loc['C(Testing)'])
            tempL5.append(tempDF1['PR(>F)'].loc['C(Testing)'])
        tempDF = pd.DataFrame({'Variable':tempL1,
                               'StatisticalTest':tempL2,
                               'N':len(tempDF),
                               'DoF':tempL3,
                               'Statistic':tempL4,
                               'Pval':tempL5})
        #P-value adjustment (within BMI class) by using Benjamini–Hochberg method
        tempDF['AdjPval_within'] = multi.multipletests(tempDF['Pval'], alpha=0.05, method='fdr_bh',
                                                       is_sorted=False, returnsorted=False)[1]
        tempDF['BMIorMetBMIclass'] = bmi_class
        tempD2[bmi_class] = tempDF
    tempDF = pd.concat(list(tempD2.values()), axis=0)
    #P-value adjustment (across all tests) by using Benjamini–Hochberg method
    tempDF['AdjPval_all'] = multi.multipletests(tempDF['Pval'], alpha=0.05, method='fdr_bh',
                                                is_sorted=False, returnsorted=False)[1]
    tempDF = tempDF.set_index(['BMIorMetBMIclass', 'Variable'])
    print(sex)
    display(tempDF)
    #Save
    fileDir = './ExportData/'
    ipynbName = '221010_Multiomics-BMI-NatMed1stRevision_Microbiome-RFclassifier-ver5_Arivale-wenceslaus_'
    fileName = 'split-sets-independency_across-classifiers-'+sex+'.tsv'
    tempDF.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)

## 2. Preprocessing

### 2-1. Standardization

> Although standardization is unnecessary for RF classifiers in general, certain standardization would be required to harmonize the Arivale and TwinsUK datasets.  
> –> In this version, the harmonization is not needed because classifiers are generated per cohort. However, standardizaiton is performed to apply PCA. Of note, log-transformation is applied before standardization since it was not done in the data cleaning notebook.  

In [None]:
tempDF = biomeDF
tempD1 = {'BothSex':bmiDF}

#Standardization
tempD2 = {}
for sex in tempD1.keys():
    tempDF1 = tempD1[sex]
    tempDF1 = tempDF.loc[tempDF1.index.tolist()]
    
    #Check
    print(sex+':', tempDF1.shape)
    print(' - Negative values in DF:', (tempDF<0).to_numpy().sum(axis=None))
    print(' - Before:')
    tempS = pd.Series(stats.skew(tempDF1), index=tempDF1.columns, name='Skewness')
    display(tempS.describe())
    
    #log-transformation
    tempDF1 = np.log(tempDF1 + 1)
    
    #Z-score transformation
    scaler = StandardScaler(copy=True, with_mean=True, with_std=True)
    tempA = scaler.fit_transform(tempDF1)#Column direction
    tempDF1 = pd.DataFrame(data=tempA, index=tempDF1.index, columns=tempDF1.columns)
    
    tempD2[sex] = tempDF1
    
    #Confirmation
    print(' - After:')
    tempS = pd.Series(stats.skew(tempDF1), index=tempDF1.columns, name='Skewness')
    display(tempS.describe())
    tempDF2 = tempDF1.describe()
    tempDF2.loc['Skewness'] = tempS
    display(tempDF2)
    sns.set(style='ticks', font='Arial', context='notebook')
    plt.figure(figsize=(4, 3))
    for col_i in range(3):
        sns.distplot(tempDF1.iloc[:, col_i], label=tempDF1.columns[col_i])
    sns.despine()
    plt.xlabel(r'$Z$'+'-score')
    plt.ylabel('Density')
    plt.legend(bbox_to_anchor=(1, 0.5), loc='center left', borderaxespad=1)
    plt.show()

#Update
biomeDF = tempD2['BothSex']

### 2-2. PCA for feature selection

> Because of many features vs. small sample size, the input taxon abundances are applied to PCA.  

In [None]:
tempD1 = {'BothSex':biomeDF}
#pc_cutoff = 0.1#Cutoff of the PC's explained variance [%]
pc_topX = 50#Set the maximum number of input features
nplot = 5
tempD2 = {'BothSex':bmiDF}
tempD3 = {'Underweight':'blue', 'Normal':'green', 'Overweight':'orange', 'Obese':'red'}

#PCA for feature selection
tempD4 = {}
for sex in tempD1.keys():
    print(sex)
    tempDF = tempD1[sex]#Already standardized
    
    #Perform PCA
    nPCs = np.min(tempDF.shape)
    model = PCA(n_components=nPCs, svd_solver='randomized', iterated_power='auto', random_state=123)
    model.fit(tempDF)
    
    #Explained variance
    tempS = pd.Series(data=model.explained_variance_ratio_*100,
                      index=['PC'+str(i+1) for i in range(nPCs)], name='ExplainedVariance')
    print(' - Percentage of variance explained by each component:')
    display(tempS)
    ##Retrieve cutoff value at top X
    pc_cutoff = tempS.iloc[(pc_topX-1)]
    ##Scree plot
    tempDF1 = tempS.reset_index()
    tempDF1['PC'] = [i+1 for i in range(nPCs)]
    sns.set(style='ticks', font='Arial', context='talk')
    plt.figure(figsize=(4, 3))
    p = sns.lineplot(data=tempDF1, x='PC', y='ExplainedVariance', color='k')
    sns.despine()
    p.set(xlim=(0.5, nPCs+0.5))
    p.axhline(y=pc_cutoff, linestyle="--", color='crimson', zorder=0)
    plt.ylabel('Explained varaince [%]')
    plt.xlabel('Principal component number')
    plt.show()
    ##Scree plot (log-scale)
    tempDF1['ExplainedVariance_log10'] = np.log10(tempDF1['ExplainedVariance'])
    sns.set(style='ticks', font='Arial', context='talk')
    plt.figure(figsize=(4, 3))
    p = sns.lineplot(data=tempDF1, x='PC', y='ExplainedVariance_log10', color='k')
    sns.despine()
    p.set(xlim=(0.5, nPCs+0.5))
    p.axhline(y=np.log10(pc_cutoff), linestyle="--", color='crimson', zorder=0)
    plt.ylabel('Explained varaince [%]\n(log-scale)')
    plt.xlabel('Principal component number')
    plt.show()
    ##Integrate the explained variance into PC label
    tempL = []
    for i in range(nPCs):
        round_value = Decimal(str(tempS['PC'+str(i+1)])).quantize(Decimal('0.1'), rounding=ROUND_HALF_UP)
        tempL.append('PC'+str(i+1)+' ('+str(round_value)+'%)')
    
    #Select PCs as input features
    tempS = tempS.loc[tempS>=pc_cutoff]
    print(' - PCs for input features:')
    print('   - Total explained variance:', tempS.sum())
    display(tempS.describe())
    
    #Projection
    tempDF1 = pd.DataFrame(data=model.transform(tempDF), index=tempDF.index, columns=tempL)
    tempDF1 = tempDF1.iloc[:, :len(tempS)]
    print(' - Projection DF:', tempDF1.shape)
    display(tempDF1)
    
    #PC component
    tempDF2 = pd.DataFrame(data=model.components_, index=tempL, columns=tempDF.columns)
    tempDF2 = tempDF2.iloc[:len(tempS), :]
    print('PC component DF:', tempDF2.shape)
    display(tempDF2)
    
    #Save
    fileDir = './ExportData/'
    ipynbName = '221010_Multiomics-BMI-NatMed1stRevision_Microbiome-RFclassifier-ver5_Arivale-wenceslaus_'
    fileName = 'TaxonAbundance-PCA-projection-'+sex+'.tsv'
    tempDF1.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)
    fileName = 'TaxonAbundance-PCA-component-'+sex+'.tsv'
    tempDF2.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)
    
    tempD4[sex] = tempDF1
    
    #Visualize sample distribution in the projected spaces
    tempDF = tempD2[sex]
    tempS = tempDF['BMI_class']
    tempDF = pd.merge(tempDF1.iloc[:, :(nplot+1)], tempS, left_index=True, right_index=True, how='left')
    sns.set(style='ticks', font='Arial', context='talk')
    p = sns.PairGrid(data=tempDF, hue='BMI_class', hue_order=tempD3.keys(), palette=tempD3)
    p.map_lower(sns.scatterplot, edgecolor='0.3', alpha=0.5, s=15)
    for i, j in zip(*np.triu_indices_from(p.axes, 0)):
        p.axes[i, j].set_visible(False)
    p.add_legend(bbox_to_anchor=(0.6, 0.6), loc='center right', frameon=True)
    plt.show()
    
    print('')

#Update
biomeDF = tempD4['BothSex']

### 2-3. Class label encoding

> In this case, only two labels: Normal vs. Obese. Hence, manually encode.  

In [None]:
tempD1 = {'BothSex':bmiDF}

tempD2 = {}
for sex in tempD1.keys():
    tempDF = tempD1[sex]
    #Encoding
    for yvar in yvarL:
        tempDF[yvar+'_class_code'] = tempDF[yvar+'_class'].map(classD)
    tempD2[sex] = tempDF
    
    #Confirmation
    print(sex+':', tempDF.shape)
    display(tempDF.describe(include='all'))
    print('')

#Update
bmiDF = tempD2['BothSex']

## 3. BMI classifier

In [None]:
yvar = 'BMI'

### 3-1. Select the target class participants

In [None]:
tempD1 = {'BothSex':bmiDF}
tempD2 = {'BothSex':biomeDF}

tempD3 = {}
tempD4 = {}
for sex in tempD1.keys():
    #Select target classes for the classifier
    tempDF1 = tempD1[sex]
    tempDF1 = tempDF1.loc[tempDF1[yvar+'_class'].isin(list(classD.keys()))]
    tempDF2 = tempD2[sex]
    tempDF2 = tempDF2.loc[tempDF1.index.tolist()]
    
    tempD3[sex] = tempDF1
    tempD4[sex] = tempDF2
    
    print(sex+':')
    display(tempDF1.describe(include='all'))
    display(tempDF2.describe(include='all'))
    print('')

yDF_B = tempD3['BothSex']
xDF_B = tempD4['BothSex']

### 3-2. Random forest with cross-validation

> Gut microbiome-based RF classifiers are generated using the 5-fold iteration scheme (with 5-fold cross-validation).  

In [None]:
#Both sex model
tempDF1 = xDF_B#Unstandardized input variables
tempDF2 = yDF_B#Encoded true class label and testing set label
ncvs = 5
parameters = {'n_estimators':[int(value) for value in np.geomspace(100, 1000, num=10)],
              'max_features':[value for sublist in [['log2', 'sqrt'], np.linspace(0.05, 1.0, num=20)] for value in sublist]}
model = RandomForestClassifier(
    #n_estimators=100,
    criterion='entropy',#Cheged from default (gini) according to Wilmanski, T. et al. Nat. Biotechnol 2019
    max_depth=None,
    min_samples_split=2,
    min_samples_leaf=1,
    min_weight_fraction_leaf=0.0,
    #max_features='sqrt',
    max_leaf_nodes=None,
    min_impurity_decrease=0.0,
    bootstrap=True,
    oob_score=False,#Remain default, because manually calculate scores
    n_jobs=-1,#Use all processors = Need to care about the other jobs
    random_state=123,#To maintain reproducibility
    verbose=0,
    warm_start=False,
    class_weight='balanced_subsample',#Automatically adjust class weights based on the bootstrap sample
    ccp_alpha=0.0,
    max_samples=None)
modelcv = GridSearchCV(model, parameters, refit=True, cv=ncvs)

#Perform random forest
featureDF = pd.DataFrame(index=tempDF1.columns).astype('float64')#For feature importance
featureDF.index.rename('Variable', inplace=True)
predictS_label = pd.Series(name=yvar+'_class_predicted')#For predictions of class label
predictS_proba = pd.Series(name=yvar+'_class_predicted-probability')#For predictions of class probability
t_start = time.time()
for model_k in range(nmodels):
    #Prepare training and testing (hold-out) datasets in model k
    model_n = 'Model_'+str(model_k+1).zfill(2)
    yDF_train = tempDF2.loc[tempDF2['Testing_'+yvar]!=model_n]
    yDF_test = tempDF2.loc[tempDF2['Testing_'+yvar]==model_n]
    xDF_train = tempDF1.loc[yDF_train.index.tolist()]
    xDF_test = tempDF1.loc[yDF_test.index.tolist()]
    #Retrieve the true class label
    yDF_train = pd.DataFrame(yDF_train[yvar+'_class_code'])#Not Series but DF
    #yDF_test = pd.DataFrame(yDF_test[yvar+'_class_code'])#Performance score is calculated later
    
    #Fitting model with cross-validation using training dataset (i.e., internal training/validation datasets)
    modelcv.fit(xDF_train, yDF_train, sample_weight=None)#Weight was set by class_weight parameter
    #Check the best hyperparameter set decided by cross validation
    print(model_n+':', modelcv.best_params_)
    #Extract the best estimator
    model_best = modelcv.best_estimator_
    #Save feature importance
    featureDF[model_n] = model_best.feature_importances_#Impurity-based feature importances
    
    #Prediction for testing dataset using the fitted model k
    ##Label
    tempS = pd.Series(model_best.predict(xDF_test),
                      index=xDF_test.index, name=predictS_label.name)
    predictS_label = pd.concat([predictS_label, tempS], axis=0)
    ##Probability
    tempS = pd.Series(model_best.predict_proba(xDF_test)[:, 1],#Take only the values for class 1
                      index=xDF_test.index, name=predictS_proba.name)
    predictS_proba = pd.concat([predictS_proba, tempS], axis=0)
t_elapsed = time.time() - t_start
print('Elapsed time for '+str(nmodels)+' models of '+str(ncvs)+'-fold CV RF:',
      round(t_elapsed//60), 'min', round(t_elapsed%60, 1), 'sec')

#Clean prediction DF
tempDF = pd.merge(tempDF2[['Testing_'+yvar, yvar+'_class', yvar+'_class_code']], predictS_label,
                  left_index=True, right_index=True, how='left')
tempDF = pd.merge(tempDF, predictS_proba, left_index=True, right_index=True, how='left')
display(tempDF)

#Save the cleaned prediction DF
fileDir = './ExportData/'
ipynbName = '221010_Multiomics-BMI-NatMed1stRevision_Microbiome-RFclassifier-ver5_Arivale-wenceslaus_'
fileName = yvar+'class-BothSex.tsv'
tempDF.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)
measBMI_B = tempDF
featureDF_B = featureDF

### 3-3. Model performance

In [None]:
#Evaluation with testing (hold-out) dataset
tempD1 = {'BothSex':measBMI_B}
tempD2 = {}
for sex in tempD1.keys():
    tempDF = tempD1[sex]
    tempDF1 = pd.DataFrame(index=pd.Index(['Sensitivity', 'Specificity', 'Precision', 'AUC-ROC', 'AUC-PR', 'AP'],
                                          name='Metric'))
    for model_k in range(nmodels):
        model_n = 'Model_'+str(model_k+1).zfill(2)
        tempS1 = tempDF[yvar+'_class_code'].loc[tempDF['Testing_'+yvar]==model_n]
        tempS2 = tempDF[yvar+'_class_predicted'].loc[tempDF['Testing_'+yvar]==model_n]
        tempS3 = tempDF[yvar+'_class_predicted-probability'].loc[tempDF['Testing_'+yvar]==model_n]
        
        tn, fp, fn, tp = confusion_matrix(tempS1, tempS2).ravel()
        sensitivity = tp/(tp+fn)#a.k.a. recall, TPR
        specificity = tn/(tn+fp)#a.k.a. 1-FPR
        precision = tp/(tp+fp)
        
        #ROC curve
        fprA, tprA, thresholdA = roc_curve(tempS1, tempS3, pos_label=1)
        auc_roc = auc(fprA, tprA)
        #auc_roc = roc_auc_score(tempS1, tempS3)#Same with the above
        
        #Precision-recall curve
        precisionA, recallA, thresholdA = precision_recall_curve(tempS1, tempS3, pos_label=1)
        auc_pr = auc(recallA, precisionA)#Linear interpolation
        ap = average_precision_score(tempS1, tempS3)#Without interpolation
        
        tempDF1[model_n] = [sensitivity, specificity, precision, auc_roc, auc_pr, ap]
    #Summarize
    tempS1 = tempDF1.mean(axis=1)
    tempS1.name = 'Mean'
    tempS2 = tempDF1.std(axis=1, ddof=1)#Sample standard deviation
    tempS2.name = 'SD'
    tempS3 = tempDF1.std(axis=1, ddof=1)/np.sqrt(len(tempDF1.columns))
    tempS3.name = 'SEM'
    tempDF = pd.concat([tempDF1, tempS1, tempS2, tempS3], axis=1)
    
    #Save the cleaned DF
    fileDir = './ExportData/'
    ipynbName = '221010_Multiomics-BMI-NatMed1stRevision_Microbiome-RFclassifier-ver5_Arivale-wenceslaus_'
    fileName = yvar+'class-'+sex+'-performance.tsv'
    tempDF.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)
    
    tempD2[sex] = tempDF
    print(sex)
    display(tempDF)
    print('')

measBMI_B_metrics = tempD2['BothSex']

### 3-4. Clean feature importance dataframe

In [None]:
tempD1 = {'BothSex':featureDF_B}
for sex in tempD1.keys():
    tempDF = tempD1[sex]
    #Summarize
    tempL1 = []
    tempL2 = []
    for row_n in tempDF.index.tolist():
        tempL1.append(tempDF.loc[row_n].mean())
        tempL2.append(tempDF.loc[row_n].std(ddof=1))#Sample standard deviation
    tempDF['Mean'] = tempL1
    tempDF['SD'] = tempL2
    #Save the cleaned DF
    fileDir = './ExportData/'
    ipynbName = '221010_Multiomics-BMI-NatMed1stRevision_Microbiome-RFclassifier-ver5_Arivale-wenceslaus_'
    fileName = yvar+'class-'+sex+'-feature-importance.tsv'
    tempDF.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)
    
    #Check
    print(sex+':')
    print(' - Variables:', len(tempDF))
    tempDF1 = tempDF.loc[tempDF['Mean']>0.01]
    print(' - Variables with the mean of feature importances > 0.01:', len(tempDF1),
          '(', len(tempDF1)/len(tempDF)*100, '%)')
    tempDF1 = tempDF.loc[tempDF['Mean']>0.05]
    print(' - Variables with the mean of feature importances > 0.05:', len(tempDF1),
          '(', len(tempDF1)/len(tempDF)*100, '%)')
    tempDF1 = tempDF.sort_values(by='Mean', ascending=False)
    display(tempDF1.iloc[:30])
    print('')

## 4. MetBMI classifier

In [None]:
yvar = 'MetBMI'

### 4-1. Select the target class participants

In [None]:
tempD1 = {'BothSex':bmiDF}
tempD2 = {'BothSex':biomeDF}

tempD3 = {}
tempD4 = {}
for sex in tempD1.keys():
    #Select target classes for the classifier
    tempDF1 = tempD1[sex]
    tempDF1 = tempDF1.loc[tempDF1[yvar+'_class'].isin(list(classD.keys()))]
    tempDF2 = tempD2[sex]
    tempDF2 = tempDF2.loc[tempDF1.index.tolist()]
    
    tempD3[sex] = tempDF1
    tempD4[sex] = tempDF2
    
    print(sex+':')
    display(tempDF1.describe(include='all'))
    display(tempDF2.describe(include='all'))
    print('')

yDF_B = tempD3['BothSex']
xDF_B = tempD4['BothSex']

### 4-2. Random forest with cross-validation

> Gut microbiome-based RF classifiers are generated using the 5-fold iteration scheme (with 5-fold cross-validation).  

In [None]:
#Both sex model
tempDF1 = xDF_B#Unstandardized input variables
tempDF2 = yDF_B#Encoded true class label and testing set label
ncvs = 5
parameters = {'n_estimators':[int(value) for value in np.geomspace(100, 1000, num=10)],
              'max_features':[value for sublist in [['log2', 'sqrt'], np.linspace(0.05, 1.0, num=20)] for value in sublist]}
model = RandomForestClassifier(
    #n_estimators=100,
    criterion='entropy',#Cheged from default (gini) according to Wilmanski, T. et al. Nat. Biotechnol 2019
    max_depth=None,
    min_samples_split=2,
    min_samples_leaf=1,
    min_weight_fraction_leaf=0.0,
    #max_features='sqrt',
    max_leaf_nodes=None,
    min_impurity_decrease=0.0,
    bootstrap=True,
    oob_score=False,#Remain default, because manually calculate scores
    n_jobs=-1,#Use all processors = Need to care about the other jobs
    random_state=123,#To maintain reproducibility
    verbose=0,
    warm_start=False,
    class_weight='balanced_subsample',#Automatically adjust class weights based on the bootstrap sample
    ccp_alpha=0.0,
    max_samples=None)
modelcv = GridSearchCV(model, parameters, refit=True, cv=ncvs)

#Perform random forest
featureDF = pd.DataFrame(index=tempDF1.columns).astype('float64')#For feature importance
featureDF.index.rename('Variable', inplace=True)
predictS_label = pd.Series(name=yvar+'_class_predicted')#For predictions of class label
predictS_proba = pd.Series(name=yvar+'_class_predicted-probability')#For predictions of class probability
t_start = time.time()
for model_k in range(nmodels):
    #Prepare training and testing (hold-out) datasets in model k
    model_n = 'Model_'+str(model_k+1).zfill(2)
    yDF_train = tempDF2.loc[tempDF2['Testing_'+yvar]!=model_n]
    yDF_test = tempDF2.loc[tempDF2['Testing_'+yvar]==model_n]
    xDF_train = tempDF1.loc[yDF_train.index.tolist()]
    xDF_test = tempDF1.loc[yDF_test.index.tolist()]
    #Retrieve the true class label
    yDF_train = pd.DataFrame(yDF_train[yvar+'_class_code'])#Not Series but DF
    #yDF_test = pd.DataFrame(yDF_test[yvar+'_class_code'])#Performance score is calculated later
    
    #Fitting model with cross-validation using training dataset (i.e., internal training/validation datasets)
    modelcv.fit(xDF_train, yDF_train, sample_weight=None)#Weight was set by class_weight parameter
    #Check the best hyperparameter set decided by cross validation
    print(model_n+':', modelcv.best_params_)
    #Extract the best estimator
    model_best = modelcv.best_estimator_
    #Save feature importance
    featureDF[model_n] = model_best.feature_importances_#Impurity-based feature importances
    
    #Prediction for testing dataset using the fitted model k
    ##Label
    tempS = pd.Series(model_best.predict(xDF_test),
                      index=xDF_test.index, name=predictS_label.name)
    predictS_label = pd.concat([predictS_label, tempS], axis=0)
    ##Probability
    tempS = pd.Series(model_best.predict_proba(xDF_test)[:, 1],#Take only the values for class 1
                      index=xDF_test.index, name=predictS_proba.name)
    predictS_proba = pd.concat([predictS_proba, tempS], axis=0)
t_elapsed = time.time() - t_start
print('Elapsed time for '+str(nmodels)+' models of '+str(ncvs)+'-fold CV RF:',
      round(t_elapsed//60), 'min', round(t_elapsed%60, 1), 'sec')

#Clean prediction DF
tempDF = pd.merge(tempDF2[['Testing_'+yvar, yvar+'_class', yvar+'_class_code']], predictS_label,
                  left_index=True, right_index=True, how='left')
tempDF = pd.merge(tempDF, predictS_proba, left_index=True, right_index=True, how='left')
display(tempDF)

#Save the cleaned prediction DF
fileDir = './ExportData/'
ipynbName = '221010_Multiomics-BMI-NatMed1stRevision_Microbiome-RFclassifier-ver5_Arivale-wenceslaus_'
fileName = yvar+'class-BothSex.tsv'
tempDF.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)
metBMI_B = tempDF
featureDF_B = featureDF

### 4-3. Model performance

In [None]:
#Evaluation with testing (hold-out) dataset
tempD1 = {'BothSex':metBMI_B}
tempD2 = {}
for sex in tempD1.keys():
    tempDF = tempD1[sex]
    tempDF1 = pd.DataFrame(index=pd.Index(['Sensitivity', 'Specificity', 'Precision', 'AUC-ROC', 'AUC-PR', 'AP'],
                                          name='Metric'))
    for model_k in range(nmodels):
        model_n = 'Model_'+str(model_k+1).zfill(2)
        tempS1 = tempDF[yvar+'_class_code'].loc[tempDF['Testing_'+yvar]==model_n]
        tempS2 = tempDF[yvar+'_class_predicted'].loc[tempDF['Testing_'+yvar]==model_n]
        tempS3 = tempDF[yvar+'_class_predicted-probability'].loc[tempDF['Testing_'+yvar]==model_n]
        
        tn, fp, fn, tp = confusion_matrix(tempS1, tempS2).ravel()
        sensitivity = tp/(tp+fn)#a.k.a. recall, TPR
        specificity = tn/(tn+fp)#a.k.a. 1-FPR
        precision = tp/(tp+fp)
        
        #ROC curve
        fprA, tprA, thresholdA = roc_curve(tempS1, tempS3, pos_label=1)
        auc_roc = auc(fprA, tprA)
        #auc_roc = roc_auc_score(tempS1, tempS3)#Same with the above
        
        #Precision-recall curve
        precisionA, recallA, thresholdA = precision_recall_curve(tempS1, tempS3, pos_label=1)
        auc_pr = auc(recallA, precisionA)#Linear interpolation
        ap = average_precision_score(tempS1, tempS3)#Without interpolation
        
        tempDF1[model_n] = [sensitivity, specificity, precision, auc_roc, auc_pr, ap]
    #Summarize
    tempS1 = tempDF1.mean(axis=1)
    tempS1.name = 'Mean'
    tempS2 = tempDF1.std(axis=1, ddof=1)#Sample standard deviation
    tempS2.name = 'SD'
    tempS3 = tempDF1.std(axis=1, ddof=1)/np.sqrt(len(tempDF1.columns))
    tempS3.name = 'SEM'
    tempDF = pd.concat([tempDF1, tempS1, tempS2, tempS3], axis=1)
    
    #Save the cleaned DF
    fileDir = './ExportData/'
    ipynbName = '221010_Multiomics-BMI-NatMed1stRevision_Microbiome-RFclassifier-ver5_Arivale-wenceslaus_'
    fileName = yvar+'class-'+sex+'-performance.tsv'
    tempDF.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)
    
    tempD2[sex] = tempDF
    print(sex)
    display(tempDF)
    print('')

metBMI_B_metrics = tempD2['BothSex']

### 4-4. Clean feature importance dataframe

In [None]:
tempD1 = {'BothSex':featureDF_B}
for sex in tempD1.keys():
    tempDF = tempD1[sex]
    #Summarize
    tempL1 = []
    tempL2 = []
    for row_n in tempDF.index.tolist():
        tempL1.append(tempDF.loc[row_n].mean())
        tempL2.append(tempDF.loc[row_n].std(ddof=1))#Sample standard deviation
    tempDF['Mean'] = tempL1
    tempDF['SD'] = tempL2
    #Save the cleaned DF
    fileDir = './ExportData/'
    ipynbName = '221010_Multiomics-BMI-NatMed1stRevision_Microbiome-RFclassifier-ver5_Arivale-wenceslaus_'
    fileName = yvar+'class-'+sex+'-feature-importance.tsv'
    tempDF.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)
    
    #Check
    print(sex+':')
    print(' - Variables:', len(tempDF))
    tempDF1 = tempDF.loc[tempDF['Mean']>0.01]
    print(' - Variables with the mean of feature importances > 0.01:', len(tempDF1),
          '(', len(tempDF1)/len(tempDF)*100, '%)')
    tempDF1 = tempDF.loc[tempDF['Mean']>0.05]
    print(' - Variables with the mean of feature importances > 0.05:', len(tempDF1),
          '(', len(tempDF1)/len(tempDF)*100, '%)')
    tempDF1 = tempDF.sort_values(by='Mean', ascending=False)
    display(tempDF1.iloc[:30])
    print('')

# — Move to the R sub-notebook —

## 5. Comparison b/w classifiers

### 5-1. ROC curve

#### 5-1-1. Per model

> Refer to Examples of scikit-learn website.  

In [None]:
tempD1 = {'BMI':measBMI_B, 'MetBMI':metBMI_B}
tempD2 = {'BMI':measBMI_B_metrics, 'MetBMI':metBMI_B_metrics}
tempD3 = {'BMI':'k', 'MetBMI':'b'}

#Prepare each ROC curve
mean_fpr = np.linspace(0, 1, 100)#X-coordinate
tempD = {}
for yvar in tempD1.keys():
    tempDF = tempD1[yvar]
    tempL = tempDF['Testing_'+yvar].unique().tolist()
    tprs = []
    for model_n in tempL:
        tempS1 = tempDF[yvar+'_class_code'].loc[tempDF['Testing_'+yvar]==model_n]
        tempS2 = tempDF[yvar+'_class_predicted-probability'].loc[tempDF['Testing_'+yvar]==model_n]
        fprA, tprA, thresholdA = roc_curve(tempS1, tempS2, pos_label=1)
        interp_tpr = np.interp(mean_fpr, fprA, tprA)
        interp_tpr[0] = 0.0#Left border
        interp_tpr[-1] = 1.0#Right border
        tprs.append(interp_tpr)
    tempD[yvar] = tprs

#Visualization
sns.set(style='ticks', font='Arial', context='talk')
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(5, 5))
plt.setp(ax, xlim=(0.0, 1.01), xticks=np.arange(0, 1.1, 0.2))
plt.setp(ax, ylim=(0.0, 1.01), yticks=np.arange(0, 1.1, 0.2))
for yvar in tempD.keys():
    tprs = tempD[yvar]
    print(yvar+': n =', len(tprs), 'models')#Check length for the following SEM calculation (just in case)
    #Prepare mean and SEM of AUC
    tempS = tempD2[yvar].loc['AUC-ROC']
    mean = tempS.loc['Mean']
    mean_text = str(Decimal(mean).quantize(Decimal('0.01'), rounding=ROUND_HALF_UP))
    sem = tempS.loc['SEM']
    sem_text = str(Decimal(sem).quantize(Decimal('0.01'), rounding=ROUND_HALF_UP))
    display(tempS)
    #Mean line
    mean_tpr = np.mean(tprs, axis=0)
    print(' -> Cf. AUC of the mean ROC curve in plot:', auc(mean_fpr, mean_tpr))
    ax.plot(mean_fpr, mean_tpr, color=tempD3[yvar], lw=2,
            label=yvar+' class\n(AUC = '+mean_text+' ± '+sem_text+')')
    #SEM range
    sem_tpr = np.std(tprs, axis=0, ddof=1)/np.sqrt(len(tprs))
    ax.fill_between(mean_fpr, mean_tpr-sem_tpr, mean_tpr+sem_tpr,
                    color=tempD3[yvar], alpha=0.2)
##Random classification line
ax.plot([0, 1], [0, 1], linestyle="--", lw=2, color='r', zorder=0)
sns.despine()
plt.ylabel('True positive rate')
plt.xlabel('False postive rate')
plt.legend(fontsize='small', title='Obesity classifier', title_fontsize='medium',
           bbox_to_anchor=(1, 0), loc='lower right', borderaxespad=0.25,
           handlelength=1.5, handletextpad=0.5)
plt.show()

#### 5-1-2. Overall population

In [None]:
tempD1 = {'BMI':measBMI_B, 'MetBMI':metBMI_B}
tempD2 = {'BMI':measBMI_B_metrics, 'MetBMI':metBMI_B_metrics}
tempD3 = {'BMI':'k', 'MetBMI':'b'}

#Prepare overall ROC curve
fprs = np.linspace(0, 1, 100)#X-coordinate
tempD = {}
for yvar in tempD1.keys():
    tempDF = tempD1[yvar]
    tempS1 = tempDF[yvar+'_class_code']
    tempS2 = tempDF[yvar+'_class_predicted-probability']
    fprA, tprA, thresholdA = roc_curve(tempS1, tempS2, pos_label=1)
    tprs = np.interp(fprs, fprA, tprA)
    tprs[0] = 0.0#Left border
    tprs[-1] = 1.0#Right border
    tempD[yvar] = tprs

#Visualization
sns.set(style='ticks', font='Arial', context='talk')
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(5, 5))
plt.setp(ax, xlim=(0.0, 1.01), xticks=np.arange(0, 1.1, 0.2))
plt.setp(ax, ylim=(0.0, 1.01), yticks=np.arange(0, 1.1, 0.2))
for yvar in tempD.keys():
    tprs = tempD[yvar]
    print(yvar+':', len(tprs), 'interpolated points')#Check just in case
    #Prepare AUC
    auc_roc = auc(fprs, tprs)
    auc_text = str(Decimal(auc_roc).quantize(Decimal('0.001'), rounding=ROUND_HALF_UP))
    print(' - AUC:', auc_roc)
    tempS = tempD2[yvar].loc['AUC-ROC']
    display(tempS)
    #ROC curve
    ax.plot(fprs, tprs, color=tempD3[yvar], lw=2,
            label=yvar+' class\n(AUC = '+auc_text+')')
##Random classification line
ax.plot([0, 1], [0, 1], linestyle="--", lw=2, color='r', zorder=0)
sns.despine()
plt.ylabel('True positive rate')
plt.xlabel('False postive rate')
plt.legend(fontsize='small', title='Obesity classifier', title_fontsize='medium',
           bbox_to_anchor=(1, 0), loc='lower right', borderaxespad=0.25,
           handlelength=1.5, handletextpad=0.5)
plt.show()

#### 5-1-3. Overall population (pROC package)

> Because the above interpolated ROC curve is not completely same with the one used in DeLong's test, the used ROC curve is imported.  

In [None]:
tempD1 = {'BMI':'k', 'MetBMI':'b'}
cohort = 'Arivale'

#Prepare the overall ROC curve used in DeLong test
tempD2 = {}
for yvar in tempD1.keys():
    #Import
    fileDir = './ExportData/'
    ipynbName = '221010_Multiomics-BMI-NatMed1stRevision_Microbiome-RFclassifier-DeLong-ver5_Arivale-wenceslaus_'
    fileName = yvar+'-ROC-curve.tsv'
    tempDF = pd.read_csv(fileDir+ipynbName+fileName, sep='\t')
    #Calculate TPR and FPR
    tempDF['TPR'] = tempDF['Sensitivity']
    tempDF['FPR'] = 1 - tempDF['Specificity']
    print(yvar)
    display(tempDF)
    tempD2[yvar] = tempDF

#Prepare test summary
fileDir = './ExportData/'
ipynbName = '221010_Multiomics-BMI-NatMed1stRevision_Microbiome-RFclassifier-DeLong-ver5_Arivale-wenceslaus_'
fileName = 'result-summary.tsv'
tempDF = pd.read_csv(fileDir+ipynbName+fileName, sep='\t')
tempDF = tempDF.set_index('Variable')
display(tempDF)
tempD3 = {}
for yvar in tempD1.keys():
    tempD3[yvar] = tempDF['Estimate_'+yvar].iloc[0]
pval = tempDF['Pval'].iloc[0]

#Visualization
sns.set(style='ticks', font='Arial', context='talk')
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(5, 5))
plt.setp(ax, xlim=(0.0, 1.01), xticks=np.arange(0, 1.1, 0.2))
plt.setp(ax, ylim=(0.0, 1.01), yticks=np.arange(0, 1.1, 0.2))
for yvar in tempD1.keys():
    tempDF = tempD2[yvar]
    #Prepare AUC
    auc_roc = tempD3[yvar]
    auc_text = str(Decimal(auc_roc).quantize(Decimal('0.001'), rounding=ROUND_HALF_UP))
    #ROC curve
    ax.plot(tempDF['FPR'], tempDF['TPR'], color=tempD1[yvar], lw=2,
            label=yvar+' class\n(AUC = '+auc_text+')')
##Random classification line
ax.plot([0, 1], [0, 1], linestyle="--", lw=2, color='r', zorder=0)
##Add annotation line
xcoord_0 = 0.4#Manually adjusted
xcoord_1 = 0.425#Manually adjusted
ycoord_0 = 0.090#Manually adjusted
ycoord_1 = 0.230#Manually adjusted
ax.plot([xcoord_1, xcoord_0, xcoord_0, xcoord_1],
        [ycoord_0, ycoord_0, ycoord_1, ycoord_1],
        lw=1.5, c='k')
##Add P-value annotation
if pval<0.001:
    label = '***'
elif pval<0.01:
    label = '**'
elif pval<0.05:
    label = '*'
else:
    pval_text = str(Decimal(pval).quantize(Decimal('0.01'), rounding=ROUND_HALF_UP))
    label = r'$P$'+' = '+pval_text
if label in ['***', '**', '*']:
    text_offset = -0.015
    text_size = 'medium'
else:
    text_offset = 0.0
    text_size = 'x-small'
##Add axis title
ax.set_title(cohort+' cohort', {'fontsize':'large'})
ax.annotate(label, xy=(2*xcoord_0-xcoord_1, (ycoord_0+ycoord_1)/2+text_offset),
            horizontalalignment='right', verticalalignment='center',
            fontsize=text_size, color='k')
sns.despine()
plt.ylabel('True positive rate')
plt.xlabel('False postive rate')
plt.legend(fontsize='small', title='Obesity classifier', title_fontsize='medium',
           bbox_to_anchor=(1, 0), loc='lower right', borderaxespad=0.25,
           handlelength=1.5, handletextpad=0.5)
##Save
fileDir = './ExportFigures/'
ipynbName = '221010_Multiomics-BMI-NatMed1stRevision_Microbiome-RFclassifier-ver5_Arivale-wenceslaus_'
fileName = 'ROC-curve.tif'
plt.gcf().savefig(fileDir+ipynbName+fileName, dpi=300, bbox_inches='tight', pad_inches=0.04,
                  pil_kwargs={'compression':'tiff_lzw'})
plt.show()

### 5-2. Precision–recall curve

#### 5-2-1. Per model

> Refer to Examples of scikit-learn website.  

In [None]:
tempD1 = {'BMI':measBMI_B, 'MetBMI':metBMI_B}
tempD2 = {'BMI':measBMI_B_metrics, 'MetBMI':metBMI_B_metrics}
tempD3 = {'BMI':'k', 'MetBMI':'b'}

#Prepare each PR curve
mean_recall = np.linspace(0, 1, 100)#X-coordinate
tempD = {}
for yvar in tempD1.keys():
    tempDF = tempD1[yvar]
    tempL = tempDF['Testing_'+yvar].unique().tolist()
    precisions = []
    for model_n in tempL:
        tempS1 = tempDF[yvar+'_class_code'].loc[tempDF['Testing_'+yvar]==model_n]
        tempS2 = tempDF[yvar+'_class_predicted-probability'].loc[tempDF['Testing_'+yvar]==model_n]
        precisionA, recallA, thresholdA = precision_recall_curve(tempS1, tempS2, pos_label=1)
        precisionA = np.flip(precisionA)#To make the first element 1
        recallA = np.flip(recallA)#To make the first element 0
        interp_precision = np.interp(mean_recall, recallA, precisionA)
        interp_precision[0] = 1.0#Left border
        random_precision = len(tempS1.loc[tempS1==1])/len(tempS1)#No-skill classifier precision
        interp_precision[-1] = random_precision#Right border
        precisions.append(interp_precision)
    tempD[yvar] = precisions

#Visualization
sns.set(style='ticks', font='Arial', context='talk')
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(5, 5))
plt.setp(ax, xlim=(0.0, 1.01), xticks=np.arange(0, 1.1, 0.2))
plt.setp(ax, ylim=(0.0, 1.01), yticks=np.arange(0, 1.1, 0.2))
for yvar in tempD.keys():
    precisions = tempD[yvar]
    print(yvar+': n =', len(precisions), 'models')#Check length for the following SEM calculation (just in case)
    #Prepare mean and SEM of AUC
    tempS = tempD2[yvar].loc['AUC-PR']
    mean = tempS.loc['Mean']
    mean_text = str(Decimal(mean).quantize(Decimal('0.01'), rounding=ROUND_HALF_UP))
    sem = tempS.loc['SEM']
    sem_text = str(Decimal(sem).quantize(Decimal('0.01'), rounding=ROUND_HALF_UP))
    display(tempS)
    #Mean line
    mean_precision = np.mean(precisions, axis=0)
    print(' -> Cf. AUC of the mean PR curve in plot:', auc(mean_recall, mean_precision))
    ax.plot(mean_recall, mean_precision, color=tempD3[yvar], lw=2,
            label=yvar+' class\n(AUC = '+mean_text+' ± '+sem_text+')')
    #SEM range
    sem_precision = np.std(precisions, axis=0, ddof=1)/np.sqrt(len(precisions))
    ax.fill_between(mean_recall, mean_precision-sem_precision, mean_precision+sem_precision,
                    color=tempD3[yvar], alpha=0.2)
    #Random classification line
    tempDF = tempD1[yvar]
    tempS = tempDF[yvar+'_class_code']
    random_precision = len(tempS.loc[tempS==1])/len(tempS)#No-skill classifier precision
    print(' - No-skill classifier precision (overall):', random_precision)
    print('   -> Cf. the mean among testing sets:', mean_precision[-1])
    ax.axhline(y=random_precision, linestyle="--", lw=2, color=tempD3[yvar], zorder=0)
sns.despine()
plt.ylabel('Precision')
plt.xlabel('Recall')
plt.legend(fontsize='small', title='Obesity classifier', title_fontsize='medium',
           bbox_to_anchor=(1, 0), loc='lower right', borderaxespad=0.25,
           handlelength=1.5, handletextpad=0.5)
plt.show()

#### 5-2-2. Overall population

In [None]:
tempD1 = {'BMI':measBMI_B, 'MetBMI':metBMI_B}
tempD2 = {'BMI':measBMI_B_metrics, 'MetBMI':metBMI_B_metrics}
tempD3 = {'BMI':'k', 'MetBMI':'b'}

#Prepare overall PR curve
recalls = np.linspace(0, 1, 100)#X-coordinate
tempD = {}
for yvar in tempD1.keys():
    tempDF = tempD1[yvar]
    tempS1 = tempDF[yvar+'_class_code']
    tempS2 = tempDF[yvar+'_class_predicted-probability']
    precisionA, recallA, thresholdA = precision_recall_curve(tempS1, tempS2, pos_label=1)
    precisionA = np.flip(precisionA)#To make the first element 1
    recallA = np.flip(recallA)#To make the first element 0
    precisions = np.interp(recalls, recallA, precisionA)
    precisions[0] = 1.0#Left border
    random_precision = len(tempS1.loc[tempS1==1])/len(tempS1)#No-skill classifier precision
    precisions[-1] = random_precision#Right border
    tempD[yvar] = precisions

#Visualization
sns.set(style='ticks', font='Arial', context='talk')
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(5, 5))
plt.setp(ax, xlim=(0.0, 1.01), xticks=np.arange(0, 1.1, 0.2))
plt.setp(ax, ylim=(0.0, 1.01), yticks=np.arange(0, 1.1, 0.2))
for yvar in tempD.keys():
    precisions = tempD[yvar]
    print(yvar+':', len(precisions), 'interpolated points')#Check just in case
    #Prepare AUC
    auc_pr = auc(recalls, precisions)
    auc_text = str(Decimal(auc_pr).quantize(Decimal('0.001'), rounding=ROUND_HALF_UP))
    print(' - AUC:', auc_pr)
    tempS = tempD2[yvar].loc['AUC-PR']
    display(tempS)
    #PR curve
    ax.plot(recalls, precisions, color=tempD3[yvar], lw=2,
            label=yvar+' class\n(AUC = '+auc_text+')')
    #Random classification line
    tempDF = tempD1[yvar]
    tempS = tempDF[yvar+'_class_code']
    random_precision = len(tempS.loc[tempS==1])/len(tempS)#No-skill classifier precision
    print(' - No-skill classifier precision (overall):', random_precision)
    ax.axhline(y=random_precision, linestyle="--", lw=2, color=tempD3[yvar], zorder=0)
sns.despine()
plt.ylabel('Precision')
plt.xlabel('Recall')
plt.legend(fontsize='small', title='Obesity classifier', title_fontsize='medium',
           bbox_to_anchor=(1, 0), loc='lower right', borderaxespad=0.25,
           handlelength=1.5, handletextpad=0.5)
plt.show()

### 5-3. Prediction metrics

In [None]:
cohort = 'Arivale'
tempD1 = {'BMI class':measBMI_B_metrics, 'MetBMI class':metBMI_B_metrics}
tempD2 = {'BMI class':'0.5', 'MetBMI class':'b'}
tempD3 = {'AUC-ROC':'AUC (ROC)', 'Sensitivity':'Sensitivity',
          'Specificity':'Specificity', 'AP':'Precision'}

#Prepare DF
tempL = []
for classifier in tempD1.keys():
    tempDF = tempD1[classifier].T
    tempDF = tempDF.loc[tempDF.index.str.contains('Model_')]
    tempDF.index.rename('Model', inplace=True)
    tempDF = tempDF[list(tempD3.keys())]
    tempDF.columns = tempDF.columns.map(tempD3)
    tempDF['Classifier'] = classifier
    tempL.append(tempDF)
    
    print(classifier)
    display(tempDF.describe(include='all'))
tempDF = pd.concat(tempL, axis=0)

#Statistical tests
control = list(tempD1.keys())[0]
contrast = list(tempD1.keys())[1]
tempDF1 = pd.DataFrame(columns=['Control', 'Contrast', 'Control_N', 'Contrast_N', 'DoF', 'tStat', 'Pval'])
for metric in tempD3.values():
    tempS1 = tempDF[metric].loc[tempDF['Classifier']==control]
    tempS2 = tempDF[metric].loc[tempDF['Classifier']==contrast]
    #Two-sided Welch's t-test
    tstat, pval, dof = weightstats.ttest_ind(tempS1, tempS2,
                                             alternative='two-sided', usevar='unequal')
    size1 = len(tempS1)
    size2 = len(tempS2)
    tempDF1.loc[metric] = [control, contrast, size1, size2, dof, tstat, pval]
tempDF1.index.rename('Metric', inplace=True)
display(tempDF1)
##Save
fileDir = './ExportData/'
ipynbName = '221010_Multiomics-BMI-NatMed1stRevision_Microbiome-RFclassifier-ver5_Arivale-wenceslaus_'
fileName = 'performance-comparison.tsv'
tempDF1.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)

#Plot
tempDF = tempDF.reset_index().melt(var_name='Metric', value_name='Value',
                                   id_vars=['Classifier', 'Model'], value_vars=list(tempD3.values()))
axis_ymin = 0.0
axis_ymax = 1.1
ymin = 0.0
ymax = 1.0
yinter = 0.2
sns.set(style='ticks', font='Arial', context='talk')
plt.figure(figsize=(3.5, 3.5))
sns.barplot(data=tempDF, y='Value', x='Metric', order=tempD3.values(),
            hue='Classifier', hue_order=tempD2.keys(), dodge=True, palette=tempD2,
            ci=95, capsize=0.4/2, errwidth=1.5, errcolor='black', edgecolor='black')
p = sns.stripplot(data=tempDF, y='Value', x='Metric', order=tempD3.values(),
                  hue='Classifier', hue_order=tempD2.keys(), dodge=True, jitter=0.3/2,
                  size=5, edgecolor='black', color='gray', linewidth=1, alpha=0.4)
sns.despine()
p.set(ylim=(axis_ymin, axis_ymax), yticks=np.arange(ymin, ymax+yinter/10, yinter))
plt.setp(p.get_xticklabels(), rotation=70,
         horizontalalignment='right', verticalalignment='center', rotation_mode='anchor')
##Random classification line
p.axhline(y=0.5, linestyle="--", lw=2, color='r', zorder=0)
##P-value annotation
for x_i, metric in enumerate(tempD3.values()):
    #Control
    rect_0 = p.patches[x_i]
    xcoord_0 = rect_0.get_x() + rect_0.get_width()/2
    ycoord_0 = rect_0.get_height()
    #Contrast
    rect_1 = p.patches[x_i+len(tempD3)]
    xcoord_1 = rect_1.get_x() + rect_1.get_width()/2
    ycoord_1 = rect_1.get_height()
    #Standard point of marker
    xcoord = (xcoord_0+xcoord_1)/2
    ycoord = tempDF['Value'].loc[tempDF['Metric']==metric].max()
    #Add annotation lines
    aline_offset = yinter/10
    aline_length = yinter/10 + aline_offset
    p.plot([xcoord_0, xcoord_0, xcoord_1, xcoord_1],
           [ycoord+aline_offset, ycoord+aline_length, ycoord+aline_length, ycoord+aline_offset],
           lw=1.5, c='k')
    #Retrieve P-value
    pval = tempDF1['Pval'].loc[metric]
    if pval<0.001:
        label = '***'
    elif pval<0.01:
        label = '**'
    elif pval<0.05:
        label = '*'
    else:
        pval_text = str(Decimal(pval).quantize(Decimal('0.01'), rounding=ROUND_HALF_UP))
        label = r'$P$'+' = '+pval_text
    #Add annotation text
    if label in ['***', '**', '*']:
        text_offset = yinter/25
        text_size = 'medium'
    else:
        text_offset = yinter/5
        text_size = 'x-small'
    p.annotate(label, xy=(xcoord, ycoord+text_offset),
               horizontalalignment='center', verticalalignment='bottom',
               fontsize=text_size, color='k')
##Add axis title
p.set_title(cohort+' cohort', {'fontsize':'large'})
plt.ylabel('Out-of-sample metric')
plt.xlabel('')
##Modify legend
h, l = p.get_legend_handles_labels()
h = h[2:]#Remove sns.stripplot legend (legend=False didn't work...)
l = l[2:]#Remove sns.stripplot legend (legend=False didn't work...)
plt.legend(h, l, fontsize='medium', title='Obesity classifier', title_fontsize='medium',
           bbox_to_anchor=(0.5, -0.45), loc='upper center', borderaxespad=0.25,
           handlelength=1.5, handletextpad=0.5)
##Save
fileDir = './ExportFigures/'
ipynbName = '221010_Multiomics-BMI-NatMed1stRevision_Microbiome-RFclassifier-ver5_Arivale-wenceslaus_'
fileName = 'performance-comparison.tif'
plt.gcf().savefig(fileDir+ipynbName+fileName, dpi=300, bbox_inches='tight', pad_inches=0.04,
                  pil_kwargs={'compression':'tiff_lzw'})
plt.show()

# — End of this notebook —