# Multiomics BMI Paper — Relationships between ∆BMI and Clinical Definition-based Misclassification

***by Kengo Watanabe***  

This Jupyter Notebook (with Python 3 kernel) assessed differences in ∆BMI (1) between metabolically healthy normal-weight (MHNW) and metabolically unhealthy normal-weight (MUNW) and (2) between metabolically healthy obese (MHO) and metabolically unhealthy obese (MUO), based on a clinical definition (in the baseline Arivale cohort).  

Input files:  
* Arivale baseline biological BMIs and covariates: 220803_Multiomics-BMI-NatMed1stRevision_DeltaBMI-misclassification_biologicalBMI-baseline-summary-BothSex.tsv  
* Arivale baseline metabolic health condition: 220720_Multiomics-BMI-NatMedRevision_Misclassification_metabolic-health-summary.tsv  

Output figures and tables:  
* Figure 3b  
* Table for Supplementary Data 10  

Original notebook (memo for my future tracing):  
* dalek:\[JupyterLab HOME\]/220621_Multiomics-BMI-NatMedRevision/220804_Multiomics-BMI-NatMed1stRevision_DeltaBMI-misclassification_ClinicalDefinition-ver2.ipynb  

In [1]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
#For Arial font
#!conda install -c conda-forge -y mscorefonts
##-> The below was also needed in matplotlib 3.4.2
#import shutil
#import matplotlib
#shutil.rmtree(matplotlib.get_cachedir())
import warnings
warnings.filterwarnings('ignore')
from IPython.display import display
import time

from sklearn.preprocessing import StandardScaler
import statsmodels.formula.api as smf
from statsmodels.stats import multitest as multi
from decimal import Decimal, ROUND_HALF_UP

!conda list

# packages in environment at /opt/conda/envs/arivale-py3:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       1_gnu    conda-forge
analytics                 0.1                      pypi_0    pypi
argon2-cffi               21.1.0           py39h3811e60_0    conda-forge
arivale-data-interface    0.1.0                    pypi_0    pypi
async_generator           1.10                       py_0    conda-forge
atk-1.0                   2.36.0               h3371d22_4    conda-forge
attrs                     21.2.0             pyhd8ed1ab_0    conda-forge
backcall                  0.2.0              pyh9f0ad1d_0    conda-forge
backports                 1.0                        py_2    conda-forge
backports.functools_lru_cache 1.6.4              pyhd8ed1ab_0    conda-forge
biopython                 1.79             py39h3811e60_0    conda-forge
bleach 

## 1. Prepare datasets

### 1-1. ∆BMI and covariates

In [None]:
#Import cleaned table for baseline measured and biological BMIs
fileDir = './ExportData/'
ipynbName = '220803_Multiomics-BMI-NatMed1stRevision_DeltaBMI-misclassification_'
fileName = 'biologicalBMI-baseline-summary-BothSex.tsv'
tempDF = pd.read_csv(fileDir+ipynbName+fileName, sep='\t', dtype={'public_client_id': str})
tempDF = tempDF.set_index('public_client_id')

#Clean to handle easier in this notebook
tempDF.columns = tempDF.columns.str.replace('Base', '')

#Calculate the rate of difference
tempL = ['MetBMI', 'ProtBMI', 'ChemBMI', 'CombiBMI']
for bbmi in tempL:
    tempDF['Delta'+bbmi] = (tempDF[bbmi] - tempDF['BMI']) / tempDF['BMI'] * 100

#Select the misclassification and covariates (just for the display in Jupyter notebook)
tempL1 = tempDF.loc[:, tempDF.columns.str.contains('Delta')].columns.tolist()
tempL2 = tempDF.loc[:, tempDF.columns.str.contains('_class')].columns.tolist()
tempL3 = ['BMI', 'Sex', 'Age', 'PC1', 'PC2', 'PC3', 'PC4', 'PC5']
tempL = [col_n for sublist in [tempL1, tempL2, tempL3] for col_n in sublist]
tempDF = tempDF[tempL]

display(tempDF)
display(tempDF.describe(include='all'))
print('NaN in DF:', tempDF.isnull().to_numpy().sum(axis=None))

bmiDF = tempDF

In [None]:
tempDF = bmiDF

#Check skewness
tempDF1 = tempDF.select_dtypes(include=[np.number])
tempDF2 = tempDF1.describe()
tempDF2.loc['Skewness'] = stats.skew(tempDF1)
display(tempDF2)

> –> ∆BMI can be surly assumed as normal distribution.  

### 1-2. Metabolic health condition

In [None]:
#Import cleaned table for metabolic health condition
fileDir = './ExportData/'
ipynbName = '220720_Multiomics-BMI-NatMedRevision_Misclassification_'
fileName = 'metabolic-health-summary.tsv'
tempDF = pd.read_csv(fileDir+ipynbName+fileName, sep='\t', dtype={'public_client_id': str})
tempDF = tempDF.set_index('public_client_id')

display(tempDF)
display(tempDF.describe(include='all'))
print('NaN in DF:', tempDF.isnull().to_numpy().sum(axis=None))

metabDF = tempDF

> –> Six participants had NaN for the metabolic health condition

In [None]:
tempDF1 = bmiDF
tempDF2 = metabDF

#Check
tempDF = pd.merge(tempDF1, tempDF2['Metabolically'], left_index=True, right_index=True, how='inner')
for bmi_class in ['Normal', 'Obese']:
    tempDF3 = tempDF.loc[tempDF['BMI_class']==bmi_class]
    print(bmi_class+': n =', len(tempDF3))
    display(tempDF3.groupby('Metabolically')['BMI'].describe())

> –> It would be safer to adjust the baseline BMI in the statistical tests.  

## 2. Regression analysis for ∆BMI

> Of note, because ∆BMI values are assumed as normal distribution, OLS linear regression (i.e., GLM with Gaussian family) can be used simply.  

### 2-1. Perform OLS linear regression

> Model: ∆BMI ~ b0 + b1\*C(MetabolicCondition) + b2\*BMI + b3\*C(Sex) + b4\*Age + b5\*AncestryPCs  
> Main aim: Assess the difference in each ∆BMI between the metabolically healthy and unhealthy groups.  

In [None]:
tempDF1 = bmiDF
tempDF2 = metabDF
tempL1 = ['Normal', 'Obese']
tempL2 = ['MetBMI', 'ProtBMI', 'ChemBMI', 'CombiBMI']

t_start = time.time()
tempD1 = {}
for bmi_class in tempL1:
    #Processing for OLS linear regression
    ##Gather all necessary variables into a single DF
    tempS = tempDF2['Metabolically']
    tempDF = pd.merge(tempDF1, tempS, left_index=True, right_index=True, how='left')
    ##Select the target participants
    tempDF = tempDF.loc[tempDF['BMI_class']==bmi_class]
    ##Drop NaN in the metabolic health condition
    tempDF = tempDF.dropna()
    ##Z-score transformation
    tempDF3 = tempDF.select_dtypes(include=[np.number])
    scaler = StandardScaler(copy=True, with_mean=True, with_std=True)
    tempA = scaler.fit_transform(tempDF3)#Column direction
    tempDF3 = pd.DataFrame(data=tempA, index=tempDF3.index, columns=tempDF3.columns)
    ###Recover the categorical variables
    tempDF4 = tempDF.select_dtypes(exclude=[np.number])
    tempDF = pd.merge(tempDF3, tempDF4, left_index=True, right_index=True, how='left')
    ##Add a constant for the intercept
    ###–> In statsmodels, a constant is automatically added as well as R!
    ##Sort to make bcoef = 0 and 1 for Healthy and Unhealthy
    tempDF = tempDF.sort_values(by='Metabolically', ascending=True)
    ##One-hot encoding for categorical covariates
    ###–> In statsmodels, categorical variables are automatically recognized!
    
    tempD2 = {}
    for bbmi in tempL2:
        #OLS linear regression
        ##Fit univariate model
        formula = 'Delta'+bbmi+' ~ C(Metabolically)'
        fit_res1 = smf.ols(formula, data=tempDF).fit()
        ##Fit full model
        formula = 'Delta'+bbmi+' ~ C(Metabolically)'\
            '+ BMI + C(Sex) + Age + PC1 + PC2 + PC3 + PC4 + PC5'
        fit_res2 = smf.ols(formula, data=tempDF).fit()
        
        #Summarize the result
        tempDF3 = pd.DataFrame({'DeltaBMI':[bbmi]})
        ##Save the sample size for each group
        tempDF3['N'] = len(tempDF)
        tempDF3['nHealthy'] = len(tempDF.loc[tempDF['Metabolically']=='Healthy'])
        tempDF3['nUnhealthy'] = len(tempDF.loc[tempDF['Metabolically']=='Unhealthy'])
        ##Save R2 [%]
        tempDF3['UnivarR2'] = fit_res1.rsquared*100
        tempDF3['R2'] = fit_res2.rsquared*100
        ##Save beta-coefficient of the target variable
        tempDF3['Bcoef'] = fit_res2.params['C(Metabolically)[T.Unhealthy]']
        tempDF3['BcoefSE'] = fit_res2.bse['C(Metabolically)[T.Unhealthy]']
        ##Save t-statistic of the target variable
        tempDF3['tStat'] = fit_res2.tvalues['C(Metabolically)[T.Unhealthy]']
        ##Save residual degrees of freedom
        tempDF3['DoF'] = int(fit_res2.df_resid)
        ##Save P-value of the target variable
        tempDF3['Pval'] = fit_res2.pvalues['C(Metabolically)[T.Unhealthy]']
        
        tempD2[bbmi] = tempDF3
    
    #Clean the results (pd.DataFrame) across bBMIs
    tempDF = pd.concat(list(tempD2.values()), axis=0)
    tempDF['BMIclass'] = bmi_class
    
    tempD1[bmi_class] = tempDF
t_elapsed = time.time() - t_start
print('Elapsed time for',
      len(tempL1)*len(tempL2), 'OLS linear regressions (',
      len(tempL1), 'BMI classes x',
      len(tempL2), 'bBMIs):',
      round(t_elapsed//60), 'min', round(t_elapsed%60, 1), 'sec')

#Clean the results (pd.DataFrame) across BMI classes
tempDF = pd.concat(list(tempD1.values()), axis=0)
##Clean the column order by setting index
tempDF = tempDF.set_index(['BMIclass', 'DeltaBMI'])

#P-value adjustment (across BMI classes within each bBMI) by using Benjamini–Hochberg method
tempD = {}
for bbmi in tempL2:
    tempL = [(bmi_class, bbmi) for bmi_class in tempL1]
    tempS = tempDF['Pval'].loc[tempL]
    tempA = multi.multipletests(tempS, alpha=0.05, method='fdr_bh',
                                is_sorted=False, returnsorted=False)[1]
    tempS = pd.Series(tempA, index=tempS.index, name='AdjPval_within')
    tempD[bbmi] = tempS
tempS = pd.concat(list(tempD.values()), axis=0)
tempDF = pd.merge(tempDF, tempS, left_index=True, right_index=True, how='left')

#P-value adjustment (across all tests) by using Benjamini–Hochberg method
tempDF['AdjPval_all'] = multi.multipletests(tempDF['Pval'], alpha=0.05, method='fdr_bh',
                                            is_sorted=False, returnsorted=False)[1]

display(tempDF)

#Save
fileDir = './ExportData/'
ipynbName = '220804_Multiomics-BMI-NatMed1stRevision_DeltaBMI-misclassification_ClinicalDefinition-ver2_'
fileName = 'result-summary.tsv'
tempDF.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)

resDF = tempDF

### 2-2. Visualization

In [None]:
tempD1 = {'MetBMI':'b', 'ProtBMI':'r', 'ChemBMI':'g', 'CombiBMI':'m'}
tempD2 = {'Healthy':'0.8', 'Unhealthy':'crimson'}
tempL1 = ['Normal', 'Obese']
tempDF1 = bmiDF
tempDF2 = metabDF
tempDF3 = resDF

#Prepare DF
tempS = tempDF2['Metabolically']
tempDF = pd.merge(tempDF1, tempS, left_index=True, right_index=True, how='left')
##Select the target participants
tempDF = tempDF.loc[tempDF['BMI_class'].isin(tempL1)]
##Drop NaN in the metabolic health condition
tempDF = tempDF.dropna()

#Check sample size
print('N (total):', len(tempDF))
print(' - BMI class:', tempDF['BMI_class'].value_counts().sort_index(ascending=True).to_dict())
for bmi_class in tempL1:
    tempDF1 = tempDF.loc[tempDF['BMI_class']==bmi_class]
    print('   - '+bmi_class+' BMI class - Metabolic condition:',
          tempDF1['Metabolically'].value_counts().sort_index(ascending=True).to_dict())

#Visualization
sns.set(style='ticks', font='Arial', context='talk')
fig, axes = plt.subplots(nrows=1, ncols=len(tempD1),
                         figsize=(7.1, 3), sharex=True, sharey=True,
                         gridspec_kw={'width_ratios':[1, 1, 1, 1]})
axis_ymin = -37.5
axis_ymax = 50
ymin = -30
ymax = 45
yinter = 15
margin = 0.49
#Set shared axis range
plt.setp(axes, ylim=(axis_ymin, axis_ymax), yticks=np.arange(ymin, ymax+yinter/10, yinter))
plt.setp(axes, xlim=(0-margin, len(tempD2)-1+margin))#To eliminate excess white space
for ax_i, ax in enumerate(axes.flat):
    bbmi = list(tempD1.keys())[ax_i]
    sns.boxplot(data=tempDF, y='Delta'+bbmi, x='BMI_class', order=tempL1,
                hue='Metabolically', hue_order=tempD2.keys(), dodge=True, palette=tempD2,
                showfliers=False,#flierprops={'marker':'o', 'markerfacecolor':'gray', 'alpha':0.4},
                showcaps=True, notch=True, ax=ax)
    #Axis setting
    if ax_i==0:
        plt.setp(ax, xlabel='', ylabel=r'$\Delta$'+'BMI [% BMI]')
    else:
        plt.setp(ax.get_yticklabels(), visible=False)
        plt.setp(ax, xlabel='', ylabel='')
    sns.despine()
    plt.setp(ax.get_xticklabels(), rotation=70,
             horizontalalignment='right', verticalalignment='center', rotation_mode='anchor')
    #P-value annotation
    lines = ax.get_lines()#Line2D: [[Q1, Q1-1.5IQR], [Q3, Q3+1.5IQR], [Q1, Q1], [Q3, Q3], [Med, Med], [flier]]
    lines_unit = 5 + int(False)#showfliers=False
    for class_i in range(len(tempL1)):
        #Healthy
        whisker_0 = lines[class_i*lines_unit*len(tempD2) + lines_unit*0 + 1]
        xcoord_0 = whisker_0._x[1]#Q3+1.5IQR
        ycoord_0 = whisker_0._y[1]#Q3+1.5IQR
        #Unhealthy
        whisker_1 = lines[class_i*lines_unit*len(tempD2) + lines_unit*1 + 1]
        xcoord_1 = whisker_1._x[1]#Q3+1.5IQR
        ycoord_1 = whisker_1._y[1]#Q3+1.5IQR
        #Standard point for annotation
        xcoord = (xcoord_0+xcoord_1)/2
        ycoord = max(ycoord_0, ycoord_1)
        #Add annotation lines
        aline_offset = yinter/5
        aline_length = yinter/5 + aline_offset/2
        ax.plot([xcoord_0, xcoord_0, xcoord_1, xcoord_1],
                [ycoord+aline_offset, ycoord+aline_length, ycoord+aline_length, ycoord+aline_offset],
                lw=1.5, c='k')
        #Retrieve P-value
        bmi_class = tempL1[class_i]
        pval = tempDF3.loc[(bmi_class, bbmi), 'AdjPval_all']
        if pval<0.001:
            label = '***'
        elif pval<0.01:
            label = '**'
        elif pval<0.05:
            label = '*'
        else:
            pval_text = str(Decimal(pval).quantize(Decimal('0.01'), rounding=ROUND_HALF_UP))
            label = r'$P$'+' = '+pval_text
        #Add annotation text
        if label in ['***', '**', '*']:
            text_offset = yinter/12
            text_size = 'medium'
        else:
            text_offset = yinter/3
            text_size = 'x-small'
        ax.annotate(label, xy=(xcoord, ycoord+text_offset),
                    horizontalalignment='center', verticalalignment='bottom',
                    fontsize=text_size, color='k')
    #Facet settings
    ax.set_title(bbmi, {'fontsize':'medium'})
    xoff = 0.025
    yoff = 0.01
    rect = plt.Rectangle((xoff, 1+yoff), 1-xoff, 0.15,#Manual adjustment
                         transform=ax.transAxes, facecolor=tempD1[bbmi], alpha=0.3,
                         clip_on=False, linewidth=0, zorder=0.5)
    ax.add_patch(rect)
    #Change the default boxplot settings
    for line in lines:
        line.set_color('k')
    for box in ax.artists:
        box.set_edgecolor('k')
    #Legend
    if ax_i==len(tempD1)-1:
        ax.legend(title='Metabolic\ncondition', title_fontsize='medium', fontsize='medium',
                  bbox_to_anchor=(1, 0.5), loc='center left', borderaxespad=0.5,
                  handlelength=1.5, handletextpad=0.5)
    else:
        ax.get_legend().remove()
##Save
fileDir = './ExportFigures/'
ipynbName = '220804_Multiomics-BMI-NatMed1stRevision_DeltaBMI-misclassification_ClinicalDefinition-ver2_'
fileName = 'DeltaBMI-all.tif'
plt.gcf().savefig(fileDir+ipynbName+fileName, dpi=300, bbox_inches='tight', pad_inches=0.04,
                  pil_kwargs={'compression':'tiff_lzw'})
plt.show()

# — End of notebook —