# Multiomics BMI Paper — Beta-coefficients of WHtR LASSO Models

***by Kengo Watanabe***  

This Jupyter Notebook (with Python 3 kernel) visualized beta-coefficients of the blood omics-based WHtR LASSO models, and assessed relationships between WHtR and the retained variables using regression analysis (in the Arivale cohort).  

Input files:  
* Arivale baseline WHtR: 220621_Multiomics-BMI-NatMedRevision_WHtR-DataCleaning_baseline-WHtR-final-cohort.tsv  
* Arivale baseline blood omics (preprocessed): 210104_Biological-BMI-paper_RF-imputation_baseline-\[metDF/protDF/chemDF/combiDF\]-with-RF-imputation.tsv  
* Arivale baseline WHtR predictions: 220822_Multiomics-BMI-NatMed1stRevision_WHtR-baseline-LASSO-ver2_\[MetWHtR/ProtWHtR/ChemWHtR/CombiWHtR\]-\[Female/Male/BothSex\].tsv  
* Biological WHtR models: 220822_Multiomics-BMI-NatMed1stRevision_WHtR-baseline-LASSO-ver2_\[MetWHtR/ProtWHtR/ChemWHtR/CombiWHtR\]-\[Female/Male/BothSex\]-LASSObcoefs.tsv  

Output figures and tables:  
* Supplementary Figure 7j, 7k  
* Tables for Supplementary Data 9  

Original notebook (memo for my future tracing):  
* dalek:\[JupyterLab HOME\]/220621_Multiomics-BMI-NatMedRevision/220823_Multiomics-BMI-NatMed1stRevision_WHtR-LASSO-bcoef-ver2.ipynb  

In [1]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
#For Arial font
#!conda install -c conda-forge -y mscorefonts
##-> The below was also needed in matplotlib 3.4.2
#import shutil
#import matplotlib
#shutil.rmtree(matplotlib.get_cachedir())
import warnings
warnings.filterwarnings('ignore')
from IPython.display import display
import time

from sklearn.preprocessing import StandardScaler
#!conda install -c conda-forge -y matplotlib-venn
from matplotlib_venn import venn3, venn3_circles, venn2, venn2_circles
import statsmodels.formula.api as smf
from statsmodels.stats import multitest as multi
import matplotlib.patches as mpatches

!conda list

# packages in environment at /opt/conda/envs/arivale-py3:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       1_gnu    conda-forge
analytics                 0.1                      pypi_0    pypi
argon2-cffi               21.1.0           py39h3811e60_0    conda-forge
arivale-data-interface    0.1.0                    pypi_0    pypi
async_generator           1.10                       py_0    conda-forge
atk-1.0                   2.36.0               h3371d22_4    conda-forge
attrs                     21.2.0             pyhd8ed1ab_0    conda-forge
backcall                  0.2.0              pyh9f0ad1d_0    conda-forge
backports                 1.0                        py_2    conda-forge
backports.functools_lru_cache 1.6.4              pyhd8ed1ab_0    conda-forge
biopython                 1.79             py39h3811e60_0    conda-forge
bleach 

## 1. Data preparation

### 1-1. Import the cleaned dataframes

In [None]:
#Import the baseline WHtR dataframe
fileDir = './ExportData/'
ipynbName = '220621_Multiomics-BMI-NatMedRevision_WHtR-DataCleaning_'
fileName = 'baseline-WHtR-final-cohort.tsv'
tempDF = pd.read_csv(fileDir+ipynbName+fileName, sep='\t', dtype={'public_client_id': str})
tempDF = tempDF.set_index('public_client_id')
##Take WHtR and general covariates (without Race in this study)
tempL = ['log_BaseWHtR', 'log_BaseBMI', 'Sex', 'BaseAge', 'PC1', 'PC2', 'PC3', 'PC4', 'PC5']
tempDF = tempDF[tempL]

display(tempDF)

whtrDF = tempDF

In [None]:
tempDF1 = whtrDF

#Import the cleaned baseline omics dataframes
fileDir = '../210104_Biological-BMI-paper/ExportData/'
ipynbName = '210104_Biological-BMI-paper_RF-imputation_'
tempL = ['log_BaseBMI', 'Sex', 'BaseAge', 'PC1', 'PC2', 'PC3', 'PC4', 'PC5', 'Race']
tempD = {}
for df_n in ['metDF', 'chemDF', 'protDF', 'combiDF']:
    fileName = 'baseline-'+df_n+'-with-RF-imputation.tsv'
    tempDF = pd.read_csv(fileDir+ipynbName+fileName, sep='\t', dtype={'public_client_id': str})
    tempDF = tempDF.set_index('public_client_id')
    print(df_n+' original shape:', tempDF.shape)
    #Drop BMI and covariates
    tempDF = tempDF.drop(columns=tempL)
    #Extract the individuals having WHtR
    tempDF = tempDF.loc[tempDF1.index.tolist()]
    display(tempDF)
    tempD[df_n] = tempDF

metDF = tempD['metDF']
chemDF = tempD['chemDF']
protDF = tempD['protDF']
combiDF = tempD['combiDF']

### 1-2. Stratification with sex

In [None]:
#Stratify the cohort with sex
whtrDF_F = whtrDF.loc[whtrDF['Sex']=='F']
whtrDF_M = whtrDF.loc[whtrDF['Sex']=='M']
whtrDF_B = whtrDF#Not copy just rename
print('Female, Male, Both sex = ', len(whtrDF_F), ', ', len(whtrDF_M), ', ', len(whtrDF_B))

## 2. Metabolomics

### 2-1. Prepare and standardize variables

> Variables, including bWHtR (as a reference) and covariates, are standardized for OLS linear regression.  
> –> In this version, unstandardized WHtR is used as the dependent variable. However, WHtR within the following DFs is standardized (just for the code simplicity).  

In [None]:
tempDF = metDF
tempD1 = {'Female':whtrDF_F, 'Male':whtrDF_M, 'BothSex':whtrDF_B}
bwhtr = 'MetWHtR'
tempD2 = {}
for sex in tempD1.keys():
    #Add covariates (and WHtR) while selecting the cohort
    tempDF1 = tempD1[sex]
    tempDF1 = pd.merge(tempDF, tempDF1, left_index=True, right_index=True, how='inner')
    
    #Add biological WHtR
    fileDir = './ExportData/'
    ipynbName = '220822_Multiomics-BMI-NatMed1stRevision_WHtR-baseline-LASSO-ver2_'
    fileName = bwhtr+'-'+sex+'.tsv'
    tempDF2 = pd.read_csv(fileDir+ipynbName+fileName, sep='\t', dtype={'public_client_id':str})
    tempDF2 = tempDF2.set_index('public_client_id')
    tempDF1['log_Base'+bwhtr] = tempDF2['log_Base'+bwhtr]
    
    #Z-score transformation
    tempDF2 = tempDF1.select_dtypes(include=[np.number])
    scaler = StandardScaler(copy=True, with_mean=True, with_std=True)
    tempA = scaler.fit_transform(tempDF2)#Column direction
    tempDF2 = pd.DataFrame(data=tempA, index=tempDF2.index, columns=tempDF2.columns)
    ##Recover categorical covariates
    tempDF3 = tempDF1.select_dtypes(exclude=[np.number])
    tempDF1 = pd.merge(tempDF2, tempDF3, left_index=True, right_index=True, how='left')
    
    print(sex)
    display(tempDF1.describe(include='all'))
    tempD2[sex] = tempDF1

metDF_F = tempD2['Female']
metDF_M = tempD2['Male']
metDF_B = tempD2['BothSex']

### 2-2. Correlation b/w all pairwise variables

In [None]:
#Female and male cohorts
tempD1 = {'Female':metDF_F, 'Male':metDF_M}
bwhtr = 'MetWHtR'
tempD2 = {}
for sex in tempD1.keys():
    print(sex+' cohort:')
    tempDF = tempD1[sex]
    #Compute correlation matrix
    tempL = [item for sublist in [whtrDF.columns.tolist(), ['log_Base'+bwhtr]] for item in sublist]
    tempDF1 = tempDF.drop(columns=tempL)
    tempDF1 = tempDF1.corr(method='pearson')
    print('• Combinations:', int(len(tempDF1)*(len(tempDF1)-1)/2))
    
    #Extract lower triangle matrix
    tempDF1 = tempDF1.where(np.tril(np.ones(tempDF1.shape), k=-1).astype(np.bool), other=np.nan)
    tempDF1.index.rename('Variable1', inplace=True)
    tempDF1 = tempDF1.reset_index().melt(var_name='Variable2', value_name='Pearson_r', id_vars=['Variable1'])
    tempDF1 = tempDF1.dropna()
    print('• nrows:', len(tempDF1))
    print('• |Pearson\'s r| > 0.8:', len(tempDF1.loc[abs(tempDF1['Pearson_r'])>0.8]))
    display(tempDF1.loc[abs(tempDF1['Pearson_r'])>0.8])
    
    tempD2[sex] = tempDF1

#Distribution
tempD3 = {'Female':'tab:red', 'Male':'tab:blue'}
sns.set(style='ticks', font='Arial', context='notebook')
plt.figure(figsize=(4, 3))
for sex in tempD2.keys():
    tempS = tempD2[sex]['Pearson_r']
    sns.distplot(tempS, color=tempD3[sex], label=sex).set(xlim=(-1, 1), xticks=np.arange(-1, 1.1, 0.5))
sns.despine()
plt.xlabel('Pearson\'s '+r'$r$')
plt.ylabel('Density')
plt.legend(loc='upper right')
plt.show()

In [None]:
#Both sex cohort
tempDF = metDF_B
bwhtr = 'MetWHtR'
bwhtr_color = 'b'

print('Both sex cohort:')
#Compute correlation matrix
tempL = [item for sublist in [whtrDF.columns.tolist(), ['log_Base'+bwhtr]] for item in sublist]
tempDF = tempDF.drop(columns=tempL)
tempDF = tempDF.corr(method='pearson')
print('• Combinations:', int(len(tempDF)*(len(tempDF)-1)/2))

#Extract lower triangle matrix
tempDF = tempDF.where(np.tril(np.ones(tempDF.shape), k=-1).astype(np.bool), other=np.nan)
tempDF.index.rename('Variable1', inplace=True)
tempDF = tempDF.reset_index().melt(var_name='Variable2', value_name='Pearson_r', id_vars=['Variable1'])
tempDF = tempDF.dropna()
print('• nrows:', len(tempDF))
print('• |Pearson\'s r| > 0.8:', len(tempDF.loc[abs(tempDF['Pearson_r'])>0.8]))
display(tempDF.loc[abs(tempDF['Pearson_r'])>0.8])

#Distribution
sns.set(style='ticks', font='Arial', context='talk')
plt.figure(figsize=(4, 3))
sns.distplot(tempDF['Pearson_r'], color=bwhtr_color).set(xlim=(-1, 1), xticks=np.arange(-1, 1.1, 0.5))
plt.axvline(x=tempDF['Pearson_r'].max(), **{'linestyle':'--', 'color':bwhtr_color})
plt.axvline(x=tempDF['Pearson_r'].min(), **{'linestyle':'--', 'color':bwhtr_color})
sns.despine()
plt.ylabel('Density')
plt.xlabel('Pearson\'s '+r'$r$')
plt.show()

### 2-3. Prepare the beta-coefficients of LASSO models

In [None]:
bwhtr = 'MetWHtR'
tempD = {}
for sex in ['Female', 'Male', 'BothSex']:
    #Import the LASSO beta-coefficients
    fileDir = './ExportData/'
    ipynbName = '220822_Multiomics-BMI-NatMed1stRevision_WHtR-baseline-LASSO-ver2_'
    fileName = bwhtr+'-'+sex+'-LASSObcoefs.tsv'
    tempDF = pd.read_csv(fileDir+ipynbName+fileName, sep='\t').set_index('Variable')
    tempDF = tempDF.drop(index=['Intercept'])
    tempD[sex] = tempDF
    
    #Check
    print(sex+':')
    print(' - Variables:', len(tempDF))
    tempDF1 = tempDF.loc[tempDF['nZeros']!=10]
    print(' - Variables with non-zero beta-coefficient:', len(tempDF1),
          '(', len(tempDF1)/len(tempDF)*100, '%)')
    tempDF1 = tempDF.loc[tempDF['nZeros']==0]
    print(' - Variables with non-zero beta-coefficient in all 10 models:', len(tempDF1),
          '(', len(tempDF1)/len(tempDF)*100, '%)')
    tempDF1 = tempDF1.sort_values(by='Mean', ascending=False)
    display(tempDF1)
    print('')

metWHtR_F_bcoefs = tempD['Female']
metWHtR_M_bcoefs = tempD['Male']
metWHtR_B_bcoefs = tempD['BothSex']

### 2-4. Variables retained in all 10 models

In [None]:
tempDF1 = metWHtR_F_bcoefs
tempDF2 = metWHtR_M_bcoefs
tempDF3 = metWHtR_B_bcoefs
category = 'Metabolomics'

#Variables with non-zero beta-coefficient in all 10 models
tempS1 = set(tempDF1.loc[tempDF1['nZeros']==0].index)
tempS2 = set(tempDF2.loc[tempDF2['nZeros']==0].index)
tempS3 = set(tempDF3.loc[tempDF3['nZeros']==0].index)

#Common the robust variables
var_111 = tempS1 & tempS2 & tempS3
print('Common variables with non-zero beta-coefficient in all 10 models:\n', var_111)
var_110 = (tempS1 & tempS2) - var_111
var_101 = (tempS1 & tempS3) - var_111
var_011 = (tempS2 & tempS3) - var_111
var_100 = tempS1 - (var_110 | var_101 | var_111)
var_010 = tempS2 - (var_110 | var_011 | var_111)
var_001 = tempS3 - (var_101 | var_011 | var_111)

#The number of each subset
var_subset = [len(var_100), len(var_010), len(var_110), len(var_001),
              len(var_101), len(var_011), len(var_111)]
print('Each subset:', var_subset)

#Venn diagram
sns.set(font='Arial', context='talk')
venn3(subsets=var_subset, set_labels=('Female', 'Male', 'Both sex'),
      set_colors=('tab:red', 'tab:blue', 'tab:green'), alpha=0.4)
venn3_circles(subsets=var_subset)
plt.title('Robust variables\n—'+category+'—', fontdict={'fontsize':24})
plt.show()

In [None]:
tempDF1 = metWHtR_F_bcoefs
tempDF2 = metWHtR_M_bcoefs
tempDF3 = metWHtR_B_bcoefs
bwhtr_color = 'b'

#Variables with non-zero beta-coefficient in all 10 models
tempS1 = set(tempDF1.loc[tempDF1['nZeros']==0].index)
tempS2 = set(tempDF2.loc[tempDF2['nZeros']==0].index)
tempS3 = set(tempDF3.loc[tempDF3['nZeros']==0].index)

#Union of the robust variables
tempL = list(tempS1 | tempS2 | tempS3)
print('Union of the variables with non-zero beta-coefficient in all 10 models:', len(tempL))

#Merge (use original bcoefs to rescue ones with zeros in any 10 models)
tempDF1 = tempDF1.loc[tempDF1.index.isin(tempL)].drop(columns=['Mean', 'SD', 'nZeros'])
tempDF2 = tempDF2.loc[tempDF2.index.isin(tempL)].drop(columns=['Mean', 'SD', 'nZeros'])
tempDF3 = tempDF3.loc[tempDF3.index.isin(tempL)].drop(columns=['Mean', 'SD', 'nZeros'])
tempDF = pd.concat([tempDF1, tempDF2, tempDF3], axis=0)
tempDF['Model'] = np.repeat(['Female', 'Male', 'Both sex'], [len(tempDF1), len(tempDF2), len(tempDF3)])

#tidyr::gather in R
tempDF1 = tempDF.reset_index().melt(var_name='LASSO', value_name='bcoef', id_vars=['Variable', 'Model'])

#Order for plot
tempDF2 = tempDF1.loc[tempDF1['Model']=='Both sex'].groupby('Variable', as_index=True).agg({'bcoef':np.mean})
tempDF2 = tempDF2.sort_values(by='bcoef', ascending=False)

#Plot
sns.set(style='ticks', font='Arial', context='talk')
sns.catplot(data=tempDF1, y='Variable', order=tempDF2.index.tolist(), x='bcoef', kind='box',
            hue='Model', palette='Set1', dodge=True, height=35, aspect=0.35,
            showfliers=True, flierprops={'marker':'o', 'markerfacecolor':'gray', 'alpha':0.4})
plt.grid(axis='x', linestyle='--', color='black')
for row_i in range(len(tempDF2)):
    if row_i%2 == 0:
        plt.axhspan(ymin=row_i-0.5, ymax=row_i+0.5, facecolor=bwhtr_color, alpha=0.2, zorder=0)
plt.xlabel(r'$\beta$'+'-coefficient in LASSO model')
plt.ylabel('')
plt.show()

In [None]:
#Both sex model
tempDF = metWHtR_B_bcoefs
bwhtr_color = 'b'
bwhtr = 'MetWHtR'
sex = 'BothSex'

#Variables with non-zero beta-coefficient in all 10 models
tempDF = tempDF.loc[tempDF['nZeros']==0].sort_values(by='Mean', ascending=False)
tempDF = tempDF.drop(columns=['Mean', 'SD', 'nZeros'])
print('Variables with non-zero beta-coefficient in all 10 models:', len(tempDF))

#tidyr::gather in R
tempDF1 = tempDF.reset_index().melt(var_name='LASSO', value_name='bcoef', id_vars=['Variable'])

#Plot
sns.set(style='ticks', font='Arial', context='talk')
plt.figure(figsize=(5, 20))
p = sns.boxplot(data=tempDF1, y='Variable', x='bcoef', color=bwhtr_color, dodge=False, saturation=1,
                showfliers=True, flierprops={'marker':'o', 'markerfacecolor':'gray', 'alpha':0.4},
                showcaps=True, notch=False)
p.set(xlim=(-0.046, 0.046), xticks=np.arange(-0.04, 0.041, 0.02))#Fixed across omics
p.grid(axis='x', linestyle='--', color='black')
sns.despine()
##Change default dull line color of sns.boxplot (saturation parameter is for patch)
for line in p.get_lines():
    line.set_color('k')
for box in p.artists:
    box.set_edgecolor('k')
##Add background color
for row_i in range(len(tempDF)):
    if row_i%2 == 0:
        plt.axhspan(ymin=row_i-0.5, ymax=row_i+0.5, facecolor=bwhtr_color, alpha=0.2, zorder=0)
plt.ylabel('')
plt.xlabel(r'$\beta$'+'-coefficient in LASSO model\n [log-scaled WHtR per s.d.]')
##Save
fileDir = './ExportFigures/'
ipynbName = '220823_Multiomics-BMI-NatMed1stRevision_WHtR-LASSO-bcoef-ver2_'
fileName = bwhtr+'-'+sex+'-bcoef_non-zero-in-all.tif'
plt.gcf().savefig(fileDir+ipynbName+fileName, dpi=300, bbox_inches='tight', pad_inches=0.04,
                  pil_kwargs={'compression':'tiff_lzw'})
plt.show()

### 2-5. Correlation b/w pairwise variables retained in all 10 models

In [None]:
#Both sex model
tempDF1 = metWHtR_B_bcoefs
tempDF2 = metDF_B
bwhtr_color = 'b'

print('Both sex model:')
#Variables with non-zero beta-coefficient in all 10 models
tempL = tempDF1.loc[tempDF1['nZeros']==0].index.tolist()

#Compute correlation matrix
tempDF = tempDF2[tempL]
tempDF = tempDF.corr(method='pearson')
print('• Combinations:', int(len(tempDF)*(len(tempDF)-1)/2))

#Extract lower triangle matrix
tempDF = tempDF.where(np.tril(np.ones(tempDF.shape), k=-1).astype(np.bool), other=np.nan)
tempDF.index.rename('Variable1', inplace=True)
tempDF = tempDF.reset_index().melt(var_name='Variable2', value_name='Pearson_r', id_vars=['Variable1'])
tempDF = tempDF.dropna()
print('• nrows:', len(tempDF))
print('• |Pearson\'s r| > 0.8:', len(tempDF.loc[abs(tempDF['Pearson_r'])>0.8]))
display(tempDF.loc[abs(tempDF['Pearson_r'])>0.8])

#Distribution
sns.set(style='ticks', font='Arial', context='talk')
plt.figure(figsize=(4, 3))
sns.distplot(tempDF['Pearson_r'], color=bwhtr_color).set(xlim=(-1, 1), xticks=np.arange(-1, 1.1, 0.5))
plt.axvline(x=tempDF['Pearson_r'].max(), **{'linestyle':'--', 'color':bwhtr_color})
plt.axvline(x=tempDF['Pearson_r'].min(), **{'linestyle':'--', 'color':bwhtr_color})
sns.despine()
plt.ylabel('Density')
plt.xlabel('Pearson\'s '+r'$r$')
plt.show()

print('Variables with non-zero beta-coefficient in all 10 models:', len(tempL))
#Clustermap
tempDF = tempDF2[tempL]
tempDF = tempDF.corr(method='pearson')
sns.set(style='ticks', font='Arial', context='notebook')
cmap = sns.diverging_palette(220, 20, as_cmap=True)
cm = sns.clustermap(tempDF, method='ward', metric='euclidean', cmap=cmap,
                    row_cluster=True, col_cluster=True, row_linkage=None, col_linkage=None,
                    row_colors=None, col_colors=None, xticklabels=True, yticklabels=True,
                    dendrogram_ratio=(0.1, 0.1), cbar_pos=(0.9, 0.02, 0.02, 0.1),
                    figsize=(15, 15), **{'center':0, 'vmin':-1, 'vmax':1})
cm.cax.set_title('Pearson\'s '+r'$r$')
hm = cm.ax_heatmap.get_position()
rd = cm.ax_row_dendrogram.get_position()
cd = cm.ax_col_dendrogram.get_position()
cm.ax_heatmap.set_position([hm.x0, hm.y0, hm.width, hm.height])
cm.ax_row_dendrogram.set_position([rd.x0+rd.width*0.5, rd.y0, rd.width*0.5, rd.height])
cm.ax_col_dendrogram.set_position([cd.x0, cd.y0, cd.width, cd.height*0.5])
plt.show()

### 2-6. Explained variance in WHtR by the variables retained in ≥1 model

> OLS linear regression model for significance: log(WHtR) ~ b0 + b1\*Variable + b2\*Sex + b3\*BaseAge + b4\*PCs  
> Univariate model: log(WHtR) ~ b0 + b1\*Variable  
>
> Of note, biological WHtR is included for comparison.  
> As well as LASSO models, unstandardized WHtR is used for the dependent variable.  

In [None]:
tempD1 = {'Female':metWHtR_F_bcoefs, 'Male':metWHtR_M_bcoefs, 'BothSex':metWHtR_B_bcoefs}
tempD2 = {'Female':metDF_F, 'Male':metDF_M, 'BothSex':metDF_B}
tempD3 = {'Female':whtrDF_F, 'Male':whtrDF_M, 'BothSex':whtrDF_B}
yvar = 'WHtR'
bwhtr = 'MetWHtR'
topX = 30
for sex in tempD1.keys():
    print(sex+' model:')
    
    #Variables with non-zero beta-coefficient at least 1 of 10 models
    tempDF = tempD1[sex]
    tempL = tempDF.loc[tempDF['nZeros']==10].index.tolist()#To be excluded
    
    #Prepare DF for regressions
    tempDF = tempD2[sex].drop(columns=tempL)
    tempDF = tempDF.rename(columns={'log_Base'+bwhtr:bwhtr+' model'})
    #Replace the standardized dependent variable with the unstandardized
    tempDF1 = tempD3[sex]
    tempDF['log_Base'+yvar] = tempDF1['log_Base'+yvar]
    
    #Independent variables for regressions (including bWHtR)
    tempL = tempDF.drop(columns=whtrDF.columns.tolist()).columns.tolist()
    
    #Perform OLS linear regression
    tempL1 = []#For sample size
    tempL2 = []#For R2 in the univariate model
    tempL3 = []#For R2
    tempL4 = []#For beta-coefficient
    tempL5 = []#For SE of beta-coefficient
    tempL6 = []#For t-statistic
    tempL7 = []#For residual degrees of freedom
    tempL8 = []#For P-value
    t_start = time.time()
    for var in tempL:
        #Rename independent variable
        tempDF1 = tempDF.rename(columns={var: 'Variable'})
        #Add a constant for the intercept -> Similar to R, smf automatically add a constant
        
        #Save sample size
        tempL1.append(len(tempDF1))
        
        #Fit univariate model
        formula = 'log_Base'+yvar+' ~ Variable'
        fit_res = smf.ols(formula, data=tempDF1).fit()
        #Save R2 [%]
        tempL2.append(fit_res.rsquared*100)

        #Fit full model
        if sex=='BothSex':
            formula = 'log_Base'+yvar+' ~ Variable + C(Sex) + BaseAge + PC1 + PC2 + PC3 + PC4 + PC5'
        else:#Sex is invariant
            formula = 'log_Base'+yvar+' ~ Variable + BaseAge + PC1 + PC2 + PC3 + PC4 + PC5'
        fit_res = smf.ols(formula, data=tempDF1).fit()
        #Save R2 [%]
        tempL3.append(fit_res.rsquared*100)
        #Save beta-coefficient of the variable
        tempL4.append(fit_res.params['Variable'])
        tempL5.append(fit_res.bse['Variable'])
        #Save t-statistic of the variable
        tempL6.append(fit_res.tvalues['Variable'])
        #Save residual degrees of freedom
        tempL7.append(int(fit_res.df_resid))
        #Save P-value of the variable
        tempL8.append(fit_res.pvalues['Variable'])
    t_elapsed = time.time() - t_start
    print('Elapsed time for', len(tempL), 'OLS linear regressions (including '+bwhtr+'):',
          round(t_elapsed//60), 'min', round(t_elapsed%60, 1), 'sec')
    
    #Clean the results
    tempDF = pd.DataFrame({'N':tempL1, 'UnivarR2':tempL2, 'R2':tempL3,
                           'OLSbcoef':tempL4, 'OLSbcoefSE':tempL5,
                           'tStat':tempL6, 'DoF':tempL7,  'Pval':tempL8},
                          index=pd.Index(tempL, name='Variable'))
    ##P-value adjustment by using Benjamini–Hochberg method
    tempDF['AdjPval'] = multi.multipletests(tempDF['Pval'], alpha=0.05, method='fdr_bh',
                                            is_sorted=False, returnsorted=False)[1]
    ##Add the LASSO results
    tempDF1 = tempD1[sex]
    tempDF['LASSObcoef'] = tempDF1.loc[tempDF1.index.isin(tempL)]['Mean']#NaN in bWHtR reference
    tempDF['LASSOnZeros'] = tempDF1.loc[tempDF1.index.isin(tempL)]['nZeros']#NaN in bWHtR reference
    tempDF = tempDF.sort_values(by='AdjPval', ascending=True)
    
    #Save the cleaned DF
    fileDir = './ExportData/'
    ipynbName = '220823_Multiomics-BMI-NatMed1stRevision_WHtR-LASSO-bcoef-ver2_'
    fileName = bwhtr+'-'+sex+'-OLS.tsv'
    tempDF.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)
    
    #Extact significant variables
    tempDF = tempDF[tempDF['AdjPval']<0.05]
    print('Variables significantly associated with WHtR (FDR < 0.05):', len(tempDF)-1)#bWHtR reference
    ##Top X (+ bWHtR)
    tempDF = tempDF.iloc[0:(topX+1)]
    display(tempDF)
    ##Category and color
    tempL = []
    for row_n in tempDF.index.tolist():
        if row_n==bwhtr+' model':
            tempL.append('Positive association')
        else:
            bcoef = tempDF.loc[row_n, 'OLSbcoef']
            if bcoef>0:
                tempL.append('Positive association')
            elif bcoef<0:
                tempL.append('Negative association')
            else:
                tempL.append('Error?')
    tempDF['Association'] = tempL
    tempD = {'Positive association':'tab:red', 'Negative association':'tab:blue'}
    ##Plot
    tempDF = tempDF.reset_index()
    sns.set(style='ticks', font='Arial', context='talk')
    plt.figure(figsize=(10, 4))
    p = sns.barplot(data=tempDF, y='UnivarR2', x='Variable', hue='Association', hue_order=tempD.keys(),
                    dodge=False, palette=tempD, edgecolor='k')
    p.set(ylim=(0, 80), yticks=np.arange(0, 80, 15))#Fixed across omics
    p.grid(axis='y', linestyle='--', color='black')
    sns.despine()
    plt.xticks(rotation=90, horizontalalignment='center')
    plt.xlabel('')
    plt.ylabel('Explained variance in WHtR [%]')
    plt.legend(loc='upper right')
    if sex=='BothSex':#Save
        fileDir = './ExportFigures/'
        ipynbName = '220823_Multiomics-BMI-NatMed1stRevision_WHtR-LASSO-bcoef-ver2_'
        fileName = bwhtr+'-'+sex+'-OLS-top'+str(topX)+'.tif'
        plt.gcf().savefig(fileDir+ipynbName+fileName, dpi=300, bbox_inches='tight', pad_inches=0.04,
                          pil_kwargs={'compression':'tiff_lzw'})
    plt.show()
    print('')

## 3. Proteomics

### 3-1. Prepare and standardize variables

> Variables, including bWHtR (as a reference) and covariates, are standardized for OLS linear regression.  
> –> In this version, unstandardized WHtR is used as the dependent variable. However, WHtR within the following DFs is standardized (just for the code simplicity).  

In [None]:
tempDF = protDF
tempD1 = {'Female':whtrDF_F, 'Male':whtrDF_M, 'BothSex':whtrDF_B}
bwhtr = 'ProtWHtR'
tempD2 = {}
for sex in tempD1.keys():
    #Add covariates (and WHtR) while selecting the cohort
    tempDF1 = tempD1[sex]
    tempDF1 = pd.merge(tempDF, tempDF1, left_index=True, right_index=True, how='inner')
    
    #Add biological WHtR
    fileDir = './ExportData/'
    ipynbName = '220822_Multiomics-BMI-NatMed1stRevision_WHtR-baseline-LASSO-ver2_'
    fileName = bwhtr+'-'+sex+'.tsv'
    tempDF2 = pd.read_csv(fileDir+ipynbName+fileName, sep='\t', dtype={'public_client_id':str})
    tempDF2 = tempDF2.set_index('public_client_id')
    tempDF1['log_Base'+bwhtr] = tempDF2['log_Base'+bwhtr]
    
    #Z-score transformation
    tempDF2 = tempDF1.select_dtypes(include=[np.number])
    scaler = StandardScaler(copy=True, with_mean=True, with_std=True)
    tempA = scaler.fit_transform(tempDF2)#Column direction
    tempDF2 = pd.DataFrame(data=tempA, index=tempDF2.index, columns=tempDF2.columns)
    ##Recover categorical covariates
    tempDF3 = tempDF1.select_dtypes(exclude=[np.number])
    tempDF1 = pd.merge(tempDF2, tempDF3, left_index=True, right_index=True, how='left')
    
    print(sex)
    display(tempDF1.describe(include='all'))
    tempD2[sex] = tempDF1

protDF_F = tempD2['Female']
protDF_M = tempD2['Male']
protDF_B = tempD2['BothSex']

### 3-2. Correlation b/w all pairwise variables

In [None]:
#Female and male cohorts
tempD1 = {'Female':protDF_F, 'Male':protDF_M}
bwhtr = 'ProtWHtR'
tempD2 = {}
for sex in tempD1.keys():
    print(sex+' cohort:')
    tempDF = tempD1[sex]
    #Compute correlation matrix
    tempL = [item for sublist in [whtrDF.columns.tolist(), ['log_Base'+bwhtr]] for item in sublist]
    tempDF1 = tempDF.drop(columns=tempL)
    tempDF1 = tempDF1.corr(method='pearson')
    print('• Combinations:', int(len(tempDF1)*(len(tempDF1)-1)/2))
    
    #Extract lower triangle matrix
    tempDF1 = tempDF1.where(np.tril(np.ones(tempDF1.shape), k=-1).astype(np.bool), other=np.nan)
    tempDF1.index.rename('Variable1', inplace=True)
    tempDF1 = tempDF1.reset_index().melt(var_name='Variable2', value_name='Pearson_r', id_vars=['Variable1'])
    tempDF1 = tempDF1.dropna()
    print('• nrows:', len(tempDF1))
    print('• |Pearson\'s r| > 0.8:', len(tempDF1.loc[abs(tempDF1['Pearson_r'])>0.8]))
    display(tempDF1.loc[abs(tempDF1['Pearson_r'])>0.8])
    
    tempD2[sex] = tempDF1

#Distribution
tempD3 = {'Female':'tab:red', 'Male':'tab:blue'}
sns.set(style='ticks', font='Arial', context='notebook')
plt.figure(figsize=(4, 3))
for sex in tempD2.keys():
    tempS = tempD2[sex]['Pearson_r']
    sns.distplot(tempS, color=tempD3[sex], label=sex).set(xlim=(-1, 1), xticks=np.arange(-1, 1.1, 0.5))
sns.despine()
plt.xlabel('Pearson\'s '+r'$r$')
plt.ylabel('Density')
plt.legend(loc='upper right')
plt.show()

In [None]:
#Both sex cohort
tempDF = protDF_B
bwhtr = 'ProtWHtR'
bwhtr_color = 'r'

print('Both sex cohort:')
#Compute correlation matrix
tempL = [item for sublist in [whtrDF.columns.tolist(), ['log_Base'+bwhtr]] for item in sublist]
tempDF = tempDF.drop(columns=tempL)
tempDF = tempDF.corr(method='pearson')
print('• Combinations:', int(len(tempDF)*(len(tempDF)-1)/2))

#Extract lower triangle matrix
tempDF = tempDF.where(np.tril(np.ones(tempDF.shape), k=-1).astype(np.bool), other=np.nan)
tempDF.index.rename('Variable1', inplace=True)
tempDF = tempDF.reset_index().melt(var_name='Variable2', value_name='Pearson_r', id_vars=['Variable1'])
tempDF = tempDF.dropna()
print('• nrows:', len(tempDF))
print('• |Pearson\'s r| > 0.8:', len(tempDF.loc[abs(tempDF['Pearson_r'])>0.8]))
display(tempDF.loc[abs(tempDF['Pearson_r'])>0.8])

#Distribution
sns.set(style='ticks', font='Arial', context='talk')
plt.figure(figsize=(4, 3))
sns.distplot(tempDF['Pearson_r'], color=bwhtr_color).set(xlim=(-1, 1), xticks=np.arange(-1, 1.1, 0.5))
plt.axvline(x=tempDF['Pearson_r'].max(), **{'linestyle':'--', 'color':bwhtr_color})
plt.axvline(x=tempDF['Pearson_r'].min(), **{'linestyle':'--', 'color':bwhtr_color})
sns.despine()
plt.ylabel('Density')
plt.xlabel('Pearson\'s '+r'$r$')
plt.show()

### 3-3. Prepare the beta-coefficients of LASSO models

In [None]:
bwhtr = 'ProtWHtR'
tempD = {}
for sex in ['Female', 'Male', 'BothSex']:
    #Import the LASSO beta-coefficients
    fileDir = './ExportData/'
    ipynbName = '220822_Multiomics-BMI-NatMed1stRevision_WHtR-baseline-LASSO-ver2_'
    fileName = bwhtr+'-'+sex+'-LASSObcoefs.tsv'
    tempDF = pd.read_csv(fileDir+ipynbName+fileName, sep='\t').set_index('Variable')
    tempDF = tempDF.drop(index=['Intercept'])
    tempD[sex] = tempDF
    
    #Check
    print(sex+':')
    print(' - Variables:', len(tempDF))
    tempDF1 = tempDF.loc[tempDF['nZeros']!=10]
    print(' - Variables with non-zero beta-coefficient:', len(tempDF1),
          '(', len(tempDF1)/len(tempDF)*100, '%)')
    tempDF1 = tempDF.loc[tempDF['nZeros']==0]
    print(' - Variables with non-zero beta-coefficient in all 10 models:', len(tempDF1),
          '(', len(tempDF1)/len(tempDF)*100, '%)')
    tempDF1 = tempDF1.sort_values(by='Mean', ascending=False)
    display(tempDF1)
    print('')

protWHtR_F_bcoefs = tempD['Female']
protWHtR_M_bcoefs = tempD['Male']
protWHtR_B_bcoefs = tempD['BothSex']

### 3-4. Variables retained in all 10 models

In [None]:
tempDF1 = protWHtR_F_bcoefs
tempDF2 = protWHtR_M_bcoefs
tempDF3 = protWHtR_B_bcoefs
category = 'Proteomics'

#Variables with non-zero beta-coefficient in all 10 models
tempS1 = set(tempDF1.loc[tempDF1['nZeros']==0].index)
tempS2 = set(tempDF2.loc[tempDF2['nZeros']==0].index)
tempS3 = set(tempDF3.loc[tempDF3['nZeros']==0].index)

#Common the robust variables
var_111 = tempS1 & tempS2 & tempS3
print('Common variables with non-zero beta-coefficient in all 10 models:\n', var_111)
var_110 = (tempS1 & tempS2) - var_111
var_101 = (tempS1 & tempS3) - var_111
var_011 = (tempS2 & tempS3) - var_111
var_100 = tempS1 - (var_110 | var_101 | var_111)
var_010 = tempS2 - (var_110 | var_011 | var_111)
var_001 = tempS3 - (var_101 | var_011 | var_111)

#The number of each subset
var_subset = [len(var_100), len(var_010), len(var_110), len(var_001),
              len(var_101), len(var_011), len(var_111)]
print('Each subset:', var_subset)

#Venn diagram
sns.set(font='Arial', context='talk')
venn3(subsets=var_subset, set_labels=('Female', 'Male', 'Both sex'),
      set_colors=('tab:red', 'tab:blue', 'tab:green'), alpha=0.4)
venn3_circles(subsets=var_subset)
plt.title('Robust variables\n—'+category+'—', fontdict={'fontsize':24})
plt.show()

In [None]:
tempDF1 = protWHtR_F_bcoefs
tempDF2 = protWHtR_M_bcoefs
tempDF3 = protWHtR_B_bcoefs
bwhtr_color = 'r'

#Variables with non-zero beta-coefficient in all 10 models
tempS1 = set(tempDF1.loc[tempDF1['nZeros']==0].index)
tempS2 = set(tempDF2.loc[tempDF2['nZeros']==0].index)
tempS3 = set(tempDF3.loc[tempDF3['nZeros']==0].index)

#Union of the robust variables
tempL = list(tempS1 | tempS2 | tempS3)
print('Union of the variables with non-zero beta-coefficient in all 10 models:', len(tempL))

#Merge (use original bcoefs to rescue ones with zeros in any 10 models)
tempDF1 = tempDF1.loc[tempDF1.index.isin(tempL)].drop(columns=['Mean', 'SD', 'nZeros'])
tempDF2 = tempDF2.loc[tempDF2.index.isin(tempL)].drop(columns=['Mean', 'SD', 'nZeros'])
tempDF3 = tempDF3.loc[tempDF3.index.isin(tempL)].drop(columns=['Mean', 'SD', 'nZeros'])
tempDF = pd.concat([tempDF1, tempDF2, tempDF3], axis=0)
tempDF['Model'] = np.repeat(['Female', 'Male', 'Both sex'], [len(tempDF1), len(tempDF2), len(tempDF3)])

#tidyr::gather in R
tempDF1 = tempDF.reset_index().melt(var_name='LASSO', value_name='bcoef', id_vars=['Variable', 'Model'])

#Order for plot
tempDF2 = tempDF1.loc[tempDF1['Model']=='Both sex'].groupby('Variable', as_index=True).agg({'bcoef':np.mean})
tempDF2 = tempDF2.sort_values(by='bcoef', ascending=False)

#Plot
sns.set(style='ticks', font='Arial', context='talk')
sns.catplot(data=tempDF1, y='Variable', order=tempDF2.index.tolist(), x='bcoef', kind='box',
            hue='Model', palette='Set1', dodge=True, height=14, aspect=0.6,
            showfliers=True, flierprops={'marker':'o', 'markerfacecolor':'gray', 'alpha':0.4})
plt.grid(axis='x', linestyle='--', color='black')
for row_i in range(len(tempDF2)):
    if row_i%2 == 0:
        plt.axhspan(ymin=row_i-0.5, ymax=row_i+0.5, facecolor=bwhtr_color, alpha=0.2, zorder=0)
plt.xlabel(r'$\beta$'+'-coefficient in LASSO model')
plt.ylabel('')
plt.show()

In [None]:
#Both sex model
tempDF = protWHtR_B_bcoefs
bwhtr_color = 'r'
bwhtr = 'ProtWHtR'
sex = 'BothSex'

#Variables with non-zero beta-coefficient in all 10 models
tempDF = tempDF.loc[tempDF['nZeros']==0].sort_values(by='Mean', ascending=False)
tempDF = tempDF.drop(columns=['Mean', 'SD', 'nZeros'])
print('Variables with non-zero beta-coefficient in all 10 models:', len(tempDF))

#tidyr::gather in R
tempDF1 = tempDF.reset_index().melt(var_name='LASSO', value_name='bcoef', id_vars=['Variable'])

#Plot
sns.set(style='ticks', font='Arial', context='talk')
plt.figure(figsize=(5, 9))
p = sns.boxplot(data=tempDF1, y='Variable', x='bcoef', color=bwhtr_color, dodge=False, saturation=1,
                showfliers=True, flierprops={'marker':'o', 'markerfacecolor':'gray', 'alpha':0.4},
                showcaps=True, notch=False)
p.set(xlim=(-0.046, 0.046), xticks=np.arange(-0.04, 0.041, 0.02))#Fixed across omics
p.grid(axis='x', linestyle='--', color='black')
sns.despine()
##Change default dull line color of sns.boxplot (saturation parameter is for patch)
for line in p.get_lines():
    line.set_color('k')
for box in p.artists:
    box.set_edgecolor('k')
##Add background color
for row_i in range(len(tempDF)):
    if row_i%2 == 0:
        plt.axhspan(ymin=row_i-0.5, ymax=row_i+0.5, facecolor=bwhtr_color, alpha=0.2, zorder=0)
plt.ylabel('')
plt.xlabel(r'$\beta$'+'-coefficient in LASSO model\n [log-scaled WHtR per s.d.]')
##Save
fileDir = './ExportFigures/'
ipynbName = '220823_Multiomics-BMI-NatMed1stRevision_WHtR-LASSO-bcoef-ver2_'
fileName = bwhtr+'-'+sex+'-bcoef_non-zero-in-all.tif'
plt.gcf().savefig(fileDir+ipynbName+fileName, dpi=300, bbox_inches='tight', pad_inches=0.04,
                  pil_kwargs={'compression':'tiff_lzw'})
plt.show()

### 3-5. Correlation b/w pairwise variables retained in all 10 models

In [None]:
#Both sex model
tempDF1 = protWHtR_B_bcoefs
tempDF2 = protDF_B
bwhtr_color = 'r'

print('Both sex model:')
#Variables with non-zero beta-coefficient in all 10 models
tempL = tempDF1.loc[tempDF1['nZeros']==0].index.tolist()

#Compute correlation matrix
tempDF = tempDF2[tempL]
tempDF = tempDF.corr(method='pearson')
print('• Combinations:', int(len(tempDF)*(len(tempDF)-1)/2))

#Extract lower triangle matrix
tempDF = tempDF.where(np.tril(np.ones(tempDF.shape), k=-1).astype(np.bool), other=np.nan)
tempDF.index.rename('Variable1', inplace=True)
tempDF = tempDF.reset_index().melt(var_name='Variable2', value_name='Pearson_r', id_vars=['Variable1'])
tempDF = tempDF.dropna()
print('• nrows:', len(tempDF))
print('• |Pearson\'s r| > 0.8:', len(tempDF.loc[abs(tempDF['Pearson_r'])>0.8]))
display(tempDF.loc[abs(tempDF['Pearson_r'])>0.8])

#Distribution
sns.set(style='ticks', font='Arial', context='talk')
plt.figure(figsize=(4, 3))
sns.distplot(tempDF['Pearson_r'], color=bwhtr_color).set(xlim=(-1, 1), xticks=np.arange(-1, 1.1, 0.5))
plt.axvline(x=tempDF['Pearson_r'].max(), **{'linestyle':'--', 'color':bwhtr_color})
plt.axvline(x=tempDF['Pearson_r'].min(), **{'linestyle':'--', 'color':bwhtr_color})
sns.despine()
plt.ylabel('Density')
plt.xlabel('Pearson\'s '+r'$r$')
plt.show()

print('Variables with non-zero beta-coefficient in all 10 models:', len(tempL))
#Clustermap
tempDF = tempDF2[tempL]
tempDF = tempDF.corr(method='pearson')
sns.set(style='ticks', font='Arial', context='notebook')
cmap = sns.diverging_palette(220, 20, as_cmap=True)
cm = sns.clustermap(tempDF, method='ward', metric='euclidean', cmap=cmap,
                    row_cluster=True, col_cluster=True, row_linkage=None, col_linkage=None,
                    row_colors=None, col_colors=None, xticklabels=True, yticklabels=True,
                    dendrogram_ratio=(0.15, 0.15), cbar_pos=(0.91, 0.025, 0.02, 0.06),
                    figsize=(7, 7), **{'center':0, 'vmin':-1, 'vmax':1})
cm.cax.set_title('Pearson\'s '+r'$r$')
hm = cm.ax_heatmap.get_position()
rd = cm.ax_row_dendrogram.get_position()
cd = cm.ax_col_dendrogram.get_position()
cm.ax_heatmap.set_position([hm.x0, hm.y0, hm.width, hm.height])
cm.ax_row_dendrogram.set_position([rd.x0+rd.width*0.5, rd.y0, rd.width*0.5, rd.height])
cm.ax_col_dendrogram.set_position([cd.x0, cd.y0, cd.width, cd.height*0.5])
plt.show()

### 3-6. Explained variance in WHtR by the variables retained in ≥1 model

> OLS linear regression model for significance: log(WHtR) ~ b0 + b1\*Variable + b2\*Sex + b3\*BaseAge + b4\*PCs  
> Univariate model: log(WHtR) ~ b0 + b1\*Variable  
>
> Of note, biological WHtR is included for comparison.  
> As well as LASSO models, unstandardized WHtR is used for the dependent variable.  

In [None]:
tempD1 = {'Female':protWHtR_F_bcoefs, 'Male':protWHtR_M_bcoefs, 'BothSex':protWHtR_B_bcoefs}
tempD2 = {'Female':protDF_F, 'Male':protDF_M, 'BothSex':protDF_B}
tempD3 = {'Female':whtrDF_F, 'Male':whtrDF_M, 'BothSex':whtrDF_B}
yvar = 'WHtR'
bwhtr = 'ProtWHtR'
topX = 30
for sex in tempD1.keys():
    print(sex+' model:')
    
    #Variables with non-zero beta-coefficient at least 1 of 10 models
    tempDF = tempD1[sex]
    tempL = tempDF.loc[tempDF['nZeros']==10].index.tolist()#To be excluded
    
    #Prepare DF for regressions
    tempDF = tempD2[sex].drop(columns=tempL)
    tempDF = tempDF.rename(columns={'log_Base'+bwhtr:bwhtr+' model'})
    #Replace the standardized dependent variable with the unstandardized
    tempDF1 = tempD3[sex]
    tempDF['log_Base'+yvar] = tempDF1['log_Base'+yvar]
    
    #Independent variables for regressions (including bWHtR)
    tempL = tempDF.drop(columns=whtrDF.columns.tolist()).columns.tolist()
    
    #Perform OLS linear regression
    tempL1 = []#For sample size
    tempL2 = []#For R2 in the univariate model
    tempL3 = []#For R2
    tempL4 = []#For beta-coefficient
    tempL5 = []#For SE of beta-coefficient
    tempL6 = []#For t-statistic
    tempL7 = []#For residual degrees of freedom
    tempL8 = []#For P-value
    t_start = time.time()
    for var in tempL:
        #Rename independent variable
        tempDF1 = tempDF.rename(columns={var: 'Variable'})
        #Add a constant for the intercept -> Similar to R, smf automatically add a constant
        
        #Save sample size
        tempL1.append(len(tempDF1))
        
        #Fit univariate model
        formula = 'log_Base'+yvar+' ~ Variable'
        fit_res = smf.ols(formula, data=tempDF1).fit()
        #Save R2 [%]
        tempL2.append(fit_res.rsquared*100)

        #Fit full model
        if sex=='BothSex':
            formula = 'log_Base'+yvar+' ~ Variable + C(Sex) + BaseAge + PC1 + PC2 + PC3 + PC4 + PC5'
        else:#Sex is invariant
            formula = 'log_Base'+yvar+' ~ Variable + BaseAge + PC1 + PC2 + PC3 + PC4 + PC5'
        fit_res = smf.ols(formula, data=tempDF1).fit()
        #Save R2 [%]
        tempL3.append(fit_res.rsquared*100)
        #Save beta-coefficient of the variable
        tempL4.append(fit_res.params['Variable'])
        tempL5.append(fit_res.bse['Variable'])
        #Save t-statistic of the variable
        tempL6.append(fit_res.tvalues['Variable'])
        #Save residual degrees of freedom
        tempL7.append(int(fit_res.df_resid))
        #Save P-value of the variable
        tempL8.append(fit_res.pvalues['Variable'])
    t_elapsed = time.time() - t_start
    print('Elapsed time for', len(tempL), 'OLS linear regressions (including '+bwhtr+'):',
          round(t_elapsed//60), 'min', round(t_elapsed%60, 1), 'sec')
    
    #Clean the results
    tempDF = pd.DataFrame({'N':tempL1, 'UnivarR2':tempL2, 'R2':tempL3,
                           'OLSbcoef':tempL4, 'OLSbcoefSE':tempL5,
                           'tStat':tempL6, 'DoF':tempL7,  'Pval':tempL8},
                          index=pd.Index(tempL, name='Variable'))
    ##P-value adjustment by using Benjamini–Hochberg method
    tempDF['AdjPval'] = multi.multipletests(tempDF['Pval'], alpha=0.05, method='fdr_bh',
                                            is_sorted=False, returnsorted=False)[1]
    ##Add the LASSO results
    tempDF1 = tempD1[sex]
    tempDF['LASSObcoef'] = tempDF1.loc[tempDF1.index.isin(tempL)]['Mean']#NaN in bWHtR reference
    tempDF['LASSOnZeros'] = tempDF1.loc[tempDF1.index.isin(tempL)]['nZeros']#NaN in bWHtR reference
    tempDF = tempDF.sort_values(by='AdjPval', ascending=True)
    
    #Save the cleaned DF
    fileDir = './ExportData/'
    ipynbName = '220823_Multiomics-BMI-NatMed1stRevision_WHtR-LASSO-bcoef-ver2_'
    fileName = bwhtr+'-'+sex+'-OLS.tsv'
    tempDF.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)
    
    #Extact significant variables
    tempDF = tempDF[tempDF['AdjPval']<0.05]
    print('Variables significantly associated with WHtR (FDR < 0.05):', len(tempDF)-1)#bWHtR reference
    ##Top X (+ bWHtR)
    tempDF = tempDF.iloc[0:(topX+1)]
    display(tempDF)
    ##Category and color
    tempL = []
    for row_n in tempDF.index.tolist():
        if row_n==bwhtr+' model':
            tempL.append('Positive association')
        else:
            bcoef = tempDF.loc[row_n, 'OLSbcoef']
            if bcoef>0:
                tempL.append('Positive association')
            elif bcoef<0:
                tempL.append('Negative association')
            else:
                tempL.append('Error?')
    tempDF['Association'] = tempL
    tempD = {'Positive association':'tab:red', 'Negative association':'tab:blue'}
    ##Plot
    tempDF = tempDF.reset_index()
    sns.set(style='ticks', font='Arial', context='talk')
    plt.figure(figsize=(10, 4))
    p = sns.barplot(data=tempDF, y='UnivarR2', x='Variable', hue='Association', hue_order=tempD.keys(),
                    dodge=False, palette=tempD, edgecolor='k')
    p.set(ylim=(0, 80), yticks=np.arange(0, 80, 15))#Fixed across omics
    p.grid(axis='y', linestyle='--', color='black')
    sns.despine()
    plt.xticks(rotation=90, horizontalalignment='center')
    plt.xlabel('')
    plt.ylabel('Explained variance in WHtR [%]')
    plt.legend(loc='upper right')
    if sex=='BothSex':#Save
        fileDir = './ExportFigures/'
        ipynbName = '220823_Multiomics-BMI-NatMed1stRevision_WHtR-LASSO-bcoef-ver2_'
        fileName = bwhtr+'-'+sex+'-OLS-top'+str(topX)+'.tif'
        plt.gcf().savefig(fileDir+ipynbName+fileName, dpi=300, bbox_inches='tight', pad_inches=0.04,
                          pil_kwargs={'compression':'tiff_lzw'})
    plt.show()
    print('')

## 4. Clinical labs

### 4-1. Prepare and standardize variables

> Variables, including bWHtR (as a reference) and covariates, are standardized for OLS linear regression.  
> –> In this version, unstandardized WHtR is used as the dependent variable. However, WHtR within the following DFs is standardized (just for the code simplicity).  

In [None]:
tempDF = chemDF
tempD1 = {'Female':whtrDF_F, 'Male':whtrDF_M, 'BothSex':whtrDF_B}
bwhtr = 'ChemWHtR'
tempD2 = {}
for sex in tempD1.keys():
    #Add covariates (and WHtR) while selecting the cohort
    tempDF1 = tempD1[sex]
    tempDF1 = pd.merge(tempDF, tempDF1, left_index=True, right_index=True, how='inner')
    
    #Add biological WHtR
    fileDir = './ExportData/'
    ipynbName = '220822_Multiomics-BMI-NatMed1stRevision_WHtR-baseline-LASSO-ver2_'
    fileName = bwhtr+'-'+sex+'.tsv'
    tempDF2 = pd.read_csv(fileDir+ipynbName+fileName, sep='\t', dtype={'public_client_id':str})
    tempDF2 = tempDF2.set_index('public_client_id')
    tempDF1['log_Base'+bwhtr] = tempDF2['log_Base'+bwhtr]
    
    #Z-score transformation
    tempDF2 = tempDF1.select_dtypes(include=[np.number])
    scaler = StandardScaler(copy=True, with_mean=True, with_std=True)
    tempA = scaler.fit_transform(tempDF2)#Column direction
    tempDF2 = pd.DataFrame(data=tempA, index=tempDF2.index, columns=tempDF2.columns)
    ##Recover categorical covariates
    tempDF3 = tempDF1.select_dtypes(exclude=[np.number])
    tempDF1 = pd.merge(tempDF2, tempDF3, left_index=True, right_index=True, how='left')
    
    print(sex)
    display(tempDF1.describe(include='all'))
    tempD2[sex] = tempDF1

chemDF_F = tempD2['Female']
chemDF_M = tempD2['Male']
chemDF_B = tempD2['BothSex']

### 4-2. Correlation b/w all pairwise variables

In [None]:
#Female and male cohorts
tempD1 = {'Female':chemDF_F, 'Male':chemDF_M}
bwhtr = 'ChemWHtR'
tempD2 = {}
for sex in tempD1.keys():
    print(sex+' cohort:')
    tempDF = tempD1[sex]
    #Compute correlation matrix
    tempL = [item for sublist in [whtrDF.columns.tolist(), ['log_Base'+bwhtr]] for item in sublist]
    tempDF1 = tempDF.drop(columns=tempL)
    tempDF1 = tempDF1.corr(method='pearson')
    print('• Combinations:', int(len(tempDF1)*(len(tempDF1)-1)/2))
    
    #Extract lower triangle matrix
    tempDF1 = tempDF1.where(np.tril(np.ones(tempDF1.shape), k=-1).astype(np.bool), other=np.nan)
    tempDF1.index.rename('Variable1', inplace=True)
    tempDF1 = tempDF1.reset_index().melt(var_name='Variable2', value_name='Pearson_r', id_vars=['Variable1'])
    tempDF1 = tempDF1.dropna()
    print('• nrows:', len(tempDF1))
    print('• |Pearson\'s r| > 0.8:', len(tempDF1.loc[abs(tempDF1['Pearson_r'])>0.8]))
    display(tempDF1.loc[abs(tempDF1['Pearson_r'])>0.8])
    
    tempD2[sex] = tempDF1

#Distribution
tempD3 = {'Female':'tab:red', 'Male':'tab:blue'}
sns.set(style='ticks', font='Arial', context='notebook')
plt.figure(figsize=(4, 3))
for sex in tempD2.keys():
    tempS = tempD2[sex]['Pearson_r']
    sns.distplot(tempS, color=tempD3[sex], label=sex).set(xlim=(-1, 1), xticks=np.arange(-1, 1.1, 0.5))
sns.despine()
plt.xlabel('Pearson\'s '+r'$r$')
plt.ylabel('Density')
plt.legend(loc='upper right')
plt.show()

In [None]:
#Both sex cohort
tempDF = chemDF_B
bwhtr = 'ChemWHtR'
bwhtr_color = 'g'

print('Both sex cohort:')
#Compute correlation matrix
tempL = [item for sublist in [whtrDF.columns.tolist(), ['log_Base'+bwhtr]] for item in sublist]
tempDF = tempDF.drop(columns=tempL)
tempDF = tempDF.corr(method='pearson')
print('• Combinations:', int(len(tempDF)*(len(tempDF)-1)/2))

#Extract lower triangle matrix
tempDF = tempDF.where(np.tril(np.ones(tempDF.shape), k=-1).astype(np.bool), other=np.nan)
tempDF.index.rename('Variable1', inplace=True)
tempDF = tempDF.reset_index().melt(var_name='Variable2', value_name='Pearson_r', id_vars=['Variable1'])
tempDF = tempDF.dropna()
print('• nrows:', len(tempDF))
print('• |Pearson\'s r| > 0.8:', len(tempDF.loc[abs(tempDF['Pearson_r'])>0.8]))
display(tempDF.loc[abs(tempDF['Pearson_r'])>0.8])

#Distribution
sns.set(style='ticks', font='Arial', context='talk')
plt.figure(figsize=(4, 3))
sns.distplot(tempDF['Pearson_r'], color=bwhtr_color).set(xlim=(-1, 1), xticks=np.arange(-1, 1.1, 0.5))
plt.axvline(x=tempDF['Pearson_r'].max(), **{'linestyle':'--', 'color':bwhtr_color})
plt.axvline(x=tempDF['Pearson_r'].min(), **{'linestyle':'--', 'color':bwhtr_color})
sns.despine()
plt.ylabel('Density')
plt.xlabel('Pearson\'s '+r'$r$')
plt.show()

### 4-3. Prepare the beta-coefficients of LASSO models

In [None]:
bwhtr = 'ChemWHtR'
tempD = {}
for sex in ['Female', 'Male', 'BothSex']:
    #Import the LASSO beta-coefficients
    fileDir = './ExportData/'
    ipynbName = '220822_Multiomics-BMI-NatMed1stRevision_WHtR-baseline-LASSO-ver2_'
    fileName = bwhtr+'-'+sex+'-LASSObcoefs.tsv'
    tempDF = pd.read_csv(fileDir+ipynbName+fileName, sep='\t').set_index('Variable')
    tempDF = tempDF.drop(index=['Intercept'])
    tempD[sex] = tempDF
    
    #Check
    print(sex+':')
    print(' - Variables:', len(tempDF))
    tempDF1 = tempDF.loc[tempDF['nZeros']!=10]
    print(' - Variables with non-zero beta-coefficient:', len(tempDF1),
          '(', len(tempDF1)/len(tempDF)*100, '%)')
    tempDF1 = tempDF.loc[tempDF['nZeros']==0]
    print(' - Variables with non-zero beta-coefficient in all 10 models:', len(tempDF1),
          '(', len(tempDF1)/len(tempDF)*100, '%)')
    tempDF1 = tempDF1.sort_values(by='Mean', ascending=False)
    display(tempDF1)
    print('')

chemWHtR_F_bcoefs = tempD['Female']
chemWHtR_M_bcoefs = tempD['Male']
chemWHtR_B_bcoefs = tempD['BothSex']

### 4-4. Variables retained in all 10 models

In [None]:
tempDF1 = chemWHtR_F_bcoefs
tempDF2 = chemWHtR_M_bcoefs
tempDF3 = chemWHtR_B_bcoefs
category = 'Clinical labs'

#Variables with non-zero beta-coefficient in all 10 models
tempS1 = set(tempDF1.loc[tempDF1['nZeros']==0].index)
tempS2 = set(tempDF2.loc[tempDF2['nZeros']==0].index)
tempS3 = set(tempDF3.loc[tempDF3['nZeros']==0].index)

#Common the robust variables
var_111 = tempS1 & tempS2 & tempS3
print('Common variables with non-zero beta-coefficient in all 10 models:\n', var_111)
var_110 = (tempS1 & tempS2) - var_111
var_101 = (tempS1 & tempS3) - var_111
var_011 = (tempS2 & tempS3) - var_111
var_100 = tempS1 - (var_110 | var_101 | var_111)
var_010 = tempS2 - (var_110 | var_011 | var_111)
var_001 = tempS3 - (var_101 | var_011 | var_111)

#The number of each subset
var_subset = [len(var_100), len(var_010), len(var_110), len(var_001),
              len(var_101), len(var_011), len(var_111)]
print('Each subset:', var_subset)

#Venn diagram
sns.set(font='Arial', context='talk')
venn3(subsets=var_subset, set_labels=('Female', 'Male', 'Both sex'),
      set_colors=('tab:red', 'tab:blue', 'tab:green'), alpha=0.4)
venn3_circles(subsets=var_subset)
plt.title('Robust variables\n—'+category+'—', fontdict={'fontsize':24})
plt.show()

In [None]:
tempDF1 = chemWHtR_F_bcoefs
tempDF2 = chemWHtR_M_bcoefs
tempDF3 = chemWHtR_B_bcoefs
bwhtr_color = 'g'

#Variables with non-zero beta-coefficient in all 10 models
tempS1 = set(tempDF1.loc[tempDF1['nZeros']==0].index)
tempS2 = set(tempDF2.loc[tempDF2['nZeros']==0].index)
tempS3 = set(tempDF3.loc[tempDF3['nZeros']==0].index)

#Union of the robust variables
tempL = list(tempS1 | tempS2 | tempS3)
print('Union of the variables with non-zero beta-coefficient in all 10 models:', len(tempL))

#Merge (use original bcoefs to rescue ones with zeros in any 10 models)
tempDF1 = tempDF1.loc[tempDF1.index.isin(tempL)].drop(columns=['Mean', 'SD', 'nZeros'])
tempDF2 = tempDF2.loc[tempDF2.index.isin(tempL)].drop(columns=['Mean', 'SD', 'nZeros'])
tempDF3 = tempDF3.loc[tempDF3.index.isin(tempL)].drop(columns=['Mean', 'SD', 'nZeros'])
tempDF = pd.concat([tempDF1, tempDF2, tempDF3], axis=0)
tempDF['Model'] = np.repeat(['Female', 'Male', 'Both sex'], [len(tempDF1), len(tempDF2), len(tempDF3)])

#tidyr::gather in R
tempDF1 = tempDF.reset_index().melt(var_name='LASSO', value_name='bcoef', id_vars=['Variable', 'Model'])

#Order for plot
tempDF2 = tempDF1.loc[tempDF1['Model']=='Both sex'].groupby('Variable', as_index=True).agg({'bcoef':np.mean})
tempDF2 = tempDF2.sort_values(by='bcoef', ascending=False)

#Plot
sns.set(style='ticks', font='Arial', context='talk')
sns.catplot(data=tempDF1, y='Variable', order=tempDF2.index.tolist(), x='bcoef', kind='box',
            hue='Model', palette='Set1', dodge=True, height=9, aspect=1.1,
            showfliers=True, flierprops={'marker':'o', 'markerfacecolor':'gray', 'alpha':0.4})
plt.grid(axis='x', linestyle='--', color='black')
for row_i in range(len(tempDF2)):
    if row_i%2 == 0:
        plt.axhspan(ymin=row_i-0.5, ymax=row_i+0.5, facecolor=bwhtr_color, alpha=0.2, zorder=0)
plt.xlabel(r'$\beta$'+'-coefficient in LASSO model')
plt.ylabel('')
plt.show()

In [None]:
#Both sex model
tempDF = chemWHtR_B_bcoefs
bwhtr_color = 'g'
bwhtr = 'ChemWHtR'
sex = 'BothSex'

#Variables with non-zero beta-coefficient in all 10 models
tempDF = tempDF.loc[tempDF['nZeros']==0].sort_values(by='Mean', ascending=False)
tempDF = tempDF.drop(columns=['Mean', 'SD', 'nZeros'])
print('Variables with non-zero beta-coefficient in all 10 models:', len(tempDF))

#tidyr::gather in R
tempDF1 = tempDF.reset_index().melt(var_name='LASSO', value_name='bcoef', id_vars=['Variable'])

#Plot
sns.set(style='ticks', font='Arial', context='talk')
plt.figure(figsize=(5, 8))
p = sns.boxplot(data=tempDF1, y='Variable', x='bcoef', color=bwhtr_color, dodge=False, saturation=1,
                showfliers=True, flierprops={'marker':'o', 'markerfacecolor':'gray', 'alpha':0.4},
                showcaps=True, notch=False)
p.set(xlim=(-0.046, 0.046), xticks=np.arange(-0.04, 0.041, 0.02))#Fixed across omics
p.grid(axis='x', linestyle='--', color='black')
sns.despine()
##Change default dull line color of sns.boxplot (saturation parameter is for patch)
for line in p.get_lines():
    line.set_color('k')
for box in p.artists:
    box.set_edgecolor('k')
##Add background color
for row_i in range(len(tempDF)):
    if row_i%2 == 0:
        plt.axhspan(ymin=row_i-0.5, ymax=row_i+0.5, facecolor=bwhtr_color, alpha=0.2, zorder=0)
plt.ylabel('')
plt.xlabel(r'$\beta$'+'-coefficient in LASSO model\n [log-scaled WHtR per s.d.]')
##Save
fileDir = './ExportFigures/'
ipynbName = '220823_Multiomics-BMI-NatMed1stRevision_WHtR-LASSO-bcoef-ver2_'
fileName = bwhtr+'-'+sex+'-bcoef_non-zero-in-all.tif'
plt.gcf().savefig(fileDir+ipynbName+fileName, dpi=300, bbox_inches='tight', pad_inches=0.04,
                  pil_kwargs={'compression':'tiff_lzw'})
plt.show()

### 4-5. Correlation b/w pairwise variables retained in all 10 models

In [None]:
#Both sex model
tempDF1 = chemWHtR_B_bcoefs
tempDF2 = chemDF_B
bwhtr_color = 'g'

print('Both sex model:')
#Variables with non-zero beta-coefficient in all 10 models
tempL = tempDF1.loc[tempDF1['nZeros']==0].index.tolist()

#Compute correlation matrix
tempDF = tempDF2[tempL]
tempDF = tempDF.corr(method='pearson')
print('• Combinations:', int(len(tempDF)*(len(tempDF)-1)/2))

#Extract lower triangle matrix
tempDF = tempDF.where(np.tril(np.ones(tempDF.shape), k=-1).astype(np.bool), other=np.nan)
tempDF.index.rename('Variable1', inplace=True)
tempDF = tempDF.reset_index().melt(var_name='Variable2', value_name='Pearson_r', id_vars=['Variable1'])
tempDF = tempDF.dropna()
print('• nrows:', len(tempDF))
print('• |Pearson\'s r| > 0.8:', len(tempDF.loc[abs(tempDF['Pearson_r'])>0.8]))
display(tempDF.loc[abs(tempDF['Pearson_r'])>0.8])

#Distribution
sns.set(style='ticks', font='Arial', context='talk')
plt.figure(figsize=(4, 3))
sns.distplot(tempDF['Pearson_r'], color=bwhtr_color).set(xlim=(-1, 1), xticks=np.arange(-1, 1.1, 0.5))
plt.axvline(x=tempDF['Pearson_r'].max(), **{'linestyle':'--', 'color':bwhtr_color})
plt.axvline(x=tempDF['Pearson_r'].min(), **{'linestyle':'--', 'color':bwhtr_color})
sns.despine()
plt.ylabel('Density')
plt.xlabel('Pearson\'s '+r'$r$')
plt.show()

print('Variables with non-zero beta-coefficient in all 10 models:', len(tempL))
#Clustermap
tempDF = tempDF2[tempL]
tempDF = tempDF.corr(method='pearson')
sns.set(style='ticks', font='Arial', context='notebook')
cmap = sns.diverging_palette(220, 20, as_cmap=True)
cm = sns.clustermap(tempDF, method='ward', metric='euclidean', cmap=cmap,
                    row_cluster=True, col_cluster=True, row_linkage=None, col_linkage=None,
                    row_colors=None, col_colors=None, xticklabels=True, yticklabels=True,
                    dendrogram_ratio=(0.15, 0.15), cbar_pos=(0.85, 0.07, 0.02, 0.1),
                    figsize=(7, 7), **{'center':0, 'vmin':-1, 'vmax':1})
cm.cax.set_title('Pearson\'s '+r'$r$')
hm = cm.ax_heatmap.get_position()
rd = cm.ax_row_dendrogram.get_position()
cd = cm.ax_col_dendrogram.get_position()
cm.ax_heatmap.set_position([hm.x0, hm.y0, hm.width, hm.height])
cm.ax_row_dendrogram.set_position([rd.x0+rd.width*0.5, rd.y0, rd.width*0.5, rd.height])
cm.ax_col_dendrogram.set_position([cd.x0, cd.y0, cd.width, cd.height*0.5])
plt.show()

### 4-6. Explained variance in WHtR by the variables retained in ≥1 model

> OLS linear regression model for significance: log(WHtR) ~ b0 + b1\*Variable + b2\*Sex + b3\*BaseAge + b4\*PCs  
> Univariate model: log(WHtR) ~ b0 + b1\*Variable  
>
> Of note, biological WHtR is included for comparison.  
> As well as LASSO models, unstandardized WHtR is used for the dependent variable.  

In [None]:
tempD1 = {'Female':chemWHtR_F_bcoefs, 'Male':chemWHtR_M_bcoefs, 'BothSex':chemWHtR_B_bcoefs}
tempD2 = {'Female':chemDF_F, 'Male':chemDF_M, 'BothSex':chemDF_B}
tempD3 = {'Female':whtrDF_F, 'Male':whtrDF_M, 'BothSex':whtrDF_B}
yvar = 'WHtR'
bwhtr = 'ChemWHtR'
topX = 30
for sex in tempD1.keys():
    print(sex+' model:')
    
    #Variables with non-zero beta-coefficient at least 1 of 10 models
    tempDF = tempD1[sex]
    tempL = tempDF.loc[tempDF['nZeros']==10].index.tolist()#To be excluded
    
    #Prepare DF for regressions
    tempDF = tempD2[sex].drop(columns=tempL)
    tempDF = tempDF.rename(columns={'log_Base'+bwhtr:bwhtr+' model'})
    #Replace the standardized dependent variable with the unstandardized
    tempDF1 = tempD3[sex]
    tempDF['log_Base'+yvar] = tempDF1['log_Base'+yvar]
    
    #Independent variables for regressions (including bWHtR)
    tempL = tempDF.drop(columns=whtrDF.columns.tolist()).columns.tolist()
    
    #Perform OLS linear regression
    tempL1 = []#For sample size
    tempL2 = []#For R2 in the univariate model
    tempL3 = []#For R2
    tempL4 = []#For beta-coefficient
    tempL5 = []#For SE of beta-coefficient
    tempL6 = []#For t-statistic
    tempL7 = []#For residual degrees of freedom
    tempL8 = []#For P-value
    t_start = time.time()
    for var in tempL:
        #Rename independent variable
        tempDF1 = tempDF.rename(columns={var: 'Variable'})
        #Add a constant for the intercept -> Similar to R, smf automatically add a constant
        
        #Save sample size
        tempL1.append(len(tempDF1))
        
        #Fit univariate model
        formula = 'log_Base'+yvar+' ~ Variable'
        fit_res = smf.ols(formula, data=tempDF1).fit()
        #Save R2 [%]
        tempL2.append(fit_res.rsquared*100)

        #Fit full model
        if sex=='BothSex':
            formula = 'log_Base'+yvar+' ~ Variable + C(Sex) + BaseAge + PC1 + PC2 + PC3 + PC4 + PC5'
        else:#Sex is invariant
            formula = 'log_Base'+yvar+' ~ Variable + BaseAge + PC1 + PC2 + PC3 + PC4 + PC5'
        fit_res = smf.ols(formula, data=tempDF1).fit()
        #Save R2 [%]
        tempL3.append(fit_res.rsquared*100)
        #Save beta-coefficient of the variable
        tempL4.append(fit_res.params['Variable'])
        tempL5.append(fit_res.bse['Variable'])
        #Save t-statistic of the variable
        tempL6.append(fit_res.tvalues['Variable'])
        #Save residual degrees of freedom
        tempL7.append(int(fit_res.df_resid))
        #Save P-value of the variable
        tempL8.append(fit_res.pvalues['Variable'])
    t_elapsed = time.time() - t_start
    print('Elapsed time for', len(tempL), 'OLS linear regressions (including '+bwhtr+'):',
          round(t_elapsed//60), 'min', round(t_elapsed%60, 1), 'sec')
    
    #Clean the results
    tempDF = pd.DataFrame({'N':tempL1, 'UnivarR2':tempL2, 'R2':tempL3,
                           'OLSbcoef':tempL4, 'OLSbcoefSE':tempL5,
                           'tStat':tempL6, 'DoF':tempL7,  'Pval':tempL8},
                          index=pd.Index(tempL, name='Variable'))
    ##P-value adjustment by using Benjamini–Hochberg method
    tempDF['AdjPval'] = multi.multipletests(tempDF['Pval'], alpha=0.05, method='fdr_bh',
                                            is_sorted=False, returnsorted=False)[1]
    ##Add the LASSO results
    tempDF1 = tempD1[sex]
    tempDF['LASSObcoef'] = tempDF1.loc[tempDF1.index.isin(tempL)]['Mean']#NaN in bWHtR reference
    tempDF['LASSOnZeros'] = tempDF1.loc[tempDF1.index.isin(tempL)]['nZeros']#NaN in bWHtR reference
    tempDF = tempDF.sort_values(by='AdjPval', ascending=True)
    
    #Save the cleaned DF
    fileDir = './ExportData/'
    ipynbName = '220823_Multiomics-BMI-NatMed1stRevision_WHtR-LASSO-bcoef-ver2_'
    fileName = bwhtr+'-'+sex+'-OLS.tsv'
    tempDF.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)
    
    #Extact significant variables
    tempDF = tempDF[tempDF['AdjPval']<0.05]
    print('Variables significantly associated with WHtR (FDR < 0.05):', len(tempDF)-1)#bWHtR reference
    ##Top X (+ bWHtR)
    tempDF = tempDF.iloc[0:(topX+1)]
    display(tempDF)
    ##Category and color
    tempL = []
    for row_n in tempDF.index.tolist():
        if row_n==bwhtr+' model':
            tempL.append('Positive association')
        else:
            bcoef = tempDF.loc[row_n, 'OLSbcoef']
            if bcoef>0:
                tempL.append('Positive association')
            elif bcoef<0:
                tempL.append('Negative association')
            else:
                tempL.append('Error?')
    tempDF['Association'] = tempL
    tempD = {'Positive association':'tab:red', 'Negative association':'tab:blue'}
    ##Plot
    tempDF = tempDF.reset_index()
    sns.set(style='ticks', font='Arial', context='talk')
    plt.figure(figsize=(10, 4))
    p = sns.barplot(data=tempDF, y='UnivarR2', x='Variable', hue='Association', hue_order=tempD.keys(),
                    dodge=False, palette=tempD, edgecolor='k')
    p.set(ylim=(0, 80), yticks=np.arange(0, 80, 15))#Fixed across omics
    p.grid(axis='y', linestyle='--', color='black')
    sns.despine()
    plt.xticks(rotation=90, horizontalalignment='center')
    plt.xlabel('')
    plt.ylabel('Explained variance in WHtR [%]')
    plt.legend(loc='upper right')
    if sex=='BothSex':#Save
        fileDir = './ExportFigures/'
        ipynbName = '220823_Multiomics-BMI-NatMed1stRevision_WHtR-LASSO-bcoef-ver2_'
        fileName = bwhtr+'-'+sex+'-OLS-top'+str(topX)+'.tif'
        plt.gcf().savefig(fileDir+ipynbName+fileName, dpi=300, bbox_inches='tight', pad_inches=0.04,
                          pil_kwargs={'compression':'tiff_lzw'})
    plt.show()
    print('')

## 5. Metabolomics, Proteomics, and Clinical labs-combined omics

### 5-1. Prepare and standardize variables

> Variables, including bWHtR (as a reference) and covariates, are standardized for OLS linear regression.  
> –> In this version, unstandardized WHtR is used as the dependent variable. However, WHtR within the following DFs is standardized (just for the code simplicity).  

In [None]:
tempDF = combiDF
tempD1 = {'Female':whtrDF_F, 'Male':whtrDF_M, 'BothSex':whtrDF_B}
bwhtr = 'CombiWHtR'
tempD2 = {}
for sex in tempD1.keys():
    #Add covariates (and WHtR) while selecting the cohort
    tempDF1 = tempD1[sex]
    tempDF1 = pd.merge(tempDF, tempDF1, left_index=True, right_index=True, how='inner')
    
    #Add biological WHtR
    fileDir = './ExportData/'
    ipynbName = '220822_Multiomics-BMI-NatMed1stRevision_WHtR-baseline-LASSO-ver2_'
    fileName = bwhtr+'-'+sex+'.tsv'
    tempDF2 = pd.read_csv(fileDir+ipynbName+fileName, sep='\t', dtype={'public_client_id':str})
    tempDF2 = tempDF2.set_index('public_client_id')
    tempDF1['log_Base'+bwhtr] = tempDF2['log_Base'+bwhtr]
    
    #Z-score transformation
    tempDF2 = tempDF1.select_dtypes(include=[np.number])
    scaler = StandardScaler(copy=True, with_mean=True, with_std=True)
    tempA = scaler.fit_transform(tempDF2)#Column direction
    tempDF2 = pd.DataFrame(data=tempA, index=tempDF2.index, columns=tempDF2.columns)
    ##Recover categorical covariates
    tempDF3 = tempDF1.select_dtypes(exclude=[np.number])
    tempDF1 = pd.merge(tempDF2, tempDF3, left_index=True, right_index=True, how='left')
    
    print(sex)
    display(tempDF1.describe(include='all'))
    tempD2[sex] = tempDF1

combiDF_F = tempD2['Female']
combiDF_M = tempD2['Male']
combiDF_B = tempD2['BothSex']

### 5-2. Correlation b/w all pairwise variables

In [None]:
#Female and male cohorts
tempD1 = {'Female':combiDF_F, 'Male':combiDF_M}
bwhtr = 'CombiWHtR'
tempD2 = {}
for sex in tempD1.keys():
    print(sex+' cohort:')
    tempDF = tempD1[sex]
    #Compute correlation matrix
    tempL = [item for sublist in [whtrDF.columns.tolist(), ['log_Base'+bwhtr]] for item in sublist]
    tempDF1 = tempDF.drop(columns=tempL)
    tempDF1 = tempDF1.corr(method='pearson')
    print('• Combinations:', int(len(tempDF1)*(len(tempDF1)-1)/2))
    
    #Extract lower triangle matrix
    tempDF1 = tempDF1.where(np.tril(np.ones(tempDF1.shape), k=-1).astype(np.bool), other=np.nan)
    tempDF1.index.rename('Variable1', inplace=True)
    tempDF1 = tempDF1.reset_index().melt(var_name='Variable2', value_name='Pearson_r', id_vars=['Variable1'])
    tempDF1 = tempDF1.dropna()
    print('• nrows:', len(tempDF1))
    print('• |Pearson\'s r| > 0.8:', len(tempDF1.loc[abs(tempDF1['Pearson_r'])>0.8]))
    display(tempDF1.loc[abs(tempDF1['Pearson_r'])>0.8])
    
    tempD2[sex] = tempDF1

#Distribution
tempD3 = {'Female':'tab:red', 'Male':'tab:blue'}
sns.set(style='ticks', font='Arial', context='notebook')
plt.figure(figsize=(4, 3))
for sex in tempD2.keys():
    tempS = tempD2[sex]['Pearson_r']
    sns.distplot(tempS, color=tempD3[sex], label=sex).set(xlim=(-1, 1), xticks=np.arange(-1, 1.1, 0.5))
sns.despine()
plt.xlabel('Pearson\'s '+r'$r$')
plt.ylabel('Density')
plt.legend(loc='upper right')
plt.show()

In [None]:
#Both sex cohort
tempDF = combiDF_B
bwhtr = 'CombiWHtR'
bwhtr_color = 'm'

print('Both sex cohort:')
#Compute correlation matrix
tempL = [item for sublist in [whtrDF.columns.tolist(), ['log_Base'+bwhtr]] for item in sublist]
tempDF = tempDF.drop(columns=tempL)
tempDF = tempDF.corr(method='pearson')
print('• Combinations:', int(len(tempDF)*(len(tempDF)-1)/2))

#Extract lower triangle matrix
tempDF = tempDF.where(np.tril(np.ones(tempDF.shape), k=-1).astype(np.bool), other=np.nan)
tempDF.index.rename('Variable1', inplace=True)
tempDF = tempDF.reset_index().melt(var_name='Variable2', value_name='Pearson_r', id_vars=['Variable1'])
tempDF = tempDF.dropna()
print('• nrows:', len(tempDF))
print('• |Pearson\'s r| > 0.8:', len(tempDF.loc[abs(tempDF['Pearson_r'])>0.8]))
display(tempDF.loc[abs(tempDF['Pearson_r'])>0.8])

#Distribution
sns.set(style='ticks', font='Arial', context='talk')
plt.figure(figsize=(4, 3))
sns.distplot(tempDF['Pearson_r'], color=bwhtr_color).set(xlim=(-1, 1), xticks=np.arange(-1, 1.1, 0.5))
plt.axvline(x=tempDF['Pearson_r'].max(), **{'linestyle':'--', 'color':bwhtr_color})
plt.axvline(x=tempDF['Pearson_r'].min(), **{'linestyle':'--', 'color':bwhtr_color})
sns.despine()
plt.ylabel('Density')
plt.xlabel('Pearson\'s '+r'$r$')
plt.show()

### 5-3. Prepare the beta-coefficients of LASSO models

In [None]:
bwhtr = 'CombiWHtR'
tempD = {}
for sex in ['Female', 'Male', 'BothSex']:
    #Import the LASSO beta-coefficients
    fileDir = './ExportData/'
    ipynbName = '220822_Multiomics-BMI-NatMed1stRevision_WHtR-baseline-LASSO-ver2_'
    fileName = bwhtr+'-'+sex+'-LASSObcoefs.tsv'
    tempDF = pd.read_csv(fileDir+ipynbName+fileName, sep='\t').set_index('Variable')
    tempDF = tempDF.drop(index=['Intercept'])
    tempD[sex] = tempDF
    
    #Check
    print(sex+':')
    print(' - Variables:', len(tempDF))
    tempDF1 = tempDF.loc[tempDF['nZeros']!=10]
    print(' - Variables with non-zero beta-coefficient:', len(tempDF1),
          '(', len(tempDF1)/len(tempDF)*100, '%)')
    tempDF1 = tempDF.loc[tempDF['nZeros']==0]
    print(' - Variables with non-zero beta-coefficient in all 10 models:', len(tempDF1),
          '(', len(tempDF1)/len(tempDF)*100, '%)')
    tempDF1 = tempDF1.sort_values(by='Mean', ascending=False)
    display(tempDF1)
    print('')

combiWHtR_F_bcoefs = tempD['Female']
combiWHtR_M_bcoefs = tempD['Male']
combiWHtR_B_bcoefs = tempD['BothSex']

### 5-4. Variables retained in all 10 models

In [None]:
tempDF1 = combiWHtR_F_bcoefs
tempDF2 = combiWHtR_M_bcoefs
tempDF3 = combiWHtR_B_bcoefs
category = 'Combined omics'

#Variables with non-zero beta-coefficient in all 10 models
tempS1 = set(tempDF1.loc[tempDF1['nZeros']==0].index)
tempS2 = set(tempDF2.loc[tempDF2['nZeros']==0].index)
tempS3 = set(tempDF3.loc[tempDF3['nZeros']==0].index)

#Common the robust variables
var_111 = tempS1 & tempS2 & tempS3
print('Common variables with non-zero beta-coefficient in all 10 models:\n', var_111)
var_110 = (tempS1 & tempS2) - var_111
var_101 = (tempS1 & tempS3) - var_111
var_011 = (tempS2 & tempS3) - var_111
var_100 = tempS1 - (var_110 | var_101 | var_111)
var_010 = tempS2 - (var_110 | var_011 | var_111)
var_001 = tempS3 - (var_101 | var_011 | var_111)

#The number of each subset
var_subset = [len(var_100), len(var_010), len(var_110), len(var_001),
              len(var_101), len(var_011), len(var_111)]
print('Each subset:', var_subset)

#Venn diagram
sns.set(font='Arial', context='talk')
venn3(subsets=var_subset, set_labels=('Female', 'Male', 'Both sex'),
      set_colors=('tab:red', 'tab:blue', 'tab:green'), alpha=0.4)
venn3_circles(subsets=var_subset)
plt.title('Robust variables\n—'+category+'—', fontdict={'fontsize':24})
plt.show()

In [None]:
tempDF1 = combiWHtR_F_bcoefs
tempDF2 = combiWHtR_M_bcoefs
tempDF3 = combiWHtR_B_bcoefs

#Variables with non-zero beta-coefficient in all 10 models
tempS1 = set(tempDF1.loc[tempDF1['nZeros']==0].index)
tempS2 = set(tempDF2.loc[tempDF2['nZeros']==0].index)
tempS3 = set(tempDF3.loc[tempDF3['nZeros']==0].index)

#Union of the robust variables
tempL = list(tempS1 | tempS2 | tempS3)
print('Union of the variables with non-zero beta-coefficient in all 10 models:', len(tempL))

#Merge (use original bcoefs to rescue ones with zeros in any 10 models)
tempDF1 = tempDF1.loc[tempDF1.index.isin(tempL)].drop(columns=['Mean', 'SD', 'nZeros'])
tempDF2 = tempDF2.loc[tempDF2.index.isin(tempL)].drop(columns=['Mean', 'SD', 'nZeros'])
tempDF3 = tempDF3.loc[tempDF3.index.isin(tempL)].drop(columns=['Mean', 'SD', 'nZeros'])
tempDF = pd.concat([tempDF1, tempDF2, tempDF3], axis=0)
tempDF['Model'] = np.repeat(['Female', 'Male', 'Both sex'], [len(tempDF1), len(tempDF2), len(tempDF3)])

#tidyr::gather in R
tempDF1 = tempDF.reset_index().melt(var_name='LASSO', value_name='bcoef', id_vars=['Variable', 'Model'])

#Order for plot and category for shading
tempDF2 = tempDF1.loc[tempDF1['Model']=='Both sex'].groupby('Variable', as_index=True).agg({'bcoef':np.mean})
tempDF2 = tempDF2.sort_values(by='bcoef', ascending=False)
tempDF2 = pd.DataFrame(index=tempDF2.index)
tempL1 = []
tempL2 = []
for row_n in tempDF2.index.tolist():
    if row_n in metDF.columns.tolist():
        tempL1.append('Metabolomics')
        tempL2.append('b')
    elif row_n in chemDF.columns.tolist():
        tempL1.append('Clinical labs')
        tempL2.append('g')
    elif row_n in protDF.columns.tolist():
        tempL1.append('Proteomics')
        tempL2.append('r')
    else:
        tempL1.append('Error?')
        tempL2.append('k')
tempDF2['Category'] = tempL1
tempDF2['Color'] = tempL2

#Plot
sns.set(style='ticks', font='Arial', context='talk')
sns.catplot(data=tempDF1, y='Variable', order=tempDF2.index.tolist(), x='bcoef', kind='box',
            hue='Model', palette='Set1', dodge=True, height=20, aspect=0.7,
            showfliers=True, flierprops={'marker':'o', 'markerfacecolor':'gray', 'alpha':0.4})
plt.grid(axis='x', linestyle='--', color='black')
for row_i in range(len(tempDF2)):
    cat_color = tempDF2['Color'].iloc[row_i]
    if row_i%2 == 0:
        plt.axhspan(ymin=row_i-0.5, ymax=row_i+0.5, facecolor=cat_color, alpha=0.2, zorder=0)
    else:
        plt.axhspan(ymin=row_i-0.5, ymax=row_i+0.5, facecolor=cat_color, alpha=0.4, zorder=0)
plt.xlabel(r'$\beta$'+'-coefficient in LASSO model')
plt.ylabel('')
plt.show()

In [None]:
#Both sex model
tempDF = combiWHtR_B_bcoefs
bwhtr = 'CombiWHtR'
sex = 'BothSex'

#Variables with non-zero beta-coefficient in all 10 models
tempDF = tempDF.loc[tempDF['nZeros']==0].sort_values(by='Mean', ascending=False)
tempDF = tempDF.drop(columns=['Mean', 'SD', 'nZeros'])
print('Variables with non-zero beta-coefficient in all 10 models:', len(tempDF))

#tidyr::gather in R
tempDF1 = tempDF.reset_index().melt(var_name='LASSO', value_name='bcoef', id_vars=['Variable'])

#Category for shading
tempDF2 = pd.DataFrame(index=tempDF.index)
tempL1 = []
tempL2 = []
for row_n in tempDF2.index.tolist():
    if row_n in metDF.columns.tolist():
        tempL1.append('Metabolomics')
        tempL2.append('b')
    elif row_n in chemDF.columns.tolist():
        tempL1.append('Clinical labs')
        tempL2.append('g')
    elif row_n in protDF.columns.tolist():
        tempL1.append('Proteomics')
        tempL2.append('r')
    else:
        tempL1.append('Error?')
        tempL2.append('k')
tempDF2['Category'] = tempL1
tempDF2['Color'] = tempL2

#Plot
sns.set(style='ticks', font='Arial', context='talk')
plt.figure(figsize=(5, 12.5))
p = sns.boxplot(data=tempDF1, y='Variable', order=tempDF2.index.tolist(), x='bcoef',
                palette=tempDF2['Color'], dodge=False, saturation=1,
                showfliers=True, flierprops={'marker':'o', 'markerfacecolor':'gray', 'alpha':0.4},
                showcaps=True, notch=False)
#p.set(xlim=(-0.046, 0.046), xticks=np.arange(-0.04, 0.041, 0.02))#Fixed across omics
p.grid(axis='x', linestyle='--', color='k')
sns.despine()
##Change default dull line color of sns.boxplot (saturation parameter is for patch)
for line in p.get_lines():
    line.set_color('k')
for box in p.artists:
    box.set_edgecolor('k')
##Add background color
for row_i in range(len(tempDF2)):
    cat_color = tempDF2['Color'].iloc[row_i]
    if row_i%2 == 0:
        plt.axhspan(ymin=row_i-0.5, ymax=row_i+0.5, facecolor=cat_color, alpha=0.4, zorder=0)
    else:
        plt.axhspan(ymin=row_i-0.5, ymax=row_i+0.5, facecolor=cat_color, alpha=0.4, zorder=0)
plt.ylabel('')
plt.xlabel(r'$\beta$'+'-coefficient in LASSO model\n [log-scaled WHtR per s.d.]')
#Add legend
legend1 = mpatches.Patch(facecolor='b', edgecolor='k', label='Metabolomics')
legend2 = mpatches.Patch(facecolor='r', edgecolor='k', label='Proteomics')
legend3 = mpatches.Patch(facecolor='g', edgecolor='k', label='Clinical labs')
plt.legend(handles=[legend1, legend2, legend3], fontsize='large',
           title='Omics category', title_fontsize='x-large',
           bbox_to_anchor=(-0.7, 0.0), loc='lower right', borderaxespad=0)#Manual adjustment
##Save
fileDir = './ExportFigures/'
ipynbName = '220823_Multiomics-BMI-NatMed1stRevision_WHtR-LASSO-bcoef-ver2_'
fileName = bwhtr+'-'+sex+'-bcoef_non-zero-in-all.tif'
plt.gcf().savefig(fileDir+ipynbName+fileName, dpi=300, bbox_inches='tight', pad_inches=0.04,
                  pil_kwargs={'compression':'tiff_lzw'})
plt.show()

### 5-5. Correlation b/w pairwise variables retained in all 10 models

In [None]:
#Both sex model
tempDF1 = combiWHtR_B_bcoefs
tempDF2 = combiDF_B
bwhtr_color = 'm'
bwhtr = 'CombiWHtR'
sex = 'BothSex'

print('Both sex model:')
#Variables with non-zero beta-coefficient in all 10 models
tempL = tempDF1.loc[tempDF1['nZeros']==0].index.tolist()

#Compute correlation matrix
tempDF = tempDF2[tempL]
tempDF = tempDF.corr(method='pearson')
print('• Combinations:', int(len(tempDF)*(len(tempDF)-1)/2))

#Extract lower triangle matrix
tempDF = tempDF.where(np.tril(np.ones(tempDF.shape), k=-1).astype(np.bool), other=np.nan)
tempDF.index.rename('Variable1', inplace=True)
tempDF = tempDF.reset_index().melt(var_name='Variable2', value_name='Pearson_r', id_vars=['Variable1'])
tempDF = tempDF.dropna()
print('• nrows:', len(tempDF))
print('• |Pearson\'s r| > 0.8:', len(tempDF.loc[abs(tempDF['Pearson_r'])>0.8]))
display(tempDF.loc[abs(tempDF['Pearson_r'])>0.8])

#Distribution
sns.set(style='ticks', font='Arial', context='talk')
plt.figure(figsize=(4, 3))
sns.distplot(tempDF['Pearson_r'], color=bwhtr_color).set(xlim=(-1, 1), xticks=np.arange(-1, 1.1, 0.5))
plt.axvline(x=tempDF['Pearson_r'].max(), **{'linestyle':'--', 'color':bwhtr_color})
plt.axvline(x=tempDF['Pearson_r'].min(), **{'linestyle':'--', 'color':bwhtr_color})
sns.despine()
plt.ylabel('Density')
plt.xlabel('Pearson\'s '+r'$r$')
plt.show()

print('Variables with non-zero beta-coefficient in all 10 models:', len(tempL))
tempDF = tempDF2[tempL]
tempDF = tempDF.corr(method='pearson')
#Category
tempS = pd.Series(index=tempDF.index, name='Omics category')
for row_n in tempS.index.tolist():
    if row_n in metDF.columns.tolist():
        tempS[row_n] = 'tab:blue'
    elif row_n in chemDF.columns.tolist():
        tempS[row_n] = 'tab:green'
    elif row_n in protDF.columns.tolist():
        tempS[row_n] = 'tab:red'
    else:#Just in case
        tempS[row_n] = 'k'
#Clustermap
sns.set(style='ticks', font='Arial', context='notebook')
cmap = sns.diverging_palette(220, 20, as_cmap=True)
cm = sns.clustermap(tempDF, method='ward', metric='euclidean', cmap=cmap,
                    row_cluster=True, col_cluster=True, row_linkage=None, col_linkage=None,
                    row_colors=tempS, col_colors=tempS, xticklabels=True, yticklabels=True,
                    dendrogram_ratio=(0.1, 0.1), colors_ratio=(0.02, 0.02),
                    cbar_pos=(0.7, 0.2, 0.25, 0.025), cbar_kws={"orientation": "horizontal"},
                    figsize=(11, 11), **{'center':0, 'vmin':-1, 'vmax':1})
cm.cax.set_title('Pearson\'s '+r'$r$', size='xx-large', verticalalignment='bottom')
cm.cax.tick_params(labelsize='x-large')
hm = cm.ax_heatmap.get_position()
rd = cm.ax_row_dendrogram.get_position()
cd = cm.ax_col_dendrogram.get_position()
cm.ax_heatmap.set_position([hm.x0, hm.y0, hm.width, hm.height])
cm.ax_row_dendrogram.set_position([rd.x0+rd.width*0.5, rd.y0, rd.width*0.5, rd.height])
cm.ax_col_dendrogram.set_position([cd.x0, cd.y0, cd.width, cd.height*0.5])
##row/column color bar legend (axis is same with cm.cax!)
legend1 = mpatches.Patch(color='tab:blue', label='Metabolomics')
legend2 = mpatches.Patch(color='tab:red', label='Proteomics')
legend3 = mpatches.Patch(color='tab:green', label='Clinical labs')
plt.legend(handles=[legend1, legend2, legend3], fontsize='x-large',
           title=tempS.name, title_fontsize='xx-large',
           bbox_to_anchor=(0.5, 0), loc='upper center', borderaxespad=3, frameon=False)
##Save
fileDir = './ExportFigures/'
ipynbName = '220823_Multiomics-BMI-NatMed1stRevision_WHtR-LASSO-bcoef-ver2_'
fileName = 'correlation-'+bwhtr+'-'+sex+'-vars_non-zero-in-all.tif'
plt.gcf().savefig(fileDir+ipynbName+fileName, dpi=300, bbox_inches='tight', pad_inches=0.04,
                  pil_kwargs={'compression':'tiff_lzw'})
plt.show()

### 5-6. Explained variance in WHtR by the variables retained in ≥1 model

> OLS linear regression model for significance: log(WHtR) ~ b0 + b1\*Variable + b2\*Sex + b3\*BaseAge + b4\*PCs  
> Univariate model: log(WHtR) ~ b0 + b1\*Variable  
>
> Of note, biological WHtR is included for comparison.  
> As well as LASSO models, unstandardized WHtR is used for the dependent variable.  

In [None]:
tempD1 = {'Female':combiWHtR_F_bcoefs, 'Male':combiWHtR_M_bcoefs, 'BothSex':combiWHtR_B_bcoefs}
tempD2 = {'Female':combiDF_F, 'Male':combiDF_M, 'BothSex':combiDF_B}
tempD3 = {'Female':whtrDF_F, 'Male':whtrDF_M, 'BothSex':whtrDF_B}
yvar = 'WHtR'
bwhtr = 'CombiWHtR'
topX = 30
for sex in tempD1.keys():
    print(sex+' model:')
    
    #Variables with non-zero beta-coefficient at least 1 of 10 models
    tempDF = tempD1[sex]
    tempL = tempDF.loc[tempDF['nZeros']==10].index.tolist()#To be excluded
    
    #Prepare DF for regressions
    tempDF = tempD2[sex].drop(columns=tempL)
    tempDF = tempDF.rename(columns={'log_Base'+bwhtr:bwhtr+' model'})
    #Replace the standardized dependent variable with the unstandardized
    tempDF1 = tempD3[sex]
    tempDF['log_Base'+yvar] = tempDF1['log_Base'+yvar]
    
    #Independent variables for regressions (including bWHtR)
    tempL = tempDF.drop(columns=whtrDF.columns.tolist()).columns.tolist()
    
    #Perform OLS linear regression
    tempL1 = []#For sample size
    tempL2 = []#For R2 in the univariate model
    tempL3 = []#For R2
    tempL4 = []#For beta-coefficient
    tempL5 = []#For SE of beta-coefficient
    tempL6 = []#For t-statistic
    tempL7 = []#For residual degrees of freedom
    tempL8 = []#For P-value
    t_start = time.time()
    for var in tempL:
        #Rename independent variable
        tempDF1 = tempDF.rename(columns={var: 'Variable'})
        #Add a constant for the intercept -> Similar to R, smf automatically add a constant
        
        #Save sample size
        tempL1.append(len(tempDF1))
        
        #Fit univariate model
        formula = 'log_Base'+yvar+' ~ Variable'
        fit_res = smf.ols(formula, data=tempDF1).fit()
        #Save R2 [%]
        tempL2.append(fit_res.rsquared*100)

        #Fit full model
        if sex=='BothSex':
            formula = 'log_Base'+yvar+' ~ Variable + C(Sex) + BaseAge + PC1 + PC2 + PC3 + PC4 + PC5'
        else:#Sex is invariant
            formula = 'log_Base'+yvar+' ~ Variable + BaseAge + PC1 + PC2 + PC3 + PC4 + PC5'
        fit_res = smf.ols(formula, data=tempDF1).fit()
        #Save R2 [%]
        tempL3.append(fit_res.rsquared*100)
        #Save beta-coefficient of the variable
        tempL4.append(fit_res.params['Variable'])
        tempL5.append(fit_res.bse['Variable'])
        #Save t-statistic of the variable
        tempL6.append(fit_res.tvalues['Variable'])
        #Save residual degrees of freedom
        tempL7.append(int(fit_res.df_resid))
        #Save P-value of the variable
        tempL8.append(fit_res.pvalues['Variable'])
    t_elapsed = time.time() - t_start
    print('Elapsed time for', len(tempL), 'OLS linear regressions (including '+bwhtr+'):',
          round(t_elapsed//60), 'min', round(t_elapsed%60, 1), 'sec')
    
    #Clean the results
    tempDF = pd.DataFrame({'N':tempL1, 'UnivarR2':tempL2, 'R2':tempL3,
                           'OLSbcoef':tempL4, 'OLSbcoefSE':tempL5,
                           'tStat':tempL6, 'DoF':tempL7,  'Pval':tempL8},
                          index=pd.Index(tempL, name='Variable'))
    ##P-value adjustment by using Benjamini–Hochberg method
    tempDF['AdjPval'] = multi.multipletests(tempDF['Pval'], alpha=0.05, method='fdr_bh',
                                            is_sorted=False, returnsorted=False)[1]
    ##Add the LASSO results
    tempDF1 = tempD1[sex]
    tempDF['LASSObcoef'] = tempDF1.loc[tempDF1.index.isin(tempL)]['Mean']#NaN in bWHtR reference
    tempDF['LASSOnZeros'] = tempDF1.loc[tempDF1.index.isin(tempL)]['nZeros']#NaN in bWHtR reference
    tempDF = tempDF.sort_values(by='AdjPval', ascending=True)
    
    #Save the cleaned DF
    fileDir = './ExportData/'
    ipynbName = '220823_Multiomics-BMI-NatMed1stRevision_WHtR-LASSO-bcoef-ver2_'
    fileName = bwhtr+'-'+sex+'-OLS.tsv'
    tempDF.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)
    
    #Extact significant variables
    tempDF = tempDF[tempDF['AdjPval']<0.05]
    print('Variables significantly associated with WHtR (FDR < 0.05):', len(tempDF)-1)#bWHtR reference
    ##Top X (+ bWHtR)
    tempDF = tempDF.iloc[0:(topX+1)]
    display(tempDF)
    ##Category and color
    tempL = []
    for row_n in tempDF.index.tolist():
        if row_n==bwhtr+' model':
            tempL.append('Positive association')
        else:
            bcoef = tempDF.loc[row_n, 'OLSbcoef']
            if bcoef>0:
                tempL.append('Positive association')
            elif bcoef<0:
                tempL.append('Negative association')
            else:
                tempL.append('Error?')
    tempDF['Association'] = tempL
    tempD = {'Positive association':'tab:red', 'Negative association':'tab:blue'}
    ##Plot
    tempDF = tempDF.reset_index()
    sns.set(style='ticks', font='Arial', context='talk')
    plt.figure(figsize=(10, 4))
    p = sns.barplot(data=tempDF, y='UnivarR2', x='Variable', hue='Association', hue_order=tempD.keys(),
                    dodge=False, palette=tempD, edgecolor='k')
    p.set(ylim=(0, 80), yticks=np.arange(0, 80, 15))#Fixed across omics
    p.grid(axis='y', linestyle='--', color='black')
    sns.despine()
    plt.xticks(rotation=90, horizontalalignment='center')
    plt.xlabel('')
    plt.ylabel('Explained variance in WHtR [%]')
    plt.legend(loc='upper right')
    if sex=='BothSex':#Save
        fileDir = './ExportFigures/'
        ipynbName = '220823_Multiomics-BMI-NatMed1stRevision_WHtR-LASSO-bcoef-ver2_'
        fileName = bwhtr+'-'+sex+'-OLS-top'+str(topX)+'.tif'
        plt.gcf().savefig(fileDir+ipynbName+fileName, dpi=300, bbox_inches='tight', pad_inches=0.04,
                          pil_kwargs={'compression':'tiff_lzw'})
    plt.show()
    print('')

## 6. Comparison b/w omics

### 6-1. Pearson's r

#### 6-1-1. Actual values in LASSO models

In [None]:
#Both sex cohort
sex = 'BothSex'
tempL1 = [metDF_B, protDF_B, chemDF_B, combiDF_B]
tempL2 = ['log_BaseMetWHtR', 'log_BaseProtWHtR', 'log_BaseChemWHtR', 'log_BaseCombiWHtR']

print('All variables')
tempD = {'Metabolomics':'b', 'Proteomics':'r', 'Clinical labs':'g', 'Combined omics':'m'}
tempDF = pd.DataFrame(columns=['Category', 'Variable1', 'Variable2', 'Pearson_r'])
countL1 = []
countL2 = []
for df_i in range(len(tempL1)):
    print(list(tempD.keys())[df_i])
    #Compute correlation matrix
    tempL = [item for sublist in [whtrDF.columns.tolist(), [tempL2[df_i]]] for item in sublist]
    tempDF1 = tempL1[df_i].drop(columns=tempL)
    tempDF1 = tempDF1.corr(method='pearson')
    print(' - Combinations:', int(len(tempDF1)*(len(tempDF1)-1)/2))
    #Extract lower triangle matrix
    tempDF1 = tempDF1.where(np.tril(np.ones(tempDF1.shape), k=-1).astype(np.bool), other=np.nan)
    tempDF1.index.rename('Variable1', inplace=True)
    tempDF1 = tempDF1.reset_index().melt(var_name='Variable2', value_name='Pearson_r', id_vars=['Variable1'])
    tempDF1 = tempDF1.dropna()
    print(' - nrows:', len(tempDF1))
    countL1.append(len(tempDF1))
    print(' - |Pearson\'s r| > 0.8:', len(tempDF1.loc[abs(tempDF1['Pearson_r'])>0.8]))
    countL2.append(len(tempDF1.loc[abs(tempDF1['Pearson_r'])>0.8]))
    #Clean
    tempDF1['Category'] = list(tempD.keys())[df_i]
    tempDF = pd.concat([tempDF, tempDF1], axis=0)
print('Confirmation')
display(tempDF['Category'].value_counts())

#Distribution with violinplot
sns.set(style='ticks', font='Arial', context='talk')
plt.figure(figsize=(4, 3))
ax = sns.violinplot(data=tempDF, x='Pearson_r', y='Category',
                    order=list(tempD.keys()), palette=tempD, dodge=False, scale='width', inner='box')
ax.set(xlim=(-1, 1), xticks=np.arange(-1, 1.1, 0.5))
sns.despine()
##Add annotation
offset_x = 1.75
ax.annotate('| Pearson\'s '+r'$r$'+' | > 0.8', (offset_x, -1), fontsize='medium', annotation_clip=False,
            verticalalignment='center', horizontalalignment='center')
for df_i in range(len(tempD.keys())):
    ax.annotate(f'{countL2[df_i]:,}'+' / '+f'{countL1[df_i]:,}'+' pairs',
                (offset_x, df_i), fontsize='small', annotation_clip=False,
                verticalalignment='center', horizontalalignment='center')
plt.xlabel('Pearson\'s '+r'$r$')
plt.ylabel('')
##Save
fileDir = './ExportFigures/'
ipynbName = '220823_Multiomics-BMI-NatMed1stRevision_WHtR-LASSO-bcoef-ver2_'
fileName = 'correlation-'+sex+'-all-vars-pair.tif'
plt.gcf().savefig(fileDir+ipynbName+fileName, dpi=300, bbox_inches='tight', pad_inches=0.04,
                  pil_kwargs={'compression':'tiff_lzw'})
plt.show()

In [None]:
#Both sex cohort
sex = 'BothSex'
tempL1 = [metDF_B, protDF_B, chemDF_B, combiDF_B]
tempL2 = [metWHtR_B_bcoefs, protWHtR_B_bcoefs, chemWHtR_B_bcoefs, combiWHtR_B_bcoefs]

print('Variables with non-zero beta-coefficient in all 10 models')
tempD = {'Metabolomics':'b', 'Proteomics':'r', 'Clinical labs':'g', 'Combined omics':'m'}
tempDF = pd.DataFrame(columns=['Category', 'Variable1', 'Variable2', 'Pearson_r'])
countL1 = []
countL2 = []
for df_i in range(len(tempL1)):
    print(list(tempD.keys())[df_i])
    ##Variables with non-zero beta-coefficient in all 10 models
    tempDF1 = tempL2[df_i]
    tempL = tempDF1.loc[tempDF1['nZeros']==0].index.tolist()
    #Compute correlation matrix
    tempDF1 = tempL1[df_i].loc[:, tempL]
    tempDF1 = tempDF1.corr(method='pearson')
    print(' - Combinations:', int(len(tempDF1)*(len(tempDF1)-1)/2))
    #Extract lower triangle matrix
    tempDF1 = tempDF1.where(np.tril(np.ones(tempDF1.shape), k=-1).astype(np.bool), other=np.nan)
    tempDF1.index.rename('Variable1', inplace=True)
    tempDF1 = tempDF1.reset_index().melt(var_name='Variable2', value_name='Pearson_r', id_vars=['Variable1'])
    tempDF1 = tempDF1.dropna()
    print(' - nrows:', len(tempDF1))
    countL1.append(len(tempDF1))
    print(' - |Pearson\'s r| > 0.8:', len(tempDF1.loc[abs(tempDF1['Pearson_r'])>0.8]))
    countL2.append(len(tempDF1.loc[abs(tempDF1['Pearson_r'])>0.8]))
    #Clean
    tempDF1['Category'] = list(tempD.keys())[df_i]
    tempDF = pd.concat([tempDF, tempDF1], axis=0)
print('Confirmation')
display(tempDF['Category'].value_counts())

#Distribution with violinplot
sns.set(style='ticks', font='Arial', context='talk')
plt.figure(figsize=(4, 3))
ax = sns.violinplot(data=tempDF, x='Pearson_r', y='Category',
                    order=list(tempD.keys()), palette=tempD, dodge=False, scale='width', inner='box')
ax.set(xlim=(-1, 1), xticks=np.arange(-1, 1.1, 0.5))
sns.despine()
##Add annotation
offset_x = 1.75
ax.annotate('| Pearson\'s '+r'$r$'+' | > 0.8', (offset_x, -1), fontsize='medium', annotation_clip=False,
            verticalalignment='center', horizontalalignment='center')
for df_i in range(len(tempD.keys())):
    ax.annotate(f'{countL2[df_i]:,}'+' / '+f'{countL1[df_i]:,}'+' pairs',
                (offset_x, df_i), fontsize='small', annotation_clip=False,
                verticalalignment='center', horizontalalignment='center')
plt.xlabel('Pearson\'s '+r'$r$')
plt.ylabel('')
##Save
fileDir = './ExportFigures/'
ipynbName = '220823_Multiomics-BMI-NatMed1stRevision_WHtR-LASSO-bcoef-ver2_'
fileName = 'correlation-'+sex+'-LASSO-vars-pair.tif'
plt.gcf().savefig(fileDir+ipynbName+fileName, dpi=300, bbox_inches='tight', pad_inches=0.04,
                  pil_kwargs={'compression':'tiff_lzw'})
plt.show()

### 6-2. Variables retained in all 10 models

#### 6-2-1. vs. Metabolomics

In [None]:
tempD1 = {'Female':metWHtR_F_bcoefs, 'Male':metWHtR_M_bcoefs, 'Both sex':metWHtR_B_bcoefs}
tempD2 = {'Female':combiWHtR_F_bcoefs, 'Male':combiWHtR_M_bcoefs, 'Both sex':combiWHtR_B_bcoefs}
tempT1 = ('MetWHtR', 'CombiWHtR')
tempT2 = ('b', 'm')

for sex in tempD1.keys():
    print(sex+':')
    tempDF1 = tempD1[sex]
    tempDF2 = tempD2[sex]
    
    #Variables with non-zero beta-coefficient in all 10 models
    tempS1 = set(tempDF1.loc[tempDF1['nZeros']==0].index)
    tempS2 = set(tempDF2.loc[tempDF2['nZeros']==0].index)
    
    #Extract analytes in same omics type
    tempS2 = tempS2 & set(tempDF1.index)
    
    #Common variables with non-zero beta-coefficient in all 10 models
    tempS3 = tempS1 & tempS2
    print('Common variables with non-zero beta-coefficient in all 10 models:\n', tempS3)
    
    #Not common variables with non-zero beta-coefficient in all 10 models
    tempS1 = tempS1 - tempS3
    tempS2 = tempS2 - tempS3
    
    #The number of each subset
    tempL = [len(tempS1), len(tempS2), len(tempS3)]
    print('Each subset:', tempL)
    
    #Venn diagram
    sns.set(font='Arial', context='talk')
    venn2(subsets=tempL, set_labels=tempT1, set_colors=tempT2, alpha=0.7)
    venn2_circles(subsets=tempL)
    plt.title('—'+sex+' model—', fontdict={'fontsize':24})
    plt.show()
    print('')

#### 6-2-2. vs. Clinical labs

In [None]:
tempD1 = {'Female':chemWHtR_F_bcoefs, 'Male':chemWHtR_M_bcoefs, 'Both sex':chemWHtR_B_bcoefs}
tempD2 = {'Female':combiWHtR_F_bcoefs, 'Male':combiWHtR_M_bcoefs, 'Both sex':combiWHtR_B_bcoefs}
tempT1 = ('ChemWHtR', 'CombiWHtR')
tempT2 = ('g', 'm')

for sex in tempD1.keys():
    print(sex+':')
    tempDF1 = tempD1[sex]
    tempDF2 = tempD2[sex]
    
    #Variables with non-zero beta-coefficient in all 10 models
    tempS1 = set(tempDF1.loc[tempDF1['nZeros']==0].index)
    tempS2 = set(tempDF2.loc[tempDF2['nZeros']==0].index)
    
    #Extract analytes in same omics type
    tempS2 = tempS2 & set(tempDF1.index)
    
    #Common variables with non-zero beta-coefficient in all 10 models
    tempS3 = tempS1 & tempS2
    print('Common variables with non-zero beta-coefficient in all 10 models:\n', tempS3)
    
    #Not common variables with non-zero beta-coefficient in all 10 models
    tempS1 = tempS1 - tempS3
    tempS2 = tempS2 - tempS3
    
    #The number of each subset
    tempL = [len(tempS1), len(tempS2), len(tempS3)]
    print('Each subset:', tempL)
    
    #Venn diagram
    sns.set(font='Arial', context='talk')
    venn2(subsets=tempL, set_labels=tempT1, set_colors=tempT2, alpha=0.7)
    venn2_circles(subsets=tempL)
    plt.title('—'+sex+' model—', fontdict={'fontsize':24})
    plt.show()
    print('')

#### 6-2-3. vs. Proteomics

In [None]:
tempD1 = {'Female':protWHtR_F_bcoefs, 'Male':protWHtR_M_bcoefs, 'Both sex':protWHtR_B_bcoefs}
tempD2 = {'Female':combiWHtR_F_bcoefs, 'Male':combiWHtR_M_bcoefs, 'Both sex':combiWHtR_B_bcoefs}
tempT1 = ('ProtWHtR', 'CombiWHtR')
tempT2 = ('r', 'm')

for sex in tempD1.keys():
    print(sex+':')
    tempDF1 = tempD1[sex]
    tempDF2 = tempD2[sex]
    
    #Variables with non-zero beta-coefficient in all 10 models
    tempS1 = set(tempDF1.loc[tempDF1['nZeros']==0].index)
    tempS2 = set(tempDF2.loc[tempDF2['nZeros']==0].index)
    
    #Extract analytes in same omics type
    tempS2 = tempS2 & set(tempDF1.index)
    
    #Common variables with non-zero beta-coefficient in all 10 models
    tempS3 = tempS1 & tempS2
    print('Common variables with non-zero beta-coefficient in all 10 models:\n', tempS3)
    
    #Not common variables with non-zero beta-coefficient in all 10 models
    tempS1 = tempS1 - tempS3
    tempS2 = tempS2 - tempS3
    
    #The number of each subset
    tempL = [len(tempS1), len(tempS2), len(tempS3)]
    print('Each subset:', tempL)
    
    #Venn diagram
    sns.set(font='Arial', context='talk')
    venn2(subsets=tempL, set_labels=tempT1, set_colors=tempT2, alpha=0.7)
    venn2_circles(subsets=tempL)
    plt.title('—'+sex+' model—', fontdict={'fontsize':24})
    plt.show()
    print('')

# — End of this notebook —