# Multiomics BMI Paper — Elastic Net-modeling for BMI from the Arivale Baseline Omics

***by Kengo Watanabe***  

This Jupyter Notebook (with Python 3 kernel) generated the elastic net linear regression models for predicting BMI (biological BMI) from each of the Arivale baseline blood omic datasets, and calculated the testing (hold-out) set-derived BMI predictions for the Arivale cohort.  

Input files:  
* Arivale baseline BMI and blood omics (preprocessed): 210104_Biological-BMI-paper_RF-imputation_baseline-\[metDF/protDF/chemDF/combiDF\]-with-RF-imputation.tsv  

Output figures and tables:  
* Intermediate tables for other notebooks (BMI predictions, beta-coefficients)  

Original notebook (memo for my future tracing):  
* dalek:\[JupyterLab HOME\]/220621_Multiomics-BMI-NatMedRevision/220827_Multiomics-BMI-NatMed1stRevision_BMI-baseline-ElasticNet.ipynb  

In [1]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
#For Arial font
#!conda install -c conda-forge -y mscorefonts
##-> The below was also needed in matplotlib 3.4.2
#import shutil
#import matplotlib
#shutil.rmtree(matplotlib.get_cachedir())
import warnings
warnings.filterwarnings('ignore')
from IPython.display import display
import time

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import ElasticNetCV
from sklearn.linear_model import LinearRegression
from statsmodels.stats import weightstats
from statsmodels.stats import multitest as multi
from decimal import Decimal, ROUND_HALF_UP

!conda list

# packages in environment at /opt/conda/envs/arivale-py3:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       1_gnu    conda-forge
analytics                 0.1                      pypi_0    pypi
argon2-cffi               21.1.0           py39h3811e60_0    conda-forge
arivale-data-interface    0.1.0                    pypi_0    pypi
async_generator           1.10                       py_0    conda-forge
atk-1.0                   2.36.0               h3371d22_4    conda-forge
attrs                     21.2.0             pyhd8ed1ab_0    conda-forge
backcall                  0.2.0              pyh9f0ad1d_0    conda-forge
backports                 1.0                        py_2    conda-forge
backports.functools_lru_cache 1.6.4              pyhd8ed1ab_0    conda-forge
biopython                 1.79             py39h3811e60_0    conda-forge
bleach 

## 1. Data preparation

> The following code is completely same with the one used in the baseline LASSO modeling. Hence, the correspondence between participant and testing (hold-out) set for each model is maintained.  

### 1-1. Import the cleaned dataframes

In [None]:
#Import the baseline BMI dataframe
fileDir = '../210104_Biological-BMI-paper/ExportData/'
ipynbName = '210104_Biological-BMI-paper_RF-imputation_'
fileName = 'baseline-combiDF-with-RF-imputation.tsv'
tempDF = pd.read_csv(fileDir+ipynbName+fileName, sep='\t', dtype={'public_client_id': str})
tempDF = tempDF.set_index('public_client_id')
##Take BMI and general covariates (without Race in this study)
tempL = ['log_BaseBMI', 'Sex', 'BaseAge', 'PC1', 'PC2', 'PC3', 'PC4', 'PC5']
tempDF = tempDF[tempL]

display(tempDF)

bmiDF = tempDF

In [None]:
#Import the cleaned baseline omics dataframes
fileDir = '../210104_Biological-BMI-paper/ExportData/'
ipynbName = '210104_Biological-BMI-paper_RF-imputation_'
tempL = ['log_BaseBMI', 'Sex', 'BaseAge', 'PC1', 'PC2', 'PC3', 'PC4', 'PC5', 'Race']
tempD = {}
for df_n in ['metDF', 'chemDF', 'protDF', 'combiDF']:
    fileName = 'baseline-'+df_n+'-with-RF-imputation.tsv'
    tempDF = pd.read_csv(fileDir+ipynbName+fileName, sep='\t', dtype={'public_client_id': str})
    tempDF = tempDF.set_index('public_client_id')
    print(df_n+' original shape:', tempDF.shape)
    #Drop BMI and covariates
    tempDF = tempDF.drop(columns=tempL)
    display(tempDF)
    tempD[df_n] = tempDF

metDF = tempD['metDF']
chemDF = tempD['chemDF']
protDF = tempD['protDF']
combiDF = tempD['combiDF']

### 1-2. Stratification with sex

In [None]:
#Stratify the cohort with sex
bmiDF_F = bmiDF.loc[bmiDF['Sex']=='F']
bmiDF_M = bmiDF.loc[bmiDF['Sex']=='M']
bmiDF_B = bmiDF#Not copy just rename
print('Female, Male, Both sex = ', len(bmiDF_F), ', ', len(bmiDF_M), ', ', len(bmiDF_B))

### 1-3. Split the cohort into 10 sets

In [None]:
tempD1 = {'Female':bmiDF_F, 'Male':bmiDF_M, 'BothSex':bmiDF_B}
nmodels = 10
tempD2 = {}
for sex in tempD1.keys():
    tempDF = tempD1[sex]
    #Split cohort to define the training and testing (hold-out) sets
    tempL = np.array_split(tempDF, nmodels)#List of DFs
    tempD = {}
    for model_k in range(nmodels):
        tempDF1 = tempL[model_k]
        model_n = 'Model_'+str(model_k+1).zfill(2)
        tempS = pd.Series(np.repeat(model_n, len(tempDF1)),
                          index=tempDF1.index, name='Testing')
        tempD[model_k] = tempS
    tempS = pd.concat(list(tempD.values()), axis=0)
    #Add the info to bmiDF
    tempDF = pd.merge(tempDF, tempS, left_index=True, right_index=True, how='left')
    tempD2[sex] = tempDF
    print(sex)
    display(tempDF)
    display(tempDF['Testing'].value_counts())
    print('')
#Update
bmiDF_F = tempD2['Female']
bmiDF_M = tempD2['Male']
bmiDF_B = tempD2['BothSex']

### 1-4. Check independency among the sets

> See the notebook for LASSO.  

## 2. Metabolomics

### 2-1. Standardization

> Note that all preprocessing, including outlier elimination, missingness filtering, imputation, and standardization, should be performed within the cross-validation fold, not across whole dataset, to minimize the potential data leakage. In this study, however, an external dataset is used for “validation" of the findings (i.e., a testing set for the fitted models). Hence, the robustness of preprocessing is prioritized.  

In [None]:
tempDF = metDF
tempD1 = {'Female':bmiDF_F, 'Male':bmiDF_M, 'BothSex':bmiDF_B}
tempD2 = {}
for sex in tempD1.keys():
    tempDF1 = tempD1[sex]
    tempDF1 = tempDF.loc[tempDF1.index.tolist()]
    #Z-score transformation
    scaler = StandardScaler(copy=True, with_mean=True, with_std=True)
    tempA = scaler.fit_transform(tempDF1)#Column direction
    tempDF1 = pd.DataFrame(data=tempA, index=tempDF1.index, columns=tempDF1.columns)
    tempD2[sex] = tempDF1
    
    #Confirmation
    print(sex+':', tempDF1.shape)
    display(tempDF1.describe())
    sns.set(style='ticks', font='Arial', context='notebook')
    plt.figure(figsize=(4, 3))
    for col_i in range(3):
        sns.distplot(tempDF1.iloc[:, col_i], label=tempDF1.columns[col_i])
    sns.despine()
    plt.xlabel(r'$Z$'+'-score')
    plt.ylabel('Density')
    plt.legend(bbox_to_anchor=(1, 0.5), loc='center left', borderaxespad=1)
    plt.show()
    print('')

metDF_F = tempD2['Female']
metDF_M = tempD2['Male']
metDF_B = tempD2['BothSex']

### 2-2. Elastic net with cross-validation

In [None]:
#Female model
tempDF1 = metDF_F#Standardized independent variables
tempDF2 = bmiDF_F#Not-standardized dependent variable and info about testing set
yvar = 'BaseBMI'
nmodels = len(tempDF2['Testing'].unique())#Reset just in case
ncvs = 10
model = ElasticNetCV(l1_ratio=0.5, eps=0.05, n_alphas=200, alphas=None, fit_intercept=True,
                     normalize=False, precompute='auto', cv=ncvs)
yvar_model = 'BaseMetBMI'

#Perform elastic net
bcoefDF = pd.DataFrame(index=tempDF1.columns).astype('float64')#For beta-coefficients
bcoefDF.index.rename('Variable', inplace=True)
interceptDF = pd.DataFrame(index=['Intercept']).astype('float64')#For intercept
interceptDF.index.rename('Variable', inplace=True)
scoreL = []#For the coefficient of determination R2
predictS = pd.Series(name='log_'+yvar_model).astype('float64')#For predictions
t_start = time.time()
for model_k in range(nmodels):
    #Prepare training and testing (hold-out) datasets in model k
    model_n = 'Model_'+str(model_k+1).zfill(2)
    yDF_train = tempDF2.loc[tempDF2['Testing']!=model_n]
    yDF_test = tempDF2.loc[tempDF2['Testing']==model_n]
    xDF_train = tempDF1.loc[yDF_train.index.tolist()]
    xDF_test = tempDF1.loc[yDF_test.index.tolist()]
    #Retrieve the dependent variable
    yDF_train = pd.DataFrame(yDF_train['log_'+yvar])#Not Series but DF
    yDF_test = pd.DataFrame(yDF_test['log_'+yvar])#Not Series but DF
    #Fitting model with cross-validation using training dataset (i.e., internal training/validation datasets)
    model.fit(xDF_train, yDF_train)
    #Check the penalization amount decided by cross validation
    print(model_n+': alpha =', model.alpha_)
    #Save parameters
    bcoefDF[model_n] = model.coef_#w in the cost function formula
    interceptDF[model_n] = model.intercept_#Independent term in decision function
    #Evaluation with testing (hold-out) dataset
    scoreL.append(model.score(xDF_test, yDF_test))
    #Prediction for testing dataset using the fitted model k
    tempS = pd.Series(model.predict(xDF_test),
                      index=xDF_test.index, name='log_'+yvar_model)
    predictS = pd.concat([predictS, tempS], axis=0)
t_elapsed = time.time() - t_start
print('Elapsed time for '+str(nmodels)+' models of '+str(ncvs)+'-fold CV elastic net:',
      round(t_elapsed//60), 'min', round(t_elapsed%60, 1), 'sec')

#Clean prediction DF
tempDF = pd.merge(tempDF2[['Testing', 'log_'+yvar]], predictS,
                  left_index=True, right_index=True, how='left')
##Convert to original scale
tempDF[yvar] = np.e**tempDF['log_'+yvar]
tempDF[yvar_model] = np.e**tempDF['log_'+yvar_model]
display(tempDF)

#Save the cleaned prediction DF
fileDir = './ExportData/'
ipynbName = '220827_Multiomics-BMI-NatMed1stRevision_BMI-baseline-ElasticNet_'
fileName = yvar_model.replace('Base', '')+'-Female.tsv'
tempDF.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)
metBMI_F_bcoefs = bcoefDF
metBMI_F_intercept = interceptDF
metBMI_F_R2 = scoreL
metBMI_F = tempDF

In [None]:
#Male model
tempDF1 = metDF_M#Standardized independent variables
tempDF2 = bmiDF_M#Not-standardized dependent variable and info about testing set
yvar = 'BaseBMI'
nmodels = len(tempDF2['Testing'].unique())#Reset just in case
ncvs = 10
model = ElasticNetCV(l1_ratio=0.5, eps=0.05, n_alphas=200, alphas=None, fit_intercept=True,
                     normalize=False, precompute='auto', cv=ncvs)
yvar_model = 'BaseMetBMI'

#Perform elastic net
bcoefDF = pd.DataFrame(index=tempDF1.columns).astype('float64')#For beta-coefficients
bcoefDF.index.rename('Variable', inplace=True)
interceptDF = pd.DataFrame(index=['Intercept']).astype('float64')#For intercept
interceptDF.index.rename('Variable', inplace=True)
scoreL = []#For the coefficient of determination R2
predictS = pd.Series(name='log_'+yvar_model).astype('float64')#For predictions
t_start = time.time()
for model_k in range(nmodels):
    #Prepare training and testing (hold-out) datasets in model k
    model_n = 'Model_'+str(model_k+1).zfill(2)
    yDF_train = tempDF2.loc[tempDF2['Testing']!=model_n]
    yDF_test = tempDF2.loc[tempDF2['Testing']==model_n]
    xDF_train = tempDF1.loc[yDF_train.index.tolist()]
    xDF_test = tempDF1.loc[yDF_test.index.tolist()]
    #Retrieve the dependent variable
    yDF_train = pd.DataFrame(yDF_train['log_'+yvar])#Not Series but DF
    yDF_test = pd.DataFrame(yDF_test['log_'+yvar])#Not Series but DF
    #Fitting model with cross-validation using training dataset (i.e., internal training/validation datasets)
    model.fit(xDF_train, yDF_train)
    #Check the penalization amount decided by cross validation
    print(model_n+': alpha =', model.alpha_)
    #Save parameters
    bcoefDF[model_n] = model.coef_#w in the cost function formula
    interceptDF[model_n] = model.intercept_#Independent term in decision function
    #Evaluation with testing (hold-out) dataset
    scoreL.append(model.score(xDF_test, yDF_test))
    #Prediction for testing dataset using the fitted model k
    tempS = pd.Series(model.predict(xDF_test),
                      index=xDF_test.index, name='log_'+yvar_model)
    predictS = pd.concat([predictS, tempS], axis=0)
t_elapsed = time.time() - t_start
print('Elapsed time for '+str(nmodels)+' models of '+str(ncvs)+'-fold CV elastic net:',
      round(t_elapsed//60), 'min', round(t_elapsed%60, 1), 'sec')

#Clean prediction DF
tempDF = pd.merge(tempDF2[['Testing', 'log_'+yvar]], predictS,
                  left_index=True, right_index=True, how='left')
##Convert to original scale
tempDF[yvar] = np.e**tempDF['log_'+yvar]
tempDF[yvar_model] = np.e**tempDF['log_'+yvar_model]
display(tempDF)

#Save the cleaned prediction DF
fileDir = './ExportData/'
ipynbName = '220827_Multiomics-BMI-NatMed1stRevision_BMI-baseline-ElasticNet_'
fileName = yvar_model.replace('Base', '')+'-Male.tsv'
tempDF.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)
metBMI_M_bcoefs = bcoefDF
metBMI_M_intercept = interceptDF
metBMI_M_R2 = scoreL
metBMI_M = tempDF

In [None]:
#Both sex model
tempDF1 = metDF_B#Standardized independent variables
tempDF2 = bmiDF_B#Not-standardized dependent variable and info about testing set
yvar = 'BaseBMI'
nmodels = len(tempDF2['Testing'].unique())#Reset just in case
ncvs = 10
model = ElasticNetCV(l1_ratio=0.5, eps=0.05, n_alphas=200, alphas=None, fit_intercept=True,
                     normalize=False, precompute='auto', cv=ncvs)
yvar_model = 'BaseMetBMI'

#Perform elastic net
bcoefDF = pd.DataFrame(index=tempDF1.columns).astype('float64')#For beta-coefficients
bcoefDF.index.rename('Variable', inplace=True)
interceptDF = pd.DataFrame(index=['Intercept']).astype('float64')#For intercept
interceptDF.index.rename('Variable', inplace=True)
scoreL = []#For the coefficient of determination R2
predictS = pd.Series(name='log_'+yvar_model).astype('float64')#For predictions
t_start = time.time()
for model_k in range(nmodels):
    #Prepare training and testing (hold-out) datasets in model k
    model_n = 'Model_'+str(model_k+1).zfill(2)
    yDF_train = tempDF2.loc[tempDF2['Testing']!=model_n]
    yDF_test = tempDF2.loc[tempDF2['Testing']==model_n]
    xDF_train = tempDF1.loc[yDF_train.index.tolist()]
    xDF_test = tempDF1.loc[yDF_test.index.tolist()]
    #Retrieve the dependent variable
    yDF_train = pd.DataFrame(yDF_train['log_'+yvar])#Not Series but DF
    yDF_test = pd.DataFrame(yDF_test['log_'+yvar])#Not Series but DF
    #Fitting model with cross-validation using training dataset (i.e., internal training/validation datasets)
    model.fit(xDF_train, yDF_train)
    #Check the penalization amount decided by cross validation
    print(model_n+': alpha =', model.alpha_)
    #Save parameters
    bcoefDF[model_n] = model.coef_#w in the cost function formula
    interceptDF[model_n] = model.intercept_#Independent term in decision function
    #Evaluation with testing (hold-out) dataset
    scoreL.append(model.score(xDF_test, yDF_test))
    #Prediction for testing dataset using the fitted model k
    tempS = pd.Series(model.predict(xDF_test),
                      index=xDF_test.index, name='log_'+yvar_model)
    predictS = pd.concat([predictS, tempS], axis=0)
t_elapsed = time.time() - t_start
print('Elapsed time for '+str(nmodels)+' models of '+str(ncvs)+'-fold CV elastic net:',
      round(t_elapsed//60), 'min', round(t_elapsed%60, 1), 'sec')

#Clean prediction DF
tempDF = pd.merge(tempDF2[['Testing', 'log_'+yvar]], predictS,
                  left_index=True, right_index=True, how='left')
##Convert to original scale
tempDF[yvar] = np.e**tempDF['log_'+yvar]
tempDF[yvar_model] = np.e**tempDF['log_'+yvar_model]
display(tempDF)

#Save the cleaned prediction DF
fileDir = './ExportData/'
ipynbName = '220827_Multiomics-BMI-NatMed1stRevision_BMI-baseline-ElasticNet_'
fileName = yvar_model.replace('Base', '')+'-BothSex.tsv'
tempDF.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)
metBMI_B_bcoefs = bcoefDF
metBMI_B_intercept = interceptDF
metBMI_B_R2 = scoreL
metBMI_B = tempDF

### 2-3. Prediction accuracy

In [None]:
#Summary
tempD1 = {'Female':metBMI_F_R2, 'Male':metBMI_M_R2, 'Both sex':metBMI_B_R2}
tempD2 = {'Female':metBMI_F, 'Male':metBMI_M, 'Both sex':metBMI_B}
yvar = 'BaseBMI'
yvar_model = 'BaseMetBMI'

for sex in tempD1.keys():
    tempL = tempD1[sex]
    print(sex+' model')
    print(' - Out-of-sample R2 [Mean ± SD]:',
          np.mean(tempL), '±', np.std(tempL, ddof=1))#Sample standard deviation
    print(' - Out-of-sample R2 [Mean ± SEM]:',
          np.mean(tempL), '±', np.std(tempL, ddof=1)/np.sqrt(len(tempL)))
    tempDF = tempD2[sex]
    print(' - Observed vs. predicted log_'+yvar+': (Pearson\'s r, P) =',
          stats.pearsonr(tempDF['log_'+yvar], tempDF['log_'+yvar_model]))
    print(' - Observed vs. predicted '+yvar+': (Pearson\'s r, P) =',
          stats.pearsonr(tempDF[yvar], tempDF[yvar_model]))
    display(tempDF.describe())

> Check difference between sex-specific and sex-mixed models for now. Note that this is a rough comparison because the sample size is different.  

In [None]:
#Prepare DF
tempDF = pd.DataFrame({'Female':metBMI_F_R2, 'Male':metBMI_M_R2, 'Both sex':metBMI_B_R2})
tempDF = tempDF.melt(var_name='Cohort', value_name='R2', value_vars=tempDF.columns.tolist())

#Plot
sns.set(style='ticks', font='Arial', context='notebook')
plt.figure(figsize=(3, 1))
sns.barplot(data=tempDF, y='Cohort', x='R2', hue='Cohort', palette='Set1', dodge=False, edgecolor='black',
            ci=95, capsize=0.4, errwidth=1.5, errcolor='black')
sns.stripplot(data=tempDF, y='Cohort', x='R2', hue='Cohort', dodge=False, size=8, edgecolor='black',
              linewidth=1, alpha=0.4, palette={'Female':'gray', 'Male':'gray', 'Both sex':'gray'})
sns.despine()
plt.xlabel('Out-of-sample '+r'$R^2$'+' in 10 elastic net models\n[mean with 95% CI]')
plt.ylabel('')
plt.legend('', frameon=False)
plt.show()

In [None]:
tempD1 = {'Female':metBMI_F, 'Male':metBMI_M, 'Both sex':metBMI_B}
yvar = 'BaseBMI'
yvar_model = 'BaseMetBMI'
range_min = np.min([df[var].min() for df in tempD1.values() for var in [yvar, yvar_model]])
range_max = np.max([df[var].max() for df in tempD1.values() for var in [yvar, yvar_model]])
axis_label = 'BMI [kg m'+r'$^{-2}$'+']'

#Plot measured vs. predicted per model
tempD2 = {'Female':'tab:red', 'Male':'tab:blue', 'Both sex':'tab:green'}
sns.set(style='ticks', font='Arial', context='talk')
fig, axes = plt.subplots(nrows=1, ncols=len(tempD2), figsize=(10, 3), sharex=True, sharey=True,
                         gridspec_kw={'hspace':0.1, 'wspace':0.1})
for ax_i, ax in enumerate(axes.flat):
    sex = list(tempD2.keys())[ax_i]
    tempDF = tempD1[sex]
    #Y=X as reference
    ax.plot([range_min, range_max], [range_min, range_max], color='black', linestyle=(0, (1, 2)))
    #Regplot
    sns.regplot(data=tempDF, x=yvar_model, y=yvar, color=tempD2[sex],
                scatter=True, fit_reg=True, ci=95, truncate=False, marker='o',
                scatter_kws={'alpha':0.2, 'edgecolor':'k', 's':30}, ax=ax)
    if ax_i%len(tempD2)==0:
        plt.setp(ax, xlabel='', ylabel='Measured '+axis_label)
    else:
        plt.setp(ax, xlabel='', ylabel='')
        plt.setp(ax.get_yticklabels(), visible=False)
    ##Annotate Pearson's correlation
    pearson_r, pval = stats.pearsonr(tempDF[yvar_model], tempDF[yvar])
    r_text = str(Decimal(str(pearson_r)).quantize(Decimal('0.001'), rounding=ROUND_HALF_UP))
    if pval==1.0:
        pval_text = '1.0'
    elif pval==0.0:
        pval_text = '0.0'
    else:
        pval_text = f'{Decimal(str(pval)):.3E}'#Take more digits because rounding is bad here
        significand, exponent = pval_text.split(sep='E-')
        significand = str(Decimal(significand).quantize(Decimal('0.1'), rounding=ROUND_HALF_UP))
        if significand=='10.0':
            significand = '1.0'
            exponent = str(int(exponent)-1)
        if int(exponent)>2:
            pval_text = significand+r'$\times$'+'10'+r'$^{{-{0}}}$'.format(exponent)##Font is different in r'$ $'...
        elif int(exponent)>0:
            pval_text = '0.'+'0'*(int(exponent)-1)+significand.replace('.', '')
        else:
            pval_text = significand
    text = 'Pearson\'s '+r'$r$'+' = '+r_text+'\n'+r'$P$'+' = '+pval_text
    ax.annotate(text, xy=(0.05, 0.95), xycoords='axes fraction',
                horizontalalignment='left', verticalalignment='top',
                multialignment='left', fontsize='x-small', color='k')
    ##Facet label
    ax.set_title(sex, {'fontsize':'medium'})
    #Save position to generate facet and legend later
    if ax_i ==0:
        ax0_pos = ax.get_position().bounds
    elif ax_i==2:
        ax2_pos = ax.get_position().bounds
sns.despine()
fig.text(x=(ax0_pos[0]+(ax2_pos[0]+ax2_pos[2]))/2, y=ax0_pos[1]-ax0_pos[3]*0.2,#Minor manual adjustment
         s='Predicted '+axis_label, fontsize='medium',
         verticalalignment='top', horizontalalignment='center', rotation='horizontal')
plt.show()

#Plot difference b/w sex-specific and sex-mixed models
tempD2 = {'Female':'tab:red', 'Male':'tab:blue'}
sns.set(style='ticks', font='Arial', context='talk')
fig, axes = plt.subplots(nrows=1, ncols=len(tempD2), figsize=(6.5, 3), sharex=True, sharey=True,
                         gridspec_kw={'hspace':0.1, 'wspace':0.1})
for ax_i, ax in enumerate(axes.flat):
    sex = list(tempD2.keys())[ax_i]
    #Prepare DF
    tempS1 = tempD1[sex][yvar_model]
    tempS1.name = 'Sex-specific'
    tempS2 = tempD1['Both sex'][yvar_model]
    tempS2.name = 'Sex-mixed'
    tempDF = pd.merge(tempS1, tempS2, left_index=True, right_index=True, how='inner')
    #Y=X as reference
    ax.plot([range_min, range_max], [range_min, range_max], color='black', linestyle=(0, (1, 2)))
    #Regplot
    sns.regplot(data=tempDF, x='Sex-specific', y='Sex-mixed', color=tempD2[sex],
                scatter=True, fit_reg=True, ci=95, truncate=False, marker='o',
                scatter_kws={'alpha':0.2, 'edgecolor':'k', 's':30}, ax=ax)
    if ax_i%len(tempD2)==0:
        plt.setp(ax, xlabel='', ylabel='Sex-mixed b'+axis_label)
    else:
        plt.setp(ax, xlabel='', ylabel='')
        plt.setp(ax.get_yticklabels(), visible=False)
    ##Annotate Pearson's correlation
    pearson_r, pval = stats.pearsonr(tempDF['Sex-specific'], tempDF['Sex-mixed'])
    r_text = str(Decimal(str(pearson_r)).quantize(Decimal('0.001'), rounding=ROUND_HALF_UP))
    if pval==1.0:
        pval_text = '1.0'
    elif pval==0.0:
        pval_text = '0.0'
    else:
        pval_text = f'{Decimal(str(pval)):.3E}'#Take more digits because rounding is bad here
        significand, exponent = pval_text.split(sep='E-')
        significand = str(Decimal(significand).quantize(Decimal('0.1'), rounding=ROUND_HALF_UP))
        if significand=='10.0':
            significand = '1.0'
            exponent = str(int(exponent)-1)
        if int(exponent)>2:
            pval_text = significand+r'$\times$'+'10'+r'$^{{-{0}}}$'.format(exponent)##Font is different in r'$ $'...
        elif int(exponent)>0:
            pval_text = '0.'+'0'*(int(exponent)-1)+significand.replace('.', '')
        else:
            pval_text = significand
    text = 'Pearson\'s '+r'$r$'+' = '+r_text+'\n'+r'$P$'+' = '+pval_text
    ax.annotate(text, xy=(0.05, 0.95), xycoords='axes fraction',
                horizontalalignment='left', verticalalignment='top',
                multialignment='left', fontsize='x-small', color='k')
    ##Facet label
    ax.set_title(sex, {'fontsize':'medium'})
    #Save position to generate facet and legend later
    if ax_i ==0:
        ax0_pos = ax.get_position().bounds
    elif ax_i==1:
        ax1_pos = ax.get_position().bounds
sns.despine()
fig.text(x=(ax0_pos[0]+(ax1_pos[0]+ax1_pos[2]))/2, y=ax0_pos[1]-ax0_pos[3]*0.2,#Minor manual adjustment
         s='Sex-specific b'+axis_label, fontsize='medium',
         verticalalignment='top', horizontalalignment='center', rotation='horizontal')
plt.show()

### 2-4. Clean beta-coefficient dataframe

In [None]:
tempD1 = {'Female':metBMI_F_bcoefs, 'Male':metBMI_M_bcoefs, 'BothSex':metBMI_B_bcoefs}
tempD2 = {'Female':metBMI_F_intercept, 'Male':metBMI_M_intercept, 'BothSex':metBMI_B_intercept}
yvar_model = 'MetBMI'
model_method = 'ElasticNet'
tempD3 = {}
for sex in tempD1.keys():
    #Combine variables and intercept
    tempDF1 = tempD1[sex]
    tempDF2 = tempD2[sex]
    tempDF = pd.concat([tempDF1, tempDF2], axis=0)
    #Summarize
    tempL1 = []
    tempL2 = []
    tempL3 = []
    for row_n in tempDF.index.tolist():
        tempL1.append(tempDF.loc[row_n].mean())
        tempL2.append(tempDF.loc[row_n].std(ddof=1))#Sample standard deviation
        tempL3.append((tempDF.loc[row_n]==0.0).astype('int64').sum())
    tempDF['Mean'] = tempL1
    tempDF['SD'] = tempL2
    tempDF['nZeros'] = tempL3
    #Save the cleaned DF
    fileDir = './ExportData/'
    ipynbName = '220827_Multiomics-BMI-NatMed1stRevision_BMI-baseline-ElasticNet_'
    fileName = yvar_model+'-'+sex+'-'+model_method+'bcoefs.tsv'
    tempDF.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)
    
    #Check
    tempDF2 = tempDF.loc[tempDF.index.isin(['Intercept'])]#Retrieve as pd.DataFrame
    tempDF = tempDF.drop(index=['Intercept'])
    print(sex+':')
    print(' - Variables:', len(tempDF))
    tempDF1 = tempDF.loc[tempDF['nZeros']!=10]
    print(' - Variables with non-zero beta-coefficient:', len(tempDF1),
          '(', len(tempDF1)/len(tempDF)*100, '%)')
    tempDF1 = tempDF.loc[tempDF['nZeros']==0]
    print(' - Variables with non-zero beta-coefficient in all 10 models:', len(tempDF1),
          '(', len(tempDF1)/len(tempDF)*100, '%)')
    tempDF1 = tempDF1.sort_values(by='Mean', ascending=False)
    display(tempDF1)
    display(tempDF2)#Intercept
    print('')
    
    tempD3[sex] = tempDF
#Update for using R2 transition analysis
metBMI_F_bcoefs = tempD3['Female']
metBMI_M_bcoefs = tempD3['Male']
metBMI_B_bcoefs = tempD3['BothSex']

## 3. Proteomics

### 3-1. Standardization

> Note that all preprocessing, including outlier elimination, missingness filtering, imputation, and standardization, should be performed within the cross-validation fold, not across whole dataset, to minimize the potential data leakage. In this study, however, an external dataset is used for “validation" of the findings (i.e., a testing set for the fitted models). Hence, the robustness of preprocessing is prioritized.  

In [None]:
tempDF = protDF
tempD1 = {'Female':bmiDF_F, 'Male':bmiDF_M, 'BothSex':bmiDF_B}
tempD2 = {}
for sex in tempD1.keys():
    tempDF1 = tempD1[sex]
    tempDF1 = tempDF.loc[tempDF1.index.tolist()]
    #Z-score transformation
    scaler = StandardScaler(copy=True, with_mean=True, with_std=True)
    tempA = scaler.fit_transform(tempDF1)#Column direction
    tempDF1 = pd.DataFrame(data=tempA, index=tempDF1.index, columns=tempDF1.columns)
    tempD2[sex] = tempDF1
    
    #Confirmation
    print(sex+':', tempDF1.shape)
    display(tempDF1.describe())
    sns.set(style='ticks', font='Arial', context='notebook')
    plt.figure(figsize=(4, 3))
    for col_i in range(3):
        sns.distplot(tempDF1.iloc[:, col_i], label=tempDF1.columns[col_i])
    sns.despine()
    plt.xlabel(r'$Z$'+'-score')
    plt.ylabel('Density')
    plt.legend(bbox_to_anchor=(1, 0.5), loc='center left', borderaxespad=1)
    plt.show()
    print('')

protDF_F = tempD2['Female']
protDF_M = tempD2['Male']
protDF_B = tempD2['BothSex']

### 3-2. Elastic net with cross-validation

In [None]:
#Female model
tempDF1 = protDF_F#Standardized independent variables
tempDF2 = bmiDF_F#Not-standardized dependent variable and info about testing set
yvar = 'BaseBMI'
nmodels = len(tempDF2['Testing'].unique())#Reset just in case
ncvs = 10
model = ElasticNetCV(l1_ratio=0.5, eps=0.05, n_alphas=200, alphas=None, fit_intercept=True,
                     normalize=False, precompute='auto', cv=ncvs)
yvar_model = 'BaseProtBMI'

#Perform elastic net
bcoefDF = pd.DataFrame(index=tempDF1.columns).astype('float64')#For beta-coefficients
bcoefDF.index.rename('Variable', inplace=True)
interceptDF = pd.DataFrame(index=['Intercept']).astype('float64')#For intercept
interceptDF.index.rename('Variable', inplace=True)
scoreL = []#For the coefficient of determination R2
predictS = pd.Series(name='log_'+yvar_model).astype('float64')#For predictions
t_start = time.time()
for model_k in range(nmodels):
    #Prepare training and testing (hold-out) datasets in model k
    model_n = 'Model_'+str(model_k+1).zfill(2)
    yDF_train = tempDF2.loc[tempDF2['Testing']!=model_n]
    yDF_test = tempDF2.loc[tempDF2['Testing']==model_n]
    xDF_train = tempDF1.loc[yDF_train.index.tolist()]
    xDF_test = tempDF1.loc[yDF_test.index.tolist()]
    #Retrieve the dependent variable
    yDF_train = pd.DataFrame(yDF_train['log_'+yvar])#Not Series but DF
    yDF_test = pd.DataFrame(yDF_test['log_'+yvar])#Not Series but DF
    #Fitting model with cross-validation using training dataset (i.e., internal training/validation datasets)
    model.fit(xDF_train, yDF_train)
    #Check the penalization amount decided by cross validation
    print(model_n+': alpha =', model.alpha_)
    #Save parameters
    bcoefDF[model_n] = model.coef_#w in the cost function formula
    interceptDF[model_n] = model.intercept_#Independent term in decision function
    #Evaluation with testing (hold-out) dataset
    scoreL.append(model.score(xDF_test, yDF_test))
    #Prediction for testing dataset using the fitted model k
    tempS = pd.Series(model.predict(xDF_test),
                      index=xDF_test.index, name='log_'+yvar_model)
    predictS = pd.concat([predictS, tempS], axis=0)
t_elapsed = time.time() - t_start
print('Elapsed time for '+str(nmodels)+' models of '+str(ncvs)+'-fold CV elastic net:',
      round(t_elapsed//60), 'min', round(t_elapsed%60, 1), 'sec')

#Clean prediction DF
tempDF = pd.merge(tempDF2[['Testing', 'log_'+yvar]], predictS,
                  left_index=True, right_index=True, how='left')
##Convert to original scale
tempDF[yvar] = np.e**tempDF['log_'+yvar]
tempDF[yvar_model] = np.e**tempDF['log_'+yvar_model]
display(tempDF)

#Save the cleaned prediction DF
fileDir = './ExportData/'
ipynbName = '220827_Multiomics-BMI-NatMed1stRevision_BMI-baseline-ElasticNet_'
fileName = yvar_model.replace('Base', '')+'-Female.tsv'
tempDF.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)
protBMI_F_bcoefs = bcoefDF
protBMI_F_intercept = interceptDF
protBMI_F_R2 = scoreL
protBMI_F = tempDF

In [None]:
#Male model
tempDF1 = protDF_M#Standardized independent variables
tempDF2 = bmiDF_M#Not-standardized dependent variable and info about testing set
yvar = 'BaseBMI'
nmodels = len(tempDF2['Testing'].unique())#Reset just in case
ncvs = 10
model = ElasticNetCV(l1_ratio=0.5, eps=0.05, n_alphas=200, alphas=None, fit_intercept=True,
                     normalize=False, precompute='auto', cv=ncvs)
yvar_model = 'BaseProtBMI'

#Perform elastic net
bcoefDF = pd.DataFrame(index=tempDF1.columns).astype('float64')#For beta-coefficients
bcoefDF.index.rename('Variable', inplace=True)
interceptDF = pd.DataFrame(index=['Intercept']).astype('float64')#For intercept
interceptDF.index.rename('Variable', inplace=True)
scoreL = []#For the coefficient of determination R2
predictS = pd.Series(name='log_'+yvar_model).astype('float64')#For predictions
t_start = time.time()
for model_k in range(nmodels):
    #Prepare training and testing (hold-out) datasets in model k
    model_n = 'Model_'+str(model_k+1).zfill(2)
    yDF_train = tempDF2.loc[tempDF2['Testing']!=model_n]
    yDF_test = tempDF2.loc[tempDF2['Testing']==model_n]
    xDF_train = tempDF1.loc[yDF_train.index.tolist()]
    xDF_test = tempDF1.loc[yDF_test.index.tolist()]
    #Retrieve the dependent variable
    yDF_train = pd.DataFrame(yDF_train['log_'+yvar])#Not Series but DF
    yDF_test = pd.DataFrame(yDF_test['log_'+yvar])#Not Series but DF
    #Fitting model with cross-validation using training dataset (i.e., internal training/validation datasets)
    model.fit(xDF_train, yDF_train)
    #Check the penalization amount decided by cross validation
    print(model_n+': alpha =', model.alpha_)
    #Save parameters
    bcoefDF[model_n] = model.coef_#w in the cost function formula
    interceptDF[model_n] = model.intercept_#Independent term in decision function
    #Evaluation with testing (hold-out) dataset
    scoreL.append(model.score(xDF_test, yDF_test))
    #Prediction for testing dataset using the fitted model k
    tempS = pd.Series(model.predict(xDF_test),
                      index=xDF_test.index, name='log_'+yvar_model)
    predictS = pd.concat([predictS, tempS], axis=0)
t_elapsed = time.time() - t_start
print('Elapsed time for '+str(nmodels)+' models of '+str(ncvs)+'-fold CV elastic net:',
      round(t_elapsed//60), 'min', round(t_elapsed%60, 1), 'sec')

#Clean prediction DF
tempDF = pd.merge(tempDF2[['Testing', 'log_'+yvar]], predictS,
                  left_index=True, right_index=True, how='left')
##Convert to original scale
tempDF[yvar] = np.e**tempDF['log_'+yvar]
tempDF[yvar_model] = np.e**tempDF['log_'+yvar_model]
display(tempDF)

#Save the cleaned prediction DF
fileDir = './ExportData/'
ipynbName = '220827_Multiomics-BMI-NatMed1stRevision_BMI-baseline-ElasticNet_'
fileName = yvar_model.replace('Base', '')+'-Male.tsv'
tempDF.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)
protBMI_M_bcoefs = bcoefDF
protBMI_M_intercept = interceptDF
protBMI_M_R2 = scoreL
protBMI_M = tempDF

In [None]:
#Both sex model
tempDF1 = protDF_B#Standardized independent variables
tempDF2 = bmiDF_B#Not-standardized dependent variable and info about testing set
yvar = 'BaseBMI'
nmodels = len(tempDF2['Testing'].unique())#Reset just in case
ncvs = 10
model = ElasticNetCV(l1_ratio=0.5, eps=0.05, n_alphas=200, alphas=None, fit_intercept=True,
                     normalize=False, precompute='auto', cv=ncvs)
yvar_model = 'BaseProtBMI'

#Perform elastic net
bcoefDF = pd.DataFrame(index=tempDF1.columns).astype('float64')#For beta-coefficients
bcoefDF.index.rename('Variable', inplace=True)
interceptDF = pd.DataFrame(index=['Intercept']).astype('float64')#For intercept
interceptDF.index.rename('Variable', inplace=True)
scoreL = []#For the coefficient of determination R2
predictS = pd.Series(name='log_'+yvar_model).astype('float64')#For predictions
t_start = time.time()
for model_k in range(nmodels):
    #Prepare training and testing (hold-out) datasets in model k
    model_n = 'Model_'+str(model_k+1).zfill(2)
    yDF_train = tempDF2.loc[tempDF2['Testing']!=model_n]
    yDF_test = tempDF2.loc[tempDF2['Testing']==model_n]
    xDF_train = tempDF1.loc[yDF_train.index.tolist()]
    xDF_test = tempDF1.loc[yDF_test.index.tolist()]
    #Retrieve the dependent variable
    yDF_train = pd.DataFrame(yDF_train['log_'+yvar])#Not Series but DF
    yDF_test = pd.DataFrame(yDF_test['log_'+yvar])#Not Series but DF
    #Fitting model with cross-validation using training dataset (i.e., internal training/validation datasets)
    model.fit(xDF_train, yDF_train)
    #Check the penalization amount decided by cross validation
    print(model_n+': alpha =', model.alpha_)
    #Save parameters
    bcoefDF[model_n] = model.coef_#w in the cost function formula
    interceptDF[model_n] = model.intercept_#Independent term in decision function
    #Evaluation with testing (hold-out) dataset
    scoreL.append(model.score(xDF_test, yDF_test))
    #Prediction for testing dataset using the fitted model k
    tempS = pd.Series(model.predict(xDF_test),
                      index=xDF_test.index, name='log_'+yvar_model)
    predictS = pd.concat([predictS, tempS], axis=0)
t_elapsed = time.time() - t_start
print('Elapsed time for '+str(nmodels)+' models of '+str(ncvs)+'-fold CV elastic net:',
      round(t_elapsed//60), 'min', round(t_elapsed%60, 1), 'sec')

#Clean prediction DF
tempDF = pd.merge(tempDF2[['Testing', 'log_'+yvar]], predictS,
                  left_index=True, right_index=True, how='left')
##Convert to original scale
tempDF[yvar] = np.e**tempDF['log_'+yvar]
tempDF[yvar_model] = np.e**tempDF['log_'+yvar_model]
display(tempDF)

#Save the cleaned prediction DF
fileDir = './ExportData/'
ipynbName = '220827_Multiomics-BMI-NatMed1stRevision_BMI-baseline-ElasticNet_'
fileName = yvar_model.replace('Base', '')+'-BothSex.tsv'
tempDF.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)
protBMI_B_bcoefs = bcoefDF
protBMI_B_intercept = interceptDF
protBMI_B_R2 = scoreL
protBMI_B = tempDF

### 3-3. Prediction accuracy

In [None]:
#Summary
tempD1 = {'Female':protBMI_F_R2, 'Male':protBMI_M_R2, 'Both sex':protBMI_B_R2}
tempD2 = {'Female':protBMI_F, 'Male':protBMI_M, 'Both sex':protBMI_B}
yvar = 'BaseBMI'
yvar_model = 'BaseProtBMI'

for sex in tempD1.keys():
    tempL = tempD1[sex]
    print(sex+' model')
    print(' - Out-of-sample R2 [Mean ± SD]:',
          np.mean(tempL), '±', np.std(tempL, ddof=1))#Sample standard deviation
    print(' - Out-of-sample R2 [Mean ± SEM]:',
          np.mean(tempL), '±', np.std(tempL, ddof=1)/np.sqrt(len(tempL)))
    tempDF = tempD2[sex]
    print(' - Observed vs. predicted log_'+yvar+': (Pearson\'s r, P) =',
          stats.pearsonr(tempDF['log_'+yvar], tempDF['log_'+yvar_model]))
    print(' - Observed vs. predicted '+yvar+': (Pearson\'s r, P) =',
          stats.pearsonr(tempDF[yvar], tempDF[yvar_model]))
    display(tempDF.describe())

> Check difference between sex-specific and sex-mixed models for now. Note that this is a rough comparison because the sample size is different.  

In [None]:
#Prepare DF
tempDF = pd.DataFrame({'Female':protBMI_F_R2, 'Male':protBMI_M_R2, 'Both sex':protBMI_B_R2})
tempDF = tempDF.melt(var_name='Cohort', value_name='R2', value_vars=tempDF.columns.tolist())

#Plot
sns.set(style='ticks', font='Arial', context='notebook')
plt.figure(figsize=(3, 1))
sns.barplot(data=tempDF, y='Cohort', x='R2', hue='Cohort', palette='Set1', dodge=False, edgecolor='black',
            ci=95, capsize=0.4, errwidth=1.5, errcolor='black')
sns.stripplot(data=tempDF, y='Cohort', x='R2', hue='Cohort', dodge=False, size=8, edgecolor='black',
              linewidth=1, alpha=0.4, palette={'Female':'gray', 'Male':'gray', 'Both sex':'gray'})
sns.despine()
plt.xlabel('Out-of-sample '+r'$R^2$'+' in 10 elastic net models\n[mean with 95% CI]')
plt.ylabel('')
plt.legend('', frameon=False)
plt.show()

In [None]:
tempD1 = {'Female':protBMI_F, 'Male':protBMI_M, 'Both sex':protBMI_B}
yvar = 'BaseBMI'
yvar_model = 'BaseProtBMI'
range_min = np.min([df[var].min() for df in tempD1.values() for var in [yvar, yvar_model]])
range_max = np.max([df[var].max() for df in tempD1.values() for var in [yvar, yvar_model]])
axis_label = 'BMI [kg m'+r'$^{-2}$'+']'

#Plot measured vs. predicted per model
tempD2 = {'Female':'tab:red', 'Male':'tab:blue', 'Both sex':'tab:green'}
sns.set(style='ticks', font='Arial', context='talk')
fig, axes = plt.subplots(nrows=1, ncols=len(tempD2), figsize=(10, 3), sharex=True, sharey=True,
                         gridspec_kw={'hspace':0.1, 'wspace':0.1})
for ax_i, ax in enumerate(axes.flat):
    sex = list(tempD2.keys())[ax_i]
    tempDF = tempD1[sex]
    #Y=X as reference
    ax.plot([range_min, range_max], [range_min, range_max], color='black', linestyle=(0, (1, 2)))
    #Regplot
    sns.regplot(data=tempDF, x=yvar_model, y=yvar, color=tempD2[sex],
                scatter=True, fit_reg=True, ci=95, truncate=False, marker='o',
                scatter_kws={'alpha':0.2, 'edgecolor':'k', 's':30}, ax=ax)
    if ax_i%len(tempD2)==0:
        plt.setp(ax, xlabel='', ylabel='Measured '+axis_label)
    else:
        plt.setp(ax, xlabel='', ylabel='')
        plt.setp(ax.get_yticklabels(), visible=False)
    ##Annotate Pearson's correlation
    pearson_r, pval = stats.pearsonr(tempDF[yvar_model], tempDF[yvar])
    r_text = str(Decimal(str(pearson_r)).quantize(Decimal('0.001'), rounding=ROUND_HALF_UP))
    if pval==1.0:
        pval_text = '1.0'
    elif pval==0.0:
        pval_text = '0.0'
    else:
        pval_text = f'{Decimal(str(pval)):.3E}'#Take more digits because rounding is bad here
        significand, exponent = pval_text.split(sep='E-')
        significand = str(Decimal(significand).quantize(Decimal('0.1'), rounding=ROUND_HALF_UP))
        if significand=='10.0':
            significand = '1.0'
            exponent = str(int(exponent)-1)
        if int(exponent)>2:
            pval_text = significand+r'$\times$'+'10'+r'$^{{-{0}}}$'.format(exponent)##Font is different in r'$ $'...
        elif int(exponent)>0:
            pval_text = '0.'+'0'*(int(exponent)-1)+significand.replace('.', '')
        else:
            pval_text = significand
    text = 'Pearson\'s '+r'$r$'+' = '+r_text+'\n'+r'$P$'+' = '+pval_text
    ax.annotate(text, xy=(0.05, 0.95), xycoords='axes fraction',
                horizontalalignment='left', verticalalignment='top',
                multialignment='left', fontsize='x-small', color='k')
    ##Facet label
    ax.set_title(sex, {'fontsize':'medium'})
    #Save position to generate facet and legend later
    if ax_i ==0:
        ax0_pos = ax.get_position().bounds
    elif ax_i==2:
        ax2_pos = ax.get_position().bounds
sns.despine()
fig.text(x=(ax0_pos[0]+(ax2_pos[0]+ax2_pos[2]))/2, y=ax0_pos[1]-ax0_pos[3]*0.2,#Minor manual adjustment
         s='Predicted '+axis_label, fontsize='medium',
         verticalalignment='top', horizontalalignment='center', rotation='horizontal')
plt.show()

#Plot difference b/w sex-specific and sex-mixed models
tempD2 = {'Female':'tab:red', 'Male':'tab:blue'}
sns.set(style='ticks', font='Arial', context='talk')
fig, axes = plt.subplots(nrows=1, ncols=len(tempD2), figsize=(6.5, 3), sharex=True, sharey=True,
                         gridspec_kw={'hspace':0.1, 'wspace':0.1})
for ax_i, ax in enumerate(axes.flat):
    sex = list(tempD2.keys())[ax_i]
    #Prepare DF
    tempS1 = tempD1[sex][yvar_model]
    tempS1.name = 'Sex-specific'
    tempS2 = tempD1['Both sex'][yvar_model]
    tempS2.name = 'Sex-mixed'
    tempDF = pd.merge(tempS1, tempS2, left_index=True, right_index=True, how='inner')
    #Y=X as reference
    ax.plot([range_min, range_max], [range_min, range_max], color='black', linestyle=(0, (1, 2)))
    #Regplot
    sns.regplot(data=tempDF, x='Sex-specific', y='Sex-mixed', color=tempD2[sex],
                scatter=True, fit_reg=True, ci=95, truncate=False, marker='o',
                scatter_kws={'alpha':0.2, 'edgecolor':'k', 's':30}, ax=ax)
    if ax_i%len(tempD2)==0:
        plt.setp(ax, xlabel='', ylabel='Sex-mixed b'+axis_label)
    else:
        plt.setp(ax, xlabel='', ylabel='')
        plt.setp(ax.get_yticklabels(), visible=False)
    ##Annotate Pearson's correlation
    pearson_r, pval = stats.pearsonr(tempDF['Sex-specific'], tempDF['Sex-mixed'])
    r_text = str(Decimal(str(pearson_r)).quantize(Decimal('0.001'), rounding=ROUND_HALF_UP))
    if pval==1.0:
        pval_text = '1.0'
    elif pval==0.0:
        pval_text = '0.0'
    else:
        pval_text = f'{Decimal(str(pval)):.3E}'#Take more digits because rounding is bad here
        significand, exponent = pval_text.split(sep='E-')
        significand = str(Decimal(significand).quantize(Decimal('0.1'), rounding=ROUND_HALF_UP))
        if significand=='10.0':
            significand = '1.0'
            exponent = str(int(exponent)-1)
        if int(exponent)>2:
            pval_text = significand+r'$\times$'+'10'+r'$^{{-{0}}}$'.format(exponent)##Font is different in r'$ $'...
        elif int(exponent)>0:
            pval_text = '0.'+'0'*(int(exponent)-1)+significand.replace('.', '')
        else:
            pval_text = significand
    text = 'Pearson\'s '+r'$r$'+' = '+r_text+'\n'+r'$P$'+' = '+pval_text
    ax.annotate(text, xy=(0.05, 0.95), xycoords='axes fraction',
                horizontalalignment='left', verticalalignment='top',
                multialignment='left', fontsize='x-small', color='k')
    ##Facet label
    ax.set_title(sex, {'fontsize':'medium'})
    #Save position to generate facet and legend later
    if ax_i ==0:
        ax0_pos = ax.get_position().bounds
    elif ax_i==1:
        ax1_pos = ax.get_position().bounds
sns.despine()
fig.text(x=(ax0_pos[0]+(ax1_pos[0]+ax1_pos[2]))/2, y=ax0_pos[1]-ax0_pos[3]*0.2,#Minor manual adjustment
         s='Sex-specific b'+axis_label, fontsize='medium',
         verticalalignment='top', horizontalalignment='center', rotation='horizontal')
plt.show()

### 3-4. Clean beta-coefficient dataframe

In [None]:
tempD1 = {'Female':protBMI_F_bcoefs, 'Male':protBMI_M_bcoefs, 'BothSex':protBMI_B_bcoefs}
tempD2 = {'Female':protBMI_F_intercept, 'Male':protBMI_M_intercept, 'BothSex':protBMI_B_intercept}
yvar_model = 'ProtBMI'
model_method = 'ElasticNet'
tempD3 = {}
for sex in tempD1.keys():
    #Combine variables and intercept
    tempDF1 = tempD1[sex]
    tempDF2 = tempD2[sex]
    tempDF = pd.concat([tempDF1, tempDF2], axis=0)
    #Summarize
    tempL1 = []
    tempL2 = []
    tempL3 = []
    for row_n in tempDF.index.tolist():
        tempL1.append(tempDF.loc[row_n].mean())
        tempL2.append(tempDF.loc[row_n].std(ddof=1))#Sample standard deviation
        tempL3.append((tempDF.loc[row_n]==0.0).astype('int64').sum())
    tempDF['Mean'] = tempL1
    tempDF['SD'] = tempL2
    tempDF['nZeros'] = tempL3
    #Save the cleaned DF
    fileDir = './ExportData/'
    ipynbName = '220827_Multiomics-BMI-NatMed1stRevision_BMI-baseline-ElasticNet_'
    fileName = yvar_model+'-'+sex+'-'+model_method+'bcoefs.tsv'
    tempDF.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)
    
    #Check
    tempDF2 = tempDF.loc[tempDF.index.isin(['Intercept'])]#Retrieve as pd.DataFrame
    tempDF = tempDF.drop(index=['Intercept'])
    print(sex+':')
    print(' - Variables:', len(tempDF))
    tempDF1 = tempDF.loc[tempDF['nZeros']!=10]
    print(' - Variables with non-zero beta-coefficient:', len(tempDF1),
          '(', len(tempDF1)/len(tempDF)*100, '%)')
    tempDF1 = tempDF.loc[tempDF['nZeros']==0]
    print(' - Variables with non-zero beta-coefficient in all 10 models:', len(tempDF1),
          '(', len(tempDF1)/len(tempDF)*100, '%)')
    tempDF1 = tempDF1.sort_values(by='Mean', ascending=False)
    display(tempDF1)
    display(tempDF2)#Intercept
    print('')
    
    tempD3[sex] = tempDF
#Update for using R2 transition analysis
protBMI_F_bcoefs = tempD3['Female']
protBMI_M_bcoefs = tempD3['Male']
protBMI_B_bcoefs = tempD3['BothSex']

## 4. Clinical labs

### 4-1. Standardization

> Note that all preprocessing, including outlier elimination, missingness filtering, imputation, and standardization, should be performed within the cross-validation fold, not across whole dataset, to minimize the potential data leakage. In this study, however, an external dataset is used for “validation" of the findings (i.e., a testing set for the fitted models). Hence, the robustness of preprocessing is prioritized.  

In [None]:
tempDF = chemDF
tempD1 = {'Female':bmiDF_F, 'Male':bmiDF_M, 'BothSex':bmiDF_B}
tempD2 = {}
for sex in tempD1.keys():
    tempDF1 = tempD1[sex]
    tempDF1 = tempDF.loc[tempDF1.index.tolist()]
    #Z-score transformation
    scaler = StandardScaler(copy=True, with_mean=True, with_std=True)
    tempA = scaler.fit_transform(tempDF1)#Column direction
    tempDF1 = pd.DataFrame(data=tempA, index=tempDF1.index, columns=tempDF1.columns)
    tempD2[sex] = tempDF1
    
    #Confirmation
    print(sex+':', tempDF1.shape)
    display(tempDF1.describe())
    sns.set(style='ticks', font='Arial', context='notebook')
    plt.figure(figsize=(4, 3))
    for col_i in range(3):
        sns.distplot(tempDF1.iloc[:, col_i], label=tempDF1.columns[col_i])
    sns.despine()
    plt.xlabel(r'$Z$'+'-score')
    plt.ylabel('Density')
    plt.legend(bbox_to_anchor=(1, 0.5), loc='center left', borderaxespad=1)
    plt.show()
    print('')

chemDF_F = tempD2['Female']
chemDF_M = tempD2['Male']
chemDF_B = tempD2['BothSex']

### 4-2. Elastic net with cross-validation

In [None]:
#Female model
tempDF1 = chemDF_F#Standardized independent variables
tempDF2 = bmiDF_F#Not-standardized dependent variable and info about testing set
yvar = 'BaseBMI'
nmodels = len(tempDF2['Testing'].unique())#Reset just in case
ncvs = 10
model = ElasticNetCV(l1_ratio=0.5, eps=0.05, n_alphas=200, alphas=None, fit_intercept=True,
                     normalize=False, precompute='auto', cv=ncvs)
yvar_model = 'BaseChemBMI'

#Perform elastic net
bcoefDF = pd.DataFrame(index=tempDF1.columns).astype('float64')#For beta-coefficients
bcoefDF.index.rename('Variable', inplace=True)
interceptDF = pd.DataFrame(index=['Intercept']).astype('float64')#For intercept
interceptDF.index.rename('Variable', inplace=True)
scoreL = []#For the coefficient of determination R2
predictS = pd.Series(name='log_'+yvar_model).astype('float64')#For predictions
t_start = time.time()
for model_k in range(nmodels):
    #Prepare training and testing (hold-out) datasets in model k
    model_n = 'Model_'+str(model_k+1).zfill(2)
    yDF_train = tempDF2.loc[tempDF2['Testing']!=model_n]
    yDF_test = tempDF2.loc[tempDF2['Testing']==model_n]
    xDF_train = tempDF1.loc[yDF_train.index.tolist()]
    xDF_test = tempDF1.loc[yDF_test.index.tolist()]
    #Retrieve the dependent variable
    yDF_train = pd.DataFrame(yDF_train['log_'+yvar])#Not Series but DF
    yDF_test = pd.DataFrame(yDF_test['log_'+yvar])#Not Series but DF
    #Fitting model with cross-validation using training dataset (i.e., internal training/validation datasets)
    model.fit(xDF_train, yDF_train)
    #Check the penalization amount decided by cross validation
    print(model_n+': alpha =', model.alpha_)
    #Save parameters
    bcoefDF[model_n] = model.coef_#w in the cost function formula
    interceptDF[model_n] = model.intercept_#Independent term in decision function
    #Evaluation with testing (hold-out) dataset
    scoreL.append(model.score(xDF_test, yDF_test))
    #Prediction for testing dataset using the fitted model k
    tempS = pd.Series(model.predict(xDF_test),
                      index=xDF_test.index, name='log_'+yvar_model)
    predictS = pd.concat([predictS, tempS], axis=0)
t_elapsed = time.time() - t_start
print('Elapsed time for '+str(nmodels)+' models of '+str(ncvs)+'-fold CV elastic net:',
      round(t_elapsed//60), 'min', round(t_elapsed%60, 1), 'sec')

#Clean prediction DF
tempDF = pd.merge(tempDF2[['Testing', 'log_'+yvar]], predictS,
                  left_index=True, right_index=True, how='left')
##Convert to original scale
tempDF[yvar] = np.e**tempDF['log_'+yvar]
tempDF[yvar_model] = np.e**tempDF['log_'+yvar_model]
display(tempDF)

#Save the cleaned prediction DF
fileDir = './ExportData/'
ipynbName = '220827_Multiomics-BMI-NatMed1stRevision_BMI-baseline-ElasticNet_'
fileName = yvar_model.replace('Base', '')+'-Female.tsv'
tempDF.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)
chemBMI_F_bcoefs = bcoefDF
chemBMI_F_intercept = interceptDF
chemBMI_F_R2 = scoreL
chemBMI_F = tempDF

In [None]:
#Male model
tempDF1 = chemDF_M#Standardized independent variables
tempDF2 = bmiDF_M#Not-standardized dependent variable and info about testing set
yvar = 'BaseBMI'
nmodels = len(tempDF2['Testing'].unique())#Reset just in case
ncvs = 10
model = ElasticNetCV(l1_ratio=0.5, eps=0.05, n_alphas=200, alphas=None, fit_intercept=True,
                     normalize=False, precompute='auto', cv=ncvs)
yvar_model = 'BaseChemBMI'

#Perform elastic net
bcoefDF = pd.DataFrame(index=tempDF1.columns).astype('float64')#For beta-coefficients
bcoefDF.index.rename('Variable', inplace=True)
interceptDF = pd.DataFrame(index=['Intercept']).astype('float64')#For intercept
interceptDF.index.rename('Variable', inplace=True)
scoreL = []#For the coefficient of determination R2
predictS = pd.Series(name='log_'+yvar_model).astype('float64')#For predictions
t_start = time.time()
for model_k in range(nmodels):
    #Prepare training and testing (hold-out) datasets in model k
    model_n = 'Model_'+str(model_k+1).zfill(2)
    yDF_train = tempDF2.loc[tempDF2['Testing']!=model_n]
    yDF_test = tempDF2.loc[tempDF2['Testing']==model_n]
    xDF_train = tempDF1.loc[yDF_train.index.tolist()]
    xDF_test = tempDF1.loc[yDF_test.index.tolist()]
    #Retrieve the dependent variable
    yDF_train = pd.DataFrame(yDF_train['log_'+yvar])#Not Series but DF
    yDF_test = pd.DataFrame(yDF_test['log_'+yvar])#Not Series but DF
    #Fitting model with cross-validation using training dataset (i.e., internal training/validation datasets)
    model.fit(xDF_train, yDF_train)
    #Check the penalization amount decided by cross validation
    print(model_n+': alpha =', model.alpha_)
    #Save parameters
    bcoefDF[model_n] = model.coef_#w in the cost function formula
    interceptDF[model_n] = model.intercept_#Independent term in decision function
    #Evaluation with testing (hold-out) dataset
    scoreL.append(model.score(xDF_test, yDF_test))
    #Prediction for testing dataset using the fitted model k
    tempS = pd.Series(model.predict(xDF_test),
                      index=xDF_test.index, name='log_'+yvar_model)
    predictS = pd.concat([predictS, tempS], axis=0)
t_elapsed = time.time() - t_start
print('Elapsed time for '+str(nmodels)+' models of '+str(ncvs)+'-fold CV elastic net:',
      round(t_elapsed//60), 'min', round(t_elapsed%60, 1), 'sec')

#Clean prediction DF
tempDF = pd.merge(tempDF2[['Testing', 'log_'+yvar]], predictS,
                  left_index=True, right_index=True, how='left')
##Convert to original scale
tempDF[yvar] = np.e**tempDF['log_'+yvar]
tempDF[yvar_model] = np.e**tempDF['log_'+yvar_model]
display(tempDF)

#Save the cleaned prediction DF
fileDir = './ExportData/'
ipynbName = '220827_Multiomics-BMI-NatMed1stRevision_BMI-baseline-ElasticNet_'
fileName = yvar_model.replace('Base', '')+'-Male.tsv'
tempDF.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)
chemBMI_M_bcoefs = bcoefDF
chemBMI_M_intercept = interceptDF
chemBMI_M_R2 = scoreL
chemBMI_M = tempDF

In [None]:
#Both sex model
tempDF1 = chemDF_B#Standardized independent variables
tempDF2 = bmiDF_B#Not-standardized dependent variable and info about testing set
yvar = 'BaseBMI'
nmodels = len(tempDF2['Testing'].unique())#Reset just in case
ncvs = 10
model = ElasticNetCV(l1_ratio=0.5, eps=0.05, n_alphas=200, alphas=None, fit_intercept=True,
                     normalize=False, precompute='auto', cv=ncvs)
yvar_model = 'BaseChemBMI'

#Perform elastic net
bcoefDF = pd.DataFrame(index=tempDF1.columns).astype('float64')#For beta-coefficients
bcoefDF.index.rename('Variable', inplace=True)
interceptDF = pd.DataFrame(index=['Intercept']).astype('float64')#For intercept
interceptDF.index.rename('Variable', inplace=True)
scoreL = []#For the coefficient of determination R2
predictS = pd.Series(name='log_'+yvar_model).astype('float64')#For predictions
t_start = time.time()
for model_k in range(nmodels):
    #Prepare training and testing (hold-out) datasets in model k
    model_n = 'Model_'+str(model_k+1).zfill(2)
    yDF_train = tempDF2.loc[tempDF2['Testing']!=model_n]
    yDF_test = tempDF2.loc[tempDF2['Testing']==model_n]
    xDF_train = tempDF1.loc[yDF_train.index.tolist()]
    xDF_test = tempDF1.loc[yDF_test.index.tolist()]
    #Retrieve the dependent variable
    yDF_train = pd.DataFrame(yDF_train['log_'+yvar])#Not Series but DF
    yDF_test = pd.DataFrame(yDF_test['log_'+yvar])#Not Series but DF
    #Fitting model with cross-validation using training dataset (i.e., internal training/validation datasets)
    model.fit(xDF_train, yDF_train)
    #Check the penalization amount decided by cross validation
    print(model_n+': alpha =', model.alpha_)
    #Save parameters
    bcoefDF[model_n] = model.coef_#w in the cost function formula
    interceptDF[model_n] = model.intercept_#Independent term in decision function
    #Evaluation with testing (hold-out) dataset
    scoreL.append(model.score(xDF_test, yDF_test))
    #Prediction for testing dataset using the fitted model k
    tempS = pd.Series(model.predict(xDF_test),
                      index=xDF_test.index, name='log_'+yvar_model)
    predictS = pd.concat([predictS, tempS], axis=0)
t_elapsed = time.time() - t_start
print('Elapsed time for '+str(nmodels)+' models of '+str(ncvs)+'-fold CV elastic net:',
      round(t_elapsed//60), 'min', round(t_elapsed%60, 1), 'sec')

#Clean prediction DF
tempDF = pd.merge(tempDF2[['Testing', 'log_'+yvar]], predictS,
                  left_index=True, right_index=True, how='left')
##Convert to original scale
tempDF[yvar] = np.e**tempDF['log_'+yvar]
tempDF[yvar_model] = np.e**tempDF['log_'+yvar_model]
display(tempDF)

#Save the cleaned prediction DF
fileDir = './ExportData/'
ipynbName = '220827_Multiomics-BMI-NatMed1stRevision_BMI-baseline-ElasticNet_'
fileName = yvar_model.replace('Base', '')+'-BothSex.tsv'
tempDF.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)
chemBMI_B_bcoefs = bcoefDF
chemBMI_B_intercept = interceptDF
chemBMI_B_R2 = scoreL
chemBMI_B = tempDF

### 4-3. Prediction accuracy

In [None]:
#Summary
tempD1 = {'Female':chemBMI_F_R2, 'Male':chemBMI_M_R2, 'Both sex':chemBMI_B_R2}
tempD2 = {'Female':chemBMI_F, 'Male':chemBMI_M, 'Both sex':chemBMI_B}
yvar = 'BaseBMI'
yvar_model = 'BaseChemBMI'

for sex in tempD1.keys():
    tempL = tempD1[sex]
    print(sex+' model')
    print(' - Out-of-sample R2 [Mean ± SD]:',
          np.mean(tempL), '±', np.std(tempL, ddof=1))#Sample standard deviation
    print(' - Out-of-sample R2 [Mean ± SEM]:',
          np.mean(tempL), '±', np.std(tempL, ddof=1)/np.sqrt(len(tempL)))
    tempDF = tempD2[sex]
    print(' - Observed vs. predicted log_'+yvar+': (Pearson\'s r, P) =',
          stats.pearsonr(tempDF['log_'+yvar], tempDF['log_'+yvar_model]))
    print(' - Observed vs. predicted '+yvar+': (Pearson\'s r, P) =',
          stats.pearsonr(tempDF[yvar], tempDF[yvar_model]))
    display(tempDF.describe())

> Check difference between sex-specific and sex-mixed models for now. Note that this is a rough comparison because the sample size is different.  

In [None]:
#Prepare DF
tempDF = pd.DataFrame({'Female':chemBMI_F_R2, 'Male':chemBMI_M_R2, 'Both sex':chemBMI_B_R2})
tempDF = tempDF.melt(var_name='Cohort', value_name='R2', value_vars=tempDF.columns.tolist())

#Plot
sns.set(style='ticks', font='Arial', context='notebook')
plt.figure(figsize=(3, 1))
sns.barplot(data=tempDF, y='Cohort', x='R2', hue='Cohort', palette='Set1', dodge=False, edgecolor='black',
            ci=95, capsize=0.4, errwidth=1.5, errcolor='black')
sns.stripplot(data=tempDF, y='Cohort', x='R2', hue='Cohort', dodge=False, size=8, edgecolor='black',
              linewidth=1, alpha=0.4, palette={'Female':'gray', 'Male':'gray', 'Both sex':'gray'})
sns.despine()
plt.xlabel('Out-of-sample '+r'$R^2$'+' in 10 elastic net models\n[mean with 95% CI]')
plt.ylabel('')
plt.legend('', frameon=False)
plt.show()

In [None]:
tempD1 = {'Female':chemBMI_F, 'Male':chemBMI_M, 'Both sex':chemBMI_B}
yvar = 'BaseBMI'
yvar_model = 'BaseChemBMI'
range_min = np.min([df[var].min() for df in tempD1.values() for var in [yvar, yvar_model]])
range_max = np.max([df[var].max() for df in tempD1.values() for var in [yvar, yvar_model]])
axis_label = 'BMI [kg m'+r'$^{-2}$'+']'

#Plot measured vs. predicted per model
tempD2 = {'Female':'tab:red', 'Male':'tab:blue', 'Both sex':'tab:green'}
sns.set(style='ticks', font='Arial', context='talk')
fig, axes = plt.subplots(nrows=1, ncols=len(tempD2), figsize=(10, 3), sharex=True, sharey=True,
                         gridspec_kw={'hspace':0.1, 'wspace':0.1})
for ax_i, ax in enumerate(axes.flat):
    sex = list(tempD2.keys())[ax_i]
    tempDF = tempD1[sex]
    #Y=X as reference
    ax.plot([range_min, range_max], [range_min, range_max], color='black', linestyle=(0, (1, 2)))
    #Regplot
    sns.regplot(data=tempDF, x=yvar_model, y=yvar, color=tempD2[sex],
                scatter=True, fit_reg=True, ci=95, truncate=False, marker='o',
                scatter_kws={'alpha':0.2, 'edgecolor':'k', 's':30}, ax=ax)
    if ax_i%len(tempD2)==0:
        plt.setp(ax, xlabel='', ylabel='Measured '+axis_label)
    else:
        plt.setp(ax, xlabel='', ylabel='')
        plt.setp(ax.get_yticklabels(), visible=False)
    ##Annotate Pearson's correlation
    pearson_r, pval = stats.pearsonr(tempDF[yvar_model], tempDF[yvar])
    r_text = str(Decimal(str(pearson_r)).quantize(Decimal('0.001'), rounding=ROUND_HALF_UP))
    if pval==1.0:
        pval_text = '1.0'
    elif pval==0.0:
        pval_text = '0.0'
    else:
        pval_text = f'{Decimal(str(pval)):.3E}'#Take more digits because rounding is bad here
        significand, exponent = pval_text.split(sep='E-')
        significand = str(Decimal(significand).quantize(Decimal('0.1'), rounding=ROUND_HALF_UP))
        if significand=='10.0':
            significand = '1.0'
            exponent = str(int(exponent)-1)
        if int(exponent)>2:
            pval_text = significand+r'$\times$'+'10'+r'$^{{-{0}}}$'.format(exponent)##Font is different in r'$ $'...
        elif int(exponent)>0:
            pval_text = '0.'+'0'*(int(exponent)-1)+significand.replace('.', '')
        else:
            pval_text = significand
    text = 'Pearson\'s '+r'$r$'+' = '+r_text+'\n'+r'$P$'+' = '+pval_text
    ax.annotate(text, xy=(0.05, 0.95), xycoords='axes fraction',
                horizontalalignment='left', verticalalignment='top',
                multialignment='left', fontsize='x-small', color='k')
    ##Facet label
    ax.set_title(sex, {'fontsize':'medium'})
    #Save position to generate facet and legend later
    if ax_i ==0:
        ax0_pos = ax.get_position().bounds
    elif ax_i==2:
        ax2_pos = ax.get_position().bounds
sns.despine()
fig.text(x=(ax0_pos[0]+(ax2_pos[0]+ax2_pos[2]))/2, y=ax0_pos[1]-ax0_pos[3]*0.2,#Minor manual adjustment
         s='Predicted '+axis_label, fontsize='medium',
         verticalalignment='top', horizontalalignment='center', rotation='horizontal')
plt.show()

#Plot difference b/w sex-specific and sex-mixed models
tempD2 = {'Female':'tab:red', 'Male':'tab:blue'}
sns.set(style='ticks', font='Arial', context='talk')
fig, axes = plt.subplots(nrows=1, ncols=len(tempD2), figsize=(6.5, 3), sharex=True, sharey=True,
                         gridspec_kw={'hspace':0.1, 'wspace':0.1})
for ax_i, ax in enumerate(axes.flat):
    sex = list(tempD2.keys())[ax_i]
    #Prepare DF
    tempS1 = tempD1[sex][yvar_model]
    tempS1.name = 'Sex-specific'
    tempS2 = tempD1['Both sex'][yvar_model]
    tempS2.name = 'Sex-mixed'
    tempDF = pd.merge(tempS1, tempS2, left_index=True, right_index=True, how='inner')
    #Y=X as reference
    ax.plot([range_min, range_max], [range_min, range_max], color='black', linestyle=(0, (1, 2)))
    #Regplot
    sns.regplot(data=tempDF, x='Sex-specific', y='Sex-mixed', color=tempD2[sex],
                scatter=True, fit_reg=True, ci=95, truncate=False, marker='o',
                scatter_kws={'alpha':0.2, 'edgecolor':'k', 's':30}, ax=ax)
    if ax_i%len(tempD2)==0:
        plt.setp(ax, xlabel='', ylabel='Sex-mixed b'+axis_label)
    else:
        plt.setp(ax, xlabel='', ylabel='')
        plt.setp(ax.get_yticklabels(), visible=False)
    ##Annotate Pearson's correlation
    pearson_r, pval = stats.pearsonr(tempDF['Sex-specific'], tempDF['Sex-mixed'])
    r_text = str(Decimal(str(pearson_r)).quantize(Decimal('0.001'), rounding=ROUND_HALF_UP))
    if pval==1.0:
        pval_text = '1.0'
    elif pval==0.0:
        pval_text = '0.0'
    else:
        pval_text = f'{Decimal(str(pval)):.3E}'#Take more digits because rounding is bad here
        significand, exponent = pval_text.split(sep='E-')
        significand = str(Decimal(significand).quantize(Decimal('0.1'), rounding=ROUND_HALF_UP))
        if significand=='10.0':
            significand = '1.0'
            exponent = str(int(exponent)-1)
        if int(exponent)>2:
            pval_text = significand+r'$\times$'+'10'+r'$^{{-{0}}}$'.format(exponent)##Font is different in r'$ $'...
        elif int(exponent)>0:
            pval_text = '0.'+'0'*(int(exponent)-1)+significand.replace('.', '')
        else:
            pval_text = significand
    text = 'Pearson\'s '+r'$r$'+' = '+r_text+'\n'+r'$P$'+' = '+pval_text
    ax.annotate(text, xy=(0.05, 0.95), xycoords='axes fraction',
                horizontalalignment='left', verticalalignment='top',
                multialignment='left', fontsize='x-small', color='k')
    ##Facet label
    ax.set_title(sex, {'fontsize':'medium'})
    #Save position to generate facet and legend later
    if ax_i ==0:
        ax0_pos = ax.get_position().bounds
    elif ax_i==1:
        ax1_pos = ax.get_position().bounds
sns.despine()
fig.text(x=(ax0_pos[0]+(ax1_pos[0]+ax1_pos[2]))/2, y=ax0_pos[1]-ax0_pos[3]*0.2,#Minor manual adjustment
         s='Sex-specific b'+axis_label, fontsize='medium',
         verticalalignment='top', horizontalalignment='center', rotation='horizontal')
plt.show()

### 4-4. Clean beta-coefficient dataframe

In [None]:
tempD1 = {'Female':chemBMI_F_bcoefs, 'Male':chemBMI_M_bcoefs, 'BothSex':chemBMI_B_bcoefs}
tempD2 = {'Female':chemBMI_F_intercept, 'Male':chemBMI_M_intercept, 'BothSex':chemBMI_B_intercept}
yvar_model = 'ChemBMI'
model_method = 'ElasticNet'
tempD3 = {}
for sex in tempD1.keys():
    #Combine variables and intercept
    tempDF1 = tempD1[sex]
    tempDF2 = tempD2[sex]
    tempDF = pd.concat([tempDF1, tempDF2], axis=0)
    #Summarize
    tempL1 = []
    tempL2 = []
    tempL3 = []
    for row_n in tempDF.index.tolist():
        tempL1.append(tempDF.loc[row_n].mean())
        tempL2.append(tempDF.loc[row_n].std(ddof=1))#Sample standard deviation
        tempL3.append((tempDF.loc[row_n]==0.0).astype('int64').sum())
    tempDF['Mean'] = tempL1
    tempDF['SD'] = tempL2
    tempDF['nZeros'] = tempL3
    #Save the cleaned DF
    fileDir = './ExportData/'
    ipynbName = '220827_Multiomics-BMI-NatMed1stRevision_BMI-baseline-ElasticNet_'
    fileName = yvar_model+'-'+sex+'-'+model_method+'bcoefs.tsv'
    tempDF.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)
    
    #Check
    tempDF2 = tempDF.loc[tempDF.index.isin(['Intercept'])]#Retrieve as pd.DataFrame
    tempDF = tempDF.drop(index=['Intercept'])
    print(sex+':')
    print(' - Variables:', len(tempDF))
    tempDF1 = tempDF.loc[tempDF['nZeros']!=10]
    print(' - Variables with non-zero beta-coefficient:', len(tempDF1),
          '(', len(tempDF1)/len(tempDF)*100, '%)')
    tempDF1 = tempDF.loc[tempDF['nZeros']==0]
    print(' - Variables with non-zero beta-coefficient in all 10 models:', len(tempDF1),
          '(', len(tempDF1)/len(tempDF)*100, '%)')
    tempDF1 = tempDF1.sort_values(by='Mean', ascending=False)
    display(tempDF1)
    display(tempDF2)#Intercept
    print('')
    
    tempD3[sex] = tempDF
#Update for using R2 transition analysis
chemBMI_F_bcoefs = tempD3['Female']
chemBMI_M_bcoefs = tempD3['Male']
chemBMI_B_bcoefs = tempD3['BothSex']

## 5. Metabolomics, Proteomics, and Clinical labs-combined omics

### 5-1. Standardization

> Note that all preprocessing, including outlier elimination, missingness filtering, imputation, and standardization, should be performed within the cross-validation fold, not across whole dataset, to minimize the potential data leakage. In this study, however, an external dataset is used for “validation" of the findings (i.e., a testing set for the fitted models). Hence, the robustness of preprocessing is prioritized.  

In [None]:
tempDF = combiDF
tempD1 = {'Female':bmiDF_F, 'Male':bmiDF_M, 'BothSex':bmiDF_B}
tempD2 = {}
for sex in tempD1.keys():
    tempDF1 = tempD1[sex]
    tempDF1 = tempDF.loc[tempDF1.index.tolist()]
    #Z-score transformation
    scaler = StandardScaler(copy=True, with_mean=True, with_std=True)
    tempA = scaler.fit_transform(tempDF1)#Column direction
    tempDF1 = pd.DataFrame(data=tempA, index=tempDF1.index, columns=tempDF1.columns)
    tempD2[sex] = tempDF1
    
    #Confirmation
    print(sex+':', tempDF1.shape)
    display(tempDF1.describe())
    sns.set(style='ticks', font='Arial', context='notebook')
    plt.figure(figsize=(4, 3))
    for col_i in range(3):
        sns.distplot(tempDF1.iloc[:, col_i], label=tempDF1.columns[col_i])
    sns.despine()
    plt.xlabel(r'$Z$'+'-score')
    plt.ylabel('Density')
    plt.legend(bbox_to_anchor=(1, 0.5), loc='center left', borderaxespad=1)
    plt.show()
    print('')

combiDF_F = tempD2['Female']
combiDF_M = tempD2['Male']
combiDF_B = tempD2['BothSex']

### 5-2. Elastic net with cross-validation

In [None]:
#Female model
tempDF1 = combiDF_F#Standardized independent variables
tempDF2 = bmiDF_F#Not-standardized dependent variable and info about testing set
yvar = 'BaseBMI'
nmodels = len(tempDF2['Testing'].unique())#Reset just in case
ncvs = 10
model = ElasticNetCV(l1_ratio=0.5, eps=0.01, n_alphas=200, alphas=None, fit_intercept=True,
                     normalize=False, precompute='auto', cv=ncvs)
yvar_model = 'BaseCombiBMI'

#Perform elastic net
bcoefDF = pd.DataFrame(index=tempDF1.columns).astype('float64')#For beta-coefficients
bcoefDF.index.rename('Variable', inplace=True)
interceptDF = pd.DataFrame(index=['Intercept']).astype('float64')#For intercept
interceptDF.index.rename('Variable', inplace=True)
scoreL = []#For the coefficient of determination R2
predictS = pd.Series(name='log_'+yvar_model).astype('float64')#For predictions
t_start = time.time()
for model_k in range(nmodels):
    #Prepare training and testing (hold-out) datasets in model k
    model_n = 'Model_'+str(model_k+1).zfill(2)
    yDF_train = tempDF2.loc[tempDF2['Testing']!=model_n]
    yDF_test = tempDF2.loc[tempDF2['Testing']==model_n]
    xDF_train = tempDF1.loc[yDF_train.index.tolist()]
    xDF_test = tempDF1.loc[yDF_test.index.tolist()]
    #Retrieve the dependent variable
    yDF_train = pd.DataFrame(yDF_train['log_'+yvar])#Not Series but DF
    yDF_test = pd.DataFrame(yDF_test['log_'+yvar])#Not Series but DF
    #Fitting model with cross-validation using training dataset (i.e., internal training/validation datasets)
    model.fit(xDF_train, yDF_train)
    #Check the penalization amount decided by cross validation
    print(model_n+': alpha =', model.alpha_)
    #Save parameters
    bcoefDF[model_n] = model.coef_#w in the cost function formula
    interceptDF[model_n] = model.intercept_#Independent term in decision function
    #Evaluation with testing (hold-out) dataset
    scoreL.append(model.score(xDF_test, yDF_test))
    #Prediction for testing dataset using the fitted model k
    tempS = pd.Series(model.predict(xDF_test),
                      index=xDF_test.index, name='log_'+yvar_model)
    predictS = pd.concat([predictS, tempS], axis=0)
t_elapsed = time.time() - t_start
print('Elapsed time for '+str(nmodels)+' models of '+str(ncvs)+'-fold CV elastic net:',
      round(t_elapsed//60), 'min', round(t_elapsed%60, 1), 'sec')

#Clean prediction DF
tempDF = pd.merge(tempDF2[['Testing', 'log_'+yvar]], predictS,
                  left_index=True, right_index=True, how='left')
##Convert to original scale
tempDF[yvar] = np.e**tempDF['log_'+yvar]
tempDF[yvar_model] = np.e**tempDF['log_'+yvar_model]
display(tempDF)

#Save the cleaned prediction DF
fileDir = './ExportData/'
ipynbName = '220827_Multiomics-BMI-NatMed1stRevision_BMI-baseline-ElasticNet_'
fileName = yvar_model.replace('Base', '')+'-Female.tsv'
tempDF.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)
combiBMI_F_bcoefs = bcoefDF
combiBMI_F_intercept = interceptDF
combiBMI_F_R2 = scoreL
combiBMI_F = tempDF

In [None]:
#Male model
tempDF1 = combiDF_M#Standardized independent variables
tempDF2 = bmiDF_M#Not-standardized dependent variable and info about testing set
yvar = 'BaseBMI'
nmodels = len(tempDF2['Testing'].unique())#Reset just in case
ncvs = 10
model = ElasticNetCV(l1_ratio=0.5, eps=0.01, n_alphas=200, alphas=None, fit_intercept=True,
                     normalize=False, precompute='auto', cv=ncvs)
yvar_model = 'BaseCombiBMI'

#Perform elastic net
bcoefDF = pd.DataFrame(index=tempDF1.columns).astype('float64')#For beta-coefficients
bcoefDF.index.rename('Variable', inplace=True)
interceptDF = pd.DataFrame(index=['Intercept']).astype('float64')#For intercept
interceptDF.index.rename('Variable', inplace=True)
scoreL = []#For the coefficient of determination R2
predictS = pd.Series(name='log_'+yvar_model).astype('float64')#For predictions
t_start = time.time()
for model_k in range(nmodels):
    #Prepare training and testing (hold-out) datasets in model k
    model_n = 'Model_'+str(model_k+1).zfill(2)
    yDF_train = tempDF2.loc[tempDF2['Testing']!=model_n]
    yDF_test = tempDF2.loc[tempDF2['Testing']==model_n]
    xDF_train = tempDF1.loc[yDF_train.index.tolist()]
    xDF_test = tempDF1.loc[yDF_test.index.tolist()]
    #Retrieve the dependent variable
    yDF_train = pd.DataFrame(yDF_train['log_'+yvar])#Not Series but DF
    yDF_test = pd.DataFrame(yDF_test['log_'+yvar])#Not Series but DF
    #Fitting model with cross-validation using training dataset (i.e., internal training/validation datasets)
    model.fit(xDF_train, yDF_train)
    #Check the penalization amount decided by cross validation
    print(model_n+': alpha =', model.alpha_)
    #Save parameters
    bcoefDF[model_n] = model.coef_#w in the cost function formula
    interceptDF[model_n] = model.intercept_#Independent term in decision function
    #Evaluation with testing (hold-out) dataset
    scoreL.append(model.score(xDF_test, yDF_test))
    #Prediction for testing dataset using the fitted model k
    tempS = pd.Series(model.predict(xDF_test),
                      index=xDF_test.index, name='log_'+yvar_model)
    predictS = pd.concat([predictS, tempS], axis=0)
t_elapsed = time.time() - t_start
print('Elapsed time for '+str(nmodels)+' models of '+str(ncvs)+'-fold CV elastic net:',
      round(t_elapsed//60), 'min', round(t_elapsed%60, 1), 'sec')

#Clean prediction DF
tempDF = pd.merge(tempDF2[['Testing', 'log_'+yvar]], predictS,
                  left_index=True, right_index=True, how='left')
##Convert to original scale
tempDF[yvar] = np.e**tempDF['log_'+yvar]
tempDF[yvar_model] = np.e**tempDF['log_'+yvar_model]
display(tempDF)

#Save the cleaned prediction DF
fileDir = './ExportData/'
ipynbName = '220827_Multiomics-BMI-NatMed1stRevision_BMI-baseline-ElasticNet_'
fileName = yvar_model.replace('Base', '')+'-Male.tsv'
tempDF.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)
combiBMI_M_bcoefs = bcoefDF
combiBMI_M_intercept = interceptDF
combiBMI_M_R2 = scoreL
combiBMI_M = tempDF

In [None]:
#Both sex model
tempDF1 = combiDF_B#Standardized independent variables
tempDF2 = bmiDF_B#Not-standardized dependent variable and info about testing set
yvar = 'BaseBMI'
nmodels = len(tempDF2['Testing'].unique())#Reset just in case
ncvs = 10
model = ElasticNetCV(l1_ratio=0.5, eps=0.01, n_alphas=200, alphas=None, fit_intercept=True,
                     normalize=False, precompute='auto', cv=ncvs)
yvar_model = 'BaseCombiBMI'

#Perform elastic net
bcoefDF = pd.DataFrame(index=tempDF1.columns).astype('float64')#For beta-coefficients
bcoefDF.index.rename('Variable', inplace=True)
interceptDF = pd.DataFrame(index=['Intercept']).astype('float64')#For intercept
interceptDF.index.rename('Variable', inplace=True)
scoreL = []#For the coefficient of determination R2
predictS = pd.Series(name='log_'+yvar_model).astype('float64')#For predictions
t_start = time.time()
for model_k in range(nmodels):
    #Prepare training and testing (hold-out) datasets in model k
    model_n = 'Model_'+str(model_k+1).zfill(2)
    yDF_train = tempDF2.loc[tempDF2['Testing']!=model_n]
    yDF_test = tempDF2.loc[tempDF2['Testing']==model_n]
    xDF_train = tempDF1.loc[yDF_train.index.tolist()]
    xDF_test = tempDF1.loc[yDF_test.index.tolist()]
    #Retrieve the dependent variable
    yDF_train = pd.DataFrame(yDF_train['log_'+yvar])#Not Series but DF
    yDF_test = pd.DataFrame(yDF_test['log_'+yvar])#Not Series but DF
    #Fitting model with cross-validation using training dataset (i.e., internal training/validation datasets)
    model.fit(xDF_train, yDF_train)
    #Check the penalization amount decided by cross validation
    print(model_n+': alpha =', model.alpha_)
    #Save parameters
    bcoefDF[model_n] = model.coef_#w in the cost function formula
    interceptDF[model_n] = model.intercept_#Independent term in decision function
    #Evaluation with testing (hold-out) dataset
    scoreL.append(model.score(xDF_test, yDF_test))
    #Prediction for testing dataset using the fitted model k
    tempS = pd.Series(model.predict(xDF_test),
                      index=xDF_test.index, name='log_'+yvar_model)
    predictS = pd.concat([predictS, tempS], axis=0)
t_elapsed = time.time() - t_start
print('Elapsed time for '+str(nmodels)+' models of '+str(ncvs)+'-fold CV elastic net:',
      round(t_elapsed//60), 'min', round(t_elapsed%60, 1), 'sec')

#Clean prediction DF
tempDF = pd.merge(tempDF2[['Testing', 'log_'+yvar]], predictS,
                  left_index=True, right_index=True, how='left')
##Convert to original scale
tempDF[yvar] = np.e**tempDF['log_'+yvar]
tempDF[yvar_model] = np.e**tempDF['log_'+yvar_model]
display(tempDF)

#Save the cleaned prediction DF
fileDir = './ExportData/'
ipynbName = '220827_Multiomics-BMI-NatMed1stRevision_BMI-baseline-ElasticNet_'
fileName = yvar_model.replace('Base', '')+'-BothSex.tsv'
tempDF.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)
combiBMI_B_bcoefs = bcoefDF
combiBMI_B_intercept = interceptDF
combiBMI_B_R2 = scoreL
combiBMI_B = tempDF

### 5-3. Prediction accuracy

In [None]:
#Summary
tempD1 = {'Female':combiBMI_F_R2, 'Male':combiBMI_M_R2, 'Both sex':combiBMI_B_R2}
tempD2 = {'Female':combiBMI_F, 'Male':combiBMI_M, 'Both sex':combiBMI_B}
yvar = 'BaseBMI'
yvar_model = 'BaseCombiBMI'

for sex in tempD1.keys():
    tempL = tempD1[sex]
    print(sex+' model')
    print(' - Out-of-sample R2 [Mean ± SD]:',
          np.mean(tempL), '±', np.std(tempL, ddof=1))#Sample standard deviation
    print(' - Out-of-sample R2 [Mean ± SEM]:',
          np.mean(tempL), '±', np.std(tempL, ddof=1)/np.sqrt(len(tempL)))
    tempDF = tempD2[sex]
    print(' - Observed vs. predicted log_'+yvar+': (Pearson\'s r, P) =',
          stats.pearsonr(tempDF['log_'+yvar], tempDF['log_'+yvar_model]))
    print(' - Observed vs. predicted '+yvar+': (Pearson\'s r, P) =',
          stats.pearsonr(tempDF[yvar], tempDF[yvar_model]))
    display(tempDF.describe())

> Check difference between sex-specific and sex-mixed models for now. Note that this is a rough comparison because the sample size is different.  

In [None]:
#Prepare DF
tempDF = pd.DataFrame({'Female':combiBMI_F_R2, 'Male':combiBMI_M_R2, 'Both sex':combiBMI_B_R2})
tempDF = tempDF.melt(var_name='Cohort', value_name='R2', value_vars=tempDF.columns.tolist())

#Plot
sns.set(style='ticks', font='Arial', context='notebook')
plt.figure(figsize=(3, 1))
sns.barplot(data=tempDF, y='Cohort', x='R2', hue='Cohort', palette='Set1', dodge=False, edgecolor='black',
            ci=95, capsize=0.4, errwidth=1.5, errcolor='black')
sns.stripplot(data=tempDF, y='Cohort', x='R2', hue='Cohort', dodge=False, size=8, edgecolor='black',
              linewidth=1, alpha=0.4, palette={'Female':'gray', 'Male':'gray', 'Both sex':'gray'})
sns.despine()
plt.xlabel('Out-of-sample '+r'$R^2$'+' in 10 elastic net models\n[mean with 95% CI]')
plt.ylabel('')
plt.legend('', frameon=False)
plt.show()

In [None]:
tempD1 = {'Female':combiBMI_F, 'Male':combiBMI_M, 'Both sex':combiBMI_B}
yvar = 'BaseBMI'
yvar_model = 'BaseCombiBMI'
range_min = np.min([df[var].min() for df in tempD1.values() for var in [yvar, yvar_model]])
range_max = np.max([df[var].max() for df in tempD1.values() for var in [yvar, yvar_model]])
axis_label = 'BMI [kg m'+r'$^{-2}$'+']'

#Plot measured vs. predicted per model
tempD2 = {'Female':'tab:red', 'Male':'tab:blue', 'Both sex':'tab:green'}
sns.set(style='ticks', font='Arial', context='talk')
fig, axes = plt.subplots(nrows=1, ncols=len(tempD2), figsize=(10, 3), sharex=True, sharey=True,
                         gridspec_kw={'hspace':0.1, 'wspace':0.1})
for ax_i, ax in enumerate(axes.flat):
    sex = list(tempD2.keys())[ax_i]
    tempDF = tempD1[sex]
    #Y=X as reference
    ax.plot([range_min, range_max], [range_min, range_max], color='black', linestyle=(0, (1, 2)))
    #Regplot
    sns.regplot(data=tempDF, x=yvar_model, y=yvar, color=tempD2[sex],
                scatter=True, fit_reg=True, ci=95, truncate=False, marker='o',
                scatter_kws={'alpha':0.2, 'edgecolor':'k', 's':30}, ax=ax)
    if ax_i%len(tempD2)==0:
        plt.setp(ax, xlabel='', ylabel='Measured '+axis_label)
    else:
        plt.setp(ax, xlabel='', ylabel='')
        plt.setp(ax.get_yticklabels(), visible=False)
    ##Annotate Pearson's correlation
    pearson_r, pval = stats.pearsonr(tempDF[yvar_model], tempDF[yvar])
    r_text = str(Decimal(str(pearson_r)).quantize(Decimal('0.001'), rounding=ROUND_HALF_UP))
    if pval==1.0:
        pval_text = '1.0'
    elif pval==0.0:
        pval_text = '0.0'
    else:
        pval_text = f'{Decimal(str(pval)):.3E}'#Take more digits because rounding is bad here
        significand, exponent = pval_text.split(sep='E-')
        significand = str(Decimal(significand).quantize(Decimal('0.1'), rounding=ROUND_HALF_UP))
        if significand=='10.0':
            significand = '1.0'
            exponent = str(int(exponent)-1)
        if int(exponent)>2:
            pval_text = significand+r'$\times$'+'10'+r'$^{{-{0}}}$'.format(exponent)##Font is different in r'$ $'...
        elif int(exponent)>0:
            pval_text = '0.'+'0'*(int(exponent)-1)+significand.replace('.', '')
        else:
            pval_text = significand
    text = 'Pearson\'s '+r'$r$'+' = '+r_text+'\n'+r'$P$'+' = '+pval_text
    ax.annotate(text, xy=(0.05, 0.95), xycoords='axes fraction',
                horizontalalignment='left', verticalalignment='top',
                multialignment='left', fontsize='x-small', color='k')
    ##Facet label
    ax.set_title(sex, {'fontsize':'medium'})
    #Save position to generate facet and legend later
    if ax_i ==0:
        ax0_pos = ax.get_position().bounds
    elif ax_i==2:
        ax2_pos = ax.get_position().bounds
sns.despine()
fig.text(x=(ax0_pos[0]+(ax2_pos[0]+ax2_pos[2]))/2, y=ax0_pos[1]-ax0_pos[3]*0.2,#Minor manual adjustment
         s='Predicted '+axis_label, fontsize='medium',
         verticalalignment='top', horizontalalignment='center', rotation='horizontal')
plt.show()

#Plot difference b/w sex-specific and sex-mixed models
tempD2 = {'Female':'tab:red', 'Male':'tab:blue'}
sns.set(style='ticks', font='Arial', context='talk')
fig, axes = plt.subplots(nrows=1, ncols=len(tempD2), figsize=(6.5, 3), sharex=True, sharey=True,
                         gridspec_kw={'hspace':0.1, 'wspace':0.1})
for ax_i, ax in enumerate(axes.flat):
    sex = list(tempD2.keys())[ax_i]
    #Prepare DF
    tempS1 = tempD1[sex][yvar_model]
    tempS1.name = 'Sex-specific'
    tempS2 = tempD1['Both sex'][yvar_model]
    tempS2.name = 'Sex-mixed'
    tempDF = pd.merge(tempS1, tempS2, left_index=True, right_index=True, how='inner')
    #Y=X as reference
    ax.plot([range_min, range_max], [range_min, range_max], color='black', linestyle=(0, (1, 2)))
    #Regplot
    sns.regplot(data=tempDF, x='Sex-specific', y='Sex-mixed', color=tempD2[sex],
                scatter=True, fit_reg=True, ci=95, truncate=False, marker='o',
                scatter_kws={'alpha':0.2, 'edgecolor':'k', 's':30}, ax=ax)
    if ax_i%len(tempD2)==0:
        plt.setp(ax, xlabel='', ylabel='Sex-mixed b'+axis_label)
    else:
        plt.setp(ax, xlabel='', ylabel='')
        plt.setp(ax.get_yticklabels(), visible=False)
    ##Annotate Pearson's correlation
    pearson_r, pval = stats.pearsonr(tempDF['Sex-specific'], tempDF['Sex-mixed'])
    r_text = str(Decimal(str(pearson_r)).quantize(Decimal('0.001'), rounding=ROUND_HALF_UP))
    if pval==1.0:
        pval_text = '1.0'
    elif pval==0.0:
        pval_text = '0.0'
    else:
        pval_text = f'{Decimal(str(pval)):.3E}'#Take more digits because rounding is bad here
        significand, exponent = pval_text.split(sep='E-')
        significand = str(Decimal(significand).quantize(Decimal('0.1'), rounding=ROUND_HALF_UP))
        if significand=='10.0':
            significand = '1.0'
            exponent = str(int(exponent)-1)
        if int(exponent)>2:
            pval_text = significand+r'$\times$'+'10'+r'$^{{-{0}}}$'.format(exponent)##Font is different in r'$ $'...
        elif int(exponent)>0:
            pval_text = '0.'+'0'*(int(exponent)-1)+significand.replace('.', '')
        else:
            pval_text = significand
    text = 'Pearson\'s '+r'$r$'+' = '+r_text+'\n'+r'$P$'+' = '+pval_text
    ax.annotate(text, xy=(0.05, 0.95), xycoords='axes fraction',
                horizontalalignment='left', verticalalignment='top',
                multialignment='left', fontsize='x-small', color='k')
    ##Facet label
    ax.set_title(sex, {'fontsize':'medium'})
    #Save position to generate facet and legend later
    if ax_i ==0:
        ax0_pos = ax.get_position().bounds
    elif ax_i==1:
        ax1_pos = ax.get_position().bounds
sns.despine()
fig.text(x=(ax0_pos[0]+(ax1_pos[0]+ax1_pos[2]))/2, y=ax0_pos[1]-ax0_pos[3]*0.2,#Minor manual adjustment
         s='Sex-specific b'+axis_label, fontsize='medium',
         verticalalignment='top', horizontalalignment='center', rotation='horizontal')
plt.show()

### 5-4. Clean beta-coefficient dataframe

In [None]:
tempD1 = {'Female':combiBMI_F_bcoefs, 'Male':combiBMI_M_bcoefs, 'BothSex':combiBMI_B_bcoefs}
tempD2 = {'Female':combiBMI_F_intercept, 'Male':combiBMI_M_intercept, 'BothSex':combiBMI_B_intercept}
yvar_model = 'CombiBMI'
model_method = 'ElasticNet'
tempD3 = {}
for sex in tempD1.keys():
    #Combine variables and intercept
    tempDF1 = tempD1[sex]
    tempDF2 = tempD2[sex]
    tempDF = pd.concat([tempDF1, tempDF2], axis=0)
    #Summarize
    tempL1 = []
    tempL2 = []
    tempL3 = []
    for row_n in tempDF.index.tolist():
        tempL1.append(tempDF.loc[row_n].mean())
        tempL2.append(tempDF.loc[row_n].std(ddof=1))#Sample standard deviation
        tempL3.append((tempDF.loc[row_n]==0.0).astype('int64').sum())
    tempDF['Mean'] = tempL1
    tempDF['SD'] = tempL2
    tempDF['nZeros'] = tempL3
    #Save the cleaned DF
    fileDir = './ExportData/'
    ipynbName = '220827_Multiomics-BMI-NatMed1stRevision_BMI-baseline-ElasticNet_'
    fileName = yvar_model+'-'+sex+'-'+model_method+'bcoefs.tsv'
    tempDF.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)
    
    #Check
    tempDF2 = tempDF.loc[tempDF.index.isin(['Intercept'])]#Retrieve as pd.DataFrame
    tempDF = tempDF.drop(index=['Intercept'])
    print(sex+':')
    print(' - Variables:', len(tempDF))
    tempDF1 = tempDF.loc[tempDF['nZeros']!=10]
    print(' - Variables with non-zero beta-coefficient:', len(tempDF1),
          '(', len(tempDF1)/len(tempDF)*100, '%)')
    tempDF1 = tempDF.loc[tempDF['nZeros']==0]
    print(' - Variables with non-zero beta-coefficient in all 10 models:', len(tempDF1),
          '(', len(tempDF1)/len(tempDF)*100, '%)')
    tempDF1 = tempDF1.sort_values(by='Mean', ascending=False)
    display(tempDF1)
    display(tempDF2)#Intercept
    print('')
    
    tempD3[sex] = tempDF
#Update for using R2 transition analysis
combiBMI_F_bcoefs = tempD3['Female']
combiBMI_M_bcoefs = tempD3['Male']
combiBMI_B_bcoefs = tempD3['BothSex']

## 6. Standard clinical measures

> A model predicting BMI is generated using ordinary least squares (OLS) linear regression with sex, age, HDL-cholesterol, LDL-cholesterol, triglycerides (TG), glucose, insulin, and HOMA-IR. To directely compare with the elastic net models, regression analysis is applied to 10 split set.  
>
> Of note, a previous study (Cirulli, E.T. et al. Cell Metab. 2019) generated a “conventional model" from sex, age, HDL, LDL, TG, and total cholesterol, which explained 31% variance in BMI (probably, in-sample R2). However, their model deesn't include the information about glucose, which is broadly used for metabolic health risks (Stefan, N. et al. Cell Metab. 2017). Also, total cholesterol was collinear with LDL and TG, at least in the Arivale cohort, enough to make negative beta-coefficients for them. HOMA-IR works a kind of interaction term between glucose and insulin.  

### 6-1. Prepare independent variables

In [None]:
tempDF1 = bmiDF
tempDF2 = chemDF

#Select the independent variables
tempD = {'Sex':'Sex', 'BaseAge':'Age'}
tempDF1 = tempDF1[tempD.keys()]
tempDF1 = tempDF1.rename(columns=tempD)
tempD = {'HDL CHOL DIRECT':'HDL-cholesterol',
         'LDL-CHOL CALCULATION':'LDL-cholesterol',
         'TRIGLYCERIDES':'Triglycerides',
         'GLUCOSE':'Glucose',
         'INSULIN':'Insulin',
         'HOMA-IR':'HOMA-IR'}
tempDF2 = tempDF2[list(tempD.keys())]
tempDF2 = tempDF2.rename(columns=tempD)
tempDF = pd.merge(tempDF1, tempDF2, left_index=True, right_index=True, how='inner')

display(tempDF.describe(include='all'))

standDF = tempDF

### 6-2. Standardization

> Note that all preprocessing, including outlier elimination, missingness filtering, imputation, and standardization, should be performed within the cross-validation fold, not across whole dataset, to minimize the potential data leakage. In this study, however, an external dataset is used for “validation" of the findings (i.e., a testing set for the fitted models). Hence, the robustness of preprocessing is prioritized.  

In [None]:
tempDF = standDF
tempD1 = {'Female':bmiDF_F, 'Male':bmiDF_M, 'BothSex':bmiDF_B}
tempD2 = {}
for sex in tempD1.keys():
    tempDF1 = tempD1[sex]
    tempDF1 = tempDF.loc[tempDF1.index.tolist()]
    #Z-score transformation
    tempDF2 = tempDF1.select_dtypes(include=[np.number])
    scaler = StandardScaler(copy=True, with_mean=True, with_std=True)
    tempA = scaler.fit_transform(tempDF2)#Column direction
    tempDF2 = pd.DataFrame(data=tempA, index=tempDF2.index, columns=tempDF2.columns)
    ##Recover categorical covariates
    tempDF3 = tempDF1.select_dtypes(exclude=[np.number])
    tempDF1 = pd.merge(tempDF2, tempDF3, left_index=True, right_index=True, how='left')
    tempD2[sex] = tempDF1
    
    #Confirmation
    print(sex+':', tempDF1.shape)
    display(tempDF1.describe(include='all'))
    tempDF1 = tempDF1.select_dtypes(include=[np.number])
    sns.set(style='ticks', font='Arial', context='notebook')
    plt.figure(figsize=(4, 3))
    for col_i in range(len(tempDF1.columns)):
        sns.distplot(tempDF1.iloc[:, col_i], label=tempDF1.columns[col_i])
    sns.despine()
    plt.xlabel(r'$Z$'+'-score')
    plt.ylabel('Density')
    plt.legend(bbox_to_anchor=(1, 0.5), loc='center left', borderaxespad=1)
    plt.show()
    print('')

standDF_F = tempD2['Female']
standDF_M = tempD2['Male']
standDF_B = tempD2['BothSex']

### 6-3. One-hot encoding for categorical variables

> While statsmodels automatically recognizes categorical variables, one-hot encoding is required in scikit-learn.  
–> In this case, only sex is the categorical variable; hence, map manually! (cf. in many cases, category_encoders is more useful than sklearn.preprocessing or pandas.get_dummies.)  

In [None]:
tempD1 = {'Female':standDF_F, 'Male':standDF_M, 'BothSex':standDF_B}
tempD2 = {}
for sex in tempD1.keys():
    tempDF = tempD1[sex]
    #One-hot encoding for sex
    tempD = {'F':0, 'M':1}
    tempDF['Sex'] = tempDF['Sex'].map(tempD)
    tempD2[sex] = tempDF
#Update
standDF_F = tempD2['Female']
standDF_M = tempD2['Male']
standDF_B = tempD2['BothSex']

### 6-4. OLS linear regression

> In scikit-learn, constant variables are automatically assinged zero to; hence, although sex is included in the model, there is no need to modify the code between sex-specific and sex-mixed version.  
>
> Note that, in contrast to ElasticNetCV, the current LinearRegression seems to produce 2D ndarray although target y is 1D.  

In [None]:
#Female model
tempDF1 = standDF_F#Standardized independent variables
tempDF2 = bmiDF_F#Not-standardized dependent variable and info about testing set
yvar = 'BaseBMI'
nmodels = len(tempDF2['Testing'].unique())#Reset just in case
model = LinearRegression(fit_intercept=True, normalize=False)
yvar_model = 'BaseStandBMI'

#Perform OLS linear regression
bcoefDF = pd.DataFrame(index=tempDF1.columns).astype('float64')#For beta-coefficients
bcoefDF.index.rename('Variable', inplace=True)
interceptDF = pd.DataFrame(index=['Intercept']).astype('float64')#For intercept
interceptDF.index.rename('Variable', inplace=True)
scoreL = []#For the coefficient of determination R2
predictS = pd.Series(name='log_'+yvar_model).astype('float64')#For predictions
t_start = time.time()
for model_k in range(nmodels):
    #Prepare training and testing (hold-out) datasets in model k
    model_n = 'Model_'+str(model_k+1).zfill(2)
    yDF_train = tempDF2.loc[tempDF2['Testing']!=model_n]
    yDF_test = tempDF2.loc[tempDF2['Testing']==model_n]
    xDF_train = tempDF1.loc[yDF_train.index.tolist()]
    xDF_test = tempDF1.loc[yDF_test.index.tolist()]
    #Retrieve the dependent variable
    yDF_train = pd.DataFrame(yDF_train['log_'+yvar])#Not Series but DF
    yDF_test = pd.DataFrame(yDF_test['log_'+yvar])#Not Series but DF
    #Fitting model using training dataset
    model.fit(xDF_train, yDF_train)
    #Save parameters
    bcoefDF[model_n] = model.coef_.flatten()#Estimated coefficients for the linear regression problem
    interceptDF[model_n] = model.intercept_#Independent term in the linear model
    #Evaluation with testing (hold-out) dataset
    scoreL.append(model.score(xDF_test, yDF_test))
    #Prediction for testing dataset using the fitted model k
    tempS = pd.Series(model.predict(xDF_test).flatten(),
                      index=xDF_test.index, name='log_'+yvar_model)
    predictS = pd.concat([predictS, tempS], axis=0)
t_elapsed = time.time() - t_start
print('Elapsed time for '+str(nmodels)+' models of OLS linear regression:',
      round(t_elapsed//60), 'min', round(t_elapsed%60, 1), 'sec')

#Clean prediction DF
tempDF = pd.merge(tempDF2[['Testing', 'log_'+yvar]], predictS,
                  left_index=True, right_index=True, how='left')
##Convert to original scale
tempDF[yvar] = np.e**tempDF['log_'+yvar]
tempDF[yvar_model] = np.e**tempDF['log_'+yvar_model]
display(tempDF)

#Save the cleaned prediction DF
fileDir = './ExportData/'
ipynbName = '220827_Multiomics-BMI-NatMed1stRevision_BMI-baseline-ElasticNet_'
fileName = yvar_model.replace('Base', '')+'-Female.tsv'
tempDF.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)
standBMI_F_bcoefs = bcoefDF
standBMI_F_intercept = interceptDF
standBMI_F_R2 = scoreL
standBMI_F = tempDF

In [None]:
#Male model
tempDF1 = standDF_M#Standardized independent variables
tempDF2 = bmiDF_M#Not-standardized dependent variable and info about testing set
yvar = 'BaseBMI'
nmodels = len(tempDF2['Testing'].unique())#Reset just in case
model = LinearRegression(fit_intercept=True, normalize=False)
yvar_model = 'BaseStandBMI'

#Perform OLS linear regression
bcoefDF = pd.DataFrame(index=tempDF1.columns).astype('float64')#For beta-coefficients
bcoefDF.index.rename('Variable', inplace=True)
interceptDF = pd.DataFrame(index=['Intercept']).astype('float64')#For intercept
interceptDF.index.rename('Variable', inplace=True)
scoreL = []#For the coefficient of determination R2
predictS = pd.Series(name='log_'+yvar_model).astype('float64')#For predictions
t_start = time.time()
for model_k in range(nmodels):
    #Prepare training and testing (hold-out) datasets in model k
    model_n = 'Model_'+str(model_k+1).zfill(2)
    yDF_train = tempDF2.loc[tempDF2['Testing']!=model_n]
    yDF_test = tempDF2.loc[tempDF2['Testing']==model_n]
    xDF_train = tempDF1.loc[yDF_train.index.tolist()]
    xDF_test = tempDF1.loc[yDF_test.index.tolist()]
    #Retrieve the dependent variable
    yDF_train = pd.DataFrame(yDF_train['log_'+yvar])#Not Series but DF
    yDF_test = pd.DataFrame(yDF_test['log_'+yvar])#Not Series but DF
    #Fitting model using training dataset
    model.fit(xDF_train, yDF_train)
    #Save parameters
    bcoefDF[model_n] = model.coef_.flatten()#Estimated coefficients for the linear regression problem
    interceptDF[model_n] = model.intercept_#Independent term in the linear model
    #Evaluation with testing (hold-out) dataset
    scoreL.append(model.score(xDF_test, yDF_test))
    #Prediction for testing dataset using the fitted model k
    tempS = pd.Series(model.predict(xDF_test).flatten(),
                      index=xDF_test.index, name='log_'+yvar_model)
    predictS = pd.concat([predictS, tempS], axis=0)
t_elapsed = time.time() - t_start
print('Elapsed time for '+str(nmodels)+' models of OLS linear regression:',
      round(t_elapsed//60), 'min', round(t_elapsed%60, 1), 'sec')

#Clean prediction DF
tempDF = pd.merge(tempDF2[['Testing', 'log_'+yvar]], predictS,
                  left_index=True, right_index=True, how='left')
##Convert to original scale
tempDF[yvar] = np.e**tempDF['log_'+yvar]
tempDF[yvar_model] = np.e**tempDF['log_'+yvar_model]
display(tempDF)

#Save the cleaned prediction DF
fileDir = './ExportData/'
ipynbName = '220827_Multiomics-BMI-NatMed1stRevision_BMI-baseline-ElasticNet_'
fileName = yvar_model.replace('Base', '')+'-Male.tsv'
tempDF.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)
standBMI_M_bcoefs = bcoefDF
standBMI_M_intercept = interceptDF
standBMI_M_R2 = scoreL
standBMI_M = tempDF

In [None]:
#Both sex model
tempDF1 = standDF_B#Standardized independent variables
tempDF2 = bmiDF_B#Not-standardized dependent variable and info about testing set
yvar = 'BaseBMI'
nmodels = len(tempDF2['Testing'].unique())#Reset just in case
model = LinearRegression(fit_intercept=True, normalize=False)
yvar_model = 'BaseStandBMI'

#Perform OLS linear regression
bcoefDF = pd.DataFrame(index=tempDF1.columns).astype('float64')#For beta-coefficients
bcoefDF.index.rename('Variable', inplace=True)
interceptDF = pd.DataFrame(index=['Intercept']).astype('float64')#For intercept
interceptDF.index.rename('Variable', inplace=True)
scoreL = []#For the coefficient of determination R2
predictS = pd.Series(name='log_'+yvar_model).astype('float64')#For predictions
t_start = time.time()
for model_k in range(nmodels):
    #Prepare training and testing (hold-out) datasets in model k
    model_n = 'Model_'+str(model_k+1).zfill(2)
    yDF_train = tempDF2.loc[tempDF2['Testing']!=model_n]
    yDF_test = tempDF2.loc[tempDF2['Testing']==model_n]
    xDF_train = tempDF1.loc[yDF_train.index.tolist()]
    xDF_test = tempDF1.loc[yDF_test.index.tolist()]
    #Retrieve the dependent variable
    yDF_train = pd.DataFrame(yDF_train['log_'+yvar])#Not Series but DF
    yDF_test = pd.DataFrame(yDF_test['log_'+yvar])#Not Series but DF
    #Fitting model using training dataset
    model.fit(xDF_train, yDF_train)
    #Save parameters
    bcoefDF[model_n] = model.coef_.flatten()#Estimated coefficients for the linear regression problem
    interceptDF[model_n] = model.intercept_#Independent term in the linear model
    #Evaluation with testing (hold-out) dataset
    scoreL.append(model.score(xDF_test, yDF_test))
    #Prediction for testing dataset using the fitted model k
    tempS = pd.Series(model.predict(xDF_test).flatten(),
                      index=xDF_test.index, name='log_'+yvar_model)
    predictS = pd.concat([predictS, tempS], axis=0)
t_elapsed = time.time() - t_start
print('Elapsed time for '+str(nmodels)+' models of OLS linear regression:',
      round(t_elapsed//60), 'min', round(t_elapsed%60, 1), 'sec')

#Clean prediction DF
tempDF = pd.merge(tempDF2[['Testing', 'log_'+yvar]], predictS,
                  left_index=True, right_index=True, how='left')
##Convert to original scale
tempDF[yvar] = np.e**tempDF['log_'+yvar]
tempDF[yvar_model] = np.e**tempDF['log_'+yvar_model]
display(tempDF)

#Save the cleaned prediction DF
fileDir = './ExportData/'
ipynbName = '220827_Multiomics-BMI-NatMed1stRevision_BMI-baseline-ElasticNet_'
fileName = yvar_model.replace('Base', '')+'-BothSex.tsv'
tempDF.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)
standBMI_B_bcoefs = bcoefDF
standBMI_B_intercept = interceptDF
standBMI_B_R2 = scoreL
standBMI_B = tempDF

### 6-5. Prediction accuracy

In [None]:
#Summary
tempD1 = {'Female':standBMI_F_R2, 'Male':standBMI_M_R2, 'Both sex':standBMI_B_R2}
tempD2 = {'Female':standBMI_F, 'Male':standBMI_M, 'Both sex':standBMI_B}
yvar = 'BaseBMI'
yvar_model = 'BaseStandBMI'

for sex in tempD1.keys():
    tempL = tempD1[sex]
    print(sex+' model')
    print(' - Out-of-sample R2 [Mean ± SD]:',
          np.mean(tempL), '±', np.std(tempL, ddof=1))#Sample standard deviation
    print(' - Out-of-sample R2 [Mean ± SEM]:',
          np.mean(tempL), '±', np.std(tempL, ddof=1)/np.sqrt(len(tempL)))
    tempDF = tempD2[sex]
    print(' - Observed vs. predicted log_'+yvar+': (Pearson\'s r, P) =',
          stats.pearsonr(tempDF['log_'+yvar], tempDF['log_'+yvar_model]))
    print(' - Observed vs. predicted '+yvar+': (Pearson\'s r, P) =',
          stats.pearsonr(tempDF[yvar], tempDF[yvar_model]))
    display(tempDF.describe())

> Check difference between sex-specific and sex-mixed models for now. Note that this is a rough comparison because the sample size is different.  

In [None]:
#Prepare DF
tempDF = pd.DataFrame({'Female':standBMI_F_R2, 'Male':standBMI_M_R2, 'Both sex':standBMI_B_R2})
tempDF = tempDF.melt(var_name='Cohort', value_name='R2', value_vars=tempDF.columns.tolist())

#Plot
sns.set(style='ticks', font='Arial', context='notebook')
plt.figure(figsize=(3, 1))
sns.barplot(data=tempDF, y='Cohort', x='R2', hue='Cohort', palette='Set1', dodge=False, edgecolor='black',
            ci=95, capsize=0.4, errwidth=1.5, errcolor='black')
sns.stripplot(data=tempDF, y='Cohort', x='R2', hue='Cohort', dodge=False, size=8, edgecolor='black',
              linewidth=1, alpha=0.4, palette={'Female':'gray', 'Male':'gray', 'Both sex':'gray'})
sns.despine()
plt.xlabel('Out-of-sample '+r'$R^2$'+' in 10 OLS-LR models\n[mean with 95% CI]')
plt.ylabel('')
plt.legend('', frameon=False)
plt.show()

In [None]:
tempD1 = {'Female':standBMI_F, 'Male':standBMI_M, 'Both sex':standBMI_B}
yvar = 'BaseBMI'
yvar_model = 'BaseStandBMI'
range_min = np.min([df[var].min() for df in tempD1.values() for var in [yvar, yvar_model]])
range_max = np.max([df[var].max() for df in tempD1.values() for var in [yvar, yvar_model]])
axis_label = 'BMI [kg m'+r'$^{-2}$'+']'

#Plot measured vs. predicted per model
tempD2 = {'Female':'tab:red', 'Male':'tab:blue', 'Both sex':'tab:green'}
sns.set(style='ticks', font='Arial', context='talk')
fig, axes = plt.subplots(nrows=1, ncols=len(tempD2), figsize=(10, 3), sharex=True, sharey=True,
                         gridspec_kw={'hspace':0.1, 'wspace':0.1})
for ax_i, ax in enumerate(axes.flat):
    sex = list(tempD2.keys())[ax_i]
    tempDF = tempD1[sex]
    #Y=X as reference
    ax.plot([range_min, range_max], [range_min, range_max], color='black', linestyle=(0, (1, 2)))
    #Regplot
    sns.regplot(data=tempDF, x=yvar_model, y=yvar, color=tempD2[sex],
                scatter=True, fit_reg=True, ci=95, truncate=False, marker='o',
                scatter_kws={'alpha':0.2, 'edgecolor':'k', 's':30}, ax=ax)
    if ax_i%len(tempD2)==0:
        plt.setp(ax, xlabel='', ylabel='Measured '+axis_label)
    else:
        plt.setp(ax, xlabel='', ylabel='')
        plt.setp(ax.get_yticklabels(), visible=False)
    ##Annotate Pearson's correlation
    pearson_r, pval = stats.pearsonr(tempDF[yvar_model], tempDF[yvar])
    r_text = str(Decimal(str(pearson_r)).quantize(Decimal('0.001'), rounding=ROUND_HALF_UP))
    if pval==1.0:
        pval_text = '1.0'
    elif pval==0.0:
        pval_text = '0.0'
    else:
        pval_text = f'{Decimal(str(pval)):.3E}'#Take more digits because rounding is bad here
        significand, exponent = pval_text.split(sep='E-')
        significand = str(Decimal(significand).quantize(Decimal('0.1'), rounding=ROUND_HALF_UP))
        if significand=='10.0':
            significand = '1.0'
            exponent = str(int(exponent)-1)
        if int(exponent)>2:
            pval_text = significand+r'$\times$'+'10'+r'$^{{-{0}}}$'.format(exponent)##Font is different in r'$ $'...
        elif int(exponent)>0:
            pval_text = '0.'+'0'*(int(exponent)-1)+significand.replace('.', '')
        else:
            pval_text = significand
    text = 'Pearson\'s '+r'$r$'+' = '+r_text+'\n'+r'$P$'+' = '+pval_text
    ax.annotate(text, xy=(0.05, 0.95), xycoords='axes fraction',
                horizontalalignment='left', verticalalignment='top',
                multialignment='left', fontsize='x-small', color='k')
    ##Facet label
    ax.set_title(sex, {'fontsize':'medium'})
    #Save position to generate facet and legend later
    if ax_i ==0:
        ax0_pos = ax.get_position().bounds
    elif ax_i==2:
        ax2_pos = ax.get_position().bounds
sns.despine()
fig.text(x=(ax0_pos[0]+(ax2_pos[0]+ax2_pos[2]))/2, y=ax0_pos[1]-ax0_pos[3]*0.2,#Minor manual adjustment
         s='Predicted '+axis_label, fontsize='medium',
         verticalalignment='top', horizontalalignment='center', rotation='horizontal')
plt.show()

#Plot difference b/w sex-specific and sex-mixed models
tempD2 = {'Female':'tab:red', 'Male':'tab:blue'}
sns.set(style='ticks', font='Arial', context='talk')
fig, axes = plt.subplots(nrows=1, ncols=len(tempD2), figsize=(6.5, 3), sharex=True, sharey=True,
                         gridspec_kw={'hspace':0.1, 'wspace':0.1})
for ax_i, ax in enumerate(axes.flat):
    sex = list(tempD2.keys())[ax_i]
    #Prepare DF
    tempS1 = tempD1[sex][yvar_model]
    tempS1.name = 'Sex-specific'
    tempS2 = tempD1['Both sex'][yvar_model]
    tempS2.name = 'Sex-mixed'
    tempDF = pd.merge(tempS1, tempS2, left_index=True, right_index=True, how='inner')
    #Y=X as reference
    ax.plot([range_min, range_max], [range_min, range_max], color='black', linestyle=(0, (1, 2)))
    #Regplot
    sns.regplot(data=tempDF, x='Sex-specific', y='Sex-mixed', color=tempD2[sex],
                scatter=True, fit_reg=True, ci=95, truncate=False, marker='o',
                scatter_kws={'alpha':0.2, 'edgecolor':'k', 's':30}, ax=ax)
    if ax_i%len(tempD2)==0:
        plt.setp(ax, xlabel='', ylabel='Sex-mixed b'+axis_label)
    else:
        plt.setp(ax, xlabel='', ylabel='')
        plt.setp(ax.get_yticklabels(), visible=False)
    ##Annotate Pearson's correlation
    pearson_r, pval = stats.pearsonr(tempDF['Sex-specific'], tempDF['Sex-mixed'])
    r_text = str(Decimal(str(pearson_r)).quantize(Decimal('0.001'), rounding=ROUND_HALF_UP))
    if pval==1.0:
        pval_text = '1.0'
    elif pval==0.0:
        pval_text = '0.0'
    else:
        pval_text = f'{Decimal(str(pval)):.3E}'#Take more digits because rounding is bad here
        significand, exponent = pval_text.split(sep='E-')
        significand = str(Decimal(significand).quantize(Decimal('0.1'), rounding=ROUND_HALF_UP))
        if significand=='10.0':
            significand = '1.0'
            exponent = str(int(exponent)-1)
        if int(exponent)>2:
            pval_text = significand+r'$\times$'+'10'+r'$^{{-{0}}}$'.format(exponent)##Font is different in r'$ $'...
        elif int(exponent)>0:
            pval_text = '0.'+'0'*(int(exponent)-1)+significand.replace('.', '')
        else:
            pval_text = significand
    text = 'Pearson\'s '+r'$r$'+' = '+r_text+'\n'+r'$P$'+' = '+pval_text
    ax.annotate(text, xy=(0.05, 0.95), xycoords='axes fraction',
                horizontalalignment='left', verticalalignment='top',
                multialignment='left', fontsize='x-small', color='k')
    ##Facet label
    ax.set_title(sex, {'fontsize':'medium'})
    #Save position to generate facet and legend later
    if ax_i ==0:
        ax0_pos = ax.get_position().bounds
    elif ax_i==1:
        ax1_pos = ax.get_position().bounds
sns.despine()
fig.text(x=(ax0_pos[0]+(ax1_pos[0]+ax1_pos[2]))/2, y=ax0_pos[1]-ax0_pos[3]*0.2,#Minor manual adjustment
         s='Sex-specific b'+axis_label, fontsize='medium',
         verticalalignment='top', horizontalalignment='center', rotation='horizontal')
plt.show()

### 6-6. Clean beta-coefficient dataframe

In [None]:
tempD1 = {'Female':standBMI_F_bcoefs, 'Male':standBMI_M_bcoefs, 'BothSex':standBMI_B_bcoefs}
tempD2 = {'Female':standBMI_F_intercept, 'Male':standBMI_M_intercept, 'BothSex':standBMI_B_intercept}
yvar_model = 'StandBMI'
model_method = 'OLS'
for sex in tempD1.keys():
    #Combine variables and intercept
    tempDF1 = tempD1[sex]
    tempDF2 = tempD2[sex]
    tempDF = pd.concat([tempDF1, tempDF2], axis=0)
    #Summarize
    tempL1 = []
    tempL2 = []
    tempL3 = []
    for row_n in tempDF.index.tolist():
        tempL1.append(tempDF.loc[row_n].mean())
        tempL2.append(tempDF.loc[row_n].std(ddof=1))#Sample standard deviation
        tempL3.append((tempDF.loc[row_n]==0.0).astype('int64').sum())
    tempDF['Mean'] = tempL1
    tempDF['SD'] = tempL2
    tempDF['nZeros'] = tempL3
    #Save the cleaned DF
    fileDir = './ExportData/'
    ipynbName = '220827_Multiomics-BMI-NatMed1stRevision_BMI-baseline-ElasticNet_'
    fileName = yvar_model+'-'+sex+'-'+model_method+'bcoefs.tsv'
    tempDF.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)
    
    #Check
    tempDF2 = tempDF.loc[tempDF.index.isin(['Intercept'])]#Retrieve as pd.DataFrame
    tempDF = tempDF.drop(index=['Intercept'])
    print(sex+':')
    print(' - Variables:', len(tempDF))
    tempDF1 = tempDF.loc[tempDF['nZeros']!=10]
    print(' - Variables with non-zero beta-coefficient:', len(tempDF1),
          '(', len(tempDF1)/len(tempDF)*100, '%)')
    tempDF1 = tempDF.loc[tempDF['nZeros']==0]
    print(' - Variables with non-zero beta-coefficient in all 10 models:', len(tempDF1),
          '(', len(tempDF1)/len(tempDF)*100, '%)')
    tempDF1 = tempDF1.sort_values(by='Mean', ascending=False)
    display(tempDF1)
    display(tempDF2)#Intercept
    print('')

## 7. Comparison b/w models

### 7-1. Out-of-sample R2

> Assess significance using two-sided Welch's t-test (using statsmodels library). Note that scipy.stats.ttest_ind() doesn't report degrees of freedom.  

In [None]:
tempD1 = {'Standard measures':standBMI_B_R2,
          'Metabolomics':metBMI_B_R2, 'Proteomics':protBMI_B_R2,
          'Clinical labs':chemBMI_B_R2, 'Combined omics':combiBMI_B_R2}
tempD2 = {'Standard measures':'0.5',
          'Metabolomics':'b', 'Proteomics':'r',
          'Clinical labs':'g', 'Combined omics':'m'}

#Prepare DF
tempDF = pd.DataFrame(tempD1)
display(tempDF.describe())

#Statistical tests
control = list(tempD1.keys())[0]
tempL = list(tempD1.keys())[1:]
tempDF1 = pd.DataFrame(columns=['Control', 'Contrast', 'Control_N', 'contrast_N', 'DoF', 'tStat', 'Pval'])
for contrast in tempL:
    tempS1 = tempDF[control]
    tempS2 = tempDF[contrast]
    #Two-sided Welch's t-test
    tstat, pval, dof = weightstats.ttest_ind(tempS1, tempS2,
                                             alternative='two-sided', usevar='unequal')
    size1 = len(tempS1)
    size2 = len(tempS2)
    tempDF1.loc[contrast+'-vs-'+control] = [control, contrast, size1, size2, dof, tstat, pval]
##P-value adjustment by using Benjamini–Hochberg method
tempDF1['AdjPval'] = multi.multipletests(tempDF1['Pval'], alpha=0.05, method='fdr_bh',
                                         is_sorted=False, returnsorted=False)[1]
tempDF1.index.rename('ComparisonLabel', inplace=True)
display(tempDF1)
##Save
fileDir = './ExportData/'
ipynbName = '220827_Multiomics-BMI-NatMed1stRevision_BMI-baseline-ElasticNet_'
fileName = 'R2-comparison-BothSex.tsv'
tempDF1.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)

#Plot
tempDF = tempDF.melt(var_name='Category', value_name='R2', value_vars=tempDF.columns.tolist())
axis_ymin = 0.0
axis_ymax = 1.2
ymin = 0.0
ymax = 0.8
yinter = 0.2
aline_ymin = 0.8
aline_yinter = 0.1
sns.set(style='ticks', font='Arial', context='talk')
plt.figure(figsize=(3.5, 4))
sns.barplot(data=tempDF, y='R2', x='Category', order=tempD2.keys(),
            hue='Category', hue_order=tempD2.keys(), dodge=False, palette=tempD2,
            ci=95, capsize=0.4, errwidth=1.5, errcolor='black', edgecolor='black')
p = sns.stripplot(data=tempDF, y='R2', x='Category', order=tempD2.keys(),
                  hue='Category', hue_order=tempD2.keys(), dodge=False, jitter=0.3,
                  size=5, edgecolor='black', color='gray', linewidth=1, alpha=0.4)
sns.despine()
p.set(ylim=(axis_ymin, axis_ymax), yticks=np.arange(ymin, ymax+yinter/10, yinter))
plt.setp(p.get_xticklabels(), rotation=70,
         horizontalalignment='right', verticalalignment='center', rotation_mode='anchor')
##P-value annotation
for row_i in range(len(tempDF1)):
    #Control
    group_0 = tempDF1['Control'].iloc[row_i]
    xcoord_0 = list(tempD2.keys()).index(group_0)
    #Contrast
    group_1 = tempDF1['Contrast'].iloc[row_i]
    xcoord_1 = list(tempD2.keys()).index(group_1)
    #Standard point of marker
    xcoord = (xcoord_0+xcoord_1)/2
    ycoord = aline_ymin + aline_yinter*row_i
    #Add annotation lines
    aline_offset = yinter/10
    aline_length = yinter/10 + aline_offset
    p.plot([xcoord_0, xcoord_0, xcoord_1, xcoord_1],
           [ycoord+aline_offset, ycoord+aline_length, ycoord+aline_length, ycoord+aline_offset],
           lw=1.5, c='k')
    #Retrieve P-value
    pval = tempDF1['AdjPval'].iloc[row_i]
    if pval<0.001:
        label = '***'
    elif pval<0.01:
        label = '**'
    elif pval<0.05:
        label = '*'
    else:
        pval_text = str(Decimal(pval).quantize(Decimal('0.01'), rounding=ROUND_HALF_UP))
        label = r'$P$'+' = '+pval_text
    #Add annotation text
    if label in ['***', '**', '*']:
        text_offset = yinter/25
        text_size = 'medium'
    else:
        text_offset = yinter/5
        text_size = 'x-small'
    p.annotate(label, xy=(xcoord, ycoord+text_offset),
               horizontalalignment='center', verticalalignment='bottom',
               fontsize=text_size, color='k')
plt.ylabel('Out-of-sample '+r'$R^2$')
plt.xlabel('')
p.get_legend().remove()
##Save
fileDir = './ExportFigures/'
ipynbName = '220827_Multiomics-BMI-NatMed1stRevision_BMI-baseline-ElasticNet_'
fileName = 'R2-comparison-BothSex.tif'
plt.gcf().savefig(fileDir+ipynbName+fileName, dpi=300, bbox_inches='tight', pad_inches=0.04,
                  pil_kwargs={'compression':'tiff_lzw'})
plt.show()

In [None]:
#Plot sex-specific version as a single figure
tempD1 = {'Standard measures':standBMI_F_R2,
          'Metabolomics':metBMI_F_R2, 'Proteomics':protBMI_F_R2,
          'Clinical labs':chemBMI_F_R2, 'Combined omics':combiBMI_F_R2}
tempD2 = {'Standard measures':standBMI_M_R2,
          'Metabolomics':metBMI_M_R2, 'Proteomics':protBMI_M_R2,
          'Clinical labs':chemBMI_M_R2, 'Combined omics':combiBMI_M_R2}
tempD1 = {'Female':tempD1, 'Male':tempD2}
tempD2 = {'Standard measures':'0.5',
          'Metabolomics':'b', 'Proteomics':'r',
          'Clinical labs':'g', 'Combined omics':'m'}

#Prepare DFs
tempD3 = {}
tempD = {}
for sex in tempD1.keys():
    #Prepare DF
    tempDF = pd.DataFrame(tempD1[sex])
    tempD3[sex] = tempDF
    print(sex)
    display(tempDF.describe())
    
    #Statistical tests
    control = list(tempD2.keys())[0]
    tempL = list(tempD2.keys())[1:]
    tempDF1 = pd.DataFrame(columns=['Control', 'Contrast', 'Control_N', 'contrast_N', 'DoF', 'tStat', 'Pval'])
    for contrast in tempL:
        tempS1 = tempDF[control]
        tempS2 = tempDF[contrast]
        #Two-sided Welch's t-test
        tstat, pval, dof = weightstats.ttest_ind(tempS1, tempS2,
                                                 alternative='two-sided', usevar='unequal')
        size1 = len(tempS1)
        size2 = len(tempS2)
        tempDF1.loc[contrast+'-vs-'+control] = [control, contrast, size1, size2, dof, tstat, pval]
    ##P-value adjustment (within sex) by using Benjamini–Hochberg method
    tempDF1['AdjPval_sex'] = multi.multipletests(tempDF1['Pval'], alpha=0.05, method='fdr_bh',
                                                 is_sorted=False, returnsorted=False)[1]
    tempDF1['Sex'] = sex
    tempD[sex] = tempDF1
tempDF1 = pd.concat(list(tempD.values()), axis=0)
##P-value adjustment (across all tests) by using Benjamini–Hochberg method
tempDF1['AdjPval_all'] = multi.multipletests(tempDF1['Pval'], alpha=0.05, method='fdr_bh',
                                             is_sorted=False, returnsorted=False)[1]
tempDF1.index.rename('ComparisonLabel', inplace=True)
tempDF1 = tempDF1.reset_index().set_index(['Sex', 'ComparisonLabel'])
display(tempDF1)
##Save
fileDir = './ExportData/'
ipynbName = '220827_Multiomics-BMI-NatMed1stRevision_BMI-baseline-ElasticNet_'
fileName = 'R2-comparison-FemaleMale.tsv'
tempDF1.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)

#Plot
axis_ymin = 0.0
axis_ymax = 1.2
ymin = 0.0
ymax = 0.8
yinter = 0.2
aline_ymin = 0.8
aline_yinter = 0.1
sns.set(style='ticks', font='Arial', context='talk')
fig, axes = plt.subplots(nrows=1, ncols=len(tempD3),
                         figsize=(3.5*len(tempD3), 4), sharex=True, sharey=True)
for ax_i, ax in enumerate(axes.flat):
    sex = list(tempD3.keys())[ax_i]
    tempDF = tempD3[sex]
    tempDF = tempDF.melt(var_name='Category', value_name='R2', value_vars=tempDF.columns.tolist())
    sns.barplot(data=tempDF, y='R2', x='Category', order=tempD2.keys(),
                hue='Category', hue_order=tempD2.keys(), dodge=False, palette=tempD2,
                ci=95, capsize=0.4, errwidth=1.5, errcolor='black', edgecolor='black', ax=ax)
    sns.stripplot(data=tempDF, y='R2', x='Category', order=tempD2.keys(),
                  hue='Category', hue_order=tempD2.keys(), dodge=False, jitter=0.3,
                  size=5, edgecolor='black', color='gray', linewidth=1, alpha=0.4, ax=ax)
    #P-value annotation
    tempDF2 = tempDF1.loc[sex]#MultiIndex
    for row_i in range(len(tempDF2)):
        #Control
        group_0 = tempDF2['Control'].iloc[row_i]
        xcoord_0 = list(tempD2.keys()).index(group_0)
        #Contrast
        group_1 = tempDF2['Contrast'].iloc[row_i]
        xcoord_1 = list(tempD2.keys()).index(group_1)
        #Standard point of marker
        xcoord = (xcoord_0+xcoord_1)/2
        ycoord = aline_ymin + aline_yinter*row_i
        #Add annotation lines
        aline_offset = yinter/10
        aline_length = yinter/10 + aline_offset
        ax.plot([xcoord_0, xcoord_0, xcoord_1, xcoord_1],
                [ycoord+aline_offset, ycoord+aline_length, ycoord+aline_length, ycoord+aline_offset],
                lw=1.5, c='k')
        #Retrieve P-value
        pval = tempDF2['AdjPval_all'].iloc[row_i]
        if pval<0.001:
            label = '***'
        elif pval<0.01:
            label = '**'
        elif pval<0.05:
            label = '*'
        else:
            pval_text = str(Decimal(pval).quantize(Decimal('0.01'), rounding=ROUND_HALF_UP))
            label = r'$P$'+' = '+pval_text
        #Add annotation text
        if label in ['***', '**', '*']:
            text_offset = yinter/25
            text_size = 'medium'
        else:
            text_offset = yinter/5
            text_size = 'x-small'
        ax.annotate(label, xy=(xcoord, ycoord+text_offset),
                    horizontalalignment='center', verticalalignment='bottom',
                    fontsize=text_size, color='k')
    #Facet label
    ax.set_title(sex, {'fontsize':'large'})
    #Legend
    ax.get_legend().remove()
    #Axis setting
    plt.setp(ax.get_xticklabels(), rotation=70,
             horizontalalignment='right', verticalalignment='center', rotation_mode='anchor')
    if ax_i==0:
        plt.setp(ax, xlabel='', ylabel='Out-of-sample '+r'$R^2$')
    else:
        plt.setp(ax, xlabel='', ylabel='')
        plt.setp(ax.get_yticklabels(), visible=False)
sns.despine()
plt.setp(axes, ylim=(axis_ymin, axis_ymax), yticks=np.arange(ymin, ymax+yinter/10, yinter))
##Save
fileDir = './ExportFigures/'
ipynbName = '220827_Multiomics-BMI-NatMed1stRevision_BMI-baseline-ElasticNet_'
fileName = 'R2-comparison-FemaleMale.tif'
plt.gcf().savefig(fileDir+ipynbName+fileName, dpi=300, bbox_inches='tight', pad_inches=0.04,
                  pil_kwargs={'compression':'tiff_lzw'})
plt.show()

### 7-2. Measured vs. Predicted

> In this version, P-values from Pearson's correlation tests are adjusted across categories.  

In [None]:
tempD1 = {'Standard measures':standBMI_B,
          'Metabolomics':metBMI_B, 'Proteomics':protBMI_B,
          'Clinical labs':chemBMI_B, 'Combined omics':combiBMI_B}
tempD2 = {'Standard measures':'BaseStandBMI',
          'Metabolomics':'BaseMetBMI', 'Proteomics':'BaseProtBMI',
          'Clinical labs':'BaseChemBMI', 'Combined omics':'BaseCombiBMI'}
tempD3 = {'Standard measures':'0.5',
          'Metabolomics':'b', 'Proteomics':'r',
          'Clinical labs':'g', 'Combined omics':'m'}
yvar='BaseBMI'
axis_label = 'BMI [kg m'+r'$^{-2}$'+']'

#Statistical tests
tempDF1 = pd.DataFrame(columns=['N', 'DoF', 'Pearson_r', 'Pval'])
for category in tempD1.keys():
    tempDF = tempD1[category]
    xvar = tempD2[category]
    #Pearson's correlation
    pearson_r, pval = stats.pearsonr(tempDF[xvar], tempDF[yvar])
    size = len(tempDF)
    dof = size - 2
    tempDF1.loc[category] = [size, dof, pearson_r, pval]
##P-value adjustment by using Benjamini–Hochberg method
tempDF1['AdjPval'] = multi.multipletests(tempDF1['Pval'], alpha=0.05, method='fdr_bh',
                                         is_sorted=False, returnsorted=False)[1]
tempDF1.index.rename('Category', inplace=True)
tempDF1['N'] = tempDF1['N'].astype('int64')#Otherwise, float64!
tempDF1['DoF'] = tempDF1['DoF'].astype('int64')#Otherwise, float64!
display(tempDF1)
##Save
fileDir = './ExportData/'
ipynbName = '220827_Multiomics-BMI-NatMed1stRevision_BMI-baseline-ElasticNet_'
fileName = 'regplot-comparison-BothSex.tsv'
tempDF1.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)

#Plot
sns.set(style='ticks', font='Arial', context='talk')
fig, axes = plt.subplots(nrows=1, ncols=len(tempD3),
                         figsize=(3.5*len(tempD3), 3.5+1), sharex=True, sharey=True)
axis_xymin = 7.5
axis_xymax = 62.5
xymin = 10
xymax = 60
xyinter = 10
#Set axis range first; otherwise, regression line can be truncated differently
plt.setp(axes, xlim=(axis_xymin, axis_xymax), xticks=np.arange(xymin, xymax+xyinter/10, xyinter))
plt.setp(axes, ylim=(axis_xymin, axis_xymax), yticks=np.arange(xymin, xymax+xyinter/10, xyinter))
for ax_i, ax in enumerate(axes.flat):
    category = list(tempD1.keys())[ax_i]
    #Prepare DF
    tempDF = tempD1[category]
    xvar = tempD2[category]
    #Scatterplot with regression line
    sns.regplot(data=tempDF, x=xvar, y=yvar, color=tempD3[category],
                scatter=True, fit_reg=True, ci=95, truncate=False, marker='o',
                scatter_kws={'alpha':0.2, 'edgecolor':'k', 's':25}, ax=ax)
    #Draw Y=X as reference
    ax.plot([axis_xymin, axis_xymax], [axis_xymin, axis_xymax],
            color='black', linestyle=(0, (1, 2)), zorder=0)
    #Annotate Pearson's correlation
    pearson_r = tempDF1['Pearson_r'].loc[category]
    r_text = str(Decimal(str(pearson_r)).quantize(Decimal('0.001'), rounding=ROUND_HALF_UP))
    pval = tempDF1['AdjPval'].loc[category]
    if pval==1.0:
        pval_text = '1.0'
    elif pval==0.0:
        pval_text = '0.0'
    else:
        pval_text = f'{Decimal(str(pval)):.3E}'#Take more digits because rounding is bad here
        significand, exponent = pval_text.split(sep='E-')
        significand = str(Decimal(significand).quantize(Decimal('0.1'), rounding=ROUND_HALF_UP))
        if significand=='10.0':
            significand = '1.0'
            exponent = str(int(exponent)-1)
        if int(exponent)>2:
            pval_text = significand+r'$\times$'+'10'+r'$^{{-{0}}}$'.format(exponent)##Font is different in r'$ $'...
        elif int(exponent)>0:
            pval_text = '0.'+'0'*(int(exponent)-1)+significand.replace('.', '')
        else:
            pval_text = significand
    text = 'Pearson\'s '+r'$r$'+' = '+r_text+'\n'+r'$P$'+' = '+pval_text
    ax.annotate(text, xy=(0.05, 0.95), xycoords='axes fraction',
                horizontalalignment='left', verticalalignment='top',
                multialignment='left', fontsize='small', color='k')
    #Facet label
    ax.set_title(category, {'fontsize':'large'})
    #Axis setting
    if ax_i%len(tempD3)==0:
        plt.setp(ax, xlabel='', ylabel='Measured '+axis_label)
    elif ax_i==np.median(range(len(tempD3))):
        plt.setp(ax, xlabel='Predicted '+axis_label, ylabel='')
        plt.setp(ax.get_yticklabels(), visible=False)
    else:
        plt.setp(ax, xlabel='', ylabel='')
        plt.setp(ax.get_yticklabels(), visible=False)
sns.despine()
#Reset and generate common axis title
#plt.setp(axes, xlabel='', ylabel='')
fig.tight_layout(pad=0.75)
#fig.text(x=0.525, y=0.0175,#Manual adjustment
#         s='Predicted '+axis_label, fontsize='medium',
#         verticalalignment='top', horizontalalignment='center')
#fig.text(x=0.0125, y=0.5,#Manual adjustment
#         s='Measured '+axis_label, fontsize='medium',
#         verticalalignment='center', horizontalalignment='right', rotation='vertical')
##Save
fileDir = './ExportFigures/'
ipynbName = '220827_Multiomics-BMI-NatMed1stRevision_BMI-baseline-ElasticNet_'
fileName = 'regplot-comparison-BothSex.tif'
plt.gcf().savefig(fileDir+ipynbName+fileName, dpi=300, bbox_inches='tight', pad_inches=0.04,
                  pil_kwargs={'compression':'tiff_lzw'})
plt.show()

## 8. Performance transition vs. critical variables

> If a critical variable for the elastic net models is eliminated from the input variables, performance should be dropped.  
> —> Repeat elastic net while dropping the top variable: the variable that was retained across 10 models and had the highest absolute value of the mean beta-coefficient.  
>
> Due to time consuming calculation, skip female/male model version.  

### 8-1. Metabolomics

In [None]:
#Both sex model
niterations = 100
sex = 'BothSex'
tempDF1 = metDF_B#Standardized independent variables
tempDF2 = bmiDF_B#Not-standardized dependent variable and info about testing set
yvar = 'BaseBMI'
nmodels = len(tempDF2['Testing'].unique())#Reset just in case
ncvs = 10
model = ElasticNetCV(l1_ratio=0.5, eps=0.05, n_alphas=200, alphas=None, fit_intercept=True,
                     normalize=False, precompute='auto', cv=ncvs)
yvar_model = 'BaseMetBMI'

#Perform elastic net while dropping the top retained variable
print('Start:', time.ctime(time.time()))
dropL = []#Initialize
tempL = ['Model_'+str(model_k+1).zfill(2) for model_k in range(nmodels)]
scoreDF = pd.DataFrame(columns=tempL).astype('float64')#For the coefficient of determination R2
scoreDF.index.rename('Iteration', inplace=True)
for iter_n in range(niterations+1):
    print('Iteration', iter_n)
    t_start = time.time()
    #Drop the top variables
    tempDF = tempDF1.drop(columns=dropL)
    
    #Perform elastic net
    bcoefDF = pd.DataFrame(index=tempDF.columns).astype('float64')#For beta-coefficients
    bcoefDF.index.rename('Variable', inplace=True)
    scoreL = []#For the coefficient of determination R2
    for model_k in range(nmodels):
        #Prepare training and testing (hold-out) datasets in model k
        model_n = 'Model_'+str(model_k+1).zfill(2)
        yDF_train = tempDF2.loc[tempDF2['Testing']!=model_n]
        yDF_test = tempDF2.loc[tempDF2['Testing']==model_n]
        xDF_train = tempDF.loc[yDF_train.index.tolist()]
        xDF_test = tempDF.loc[yDF_test.index.tolist()]
        #Retrieve the dependent variable
        yDF_train = pd.DataFrame(yDF_train['log_'+yvar])#Not Series but DF
        yDF_test = pd.DataFrame(yDF_test['log_'+yvar])#Not Series but DF
        #Fitting model with cross-validation using training dataset (i.e., internal training/validation datasets)
        model.fit(xDF_train, yDF_train)
        #Save parameters
        bcoefDF[model_n] = model.coef_#w in the cost function formula
        #Evaluation with testing (hold-out) dataset
        scoreL.append(model.score(xDF_test, yDF_test))
    
    #Prediction accuracy
    print(' - Out-of-sample R2 [Mean ± SEM]:',
          np.mean(scoreL), '±', np.std(scoreL, ddof=1)/np.sqrt(len(scoreL)))
    if np.mean(scoreL)<0:
        print('-> Finish')
        break
    
    #Add the scores to scoreDF
    scoreDF.loc['Iteration_'+str(iter_n).zfill(3)] = scoreL
    
    #Clean beta-coefficient dataframe
    count = len(bcoefDF)#Input variables
    tempDF = bcoefDF#Just rename to reuse the same code without modification
    ##Summarize
    tempL1 = []
    tempL3 = []
    for row_n in tempDF.index.tolist():
        tempL1.append(tempDF.loc[row_n].mean())
        tempL3.append((tempDF.loc[row_n]==0.0).astype('int64').sum())
    tempDF['Mean'] = tempL1
    tempDF['nZeros'] = tempL3
    
    #Identify the top variable
    ##Filter the variables with non-zero beta-coefficient in all 10 models
    tempDF = tempDF.loc[tempDF['nZeros']==0]
    print(' - Variables with non-zero beta-coefficient:', len(tempDF),
          '(', len(tempDF)/count*100, '%)')
    ##Identify the variable having the highest absolute value of the mean beta-coefficient
    tempS = np.abs(tempDF['Mean'])
    tempS = tempS.sort_values(ascending=False)
    print(' - Top variable:', tempS.index[0])
    ##Update the top variable list
    dropL.append(tempS.index[0])
    
    t_elapsed = time.time() - t_start
    print(' - Elapsed time:', round(t_elapsed//60), 'min', round(t_elapsed%60, 1), 'sec')
print('Finish:', time.ctime(time.time()))

#Save just in case for a connection timeout error
scoreDF['TopVariable'] = dropL
fileDir = './ExportData/'
ipynbName = '220827_Multiomics-BMI-NatMed1stRevision_BMI-baseline-ElasticNet_'
fileName = yvar_model.replace('Base', '')+'-'+sex+'-R2transition.tsv'
scoreDF.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)

In [None]:
#Visualize performance transition
yvar_model = 'MetBMI'
tempDF1 = metBMI_B_bcoefs#All input variables with the beta-coefficient summary
sex = 'BothSex'
model_color = 'b'

#Re-load scoreDF just in case for a connection timeout error
fileDir = './ExportData/'
ipynbName = '220827_Multiomics-BMI-NatMed1stRevision_BMI-baseline-ElasticNet_'
fileName = yvar_model+'-'+sex+'-R2transition.tsv'
tempDF2 = pd.read_csv(fileDir+ipynbName+fileName, sep='\t').set_index('Iteration')

#Prepare DF
tempDF2['IterNum'] = list(range(len(tempDF2)))
tempDF = tempDF2.drop(columns=['TopVariable'])
tempDF = tempDF.reset_index().melt(var_name='Model', value_name='R2', id_vars=['Iteration', 'IterNum'])

#Retrieve the original robust variables
tempL = tempDF1.loc[tempDF1['nZeros']==0].index.tolist()
tempDF2 = tempDF2.loc[tempDF2['TopVariable'].isin(tempL)]
tempDF2 = tempDF2.loc[:, ~tempDF2.columns.str.contains('Model_')]
print('Original robust variables:', len(tempDF2))
display(tempDF2)

#Plot
sns.set(style='ticks', font='Arial', context='talk')
plt.figure(figsize=(5, 3))
p = sns.lineplot(data=tempDF, x='IterNum', y='R2', estimator='mean', ci=95, **{'color':'black'})
p.set(xlim=(0, 100), ylim=(0, 0.8), yticks=np.arange(0, 0.81, 0.2))
sns.despine()
plt.xlabel('Iteration number')
plt.ylabel('Out-of-sample '+r'$R^2$')
for row_i in range(len(tempDF2)):
    iternum = tempDF2['IterNum'].iloc[row_i]
    p.axvspan(xmin=iternum, xmax=iternum+1, facecolor=model_color, alpha=0.4, zorder=0)
plt.margins(0, 0, tight=True)
##Save
fileDir = './ExportFigures/'
ipynbName = '220827_Multiomics-BMI-NatMed1stRevision_BMI-baseline-ElasticNet_'
fileName = yvar_model+'-'+sex+'-R2transition.tif'
plt.gcf().savefig(fileDir+ipynbName+fileName, dpi=300, bbox_inches='tight', pad_inches=0.04,
                  pil_kwargs={'compression':'tiff_lzw'})
plt.show()

### 8-2. Proteomics

In [None]:
#Both sex model
niterations = 100
sex = 'BothSex'
tempDF1 = protDF_B#Standardized independent variables
tempDF2 = bmiDF_B#Not-standardized dependent variable and info about testing set
yvar = 'BaseBMI'
nmodels = len(tempDF2['Testing'].unique())#Reset just in case
ncvs = 10
model = ElasticNetCV(l1_ratio=0.5, eps=0.05, n_alphas=200, alphas=None, fit_intercept=True,
                     normalize=False, precompute='auto', cv=ncvs)
yvar_model = 'BaseProtBMI'

#Perform elastic net while dropping the top retained variable
print('Start:', time.ctime(time.time()))
dropL = []#Initialize
tempL = ['Model_'+str(model_k+1).zfill(2) for model_k in range(nmodels)]
scoreDF = pd.DataFrame(columns=tempL).astype('float64')#For the coefficient of determination R2
scoreDF.index.rename('Iteration', inplace=True)
for iter_n in range(niterations+1):
    print('Iteration', iter_n)
    t_start = time.time()
    #Drop the top variables
    tempDF = tempDF1.drop(columns=dropL)
    
    #Perform elastic net
    bcoefDF = pd.DataFrame(index=tempDF.columns).astype('float64')#For beta-coefficients
    bcoefDF.index.rename('Variable', inplace=True)
    scoreL = []#For the coefficient of determination R2
    for model_k in range(nmodels):
        #Prepare training and testing (hold-out) datasets in model k
        model_n = 'Model_'+str(model_k+1).zfill(2)
        yDF_train = tempDF2.loc[tempDF2['Testing']!=model_n]
        yDF_test = tempDF2.loc[tempDF2['Testing']==model_n]
        xDF_train = tempDF.loc[yDF_train.index.tolist()]
        xDF_test = tempDF.loc[yDF_test.index.tolist()]
        #Retrieve the dependent variable
        yDF_train = pd.DataFrame(yDF_train['log_'+yvar])#Not Series but DF
        yDF_test = pd.DataFrame(yDF_test['log_'+yvar])#Not Series but DF
        #Fitting model with cross-validation using training dataset (i.e., internal training/validation datasets)
        model.fit(xDF_train, yDF_train)
        #Save parameters
        bcoefDF[model_n] = model.coef_#w in the cost function formula
        #Evaluation with testing (hold-out) dataset
        scoreL.append(model.score(xDF_test, yDF_test))
    
    #Prediction accuracy
    print(' - Out-of-sample R2 [Mean ± SEM]:',
          np.mean(scoreL), '±', np.std(scoreL, ddof=1)/np.sqrt(len(scoreL)))
    if np.mean(scoreL)<0:
        print('-> Finish')
        break
    
    #Add the scores to scoreDF
    scoreDF.loc['Iteration_'+str(iter_n).zfill(3)] = scoreL
    
    #Clean beta-coefficient dataframe
    count = len(bcoefDF)#Input variables
    tempDF = bcoefDF#Just rename to reuse the same code without modification
    ##Summarize
    tempL1 = []
    tempL3 = []
    for row_n in tempDF.index.tolist():
        tempL1.append(tempDF.loc[row_n].mean())
        tempL3.append((tempDF.loc[row_n]==0.0).astype('int64').sum())
    tempDF['Mean'] = tempL1
    tempDF['nZeros'] = tempL3
    
    #Identify the top variable
    ##Filter the variables with non-zero beta-coefficient in all 10 models
    tempDF = tempDF.loc[tempDF['nZeros']==0]
    print(' - Variables with non-zero beta-coefficient:', len(tempDF),
          '(', len(tempDF)/count*100, '%)')
    ##Identify the variable having the highest absolute value of the mean beta-coefficient
    tempS = np.abs(tempDF['Mean'])
    tempS = tempS.sort_values(ascending=False)
    print(' - Top variable:', tempS.index[0])
    ##Update the top variable list
    dropL.append(tempS.index[0])
    
    t_elapsed = time.time() - t_start
    print(' - Elapsed time:', round(t_elapsed//60), 'min', round(t_elapsed%60, 1), 'sec')
print('Finish:', time.ctime(time.time()))

#Save just in case for a connection timeout error
scoreDF['TopVariable'] = dropL
fileDir = './ExportData/'
ipynbName = '220827_Multiomics-BMI-NatMed1stRevision_BMI-baseline-ElasticNet_'
fileName = yvar_model.replace('Base', '')+'-'+sex+'-R2transition.tsv'
scoreDF.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)

In [None]:
#Visualize performance transition
yvar_model = 'ProtBMI'
tempDF1 = protBMI_B_bcoefs#All input variables with the beta-coefficient summary
sex = 'BothSex'
model_color = 'r'

#Re-load scoreDF just in case for a connection timeout error
fileDir = './ExportData/'
ipynbName = '220827_Multiomics-BMI-NatMed1stRevision_BMI-baseline-ElasticNet_'
fileName = yvar_model+'-'+sex+'-R2transition.tsv'
tempDF2 = pd.read_csv(fileDir+ipynbName+fileName, sep='\t').set_index('Iteration')

#Prepare DF
tempDF2['IterNum'] = list(range(len(tempDF2)))
tempDF = tempDF2.drop(columns=['TopVariable'])
tempDF = tempDF.reset_index().melt(var_name='Model', value_name='R2', id_vars=['Iteration', 'IterNum'])

#Retrieve the original robust variables
tempL = tempDF1.loc[tempDF1['nZeros']==0].index.tolist()
tempDF2 = tempDF2.loc[tempDF2['TopVariable'].isin(tempL)]
tempDF2 = tempDF2.loc[:, ~tempDF2.columns.str.contains('Model_')]
print('Original robust variables:', len(tempDF2))
display(tempDF2)

#Plot
sns.set(style='ticks', font='Arial', context='talk')
plt.figure(figsize=(5, 3))
p = sns.lineplot(data=tempDF, x='IterNum', y='R2', estimator='mean', ci=95, **{'color':'black'})
p.set(xlim=(0, 100), ylim=(0, 0.8), yticks=np.arange(0, 0.81, 0.2))
sns.despine()
plt.xlabel('Iteration number')
plt.ylabel('Out-of-sample '+r'$R^2$')
for row_i in range(len(tempDF2)):
    iternum = tempDF2['IterNum'].iloc[row_i]
    p.axvspan(xmin=iternum, xmax=iternum+1, facecolor=model_color, alpha=0.4, zorder=0)
plt.margins(0, 0, tight=True)
##Save
fileDir = './ExportFigures/'
ipynbName = '220827_Multiomics-BMI-NatMed1stRevision_BMI-baseline-ElasticNet_'
fileName = yvar_model+'-'+sex+'-R2transition.tif'
plt.gcf().savefig(fileDir+ipynbName+fileName, dpi=300, bbox_inches='tight', pad_inches=0.04,
                  pil_kwargs={'compression':'tiff_lzw'})
plt.show()

### 8-3. Clinical labs

In [None]:
#Both sex model
niterations = 100
sex = 'BothSex'
tempDF1 = chemDF_B#Standardized independent variables
tempDF2 = bmiDF_B#Not-standardized dependent variable and info about testing set
yvar = 'BaseBMI'
nmodels = len(tempDF2['Testing'].unique())#Reset just in case
ncvs = 10
model = ElasticNetCV(l1_ratio=0.5, eps=0.05, n_alphas=200, alphas=None, fit_intercept=True,
                     normalize=False, precompute='auto', cv=ncvs)
yvar_model = 'BaseChemBMI'

#Perform elastic net while dropping the top retained variable
print('Start:', time.ctime(time.time()))
dropL = []#Initialize
tempL = ['Model_'+str(model_k+1).zfill(2) for model_k in range(nmodels)]
scoreDF = pd.DataFrame(columns=tempL).astype('float64')#For the coefficient of determination R2
scoreDF.index.rename('Iteration', inplace=True)
for iter_n in range(niterations+1):
    print('Iteration', iter_n)
    t_start = time.time()
    #Drop the top variables
    tempDF = tempDF1.drop(columns=dropL)
    
    #Perform elastic net
    bcoefDF = pd.DataFrame(index=tempDF.columns).astype('float64')#For beta-coefficients
    bcoefDF.index.rename('Variable', inplace=True)
    scoreL = []#For the coefficient of determination R2
    for model_k in range(nmodels):
        #Prepare training and testing (hold-out) datasets in model k
        model_n = 'Model_'+str(model_k+1).zfill(2)
        yDF_train = tempDF2.loc[tempDF2['Testing']!=model_n]
        yDF_test = tempDF2.loc[tempDF2['Testing']==model_n]
        xDF_train = tempDF.loc[yDF_train.index.tolist()]
        xDF_test = tempDF.loc[yDF_test.index.tolist()]
        #Retrieve the dependent variable
        yDF_train = pd.DataFrame(yDF_train['log_'+yvar])#Not Series but DF
        yDF_test = pd.DataFrame(yDF_test['log_'+yvar])#Not Series but DF
        #Fitting model with cross-validation using training dataset (i.e., internal training/validation datasets)
        model.fit(xDF_train, yDF_train)
        #Save parameters
        bcoefDF[model_n] = model.coef_#w in the cost function formula
        #Evaluation with testing (hold-out) dataset
        scoreL.append(model.score(xDF_test, yDF_test))
    
    #Prediction accuracy
    print(' - Out-of-sample R2 [Mean ± SEM]:',
          np.mean(scoreL), '±', np.std(scoreL, ddof=1)/np.sqrt(len(scoreL)))
    if np.mean(scoreL)<0:
        print('-> Finish')
        break
    
    #Add the scores to scoreDF
    scoreDF.loc['Iteration_'+str(iter_n).zfill(3)] = scoreL
    
    #Clean beta-coefficient dataframe
    count = len(bcoefDF)#Input variables
    tempDF = bcoefDF#Just rename to reuse the same code without modification
    ##Summarize
    tempL1 = []
    tempL3 = []
    for row_n in tempDF.index.tolist():
        tempL1.append(tempDF.loc[row_n].mean())
        tempL3.append((tempDF.loc[row_n]==0.0).astype('int64').sum())
    tempDF['Mean'] = tempL1
    tempDF['nZeros'] = tempL3
    
    #Identify the top variable
    ##Filter the variables with non-zero beta-coefficient in all 10 models
    tempDF = tempDF.loc[tempDF['nZeros']==0]
    print(' - Variables with non-zero beta-coefficient:', len(tempDF),
          '(', len(tempDF)/count*100, '%)')
    ##Identify the variable having the highest absolute value of the mean beta-coefficient
    tempS = np.abs(tempDF['Mean'])
    tempS = tempS.sort_values(ascending=False)
    print(' - Top variable:', tempS.index[0])
    ##Update the top variable list
    dropL.append(tempS.index[0])
    
    t_elapsed = time.time() - t_start
    print(' - Elapsed time:', round(t_elapsed//60), 'min', round(t_elapsed%60, 1), 'sec')
print('Finish:', time.ctime(time.time()))

#Save just in case for a connection timeout error
scoreDF['TopVariable'] = dropL
fileDir = './ExportData/'
ipynbName = '220827_Multiomics-BMI-NatMed1stRevision_BMI-baseline-ElasticNet_'
fileName = yvar_model.replace('Base', '')+'-'+sex+'-R2transition.tsv'
scoreDF.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)

In [None]:
#Visualize performance transition
yvar_model = 'ChemBMI'
tempDF1 = chemBMI_B_bcoefs#All input variables with the beta-coefficient summary
sex = 'BothSex'
model_color = 'g'

#Re-load scoreDF just in case for a connection timeout error
fileDir = './ExportData/'
ipynbName = '220827_Multiomics-BMI-NatMed1stRevision_BMI-baseline-ElasticNet_'
fileName = yvar_model+'-'+sex+'-R2transition.tsv'
tempDF2 = pd.read_csv(fileDir+ipynbName+fileName, sep='\t').set_index('Iteration')

#Prepare DF
tempDF2['IterNum'] = list(range(len(tempDF2)))
tempDF = tempDF2.drop(columns=['TopVariable'])
tempDF = tempDF.reset_index().melt(var_name='Model', value_name='R2', id_vars=['Iteration', 'IterNum'])

#Retrieve the original robust variables
tempL = tempDF1.loc[tempDF1['nZeros']==0].index.tolist()
tempDF2 = tempDF2.loc[tempDF2['TopVariable'].isin(tempL)]
tempDF2 = tempDF2.loc[:, ~tempDF2.columns.str.contains('Model_')]
print('Original robust variables:', len(tempDF2))
display(tempDF2)

#Plot
sns.set(style='ticks', font='Arial', context='talk')
plt.figure(figsize=(5, 3))
p = sns.lineplot(data=tempDF, x='IterNum', y='R2', estimator='mean', ci=95, **{'color':'black'})
p.set(xlim=(0, 100), ylim=(0, 0.8), yticks=np.arange(0, 0.81, 0.2))
sns.despine()
plt.xlabel('Iteration number')
plt.ylabel('Out-of-sample '+r'$R^2$')
for row_i in range(len(tempDF2)):
    iternum = tempDF2['IterNum'].iloc[row_i]
    p.axvspan(xmin=iternum, xmax=iternum+1, facecolor=model_color, alpha=0.4, zorder=0)
plt.margins(0, 0, tight=True)
##Save
fileDir = './ExportFigures/'
ipynbName = '220827_Multiomics-BMI-NatMed1stRevision_BMI-baseline-ElasticNet_'
fileName = yvar_model+'-'+sex+'-R2transition.tif'
plt.gcf().savefig(fileDir+ipynbName+fileName, dpi=300, bbox_inches='tight', pad_inches=0.04,
                  pil_kwargs={'compression':'tiff_lzw'})
plt.show()

### 8-4. Combined omics

In [None]:
#Both sex model
niterations = 100
sex = 'BothSex'
tempDF1 = combiDF_B#Standardized independent variables
tempDF2 = bmiDF_B#Not-standardized dependent variable and info about testing set
yvar = 'BaseBMI'
nmodels = len(tempDF2['Testing'].unique())#Reset just in case
ncvs = 10
model = ElasticNetCV(l1_ratio=0.5, eps=0.01, n_alphas=200, alphas=None, fit_intercept=True,
                     normalize=False, precompute='auto', cv=ncvs)
yvar_model = 'BaseCombiBMI'

#Perform elastic net while dropping the top retained variable
print('Start:', time.ctime(time.time()))
dropL = []#Initialize
tempL = ['Model_'+str(model_k+1).zfill(2) for model_k in range(nmodels)]
scoreDF = pd.DataFrame(columns=tempL).astype('float64')#For the coefficient of determination R2
scoreDF.index.rename('Iteration', inplace=True)
for iter_n in range(niterations+1):
    print('Iteration', iter_n)
    t_start = time.time()
    #Drop the top variables
    tempDF = tempDF1.drop(columns=dropL)
    
    #Perform elastic net
    bcoefDF = pd.DataFrame(index=tempDF.columns).astype('float64')#For beta-coefficients
    bcoefDF.index.rename('Variable', inplace=True)
    scoreL = []#For the coefficient of determination R2
    for model_k in range(nmodels):
        #Prepare training and testing (hold-out) datasets in model k
        model_n = 'Model_'+str(model_k+1).zfill(2)
        yDF_train = tempDF2.loc[tempDF2['Testing']!=model_n]
        yDF_test = tempDF2.loc[tempDF2['Testing']==model_n]
        xDF_train = tempDF.loc[yDF_train.index.tolist()]
        xDF_test = tempDF.loc[yDF_test.index.tolist()]
        #Retrieve the dependent variable
        yDF_train = pd.DataFrame(yDF_train['log_'+yvar])#Not Series but DF
        yDF_test = pd.DataFrame(yDF_test['log_'+yvar])#Not Series but DF
        #Fitting model with cross-validation using training dataset (i.e., internal training/validation datasets)
        model.fit(xDF_train, yDF_train)
        #Save parameters
        bcoefDF[model_n] = model.coef_#w in the cost function formula
        #Evaluation with testing (hold-out) dataset
        scoreL.append(model.score(xDF_test, yDF_test))
    
    #Prediction accuracy
    print(' - Out-of-sample R2 [Mean ± SEM]:',
          np.mean(scoreL), '±', np.std(scoreL, ddof=1)/np.sqrt(len(scoreL)))
    if np.mean(scoreL)<0:
        print('-> Finish')
        break
    
    #Add the scores to scoreDF
    scoreDF.loc['Iteration_'+str(iter_n).zfill(3)] = scoreL
    
    #Clean beta-coefficient dataframe
    count = len(bcoefDF)#Input variables
    tempDF = bcoefDF#Just rename to reuse the same code without modification
    ##Summarize
    tempL1 = []
    tempL3 = []
    for row_n in tempDF.index.tolist():
        tempL1.append(tempDF.loc[row_n].mean())
        tempL3.append((tempDF.loc[row_n]==0.0).astype('int64').sum())
    tempDF['Mean'] = tempL1
    tempDF['nZeros'] = tempL3
    
    #Identify the top variable
    ##Filter the variables with non-zero beta-coefficient in all 10 models
    tempDF = tempDF.loc[tempDF['nZeros']==0]
    print(' - Variables with non-zero beta-coefficient:', len(tempDF),
          '(', len(tempDF)/count*100, '%)')
    ##Identify the variable having the highest absolute value of the mean beta-coefficient
    tempS = np.abs(tempDF['Mean'])
    tempS = tempS.sort_values(ascending=False)
    print(' - Top variable:', tempS.index[0])
    ##Update the top variable list
    dropL.append(tempS.index[0])
    
    t_elapsed = time.time() - t_start
    print(' - Elapsed time:', round(t_elapsed//60), 'min', round(t_elapsed%60, 1), 'sec')
print('Finish:', time.ctime(time.time()))

#Save just in case for a connection timeout error
scoreDF['TopVariable'] = dropL
fileDir = './ExportData/'
ipynbName = '220827_Multiomics-BMI-NatMed1stRevision_BMI-baseline-ElasticNet_'
fileName = yvar_model.replace('Base', '')+'-'+sex+'-R2transition.tsv'
scoreDF.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)

In [None]:
#Visualize performance transition
yvar_model = 'CombiBMI'
tempDF1 = combiBMI_B_bcoefs#All input variables with the beta-coefficient summary
sex = 'BothSex'
model_color = 'm'

#Re-load scoreDF just in case for a connection timeout error
fileDir = './ExportData/'
ipynbName = '220827_Multiomics-BMI-NatMed1stRevision_BMI-baseline-ElasticNet_'
fileName = yvar_model+'-'+sex+'-R2transition.tsv'
tempDF2 = pd.read_csv(fileDir+ipynbName+fileName, sep='\t').set_index('Iteration')

#Prepare DF
tempDF2['IterNum'] = list(range(len(tempDF2)))
tempDF = tempDF2.drop(columns=['TopVariable'])
tempDF = tempDF.reset_index().melt(var_name='Model', value_name='R2', id_vars=['Iteration', 'IterNum'])

#Retrieve the original robust variables
tempL = tempDF1.loc[tempDF1['nZeros']==0].index.tolist()
tempDF2 = tempDF2.loc[tempDF2['TopVariable'].isin(tempL)]
tempDF2 = tempDF2.loc[:, ~tempDF2.columns.str.contains('Model_')]
print('Original robust variables:', len(tempDF2))
display(tempDF2)

#Plot
sns.set(style='ticks', font='Arial', context='talk')
plt.figure(figsize=(5, 3))
p = sns.lineplot(data=tempDF, x='IterNum', y='R2', estimator='mean', ci=95, **{'color':'black'})
p.set(xlim=(0, 100), ylim=(0, 0.8), yticks=np.arange(0, 0.81, 0.2))
sns.despine()
plt.xlabel('Iteration number')
plt.ylabel('Out-of-sample '+r'$R^2$')
for row_i in range(len(tempDF2)):
    iternum = tempDF2['IterNum'].iloc[row_i]
    p.axvspan(xmin=iternum, xmax=iternum+1, facecolor=model_color, alpha=0.4, zorder=0)
plt.margins(0, 0, tight=True)
##Save
fileDir = './ExportFigures/'
ipynbName = '220827_Multiomics-BMI-NatMed1stRevision_BMI-baseline-ElasticNet_'
fileName = yvar_model+'-'+sex+'-R2transition.tif'
plt.gcf().savefig(fileDir+ipynbName+fileName, dpi=300, bbox_inches='tight', pad_inches=0.04,
                  pil_kwargs={'compression':'tiff_lzw'})
plt.show()

## — End of this notebook —