# Multiomics BMI Paper — Longitudinal BMI Predictions from the Arivale Time-series Omics Using the Baseline LASSO Models

***by Kengo Watanabe***  

This Jupyter Notebook (with Python 3 kernel) calculated longitudinal BMI predictions from each of the Arivale time-series blood omic datasets, using the LASSO linear regression models that were fitted on the Arivale baseline datasets.  

Input files:  
* Arivale baseline BMI and blood omics (preprocessed): 210104_Biological-BMI-paper_RF-imputation_baseline-\[metDF/protDF/chemDF/combiDF\]-with-RF-imputation.tsv  
* Arivale time-series blood omics (preprocessed): 220804_Multiomics-BMI-NatMed1stRevision_RF-imputation-ver2_time-series-\[metDF/protDF/chemDF/combiDF\]-with-RF-imputation.tsv  
* Baseline biological BMI models: 220801_Multiomics-BMI-NatMed1stRevision_BMI-baseline-LASSO_\[MetBMI/ProtBMI/ChemBMI/CombiBMI\]-\[Female/Male/BothSex\]-LASSObcoefs.tsv  
* Arivale baseline BMI predictions: 220801_Multiomics-BMI-NatMed1stRevision_BMI-baseline-LASSO_\[MetBMI/ProtBMI/ChemBMI/CombiBMI\]-\[Female/Male/BothSex\].tsv  

Output figures and tables:  
* Intermediate tables for other notebooks (BMI predictions)  

Original notebook (memo for my future tracing):  
* dalek:\[JupyterLab HOME\]/220621_Multiomics-BMI-NatMedRevision/220805_Multiomics-BMI-NatMed1stRevision_BMI-longitudinal-LASSO.ipynb  

In [1]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
#For Arial font
#!conda install -c conda-forge -y mscorefonts
##-> The below was also needed in matplotlib 3.4.2
#import shutil
#import matplotlib
#shutil.rmtree(matplotlib.get_cachedir())
import warnings
warnings.filterwarnings('ignore')
from IPython.display import display
import time

from sklearn.preprocessing import StandardScaler
from decimal import Decimal, ROUND_HALF_UP

!conda list

# packages in environment at /opt/conda/envs/arivale-py3:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       1_gnu    conda-forge
analytics                 0.1                      pypi_0    pypi
argon2-cffi               21.1.0           py39h3811e60_0    conda-forge
arivale-data-interface    0.1.0                    pypi_0    pypi
async_generator           1.10                       py_0    conda-forge
atk-1.0                   2.36.0               h3371d22_4    conda-forge
attrs                     21.2.0             pyhd8ed1ab_0    conda-forge
backcall                  0.2.0              pyh9f0ad1d_0    conda-forge
backports                 1.0                        py_2    conda-forge
backports.functools_lru_cache 1.6.4              pyhd8ed1ab_0    conda-forge
biopython                 1.79             py39h3811e60_0    conda-forge
bleach 

## 1. Cohort preparation

> The following code is completely same with the one used in the baseline LASSO modeling. Hence, the correspondence between participant and testing (hold-out) set for each LASSO model is maintained.  

### 1-1. Import the cleaned dataframe

> Omics dataframes are imported at each section later.  

In [None]:
#Import the baseline BMI dataframe
fileDir = '../210104_Biological-BMI-paper/ExportData/'
ipynbName = '210104_Biological-BMI-paper_RF-imputation_'
fileName = 'baseline-combiDF-with-RF-imputation.tsv'
tempDF = pd.read_csv(fileDir+ipynbName+fileName, sep='\t', dtype={'public_client_id': str})
tempDF = tempDF.set_index('public_client_id')
##Take BMI and general covariates (without Race in this study)
tempL = ['log_BaseBMI', 'Sex', 'BaseAge', 'PC1', 'PC2', 'PC3', 'PC4', 'PC5']
tempDF = tempDF[tempL]

display(tempDF)

bmiDF = tempDF

### 1-2. Stratification with sex

In [None]:
#Stratify the cohort with sex
bmiDF_F = bmiDF.loc[bmiDF['Sex']=='F']
bmiDF_M = bmiDF.loc[bmiDF['Sex']=='M']
bmiDF_B = bmiDF#Not copy just rename
print('Female, Male, Both sex = ', len(bmiDF_F), ', ', len(bmiDF_M), ', ', len(bmiDF_B))

### 1-3. Split the cohort into 10 sets

In [None]:
tempD1 = {'Female':bmiDF_F, 'Male':bmiDF_M, 'BothSex':bmiDF_B}
nmodels = 10
tempD2 = {}
for sex in tempD1.keys():
    tempDF = tempD1[sex]
    #Split cohort to define the training and testing (hold-out) sets
    tempL = np.array_split(tempDF, nmodels)#List of DFs
    tempD = {}
    for model_k in range(nmodels):
        tempDF1 = tempL[model_k]
        model_n = 'Model_'+str(model_k+1).zfill(2)
        tempS = pd.Series(np.repeat(model_n, len(tempDF1)),
                          index=tempDF1.index, name='Testing')
        tempD[model_k] = tempS
    tempS = pd.concat(list(tempD.values()), axis=0)
    #Add the info to bmiDF
    tempDF = pd.merge(tempDF, tempS, left_index=True, right_index=True, how='left')
    tempD2[sex] = tempDF
    print(sex)
    display(tempDF)
    display(tempDF['Testing'].value_counts())
    print('')
#Update
bmiDF_F = tempD2['Female']
bmiDF_M = tempD2['Male']
bmiDF_B = tempD2['BothSex']

## 2. Metabolomics

### 2-1. Prepare the cleaned omics dataframe

> Because the fitted models are used for predictions, the dependent variable (BMI) is not needed. (Of note, BMI was not necessarily available for the same point of omics measurements.) However, not only the time-series but also the baseline omics DF needs to be prepared for standardization.  

In [None]:
df_n = 'metDF'

#Import the cleaned baseline omics dataframes
fileDir = '../210104_Biological-BMI-paper/ExportData/'
ipynbName = '210104_Biological-BMI-paper_RF-imputation_'
fileName = 'baseline-'+df_n+'-with-RF-imputation.tsv'
tempDF = pd.read_csv(fileDir+ipynbName+fileName, sep='\t', dtype={'public_client_id': str})
tempDF = tempDF.set_index('public_client_id')
print(df_n+' original shape:', tempDF.shape)

#Drop BMI and covariates
tempL = ['log_BaseBMI', 'Sex', 'BaseAge', 'PC1', 'PC2', 'PC3', 'PC4', 'PC5', 'Race']
tempDF = tempDF.drop(columns=tempL)
display(tempDF)

baseDF = tempDF

In [None]:
df_n = 'metDF'

#Import the cleaned time-series omics dataframes
fileDir = './ExportData/'
ipynbName = '220804_Multiomics-BMI-NatMed1stRevision_RF-imputation-ver2_'
fileName = 'time-series-'+df_n+'-with-RF-imputation.tsv'
tempDF = pd.read_csv(fileDir+ipynbName+fileName, sep='\t', dtype={'public_client_id': str})
tempDF = tempDF.set_index('KeyIndex')
tempDF = tempDF.sort_values(by=['public_client_id', 'days_in_program'], ascending=True)
print(df_n+' original shape:', tempDF.shape)
print(' -> Unique ID:', len(tempDF['public_client_id'].unique()))

#Drop BMI and covariates
tempL = ['log_BaseBMI', 'Sex', 'BaseAge', 'PC1', 'PC2', 'PC3', 'PC4', 'PC5', 'Race']
tempDF = tempDF.drop(columns=tempL)
display(tempDF)
print(' -> Unique ID:', len(tempDF['public_client_id'].unique()))

tsDF = tempDF

In [None]:
#Check consistency
print('Baseline DF:')
display(baseDF.describe())
print('Baseline from the time-series DF:')
tempDF = tsDF.sort_values(by=['public_client_id', 'days_in_program'], ascending=True)
tempDF = tempDF.drop_duplicates('public_client_id', keep='first')
tempL = ['public_client_id', 'days_in_program', 'Season']
tempDF = tempDF.drop(columns=tempL)
display(tempDF.describe())

> –> Confirmed that the baseline measurements were consistent after the improved imputation.  

### 2-2. Standarization with the baseline distribution

In [None]:
tempD1 = {'Female':bmiDF_F, 'Male':bmiDF_M, 'BothSex':bmiDF_B}
tempL = ['public_client_id', 'days_in_program', 'Season']

#Check just in case
tempA1 = baseDF.columns.to_numpy()
tempA2 = tsDF.drop(columns=tempL).columns.to_numpy()
print('nVariables is consistent between baseline and time-series DFs:',
      len(tempA1)==len(tempA2))
print('Variable order is consistent between baseline and time-series DFs:',
      (tempA1==tempA2).sum()==len(tempA1))

tempD2 = {}
for sex in tempD1.keys():
    tempDF = tempD1[sex]
    tempL1 = tempDF.index.tolist()
    #Prepare baseline DF
    tempDF1 = baseDF.loc[tempL1]
    #Compute the mean and std for Z-score transformation based on the baseline distribution
    scaler = StandardScaler(copy=True, with_mean=True, with_std=True)
    scaler.fit(tempDF1)#Column direction
    #Z-score transformation of the baseline DF (just for confirmation)
    tempA = scaler.transform(tempDF1)
    tempDF1 = pd.DataFrame(data=tempA, index=tempDF1.index, columns=tempDF1.columns)
    
    #Prepare time-series DF
    tempDF2 = tsDF.loc[tsDF['public_client_id'].isin(tempL1)]
    tempDF = tempDF2.drop(columns=tempL)
    #Z-score transformation of the time-series DF with the baseline distribution
    tempA = scaler.transform(tempDF)
    tempDF = pd.DataFrame(data=tempA, index=tempDF.index, columns=tempDF.columns)
    #Recover the time-series metadata
    tempDF2 = tempDF2[tempL]
    tempDF2 = pd.merge(tempDF2, tempDF, left_index=True, right_index=True, how='left')
    
    print(sex)
    display(tempDF2)
    tempD2[sex] = tempDF2
    
    #Confirmation
    tempD = {'Baseline DF':tempDF1, 'Time-series DF':tempDF2}
    for df_n in tempD.keys():
        tempDF = tempD[df_n]
        print(' - '+df_n, tempDF.shape)
        if df_n=='Time-series DF':
            print('    -> Unique ID:', len(tempDF['public_client_id'].unique()))
            tempDF = tempDF.drop(columns=tempL)
        display(tempDF.describe())
        sns.set(style='ticks', font='Arial', context='notebook')
        plt.figure(figsize=(4, 3))
        for col_i in range(3):
            sns.distplot(tempDF.iloc[:, col_i], label=tempDF.columns[col_i])
        sns.despine()
        plt.xlabel(r'$Z$'+'-score')
        plt.ylabel('Density')
        plt.legend(bbox_to_anchor=(1, 0.5), loc='center left', borderaxespad=1)
        plt.show()
    print('')

tsDF_F = tempD2['Female']
tsDF_M = tempD2['Male']
tsDF_B = tempD2['BothSex']

> –> Confirmed that the baseline summary was completely same with the before (the baseline LASSO modeling).  

### 2-3. Import the fitted LASSO models with the baseline measurements

In [None]:
yvar_model = 'MetBMI'
tempD = {}
for sex in ['Female', 'Male', 'BothSex']:
    #Import the LASSO beta-coefficients (including intercept)
    fileDir = './ExportData/'
    ipynbName = '220801_Multiomics-BMI-NatMed1stRevision_BMI-baseline-LASSO_'
    fileName = yvar_model+'-'+sex+'-LASSObcoefs.tsv'
    tempDF = pd.read_csv(fileDir+ipynbName+fileName, sep='\t').set_index('Variable')
    #Drop summary columns
    tempL = ['Mean', 'SD', 'nZeros']
    tempDF = tempDF.drop(columns=tempL)
    tempD[sex] = tempDF
    
    #Check
    print(sex+':')
    print(' - Variables (without intercept):', len(tempDF)-1)
    display(tempDF)
    print('')

modelDF_F = tempD['Female']
modelDF_M = tempD['Male']
modelDF_B = tempD['BothSex']

### 2-4. Calculate predictions using the fitted models

> According to LassoCV source, the self.predict(X) method calls self._decision_function(X) method, which further returns “safe_sparsedot(X, self.coef.T, denseoutput=True) + self.intercept". In this case, safe_sparse_dot() simply corresponds to a dot product. Hence, manual calculation from beta-coefficients and intercept is impremented here.  

In [None]:
tempD1 = {'Female':tsDF_F, 'Male':tsDF_M, 'BothSex':tsDF_B}
tempD2 = {'Female':modelDF_F, 'Male':modelDF_M, 'BothSex':modelDF_B}
tempL = ['public_client_id', 'days_in_program', 'Season']

tempD = {}
for sex in tempD1.keys():
    tempDF = tempD1[sex]
    tempDF1 = tempDF.drop(columns=tempL)
    tempDF2 = tempD2[sex]
    
    #Add dummy intercept variable to data DF
    tempDF1['Intercept'] = 1.0
    
    #Check just in case
    tempA1 = tempDF1.columns.to_numpy()
    tempA2 = tempDF2.index.to_numpy()
    print(sex)
    print(' - nVariables is consistent between data and model DFs:',
          len(tempA1)==len(tempA2))
    print(' - Variable order is consistent between data and model DFs:',
          (tempA1==tempA2).sum()==len(tempA1))
    
    #Calculate prediction
    tempA = np.dot(tempDF1, tempDF2)
    tempDF1 = pd.DataFrame(tempA, index=tempDF1.index, columns=tempDF2.columns)
    
    #Recover the time-series metadata
    tempDF = tempDF[tempL]
    tempDF = pd.merge(tempDF, tempDF1, left_index=True, right_index=True, how='left')
    
    tempD[sex] = tempDF
    display(tempDF)
    print('')

predictDF_F = tempD['Female']
predictDF_M = tempD['Male']
predictDF_B = tempD['BothSex']

> To obtain one single prediction for each participant and each time point, the calculated predictions from each model can be averaged.  
> ***–> However, in addition to the overfitting problem for the baseline predictions, there is a potential risk of data leakage even for the longitudinal predictions.***  
> –> In this version, one single prediction for each participant and each time point is selected with the prediction from the model for which the participant was included in the baseline testing (hold-out) set.  

In [None]:
tempD1 = {'Female':bmiDF_F, 'Male':bmiDF_M, 'BothSex':bmiDF_B}
tempD2 = {'Female':predictDF_F, 'Male':predictDF_M, 'BothSex':predictDF_B}
yvar_model = 'MetBMI'
tempD = {}
for sex in tempD1.keys():
    #Retrieve the predictions for the testing (hold-out) set
    tempDF = tempD1[sex]
    tempS = tempDF['Testing']
    tempDF = tempD2[sex]
    tempDF = pd.merge(tempDF, tempS, left_on='public_client_id', right_index=True, how='left')
    tempL = []
    for row_i in range(len(tempDF)):
        model_n = tempDF['Testing'].iloc[row_i]
        tempL.append(tempDF[model_n].iloc[row_i])
    tempDF['log_'+yvar_model] = tempL
    
    #Drop the temporal prediction columns
    tempDF = tempDF.loc[:, ~tempDF.columns.str.contains('Model_')]
    
    #Convert to original scale
    tempDF[yvar_model] = np.e**tempDF['log_'+yvar_model]
    
    tempD[sex] = tempDF
    print(sex)
    display(tempDF)
    print(' -> Unique ID:', len(tempDF['public_client_id'].unique()))
    print('')
#Update
predictDF_F = tempD['Female']
predictDF_M = tempD['Male']
predictDF_B = tempD['BothSex']

### 2-5. Clean and save predictions

In [None]:
#Add the baseline info
tempD1 = {'Female':predictDF_F, 'Male':predictDF_M, 'BothSex':predictDF_B}
tempD2 = {'Female':bmiDF_F, 'Male':bmiDF_M, 'BothSex':bmiDF_B}
yvar_model = 'MetBMI'
tempD = {}
for sex in tempD1.keys():
    tempDF = tempD1[sex]
    
    #Retrieve the baseline predictions
    tempDF1 = tempDF.sort_values(by=['public_client_id', 'days_in_program'], ascending=True)
    tempDF1 = tempDF1.drop_duplicates('public_client_id', keep='first')
    tempDF1 = tempDF1.reset_index().set_index('public_client_id')
    tempDF1 = tempDF1.rename(columns={yvar_model:'Base'+yvar_model})
    
    #Add baseline BMI and covariate info
    tempDF2 = tempD2[sex]
    ##Replace the log-scaled BMI with the original scaled
    tempS = np.e**tempDF2['log_BaseBMI']
    tempS.name = 'BaseBMI'
    tempDF2 = pd.merge(tempS, tempDF2, left_index=True, right_index=True, how='left')
    tempDF2 = tempDF2.drop(columns=['log_BaseBMI', 'Testing'])
    tempDF1 = pd.merge(tempDF1['Base'+yvar_model], tempDF2,
                       left_index=True, right_index=True, how='left')
    
    #Obesity classification
    for bmi in ['BMI', yvar_model]:
        tempL = []
        for value in tempDF1['Base'+bmi].tolist():
            if np.isnan(value):
                tempL.append('NotCalculated')
            elif value < 18.5:
                tempL.append('Underweight')
            elif value < 25:
                tempL.append('Normal')
            elif value < 30:
                tempL.append('Overweight')
            elif value >= 30:
                tempL.append('Obese')
            else:#Just in case
                tempL.append('Error?')
        tempDF1['Base'+bmi+'_class'] = tempL
    
    #Check baseline summary
    print(sex+' baseline summary:')
    display(tempDF1.describe(include='all'))
    for bmi in ['BMI', yvar_model]:
        print('Base'+bmi+'_class:')
        tempS1 = tempDF1['Base'+bmi+'_class'].value_counts()
        tempDF2 = pd.DataFrame({'Count':tempS1, 'Percentage':tempS1/len(tempDF1)*100})
        display(tempDF2)
    
    #Merge
    tempDF = pd.merge(tempDF, tempDF1, left_on='public_client_id', right_index=True, how='left')
    tempD[sex] = tempDF
    display(tempDF)
    print(' -> Unique ID:', len(tempDF['public_client_id'].unique()))
    print('')
#Update
predictDF_F = tempD['Female']
predictDF_M = tempD['Male']
predictDF_B = tempD['BothSex']

In [None]:
#Save
yvar_model = 'MetBMI'

#Sex-stratified models
tempDF = pd.concat([predictDF_F, predictDF_M], axis=0)
display(tempDF)
print(' -> Unique ID:', len(tempDF['public_client_id'].unique()))
fileDir = './ExportData/'
ipynbName = '220805_Multiomics-BMI-NatMed1stRevision_BMI-longitudinal-LASSO_'
fileName = yvar_model+'-FemaleMale.tsv'
tempDF.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)

#Sex-mixed model
tempDF = predictDF_B
display(tempDF)
print(' -> Unique ID:', len(tempDF['public_client_id'].unique()))
fileDir = './ExportData/'
ipynbName = '220805_Multiomics-BMI-NatMed1stRevision_BMI-longitudinal-LASSO_'
fileName = yvar_model+'-BothSex.tsv'
tempDF.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)

### 2-6. Check consistency

In [None]:
tempD1 = {'Female':predictDF_F, 'Male':predictDF_M, 'Both sex':predictDF_B}
yvar_model = 'MetBMI'
axis_label = 'BMI [kg m'+r'$^{-2}$'+']'

#Plot difference b/w sex-specific and sex-mixed models
tempD2 = {'Female':'tab:red', 'Male':'tab:blue'}
range_min = np.min([df[var].min() for df in tempD1.values() for var in [yvar_model]])
range_max = np.max([df[var].max() for df in tempD1.values() for var in [yvar_model]])
sns.set(style='ticks', font='Arial', context='talk')
fig, axes = plt.subplots(nrows=1, ncols=len(tempD2), figsize=(6.5, 3), sharex=True, sharey=True,
                         gridspec_kw={'hspace':0.1, 'wspace':0.1})
for ax_i, ax in enumerate(axes.flat):
    sex = list(tempD2.keys())[ax_i]
    #Prepare DF
    tempS1 = tempD1[sex][yvar_model]
    tempS1.name = 'Sex-specific'
    tempS2 = tempD1['Both sex'][yvar_model]
    tempS2.name = 'Sex-mixed'
    tempDF = pd.merge(tempS1, tempS2, left_index=True, right_index=True, how='inner')
    #Y=X as reference
    ax.plot([range_min, range_max], [range_min, range_max], color='black', linestyle=(0, (1, 2)))
    #Regplot
    sns.regplot(data=tempDF, x='Sex-specific', y='Sex-mixed', color=tempD2[sex],
                scatter=True, fit_reg=True, ci=95, truncate=False, marker='o',
                scatter_kws={'alpha':0.2, 'edgecolor':'k', 's':30}, ax=ax)
    if ax_i%len(tempD2)==0:
        plt.setp(ax, xlabel='', ylabel='Sex-mixed b'+axis_label)
    else:
        plt.setp(ax, xlabel='', ylabel='')
        plt.setp(ax.get_yticklabels(), visible=False)
    ##Annotate Pearson's correlation
    pearson_r, pval = stats.pearsonr(tempDF['Sex-specific'], tempDF['Sex-mixed'])
    r_text = str(Decimal(str(pearson_r)).quantize(Decimal('0.001'), rounding=ROUND_HALF_UP))
    if pval==1.0:
        pval_text = '1.0'
    elif pval==0.0:
        pval_text = '0.0'
    else:
        pval_text = f'{Decimal(str(pval)):.3E}'#Take more digits because rounding is bad here
        significand, exponent = pval_text.split(sep='E-')
        significand = str(Decimal(significand).quantize(Decimal('0.1'), rounding=ROUND_HALF_UP))
        if significand=='10.0':
            significand = '1.0'
            exponent = str(int(exponent)-1)
        if int(exponent)>2:
            pval_text = significand+r'$\times$'+'10'+r'$^{{-{0}}}$'.format(exponent)##Font is different in r'$ $'...
        elif int(exponent)>0:
            pval_text = '0.'+'0'*(int(exponent)-1)+significand.replace('.', '')
        else:
            pval_text = significand
    text = 'Pearson\'s '+r'$r$'+' = '+r_text+'\n'+r'$P$'+' = '+pval_text
    ax.annotate(text, xy=(0.05, 0.95), xycoords='axes fraction',
                horizontalalignment='left', verticalalignment='top',
                multialignment='left', fontsize='x-small', color='k')
    ##Facet label
    ax.set_title(sex, {'fontsize':'medium'})
    #Save position to generate facet and legend later
    if ax_i ==0:
        ax0_pos = ax.get_position().bounds
    elif ax_i==1:
        ax1_pos = ax.get_position().bounds
sns.despine()
fig.text(x=(ax0_pos[0]+(ax1_pos[0]+ax1_pos[2]))/2, y=ax0_pos[1]-ax0_pos[3]*0.2,#Minor manual adjustment
         s='Sex-specific b'+axis_label, fontsize='medium',
         verticalalignment='top', horizontalalignment='center', rotation='horizontal')
plt.show()

In [None]:
#Check consistency of the baseline predictions
tempD1 = {'Female':predictDF_F, 'Male':predictDF_M, 'Both sex':predictDF_B}
tempD2 = {'Female':'Female', 'Male':'Male', 'Both sex':'BothSex'}
yvar_model = 'MetBMI'
axis_label = 'BMI [kg m'+r'$^{-2}$'+']'

tempD = {}
for sex in tempD2.keys():
    #Retrieve the baseline predictions
    tempDF = tempD1[sex]
    tempDF = tempDF.sort_values(by=['public_client_id', 'days_in_program'], ascending=True)
    tempDF = tempDF.drop_duplicates('public_client_id', keep='first')
    tempDF = tempDF.reset_index().set_index('public_client_id')
    tempS1 = tempDF['Base'+yvar_model]
    tempS1.name = 'Current'
    
    #Import the previous baseline prediction DF
    fileDir = './ExportData/'
    ipynbName = '220801_Multiomics-BMI-NatMed1stRevision_BMI-baseline-LASSO_'
    fileName = yvar_model+'-'+tempD2[sex]+'.tsv'
    tempDF = pd.read_csv(fileDir+ipynbName+fileName, sep='\t', dtype={'public_client_id':str})
    tempDF = tempDF.set_index('public_client_id')
    tempS2 = tempDF['Base'+yvar_model]
    tempS2.name = 'Previous'
    
    tempDF = pd.merge(tempS1, tempS2, left_index=True, right_index=True, how='outer')
    tempD[sex] = tempDF
    
    #Check exact values
    print(sex)
    print(' - Participant is consistent:', len(tempDF)==len(tempDF.dropna()))
    tempDF1 = tempDF.loc[tempDF['Current']!=tempDF['Previous']]
    print(' - Inconsistent baseline predictions:',
          len(tempDF1), '(', len(tempDF1)/len(tempDF)*100, '[%])')
    display(tempDF1)
    
    #Check obesity classification
    for bbmi in ['Current', 'Previous']:
        tempL = []
        for value in tempDF[bbmi].tolist():
            if np.isnan(value):
                tempL.append('NotCalculated')
            elif value < 18.5:
                tempL.append('Underweight')
            elif value < 25:
                tempL.append('Normal')
            elif value < 30:
                tempL.append('Overweight')
            elif value >= 30:
                tempL.append('Obese')
            else:#Just in case
                tempL.append('Error?')
        tempDF[bbmi+'_Base'+yvar_model+'_class'] = tempL
        print(' - '+bbmi+'_Base'+yvar_model+'_class:')
        tempS = tempDF[bbmi+'_Base'+yvar_model+'_class'].value_counts()
        tempDF1 = pd.DataFrame({'Count':tempS, 'Percentage':tempS/len(tempDF)*100})
        display(tempDF1)

#Plot current vs. previous baseline predictions per model
tempD2 = {'Female':'tab:red', 'Male':'tab:blue', 'Both sex':'tab:green'}
range_min = np.min([df[var].min() for df in tempD.values() for var in ['Current', 'Previous']])
range_max = np.max([df[var].max() for df in tempD.values() for var in ['Current', 'Previous']])
sns.set(style='ticks', font='Arial', context='talk')
fig, axes = plt.subplots(nrows=1, ncols=len(tempD2), figsize=(10, 3), sharex=True, sharey=True,
                         gridspec_kw={'hspace':0.1, 'wspace':0.1})
for ax_i, ax in enumerate(axes.flat):
    sex = list(tempD2.keys())[ax_i]
    tempDF = tempD[sex]
    #Y=X as reference
    ax.plot([range_min, range_max], [range_min, range_max], color='black', linestyle=(0, (1, 2)))
    #Regplot
    sns.regplot(data=tempDF, x='Previous', y='Current', color=tempD2[sex],
                scatter=True, fit_reg=True, ci=95, truncate=False, marker='o',
                scatter_kws={'alpha':0.2, 'edgecolor':'k', 's':30}, ax=ax)
    if ax_i%len(tempD2)==0:
        plt.setp(ax, xlabel='', ylabel='Current b'+axis_label)
    else:
        plt.setp(ax, xlabel='', ylabel='')
        plt.setp(ax.get_yticklabels(), visible=False)
    ##Annotate Pearson's correlation
    pearson_r, pval = stats.pearsonr(tempDF['Previous'], tempDF['Current'])
    r_text = str(Decimal(str(pearson_r)).quantize(Decimal('0.001'), rounding=ROUND_HALF_UP))
    if pval==1.0:
        pval_text = '1.0'
    elif pval==0.0:
        pval_text = '0.0'
    else:
        pval_text = f'{Decimal(str(pval)):.3E}'#Take more digits because rounding is bad here
        significand, exponent = pval_text.split(sep='E-')
        significand = str(Decimal(significand).quantize(Decimal('0.1'), rounding=ROUND_HALF_UP))
        if significand=='10.0':
            significand = '1.0'
            exponent = str(int(exponent)-1)
        if int(exponent)>2:
            pval_text = significand+r'$\times$'+'10'+r'$^{{-{0}}}$'.format(exponent)##Font is different in r'$ $'...
        elif int(exponent)>0:
            pval_text = '0.'+'0'*(int(exponent)-1)+significand.replace('.', '')
        else:
            pval_text = significand
    text = 'Pearson\'s '+r'$r$'+' = '+r_text+'\n'+r'$P$'+' = '+pval_text
    ax.annotate(text, xy=(0.05, 0.95), xycoords='axes fraction',
                horizontalalignment='left', verticalalignment='top',
                multialignment='left', fontsize='x-small', color='k')
    ##Facet label
    ax.set_title(sex, {'fontsize':'medium'})
    #Save position to generate facet and legend later
    if ax_i ==0:
        ax0_pos = ax.get_position().bounds
    elif ax_i==2:
        ax2_pos = ax.get_position().bounds
sns.despine()
fig.text(x=(ax0_pos[0]+(ax2_pos[0]+ax2_pos[2]))/2, y=ax0_pos[1]-ax0_pos[3]*0.2,#Minor manual adjustment
         s='Previous b'+axis_label, fontsize='medium',
         verticalalignment='top', horizontalalignment='center', rotation='horizontal')
plt.show()

> –> The inconsistent predictions were the same values (at least) until six decimal places, probably due to floating issues. In fact, bBMI class is surely consisistent.  

## 3. Proteomics

### 3-1. Prepare the cleaned omics dataframe

> Because the fitted models are used for predictions, the dependent variable (BMI) is not needed. (Of note, BMI was not necessarily available for the same point of omics measurements.) However, not only the time-series but also the baseline omics DF needs to be prepared for standardization.  

In [None]:
df_n = 'protDF'

#Import the cleaned baseline omics dataframes
fileDir = '../210104_Biological-BMI-paper/ExportData/'
ipynbName = '210104_Biological-BMI-paper_RF-imputation_'
fileName = 'baseline-'+df_n+'-with-RF-imputation.tsv'
tempDF = pd.read_csv(fileDir+ipynbName+fileName, sep='\t', dtype={'public_client_id': str})
tempDF = tempDF.set_index('public_client_id')
print(df_n+' original shape:', tempDF.shape)

#Drop BMI and covariates
tempL = ['log_BaseBMI', 'Sex', 'BaseAge', 'PC1', 'PC2', 'PC3', 'PC4', 'PC5', 'Race']
tempDF = tempDF.drop(columns=tempL)
display(tempDF)

baseDF = tempDF

In [None]:
df_n = 'protDF'

#Import the cleaned time-series omics dataframes
fileDir = './ExportData/'
ipynbName = '220804_Multiomics-BMI-NatMed1stRevision_RF-imputation-ver2_'
fileName = 'time-series-'+df_n+'-with-RF-imputation.tsv'
tempDF = pd.read_csv(fileDir+ipynbName+fileName, sep='\t', dtype={'public_client_id': str})
tempDF = tempDF.set_index('KeyIndex')
tempDF = tempDF.sort_values(by=['public_client_id', 'days_in_program'], ascending=True)
print(df_n+' original shape:', tempDF.shape)
print(' -> Unique ID:', len(tempDF['public_client_id'].unique()))

#Drop BMI and covariates
tempL = ['log_BaseBMI', 'Sex', 'BaseAge', 'PC1', 'PC2', 'PC3', 'PC4', 'PC5', 'Race']
tempDF = tempDF.drop(columns=tempL)
display(tempDF)
print(' -> Unique ID:', len(tempDF['public_client_id'].unique()))

tsDF = tempDF

In [None]:
#Check consistency
print('Baseline DF:')
display(baseDF.describe())
print('Baseline from the time-series DF:')
tempDF = tsDF.sort_values(by=['public_client_id', 'days_in_program'], ascending=True)
tempDF = tempDF.drop_duplicates('public_client_id', keep='first')
tempL = ['public_client_id', 'days_in_program', 'Season']
tempDF = tempDF.drop(columns=tempL)
display(tempDF.describe())

> –> Confirmed that the baseline measurements were consistent after the improved imputation.  

### 3-2. Standarization with the baseline distribution

In [None]:
tempD1 = {'Female':bmiDF_F, 'Male':bmiDF_M, 'BothSex':bmiDF_B}
tempL = ['public_client_id', 'days_in_program', 'Season']

#Check just in case
tempA1 = baseDF.columns.to_numpy()
tempA2 = tsDF.drop(columns=tempL).columns.to_numpy()
print('nVariables is consistent between baseline and time-series DFs:',
      len(tempA1)==len(tempA2))
print('Variable order is consistent between baseline and time-series DFs:',
      (tempA1==tempA2).sum()==len(tempA1))

tempD2 = {}
for sex in tempD1.keys():
    tempDF = tempD1[sex]
    tempL1 = tempDF.index.tolist()
    #Prepare baseline DF
    tempDF1 = baseDF.loc[tempL1]
    #Compute the mean and std for Z-score transformation based on the baseline distribution
    scaler = StandardScaler(copy=True, with_mean=True, with_std=True)
    scaler.fit(tempDF1)#Column direction
    #Z-score transformation of the baseline DF (just for confirmation)
    tempA = scaler.transform(tempDF1)
    tempDF1 = pd.DataFrame(data=tempA, index=tempDF1.index, columns=tempDF1.columns)
    
    #Prepare time-series DF
    tempDF2 = tsDF.loc[tsDF['public_client_id'].isin(tempL1)]
    tempDF = tempDF2.drop(columns=tempL)
    #Z-score transformation of the time-series DF with the baseline distribution
    tempA = scaler.transform(tempDF)
    tempDF = pd.DataFrame(data=tempA, index=tempDF.index, columns=tempDF.columns)
    #Recover the time-series metadata
    tempDF2 = tempDF2[tempL]
    tempDF2 = pd.merge(tempDF2, tempDF, left_index=True, right_index=True, how='left')
    
    print(sex)
    display(tempDF2)
    tempD2[sex] = tempDF2
    
    #Confirmation
    tempD = {'Baseline DF':tempDF1, 'Time-series DF':tempDF2}
    for df_n in tempD.keys():
        tempDF = tempD[df_n]
        print(' - '+df_n, tempDF.shape)
        if df_n=='Time-series DF':
            print('    -> Unique ID:', len(tempDF['public_client_id'].unique()))
            tempDF = tempDF.drop(columns=tempL)
        display(tempDF.describe())
        sns.set(style='ticks', font='Arial', context='notebook')
        plt.figure(figsize=(4, 3))
        for col_i in range(3):
            sns.distplot(tempDF.iloc[:, col_i], label=tempDF.columns[col_i])
        sns.despine()
        plt.xlabel(r'$Z$'+'-score')
        plt.ylabel('Density')
        plt.legend(bbox_to_anchor=(1, 0.5), loc='center left', borderaxespad=1)
        plt.show()
    print('')

tsDF_F = tempD2['Female']
tsDF_M = tempD2['Male']
tsDF_B = tempD2['BothSex']

> –> Confirmed that the baseline summary was completely same with the before (the baseline LASSO modeling).  

### 3-3. Import the fitted LASSO models with the baseline measurements

In [None]:
yvar_model = 'ProtBMI'
tempD = {}
for sex in ['Female', 'Male', 'BothSex']:
    #Import the LASSO beta-coefficients (including intercept)
    fileDir = './ExportData/'
    ipynbName = '220801_Multiomics-BMI-NatMed1stRevision_BMI-baseline-LASSO_'
    fileName = yvar_model+'-'+sex+'-LASSObcoefs.tsv'
    tempDF = pd.read_csv(fileDir+ipynbName+fileName, sep='\t').set_index('Variable')
    #Drop summary columns
    tempL = ['Mean', 'SD', 'nZeros']
    tempDF = tempDF.drop(columns=tempL)
    tempD[sex] = tempDF
    
    #Check
    print(sex+':')
    print(' - Variables (without intercept):', len(tempDF)-1)
    display(tempDF)
    print('')

modelDF_F = tempD['Female']
modelDF_M = tempD['Male']
modelDF_B = tempD['BothSex']

### 3-4. Calculate predictions using the fitted models

> According to LassoCV source, the self.predict(X) method calls self._decision_function(X) method, which further returns “safe_sparsedot(X, self.coef.T, denseoutput=True) + self.intercept". In this case, safe_sparse_dot() simply corresponds to a dot product. Hence, manual calculation from beta-coefficients and intercept is impremented here.  

In [None]:
tempD1 = {'Female':tsDF_F, 'Male':tsDF_M, 'BothSex':tsDF_B}
tempD2 = {'Female':modelDF_F, 'Male':modelDF_M, 'BothSex':modelDF_B}
tempL = ['public_client_id', 'days_in_program', 'Season']

tempD = {}
for sex in tempD1.keys():
    tempDF = tempD1[sex]
    tempDF1 = tempDF.drop(columns=tempL)
    tempDF2 = tempD2[sex]
    
    #Add dummy intercept variable to data DF
    tempDF1['Intercept'] = 1.0
    
    #Check just in case
    tempA1 = tempDF1.columns.to_numpy()
    tempA2 = tempDF2.index.to_numpy()
    print(sex)
    print(' - nVariables is consistent between data and model DFs:',
          len(tempA1)==len(tempA2))
    print(' - Variable order is consistent between data and model DFs:',
          (tempA1==tempA2).sum()==len(tempA1))
    
    #Calculate prediction
    tempA = np.dot(tempDF1, tempDF2)
    tempDF1 = pd.DataFrame(tempA, index=tempDF1.index, columns=tempDF2.columns)
    
    #Recover the time-series metadata
    tempDF = tempDF[tempL]
    tempDF = pd.merge(tempDF, tempDF1, left_index=True, right_index=True, how='left')
    
    tempD[sex] = tempDF
    display(tempDF)
    print('')

predictDF_F = tempD['Female']
predictDF_M = tempD['Male']
predictDF_B = tempD['BothSex']

> To obtain one single prediction for each participant and each time point, the calculated predictions from each model can be averaged.  
> ***–> However, in addition to the overfitting problem for the baseline predictions, there is a potential risk of data leakage even for the longitudinal predictions.***  
> –> In this version, one single prediction for each participant and each time point is selected with the prediction from the model for which the participant was included in the baseline testing (hold-out) set.  

In [None]:
tempD1 = {'Female':bmiDF_F, 'Male':bmiDF_M, 'BothSex':bmiDF_B}
tempD2 = {'Female':predictDF_F, 'Male':predictDF_M, 'BothSex':predictDF_B}
yvar_model = 'ProtBMI'
tempD = {}
for sex in tempD1.keys():
    #Retrieve the predictions for the testing (hold-out) set
    tempDF = tempD1[sex]
    tempS = tempDF['Testing']
    tempDF = tempD2[sex]
    tempDF = pd.merge(tempDF, tempS, left_on='public_client_id', right_index=True, how='left')
    tempL = []
    for row_i in range(len(tempDF)):
        model_n = tempDF['Testing'].iloc[row_i]
        tempL.append(tempDF[model_n].iloc[row_i])
    tempDF['log_'+yvar_model] = tempL
    
    #Drop the temporal prediction columns
    tempDF = tempDF.loc[:, ~tempDF.columns.str.contains('Model_')]
    
    #Convert to original scale
    tempDF[yvar_model] = np.e**tempDF['log_'+yvar_model]
    
    tempD[sex] = tempDF
    print(sex)
    display(tempDF)
    print(' -> Unique ID:', len(tempDF['public_client_id'].unique()))
    print('')
#Update
predictDF_F = tempD['Female']
predictDF_M = tempD['Male']
predictDF_B = tempD['BothSex']

### 3-5. Clean and save predictions

In [None]:
#Add the baseline info
tempD1 = {'Female':predictDF_F, 'Male':predictDF_M, 'BothSex':predictDF_B}
tempD2 = {'Female':bmiDF_F, 'Male':bmiDF_M, 'BothSex':bmiDF_B}
yvar_model = 'ProtBMI'
tempD = {}
for sex in tempD1.keys():
    tempDF = tempD1[sex]
    
    #Retrieve the baseline predictions
    tempDF1 = tempDF.sort_values(by=['public_client_id', 'days_in_program'], ascending=True)
    tempDF1 = tempDF1.drop_duplicates('public_client_id', keep='first')
    tempDF1 = tempDF1.reset_index().set_index('public_client_id')
    tempDF1 = tempDF1.rename(columns={yvar_model:'Base'+yvar_model})
    
    #Add baseline BMI and covariate info
    tempDF2 = tempD2[sex]
    ##Replace the log-scaled BMI with the original scaled
    tempS = np.e**tempDF2['log_BaseBMI']
    tempS.name = 'BaseBMI'
    tempDF2 = pd.merge(tempS, tempDF2, left_index=True, right_index=True, how='left')
    tempDF2 = tempDF2.drop(columns=['log_BaseBMI', 'Testing'])
    tempDF1 = pd.merge(tempDF1['Base'+yvar_model], tempDF2,
                       left_index=True, right_index=True, how='left')
    
    #Obesity classification
    for bmi in ['BMI', yvar_model]:
        tempL = []
        for value in tempDF1['Base'+bmi].tolist():
            if np.isnan(value):
                tempL.append('NotCalculated')
            elif value < 18.5:
                tempL.append('Underweight')
            elif value < 25:
                tempL.append('Normal')
            elif value < 30:
                tempL.append('Overweight')
            elif value >= 30:
                tempL.append('Obese')
            else:#Just in case
                tempL.append('Error?')
        tempDF1['Base'+bmi+'_class'] = tempL
    
    #Check baseline summary
    print(sex+' baseline summary:')
    display(tempDF1.describe(include='all'))
    for bmi in ['BMI', yvar_model]:
        print('Base'+bmi+'_class:')
        tempS1 = tempDF1['Base'+bmi+'_class'].value_counts()
        tempDF2 = pd.DataFrame({'Count':tempS1, 'Percentage':tempS1/len(tempDF1)*100})
        display(tempDF2)
    
    #Merge
    tempDF = pd.merge(tempDF, tempDF1, left_on='public_client_id', right_index=True, how='left')
    tempD[sex] = tempDF
    display(tempDF)
    print(' -> Unique ID:', len(tempDF['public_client_id'].unique()))
    print('')
#Update
predictDF_F = tempD['Female']
predictDF_M = tempD['Male']
predictDF_B = tempD['BothSex']

In [None]:
#Save
yvar_model = 'ProtBMI'

#Sex-stratified models
tempDF = pd.concat([predictDF_F, predictDF_M], axis=0)
display(tempDF)
print(' -> Unique ID:', len(tempDF['public_client_id'].unique()))
fileDir = './ExportData/'
ipynbName = '220805_Multiomics-BMI-NatMed1stRevision_BMI-longitudinal-LASSO_'
fileName = yvar_model+'-FemaleMale.tsv'
tempDF.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)

#Sex-mixed model
tempDF = predictDF_B
display(tempDF)
print(' -> Unique ID:', len(tempDF['public_client_id'].unique()))
fileDir = './ExportData/'
ipynbName = '220805_Multiomics-BMI-NatMed1stRevision_BMI-longitudinal-LASSO_'
fileName = yvar_model+'-BothSex.tsv'
tempDF.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)

### 3-6. Check consistency

In [None]:
tempD1 = {'Female':predictDF_F, 'Male':predictDF_M, 'Both sex':predictDF_B}
yvar_model = 'ProtBMI'
axis_label = 'BMI [kg m'+r'$^{-2}$'+']'

#Plot difference b/w sex-specific and sex-mixed models
tempD2 = {'Female':'tab:red', 'Male':'tab:blue'}
range_min = np.min([df[var].min() for df in tempD1.values() for var in [yvar_model]])
range_max = np.max([df[var].max() for df in tempD1.values() for var in [yvar_model]])
sns.set(style='ticks', font='Arial', context='talk')
fig, axes = plt.subplots(nrows=1, ncols=len(tempD2), figsize=(6.5, 3), sharex=True, sharey=True,
                         gridspec_kw={'hspace':0.1, 'wspace':0.1})
for ax_i, ax in enumerate(axes.flat):
    sex = list(tempD2.keys())[ax_i]
    #Prepare DF
    tempS1 = tempD1[sex][yvar_model]
    tempS1.name = 'Sex-specific'
    tempS2 = tempD1['Both sex'][yvar_model]
    tempS2.name = 'Sex-mixed'
    tempDF = pd.merge(tempS1, tempS2, left_index=True, right_index=True, how='inner')
    #Y=X as reference
    ax.plot([range_min, range_max], [range_min, range_max], color='black', linestyle=(0, (1, 2)))
    #Regplot
    sns.regplot(data=tempDF, x='Sex-specific', y='Sex-mixed', color=tempD2[sex],
                scatter=True, fit_reg=True, ci=95, truncate=False, marker='o',
                scatter_kws={'alpha':0.2, 'edgecolor':'k', 's':30}, ax=ax)
    if ax_i%len(tempD2)==0:
        plt.setp(ax, xlabel='', ylabel='Sex-mixed b'+axis_label)
    else:
        plt.setp(ax, xlabel='', ylabel='')
        plt.setp(ax.get_yticklabels(), visible=False)
    ##Annotate Pearson's correlation
    pearson_r, pval = stats.pearsonr(tempDF['Sex-specific'], tempDF['Sex-mixed'])
    r_text = str(Decimal(str(pearson_r)).quantize(Decimal('0.001'), rounding=ROUND_HALF_UP))
    if pval==1.0:
        pval_text = '1.0'
    elif pval==0.0:
        pval_text = '0.0'
    else:
        pval_text = f'{Decimal(str(pval)):.3E}'#Take more digits because rounding is bad here
        significand, exponent = pval_text.split(sep='E-')
        significand = str(Decimal(significand).quantize(Decimal('0.1'), rounding=ROUND_HALF_UP))
        if significand=='10.0':
            significand = '1.0'
            exponent = str(int(exponent)-1)
        if int(exponent)>2:
            pval_text = significand+r'$\times$'+'10'+r'$^{{-{0}}}$'.format(exponent)##Font is different in r'$ $'...
        elif int(exponent)>0:
            pval_text = '0.'+'0'*(int(exponent)-1)+significand.replace('.', '')
        else:
            pval_text = significand
    text = 'Pearson\'s '+r'$r$'+' = '+r_text+'\n'+r'$P$'+' = '+pval_text
    ax.annotate(text, xy=(0.05, 0.95), xycoords='axes fraction',
                horizontalalignment='left', verticalalignment='top',
                multialignment='left', fontsize='x-small', color='k')
    ##Facet label
    ax.set_title(sex, {'fontsize':'medium'})
    #Save position to generate facet and legend later
    if ax_i ==0:
        ax0_pos = ax.get_position().bounds
    elif ax_i==1:
        ax1_pos = ax.get_position().bounds
sns.despine()
fig.text(x=(ax0_pos[0]+(ax1_pos[0]+ax1_pos[2]))/2, y=ax0_pos[1]-ax0_pos[3]*0.2,#Minor manual adjustment
         s='Sex-specific b'+axis_label, fontsize='medium',
         verticalalignment='top', horizontalalignment='center', rotation='horizontal')
plt.show()

In [None]:
#Check consistency of the baseline predictions
tempD1 = {'Female':predictDF_F, 'Male':predictDF_M, 'Both sex':predictDF_B}
tempD2 = {'Female':'Female', 'Male':'Male', 'Both sex':'BothSex'}
yvar_model = 'ProtBMI'
axis_label = 'BMI [kg m'+r'$^{-2}$'+']'

tempD = {}
for sex in tempD2.keys():
    #Retrieve the baseline predictions
    tempDF = tempD1[sex]
    tempDF = tempDF.sort_values(by=['public_client_id', 'days_in_program'], ascending=True)
    tempDF = tempDF.drop_duplicates('public_client_id', keep='first')
    tempDF = tempDF.reset_index().set_index('public_client_id')
    tempS1 = tempDF['Base'+yvar_model]
    tempS1.name = 'Current'
    
    #Import the previous baseline prediction DF
    fileDir = './ExportData/'
    ipynbName = '220801_Multiomics-BMI-NatMed1stRevision_BMI-baseline-LASSO_'
    fileName = yvar_model+'-'+tempD2[sex]+'.tsv'
    tempDF = pd.read_csv(fileDir+ipynbName+fileName, sep='\t', dtype={'public_client_id':str})
    tempDF = tempDF.set_index('public_client_id')
    tempS2 = tempDF['Base'+yvar_model]
    tempS2.name = 'Previous'
    
    tempDF = pd.merge(tempS1, tempS2, left_index=True, right_index=True, how='outer')
    tempD[sex] = tempDF
    
    #Check exact values
    print(sex)
    print(' - Participant is consistent:', len(tempDF)==len(tempDF.dropna()))
    tempDF1 = tempDF.loc[tempDF['Current']!=tempDF['Previous']]
    print(' - Inconsistent baseline predictions:',
          len(tempDF1), '(', len(tempDF1)/len(tempDF)*100, '[%])')
    display(tempDF1)
    
    #Check obesity classification
    for bbmi in ['Current', 'Previous']:
        tempL = []
        for value in tempDF[bbmi].tolist():
            if np.isnan(value):
                tempL.append('NotCalculated')
            elif value < 18.5:
                tempL.append('Underweight')
            elif value < 25:
                tempL.append('Normal')
            elif value < 30:
                tempL.append('Overweight')
            elif value >= 30:
                tempL.append('Obese')
            else:#Just in case
                tempL.append('Error?')
        tempDF[bbmi+'_Base'+yvar_model+'_class'] = tempL
        print(' - '+bbmi+'_Base'+yvar_model+'_class:')
        tempS = tempDF[bbmi+'_Base'+yvar_model+'_class'].value_counts()
        tempDF1 = pd.DataFrame({'Count':tempS, 'Percentage':tempS/len(tempDF)*100})
        display(tempDF1)

#Plot current vs. previous baseline predictions per model
tempD2 = {'Female':'tab:red', 'Male':'tab:blue', 'Both sex':'tab:green'}
range_min = np.min([df[var].min() for df in tempD.values() for var in ['Current', 'Previous']])
range_max = np.max([df[var].max() for df in tempD.values() for var in ['Current', 'Previous']])
sns.set(style='ticks', font='Arial', context='talk')
fig, axes = plt.subplots(nrows=1, ncols=len(tempD2), figsize=(10, 3), sharex=True, sharey=True,
                         gridspec_kw={'hspace':0.1, 'wspace':0.1})
for ax_i, ax in enumerate(axes.flat):
    sex = list(tempD2.keys())[ax_i]
    tempDF = tempD[sex]
    #Y=X as reference
    ax.plot([range_min, range_max], [range_min, range_max], color='black', linestyle=(0, (1, 2)))
    #Regplot
    sns.regplot(data=tempDF, x='Previous', y='Current', color=tempD2[sex],
                scatter=True, fit_reg=True, ci=95, truncate=False, marker='o',
                scatter_kws={'alpha':0.2, 'edgecolor':'k', 's':30}, ax=ax)
    if ax_i%len(tempD2)==0:
        plt.setp(ax, xlabel='', ylabel='Current b'+axis_label)
    else:
        plt.setp(ax, xlabel='', ylabel='')
        plt.setp(ax.get_yticklabels(), visible=False)
    ##Annotate Pearson's correlation
    pearson_r, pval = stats.pearsonr(tempDF['Previous'], tempDF['Current'])
    r_text = str(Decimal(str(pearson_r)).quantize(Decimal('0.001'), rounding=ROUND_HALF_UP))
    if pval==1.0:
        pval_text = '1.0'
    elif pval==0.0:
        pval_text = '0.0'
    else:
        pval_text = f'{Decimal(str(pval)):.3E}'#Take more digits because rounding is bad here
        significand, exponent = pval_text.split(sep='E-')
        significand = str(Decimal(significand).quantize(Decimal('0.1'), rounding=ROUND_HALF_UP))
        if significand=='10.0':
            significand = '1.0'
            exponent = str(int(exponent)-1)
        if int(exponent)>2:
            pval_text = significand+r'$\times$'+'10'+r'$^{{-{0}}}$'.format(exponent)##Font is different in r'$ $'...
        elif int(exponent)>0:
            pval_text = '0.'+'0'*(int(exponent)-1)+significand.replace('.', '')
        else:
            pval_text = significand
    text = 'Pearson\'s '+r'$r$'+' = '+r_text+'\n'+r'$P$'+' = '+pval_text
    ax.annotate(text, xy=(0.05, 0.95), xycoords='axes fraction',
                horizontalalignment='left', verticalalignment='top',
                multialignment='left', fontsize='x-small', color='k')
    ##Facet label
    ax.set_title(sex, {'fontsize':'medium'})
    #Save position to generate facet and legend later
    if ax_i ==0:
        ax0_pos = ax.get_position().bounds
    elif ax_i==2:
        ax2_pos = ax.get_position().bounds
sns.despine()
fig.text(x=(ax0_pos[0]+(ax2_pos[0]+ax2_pos[2]))/2, y=ax0_pos[1]-ax0_pos[3]*0.2,#Minor manual adjustment
         s='Previous b'+axis_label, fontsize='medium',
         verticalalignment='top', horizontalalignment='center', rotation='horizontal')
plt.show()

> –> The inconsistent predictions were the same values (at least) until six decimal places, probably due to floating issues. In fact, bBMI class is surely consisistent.  

## 4. Clinical labs

### 4-1. Prepare the cleaned omics dataframe

> Because the fitted models are used for predictions, the dependent variable (BMI) is not needed. (Of note, BMI was not necessarily available for the same point of omics measurements.) However, not only the time-series but also the baseline omics DF needs to be prepared for standardization.  

In [None]:
df_n = 'chemDF'

#Import the cleaned baseline omics dataframes
fileDir = '../210104_Biological-BMI-paper/ExportData/'
ipynbName = '210104_Biological-BMI-paper_RF-imputation_'
fileName = 'baseline-'+df_n+'-with-RF-imputation.tsv'
tempDF = pd.read_csv(fileDir+ipynbName+fileName, sep='\t', dtype={'public_client_id': str})
tempDF = tempDF.set_index('public_client_id')
print(df_n+' original shape:', tempDF.shape)

#Drop BMI and covariates
tempL = ['log_BaseBMI', 'Sex', 'BaseAge', 'PC1', 'PC2', 'PC3', 'PC4', 'PC5', 'Race']
tempDF = tempDF.drop(columns=tempL)
display(tempDF)

baseDF = tempDF

In [None]:
df_n = 'chemDF'

#Import the cleaned time-series omics dataframes
fileDir = './ExportData/'
ipynbName = '220804_Multiomics-BMI-NatMed1stRevision_RF-imputation-ver2_'
fileName = 'time-series-'+df_n+'-with-RF-imputation.tsv'
tempDF = pd.read_csv(fileDir+ipynbName+fileName, sep='\t', dtype={'public_client_id': str})
tempDF = tempDF.set_index('KeyIndex')
tempDF = tempDF.sort_values(by=['public_client_id', 'days_in_program'], ascending=True)
print(df_n+' original shape:', tempDF.shape)
print(' -> Unique ID:', len(tempDF['public_client_id'].unique()))

#Drop BMI and covariates
tempL = ['log_BaseBMI', 'Sex', 'BaseAge', 'PC1', 'PC2', 'PC3', 'PC4', 'PC5', 'Race']
tempDF = tempDF.drop(columns=tempL)
display(tempDF)
print(' -> Unique ID:', len(tempDF['public_client_id'].unique()))

tsDF = tempDF

In [None]:
#Check consistency
print('Baseline DF:')
display(baseDF.describe())
print('Baseline from the time-series DF:')
tempDF = tsDF.sort_values(by=['public_client_id', 'days_in_program'], ascending=True)
tempDF = tempDF.drop_duplicates('public_client_id', keep='first')
tempL = ['public_client_id', 'days_in_program', 'Season']
tempDF = tempDF.drop(columns=tempL)
display(tempDF.describe())

> –> Confirmed that the baseline measurements were consistent after the improved imputation.  

### 4-2. Standarization with the baseline distribution

In [None]:
tempD1 = {'Female':bmiDF_F, 'Male':bmiDF_M, 'BothSex':bmiDF_B}
tempL = ['public_client_id', 'days_in_program', 'Season']

#Check just in case
tempA1 = baseDF.columns.to_numpy()
tempA2 = tsDF.drop(columns=tempL).columns.to_numpy()
print('nVariables is consistent between baseline and time-series DFs:',
      len(tempA1)==len(tempA2))
print('Variable order is consistent between baseline and time-series DFs:',
      (tempA1==tempA2).sum()==len(tempA1))

tempD2 = {}
for sex in tempD1.keys():
    tempDF = tempD1[sex]
    tempL1 = tempDF.index.tolist()
    #Prepare baseline DF
    tempDF1 = baseDF.loc[tempL1]
    #Compute the mean and std for Z-score transformation based on the baseline distribution
    scaler = StandardScaler(copy=True, with_mean=True, with_std=True)
    scaler.fit(tempDF1)#Column direction
    #Z-score transformation of the baseline DF (just for confirmation)
    tempA = scaler.transform(tempDF1)
    tempDF1 = pd.DataFrame(data=tempA, index=tempDF1.index, columns=tempDF1.columns)
    
    #Prepare time-series DF
    tempDF2 = tsDF.loc[tsDF['public_client_id'].isin(tempL1)]
    tempDF = tempDF2.drop(columns=tempL)
    #Z-score transformation of the time-series DF with the baseline distribution
    tempA = scaler.transform(tempDF)
    tempDF = pd.DataFrame(data=tempA, index=tempDF.index, columns=tempDF.columns)
    #Recover the time-series metadata
    tempDF2 = tempDF2[tempL]
    tempDF2 = pd.merge(tempDF2, tempDF, left_index=True, right_index=True, how='left')
    
    print(sex)
    display(tempDF2)
    tempD2[sex] = tempDF2
    
    #Confirmation
    tempD = {'Baseline DF':tempDF1, 'Time-series DF':tempDF2}
    for df_n in tempD.keys():
        tempDF = tempD[df_n]
        print(' - '+df_n, tempDF.shape)
        if df_n=='Time-series DF':
            print('    -> Unique ID:', len(tempDF['public_client_id'].unique()))
            tempDF = tempDF.drop(columns=tempL)
        display(tempDF.describe())
        sns.set(style='ticks', font='Arial', context='notebook')
        plt.figure(figsize=(4, 3))
        for col_i in range(3):
            sns.distplot(tempDF.iloc[:, col_i], label=tempDF.columns[col_i])
        sns.despine()
        plt.xlabel(r'$Z$'+'-score')
        plt.ylabel('Density')
        plt.legend(bbox_to_anchor=(1, 0.5), loc='center left', borderaxespad=1)
        plt.show()
    print('')

tsDF_F = tempD2['Female']
tsDF_M = tempD2['Male']
tsDF_B = tempD2['BothSex']

> –> Confirmed that the baseline summary was completely same with the before (the baseline LASSO modeling).  

### 4-3. Import the fitted LASSO models with the baseline measurements

In [None]:
yvar_model = 'ChemBMI'
tempD = {}
for sex in ['Female', 'Male', 'BothSex']:
    #Import the LASSO beta-coefficients (including intercept)
    fileDir = './ExportData/'
    ipynbName = '220801_Multiomics-BMI-NatMed1stRevision_BMI-baseline-LASSO_'
    fileName = yvar_model+'-'+sex+'-LASSObcoefs.tsv'
    tempDF = pd.read_csv(fileDir+ipynbName+fileName, sep='\t').set_index('Variable')
    #Drop summary columns
    tempL = ['Mean', 'SD', 'nZeros']
    tempDF = tempDF.drop(columns=tempL)
    tempD[sex] = tempDF
    
    #Check
    print(sex+':')
    print(' - Variables (without intercept):', len(tempDF)-1)
    display(tempDF)
    print('')

modelDF_F = tempD['Female']
modelDF_M = tempD['Male']
modelDF_B = tempD['BothSex']

### 4-4. Calculate predictions using the fitted models

> According to LassoCV source, the self.predict(X) method calls self._decision_function(X) method, which further returns “safe_sparsedot(X, self.coef.T, denseoutput=True) + self.intercept". In this case, safe_sparse_dot() simply corresponds to a dot product. Hence, manual calculation from beta-coefficients and intercept is impremented here.  

In [None]:
tempD1 = {'Female':tsDF_F, 'Male':tsDF_M, 'BothSex':tsDF_B}
tempD2 = {'Female':modelDF_F, 'Male':modelDF_M, 'BothSex':modelDF_B}
tempL = ['public_client_id', 'days_in_program', 'Season']

tempD = {}
for sex in tempD1.keys():
    tempDF = tempD1[sex]
    tempDF1 = tempDF.drop(columns=tempL)
    tempDF2 = tempD2[sex]
    
    #Add dummy intercept variable to data DF
    tempDF1['Intercept'] = 1.0
    
    #Check just in case
    tempA1 = tempDF1.columns.to_numpy()
    tempA2 = tempDF2.index.to_numpy()
    print(sex)
    print(' - nVariables is consistent between data and model DFs:',
          len(tempA1)==len(tempA2))
    print(' - Variable order is consistent between data and model DFs:',
          (tempA1==tempA2).sum()==len(tempA1))
    
    #Calculate prediction
    tempA = np.dot(tempDF1, tempDF2)
    tempDF1 = pd.DataFrame(tempA, index=tempDF1.index, columns=tempDF2.columns)
    
    #Recover the time-series metadata
    tempDF = tempDF[tempL]
    tempDF = pd.merge(tempDF, tempDF1, left_index=True, right_index=True, how='left')
    
    tempD[sex] = tempDF
    display(tempDF)
    print('')

predictDF_F = tempD['Female']
predictDF_M = tempD['Male']
predictDF_B = tempD['BothSex']

> To obtain one single prediction for each participant and each time point, the calculated predictions from each model can be averaged.  
> ***–> However, in addition to the overfitting problem for the baseline predictions, there is a potential risk of data leakage even for the longitudinal predictions.***  
> –> In this version, one single prediction for each participant and each time point is selected with the prediction from the model for which the participant was included in the baseline testing (hold-out) set.  

In [None]:
tempD1 = {'Female':bmiDF_F, 'Male':bmiDF_M, 'BothSex':bmiDF_B}
tempD2 = {'Female':predictDF_F, 'Male':predictDF_M, 'BothSex':predictDF_B}
yvar_model = 'ChemBMI'
tempD = {}
for sex in tempD1.keys():
    #Retrieve the predictions for the testing (hold-out) set
    tempDF = tempD1[sex]
    tempS = tempDF['Testing']
    tempDF = tempD2[sex]
    tempDF = pd.merge(tempDF, tempS, left_on='public_client_id', right_index=True, how='left')
    tempL = []
    for row_i in range(len(tempDF)):
        model_n = tempDF['Testing'].iloc[row_i]
        tempL.append(tempDF[model_n].iloc[row_i])
    tempDF['log_'+yvar_model] = tempL
    
    #Drop the temporal prediction columns
    tempDF = tempDF.loc[:, ~tempDF.columns.str.contains('Model_')]
    
    #Convert to original scale
    tempDF[yvar_model] = np.e**tempDF['log_'+yvar_model]
    
    tempD[sex] = tempDF
    print(sex)
    display(tempDF)
    print(' -> Unique ID:', len(tempDF['public_client_id'].unique()))
    print('')
#Update
predictDF_F = tempD['Female']
predictDF_M = tempD['Male']
predictDF_B = tempD['BothSex']

### 4-5. Clean and save predictions

In [None]:
#Add the baseline info
tempD1 = {'Female':predictDF_F, 'Male':predictDF_M, 'BothSex':predictDF_B}
tempD2 = {'Female':bmiDF_F, 'Male':bmiDF_M, 'BothSex':bmiDF_B}
yvar_model = 'ChemBMI'
tempD = {}
for sex in tempD1.keys():
    tempDF = tempD1[sex]
    
    #Retrieve the baseline predictions
    tempDF1 = tempDF.sort_values(by=['public_client_id', 'days_in_program'], ascending=True)
    tempDF1 = tempDF1.drop_duplicates('public_client_id', keep='first')
    tempDF1 = tempDF1.reset_index().set_index('public_client_id')
    tempDF1 = tempDF1.rename(columns={yvar_model:'Base'+yvar_model})
    
    #Add baseline BMI and covariate info
    tempDF2 = tempD2[sex]
    ##Replace the log-scaled BMI with the original scaled
    tempS = np.e**tempDF2['log_BaseBMI']
    tempS.name = 'BaseBMI'
    tempDF2 = pd.merge(tempS, tempDF2, left_index=True, right_index=True, how='left')
    tempDF2 = tempDF2.drop(columns=['log_BaseBMI', 'Testing'])
    tempDF1 = pd.merge(tempDF1['Base'+yvar_model], tempDF2,
                       left_index=True, right_index=True, how='left')
    
    #Obesity classification
    for bmi in ['BMI', yvar_model]:
        tempL = []
        for value in tempDF1['Base'+bmi].tolist():
            if np.isnan(value):
                tempL.append('NotCalculated')
            elif value < 18.5:
                tempL.append('Underweight')
            elif value < 25:
                tempL.append('Normal')
            elif value < 30:
                tempL.append('Overweight')
            elif value >= 30:
                tempL.append('Obese')
            else:#Just in case
                tempL.append('Error?')
        tempDF1['Base'+bmi+'_class'] = tempL
    
    #Check baseline summary
    print(sex+' baseline summary:')
    display(tempDF1.describe(include='all'))
    for bmi in ['BMI', yvar_model]:
        print('Base'+bmi+'_class:')
        tempS1 = tempDF1['Base'+bmi+'_class'].value_counts()
        tempDF2 = pd.DataFrame({'Count':tempS1, 'Percentage':tempS1/len(tempDF1)*100})
        display(tempDF2)
    
    #Merge
    tempDF = pd.merge(tempDF, tempDF1, left_on='public_client_id', right_index=True, how='left')
    tempD[sex] = tempDF
    display(tempDF)
    print(' -> Unique ID:', len(tempDF['public_client_id'].unique()))
    print('')
#Update
predictDF_F = tempD['Female']
predictDF_M = tempD['Male']
predictDF_B = tempD['BothSex']

In [None]:
#Save
yvar_model = 'ChemBMI'

#Sex-stratified models
tempDF = pd.concat([predictDF_F, predictDF_M], axis=0)
display(tempDF)
print(' -> Unique ID:', len(tempDF['public_client_id'].unique()))
fileDir = './ExportData/'
ipynbName = '220805_Multiomics-BMI-NatMed1stRevision_BMI-longitudinal-LASSO_'
fileName = yvar_model+'-FemaleMale.tsv'
tempDF.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)

#Sex-mixed model
tempDF = predictDF_B
display(tempDF)
print(' -> Unique ID:', len(tempDF['public_client_id'].unique()))
fileDir = './ExportData/'
ipynbName = '220805_Multiomics-BMI-NatMed1stRevision_BMI-longitudinal-LASSO_'
fileName = yvar_model+'-BothSex.tsv'
tempDF.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)

### 4-6. Check consistency

In [None]:
tempD1 = {'Female':predictDF_F, 'Male':predictDF_M, 'Both sex':predictDF_B}
yvar_model = 'ChemBMI'
axis_label = 'BMI [kg m'+r'$^{-2}$'+']'

#Plot difference b/w sex-specific and sex-mixed models
tempD2 = {'Female':'tab:red', 'Male':'tab:blue'}
range_min = np.min([df[var].min() for df in tempD1.values() for var in [yvar_model]])
range_max = np.max([df[var].max() for df in tempD1.values() for var in [yvar_model]])
sns.set(style='ticks', font='Arial', context='talk')
fig, axes = plt.subplots(nrows=1, ncols=len(tempD2), figsize=(6.5, 3), sharex=True, sharey=True,
                         gridspec_kw={'hspace':0.1, 'wspace':0.1})
for ax_i, ax in enumerate(axes.flat):
    sex = list(tempD2.keys())[ax_i]
    #Prepare DF
    tempS1 = tempD1[sex][yvar_model]
    tempS1.name = 'Sex-specific'
    tempS2 = tempD1['Both sex'][yvar_model]
    tempS2.name = 'Sex-mixed'
    tempDF = pd.merge(tempS1, tempS2, left_index=True, right_index=True, how='inner')
    #Y=X as reference
    ax.plot([range_min, range_max], [range_min, range_max], color='black', linestyle=(0, (1, 2)))
    #Regplot
    sns.regplot(data=tempDF, x='Sex-specific', y='Sex-mixed', color=tempD2[sex],
                scatter=True, fit_reg=True, ci=95, truncate=False, marker='o',
                scatter_kws={'alpha':0.2, 'edgecolor':'k', 's':30}, ax=ax)
    if ax_i%len(tempD2)==0:
        plt.setp(ax, xlabel='', ylabel='Sex-mixed b'+axis_label)
    else:
        plt.setp(ax, xlabel='', ylabel='')
        plt.setp(ax.get_yticklabels(), visible=False)
    ##Annotate Pearson's correlation
    pearson_r, pval = stats.pearsonr(tempDF['Sex-specific'], tempDF['Sex-mixed'])
    r_text = str(Decimal(str(pearson_r)).quantize(Decimal('0.001'), rounding=ROUND_HALF_UP))
    if pval==1.0:
        pval_text = '1.0'
    elif pval==0.0:
        pval_text = '0.0'
    else:
        pval_text = f'{Decimal(str(pval)):.3E}'#Take more digits because rounding is bad here
        significand, exponent = pval_text.split(sep='E-')
        significand = str(Decimal(significand).quantize(Decimal('0.1'), rounding=ROUND_HALF_UP))
        if significand=='10.0':
            significand = '1.0'
            exponent = str(int(exponent)-1)
        if int(exponent)>2:
            pval_text = significand+r'$\times$'+'10'+r'$^{{-{0}}}$'.format(exponent)##Font is different in r'$ $'...
        elif int(exponent)>0:
            pval_text = '0.'+'0'*(int(exponent)-1)+significand.replace('.', '')
        else:
            pval_text = significand
    text = 'Pearson\'s '+r'$r$'+' = '+r_text+'\n'+r'$P$'+' = '+pval_text
    ax.annotate(text, xy=(0.05, 0.95), xycoords='axes fraction',
                horizontalalignment='left', verticalalignment='top',
                multialignment='left', fontsize='x-small', color='k')
    ##Facet label
    ax.set_title(sex, {'fontsize':'medium'})
    #Save position to generate facet and legend later
    if ax_i ==0:
        ax0_pos = ax.get_position().bounds
    elif ax_i==1:
        ax1_pos = ax.get_position().bounds
sns.despine()
fig.text(x=(ax0_pos[0]+(ax1_pos[0]+ax1_pos[2]))/2, y=ax0_pos[1]-ax0_pos[3]*0.2,#Minor manual adjustment
         s='Sex-specific b'+axis_label, fontsize='medium',
         verticalalignment='top', horizontalalignment='center', rotation='horizontal')
plt.show()

In [None]:
#Check consistency of the baseline predictions
tempD1 = {'Female':predictDF_F, 'Male':predictDF_M, 'Both sex':predictDF_B}
tempD2 = {'Female':'Female', 'Male':'Male', 'Both sex':'BothSex'}
yvar_model = 'ChemBMI'
axis_label = 'BMI [kg m'+r'$^{-2}$'+']'

tempD = {}
for sex in tempD2.keys():
    #Retrieve the baseline predictions
    tempDF = tempD1[sex]
    tempDF = tempDF.sort_values(by=['public_client_id', 'days_in_program'], ascending=True)
    tempDF = tempDF.drop_duplicates('public_client_id', keep='first')
    tempDF = tempDF.reset_index().set_index('public_client_id')
    tempS1 = tempDF['Base'+yvar_model]
    tempS1.name = 'Current'
    
    #Import the previous baseline prediction DF
    fileDir = './ExportData/'
    ipynbName = '220801_Multiomics-BMI-NatMed1stRevision_BMI-baseline-LASSO_'
    fileName = yvar_model+'-'+tempD2[sex]+'.tsv'
    tempDF = pd.read_csv(fileDir+ipynbName+fileName, sep='\t', dtype={'public_client_id':str})
    tempDF = tempDF.set_index('public_client_id')
    tempS2 = tempDF['Base'+yvar_model]
    tempS2.name = 'Previous'
    
    tempDF = pd.merge(tempS1, tempS2, left_index=True, right_index=True, how='outer')
    tempD[sex] = tempDF
    
    #Check exact values
    print(sex)
    print(' - Participant is consistent:', len(tempDF)==len(tempDF.dropna()))
    tempDF1 = tempDF.loc[tempDF['Current']!=tempDF['Previous']]
    print(' - Inconsistent baseline predictions:',
          len(tempDF1), '(', len(tempDF1)/len(tempDF)*100, '[%])')
    display(tempDF1)
    
    #Check obesity classification
    for bbmi in ['Current', 'Previous']:
        tempL = []
        for value in tempDF[bbmi].tolist():
            if np.isnan(value):
                tempL.append('NotCalculated')
            elif value < 18.5:
                tempL.append('Underweight')
            elif value < 25:
                tempL.append('Normal')
            elif value < 30:
                tempL.append('Overweight')
            elif value >= 30:
                tempL.append('Obese')
            else:#Just in case
                tempL.append('Error?')
        tempDF[bbmi+'_Base'+yvar_model+'_class'] = tempL
        print(' - '+bbmi+'_Base'+yvar_model+'_class:')
        tempS = tempDF[bbmi+'_Base'+yvar_model+'_class'].value_counts()
        tempDF1 = pd.DataFrame({'Count':tempS, 'Percentage':tempS/len(tempDF)*100})
        display(tempDF1)

#Plot current vs. previous baseline predictions per model
tempD2 = {'Female':'tab:red', 'Male':'tab:blue', 'Both sex':'tab:green'}
range_min = np.min([df[var].min() for df in tempD.values() for var in ['Current', 'Previous']])
range_max = np.max([df[var].max() for df in tempD.values() for var in ['Current', 'Previous']])
sns.set(style='ticks', font='Arial', context='talk')
fig, axes = plt.subplots(nrows=1, ncols=len(tempD2), figsize=(10, 3), sharex=True, sharey=True,
                         gridspec_kw={'hspace':0.1, 'wspace':0.1})
for ax_i, ax in enumerate(axes.flat):
    sex = list(tempD2.keys())[ax_i]
    tempDF = tempD[sex]
    #Y=X as reference
    ax.plot([range_min, range_max], [range_min, range_max], color='black', linestyle=(0, (1, 2)))
    #Regplot
    sns.regplot(data=tempDF, x='Previous', y='Current', color=tempD2[sex],
                scatter=True, fit_reg=True, ci=95, truncate=False, marker='o',
                scatter_kws={'alpha':0.2, 'edgecolor':'k', 's':30}, ax=ax)
    if ax_i%len(tempD2)==0:
        plt.setp(ax, xlabel='', ylabel='Current b'+axis_label)
    else:
        plt.setp(ax, xlabel='', ylabel='')
        plt.setp(ax.get_yticklabels(), visible=False)
    ##Annotate Pearson's correlation
    pearson_r, pval = stats.pearsonr(tempDF['Previous'], tempDF['Current'])
    r_text = str(Decimal(str(pearson_r)).quantize(Decimal('0.001'), rounding=ROUND_HALF_UP))
    if pval==1.0:
        pval_text = '1.0'
    elif pval==0.0:
        pval_text = '0.0'
    else:
        pval_text = f'{Decimal(str(pval)):.3E}'#Take more digits because rounding is bad here
        significand, exponent = pval_text.split(sep='E-')
        significand = str(Decimal(significand).quantize(Decimal('0.1'), rounding=ROUND_HALF_UP))
        if significand=='10.0':
            significand = '1.0'
            exponent = str(int(exponent)-1)
        if int(exponent)>2:
            pval_text = significand+r'$\times$'+'10'+r'$^{{-{0}}}$'.format(exponent)##Font is different in r'$ $'...
        elif int(exponent)>0:
            pval_text = '0.'+'0'*(int(exponent)-1)+significand.replace('.', '')
        else:
            pval_text = significand
    text = 'Pearson\'s '+r'$r$'+' = '+r_text+'\n'+r'$P$'+' = '+pval_text
    ax.annotate(text, xy=(0.05, 0.95), xycoords='axes fraction',
                horizontalalignment='left', verticalalignment='top',
                multialignment='left', fontsize='x-small', color='k')
    ##Facet label
    ax.set_title(sex, {'fontsize':'medium'})
    #Save position to generate facet and legend later
    if ax_i ==0:
        ax0_pos = ax.get_position().bounds
    elif ax_i==2:
        ax2_pos = ax.get_position().bounds
sns.despine()
fig.text(x=(ax0_pos[0]+(ax2_pos[0]+ax2_pos[2]))/2, y=ax0_pos[1]-ax0_pos[3]*0.2,#Minor manual adjustment
         s='Previous b'+axis_label, fontsize='medium',
         verticalalignment='top', horizontalalignment='center', rotation='horizontal')
plt.show()

> –> The inconsistent predictions were the same values (at least) until six decimal places, probably due to floating issues. In fact, bBMI class is surely consisistent.  

## 5. Combined omics

### 5-1. Prepare the cleaned omics dataframe

> Because the fitted models are used for predictions, the dependent variable (BMI) is not needed. (Of note, BMI was not necessarily available for the same point of omics measurements.) However, not only the time-series but also the baseline omics DF needs to be prepared for standardization.  

In [None]:
df_n = 'combiDF'

#Import the cleaned baseline omics dataframes
fileDir = '../210104_Biological-BMI-paper/ExportData/'
ipynbName = '210104_Biological-BMI-paper_RF-imputation_'
fileName = 'baseline-'+df_n+'-with-RF-imputation.tsv'
tempDF = pd.read_csv(fileDir+ipynbName+fileName, sep='\t', dtype={'public_client_id': str})
tempDF = tempDF.set_index('public_client_id')
print(df_n+' original shape:', tempDF.shape)

#Drop BMI and covariates
tempL = ['log_BaseBMI', 'Sex', 'BaseAge', 'PC1', 'PC2', 'PC3', 'PC4', 'PC5', 'Race']
tempDF = tempDF.drop(columns=tempL)
display(tempDF)

baseDF = tempDF

In [None]:
df_n = 'combiDF'

#Import the cleaned time-series omics dataframes
fileDir = './ExportData/'
ipynbName = '220804_Multiomics-BMI-NatMed1stRevision_RF-imputation-ver2_'
fileName = 'time-series-'+df_n+'-with-RF-imputation.tsv'
tempDF = pd.read_csv(fileDir+ipynbName+fileName, sep='\t', dtype={'public_client_id': str})
tempDF = tempDF.set_index('KeyIndex')
tempDF = tempDF.sort_values(by=['public_client_id', 'days_in_program'], ascending=True)
print(df_n+' original shape:', tempDF.shape)
print(' -> Unique ID:', len(tempDF['public_client_id'].unique()))

#Drop BMI and covariates
tempL = ['log_BaseBMI', 'Sex', 'BaseAge', 'PC1', 'PC2', 'PC3', 'PC4', 'PC5', 'Race']
tempDF = tempDF.drop(columns=tempL)
display(tempDF)
print(' -> Unique ID:', len(tempDF['public_client_id'].unique()))

tsDF = tempDF

In [None]:
#Check consistency
print('Baseline DF:')
display(baseDF.describe())
print('Baseline from the time-series DF:')
tempDF = tsDF.sort_values(by=['public_client_id', 'days_in_program'], ascending=True)
tempDF = tempDF.drop_duplicates('public_client_id', keep='first')
tempL = ['public_client_id', 'days_in_program', 'Season']
tempDF = tempDF.drop(columns=tempL)
display(tempDF.describe())

> –> Confirmed that the baseline measurements were consistent after the improved imputation.  

### 5-2. Standarization with the baseline distribution

In [None]:
tempD1 = {'Female':bmiDF_F, 'Male':bmiDF_M, 'BothSex':bmiDF_B}
tempL = ['public_client_id', 'days_in_program', 'Season']

#Check just in case
tempA1 = baseDF.columns.to_numpy()
tempA2 = tsDF.drop(columns=tempL).columns.to_numpy()
print('nVariables is consistent between baseline and time-series DFs:',
      len(tempA1)==len(tempA2))
print('Variable order is consistent between baseline and time-series DFs:',
      (tempA1==tempA2).sum()==len(tempA1))

tempD2 = {}
for sex in tempD1.keys():
    tempDF = tempD1[sex]
    tempL1 = tempDF.index.tolist()
    #Prepare baseline DF
    tempDF1 = baseDF.loc[tempL1]
    #Compute the mean and std for Z-score transformation based on the baseline distribution
    scaler = StandardScaler(copy=True, with_mean=True, with_std=True)
    scaler.fit(tempDF1)#Column direction
    #Z-score transformation of the baseline DF (just for confirmation)
    tempA = scaler.transform(tempDF1)
    tempDF1 = pd.DataFrame(data=tempA, index=tempDF1.index, columns=tempDF1.columns)
    
    #Prepare time-series DF
    tempDF2 = tsDF.loc[tsDF['public_client_id'].isin(tempL1)]
    tempDF = tempDF2.drop(columns=tempL)
    #Z-score transformation of the time-series DF with the baseline distribution
    tempA = scaler.transform(tempDF)
    tempDF = pd.DataFrame(data=tempA, index=tempDF.index, columns=tempDF.columns)
    #Recover the time-series metadata
    tempDF2 = tempDF2[tempL]
    tempDF2 = pd.merge(tempDF2, tempDF, left_index=True, right_index=True, how='left')
    
    print(sex)
    display(tempDF2)
    tempD2[sex] = tempDF2
    
    #Confirmation
    tempD = {'Baseline DF':tempDF1, 'Time-series DF':tempDF2}
    for df_n in tempD.keys():
        tempDF = tempD[df_n]
        print(' - '+df_n, tempDF.shape)
        if df_n=='Time-series DF':
            print('    -> Unique ID:', len(tempDF['public_client_id'].unique()))
            tempDF = tempDF.drop(columns=tempL)
        display(tempDF.describe())
        sns.set(style='ticks', font='Arial', context='notebook')
        plt.figure(figsize=(4, 3))
        for col_i in range(3):
            sns.distplot(tempDF.iloc[:, col_i], label=tempDF.columns[col_i])
        sns.despine()
        plt.xlabel(r'$Z$'+'-score')
        plt.ylabel('Density')
        plt.legend(bbox_to_anchor=(1, 0.5), loc='center left', borderaxespad=1)
        plt.show()
    print('')

tsDF_F = tempD2['Female']
tsDF_M = tempD2['Male']
tsDF_B = tempD2['BothSex']

> –> Confirmed that the baseline summary was completely same with the before (the baseline LASSO modeling).  

### 5-3. Import the fitted LASSO models with the baseline measurements

In [None]:
yvar_model = 'CombiBMI'
tempD = {}
for sex in ['Female', 'Male', 'BothSex']:
    #Import the LASSO beta-coefficients (including intercept)
    fileDir = './ExportData/'
    ipynbName = '220801_Multiomics-BMI-NatMed1stRevision_BMI-baseline-LASSO_'
    fileName = yvar_model+'-'+sex+'-LASSObcoefs.tsv'
    tempDF = pd.read_csv(fileDir+ipynbName+fileName, sep='\t').set_index('Variable')
    #Drop summary columns
    tempL = ['Mean', 'SD', 'nZeros']
    tempDF = tempDF.drop(columns=tempL)
    tempD[sex] = tempDF
    
    #Check
    print(sex+':')
    print(' - Variables (without intercept):', len(tempDF)-1)
    display(tempDF)
    print('')

modelDF_F = tempD['Female']
modelDF_M = tempD['Male']
modelDF_B = tempD['BothSex']

### 5-4. Calculate predictions using the fitted models

> According to LassoCV source, the self.predict(X) method calls self._decision_function(X) method, which further returns “safe_sparsedot(X, self.coef.T, denseoutput=True) + self.intercept". In this case, safe_sparse_dot() simply corresponds to a dot product. Hence, manual calculation from beta-coefficients and intercept is impremented here.  

In [None]:
tempD1 = {'Female':tsDF_F, 'Male':tsDF_M, 'BothSex':tsDF_B}
tempD2 = {'Female':modelDF_F, 'Male':modelDF_M, 'BothSex':modelDF_B}
tempL = ['public_client_id', 'days_in_program', 'Season']

tempD = {}
for sex in tempD1.keys():
    tempDF = tempD1[sex]
    tempDF1 = tempDF.drop(columns=tempL)
    tempDF2 = tempD2[sex]
    
    #Add dummy intercept variable to data DF
    tempDF1['Intercept'] = 1.0
    
    #Check just in case
    tempA1 = tempDF1.columns.to_numpy()
    tempA2 = tempDF2.index.to_numpy()
    print(sex)
    print(' - nVariables is consistent between data and model DFs:',
          len(tempA1)==len(tempA2))
    print(' - Variable order is consistent between data and model DFs:',
          (tempA1==tempA2).sum()==len(tempA1))
    
    #Calculate prediction
    tempA = np.dot(tempDF1, tempDF2)
    tempDF1 = pd.DataFrame(tempA, index=tempDF1.index, columns=tempDF2.columns)
    
    #Recover the time-series metadata
    tempDF = tempDF[tempL]
    tempDF = pd.merge(tempDF, tempDF1, left_index=True, right_index=True, how='left')
    
    tempD[sex] = tempDF
    display(tempDF)
    print('')

predictDF_F = tempD['Female']
predictDF_M = tempD['Male']
predictDF_B = tempD['BothSex']

> To obtain one single prediction for each participant and each time point, the calculated predictions from each model can be averaged.  
> ***–> However, in addition to the overfitting problem for the baseline predictions, there is a potential risk of data leakage even for the longitudinal predictions.***  
> –> In this version, one single prediction for each participant and each time point is selected with the prediction from the model for which the participant was included in the baseline testing (hold-out) set.  

In [None]:
tempD1 = {'Female':bmiDF_F, 'Male':bmiDF_M, 'BothSex':bmiDF_B}
tempD2 = {'Female':predictDF_F, 'Male':predictDF_M, 'BothSex':predictDF_B}
yvar_model = 'CombiBMI'
tempD = {}
for sex in tempD1.keys():
    #Retrieve the predictions for the testing (hold-out) set
    tempDF = tempD1[sex]
    tempS = tempDF['Testing']
    tempDF = tempD2[sex]
    tempDF = pd.merge(tempDF, tempS, left_on='public_client_id', right_index=True, how='left')
    tempL = []
    for row_i in range(len(tempDF)):
        model_n = tempDF['Testing'].iloc[row_i]
        tempL.append(tempDF[model_n].iloc[row_i])
    tempDF['log_'+yvar_model] = tempL
    
    #Drop the temporal prediction columns
    tempDF = tempDF.loc[:, ~tempDF.columns.str.contains('Model_')]
    
    #Convert to original scale
    tempDF[yvar_model] = np.e**tempDF['log_'+yvar_model]
    
    tempD[sex] = tempDF
    print(sex)
    display(tempDF)
    print(' -> Unique ID:', len(tempDF['public_client_id'].unique()))
    print('')
#Update
predictDF_F = tempD['Female']
predictDF_M = tempD['Male']
predictDF_B = tempD['BothSex']

### 5-5. Clean and save predictions

In [None]:
#Add the baseline info
tempD1 = {'Female':predictDF_F, 'Male':predictDF_M, 'BothSex':predictDF_B}
tempD2 = {'Female':bmiDF_F, 'Male':bmiDF_M, 'BothSex':bmiDF_B}
yvar_model = 'CombiBMI'
tempD = {}
for sex in tempD1.keys():
    tempDF = tempD1[sex]
    
    #Retrieve the baseline predictions
    tempDF1 = tempDF.sort_values(by=['public_client_id', 'days_in_program'], ascending=True)
    tempDF1 = tempDF1.drop_duplicates('public_client_id', keep='first')
    tempDF1 = tempDF1.reset_index().set_index('public_client_id')
    tempDF1 = tempDF1.rename(columns={yvar_model:'Base'+yvar_model})
    
    #Add baseline BMI and covariate info
    tempDF2 = tempD2[sex]
    ##Replace the log-scaled BMI with the original scaled
    tempS = np.e**tempDF2['log_BaseBMI']
    tempS.name = 'BaseBMI'
    tempDF2 = pd.merge(tempS, tempDF2, left_index=True, right_index=True, how='left')
    tempDF2 = tempDF2.drop(columns=['log_BaseBMI', 'Testing'])
    tempDF1 = pd.merge(tempDF1['Base'+yvar_model], tempDF2,
                       left_index=True, right_index=True, how='left')
    
    #Obesity classification
    for bmi in ['BMI', yvar_model]:
        tempL = []
        for value in tempDF1['Base'+bmi].tolist():
            if np.isnan(value):
                tempL.append('NotCalculated')
            elif value < 18.5:
                tempL.append('Underweight')
            elif value < 25:
                tempL.append('Normal')
            elif value < 30:
                tempL.append('Overweight')
            elif value >= 30:
                tempL.append('Obese')
            else:#Just in case
                tempL.append('Error?')
        tempDF1['Base'+bmi+'_class'] = tempL
    
    #Check baseline summary
    print(sex+' baseline summary:')
    display(tempDF1.describe(include='all'))
    for bmi in ['BMI', yvar_model]:
        print('Base'+bmi+'_class:')
        tempS1 = tempDF1['Base'+bmi+'_class'].value_counts()
        tempDF2 = pd.DataFrame({'Count':tempS1, 'Percentage':tempS1/len(tempDF1)*100})
        display(tempDF2)
    
    #Merge
    tempDF = pd.merge(tempDF, tempDF1, left_on='public_client_id', right_index=True, how='left')
    tempD[sex] = tempDF
    display(tempDF)
    print(' -> Unique ID:', len(tempDF['public_client_id'].unique()))
    print('')
#Update
predictDF_F = tempD['Female']
predictDF_M = tempD['Male']
predictDF_B = tempD['BothSex']

In [None]:
#Save
yvar_model = 'CombiBMI'

#Sex-stratified models
tempDF = pd.concat([predictDF_F, predictDF_M], axis=0)
display(tempDF)
print(' -> Unique ID:', len(tempDF['public_client_id'].unique()))
fileDir = './ExportData/'
ipynbName = '220805_Multiomics-BMI-NatMed1stRevision_BMI-longitudinal-LASSO_'
fileName = yvar_model+'-FemaleMale.tsv'
tempDF.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)

#Sex-mixed model
tempDF = predictDF_B
display(tempDF)
print(' -> Unique ID:', len(tempDF['public_client_id'].unique()))
fileDir = './ExportData/'
ipynbName = '220805_Multiomics-BMI-NatMed1stRevision_BMI-longitudinal-LASSO_'
fileName = yvar_model+'-BothSex.tsv'
tempDF.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)

### 5-6. Check consistency

In [None]:
tempD1 = {'Female':predictDF_F, 'Male':predictDF_M, 'Both sex':predictDF_B}
yvar_model = 'CombiBMI'
axis_label = 'BMI [kg m'+r'$^{-2}$'+']'

#Plot difference b/w sex-specific and sex-mixed models
tempD2 = {'Female':'tab:red', 'Male':'tab:blue'}
range_min = np.min([df[var].min() for df in tempD1.values() for var in [yvar_model]])
range_max = np.max([df[var].max() for df in tempD1.values() for var in [yvar_model]])
sns.set(style='ticks', font='Arial', context='talk')
fig, axes = plt.subplots(nrows=1, ncols=len(tempD2), figsize=(6.5, 3), sharex=True, sharey=True,
                         gridspec_kw={'hspace':0.1, 'wspace':0.1})
for ax_i, ax in enumerate(axes.flat):
    sex = list(tempD2.keys())[ax_i]
    #Prepare DF
    tempS1 = tempD1[sex][yvar_model]
    tempS1.name = 'Sex-specific'
    tempS2 = tempD1['Both sex'][yvar_model]
    tempS2.name = 'Sex-mixed'
    tempDF = pd.merge(tempS1, tempS2, left_index=True, right_index=True, how='inner')
    #Y=X as reference
    ax.plot([range_min, range_max], [range_min, range_max], color='black', linestyle=(0, (1, 2)))
    #Regplot
    sns.regplot(data=tempDF, x='Sex-specific', y='Sex-mixed', color=tempD2[sex],
                scatter=True, fit_reg=True, ci=95, truncate=False, marker='o',
                scatter_kws={'alpha':0.2, 'edgecolor':'k', 's':30}, ax=ax)
    if ax_i%len(tempD2)==0:
        plt.setp(ax, xlabel='', ylabel='Sex-mixed b'+axis_label)
    else:
        plt.setp(ax, xlabel='', ylabel='')
        plt.setp(ax.get_yticklabels(), visible=False)
    ##Annotate Pearson's correlation
    pearson_r, pval = stats.pearsonr(tempDF['Sex-specific'], tempDF['Sex-mixed'])
    r_text = str(Decimal(str(pearson_r)).quantize(Decimal('0.001'), rounding=ROUND_HALF_UP))
    if pval==1.0:
        pval_text = '1.0'
    elif pval==0.0:
        pval_text = '0.0'
    else:
        pval_text = f'{Decimal(str(pval)):.3E}'#Take more digits because rounding is bad here
        significand, exponent = pval_text.split(sep='E-')
        significand = str(Decimal(significand).quantize(Decimal('0.1'), rounding=ROUND_HALF_UP))
        if significand=='10.0':
            significand = '1.0'
            exponent = str(int(exponent)-1)
        if int(exponent)>2:
            pval_text = significand+r'$\times$'+'10'+r'$^{{-{0}}}$'.format(exponent)##Font is different in r'$ $'...
        elif int(exponent)>0:
            pval_text = '0.'+'0'*(int(exponent)-1)+significand.replace('.', '')
        else:
            pval_text = significand
    text = 'Pearson\'s '+r'$r$'+' = '+r_text+'\n'+r'$P$'+' = '+pval_text
    ax.annotate(text, xy=(0.05, 0.95), xycoords='axes fraction',
                horizontalalignment='left', verticalalignment='top',
                multialignment='left', fontsize='x-small', color='k')
    ##Facet label
    ax.set_title(sex, {'fontsize':'medium'})
    #Save position to generate facet and legend later
    if ax_i ==0:
        ax0_pos = ax.get_position().bounds
    elif ax_i==1:
        ax1_pos = ax.get_position().bounds
sns.despine()
fig.text(x=(ax0_pos[0]+(ax1_pos[0]+ax1_pos[2]))/2, y=ax0_pos[1]-ax0_pos[3]*0.2,#Minor manual adjustment
         s='Sex-specific b'+axis_label, fontsize='medium',
         verticalalignment='top', horizontalalignment='center', rotation='horizontal')
plt.show()

In [None]:
#Check consistency of the baseline predictions
tempD1 = {'Female':predictDF_F, 'Male':predictDF_M, 'Both sex':predictDF_B}
tempD2 = {'Female':'Female', 'Male':'Male', 'Both sex':'BothSex'}
yvar_model = 'CombiBMI'
axis_label = 'BMI [kg m'+r'$^{-2}$'+']'

tempD = {}
for sex in tempD2.keys():
    #Retrieve the baseline predictions
    tempDF = tempD1[sex]
    tempDF = tempDF.sort_values(by=['public_client_id', 'days_in_program'], ascending=True)
    tempDF = tempDF.drop_duplicates('public_client_id', keep='first')
    tempDF = tempDF.reset_index().set_index('public_client_id')
    tempS1 = tempDF['Base'+yvar_model]
    tempS1.name = 'Current'
    
    #Import the previous baseline prediction DF
    fileDir = './ExportData/'
    ipynbName = '220801_Multiomics-BMI-NatMed1stRevision_BMI-baseline-LASSO_'
    fileName = yvar_model+'-'+tempD2[sex]+'.tsv'
    tempDF = pd.read_csv(fileDir+ipynbName+fileName, sep='\t', dtype={'public_client_id':str})
    tempDF = tempDF.set_index('public_client_id')
    tempS2 = tempDF['Base'+yvar_model]
    tempS2.name = 'Previous'
    
    tempDF = pd.merge(tempS1, tempS2, left_index=True, right_index=True, how='outer')
    tempD[sex] = tempDF
    
    #Check exact values
    print(sex)
    print(' - Participant is consistent:', len(tempDF)==len(tempDF.dropna()))
    tempDF1 = tempDF.loc[tempDF['Current']!=tempDF['Previous']]
    print(' - Inconsistent baseline predictions:',
          len(tempDF1), '(', len(tempDF1)/len(tempDF)*100, '[%])')
    display(tempDF1)
    
    #Check obesity classification
    for bbmi in ['Current', 'Previous']:
        tempL = []
        for value in tempDF[bbmi].tolist():
            if np.isnan(value):
                tempL.append('NotCalculated')
            elif value < 18.5:
                tempL.append('Underweight')
            elif value < 25:
                tempL.append('Normal')
            elif value < 30:
                tempL.append('Overweight')
            elif value >= 30:
                tempL.append('Obese')
            else:#Just in case
                tempL.append('Error?')
        tempDF[bbmi+'_Base'+yvar_model+'_class'] = tempL
        print(' - '+bbmi+'_Base'+yvar_model+'_class:')
        tempS = tempDF[bbmi+'_Base'+yvar_model+'_class'].value_counts()
        tempDF1 = pd.DataFrame({'Count':tempS, 'Percentage':tempS/len(tempDF)*100})
        display(tempDF1)

#Plot current vs. previous baseline predictions per model
tempD2 = {'Female':'tab:red', 'Male':'tab:blue', 'Both sex':'tab:green'}
range_min = np.min([df[var].min() for df in tempD.values() for var in ['Current', 'Previous']])
range_max = np.max([df[var].max() for df in tempD.values() for var in ['Current', 'Previous']])
sns.set(style='ticks', font='Arial', context='talk')
fig, axes = plt.subplots(nrows=1, ncols=len(tempD2), figsize=(10, 3), sharex=True, sharey=True,
                         gridspec_kw={'hspace':0.1, 'wspace':0.1})
for ax_i, ax in enumerate(axes.flat):
    sex = list(tempD2.keys())[ax_i]
    tempDF = tempD[sex]
    #Y=X as reference
    ax.plot([range_min, range_max], [range_min, range_max], color='black', linestyle=(0, (1, 2)))
    #Regplot
    sns.regplot(data=tempDF, x='Previous', y='Current', color=tempD2[sex],
                scatter=True, fit_reg=True, ci=95, truncate=False, marker='o',
                scatter_kws={'alpha':0.2, 'edgecolor':'k', 's':30}, ax=ax)
    if ax_i%len(tempD2)==0:
        plt.setp(ax, xlabel='', ylabel='Current b'+axis_label)
    else:
        plt.setp(ax, xlabel='', ylabel='')
        plt.setp(ax.get_yticklabels(), visible=False)
    ##Annotate Pearson's correlation
    pearson_r, pval = stats.pearsonr(tempDF['Previous'], tempDF['Current'])
    r_text = str(Decimal(str(pearson_r)).quantize(Decimal('0.001'), rounding=ROUND_HALF_UP))
    if pval==1.0:
        pval_text = '1.0'
    elif pval==0.0:
        pval_text = '0.0'
    else:
        pval_text = f'{Decimal(str(pval)):.3E}'#Take more digits because rounding is bad here
        significand, exponent = pval_text.split(sep='E-')
        significand = str(Decimal(significand).quantize(Decimal('0.1'), rounding=ROUND_HALF_UP))
        if significand=='10.0':
            significand = '1.0'
            exponent = str(int(exponent)-1)
        if int(exponent)>2:
            pval_text = significand+r'$\times$'+'10'+r'$^{{-{0}}}$'.format(exponent)##Font is different in r'$ $'...
        elif int(exponent)>0:
            pval_text = '0.'+'0'*(int(exponent)-1)+significand.replace('.', '')
        else:
            pval_text = significand
    text = 'Pearson\'s '+r'$r$'+' = '+r_text+'\n'+r'$P$'+' = '+pval_text
    ax.annotate(text, xy=(0.05, 0.95), xycoords='axes fraction',
                horizontalalignment='left', verticalalignment='top',
                multialignment='left', fontsize='x-small', color='k')
    ##Facet label
    ax.set_title(sex, {'fontsize':'medium'})
    #Save position to generate facet and legend later
    if ax_i ==0:
        ax0_pos = ax.get_position().bounds
    elif ax_i==2:
        ax2_pos = ax.get_position().bounds
sns.despine()
fig.text(x=(ax0_pos[0]+(ax2_pos[0]+ax2_pos[2]))/2, y=ax0_pos[1]-ax0_pos[3]*0.2,#Minor manual adjustment
         s='Previous b'+axis_label, fontsize='medium',
         verticalalignment='top', horizontalalignment='center', rotation='horizontal')
plt.show()

> –> The inconsistent predictions were the same values (at least) until six decimal places, probably due to floating issues. In fact, bBMI class is surely consisistent.  

# — End of this notebook —