# Cohort 60x60 AUCs analysis

### Imports and environment setup

- Date of run: 2024-08-19
- Environment: python 3.12
- Packages required: pandas, numpy, sklearn, statsmodels, seaborn, matplotlib

In [1]:
# Include in the environment the code directory with the utils function
import sys
sys.path.append('../code/')

In [2]:
# Library imports
import pandas as pd
from IPython.display import display, HTML
import matplotlib.pyplot as plt
import seaborn as sns

# Utils imports
import cohort_analysis_utils as utils

In [3]:
# Remove warnings for readability
import warnings
warnings.filterwarnings('ignore')

# Remove cell printing limits
pd.set_option('display.max_rows', None)


# Data loading and preprosessing

The original excel file (available [here](<https://mimarkdx.sharepoint.com/sites/Scientific/Documentos compartidos/General/PHASE 6 - SOFTWARE DEVELOPMENT/DATA/../../../../../../:x:/s/Scientific/Eaw9d-fa2BREg_iZB1SL02YBG4mfVaJtoylG46bROmXVJA?e=8chcN7>)) was saved into a CSV file in the data folder of this repository, separating fields by TABs.

In [4]:
df_240 = pd.read_csv('../data/ruo_240.csv', sep='\t', index_col=0, header=0)

In [5]:
# Harmonization of column names
df_240 = utils.normalize_column_names(df_240)

In [6]:
# Ensure numeric columns are treated as such
cols_240_to_num = ['Age', 'Collected_volume_mL',
                   'MMP9', 'HSPB1', 'PERM', 'Total_protein_BCA']
df_240 = utils.cols_as_numbers(df_240, cols_240_to_num)


In [7]:
# Ensure categorical columns are treated as such
df_240 = utils.cols_as_category(df_240, {'Pathology':{'Benign': 0, 'Endometrial cancer': 1}})

# Execution parameters

In [11]:
PLOT_ROCS = True
MAX_BIOMARKER_COUNT = 3
RESULTS_PATH = '../data/results/240'

# Columns to be considered as biomarkers
BIOMARKERS_240 = ['MMP9', 'HSPB1', 'PERM']

NORMALIZING_COL_240 = 'Total_protein_BCA' # Column to be used for normalizing the biomarkers
VOLUME_COL = 'Collected_volume_mL' # Column to be used as volume for scatters and undoing the dilution

## Methods

Description of the methods used to compute the AUCS

### Direct

No transformations were made to the readout of each biomarker.

Due to the different treatment the samples have undergone, the performance of the biomarkers in the 60x60 is masking the correlation of the target variable (disease) with the volume variable because the biomarkers (and total protein) readouts correlate with the collected volume. While the volume is a good classification variable, we want to get rid of it because we cannot control it.  

### Normalized

To get rid of the volume, the total protein variable was proposed as a normalizing variable. The values used for classification are then the ratios of the concentrations $[Bmk]/[TP]$.

After discussions with the collaborators from Santiago, they noted the fact that ratios can be misleading, especially if the denominator variable is not independent of the numerator or other variables in the analysis. 

### Kronmal

Given the potential drawbacks of the normalization method raised by our colleagues, we have then computed the performance of the biomarkers with the Kronmal’s method, which consists of adding the normalizing variable ($[Total Protein]$) as a new variable in the model. 

This method has the cons of adding the $[TP]$ to the model, making the result harder to interpret and not solving the volume correlation nor for the $[TP]$ neither for the biomarkers. 



In [12]:
METHODS = ['direct', 'normalized', 'kronmal']

# Computing the models

All the functions to generate the models are included in the [cohort_analysis_utils.py](../code/cohort_analysis_utils.py) file.

In [14]:
models_120 = utils.compute_all_models_and_save(
                            df=df_240,
                            biomarkers=BIOMARKERS_240,
                            normalizing_col=NORMALIZING_COL_240, 
                            volume_col= VOLUME_COL,
                            volume_added=0.5,
                            apply_log=True,
                            avoid_same_biomarker=True,
                            methods=METHODS,
                            max_biomarker_count=MAX_BIOMARKER_COUNT,
                            folder_name=RESULTS_PATH,
                            plot_rocs=PLOT_ROCS,
                            )

## Running other analyses

### Create scatterplots for all the pairs of biomarkers

In [15]:
for biomarker1 in BIOMARKERS_240:
    for biomarker2 in BIOMARKERS_240[BIOMARKERS_240.index(biomarker1)+1:]:
        utils.plot_scatter_to_file(df_240, 
                                biomarker1, 
                                biomarker2, 
                                normalizing_col=NORMALIZING_COL_240, 
                                apply_log_x=True,
                                apply_log_y=True,
                                hue='Pathology', 
                                folder=RESULTS_PATH+'/scatters/')

# Results

## Direct

Here are presented results for the direct method. Remember you can see the full results [here](<../data/results/240/direct/max_3.csv>) (they are stored in the folder "data/results/240/direct/").

In [17]:
df_results_240_direct = pd.read_csv(RESULTS_PATH+'/direct/max_3.csv', sep=',', header=0)
df_results_240_direct[['Biomarker_1','Biomarker_2','Biomarker_3','AUC']].head(n=10)


Unnamed: 0,Biomarker_1,Biomarker_2,Biomarker_3,AUC
0,MMP9,,,0.84381
1,PERM,,,0.83853
2,MMP9,PERM,,0.76825
3,MMP9,HSPB1,PERM,0.74198
4,MMP9,HSPB1,,0.7281
5,HSPB1,PERM,,0.72782
6,HSPB1,,,0.67763


Let's see also how the biomarkers performed individually.

In [18]:
df_results_240_direct[df_results_240_direct['Biomarker_2'].isnull() & df_results_240_direct['Biomarker_3'].isnull()][['Biomarker_1','AUC']]

Unnamed: 0,Biomarker_1,AUC
0,MMP9,0.84381
1,PERM,0.83853
6,HSPB1,0.67763


## Kronmal

In [19]:
df_results_240_kronmal = pd.read_csv(RESULTS_PATH+'/kronmal/max_3.csv', sep=',', header=0)
df_results_240_kronmal[['Biomarker_1','Biomarker_2','Biomarker_3','AUC']].head(n=10)


Unnamed: 0,Biomarker_1,Biomarker_2,Biomarker_3,AUC
0,MMP9,PERM,,0.79153
1,MMP9,HSPB1,PERM,0.7914
2,MMP9,,,0.77653
3,MMP9,HSPB1,,0.77634
4,HSPB1,PERM,,0.76462
5,PERM,,,0.76423
6,HSPB1,,,0.57327


In [20]:
df_results_240_kronmal[df_results_240_kronmal['Biomarker_2'].isnull() & df_results_240_kronmal['Biomarker_3'].isnull()][['Biomarker_1','AUC']]

Unnamed: 0,Biomarker_1,AUC
2,MMP9,0.77653
5,PERM,0.76423
6,HSPB1,0.57327


## Normalized

In [21]:
df_results_240_normalized = pd.read_csv(RESULTS_PATH+'/normalized/max_3.csv', sep=',', header=0)
df_results_240_normalized[['Biomarker_1','Biomarker_2','Biomarker_3','AUC']].head(n=10)

Unnamed: 0,Biomarker_1,Biomarker_2,Biomarker_3,AUC
0,PERM,,,0.82372
1,MMP9,,,0.81688
2,MMP9,PERM,,0.79576
3,MMP9,HSPB1,PERM,0.75254
4,MMP9,HSPB1,,0.74094
5,HSPB1,PERM,,0.73513
6,HSPB1,,,0.63462


In [22]:
df_results_240_normalized[df_results_240_normalized['Biomarker_2'].isnull() & df_results_240_normalized['Biomarker_3'].isnull()][['Biomarker_1','AUC']]

Unnamed: 0,Biomarker_1,AUC
0,PERM,0.82372
1,MMP9,0.81688
6,HSPB1,0.63462


## ROC curves

All the ROC curves for all the methods over all the combinations in all the cohorts were computed. If you want to check any ROC curve, the folder structure is the following: "data/results/240/<"method">/<max_biomarkers>/rocs/. 

What we have observed is that, if we set as a threshold 0.97 sensitivity, this comes at the price of poor specificity. Just as an example, let’s take a look to one of the best scored classifier of all the methods above: PERM using the ‘normalize’ method. This ROC image is located at "data/results/240/normalized/max_3/rocs/PERM.png". 

 

In [26]:
roc_image_path = "../data/results/240/normalized/max_3/rocs/PERM.png"
display(HTML("<img src='"+roc_image_path+"'>"))


All the sensitivities and specificities for all the tresholds are located in the same folder, in CSV files.


In [27]:
roc_table_path = "../data/results/240/normalized/max_3/rocs/PERM.csv"
roc_table = pd.read_csv(roc_table_path, sep=',', header=0)
roc_table


Unnamed: 0,Threshold,Sensitivity,Specificity,NPV,PPV
0,inf,0.0,1.0,0.48,0.0
1,0.721182,0.007692,1.0,0.481928,1.0
2,0.682961,0.2,1.0,0.535714,1.0
3,0.681497,0.2,0.983333,0.531532,0.928571
4,0.679455,0.223077,0.983333,0.538813,0.935484
5,0.678204,0.223077,0.966667,0.534562,0.878788
6,0.675114,0.253846,0.966667,0.544601,0.891892
7,0.675004,0.253846,0.958333,0.542453,0.868421
8,0.670207,0.330769,0.958333,0.569307,0.895833
9,0.669414,0.330769,0.95,0.567164,0.877551
