In this notebook, the goal is two folds:
1. Model the different covariates from the text only
2. Then model the outcome given the predicted covariates & compare this with model build on the true covariates

An important consideration is that we want the split to be the same across all notebooks, we save this information to be sure to be consistent across all experiments.

In [None]:
import numpy as np
import pandas as pd 

### Open data and embedding

In [None]:
# Embedding to use for the notebook
embedding_type = 'BERT' # BERT, clinicalBERT, gpt, gpt+framing

In [None]:
embedding = pd.read_csv('data/{}_embedding.csv'.format(embedding_type), index_col = 0)
outcomes  = pd.read_csv('data/TGCA_Merged.csv', index_col = 0)
embedding = embedding.loc[outcomes.index]

In [None]:
assert (outcomes.index == embedding.index).all(), 'Misaligned index may create an issue - How is the embedding obtained?'

In [None]:
outcomes.head()

In [None]:
embedding.head()

### Split the data 

We propose different evaluation procedures:
1. One hospital out evaluation: to evaluate how well the model generalise outside the cohort. To limit the number of split, we compute only for hospitals with more than 100 patients.
2. One cancer group out

In [None]:
split = pd.DataFrame({
        'Hospital': pd.factorize(outcomes.Hospital.replace({'Other': np.nan}))[0],
        'Grouping' : pd.factorize(outcomes.grouping.replace({'Other': np.nan}))[0],
    }, index = outcomes.index).replace({-1: np.nan})
split.to_csv('results/split.csv')

---------

### Model the different covariates and save

We aim to predict from the text each manually extracted covariates. 

Then we save these covariates for future predictions.

In [None]:
outcomes_to_predict = outcomes[['type', 'gender', 'race', 'ajcc_pathologic_tumor_stage']]
outcomes_to_predict['ajcc_pathologic_tumor_stage'] = outcomes_to_predict.ajcc_pathologic_tumor_stage.astype('category')
outcomes_to_predict_dummy = pd.get_dummies(outcomes_to_predict, dummy_na = True).astype(int)

In [None]:
outcomes_to_predict_dummy.to_csv('data/binary_embedding.csv')

In [None]:
# For simplicity, we rely on a NN from sklearn for this task
from sklearn.neural_network import MLPClassifier

In [None]:
predictions = {}
for split_type in split.columns:
    predictions[split_type] = pd.DataFrame().reindex_like(outcomes_to_predict_dummy)
    for fold in split[split_type].dropna().unique():
        train = split[split_type].values != fold
        test = split[split_type].values == fold

        model = MLPClassifier(hidden_layer_sizes = [], random_state = 42, 
                              learning_rate_init = 0.01, max_iter = 10, 
                              early_stopping = True).fit(embedding[train].values, outcomes_to_predict_dummy[train].values)
        predictions[split_type][test] = model.predict_proba(embedding[test].values)

In [None]:
predictions = pd.concat(predictions)

-----------

### Binarise and save

In [None]:
# Binarisation by a softmax
binarised_predictions = []
for column in outcomes_to_predict.columns:
    if column == 'ajcc_pathologic_tumor_stage': 
        pred_col = predictions.loc[:, predictions.columns.str.contains(column)].idxmax(axis = 1).str.replace(column + '_', '').astype(float)
    elif column == 'type':
        pred_col = predictions.loc[:, predictions.columns.str.contains(column)].idxmax(axis = 1).str.replace(column + '_', '')
    else:
        pred_col = predictions.loc[:, column] > 0.5

    binarised_predictions.append(pred_col.rename(column))
binarised_predictions = pd.concat(binarised_predictions, axis = 1)
binarised_predictions.head()

In [None]:
binarised_predictions.to_csv('data/{}_predicted_binary.csv'.format(embedding_type))

----------

### Measure performance of the extraction

In [None]:
from sklearn.metrics import roc_auc_score

In [None]:
performance = {}
for split_type in split.columns:
    columns = split[split_type].dropna().unique()
    performance[split_type] = pd.DataFrame(index = columns, columns = predictions.columns)
    for fold in columns:
        for dimension in predictions.columns:
            test = split[split_type] == fold
            mean = outcomes_to_predict_dummy.loc[test, dimension].mean()
            if mean != 1 and mean != 0:
                # The class contains some positive
                performance[split_type].loc[fold, dimension] = roc_auc_score(outcomes_to_predict_dummy.loc[test, dimension], predictions.loc[(split_type, test[test].index), dimension])
performance = pd.concat(performance)

In [None]:
performance.loc['Hospital'].astype(float).describe()

In [None]:
performance.loc['Grouping'].astype(float).describe()