## Reproduce Analysis for Task 2 corresponding to Figure 3 in the paper.

This notebook contains code and analysis for reproducing results for Figure 3

In [2]:
from scipy.stats import bootstrap, permutation_test
import pandas as pd
from pathlib import Path
from functools import partial
from tqdm import tqdm 
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
pio.templates["custom"] = go.layout.Template(
    layout=go.Layout(
        colorway=px.colors.qualitative.D3,
    )
)

from utils import get_score, get_score_difference

### Load CSV files for each of the implementation approach predictions

We load saved prediction files for each of the implementation approaches. The CSV contains the predicted probabilities for malignancy and the true label from the dataset. 

In [3]:

path = Path("../outputs/predictions/task2")

In [4]:
implementation_dict = {
    "Supervised": [csv_path for csv_path in path.glob("supervised_random*.csv")],
    "Supervised (Finetuned)": [csv_path for csv_path in path.glob("supervised_finetuned*.csv")],
    "Foundation (Features)": [csv_path for csv_path in path.glob("foundation_features*.csv")],
    "Foundation (Finetuned)": [csv_path for csv_path in path.glob("foundation_finetuned*.csv")],
}

### Analysis for computing metrics for each of the implementation approaches

Here we compute the metrics for each of the implementation approaches along with the 95% confidence intervals. Each implementation approach is also compared to all other implementation 
approaches with a difference confidence interval and p-value. We bootstrap to compute the confidence intervals and run a permutation test for the p-values.

In [9]:
pbar = tqdm(total=len(implementation_dict) * len(implementation_dict["Supervised"]))
results = []

# We use 1000 resamples in the study, but for the sake of time we use reproduce results with 10 here
N_RESAMPLES = 10

for implementation_name, implementation_list in implementation_dict.items():
    for model_prediction_csv in implementation_list:
        data_percentage = float(model_prediction_csv.stem.split("_")[-2])/100 if len(model_prediction_csv.stem.split("_")) > 2 else 1.0
        df = pd.read_csv(model_prediction_csv)

        pred_set = (df["target"].values, df["conf_scores_class_1"].values)
        
        mAP =  get_score(*pred_set)
        AUC = get_score(*pred_set, fn="auc_roc")

        map_ci = bootstrap(pred_set, get_score, method="basic", n_resamples=N_RESAMPLES, confidence_level=0.95, paired=True)
        auc_ci = bootstrap(pred_set, partial(get_score, fn="auc_roc"), method="basic", n_resamples=N_RESAMPLES, confidence_level=0.95, paired=True)
    
        
        row = {
            "Implementation": implementation_name,
            "Data Percentage": data_percentage,
            "mAP": mAP,
            "mAP_low_CI": map_ci.confidence_interval.low,
            "mAP_high_CI": map_ci.confidence_interval.high,
            "AUC": AUC,
            "AUC_low_CI": auc_ci.confidence_interval.low,
            "AUC_high_CI": auc_ci.confidence_interval.high,
            }
        
        # Compute statistics for comparison between this implementation and all other ones (difference CI and p-value)
        compare_impementations = {k:v for k, v in implementation_dict.items() if k != implementation_name}
        for _implementation_name, _implementations_list in compare_impementations.items():
            for _model_prediction_csv in _implementations_list:
                _data_percentage = float(_model_prediction_csv.stem.split("_")[-2])/100 if len(_model_prediction_csv.stem.split("_")) > 2 else 1.0
                if data_percentage == _data_percentage:

                    _df = pd.read_csv(_model_prediction_csv)
                    _pred = _df["conf_scores_class_1"].values
                    _pred_set = (*pred_set, _pred)
                    
                    # Get confidence intervals and p-values for BA between this implementation and the other ones
                    diff_ci = bootstrap(_pred_set, partial(get_score_difference, fn="auc_roc"), method='basic', n_resamples=N_RESAMPLES, confidence_level=0.95, paired=True)
                    perm_test = permutation_test((_pred_set[1], _pred_set[2]), partial(get_score_difference, _pred_set[0],fn="auc_roc", sample_target=False), permutation_type='samples', n_resamples=N_RESAMPLES, alternative='two-sided', vectorized=True)
                    
                    row[f"AUC_diff_CI_low_{_implementation_name}"] = diff_ci.confidence_interval.low
                    row[f"AUC_diff_CI_high_{_implementation_name}"] = diff_ci.confidence_interval.high
                    row[f"AUC_pval_{_implementation_name}"] = perm_test.pvalue
                    
                    # Get confidence intervals and p-values for mAP between this implementation and the other ones
                    diff_ci = bootstrap(_pred_set, partial(get_score_difference), method='basic', n_resamples=N_RESAMPLES, confidence_level=0.95, paired=True)
                    perm_test = permutation_test((_pred_set[1], _pred_set[2]), partial(get_score_difference, _pred_set[0], fn="average_precision", sample_target=False), permutation_type='samples', n_resamples=N_RESAMPLES, alternative='two-sided', vectorized=True)
                    
            
                    
                    row[f"mAP_diff_CI_low_{_implementation_name}"] = diff_ci.confidence_interval.low
                    row[f"mAP_diff_CI_high_{_implementation_name}"] = diff_ci.confidence_interval.high
                    row[f"mAP_pval_{_implementation_name}"] = perm_test.pvalue
        
        results.append(row)
        pbar.update(1)


  0%|          | 0/16 [00:44<?, ?it/s]
100%|██████████| 16/16 [00:05<00:00,  2.80it/s]

In [10]:
results_df = pd.DataFrame(results)
results_df.sort_values(by=["Data Percentage", "Implementation"], inplace=True, ascending=True)

### Generate the figures
The figures are reproduced using plotly 

In [13]:
results_df_ = results_df[results_df["Data Percentage"] == 1]
for metric in ["mAP", "AUC"]:
    results_df_[f"e_plus_{metric}"] = results_df_[f"{metric}_high_CI"] - results_df_[metric]
    results_df_[f"e_minus_{metric}"] = results_df_[metric] - results_df_[f"{metric}_low_CI"] 

    colors = ['49,130,189', '0, 163, 213', '115,115,115', '189,189,189']

    fig = px.bar(results_df_, x="Implementation", y=metric, error_y=f"e_plus_{metric}", error_y_minus=f"e_minus_{metric}", color='Implementation',
                 template="simple_white",  labels={'Model': '', metric: metric, "Implementation": "Implementation approaches"}, color_discrete_sequence=[f"rgb({color})" for color in colors],
                 range_y=[0.4, 1])               

    title = "Full training set"
    fig.update_layout(title=title,
                       width=400,
                        height=500, autosize=False, legend=dict(
    orientation="h",
    ), 
    template='simple_white', title_x=0.5,
    xaxis=dict(showticklabels=False),
    yaxis=dict(showgrid=True),
    xaxis_title=None,
    showlegend=True,
    ), 

    fig.show()
    fig.data = []




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy





A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [14]:
for metric in ["mAP", "AUC"]:
    results_df[f"e_plus_{metric}"] = results_df[f"{metric}_high_CI"] - results_df[metric]
    results_df[f"e_minus_{metric}"] = results_df[metric] - results_df[f"{metric}_low_CI"] 

    colors = ['49,130,189', '0, 163, 213', '115,115,115']
    fig = px.line(results_df, x="Data Percentage", y=metric, error_y=f"e_plus_{metric}",
                   error_y_minus=f"e_minus_{metric}", color='Implementation', markers=True, template="simple_white", 
                     labels={'Data Percentage': 'Percentage', metric: metric}, color_discrete_sequence=[f"rgb({color})" for color in colors], range_y=[0.4, 1])
    
    fig.update_traces(marker=dict(size=10)) 

    title = "Percentages of training data"
    fig.update_traces(
        error_y = dict( 
            thickness=1,
        ),
        
    )
    fig.update_layout(title=title, width=600, height=500, autosize=True, 
        showlegend=True,
    legend=dict(
        yanchor="bottom",
        y=0.01,
        orientation="h",
        xanchor="right",
        x=1.2
    ), template='simple_white', title_x=0.5,
        yaxis=dict(showgrid=True),
        xaxis = dict(
                    tickmode='array', #change 1
                    tickvals = [0.1, 0.2, 0.5, 1], #change 2
                    ticktext = ['10%', '20%', '50%', '100%'], #change 3
                    autorange="reversed"
                    ),
    )
   
    fig.show()
    fig.data = []
