In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from evaluation import evaluate

# Exemplary Evaluation

For evaluation, we always calculate the unadjusted and adjusted performance. Therefore it is necessary to provide an appropriate baseline. In our case, we use AlexNet as the baseline which we have made available for download (see README). 

After a user has sent off their diagnostic prediction scores using our provided template files, he/she will receive corresponding result files which contain the unadjusted metrics for SAM, SAM-C and SAM-P for each individual transformation. From this, the unadjusted/adjusted *mCBE*, *relative mCBE* and *mFR* can can be calculated using our provided code interactively or via the command line. 

**mCBE and mFR**

Calculating unadjusted/adjusted *mCBE* and *mFR* requires the file path to a scored SAM-C/P or SAM-C-Eextra/P-Extra submission file and the corresponding baseline file (preferrably our AlexNet baseline). Users need to make sure, that the DatasetName (e.g. SAM-P or SAM-C) is always matched for both file paths. 

**relative mCBE**

Calculating unadjusted/adjusted *relative mCBE* requires two additional file path parameters. Both file paths should point to a file which contains the 'clean' performance (i.e. performance of the user classifier and baseline classifier on SAM).  

## 1. Interactive

### SAM-C performance

In [3]:
perf = evaluate(f"example_sub/ResNet50_SAM-C.csv", "baseline/Baseline_SAM-C.csv")
print(f"Unadjusted mCBE: {perf['unadjusted']*100:.2f}")
print(f"Adjusted mCBE: {perf['adjusted']*100:.2f}")

Unadjusted mCBE: 26.99
Adjusted mCBE: 88.29


In [4]:
perf = evaluate(f"example_sub/ResNet50_SAM-C.csv", "baseline/Baseline_SAM-C.csv", f"example_sub/ResNet50_SAM.csv", "baseline/Baseline_SAM.csv" )
print(f"Unadjusted relative mCBE: {perf['unadjusted']*100:.2f}")
print(f"Adjusted relative mCBE: {perf['adjusted']*100:.2f}")

Unadjusted relative mCBE: 8.82
Adjusted relative mCBE: 118.40


### SAM-P performance

In [5]:
perf = evaluate(f"example_sub/ResNet50_SAM-P.csv", "baseline/Baseline_SAM-P.csv")
print(f"Unadjusted mFR: {perf['unadjusted']*100:.2f}")
print(f"Adjusted mFR: {perf['adjusted']*100:.2f}")

Unadjusted mFR: 6.73
Adjusted mFR: 138.71


### SAM-C-Extra performance

In [6]:
perf = evaluate(f"example_sub/ResNet50_SAM-C-Extra.csv", "baseline/Baseline_SAM-C-Extra.csv")
print(f"Unadjusted mCBE: {perf['unadjusted']*100:.2f}")
print(f"Adjusted mCBE: {perf['adjusted']*100:.2f}")

Unadjusted mCBE: 28.16
Adjusted mCBE: 87.40


In [7]:
perf = evaluate(f"example_sub/ResNet50_SAM-C-Extra.csv", "baseline/Baseline_SAM-C-Extra.csv", f"example_sub/ResNet50_SAM.csv", "baseline/Baseline_SAM.csv")
print(f"Unadjusted relative mCBE: {perf['unadjusted']*100:.2f}")
print(f"Adjusted relative mCBE: {perf['adjusted']*100:.2f}")

Unadjusted relative mCBE: 9.98
Adjusted relative mCBE: 94.32


### SAM-P-Extra performance

In [8]:
perf = evaluate(f"example_sub/ResNet50_SAM-P-Extra.csv", "baseline/Baseline_SAM-P-Extra.csv")
print(f"Unadjusted mFR: {perf['unadjusted']*100:.2f}")
print(f"Adjusted mFR: {perf['adjusted']*100:.2f}")

Unadjusted mFR: 6.64
Adjusted mFR: 141.03


## 2. Command line

### SAM-C performance

In [29]:
! python evaluation.py -uf "example_sub/ResNet50_SAM-C.csv" -bf "baseline/Baseline_SAM-C.csv"

Unadjusted metric: 26.99
Adjusted metric: 88.29


In [30]:
! python evaluation.py -uf "example_sub/ResNet50_SAM-C.csv" -bf "baseline/Baseline_SAM-C.csv" -cluf "example_sub/ResNet50_SAM.csv" -clbf "baseline/Baseline_SAM.csv"

Unadjusted metric: 8.82
Adjusted metric: 118.40


### SAM-P performance

In [31]:
! python evaluation.py -uf "example_sub/ResNet50_SAM-P.csv" -bf "baseline/Baseline_SAM-P.csv"

Unadjusted metric: 6.73
Adjusted metric: 138.71


### SAM-C-Extra performance

In [32]:
! python evaluation.py -uf "example_sub/ResNet50_SAM-C-Extra.csv" -bf "baseline/Baseline_SAM-C-Extra.csv"

Unadjusted metric: 28.16
Adjusted metric: 87.40


In [33]:
! python evaluation.py -uf "example_sub/ResNet50_SAM-C-Extra.csv" -bf "baseline/Baseline_SAM-C-Extra.csv" -cluf "example_sub/ResNet50_SAM.csv" -clbf "baseline/Baseline_SAM.csv"

Unadjusted metric: 9.98
Adjusted metric: 94.32


### SAM-P-Extra performance

In [34]:
! python evaluation.py -uf "example_sub/ResNet50_SAM-P-Extra.csv" -bf "baseline/Baseline_SAM-P-Extra.csv"

Unadjusted metric: 6.64
Adjusted metric: 141.03
