This is a script to verify that commonly used metrics for binary classification are not invariant to arbitrary class distributions. Suppose we are given a binary classifier which predicts class 0 perfectly (100% accuracy) and class 1 with 50% accuracy. Now consider two test sets: the first contains 1 sample of class 0 and 2 samples of class 1, the second includes 2 samples for both classes. Assuming the samples are i.i.d, the predicted labels/scores are given in the following cell. 

We evaluate 6 metrics: weighted F1, matthews_corrcoef ($\phi$), cohen_kappa_score ($\kappa$), average_precision_score (AUPRC) and our proposed balanced acc and balance F1, the result follows:


| Metric | Dataset 1 | Dataset 2 | Invariant |
| :----: | :-------: | :-------: | :-------: |
| Weighted F1 | 0.6667 | 0.7333 | ❌ |
| $\phi$ | 0.5000 | 0.5774 | ❌  |
| $\kappa$ | 0.4000 | 0.5000 | ❌  |
| AUPRC | 0.8333 | 0.7500 | ❌  |
| Balanced Acc | 0.7500 | 0.7500 | ✅ |
| Balanced F1 | 0.7333 | 0.7333 | ✅ |


**Conclusion: only the proposed balanced metrics are invariant to different class distributions in datasets.**

In [21]:
import script._init_paths
from sklearn.metrics import matthews_corrcoef, cohen_kappa_score, average_precision_score, f1_score
from lib.core.evaluate import balanced_f1, balanced_accuracy_score


# Dataset 1
gt1 = [0, 1, 1] # ground truth labels
pred1 = [0, 0, 1] # predicted labels/scores

# Dataset 2
gt2 = [0, 0, 1, 1] # ground truth labels
pred2 = [0, 0, 0, 1] # predicted labels/scores

print('Weighted F1 for dataset 1: %.4f' %(f1_score(gt1, pred1, average="weighted")))
print('Weighted F1 for dataset 2: %.4f' %(f1_score(gt2, pred2, average="weighted")))
print('phi coefficient for dataset 1: %.4f' %(matthews_corrcoef(gt1, pred1)))
print('phi coefficient for dataset 2: %.4f' %(matthews_corrcoef(gt2, pred2)))
print('cohen kappa for dataset 1: %.4f' %(cohen_kappa_score(gt1, pred1)))
print('cohen kappa for dataset 2: %.4f' %(cohen_kappa_score(gt2, pred2)))
print('AUPRC for dataset 1: %.4f' %(average_precision_score(gt1, pred1)))
print('AUPRC for dataset 2: %.4f' %(average_precision_score(gt2, pred2)))
print('Balanced Acc for dataset 1: %.4f' %(balanced_accuracy_score(gt1, pred1)))
print('Balanced Acc for dataset 2: %.4f' %(balanced_accuracy_score(gt2, pred2)))
print('Balanced F1 for dataset 1: %.4f' %(balanced_f1(gt1, pred1)))
print('Balanced F1 for dataset 2: %.4f' %(balanced_f1(gt2, pred2)))

Weighted F1 for dataset 1: 0.6667
Weighted F1 for dataset 2: 0.7333
phi coefficient for dataset 1: 0.5000
phi coefficient for dataset 2: 0.5774
cohen kappa for dataset 1: 0.4000
cohen kappa for dataset 2: 0.5000
AUPRC for dataset 1: 0.8333
AUPRC for dataset 2: 0.7500
Balanced Acc for dataset 1: 0.7500
Balanced Acc for dataset 2: 0.7500
Balanced F1 for dataset 1: 0.7333
Balanced F1 for dataset 2: 0.7333
