This is a script to verify that commonly used metrics for binary classification are not invariant to arbitrary class distributions. Suppose we are given a binary classifier which predicts class 0 perfectly (100% accuracy) and class 1 with 50% accuracy. Now consider two test sets: the first contains 1 sample of class 0 and 2 samples of class 1, the second includes 2 samples for both classes. Assuming the samples are i.i.d, the predicted labels/scores are given in the following cell. 

We evaluate 6 metrics: weighted F1, matthews_corrcoef ($\phi$), cohen_kappa_score ($\kappa$), average_precision_score (AUPRC) and our proposed balanced acc and balance F1, the result follows:


| Metric | Dataset 1 | Dataset 2 | Invariant |
| :----: | :-------: | :-------: | :-------: |
| Weighted F1 | 0.6667 | 0.7333 | ❌ |
| $\phi$ | 0.5000 | 0.5774 | ❌  |
| $\kappa$ | 0.4000 | 0.5000 | ❌  |
| AUPRC | 0.8333 | 0.7500 | ❌  |
| Balanced Acc | 0.7500 | 0.7500 | ✅ |
| Balanced F1 | 0.7333 | 0.7333 | ✅ |


**Conclusion: only the proposed balanced metrics are invariant to different class distributions in datasets.**

In [21]:
import script._init_paths
from sklearn.metrics import matthews_corrcoef, cohen_kappa_score, average_precision_score, f1_score
from lib.core.evaluate import balanced_f1, balanced_accuracy_score


# Dataset 1
gt1 = [0, 1, 1] # ground truth labels
pred1 = [0, 0, 1] # predicted labels/scores

# Dataset 2
gt2 = [0, 0, 1, 1] # ground truth labels
pred2 = [0, 0, 0, 1] # predicted labels/scores

print('Weighted F1 for dataset 1: %.4f' %(f1_score(gt1, pred1, average="weighted")))
print('Weighted F1 for dataset 2: %.4f' %(f1_score(gt2, pred2, average="weighted")))
print('phi coefficient for dataset 1: %.4f' %(matthews_corrcoef(gt1, pred1)))
print('phi coefficient for dataset 2: %.4f' %(matthews_corrcoef(gt2, pred2)))
print('cohen kappa for dataset 1: %.4f' %(cohen_kappa_score(gt1, pred1)))
print('cohen kappa for dataset 2: %.4f' %(cohen_kappa_score(gt2, pred2)))
print('AUPRC for dataset 1: %.4f' %(average_precision_score(gt1, pred1)))
print('AUPRC for dataset 2: %.4f' %(average_precision_score(gt2, pred2)))
print('Balanced Acc for dataset 1: %.4f' %(balanced_accuracy_score(gt1, pred1)))
print('Balanced Acc for dataset 2: %.4f' %(balanced_accuracy_score(gt2, pred2)))
print('Balanced F1 for dataset 1: %.4f' %(balanced_f1(gt1, pred1)))
print('Balanced F1 for dataset 2: %.4f' %(balanced_f1(gt2, pred2)))

Weighted F1 for dataset 1: 0.6667
Weighted F1 for dataset 2: 0.7333
phi coefficient for dataset 1: 0.5000
phi coefficient for dataset 2: 0.5774
cohen kappa for dataset 1: 0.4000
cohen kappa for dataset 2: 0.5000
AUPRC for dataset 1: 0.8333
AUPRC for dataset 2: 0.7500
Balanced Acc for dataset 1: 0.7500
Balanced Acc for dataset 2: 0.7500
Balanced F1 for dataset 1: 0.7333
Balanced F1 for dataset 2: 0.7333


As proven in Thm 1 of the paper, only the proposed balanced metrics are *invariant* to label distribution shift. Therefore, to retain consistency with the standard split (balanced test set), only these 2 metrics can be used on an *imbalanced* test set. As discussed in the paper, this enables a much larger test set for better quality and reliability of the evaluation statistics. We provide empirical evidence of this claim by running the following toy example.

In the following cell, we consider a 3-way classification problem where there are only 10 samples for the tail class 2. We can construct an imbalanced test set with sample size [1000, 100, 10] for each class (used for the balanced metrics), while the maximum size of a balanced test set is [10, 10, 10] for each class (used for the rest non-invariant metrics). Suppose we are given a classifier that outputs logits/scores for each class following the probability table below:

| Class | Dim 0 | Dim 1 | Dim 2 |
| :---: | :---: | :---: | :---: |
| 0     | Uniform(0, 3) | Uniform(0, 2) | Uniform(0, 1) | 
| 1     | Uniform(0, 2) | Uniform(0, 3) | Uniform(0, 2) | 
| 2     | Uniform(0, 1) | Uniform(0, 2) | Uniform(0, 3) | 

where Uniform(a, b) stands for an continuous uniform distribution on interval [a, b]. Given a random seed, we resample the model output from the probability table above and compute the balanced metrics and other metrics. We repeat the process for multiple random seeds and keep track of the standard devision of the metrics across all seeds:

| Metric | # Seed=3 | # Seed=5 | # Seed=10 | # Seed=20 | # Seed=50 | # Seed=100 |
| :----: | :----: | :----: | :----: | :----: | :----: | :----: | 
| Weighted F1 | 0.0322 | 0.0563 | 0.1036 |  0.0892 | 0.0760 | 0.0853 | 
| $\phi$ | 0.0513 | 0.0800 |  0.1427 | 0.1281 | 0.1140 | 0.1275 | 
| $\kappa$ | 0.0471 | 0.0800 |  0.1435 | 0.1281 | 0.1132 | 0.1269 | 
| AUPRC | 0.0369 | 0.0289 |  0.0625 | 0.0703 | 0.0696 | 0.0763 | 
| AUROC | 0.0295 | **0.0242** | 0.0539 | 0.0613 | 0.0602 | 0.0672 | 
| Balanced Acc | 0.0310 | 0.0414 |  0.0512 | 0.0471 | 0.0522 | 0.0483 | 
| Balanced F1 | **0.0286** | 0.0374 |  **0.0504** | **0.0463** | **0.0506** | **0.0468** | 

for each column, the metric with lowest standard deviation is **bolded**.

**Conclusion: the proposed balanced metrics (especially balanced f1) achieve consistently lower variance thanks to their invariance to label distribution.**

In [95]:
import script._init_paths
import numpy as np
from sklearn.metrics import matthews_corrcoef, cohen_kappa_score, average_precision_score, f1_score, average_precision_score, roc_auc_score
from lib.core.evaluate import balanced_f1, balanced_accuracy_score


# Dataset random generator
def data_gen(seed=0, samples=[10, 10, 10]):
    np.random.seed(seed)
    sample0, sample1, sample2 = samples[0], samples[1], samples[2]
    gt0 = np.zeros(sample0)
    score0 = np.concatenate([np.random.uniform(0, 3, sample0).reshape(-1, 1),
                             np.random.uniform(0, 2, sample0).reshape(-1, 1),
                             np.random.uniform(0, 1, sample0).reshape(-1, 1)], axis=1)

    gt1 = np.ones(sample1)
    score1 = np.concatenate([np.random.uniform(0, 2, sample1).reshape(-1, 1),
                             np.random.uniform(0, 3, sample1).reshape(-1, 1),
                             np.random.uniform(0, 2, sample1).reshape(-1, 1)], axis=1)

    gt2 = np.ones(sample2) * 2
    score2 = np.concatenate([np.random.uniform(0, 1, sample2).reshape(-1, 1),
                             np.random.uniform(0, 2, sample2).reshape(-1, 1),
                             np.random.uniform(0, 3, sample2).reshape(-1, 1)], axis=1)
    
    gt = np.concatenate([gt0, gt1, gt2])
    score = np.concatenate([score0, score1, score2])
    softmax = np.exp(score)/np.sum(np.exp(score), axis=1).reshape(-1, 1)
    pred = np.argmax(score, axis=1)
    return gt.astype('int16'), softmax, pred

# Convert ground truth labels to one-hot labels
def one_hot(gt):
    one_hot = np.zeros((gt.size, gt.max() + 1))
    one_hot[np.arange(gt.size), gt] = 1
    return one_hot

w_f1 = []
phi = []
ck = []
auprc = []
auroc = []
b_acc = []
b_f1 = []

for seed in range(0, 100):
    np.random.seed(seed)
    gt, _, pred = data_gen(seed, samples=[1000, 100, 10])
    b_acc.append(balanced_accuracy_score(gt, pred))
    b_f1.append(balanced_f1(gt, pred))
    gt, softmax, pred = data_gen(seed, samples=[10, 10, 10])
    w_f1.append(f1_score(gt, pred, average="weighted"))
    phi.append(matthews_corrcoef(gt, pred))
    ck.append(cohen_kappa_score(gt, pred))
    auprc.append(average_precision_score(one_hot(gt), softmax))
    auroc.append(roc_auc_score(one_hot(gt), softmax))

print('Weighted f1 standard deviation: %.4f' %(np.std(w_f1)))
print('Phi coefficient standard deviation: %.4f' %(np.std(phi)))
print('Cohen kappa standard deviation: %.4f' %(np.std(ck)))
print('AUPRC standard deviation: %.4f' %(np.std(auprc)))      
print('AUROC standard deviation: %.4f' %(np.std(auroc)))
print('Balanced Acc standard deviation: %.4f' %(np.std(b_acc)))
print('Balanced F1 standard deviation: %.4f' %(np.std(b_f1)))


    


Weighted f1 standard deviation: 0.0853
Phi coefficient standard deviation: 0.1275
Cohen kappa standard deviation: 0.1269
AUPRC standard deviation: 0.0763
AUROC standard deviation: 0.0672
Balanced Acc standard deviation: 0.0483
Balanced F1 standard deviation: 0.0468
