# Fairness estimation with ANOVA

We want to test whether or not our models output equally high malignancy probabilities $P(y=\text{malignant} \mid x)$ across skin color groups.

We'll perform this analysis on the outputs of our three best experiments to assess their fairness and compare them.

In [1]:
import os
import numpy as np
from scipy.stats import f_oneway

In [2]:
# Load data
root_dir = "C:\\Users\\Duje\\Desktop\\fer\\8. semestar\\lumen\\rezultati\\02 eksperimenti\\"
experiments = [
    "10 transformer\\normal", 
    "12 domain discriminative\\new", 
    "16 efficient m\\new",
]

y_true = []
groups = []
y_prob = []

for exp in experiments:
    eval_dir = os.path.join(root_dir, exp, "eval")
    y_true.append(np.load(os.path.join(eval_dir, "best_model", "y_true.npy")))
    groups.append(np.load(os.path.join(eval_dir, "best_model", "groups.npy")))

    # We only need the post. prob. of the malignant class
    p = np.load(os.path.join(eval_dir, "best_model", "probs.npy"))
    y_prob.append(p[:, 1])


# We know that that y_true and groups are the same across experiments
assert (y_true[0] == y_true[1]).all() and (y_true[1] == y_true[2]).all()
assert (groups[0] == groups[1]).all() and (groups[1] == groups[2]).all()
y_true = y_true[0]
groups = groups[0]

# One column for each model
y_prob = np.stack(y_prob, axis=1)

y_true.shape, groups.shape, y_prob.shape

((6562,), (6562,), (6562, 3))

## Positive subset

We want to test if the trained models behave differently across skin color groups on GT malignant samples, regardless of whether or not they were correctly labeled by the model.

This gives us an overall view of the fairness of each model and checks for confidece disparities in all malignant samples. 

- $H_0$ - Mean probabilities $P(y=\text{malignant} \mid x)$ are equal for GT malignant samples across all skin color groups
- $H_1$ - At least one mean probability is different than the rest

In [3]:
group_0 = y_prob[np.logical_and(y_true == 1, groups == 0), :]
group_1 = y_prob[np.logical_and(y_true == 1, groups == 1), :]
group_2 = y_prob[np.logical_and(y_true == 1, groups == 2), :]
group_3 = y_prob[np.logical_and(y_true == 1, groups == 3), :]

group_0.shape, group_1.shape, group_2.shape, group_3.shape

((17, 3), (58, 3), (46, 3), (16, 3))

In [4]:
# Mean and std values
all_groups = [group_0, group_1, group_2, group_3]

print("\t\tExp 1\t\tExp 2\t\tExp 3")
for i, group in enumerate(all_groups):
    print(f"Group {i}", end='\t')
    mean = np.mean(group, axis=0)
    std = np.std(group, axis=0)
    for m, s in zip(mean, std):
        print(f"\t{100*m:.2f} +- {100*s:.2f}", end="")
    print()


		Exp 1		Exp 2		Exp 3
Group 0		83.81 +- 32.31	32.04 +- 27.95	78.38 +- 33.42
Group 1		50.82 +- 41.04	22.96 +- 27.22	43.50 +- 41.02
Group 2		65.81 +- 37.61	25.08 +- 25.30	58.08 +- 40.39
Group 3		73.51 +- 28.47	26.05 +- 22.27	65.32 +- 36.11


In [5]:
F = f_oneway(group_0, group_1, group_2, group_3)
F.statistic, F.pvalue

(array([4.14200588, 0.52022699, 3.98315294]),
 array([0.00764976, 0.6690862 , 0.00937104]))

$\implies$ We can discard $H_0$ with a $0.05$ level of significance in experimets 1 and 3.

This indicates the presence of skin-color bias in those two models; more specifically that those models are not equally confident in their decisions across groups. Looking at the mean values per group, models 1 and 3 seem to give more opportunities (higher confidence) to the minority skin groups. However, knowing that group 1 had the majority in the training data, we can safely assume that the results on other groups are overly optimistic.

The mean values for experiment 2 suggest a very stable condifence across skin color groups, dispite the fact that the training data was highly imbalanced.

## True positive subset

Looking only at the samples that were correctly labeled as malignant, we can check if our model was less confident to assign that label on different skin color groups. 

- $H_0$ - Mean probabilities $P(y=\text{malignant} \mid x)$ are equal for correcty classified samples across all skin color groups
- $H_1$ - At least one mean probability is different than the rest

In [6]:
# The number of TPs varies between experiments
stats = []
pvals = []

for exp in range(len(experiments)):
    print(f"Experiment {exp+1}:")
    group_0 = y_prob[np.logical_and(y_true == 1, groups == 0), exp]
    group_1 = y_prob[np.logical_and(y_true == 1, groups == 1), exp]
    group_2 = y_prob[np.logical_and(y_true == 1, groups == 2), exp]
    group_3 = y_prob[np.logical_and(y_true == 1, groups == 3), exp]
    
    group_0 = group_0[group_0 > 0.5]
    group_1 = group_1[group_1 > 0.5]
    group_2 = group_2[group_2 > 0.5]
    group_3 = group_3[group_3 > 0.5]

    all_groups = [group_0, group_1, group_2, group_3]

    for i, group in enumerate(all_groups):
        print(f"\tGroup {i}", end='\t')
        mean = np.mean(group)
        std = np.std(group)
        print(f"  {100*mean:.2f} +- {100*std:.2f}", end="")
        print()

    F = f_oneway(group_0, group_1, group_2, group_3)
    stats.append(F.statistic)
    pvals.append(F.pvalue)

np.array(stats), np.array(pvals)

Experiment 1:
	Group 0	  94.96 +- 11.24
	Group 1	  87.85 +- 15.83
	Group 2	  88.52 +- 14.83
	Group 3	  82.72 +- 15.69
Experiment 2:
	Group 0	  65.02 +- 4.85
	Group 1	  63.94 +- 4.66
	Group 2	  63.91 +- 4.80
	Group 3	  62.77 +- 6.90
Experiment 3:
	Group 0	  92.96 +- 12.04
	Group 1	  86.54 +- 15.73
	Group 2	  90.26 +- 11.82
	Group 3	  88.05 +- 13.36


(array([1.61153694, 0.13112771, 0.74056832]),
 array([0.19251901, 0.94083616, 0.53122427]))

$\implies$ We fail to reject the null hypothesis in all three experiments.

When only considering true positive predictions, we see no statistically significant evidence of confidence disparities. 

Unlike the results of the previous ANOVA analysis, this analysis did not reveal any biases in experiments 1 and 3. This further shows that, when only looking at TP samples, we have a slighlty biased and less general view of our model's fairness. Thus, we suspect that the models in experiments 1 and 3 could be fair (consistently confident) in the cases that they are correct, but they do in general show some bias. However, this does not seem to be the case for the model in experiment 2.