<a href="https://colab.research.google.com/github/KCachel/FairRankTune/blob/main/examples/4_statisticalparitymetrics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup

We need to install [FairRankTune](https://https://github.com/KCachel/FairRankTune).

In [1]:
!pip install FairRankTune



We need to import FairRankTune along with some other packages.

In [2]:
import FairRankTune as frt
import numpy as np
import pandas as pd
from FairRankTune import RankTune, Metrics, Rankers

# Metrics (Evaluating Rankings for Fairness)
The [Metrics](https://https://kcachel.github.io/FairRankTune/Metrics/) module contains several fairness metrics for assessing ranked lists. Each metric evaluates the ranking(s) in the passed `ranking_df` parameter (a Pandas dataframe).

In this overview, we will demonstrate the statistical parity metrics.  Statistical Parity is a sub-tupe of group fairness, asks for groups to receive a proportional share of the positive outcome. In ranking(s) the positive outcome can be the exposure or attention of the viewer or a share of top-ranked positions. Statistical Parity is also known as Demographic Parity and explicitly does not use relevance scores. For example, we might want to know if groups receive comporable amounts of exposure regardless of the relevance scores associated with items.


## Modular Metrics
A key functionality of the  `Metrics` library in `FairRankTune`  is providing toolkit users multiple choices for how to calculate a given top-level fairness metric. For instance, for group exposure EXP a statistical parity metric,  `Metrics` offers seven ways of calculating a top-level exposure metric (e.g., min-max ratios, max absolute difference, L-2 norms of per-group exposures, etc.).


Below are the formulas supported for combining per-group style metrics. In the formulas $V = [V_{1}, ..., V_{g}$] is an array of per-group metrics and $G$ is the number of groups. The `combo` variable is used directly in the function call. Depending on the formula used for aggregating per-group metrics the range of the given fairness metric varies. The range and its corresponding "most fair" value is provided in the table.

| **Combo Variable in ```FairRankTune```** | **Formula** | **Range** | **Most Fair** |
|---|:---:|:---:|:---:|
| ```MinMaxRatio``` |  $min_{g} V / max_{g} V$ | [0,1] | 1 |
| ```MaxMinRatio``` |  $max_{g} V / min_{g} V$ | [1, $\infty$] | 1 |
| ```MaxMinDiff``` |  $max_{g} V - min_{g} V$ |  [0,1] | 0 |
| ```MaxAbsDiff``` | $max_{g} \mid V - V_{mean} \mid$ |  [0, $\infty$] | 0 |
| ```MeanAbsDev``` | $\frac{1}{G} \sum_{g} \mid V - V_{mean}\mid$ | [0, $\infty$] | 0 |
| ```LTwo```| $\lVert V \rVert_2^2$ | [0, $\infty$] | 0 |
|  ```Variance``` |  $\frac{1}{G - 1} \sum_{g} (V_{g} - V_{mean})^2$ | [0, $\infty$] | 0 |


Here, we have split the statistical parity metrics into the modular metrics, and the metrics that are mot meta-metric composable.

# Statistical Parity Modular Metrics

## Group Exposure EXP
[EXP](https://kcachel.github.io/FairRankTune/Metrics/#group-exposure-exp) compares the average exposures of groups in the ranking(s) and does not consider relevances or scores associate with items. It aligns with the fairness concept of statistical parity. The per-group metric is the group average exposure, whereby the exposure of item $x_i$ in ranking $\tau$ is $exposure(\tau,x_i) = 1 / log_2(\tau(x_i)+1))$ and the average exposure for group $g_j$ is $avgexp(\tau,g_j) = \sum_{\forall x \in g_{j}}exposure(\tau,x_i)/|g_{j}|$. The range of EXP and its "most fair" value depends on the `combo` variable.

In the example below we calculate EXP across all aggregation functions. We can see that the average exposures for 'M' (men) and 'W' (women) are always the same, but the EXP value varies depenging on the meta-metric used.

In [3]:
ranking_df = pd.DataFrame(["Joe", "Jack", "Nick", "David", "Mark", "Josh", "Dave",
                          "Bella", "Heidi", "Amy"])
item_group_dict = dict(Joe= "M",  David= "M", Bella= "W", Heidi= "W", Amy = "W", Mark= "M", Josh= "M", Dave= "M", Jack= "M", Nick= "M")

#Calculate EXP
EXPMaxMinDiff, exps_MaxMinDiff = frt.Metrics.EXP(ranking_df, item_group_dict, 'MaxMinDiff')
print("EXP (MaxMinDiff): ", EXPMaxMinDiff, "avg_exposures: ", exps_MaxMinDiff)

EXPMinMaxRatio, exps_MinMaxRatio = frt.Metrics.EXP(ranking_df, item_group_dict, 'MinMaxRatio')
print("EXP (MinMaxRatio): ", EXPMinMaxRatio, "avg_exposures: ", exps_MinMaxRatio)

EXPMaxMinRatio, exps_MaxMinRatio = frt.Metrics.EXP(ranking_df, item_group_dict, 'MaxMinRatio')
print("EXP (MaxMinRatio): ", EXPMaxMinRatio, "avg_exposures: ", exps_MaxMinRatio)

EXPMaxAbsDiff, exps_MaxAbsDiff = frt.Metrics.EXP(ranking_df, item_group_dict, 'MaxAbsDiff')
print("EXP (MaxAbsDiff): ", EXPMaxAbsDiff, "avg_exposures: ", exps_MaxAbsDiff)


EXPMeanAbsDev, exps_MeanAbsDev = frt.Metrics.EXP(ranking_df, item_group_dict, 'MeanAbsDev')
print("EXP (MeanAbsDev): ", EXPMeanAbsDev, "avg_exposures: ", exps_MeanAbsDev)



EXPLTwo, exps_LTwo = frt.Metrics.EXP(ranking_df, item_group_dict, 'LTwo')
print("EXP (LTwo): ", EXPLTwo, "avg_exposures: ", exps_LTwo)

EXPVariance, exps_Variance = frt.Metrics.EXP(ranking_df, item_group_dict, 'Variance')
print("EXP (Variance): ", EXPVariance, "avg_exposures: ", exps_Variance)

EXP (MaxMinDiff):  0.21786100126614577 avg_exposures:  {'M': 0.5197142341886783, 'W': 0.3018532329225326}
EXP (MinMaxRatio):  0.5808061682084833 avg_exposures:  {'M': 0.5197142341886783, 'W': 0.3018532329225326}
EXP (MaxMinRatio):  1.721744800136222 avg_exposures:  {'M': 0.5197142341886783, 'W': 0.3018532329225326}
EXP (MaxAbsDiff):  0.10893050063307291 avg_exposures:  {'M': 0.5197142341886783, 'W': 0.3018532329225326}
EXP (MeanAbsDev):  0.10893050063307289 avg_exposures:  {'M': 0.5197142341886783, 'W': 0.3018532329225326}
EXP (LTwo):  0.6010143587670008 avg_exposures:  {'M': 0.5197142341886783, 'W': 0.3018532329225326}
EXP (Variance):  0.011865853968171892 avg_exposures:  {'M': 0.5197142341886783, 'W': 0.3018532329225326}


## Attention Weighted Rank Fairness (AWRF)

[AWRF](https://kcachel.github.io/FairRankTune/Metrics/#attention-weighted-rank-fairness-awrf) compares the average attention of groups in the ranking(s)and does not consider relevances or scores associate with items. It aligns with the fairness concept of statistical parity.  Attention compared to EXP uses a geometric discount on the "attention" assigned to positions in a ranking. The per-group metric is the group average attention,
whereby the attention score for item $x_i$ in ranking $\tau$ as $attention(\tau,x_i) = 100 \times (1 - p) ^{(\tau(x_i) -1)} \times p$, where $p$ is a parameter representing the proportion of attention received by the first (top) ranked item.  The range of AWRF and its "most fair" value depends on the `combo` variable.

In the example below we calculate AWRF across all aggregation functions. We can see that the average exposures for 'M' (men) and 'W' (women) are always the same, but the AWRF value varies depenging on the meta-metric used.


In [4]:
ranking_df = pd.DataFrame(["Joe", "Jack", "Nick", "David", "Mark", "Josh", "Dave",
                          "Bella", "Heidi", "Amy"])
item_group_dict = dict(Joe= "M",  David= "M", Bella= "W", Heidi= "W", Amy = "W", Mark= "M", Josh= "M", Dave= "M", Jack= "M", Nick= "M")

#Calculate AWRF
p = .1 #paramater representing the proportion of attention received by the first postion
AWRFMaxMinDiff, AWRFs_MaxMinDiff = frt.Metrics.AWRF(ranking_df, item_group_dict, p, 'MaxMinDiff')
print("AWRF (MaxMinDiff): ", AWRFMaxMinDiff, "avg_attention: ", AWRFs_MaxMinDiff)

AWRFMinMaxRatio, AWRFs_MinMaxRatio = frt.Metrics.AWRF(ranking_df, item_group_dict, p, 'MinMaxRatio')
print("AWRF (MinMaxRatio): ", AWRFMinMaxRatio, "avg_attention: ", AWRFs_MinMaxRatio)

AWRFMaxMinRatio, AWRFs_MaxMinRatio = frt.Metrics.AWRF(ranking_df, item_group_dict, p, 'MaxMinRatio')
print("AWRF (MaxMinRatio): ", AWRFMaxMinRatio, "avg_attention: ", AWRFs_MaxMinRatio)

AWRFMaxAbsDiff, AWRFs_MaxAbsDiff = frt.Metrics.AWRF(ranking_df, item_group_dict, p, 'MaxAbsDiff')
print("AWRF (MaxAbsDiff): ", AWRFMaxAbsDiff, "avg_attention: ", AWRFs_MaxAbsDiff)

AWRFMeanAbsDev, AWRFs_MeanAbsDev = frt.Metrics.AWRF(ranking_df, item_group_dict, p, 'MeanAbsDev')
print("AWRF (MeanAbsDev): ", AWRFMeanAbsDev, "avg_attention: ", AWRFs_MeanAbsDev)

AWRFLTwo, AWRFs_LTwo = frt.Metrics.AWRF(ranking_df, item_group_dict, p, 'LTwo')
print("AWRF (LTwo): ", AWRFLTwo, "avg_attention: ", AWRFs_LTwo)

AWRFVariance, AWRFs_Variance = frt.Metrics.AWRF(ranking_df, item_group_dict, p, 'Variance')
print("AWRF (Variance): ", AWRFVariance, "avg_attention: ", AWRFs_Variance)

AWRF (MaxMinDiff):  3.132286098571428 avg_attention:  {'M': 7.45290142857143, 'W': 4.320615330000002}
AWRF (MinMaxRatio):  0.5797225914509614 avg_attention:  {'M': 7.45290142857143, 'W': 4.320615330000002}
AWRF (MaxMinRatio):  1.7249629646088922 avg_attention:  {'M': 7.45290142857143, 'W': 4.320615330000002}
AWRF (MaxAbsDiff):  1.5661430492857145 avg_attention:  {'M': 7.45290142857143, 'W': 4.320615330000002}
AWRF (MeanAbsDev):  1.566143049285714 avg_attention:  {'M': 7.45290142857143, 'W': 4.320615330000002}
AWRF (LTwo):  8.614723241859432 avg_attention:  {'M': 7.45290142857143, 'W': 4.320615330000002}
AWRF (Variance):  2.4528040508259545 avg_attention:  {'M': 7.45290142857143, 'W': 4.320615330000002}


## Exposure Rank Biased Precision Proportionality (ERBP)

[ERBP](https://kcachel.github.io/FairRankTune/Metrics/#exposure-rank-biased-precision-proportionality-erbp) assesses if groups receive exposure proportional to their size whereby exposure is based on the Rank Biased Precision metric) This metric does not consider relevances or scores associate with items. Exposure in ERBE is determined differently compared to [exposure (EXP)](#group-exposure-exp). Specifically this calculation is based on the [Rank Biased Precision (RBP) metric](https://dl.acm.org/doi/10.1145/1416950.1416952).  The per-group metric is the group average exposure, whereby exposure is measured exactly as in [ERBE](https://kcachel.github.io/FairRankTune/Metrics/#exposure-rank-biased-precision-equality-erbe). Group average exposure for group $g_j$ in ranking $\tau$ is $avgexpRBP(\tau,g_j) = (1 - \gamma) \sum_{\forall x \in g_{j}}exposureRBP(\tau,x_i)/|g_{j}|$.  The range of ERBP and its "most fair" value depends on the  `combo` variable.

In the example below we calculate ERBP across all aggregation functions. We can see that the average exposures for 'M' (men) and 'W' (women) are always the same, but the ERBP value varies depenging on the meta-metric used.


In [5]:
ranking_df = pd.DataFrame(["Joe", "Jack", "Nick", "David", "Mark", "Josh", "Dave",
                          "Bella", "Heidi", "Amy"])
item_group_dict = dict(Joe= "M",  David= "M", Bella= "W", Heidi= "W", Amy = "W", Mark= "M", Josh= "M", Dave= "M", Jack= "M", Nick= "M")

#Calculate ERBP
decay = .75 #paramater representing gamma which controls the importance of higher ranks
ERBPMaxMinDiff, ERBPs_MaxMinDiff = frt.Metrics.ERBP(ranking_df, item_group_dict, decay, 'MaxMinDiff')
print("ERBP (MaxMinDiff): ", ERBPMaxMinDiff, "avg RBP exposure: ", ERBPs_MaxMinDiff)

ERBPMinMaxRatio, ERBPs_MinMaxRatio = frt.Metrics.ERBP(ranking_df, item_group_dict, decay, 'MinMaxRatio')
print("ERBP (MinMaxRatio): ", ERBPMinMaxRatio, "avg RBP exposure: ", ERBPs_MinMaxRatio)

ERBPMaxMinRatio, ERBPs_MaxMinRatio = frt.Metrics.ERBP(ranking_df, item_group_dict, decay, 'MaxMinRatio')
print("ERBP (MaxMinRatio): ", ERBPMaxMinRatio, "avg RBP exposure: ", ERBPs_MaxMinRatio)

ERBPMaxAbsDiff, ERBPs_MaxAbsDiff = frt.Metrics.ERBP(ranking_df, item_group_dict, decay, 'MaxAbsDiff')
print("ERBP (MaxAbsDiff): ", ERBPMaxAbsDiff, "avg RBP exposure: ", ERBPs_MaxAbsDiff)

ERBPMeanAbsDev, ERBPs_MeanAbsDev = frt.Metrics.ERBP(ranking_df, item_group_dict, decay, 'MeanAbsDev')
print("ERBP (MeanAbsDev): ", ERBPMeanAbsDev, "avg RBP exposure: ", ERBPs_MeanAbsDev)

ERBPLTwo, ERBPs_LTwo = frt.Metrics.ERBP(ranking_df, item_group_dict, decay, 'LTwo')
print("ERBP (LTwo): ", ERBPLTwo, "avg RBP exposure: ", ERBPs_LTwo)

ERBPVariance, ERBPs_Variance = frt.Metrics.ERBP(ranking_df, item_group_dict, decay, 'Variance')
print("ERBP (Variance): ", ERBPVariance, "avg RBP exposure: ", ERBPs_Variance)

ERBP (MaxMinDiff):  0.09806455884660993 avg RBP exposure:  {'M': 0.12378801618303571, 'W': 0.02572345733642578}
ERBP (MinMaxRatio):  0.20780248467986195 avg RBP exposure:  {'M': 0.12378801618303571, 'W': 0.02572345733642578}
ERBP (MaxMinRatio):  4.812261997447183 avg RBP exposure:  {'M': 0.12378801618303571, 'W': 0.02572345733642578}
ERBP (MaxAbsDiff):  0.04903227942330497 avg RBP exposure:  {'M': 0.12378801618303571, 'W': 0.02572345733642578}
ERBP (MeanAbsDev):  0.049032279423304966 avg RBP exposure:  {'M': 0.12378801618303571, 'W': 0.02572345733642578}
ERBP (LTwo):  0.1264324689621714 avg RBP exposure:  {'M': 0.12378801618303571, 'W': 0.02572345733642578}
ERBP (Variance):  0.0024041644254450554 avg RBP exposure:  {'M': 0.12378801618303571, 'W': 0.02572345733642578}


## Attribute Rank Parity (ARP)

[ARP](https://kcachel.github.io/FairRankTune/Metrics/#attribute-rank-parity-arp) compares the number of mixed pairs won by groups in the ranking(s) and does not consider relevances or scores associate with items. It aligns with the fairness concept of statistical parity. ARP decomposes the ranking into pairwise comparisons, a mixed pair contains items from two different groups, the item "on top" is said to "win" the pair. The per-group metric is the average mixed pairs won by each group, calculated as $avgpairs(\tau, g_i) = \# ~mixedpairswon(g_i) / \# totalmixedpairs(g_i)$ in ranking $\tau$.  The range of ARP and its "most fair" value depends on the `combo` variable.

In the example below we calculate ERBP across all aggregation functions. We can see that the average exposures for 'M' (men) and 'W' (women) are always the same, but the ERBP value varies depenging on the meta-metric used.

In [6]:
ranking_df = pd.DataFrame(["Joe", "Jack", "Nick", "David", "Mark", "Josh",
                          "Bella",  "Dave", "Heidi", "Amy"])
item_group_dict = dict(Joe= "M",  David= "M", Bella= "W", Heidi= "W", Amy = "W", Mark= "M", Josh= "M", Dave= "M", Jack= "M", Nick= "M")

#Calculate ARP
ARPMaxMinDiff, group_pairs_MaxMinDiff = frt.Metrics.ARP(ranking_df, item_group_dict, 'MaxMinDiff')
print("ARP (MaxMinDiff): ", ARPMaxMinDiff, "group mixed pairs ratio: ", group_pairs_MaxMinDiff)

ARPMinMaxRatio, group_pairs_MinMaxRatio = frt.Metrics.ARP(ranking_df, item_group_dict, 'MinMaxRatio')
print("ARP (MinMaxRatio): ", ARPMinMaxRatio, "group mixed pairs ratio: ", group_pairs_MinMaxRatio)

ARPMaxMinRatio, group_pairs_MaxMinRatio = frt.Metrics.ARP(ranking_df, item_group_dict,'MaxMinRatio')
print("ARP (MaxMinRatio): ", ARPMaxMinRatio, "group mixed pairs ratio: ", group_pairs_MaxMinRatio)

ARPMaxAbsDiff, group_pairs_MaxAbsDiff = frt.Metrics.ARP(ranking_df, item_group_dict,'MaxAbsDiff')
print("ARP (MaxAbsDiff): ", ARPMaxAbsDiff, "group mixed pairs ratio: ", group_pairs_MaxAbsDiff)

ARPMeanAbsDev, group_pairs_MeanAbsDev = frt.Metrics.ARP(ranking_df, item_group_dict,'MeanAbsDev')
print("ARP (MeanAbsDev): ", ARPMeanAbsDev, "group mixed pairs ratio: ", group_pairs_MeanAbsDev)

ARPLTwo, group_pairs_LTwo = frt.Metrics.ARP(ranking_df, item_group_dict,'LTwo')
print("ARP (LTwo): ", ARPLTwo, "group mixed pairs ratio: ", group_pairs_LTwo)

ARPVariance, group_pairs_Variance = frt.Metrics.ARP(ranking_df, item_group_dict,'Variance')
print("ARP (Variance): ", ARPVariance, "group mixed pairs ratio: ", group_pairs_Variance)

ARP (MaxMinDiff):  0.9047619047619047 group mixed pairs ratio:  {'M': 0.9523809523809523, 'W': 0.047619047619047616}
ARP (MinMaxRatio):  0.05 group mixed pairs ratio:  {'M': 0.9523809523809523, 'W': 0.047619047619047616}
ARP (MaxMinRatio):  20.0 group mixed pairs ratio:  {'M': 0.9523809523809523, 'W': 0.047619047619047616}
ARP (MaxAbsDiff):  0.4523809523809524 group mixed pairs ratio:  {'M': 0.9523809523809523, 'W': 0.047619047619047616}
ARP (MeanAbsDev):  0.45238095238095233 group mixed pairs ratio:  {'M': 0.9523809523809523, 'W': 0.047619047619047616}
ARP (LTwo):  0.9535706854524183 group mixed pairs ratio:  {'M': 0.9523809523809523, 'W': 0.047619047619047616}
ARP (Variance):  0.2046485260770975 group mixed pairs ratio:  {'M': 0.9523809523809523, 'W': 0.047619047619047616}


#Single Formulation Metrics Statistical Parity



## Normalized Discounted KL-Divergence (NDKL)

[NDKL](https://kcachel.github.io/FairRankTune/Metrics/#normalized-discounted-kl-divergence-ndkl) asseses the representation of groups in dsicrete prefixes of the ranking. It does not considers the relevance or scores associated with items. It aligns with the fairness cocnept of statistical parity and is assess on a single ranking. The NDKL of ranking $\tau$ with respect to groups $G$ is defined as:
$\frac{1}{Z}\sum^{|X|}_{i = 1}\frac{1}{log_{2}(i +1 )}d_{KL}(D_{\tau_i} || D_{X})$
where $d_{KL}(D_{\tau_i} || D_{X})$ is the [KL-divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence) score of the group proportions of the first $i$ positions in $\tau$ and the group proportions of the item set $X$ and $Z = \sum_{i = 1}^{| \tau |} \frac{1}{log_2(i + 1)}$. NDKL ranges from 0 to $\infty$, and is most fair at 0.

In [7]:
ranking_df = pd.DataFrame(["Joe", "Jack", "Nick", "David", "Mark", "Josh", "Dave",
                          "Bella", "Heidi", "Amy"])
item_group_dict = dict(Joe= "M",  David= "M", Bella= "W", Heidi= "W", Amy = "W", Mark= "M", Josh= "M", Dave= "M", Jack= "M", Nick= "M")

NDKL= frt.Metrics.NDKL(ranking_df, item_group_dict)
print("NDKL:", NDKL)

NDKL: 0.2925554332073208
