<a href="https://colab.research.google.com/github/KCachel/fairranktune/blob/main/examples/5_scorebasedmetrics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup

We need to install [FairRankTune](https://https://github.com/KCachel/FairRankTune).

In [1]:
!pip install FairRankTune

Collecting FairRankTune
  Downloading fairranktune-0.0.6-py3-none-any.whl (20 kB)
Installing collected packages: FairRankTune
Successfully installed FairRankTune-0.0.6


We need to import FairRankTune along with some other packages.

In [2]:
import FairRankTune as frt
import numpy as np
import pandas as pd
from FairRankTune import RankTune, Metrics, Rankers

# Metrics (Evaluating Rankings for Fairness)
The [Metrics](https://kcachel.github.io/fairranktune/metrics/) module contains several fairness metrics for assessing ranked lists. Each metric evaluates the ranking(s) in the passed `ranking_df` parameter (a Pandas dataframe).

In this overview, we will demonstrate the score-based metrics for both group and individual fairness.

# Score-based Group Fairness Metrics

## A Note about Modular Metric Implementations
A key functionality of the  `Metrics` library in `FairRankTune`  is providing toolkit users multiple choices for how to calculate a given top-level fairness metric. For instance, for group exposure EXP a statistical parity metric,  `Metrics` offers seven ways of calculating a top-level exposure metric (e.g., min-max ratios, max absolute difference, L-2 norms of per-group exposures, etc.). All score-based group fairness metrics in the `Metrics` module use the modular metric implementation.


Below are the formulas supported for combining per-group style metrics. In the formulas $V = [V_{1}, ..., V_{g}$] is an array of per-group metrics and $G$ is the number of groups. The `combo` variable is used directly in the function call. Depending on the formula used for aggregating per-group metrics the range of the given fairness metric varies. The range and its corresponding "most fair" value is provided in the table.

| **Combo Variable in ```FairRankTune```** | **Formula** | **Range** | **Most Fair** |
|---|:---:|:---:|:---:|
| ```MinMaxRatio``` |  $min_{g} V / max_{g} V$ | [0,1] | 1 |
| ```MaxMinRatio``` |  $max_{g} V / min_{g} V$ | [1, $\infty$] | 1 |
| ```MaxMinDiff``` |  $max_{g} V - min_{g} V$ |  [0,1] | 0 |
| ```MaxAbsDiff``` | $max_{g} \mid V - V_{mean} \mid$ |  [0, $\infty$] | 0 |
| ```MeanAbsDev``` | $\frac{1}{G} \sum_{g} \mid V - V_{mean}\mid$ | [0, $\infty$] | 0 |
| ```LTwo```| $\lVert V \rVert_2^2$ | [0, $\infty$] | 0 |
|  ```Variance``` |  $\frac{1}{G - 1} \sum_{g} (V_{g} - V_{mean})^2$ | [0, $\infty$] | 0 |



## Exposure Utility (EXPU)

[EXPU](https://kcachel.github.io/fairranktune/metrics/#exposure-utility-expu) assesses if groups receive exposure proportional to their relevance in the ranking(s). This is a fotm of group fairness that considers the scores (relevances) associated with items. The per-group metric is the ratio of group average exposure and group average utility, whereby group average exposure is measured exactly as in [EXP](https://kcachel.github.io/fairranktune/metrics/#group-exposure-exp). Group average utility for group $g_j$ is $avgutil(\tau,g_j) = \sum_{\forall x \in g_{j}}x_i^{util_{\tau}}/|g_{j}|$, where $x_i^{util_{\tau}}$ is the utility (or relevance score) for candidate $x_i$ in ranking $\tau$.  The range of EXPU and its "most fair" value depends on the ``combo` variable.

[Singh et al.](https://dl.acm.org/doi/10.1145/3219819.3220088) refer to EXPU as "Disparate Treatment", as pointed out by Raj et al. this terminology, is inconsistent with the use of these terms in the broader algorithmic fairness literature, thus ```FairRankTune``` uses the term "Exposure Utility" a introduced in [Raj et al.}(https://dl.acm.org/doi/10.1145/3477495.3532018).

In the example below we calculate EXPU across all aggregation functions.
Note that the relevance scores associated with the ranking(s) in `relevance_df` must be between 0 and 1. The first returned object is a float specifying the EXPU value and the second returned object is a dictionary of average exposure and average utility ratios for each group (keys are group ids).

In [3]:
#Generate two biased (phi = 0) rankings of 1000 items, with three groups of 100, 700, and 200 items each.
seed = 2 #For reproducability
ranking_df, item_group_dict, relevance_df = frt.RankTune.ScoredGenFromGroups(np.asarray([.1, .7, .2]),  1000, 0, 2, 'uniform', seed)

EXPUU, per_group = frt.Metrics.EXPU(ranking_df, item_group_dict, relevance_df, 'MinMaxRatio')
EXPUUMaxMinDiff, EXPUs_MaxMinDiff = frt.Metrics.EXPU(ranking_df, item_group_dict, relevance_df, 'MaxMinDiff')
print("EXPUU (MaxMinDiff): ", EXPUUMaxMinDiff, "grp exp/util ratios: ", EXPUs_MaxMinDiff)

EXPUMinMaxRatio, EXPUs_MinMaxRatio = frt.Metrics.EXPU(ranking_df, item_group_dict, relevance_df, 'MinMaxRatio')
print("EXPU (MinMaxRatio): ", EXPUMinMaxRatio, "grp exp/util ratios: ", EXPUs_MinMaxRatio)

EXPUMaxMinRatio, EXPUs_MaxMinRatio = frt.Metrics.EXPU(ranking_df, item_group_dict, relevance_df, 'MaxMinRatio')
print("EXPU (MaxMinRatio): ", EXPUMaxMinRatio, "grp exp/util ratios: ", EXPUs_MaxMinRatio)

EXPUMaxAbsDiff, EXPUs_MaxAbsDiff = frt.Metrics.EXPU(ranking_df, item_group_dict, relevance_df, 'MaxAbsDiff')
print("EXPU (MaxAbsDiff): ", EXPUMaxAbsDiff, "grp exp/util ratios: ", EXPUs_MaxAbsDiff)

EXPUMeanAbsDev, EXPUs_MeanAbsDev = frt.Metrics.EXPU(ranking_df, item_group_dict, relevance_df, 'MeanAbsDev')
print("EXPU (MeanAbsDev): ", EXPUMeanAbsDev, "grp exp/util ratios: ", EXPUs_MeanAbsDev)

EXPULTwo, EXPUs_LTwo = frt.Metrics.EXPU(ranking_df, item_group_dict, relevance_df, 'LTwo')
print("EXPU (LTwo): ", EXPULTwo, "grp exp/util ratios: ", EXPUs_LTwo)

EXPUVariance, EXPUs_Variance = frt.Metrics.EXPU(ranking_df, item_group_dict, relevance_df, 'Variance')
print("EXPU (Variance): ", EXPUVariance, "grp exp/util ratios: ", EXPUs_Variance)

EXPUU (MaxMinDiff):  0.1578033459472419 grp exp/util ratios:  {0: 0.2198666475823326, 1: 0.324508855088885, 2: 0.1667055091416431}
EXPU (MinMaxRatio):  0.5137163640603317 grp exp/util ratios:  {0: 0.2198666475823326, 1: 0.324508855088885, 2: 0.1667055091416431}
EXPU (MaxMinRatio):  1.9465994660870063 grp exp/util ratios:  {0: 0.2198666475823326, 1: 0.324508855088885, 2: 0.1667055091416431}
EXPU (MaxAbsDiff):  0.08748185115126478 grp exp/util ratios:  {0: 0.2198666475823326, 1: 0.324508855088885, 2: 0.1667055091416431}
EXPU (MeanAbsDev):  0.058321234100843174 grp exp/util ratios:  {0: 0.2198666475823326, 1: 0.324508855088885, 2: 0.1667055091416431}
EXPU (LTwo):  0.4259554748191025 grp exp/util ratios:  {0: 0.2198666475823326, 1: 0.324508855088885, 2: 0.1667055091416431}
EXPU (Variance):  0.004297554913811049 grp exp/util ratios:  {0: 0.2198666475823326, 1: 0.324508855088885, 2: 0.1667055091416431}


##  Exposure Realized Utility (EXPRU)

[EXPRU](https://kcachel.github.io/fairranktune/metrics/#exposure-realized-utility-expru) assesses if groups are click-on proportional to their relevance in the ranking(s). This is a form of group fairness that considers the scores (relevances) associated with items. The per-group metric is the ratio of group average click-through rate and group average utility, whereby  group average utility is measured exactly as in EXPU. The average click-through rate for group $g_j$ is $avgctr(\tau,g_j) = \sum_{\forall x \in g_{j}}x_i^{ctr_{\tau}}/|g_{j}|$, where $x_i^{ctr_{\tau}}$ is the click-through rate for candidate $x_i$ in ranking $\tau$.  The range of EXPRU and its "most fair" value depends on the [per-group aggregation](#modular-metric-implementation) `combo` variable.

[Singh et al.](https://dl.acm.org/doi/10.1145/3219819.3220088) refer to EXPRU as "Disparate Impact", as pointed out by Raj et al. this terminology, is inconsistent with the use of these terms ithe broader algorithmic fairness literature, thus `FairRankTune``` uses the term "Exposure Realized Utility" a introduced in [Raj et al.](https://dl.acm.org/doi/10.1145/3477495.3532018).


Note that the relevance scores associated with the ranking(s) in `relevance_df` must be between 0 and 1 and the click-through-rates in `ctr_df` must be between 0 (no clicks) or 1 (100% ctr). The first returned object is a float specifying the EXPRU value and the second returned object is a dictionary of  average utility and average click-through rate ratios for each group (keys are group ids).

In [4]:
ranking_df = pd.DataFrame(["Joe", "Jack", "Nick", "David", "Mark", "Josh", "Dave",
                          "Bella", "Heidi", "Amy"])
relevance_df = pd.DataFrame([1, .9, .8, .82, .78, .71, .6,
                          .59, .58, .56])
ctr_df = pd.DataFrame([.99, .978, .88, .86, .85, .9, .83,
                          .82, .82, .62])
item_group_dict = dict(Joe= "M",  David= "M", Bella= "W", Heidi= "W", Amy = "W", Mark= "M", Josh= "M", Dave= "M", Jack= "M", Nick= "M")

#Calculate EXPRU
EXPRUMaxMinDiff, EXPRUs_MaxMinDiff = frt.Metrics.EXPRU(ranking_df, item_group_dict, relevance_df, ctr_df, 'MaxMinDiff')
print("EXPRU (MaxMinDiff): ", EXPRUMaxMinDiff, "grp ctr/util ratios: ", EXPRUs_MaxMinDiff)

EXPRUMinMaxRatio, EXPRUs_MinMaxRatio = frt.Metrics.EXPRU(ranking_df, item_group_dict, relevance_df, ctr_df, 'MinMaxRatio')
print("EXPRU (MinMaxRatio): ", EXPRUMinMaxRatio, "grp ctr/util ratios: ", EXPRUs_MinMaxRatio)

EXPRUMaxMinRatio, EXPRUs_MaxMinRatio = frt.Metrics.EXPRU(ranking_df, item_group_dict, relevance_df, ctr_df, 'MaxMinRatio')
print("EXPRU (MaxMinRatio): ", EXPRUMaxMinRatio, "grp ctr/util ratios: ", EXPRUs_MaxMinRatio)

EXPRUMaxAbsDiff, EXPRUs_MaxAbsDiff = frt.Metrics.EXPRU(ranking_df, item_group_dict, relevance_df, ctr_df, 'MaxAbsDiff')
print("EXPRU (MaxAbsDiff): ", EXPRUMaxAbsDiff, "grp ctr/util ratios: ", EXPRUs_MaxAbsDiff)

EXPRUMeanAbsDev, EXPRUs_MeanAbsDev = frt.Metrics.EXPRU(ranking_df, item_group_dict, relevance_df, ctr_df, 'MeanAbsDev')
print("EXPRU (MeanAbsDev): ", EXPRUMeanAbsDev, "grp ctr/util ratios: ", EXPRUs_MeanAbsDev)

EXPRULTwo, EXPRUs_LTwo = frt.Metrics.EXPRU(ranking_df, item_group_dict, relevance_df, ctr_df, 'LTwo')
print("EXPRU (LTwo): ", EXPRULTwo, "grp ctr/util ratios: ", EXPRUs_LTwo)

EXPRUVariance, EXPRUs_Variance = frt.Metrics.EXPRU(ranking_df, item_group_dict, relevance_df, ctr_df, 'Variance')
print("EXPRU (Variance): ", EXPRUVariance, "grp ctr/util ratios: ", EXPRUs_Variance)

EXPRU (MaxMinDiff):  0.18550276652962783 grp ctr/util ratios:  {'M': 1.1208556149732622, 'W': 1.30635838150289}
EXPRU (MinMaxRatio):  0.8580000946476742 grp ctr/util ratios:  {'M': 1.1208556149732622, 'W': 1.30635838150289}
EXPRU (MaxMinRatio):  1.1655010369324446 grp ctr/util ratios:  {'M': 1.1208556149732622, 'W': 1.30635838150289}
EXPRU (MaxAbsDiff):  0.09275138326481391 grp ctr/util ratios:  {'M': 1.1208556149732622, 'W': 1.30635838150289}
EXPRU (MeanAbsDev):  0.09275138326481391 grp ctr/util ratios:  {'M': 1.1208556149732622, 'W': 1.30635838150289}
EXPRU (LTwo):  1.7213046013242226 grp ctr/util ratios:  {'M': 1.1208556149732622, 'W': 1.30635838150289}
EXPRU (Variance):  0.008602819097536402 grp ctr/util ratios:  {'M': 1.1208556149732622, 'W': 1.30635838150289}


## Exposure Rank Biased Precision Proportional to Relevance (ERBR)

[ERBR](https://kcachel.github.io/fairranktune/metrics/#exposure-rank-biased-precision-proportional-to-relevance-erbr) assesses if groups receive exposure proportional to how many relevant items are in the group. It aligns with the fairness concept of statistical parity. This is a form of group fairness that considers the scores (relevances) associated with items. The per-group metric is the ratio of group exposure and the number of items belonging to the given group that are relevant, whereby exposure is measured exactly as in ERBE. This ratio for group $g_j$ is $expRBP2rel(\tau,g_j) = (1 - \gamma) \sum_{\forall x \in g_{j}}exposureRBP(\tau,x_i)/|g_{j}^{rel}|$, where $|g_{j}^{rel}|$ is the count of relevant items in group $g_{j}$.  The range of ERBR and its "most fair" value depends on the `combo` variable.

Note that the relevance scores associated with the ranking(s) in `relevance_df` must be either 0 or 1. The first returned object is float specifying the ERBR value and the second returned object is a dictionary of exposure and relevance ratios for each group (keys are group ids).

In [5]:
#Generate two biased (phi = 0) rankings of 1000 items, with three groups of 100, 700, and 200 items each.
seed = 2 #For reproducability
decay = .8
ranking_df, item_group_dict, relevance_df = frt.RankTune.ScoredGenFromGroups(np.asarray([.1, .7, .2]),  1000, 0, 2, 'uniform', seed)

#Calculate ERBR
ERBRMaxMinDiff, ERBRs_MaxMinDiff = frt.Metrics.ERBR(ranking_df, item_group_dict, relevance_df, decay, 'MaxMinDiff')
print("ERBRU (MaxMinDiff): ", ERBRMaxMinDiff, "grp exp/rel ratios: ", ERBRs_MaxMinDiff)

ERBRMinMaxRatio, ERBRs_MinMaxRatio = frt.Metrics.ERBR(ranking_df, item_group_dict, relevance_df, decay, 'MinMaxRatio')
print("ERBR (MinMaxRatio): ", ERBRMinMaxRatio, "grp exp/rel ratios: ", ERBRs_MinMaxRatio)

ERBRMaxMinRatio, ERBRs_MaxMinRatio = frt.Metrics.ERBR(ranking_df, item_group_dict, relevance_df, decay, 'MaxMinRatio')
print("ERBR (MaxMinRatio): ", ERBRMaxMinRatio, "grp exp/rel ratios: ", ERBRs_MaxMinRatio)

ERBRMaxAbsDiff, ERBRs_MaxAbsDiff = frt.Metrics.ERBR(ranking_df, item_group_dict, relevance_df, decay, 'MaxAbsDiff')
print("ERBR (MaxAbsDiff): ", ERBRMaxAbsDiff, "grp exp/rel ratios: ", ERBRs_MaxAbsDiff)

ERBRMeanAbsDev, ERBRs_MeanAbsDev = frt.Metrics.ERBR(ranking_df, item_group_dict, relevance_df, decay, 'MeanAbsDev')
print("ERBR (MeanAbsDev): ", ERBRMeanAbsDev, "grp exp/rel ratios: ", ERBRs_MeanAbsDev)

ERBRLTwo, ERBRs_LTwo = frt.Metrics.ERBR(ranking_df, item_group_dict, relevance_df, decay, 'LTwo')
print("ERBR (LTwo): ", ERBRLTwo, "grp exp/rel ratios: ", ERBRs_LTwo)

ERBRVariance, ERBRs_Variance = frt.Metrics.ERBR(ranking_df, item_group_dict, relevance_df, decay, 'Variance')
print("ERBR (Variance): ", ERBRVariance, "grp exp/rel ratios: ", ERBRs_Variance)

ERBRU (MaxMinDiff):  0.010500506400700346 grp exp/rel ratios:  {0: 0.010500506400700346, 1: 3.622881107451691e-32, 2: 1.2843519883724114e-12}
ERBR (MinMaxRatio):  3.4501965611963804e-30 grp exp/rel ratios:  {0: 0.010500506400700346, 1: 3.622881107451691e-32, 2: 1.2843519883724114e-12}
ERBR (MaxMinRatio):  2.898385591263945e+29 grp exp/rel ratios:  {0: 0.010500506400700346, 1: 3.622881107451691e-32, 2: 1.2843519883724114e-12}
ERBR (MaxAbsDiff):  0.00700033760003878 grp exp/rel ratios:  {0: 0.010500506400700346, 1: 3.622881107451691e-32, 2: 1.2843519883724114e-12}
ERBR (MeanAbsDev):  0.004666891733359186 grp exp/rel ratios:  {0: 0.010500506400700346, 1: 3.622881107451691e-32, 2: 1.2843519883724114e-12}
ERBR (LTwo):  0.010500506400700346 grp exp/rel ratios:  {0: 0.010500506400700346, 1: 3.622881107451691e-32, 2: 1.2843519883724114e-12}
ERBR (Variance):  2.4502363257258353e-05 grp exp/rel ratios:  {0: 0.010500506400700346, 1: 3.622881107451691e-32, 2: 1.2843519883724114e-12}


#Score-based Individual Fairness Metrics



## Inequity of Amortized Attention (IAA)

[IAA](https://kcachel.github.io/fairranktune/metrics/#inequity-of-amortized-attention-iaa) assess if a series of rankings is individually fair; meaning items are given attention similiar to their relevance. IAA measures the difference, via the $L_1$ norm between the cumulative attention and cumulative relevance of items in the rankings. Whereby the attention of an item $x_i$ in ranking $\tau$ is $attention(\tau,x_i) = 1 / log_2(\tau(x_i)+1))$ and the relevance of an item is a $[0 - 1]$-normalized score. IAA is ranges from 0 to $\infty$, and is most fair at 0.

Note that the relevance scores associated with the rankings in `relevance_df` must be between 0 and 1. The returned object is a float specifying the IAA value.

In [6]:
#Generate 10 biased (phi = 0) rankings of 1000 items, with two groups of 100 and 900 items each.
seed = 2 #For reproducability
ranking_df, item_group_dict, relevance_df = frt.RankTune.ScoredGenFromGroups(np.asarray([.1, .9]),  1000, 0, 2, 'uniform', seed)

#Calculate IAA
IAA = frt.Metrics.IAA(ranking_df, relevance_df)
print("IAA value: ", IAA)

IAA value:  373.55172571864875
