# Fairness auditing for subgroups using Fairness Aware Counterfactuals for Subgroups (FACTS).

[FACTS](https://arxiv.org/abs/2306.14978) is an efficient, model-agnostic, highly parameterizable, and explainable framework for auditing subgroup fairness through counterfactual explanations. FACTS focuses on identifying a specific type of bias, i.e. the *difficulty in achieving recourse*. In short, it focuses on the population that has obtained the unfavorable outcome (*affected population*) by a ML model and tries to identify differences in the difficulty of changing the ML model's decision to obtain the favorable outcome, between affected subpopulations.

In this notebook, we will see how to use this algorithm for discovering subgroups where the bias of a model (logistic regression for simplicity) between Males and Females is high.

We will use the Adult dataset from UCI ([reference](https://archive.ics.uci.edu/ml/datasets/adult)).

# Preliminaries

## Import dependencies

As usual in python, the first step is to import all necessary packages.

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

from aif360.sklearn.datasets.openml_datasets import fetch_adult
from aif360.sklearn.detectors.facts.clean import clean_dataset
from aif360.sklearn.detectors.facts import FACTS, FACTS_bias_scan

from IPython.display import display

import warnings
warnings.filterwarnings("ignore")

pip install 'aif360[LawSchoolGPA]'


Below, you can change the `random_seed` variable to `None` if you would like for the pseudo-random parts to actually change between runs. We have set it to a specific value for reproducibility.

In [2]:
random_seed = 131313 # for reproducibility

## Load Dataset

In [3]:
# load the adult dataset and perform some simple preprocessing steps
# See output for a glimpse of the final dataset's characteristics
X, y, sample_weight = fetch_adult()
data = clean_dataset(X.assign(income=y), "adult")
display(data.head())

# split into train-test data
y = data['income']
X = data.drop('income', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=random_seed, stratify=y)

Unnamed: 0,age,workclass,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,"(16.999, 26.0]",Private,7.0,Never-married,Machine-op-inspct,Own-child,Black,Male,0.0,0.0,FullTime,United-States,0
1,"(34.0, 41.0]",Private,9.0,Married-civ-spouse,Farming-fishing,Married,White,Male,0.0,0.0,OverTime,United-States,0
2,"(26.0, 34.0]",Local-gov,12.0,Married-civ-spouse,Protective-serv,Married,White,Male,0.0,0.0,FullTime,United-States,1
3,"(41.0, 50.0]",Private,10.0,Married-civ-spouse,Machine-op-inspct,Married,Black,Male,7688.0,0.0,FullTime,United-States,1
4,"(26.0, 34.0]",Private,6.0,Never-married,Other-service,Not-in-family,White,Male,0.0,0.0,MidTime,United-States,0


## Example Model to be used for Auditing

We use the train set to train a simple logistic regression model. This will serve as the demonstrative model, which we will then treat as a black box and apply our algorithm.

Of course, any model can be used in its place. Our purpose here is not to produce a good model, but to audit the fairness of an existing one.

In [4]:
#### here, we incrementally build the example model. It consists of one preprocessing step,
#### which is to turn categorical features into the respective one-hot encodings, and
#### a simple scikit-learn logistic regressor.
categorical_features = X.select_dtypes(include=["object", "category"]).columns.to_list()
categorical_features_onehot_transformer = ColumnTransformer(
    transformers=[
        ("one-hot-encoder", OneHotEncoder(), categorical_features)
    ],
    remainder="passthrough"
)
model = Pipeline([
    ("one-hot-encoder", categorical_features_onehot_transformer),
    ("clf", LogisticRegression(max_iter=1500))
])

#### train the model
model = model.fit(X_train, y_train)

In [5]:
# showcase model's accuracy
y_pred = model.predict(X_test)
print(f"Accuracy = {(y_test.values == y_pred).sum() / y_test.shape[0]:.2%}")

Accuracy = 85.16%


# A Practical Example of FACTS

The real essence of our work starts here. Specifically, we showcase the generation of candidate subpopulation groups and counterfactuals and the detection of those groups that exhibit the greatest unfairness, with respect to one of several metrics.

## Load and Fit FACTS

In [6]:
# load FACTS framework with:
# - the model to be audited
# - protected attribute "sex" and
# - assigning equal, unit weights to all features for cost computation.
# - no features forbidden from changing, i.e. user can specify any features that cannot change at all.
detector = FACTS(
    clf=model,
    prot_attr="sex",
    feature_weights={f: 1 for f in X.columns},
    feats_not_allowed_to_change=[]
)

In [7]:
# generates candidate subpopulation groups for bias and candidate actions
detector = detector.fit(X_test)

Computing candidate subgroups.


100%|██████████████████████████████████████████████████████████████████████████| 1046/1046 [00:00<00:00, 523287.45it/s]

Number of subgroups: 563
Computing candidate recourses for all subgroups.



100%|█████████████████████████████████████████████████████████████████████████████| 563/563 [00:00<00:00, 50669.32it/s]

Computing percentages of individuals flipped by each action independently.



100%|████████████████████████████████████████████████████████████████████████████████| 590/590 [00:13<00:00, 43.37it/s]

Computing percentages of individuals flipped by any action with cost up to c, for every c



100%|████████████████████████████████████████████████████████████████████████████████| 416/416 [00:12<00:00, 32.57it/s]


## Detect Groups with Unfairness in Protected Subgroups (using "Equal Choice for Recourse" metric)

Here we demonstrate the `bias_scan` method of our detector, which ranks subpopulation groups from most to least unfair, with respect to the chosen metric and, of course, the protected attribute.

For the purposes of the demo, we use the "Equal Choice for Recourse" definition / metric. This posits that the classifier acts fairly for the group in question if the protected subgroups can choose among the same number of sufficiently effective actions to achieve recourse. By sufficiently effective we mean those actions (out of all candidates) which work for at least $100\phi \%$ (for some $\phi \in [0,1]$) of the subgroup.

Given this definition, the respective unfairness *metric* is defined to be the difference in the number of sufficiently effective actions between the two protected subgroups.

**Suggestion**: this metric may find utility in scenarios where the aim is to guarantee that protected subgroups have a similar range of options available to them when it comes to making adjustments in order to attain a favorable outcome. For example, when evaluating job candidates, the employer may wish to ensure that applicants from different backgrounds (that currently fail to meet expectations) have an equal array of career / retraining opportunities that may land them the job, so as to ensure diversity in all sectors of the company, which employ individuals with a plethora of roles.

In [8]:
# Detects the top `top_count` most biased groups based on the given metric
# available metrics are:
# - equal-effectiveness
# - equal-choice-for-recourse
# - equal-effectiveness-within-budget
# - equal-cost-of-effectiveness
# - equal-mean-recourse
# - fair-tradeoff
# a short description for each metric is given below
detector.bias_scan(
    metric="equal-choice-for-recourse",
    phi=0.1,
    top_count=3
)

In [9]:
# prints the result into a nicely formatted report
detector.print_recourse_report(
    show_action_costs=False,
    show_subgroup_costs=True,
    show_unbiased_subgroups=False,
)

If [1mage = (26.0, 34.0], hours-per-week = FullTime[0m:
	Protected Subgroup '[1mFemale[0m', [34m10.59%[39m covered
		Make [1m[31mage = (41.0, 50.0][39m, [31mhours-per-week = OverTime[39m[0m with effectiveness [32m7.73%[39m.
		Make [1m[31mage = (41.0, 50.0][39m[0m with effectiveness [32m3.98%[39m.
		Make [1m[31mage = (34.0, 41.0][39m, [31mhours-per-week = OverTime[39m[0m with effectiveness [32m5.39%[39m.
		[1mAggregate cost[0m of the above recourses = [35m0.00[39m
	Protected Subgroup '[1mMale[0m', [34m13.78%[39m covered
		Make [1m[31mage = (41.0, 50.0][39m, [31mhours-per-week = OverTime[39m[0m with effectiveness [32m19.66%[39m.
		Make [1m[31mage = (41.0, 50.0][39m[0m with effectiveness [32m10.63%[39m.
		Make [1m[31mage = (34.0, 41.0][39m, [31mhours-per-week = OverTime[39m[0m with effectiveness [32m13.39%[39m.
		[1mAggregate cost[0m of the above recourses = [35m-3.00[39m
	[35mBias against Female with respect to equal-choice-f

### Example Output Breakdown

Let us now disect the above example and the output we see, one step at a time.

#### Prelude: $\phi = 0.1$

As we mentioned in the general description of this metric, this is the parameter that determines whether we consider an action sufficiently effective or not. So, here, we consider an action effective if it manages to flip the prediction for at least 10% of the individuals under study, and ineffective otherwise.

#### **age = (26.0, 34.0], hours-per-week = FullTime**

This is the first (hence, most biased) group. The group description is mostly self-explanatory: everything inside this block concerns all those (affected) individuals that are from 26 (not inclusive) to 34 years old and have a fulltime job. Now, since the output has the same structure for all groups, let us consider this group as an example and further disect the output we see in this block.

#### *Protected subgroups 'Male' / 'Female'*

We split the population of this group, according to the protected attribute. Hence, we distinguish between males that are 26-34 years old and have a fulltime job and females that are 26-34 years old and have a fulltime job.

The "covered" percentage reported here in blue signifies that out of all affected females, 10.59% are 26-34 years old and have a fulltime job, while the respective percentage for males is 13.78%.

#### *Make age = (41.0, 50.0], hours-per-week = OverTime*

This is one of the 3 actions we have tried to apply on the individuals in the current subpopulation group. We report the action, along with its effectiveness and, optionally, the cost; here we omit the action cost because the "Equal Choice for Recourse" metric does not take it into account.

At this point, let us give a more direct interpretation for the **effectiveness**. In this case, for example, the interpretation could be the following: if all females aged 26-34 with fulltime jobs change their age group to 41-50 years old and their working hours to overtime, then 7.73% of them will actually manage to receive the positive prediction from the model. The rest will still receive the negative prediction.

#### *Protected Subgroups' Aggregate Cost*

The "aggregate cost of the above recourses" message shows how we quantify the *cost of recourse* for all actions in each protected subgroup.

This is derived directly from the definition of each metric. Here, for example, we use the "Equal Choice for Recourse" metric, which counts the number of effective actions available to each of the protected subgroups. In this group, females have no (sufficiently) effective actions, and as such we say that they gain 0 units. Males have 3 effective actions, so they gain 3 units.

Finally, to keep the formalization of having costs everywhere, we rephrase this instead into males having a recourse cost of -3 and females having a recourse cost of 0.

As we also mention in the next paragraph, the final bias score of the subgroup is nothing more than the absolute difference of these 2 costs.

#### *Bias Deduction / Metric Application*

Given the above, one can see that the (same) actions, if applied to females of the subpopulation group, cannot yield more than 10% effectiveness, while in males they achieve up to 19.66%! This is why we argue that, in the terms of bias of recourse, this group exhibits bias against females.

This is, of course, with respect to the "Equal Choice for Recourse" metric, which posits that the 2 protected subgroups should have the same number of effective actions. Since none of the 3 actions are sufficiently effective for females, and all 3 of them are sufficiently effective for males, we score this group as having a bias measure of $|0 - 3| = 3$.

### Example without Bias of Recourse

For completeness, we also demonstrate how, for some choices of metrics and parameters, FACTS may fail to find any subpopulation groups that exhibit bias between the protected populations, and thus deduce that in this case there is no recourse related bias.

In [10]:
detector.bias_scan(
    metric="equal-choice-for-recourse",
    phi=0.7,
    top_count=3
)

In [11]:
# prints the result into a nicely formatted report
detector.print_recourse_report(
    show_action_costs=False,
    show_subgroup_costs=True,
    show_unbiased_subgroups=False,
)

[1mWith the given parameters, no recourses showing unfairness have been found![0m


## Aternative API

We also provide a more succinct API in the form of a wrapper function. This is closer in style to the API of existing `aif360` detectors.

The previous example could be run equivalently with the following.

In [12]:
most_biased_subgroups = FACTS_bias_scan(
    X=X_test,
    clf=model,
    prot_attr="sex",
    feature_weights={f: 1 for f in X.columns},
    feats_not_allowed_to_change=[],
    metric="equal-choice-for-recourse",
    phi=0.1,
    top_count=3,
    verbose=False, # hides progress bars
    print_recourse_report=True,
    show_action_costs=True,
    show_subgroup_costs=True,
)

If [1mage = (26.0, 34.0], hours-per-week = FullTime[0m:
	Protected Subgroup '[1mFemale[0m', [34m10.59%[39m covered
		Make [1m[31mage = (41.0, 50.0][39m, [31mhours-per-week = OverTime[39m[0m with effectiveness [32m7.73%[39m and counterfactual cost = 2.0.
		Make [1m[31mage = (41.0, 50.0][39m[0m with effectiveness [32m3.98%[39m and counterfactual cost = 1.0.
		Make [1m[31mage = (34.0, 41.0][39m, [31mhours-per-week = OverTime[39m[0m with effectiveness [32m5.39%[39m and counterfactual cost = 2.0.
		[1mAggregate cost[0m of the above recourses = [35m0.00[39m
	Protected Subgroup '[1mMale[0m', [34m13.78%[39m covered
		Make [1m[31mage = (41.0, 50.0][39m, [31mhours-per-week = OverTime[39m[0m with effectiveness [32m19.66%[39m and counterfactual cost = 2.0.
		Make [1m[31mage = (41.0, 50.0][39m[0m with effectiveness [32m10.63%[39m and counterfactual cost = 1.0.
		Make [1m[31mage = (34.0, 41.0][39m, [31mhours-per-week = OverTime[39m[0m with effe

In [13]:
most_biased_subgroups

[({'hours-per-week': 'FullTime', 'native-country': 'United-States'}, 1),
 ({'age': Interval(26.0, 34.0, closed='right'), 'hours-per-week': 'FullTime'},
  3),
 ({'age': Interval(26.0, 34.0, closed='right'),
   'capital-loss': 0.0,
   'hours-per-week': 'FullTime'},
  2)]

# Short Description of all Definitions / Metrics of Subgroup Recourse Fairness

Here we give a brief description of each of the metrics available in our framework apart from "Equal Choice for Recourse".

## Equal Effectiveness

The classifier is considered to act fairly for a population group if the same proportion of individuals in the protected subgroups can achieve recourse.

**Suggestion**: this metric ignores costs altogether and compares only the percentage of males VS females that can cross the model's decision boundary by the same actions. We would use it in applications where the goal is equal impact, in the sense that a change (or a set thereof) affects the same proportion of individuals in the protected subgroups. For example, in a hiring scenario, a similar proportion of males and females are expected to benefit from the same change.

## Equal Effectiveness within Budget

The classifier is considered to act fairly for a population group if the same proportion of individuals in the protected subgroups can achieve recourse with a cost at most $c$, where $c$ is some user-provided cost budget.

**Suggestion**: this metric is similar to the above, but puts a bound on how large the cost of an action can be. Could be used to limit changes with undesirably large cost, e.g., salary changes up to 10K.

## Equal Cost of Effectiveness

The classifier is considered to act fairly for a population group if the minimum cost required to be sufficiently effective in the protected subgroups is equal. Again, as in "Equal Choice for Recourse", by "sufficiently effective" we refer to those actions that successfully flip the model's decision for at least $100\phi \%$ (for $\phi \in [0,1]$) of the subgroup.

**Suggestion**: this metric could be useful when an external factor imposes a specific threshold, e.g. in credit risk assessment, a guideline which states that the effort required to be 80% certain that you will have your loan accepted should be the same for males and females.

## Equal (Conditional) Mean Recourse

This definition extends the notion of *burden* from literature ([reference](https://dl.acm.org/doi/10.1145/3375627.3375812)) to the case where not all individuals may achieve recourse. Omitting some details, given any set of individuals, the **conditional mean recourse cost** is the mean recourse cost among the subset of individuals that can actually achieve recourse, i.e. by at least one of the available actions.

Given the above, this definition considers the classifier to act fairly for a population group if the (conditional) mean recourse cost for the protected subgroups is the same.

**Suggestion**: this metric compares the mean cost required to achieve recourse for the protected subgroups. It could be useful in a scenario like loan approval, where one needs to ensure that the cost of changes needed to receive the loan are the same for males and females on average.

## Fair Effectiveness-Cost Trade-Off

This is the strictest definition, which considers the classifier to act fairly for a population group only if the protected subgroups have the same effectiveness-cost distribution (checked in the implementation via a statistical test).

Equivalently, Equal Effectiveness within Budget must hold for *every* value of the cost budget $c$.

**Suggestion**: this metric considers all available actions and compares all their possible trade-offs between effectiveness and cost among the protected subgroups. This could be useful for cases where the protected attribute should have absolutely no impact on the available options to achieve recourse, such as in high-risk situations like estimating the risk of a convicted individual to act unlawfully in the future (as in the well known [COMPAS dataset](https://www.propublica.org/datastore/dataset/compas-recidivism-risk-score-data-and-analysis)).