# FairSD settings
For this example is used the UCI adult dataset where the objective is to predict whether a person makes more (label 1) or less (0) than $50,000 a year.

In [1]:
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_openml
from sklearn.tree import DecisionTreeClassifier
#Import dataset
d = fetch_openml(data_id=1590, as_frame=True)
X = d.data
d_train=pd.get_dummies(X)
y_true = (d.target == '>50K') * 1
#training the classifier
classifier = DecisionTreeClassifier(min_samples_leaf=10, max_depth=4)
classifier.fit(d_train, y_true)
#Producing y_pred
y_pred = classifier.predict(d_train)

## SubgroupDiscoveryTask Object
This class will contain all the parameters useful for the sg discovery algorithms.<br/>
**<u>Parameters</u>:**

|name |type |default value |description
|:-----|:-----|:-----|:----- 
|X| pandas.Dataframe or <br/> numpy.ndarray| |dataset.
|y_true| pandas.Dataframe,<br/> pandas.Series or <br/> numpy.ndarray| |ground truth.
|y_pred| pandas.Dataframe,<br/> pandas.Series or <br/> numpy.ndarray| |predicted label.
|feature_names| list of strings| None| see below.
|nominal_features| list of strings| None| see below.
|numeric_features| list of strings| None| see below.
|qf | String or <br/>callable object| 'equalized_odds_difference'| quality function to use.
|discretizer| String| 'equalfreq'| see below.
|num_bins | int| 6|  see below.
|dynamic_discretization| bool| True| see below.
|result_set_size| int| 5| maximum number of subgroups <br/>that the sg discovery will return.
|depth| int| 3| maximum number of descriptors<br/> that a description can contain.
|min_quality| float| 0.1| minimum quality<br/> that a subgroup needs to be selected. 
|min_support| int| 200| minimum size <br/>that a subgroup needs to be selected. 

### X and feature_names
X parameters can be a Pandas DataFrame or a Numpy array. If we pass a numpy array we must also pass the feature_names parameter, a list with the column names of X:

In [2]:
import fairsd as fsd
task=fsd.SubgroupDiscoveryTask(X, y_true, y_pred) # X is a pandas.Dataframe

#X as numpy array
x_np    = X.to_numpy()
columns = list(X.columns)
task=fsd.SubgroupDiscoveryTask(X, y_true, y_pred, feature_names = columns)

### nominal_features and numeric_features
These two parameters (string list type) are used to specify which columns of X have nominal values and which ones have numeric values. For attributes that do not appear in either of the two list, the data type will be automatically inferred.<br/>
**Example:**<br/>
    Let's analize the education-num attribute

In [3]:
educationnum_val = X['education-num'].unique()
print(np.sort(educationnum_val))

[ 1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12. 13. 14. 15. 16.]


The values of this attribute are integer number from 1 to 16. <br/>
The package would treat this attribute as numeric by default, but if we want to treat it as a nominal attribute we can use the nominal_features parameter:

In [4]:
task=fsd.SubgroupDiscoveryTask(X, y_true, y_pred, nominal_features = ['education-num'])

### discretizer
This parameter determine the algorithm that will perform the numerical features discretization.
It is set to 'equalfreq' by default but other options are possible:<br/>

|dicretizer parameter |description 
|:-----|:----- 
|'eualfreq'|approximate equal frequency discretization.
|'equalwidth'|equal width discretization.
|'mdlp'|minimum description length principle discretization (Fayyad and Irani approach).

In [5]:
#example
task=fsd.SubgroupDiscoveryTask(X, y_true, y_pred, discretizer='mdlp')

### num_bins
This parameter determine the maximum number of bins that a numerical feature discretization operation will produce. The number of produced bins could also be less than num_bins due to the min_support constraint:<br/>
* the equal-frequency and mdlp discretizations will never produce bins with size lower than min_support, as we know in advance that these bins could never be used to create a description with large enough support.<br/>

If a value lower than two is used, the number of bins will be automatically decided.


In [6]:
#example
task=fsd.SubgroupDiscoveryTask(X, y_true, y_pred, discretizer='equalwidth', num_bins = 4)

### dynamic_discretization
If is set to False, the discretization of numeric features will be done only once for each numerical feature before starting the subgroup discovery algorithm.<br/>
If instead is set to True, the discretization of numerical featues will be integrated with the subgroup discovery algorithm: will be done each time the algorithm will try to expand a description with a numerical attribute.

In [7]:
#example
task=fsd.SubgroupDiscoveryTask(X, y_true, y_pred, dynamic_discretization = False)

## Qaulity Measures
### Fairlearn Quality measures
All the Fairlear metrics can be used as quality measure (qf parameter in SubgroupDiscoveryTask object). See the Fairlearn documentation [here](https://fairlearn.github.io/v0.6.0/api_reference/fairlearn.metrics.html).<br/>
**The predefined fairlearn metrics are:**


|Metric name | Description
|:-----|:-----
| demographic_parity_difference |  Defined as the absolute value of the difference in the **selection rates** between a subgroup and its negation.
|demographic_parity_ratio | Defined as the ratio between the smallest and the largest group-level **selection rate**, between a subgroup and its negation.
|equalized_odds_difference | The greater of two metrics: true_positive_rate_difference and false_positive_rate_difference. The former is the difference between the TPRs, between a subgroup and its negation. The latter is defined similarly, but for FPRs.
|equalized_odds_ratio | The smaller of two metrics: true_positive_rate_ratio and false_positive_rate_ratio. The former is the ratio between the TPRs, between a subgroup and its negation. The latter is defined similarly, but for FPRs.

We can inizialize a SubgroupDiscoveryTask object in this way:

In [8]:
import fairsd as fsd
from fairlearn.metrics import demographic_parity_ratio
task = fsd.SubgroupDiscoveryTask(X, y_true, y_pred, demographic_parity_ratio)

Or, faster, for this four predefined fairlearn metrics, we can pass the same metric as a string:

In [9]:
task = fsd.SubgroupDiscoveryTask(X, y_true, y_pred, 'demographic_parity_ratio')

From the version 6.0, Fairlearn also offers the interesting possibility of "create a scalar returning metric function based on aggregation of a disaggregated metric".<br/>
**Example:**

In [10]:
from sklearn.metrics import accuracy_score
from fairlearn.metrics import make_derived_metric
derived_metric = make_derived_metric(metric = accuracy_score, transform = 'difference')
task = fsd.SubgroupDiscoveryTask(X, y_true, y_pred, derived_metric)

For more details see the Fairlearn documentation.

### Customized Quality Measures
It is possible to create a quality measure by estending the class [QualityFunction](https://github.com/MaurizioPulizzi/fairsd/blob/main/fairsd/qualitymeasures.py#L3):

In [11]:
class MyQualityMeasure(fsd.QualityFunction):
    def evaluate(self, y_true = None, y_pred=None, sensitive_features=None):
        return 0.5
    
task = fsd.SubgroupDiscoveryTask(X, y_true, y_pred, MyQualityMeasure.evaluate)

## Descriptions and Quality Measures
A subgroup description is formed by the conjunction of zero or more descriptors.<br/>
A descriptor is a statement in the form "attribute_name = attribute_value" for nomilal attributes or "attribute_name = range" for numerical attributes.<br/>
Example of Description: " sex = 'Male' AND age = (18, 30] ". <br/>
The Top-k subgroup discovery task in this package returns the k subgroup descriptions of the subgroups that exert the greatest disparity.<br>
There is no single definition of subgroup disparity, the meaning changes according to the used quality measure.

**All metrics in the [fairlearn.metrics](https://fairlearn.github.io/v0.6.0/api_reference/fairlearn.metrics.html) module are symmetrical:** they always return a value between 0 and 1 and do not distinguish whether a subgroup is \"positively\" or \"negatively\" dissimilar. For example the descriptions \"married = True\" and \"married = False\" will always have the same quality.


## Implemented Subgroup Disovery Algorithms
This package offers two Top-K Subgroup Discovery Algorithms: **BeamSearch** and **DSSD**.

### BeamSearch Algorithm
BeamSEarch is an euristic algorithm representing a good between the completeness of exhaustive search algorithms and the speed of greedy algorithms. <br/> This trade-off can be adjusted via the beam width. For a beam width that tends to infinity the algorithm becomes an exhaustive search algoritm, instead with a beam width equal to 1 the algorithm becomes a greedy one.

In [12]:
# Beam search algorithm usage
from fairsd import SubgroupDiscoveryTask
from fairsd import BeamSearch
task = SubgroupDiscoveryTask(X, y_true, y_pred)
resultset = BeamSearch().execute(task)

The beam width is 20 by default but it is possible specify a different value:

In [13]:
resultset = BeamSearch(beam_width=10).execute(task)
print(resultset)

education = "Bachelors" AND marital-status = "Married-civ-spouse" 
education = "Bachelors" AND marital-status = "Married-civ-spouse" AND capital-loss = (-infinite, 0.0] 
education = "Bachelors" AND marital-status = "Married-civ-spouse" AND capital-gain = (-infinite, 0.0] 
education = "Bachelors" AND marital-status = "Married-civ-spouse" AND race = "White" 
education = "Bachelors" AND marital-status = "Married-civ-spouse" AND sex = "Male" 



#### Redundancy
As we can see, the returned result set is somehow redundant: many variants of, essentially, the same subgroup are presents in it. <br/>
Many top-k sg-discovery algorithms suffers of this problem.<br/>
To solve this problem it is suggested to use the DSSD algorithm.

### DSSD (Diverse Subgroup Set Discovery) algorithm
 This algorithm is a variant of the Beam Search Algorithm that also take into account the redundancy of the generated subgroups.<br/>
In this package a cover-based redundancy definition is used: roughly, the more tuples two subgroups have in common, the more they are considered redundant. <br/>
The degree to which redundancy is mitigated is determined by the alpha parameter. the more a is high, the less the subgroups redundancy is taken into account. Alpha must be between zero and 1, and, the more alpha is high, the less the subgroups redundancy is taken into account.<br>

This algorithm is described in details in the Van Leeuwen and Knobbe's paper "Diverse Subgroup Set Discovery".


In [14]:
# DSSD algorithm usage
from fairsd import DSSD
resultset = DSSD(beam_width=10, a = 0.9).execute(task) # "a" parameter represents alpha, 0.9 by default 
print(resultset)

education = "Bachelors" AND marital-status = "Married-civ-spouse" 
education = "Bachelors" AND relationship = "Husband" 
education-num = (13.0, +infinite] AND marital-status = "Married-civ-spouse" 
education-num = (13.0, +infinite] AND race = "Asian-Pac-Islander" 
relationship = "Husband" AND capital-gain = (7298.0, +infinite] 



As we can see now, just by looking at the descriptions in the result set, the redundancy sems very attenuated.