This repository contains the work done on the implementation and testing of three different fair machine learning algorithms based on decision trees for binary classification with one binary sensitive attribute, and multiobjective evolutionary procedures using this algorithms to achieve the best balance between accuracy and fairness. You can read the project report in the file MasterThesis.pdf.
Fairness in machine learning is a field that has gained great prominence and relevance in recent years. This area deals with the study and quantification of different measures of fairness on various specific problems and decision processes, as well as the development and implementation of solutions to create fairer systems. Unfortunately, at the limits of joint optimization between accuracy and fairness, a trade-off is reached, so that requiring a fairer system will necessarily imply less accuracy and vice versa. Due to this fact, multiobjective optimization emerges as a method that allows finding a wide range of solutions exploring this optimization frontier. Genetic algorithms represent the state of the art in these optimization processes.
This project will be developed within this area. After conducting a review of the context from a mathematical perspective, analyzing various interpretations of mathematical fairness, and exploring the theory behind multi-objective optimization, the creation of 3 novel algorithms based on decision trees for fair binary classification with a binary sensitive attribute will be proposed. The first algorithm, named Fair Decision Tree (FDT), modifies the information gain criterion during decision tree training to incorporate fairness. The second, named Fair Genetic Pruning (FGP), proposes a genetic optimization procedure on prunings in a matrix tree, which will be the tree that perfectly classifies the training sample. The last one, named Fair LightGBM (FLGBM), modifies the loss function of the LightGBM algorithm to incorporate fairness. For the two algorithms that do not inherently use genetic optimization processes, a genetic hyperparameter optimization procedure based on the NSGA-II algorithm will be employed. This will enable finding models on the joint optimization frontier, specifically for accuracy and fairness objectives. A study will be conducted to test these algorithms alongside a baseline algorithm (classic decision tree) on 10 relevant datasets within the field. Each dataset will be split into 10 training, validation, and test partitions using different random seeds. The average results obtained demonstrate how these algorithms are capable of finding a wide range of Pareto-optimal solutions. Overall, FDT and FLGBM algorithms outperform the baseline algorithm across several quality indicators calculated on the average Pareto fronts. The FGP algorithm has also shown promising results but has not surpassed the baseline algorithm on these indicators. However, it is the algorithm that achieved the best processing time results. All the algorithms have found models that perform better on both objectives than classic models like the COMPAS model developed by Northpointe.
-
FairDT (FDT): Modification of the impurity criterion calculation during decision tree training to also consider fairness. Its general expression is:
$$(1-\lambda) \cdot k \cdot \text{gini/entropy} - \lambda \cdot \text{fairness criterion}$$
Where$\lambda$ is the hyperparameter that controls the importance given to the fairness criterion used. k is a normalization factor, with$k=2$ if the Gini impurity criterion is used, and$k=1$ is the entropy impurity criterion is used, since the fairness criterion will take values in the range$[0,1]$ . Different strategies are proposed for calculating the value of the fairness criterion at any internal node, as a classification is needed to compute it. Finally, a method named "Fair probabilistic prediction" is used. -
Fair Genetic Pruning (FGP): Genetic pruning of a matrix decision tree (the largest decision tree that can be built to perfectly classify the training sample, considering different balances for both classes). This algorithm is an evolutionary algorithm, where each individual in the population is represented by the codes of the nodes where the prunings occur. These codes are lexicographically ordered for efficiency, avoiding pruning already pruned branches of the tree. Crossover assigns prunings from parents to children one by one, ensuring that each pruning assigned is performed in a non-pruned node. Mutations select a leaf of the tree and traverse up or down the matrix tree structure by a certain number of random levels, which depend on the matrix tree depth, then apply or substitute the pruning. This process is repeated a random number of times, depending on the number of leaves of the individual.
-
FairLGBM (FLGBM): Modification of the loss function in the LightGBM algorithm to incorporate fairness. In fact, it has the following real structure:
Where$$L_f(Y, \Sigma,P) = k(1-\lambda) L(Y,\Sigma) +\lambda\left| \frac{π_{-,0,1}^T \Sigma}{π_{-,0,1}^T 1} -\frac{π_{-,0,0}^T \Sigma}{π_{-,0,0}^T 1} \right|$$ $L$ is the logloss function,$\Sigma$ are the prediction scores (after applying the logistic function) and$\left| \frac{π_{-,0,1}^T \Sigma}{π_{-,0,1}^T 1} -\frac{π_{-,0,0}^T \Sigma}{π_{-,0,0}^T 1} \right|$ is the continuous extension of$\text{FPR}_{\text{diff}}$ fairness criterion. This continuous extension is highly advantageous for the calculation of the derivative and hessian of this function, which is necessary for its implementation in a LightGBM algorithm.$k=\frac{-1}{\ln{0.5}}$ is a "normalization factor", considered using as the worst prediction the one made by a dummy classifier ($\Sigma=(0.5,0.5,\dots, 0.5$ )). This is not the worst prediction, but it serves as a good normalization point.
The experimentation involved testing each algorithm with 10 datasets for binary classification that are well-known in the field of fairness in machine learning (adult, compas, diabetes, dutch, german, insurance, obesity, parkinson, ricci, and student), containing 1 binary protected attribute. To obtain robust and reproducible results, each experiment was run a total of 10 times with different values of the random seed (from 100 to 109), which controls the partitioning of data into training, validation, and test sets, as well as other pseudo-random processes. Average results were calculated from the outcomes obtained in each run for each algorithm. The experimentation was conducted with the three developed algorithms, as well as with a decision tree (DT).
The hyperparameters that define the decision space of each algorithm are as follows:
- DT:
- criterion: gini / entropy.
- max_depth: maximum depth of the tree.
- min_samples_split: minimum number of samples required to split an internal node.
- max_leaf_nodes: maximum number of leaf nodes the final tree can have.
- class_weight: weight assigned to each class to be predicted.
- FDT:
- same parameters, plus:
- fair_param: parameter that controls the balance between the impurity criterion and the fairness criterion during tree learning.
- FGP
- The method itself is a genetic algorithm that returns a large number of solutions. Instead of hyperparameter optimization of a base classifier, this method is applied directly.
- FLGBM
- num_leaves: number of leaves in the tree.
- min_data_in_leaf: minimum amount of data required in a leaf node to allow splitting.
- max_depth: maximum depth of the tree.
- learning_rate: learning rate of the algorithm.
- n_estimators: number of weak classifiers to be built.
- feature_fraction: fraction of features used to build the model.
- class_weight: weight assigned to each class to be predicted (internally uses scale_pos_weight).
- fair_param: controls the balance between the standard loss function (logloss) of the algorithm and the fairness function considered.
The objectives to minimize during the experimentation are as follows:
-
Inverted G-mean (gmean_inv): The geometric mean criterion is defined as the square root of the product of the true positive rate and the true negative rate
$(\sqrt{\text{TPR} \cdot \text{TNR}})$ . Since it is a minimization objective,$1-\sqrt{\text{TPR} \cdot \text{TNR}}$ will be used. -
Difference in False Positive Rate (fpr_diff): This is the difference between the probabilities
$\text{FPR}_{\text{diff}} = |P[p=1|Y=0,A=0]-P[p=1|Y=0,A=1]|$ , where$p$ is the result of the classifier,$Y$ is the attribute to be predicted, and$A$ is the sensitive attribute.
An overall structure of the experimentation conducted for each dataset can be seen in this Figure. The population size used was
The results show that FDT and FLGBM algorithms generally achieve better average Pareto fronts with respect to almost all quality indicators compared to the base DT algorithm. Although the FGP algorithm did not improve upon these quality indicators relative to the base algorithm, it still produced very good models. Additionally, it achieved the best results in terms of CPU processing time. The models found demonstrate significant improvement over commonly used classic models, such as the COMPAS model from Northpointe.
The rankings of the quality measures on the average Pareto fronts found can also be consulted here:
The average CPU processing times of each generation (in seconds) can also be consulted here:
Overall, these good results show that there is ample room for improvement in the construction of fair classification models for joint optimization of accuracy and fairness.
Listed below are the libraries and dependencies required to run your own experiments. You can try using higher versions of all libraries except Cython:
- python=3.10.12
- matplotlib=3.8.3
- pandas=2.2.1
- scikit-learn=1.4.1.post1
- pydotplus=2.0.2
- imblearn
- cython=0.29.37
- lightgbm=4.3.0 (from the official lightgbm webpage)
- seaborn=0.13.2
- pygmo=2.19.5
- Author: David Villar Martos
- Contributors: David Villar Martos
- Project director: Jorge Casillas Barranquero