Fairness in Machine Learning
This page is a collaborative page from several authors from University of Valladolid (P. Gordaliza, H. Inouzhe, E. del Barrio) from University of Toulouse (P. Besse, N. Couellan, J-M. Loubes, L. Risser)
The whole machinery of Machine Learning techniques relies on the fact that a decision rule can be learnt by looking at a set of labeled examples called the learning sample. Then this decision is applied to the whole population which is assumed to follow the same underlying distribution. In many cases, learning samples may present biases either due to the presence of a real but unwanted bias in the observations (societal bias for instance) or due to the way the data are processed (multiple sensors, parallelized inference, evolution in the distribution or unbalanced sample...) Hence the goal of this research is twofold : to detect, analyze and remove such biases, which is called fair leaning ; then understand the way the biases are created and provide more robust, certifiable and explainable methods to tackle the distributional effects in machine learning including transfert learning, consensus learning, theoretical bounds and robustness.
In the recent years, I developed new tools for applications of optimal transport in machine learning and statistics, including new tests for classification using Wasserstein distance and statistical properties of Fréchet means of distributions seen as Wasserstein Barycenters. This work has important developments in machine learning for fairness issues and robustness with respect to stress or changes and evolution of distribution for machine learning algorithms.
The main direction of this research project deals with the theoretical properties of statistical inference under fairness constraint modelling how the effect of an algorithm can depend on an unwanted variable whose influence is yet present in the learning sample. We propose to rewrite the framework of fair learning using tools from mathematical statistics and optimal transport theory to obtain new methods and bounds (connected to the differential privacy). We will consider extension to others statistical methods (tests, PCA, PLS, matrix factorizations, Bayesian Networks) , unsupervised learning and machine learning methods (regression models or ranking models, online algorithms, deep networks, GAN ...) Our aim will be to provide new feasible algorithms to promote fairness by adding constraints. Finally, replacing the notion of independence with a notion of causality can provide new ways of understanding algorithms and adding prior knowledge such as acceptability, logic or physical constraints to AI
Use Cases :
- Adult census Analyse loyale d'un jeu de données. Epxloration et détection d'une discrimination individuelle par testing ou de groupe par estimation d'un impact disproportionné: disparate impact (DI) estimatIon par intervalle de confiance (Besse et al. 2018-a). Biais de la base d'apprentissage et biais des prévisions.
- Propublica Analyse du score de récidive (COMPAS) commercialisé par la société equivant. Trois critères de discrimination sont considérés entre détenus d'origine ethnique caucasienne vs. non-caucasienne: impact disproportionné, erreurs de prévision, assymétrie de la matrice de confusion. Ils sont successivement analysés sur les données brutes, l'estimation COMPAS du risque de récidive, une estimation élémentaire par régression logistique.