# Sklearn compatible Grid Search for regression

Grid search is an in-processing technique that can be used for fair classification or fair regression. For classification it reduces fair classification to a sequence of cost-sensitive classification problems, returning the deterministic classifier with the lowest empirical error subject to fair classification constraints among
the candidates searched. For regression it uses the same priniciple to return a deterministic regressor with the lowest empirical error subject to the constraint of bounded group loss. The code for grid search wraps the source class `fairlearn.reductions.GridSearch` available in the https://github.com/fairlearn/fairlearn library, licensed under the MIT Licencse, Copyright Microsoft Corporation.

In [1]:
import numpy as np
import pandas as pd

from sklearn.compose import TransformedTargetRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import MinMaxScaler

from aif360.sklearn.datasets import fetch_lawschool_gpa
from aif360.sklearn.inprocessing import GridSearchReduction
from aif360.sklearn.metrics import difference

### Loading data

Datasets are formatted as separate `X` (# samples x # features) and `y` (# samples x # labels) DataFrames. The index of each DataFrame contains protected attribute values per sample. Datasets may also load a `sample_weight` object to be used with certain algorithms/metrics. All of this makes it so that aif360 is compatible with scikit-learn objects.

For example, we can easily load the law school gpa dataset from tempeh with the following line:

In [2]:
X_train, y_train = fetch_lawschool_gpa("train", numeric_only=True)
X_test, y_test = fetch_lawschool_gpa("test", numeric_only=True)
X_train.head()

Unnamed: 0_level_0,lsat,ugpa,race
race,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,38.0,3.3,0
1,34.0,4.0,1
1,34.0,3.9,1
1,45.0,3.3,1
1,39.0,2.5,1


We normalize the continuous values, making sure to propagate column names associated with protected attributes, information necessary for grid search reduction.

In [3]:
scaler = MinMaxScaler()

X_train  = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns, index=X_train.index)
X_test = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns, index=X_test.index)

X_train.head()

Unnamed: 0_level_0,lsat,ugpa,race
race,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.72973,0.825,0.0
1,0.621622,1.0,1.0
1,0.621622,0.975,1.0
1,0.918919,0.825,1.0
1,0.756757,0.625,1.0


### Running metrics

With the data in this format, we can easily train a scikit-learn model and get predictions for the test data. We drop the protective attribule columns so that they are not used in the model.

In [4]:
tt = TransformedTargetRegressor(LinearRegression(), transformer=scaler)
tt = tt.fit(X_train.drop(["race"], axis=1), y_train)
y_pred = tt.predict(X_test.drop(["race"], axis=1))
lr_mae = mean_absolute_error(y_test, y_pred)
lr_mae

0.7400826321650612

We can assess how the mean absolute error differs across groups simply

In [5]:
lr_mae_diff = difference(mean_absolute_error, y_test, y_pred)
lr_mae_diff

0.20392590525744636

### Grid Search

Reuse the base model for the candidate regressors. Base models should implement a fit method that can take a sample weight as input. For details refer to the docs. 

In [6]:
estimator = TransformedTargetRegressor(LinearRegression(), transformer=scaler)

Search for the best regressor and observe mean absolute error. Grid search for regression uses "GroupLoss" to specify using bounded group loss for its constraints. Accordingly we need to specify a loss function, like "Absolute." Other options include "Square" and "ZeroOne." When the loss is "Absolute" or "Square" we also specify the expected range of the y values in min_val and max_val. For details on the implementation of these loss function see the fairlearn library here https://github.com/fairlearn/fairlearn/blob/master/fairlearn/reductions/_moments/bounded_group_loss.py.

In [7]:
np.random.seed(0) #need for reproducibility
grid_search_red = GridSearchReduction(prot_attr="race", 
                                      estimator=estimator, 
                                      constraints="GroupLoss",
                                      loss="Absolute",
                                      min_val=y_train.min(),
                                      max_val=y_train.max(),
                                      grid_size=10,
                                      drop_prot_attr=True)
grid_search_red.fit(X_train, y_train)
gs_pred = grid_search_red.predict(X_test)
gs_mae = mean_absolute_error(y_test, gs_pred)
print(gs_mae)

#Check if mean absolute error is comparable
assert abs(gs_mae-lr_mae) < 0.08

0.7622719376746614


In [8]:
gs_mae_diff = difference(mean_absolute_error, y_test, gs_pred)
print(gs_mae_diff)

#Check if difference decreased
assert gs_mae_diff < lr_mae_diff

0.06122151904963535
