In [1]:
from folktables.acs import adult_filter
from folktables import ACSDataSource, BasicProblem, generate_categories
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, LinearRegression



### 1. Load and Preprocess the data
We are going to work with the [Folktables](https://github.com/socialfoundations/folktables#quick-start-examples) dataset (*you have already worked with it*). I have chosen some variables for you, but you can add more (*if you like to*) - here is the [full list](https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021.pdf) of variables (some of them do not exist in `ACSDataSource`). 

Today we are going to debias a regression model using the `SEX` variable. Your model should predict the *Total person's income*  (I've digitized  it in  `target_transform=lambda x: x > 25000`, you can choose another threshold).

In [2]:
data_source = ACSDataSource(survey_year='2018', horizon='1-Year', survey='person')
acs_data = data_source.get_data(states=["CA"], download=True)

ACSIncomeNew = BasicProblem(
    features=[
        'AGEP', # include AGE
        'COW', # include class of worker
        'SCHL', # include school education
        'WKHP', # include reported working hours
        'SEX', # include sex
        'MAR', # martial status
        # some random, possibly noisy
        'PWGTP', # person weight
        'JWMNP', # travel time to work
        'INTP', # interest, devidents and etc
    ],
    target='PINCP',
    target_transform=lambda x: x > 25000,    
    group='SEX',
    preprocess=adult_filter,
    postprocess=lambda x: np.nan_to_num(x, -1),
)

Here is a small snippet to get the names of the categorical variables - I convert categoricals into one-hot encoded (*you don't have to, depending on what assumptions you use about the data*). **Don't forget to normalise the continious features (if you plan to use Cross-Validation features should be normalized per fold, aka not in the global table).**

In [3]:
definition_df = data_source.get_definitions(download=True)
categories = generate_categories(features=ACSIncomeNew.features, definition_df=definition_df)
# here I convert categoricals into one-hot encoded (you don't have to, depending on what assumptions you use about the data)
features, labels, groups = ACSIncomeNew.df_to_pandas(acs_data, categories=categories, dummies=True)
########### Normalize continious features
## YOUR CODE (if relevant)
###########
features.head()

Unnamed: 0,AGEP,WKHP,PWGTP,JWMNP,INTP,"COW_Employee of a private for-profit company or business, or of an individual, for wages, salary, or commissions","COW_Employee of a private not-for-profit, tax-exempt, or charitable organization",COW_Federal government employee,"COW_Local government employee (city, county, etc.)","COW_Self-employed in own incorporated business, professional practice or farm",...,SCHL_Professional degree beyond a bachelor's degree,SCHL_Regular high school diploma,"SCHL_Some college, but less than 1 year",SEX_Female,SEX_Male,MAR_Divorced,MAR_Married,MAR_Never married or under 15 years old,MAR_Separated,MAR_Widowed
0,30.0,40.0,32.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
1,21.0,20.0,52.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
2,65.0,8.0,33.0,25.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
3,33.0,40.0,53.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0
4,18.0,18.0,106.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0


Spliting data into train-test. **Again, if you plan to use Cross-Validation then you should normalise features only inside of a fold**.

In [4]:
X_train, X_test, y_train, y_test, group_train, group_test = train_test_split(
    features.values, labels.values.reshape(-1), groups, test_size=0.2, random_state=0)

### 2.  Regression model (without Fairness constraints)
Let's first train a simple **Logistic Regression**. 
1. Use L2 penalty to train the model (you should find the optimal value for the regularizer)
2. Calculate the total performance metric
3. Calculate and compare the perforformanc metric for each `SEX` group (use your favourite metric introduced during the course).

In [None]:
##########
### YOUR CODE HERE
##########

### 3. Constrained Regression Model 
Now let's try to include the [Fairness Constraint](https://arxiv.org/abs/1706.02409)! You'll have to implement couple of things from scratch (as it is tricky to add a custom constraint function in `sklearn`.  To optimise the cost function let's use `scipy.optimize.fmin_tnc`. To calculate gradient you can use `fprime` attribute):
1. Logistic Regression
2. L2 penalisation
3. **Group** Fairness Constrained

When you are finished with implementation - you should evaluate performance on multiple choices of fairness weight, $\lambda$.

#### Detailed breakdown
The group constraint constraint looks like this:

$$ 
f(\beta,S) = \left( \frac{1}{n_1 n_2} \sum\limits{(x_i,y_i)\in S_1 \\ (x_i,y_i)\in S_2} d(y_i,y_j) (\beta^T  \textbf{x}_i - \beta^T \textbf{x}_j)  \right)^2 
$$


For the constrained optimization we have to solve a problem on the form:

$$ \min_\beta \left( \ell (\beta,S) + \lambda f(\beta,S)  +\gamma \Vert \beta \Vert_2 \right) $$ 

where $\ell$ is some loss function, $f$ is the constraint function, and the $\gamma \Vert \beta \Vert_2 $ is L2 regularization (we use it to avoid overfitting).
(Basically we are minimizing the Lagrangian $\mathscr{L} = \ell (\beta,S) + \lambda f(\beta,S)  +\gamma \Vert \textbf{x} \Vert_2$ with respect to $\beta$ - in ML literature $\mathscr{L}$ is often denoted as J)

Because we are doing classification we are going to use logistic regression. The log loss function is:
$$
\ell = \frac{1}{m}\sum_i^m\left[ -y_i \log(g(x_i)) - (1-y_i)\log(1-g(x_i)) \right], \text{where } g(x_i) = \frac{1}{1+\exp(-\beta_i x_i)}
$$

For the distance function we follow the approach from Berk et al. (2017) and set:
$$d(y_i,y_j) = \begin{cases}
            1, &         \text{if } y_i=y_j,\\
            0, &         \text{if } y_i\neq y_j.
    \end{cases}$$
    
To minimize the total loss function we also need to estimate the gradient of $\mathscr{L}$ with respect to $\beta$. Here to update the $\beta$ values we are just going the gradient's without the fairness constraing - this will make our lives considerably easier. The j'th element of the gradiend is defined as follows:
$$
\frac{\partial \mathscr{L}}{\partial \beta_j} \approx \frac{1}{m}\left( \sum_i  (g(x_i) - y_i) x[j] \right)+ 2\gamma \beta_j
$$

In [8]:
import scipy.optimize as opt

def sigmoid(x):
    """This is logistic regression"""
    NotImplemented
def compute_gradient(beta,X,y,lambda_,gamma_):
    NotImplemented
def compute_cost(beta,X,y,lambda_,gamma_):
    NotImplemented


########## This is the optimisation function
#result = opt.fmin_tnc(func=compute_cost, x0=beta, fprime = compute_gradient, maxfun = 500,
#                          args = (X_train,Y_train,lambda_,gamma_), xtol=1e-7)