# Examples and exercises for Lecture Adversarial Regularization Regimes for Classification Tasks

In [2]:
import os
from pathlib import Path

import pandas as pd
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
import numpy as np

from risk_learning.arr import (
    convert_to_categorical,
    make_feature_combination_array,
    make_feature_combination_score_array,
    make_trend_reports, 
    make_data_trend_reports
)

## Example Simpson's Paradox Data

In [3]:
datadir = Path(os.getcwd()) / 'data'
data_path = datadir / 'adversarial-default-for-x-validation.csv'

df = pd.read_csv(data_path)
df

Unnamed: 0,default,gender,occupation
0,0,0,1
1,1,0,0
2,1,1,1
3,0,0,0
4,0,1,1
...,...,...,...
595,0,0,0
596,0,0,1
597,1,0,0
598,1,0,0


In [4]:
label_mapping_values = dict(gender=[0, 1], occupation=[0, 1])
data_categories = label_mapping_values.copy()
data_categories['default'] = [0, 1]
df = convert_to_categorical(df, data_categories)
df.head(10)

Unnamed: 0,default,gender,occupation
0,0,0,1
1,1,0,0
2,1,1,1
3,0,0,0
4,0,1,1
5,1,0,0
6,0,0,0
7,1,0,0
8,0,0,0
9,1,0,0


## Exercise: Simpson or not?

Difficulty: (*)

Prove that this dataset exhibites Simpson's paradox.

In [5]:
n_records = df.shape[0]

mask_100 = ((df['gender'] == 1) & (df['occupation'] == 0) & (df['default'] == 1)) 
mask_10_ = ((df['gender'] == 1) & (df['occupation'] == 0)) 
P_100_10_ = sum(mask_100) / sum(mask_10_) # verjetnost da si moski z ocupation 0 in defaultas med vsemi moskimi z ocupation 0

# enako samo da imas occupation 1
mask_110 = ((df['gender'] == 1) & (df['occupation'] == 1) & (df['default'] == 1)) 
mask_11_ = ((df['gender'] == 1) & (df['occupation'] == 1)) 
P_110_11_ = sum(mask_110) / sum(mask_11_) # verjetnost da si moski z ocupation 1 in defaultas med vsemi moskimi z ocupation 1

# enako samo da ne gledas ocupationa
mask_10 = ((df['gender'] == 1)  & (df['default'] == 1)) 
mask_1_ = ((df['gender'] == 1)) 
P_10__1_ = sum(mask_10) / sum(mask_1_) # verjetnost da si moski  in defaultas med vsemi moskimi 


## enako samo da si zenska

mask_000 = ((df['gender'] == 0) & (df['occupation'] == 0) & (df['default'] == 1)) 
mask_00_ = ((df['gender'] == 0) & (df['occupation'] == 0)) 
P_000_00_ = sum(mask_000) / sum(mask_00_) # verjetnost da si zenska z ocupation 0 in defaultas med vsemi zenskami z ocupation 0

# enako samo da imas occupation 1
mask_010 = ((df['gender'] == 0) & (df['occupation'] == 1) & (df['default'] == 1)) 
mask_01_ = ((df['gender'] == 0) & (df['occupation'] == 1)) 
P_010_01_ = sum(mask_010) / sum(mask_01_) # verjetnost da si zenska z ocupation 1 in defaultas med vsemi zenska z ocupation 1

# enako samo da ne gledas ocupationa
mask_00 = ((df['gender'] == 0)  & (df['default'] == 1)) 
mask_0_ = ((df['gender'] == 0)) 
P_00__0_ = sum(mask_00) / sum(mask_0_) # verjetnost da si zenska  in defaultas med vsemi zenskami 


tabela = pd.DataFrame( {'Ženske' :[P_000_00_, P_010_01_, P_00__0_],
                      'Moski': [P_100_10_, P_110_11_, P_10__1_]} ,  index=['occupation 0', 'occupation 1', 'occupation 0&1'])
tabela


Unnamed: 0,Ženske,Moski
occupation 0,0.770936,1.0
occupation 1,0.045455,0.276471
occupation 0&1,0.733645,0.284884


V zgornji tabeli so izračunani deleži defaulta pogojeno na spol in na poklic. Pri obeh poklicih je večji delež  defaulta če si moški. v tretji vrstici pa je izračunan delež defaulta neodvisno od poklica, ko je večji delež defaulta pri ženskah. Torej imamo Simpsonov paradoks. 

## Exercises: non-trivial regularization regime

* Which optimizer ("solver") for logistic regression seems best suited for the above dataset? https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
* Calculate the "true" trends for female default for each occupation subgroup. Note that in sklearn, the inverse regularization parameter is  used, $C$, so to approximate the usual $c=0$, set $C$ to a large value. Difficulty (**)
* Show that this dataset is adversarial for logistic regression for inverse regularization parameter $C=0.05$. Difficulty: (**)
* Show that this dataset is still adversarial for k-fold cross-validated logistic regression if $k=5$, the default setting.