# Fairness Audit Tool


#### What do you need?
 - pandas
 - os.path
 - pick_labels (py file)
 - import_data (py file)
 

#### Data:
 - Data is in csv format
 - Columns for ground truth (what is your model aiming to recreate?), model results and protected characteristic you would like to investigate available
 

 
#### What do you get?
 - Find out if there is a difference in outcomes for your ground truth, dependent on the protected characteristic. 
 - Find out if there is a difference in statistical parity for your model predictions, dependent on the protected characteristic.
 - Comparisons are 1-1. For example, if your protected characteristic contains A, B and C, then results will compare A and B; A and C; and B and C.
 - If you have an expected difference (for example, an illness that men are more likely to get than women), you can find the expected differences you should have, accounting for that.
 






### Import required modules

In [17]:
#Import docs
import pandas as pd
pd.options.mode.chained_assignment = None
import import_data as imd
import pick_labels as pl

### Load data and pick columns

In [19]:
#Load data
df = imd.import_data()
#Pick requirements
ground_truth = pl.ground_truth(df)
model_results = pl.model_result(df, ground_truth)
protected_chara = pl.protected_chara(df, ground_truth, model_results)
#List the potential values for protected characteristic
protected_values = list(df[protected_chara].unique())
print("The potential values of your protected characteristic are ", protected_values)
pot_results = list(df[model_result].unique())

Please enter the full location of your data on your device:  D:\SORT\Data\predictions_clinical_only_with_percent_bins_death.csv


Thank you, this is the first 5 rows of the dataframe you have loaded.
   age sex  mort30  phat.clinical  phat.sort_clinical  phat.sort  \
0   32   F   False          0.005            0.002277   0.007290   
1   68   F   False          0.005            0.003473   0.015828   
2   58   F   False          0.005            0.000870   0.001231   
3   66   M   False          0.005            0.001134   0.002009   
4   68   M   False          0.005            0.001134   0.002009   

  percentage_clin_mort percentage_sort_clin_mort percentage_sort_mort  \
0                  <1%                       <1%                  <1%   
1                  <1%                       <1%              1%-2.5%   
2                  <1%                       <1%                  <1%   
3                  <1%                       <1%                  <1%   
4                  <1%                       <1%                  <1%   

   pred_death  
0       False  
1       False  
2       False  
3       False  
4 

Please enter the name of the column containing your ground truth:  mort30
Please enter the name of the column containing your model results:  pred_death
Please enter the name of the name of the column containing the characteristic you would like to audit:  sex


The potential values of your protected characteristic are  ['F', 'M']


### Is there an expected bias?

In [24]:
##Find expected bias
a=0
b=0
rate=0
while a==0:
    while b==0:
        expected = input("Do you expect there to be explainable bias in your data? For example, you may be looking at a disease that has a higher rate in men. (Yes/No)")
        if expected == "Yes" or expected == "Y" or expected == "yes":
            b=1
        elif expected == "No" or expected == "N" or expected == "no":
            b=1
            a=1
        else:
            print("Please answer Yes or No.")
    if a==1:
        break
    d=0
    while d==0:
        expected = input("Do you know the what the rate of difference is?")
        if expected == "Yes" or expected == "Y" or expected == "yes":
            d=1
        elif expected == "No" or expected == "N" or expected == "no":
            a=1
            d=1
        else:
            print("Please answer Yes or No.")
    if a==1:
        break
    e=0
    while e==0:
        rate = input("Please enter your expected rate. If you do not know the expected rate, please enter 0. For example, if disease is twice as high in men you should enter 2. ")
        try:
            val = float(rate)
            c=0
            while c==0:
                lean = input("Please enter the group that has the higher rate, as it shows in your data")
                if lean not in protected_values:
                    print("please ensure that your xxx is one of your protected values")
                else:
                    print("We will take this into consideration")
                    a=1
                    e=1
                    c=1
        except ValueError:
            print("That's not a float!")
            

Do you expect there to be explainable bias in your data? For example, you may be looking at a disease that has a higher rate in men. (Yes/No) no


### Bias in the original data?

In [28]:
#Information about data
df_info = pd.DataFrame(columns=['Group 1', 'Group 2 ', 'Difference', 'Ground_Truth'])
AA = []
BB = []
CC = []
DD = []
EE = []
a=0
b=0
#Run through all potential protected values twice to pair up all possibilities
for j in range(0,len(protected_values)):
    privileged_group = protected_values[j]
    for i in range(j,len(protected_values)):
        unprivileged_group = protected_values[i]
        if unprivileged_group == privileged_group:
            cool = "cool"
        else:
            #Split new dataframes for each characteristic
            unprivileged_df = df[df[protected_chara] == unprivileged_group]
            privileged_df = df[df[protected_chara] == privileged_group]

            privileged_df[ground_truth] = privileged_df[ground_truth].astype(str)
            unprivileged_df[ground_truth] = unprivileged_df[ground_truth].astype(str)
            
            #run through all potential results
            for k in range(0, len(pot_results)):
                percent = pot_results[k]
                prob_of_death_unprivileged = len(unprivileged_df[unprivileged_df[ground_truth] == percent])/len(unprivileged_df)
                prob_of_death_privileged = len(privileged_df[privileged_df[ground_truth] == percent])/len(privileged_df)
                
                AA.append(privileged_group)
                BB.append(unprivileged_group)
                CC.append(prob_of_death_unprivileged - prob_of_death_privileged)
                DD.append(percent)

                #If there is an expected difference
                if rate != 0 and (lean == privileged_group or lean == unprivileged_group):
                    
                    df_r = df[df[protected_chara] == lean]
                    M_t = len(df_r)
                    M_d = len(df_r[df_r[ground_truth] == percent])
                    if lean == privileged_group:
                        Fdf_r = unprivileged_df
                    else:
                        Fdf_r = privileged_df
                    expected_result = M_d/M_t - (M_d/int(rate))/len(Fdf_r)
                    EE.append(expected_result)
                   
df_info['Group 1'] = AA
df_info['Group 2'] = BB
df_info['Difference'] = CC
df_info['Ground_Truth'] = DD
if rate != 0 and (lean == privileged_group or lean == unprivileged_group):
    df_info['Expected'] = EE 
    dif = [a_i - b_i for a_i, b_i in zip(CC, EE)]
if any(abs(t)>0.1 for t in CC):
    if dif:
        if any(r>0.1 for r in dif):
            print("There is some unexpected bias in this data. This is equivalent to...")
        else:
            print("There is some bias in this data, however it matches with your expected levels.")
    else:
        print("There is some unexpected bias in this data. This is equivalent to...")
else:
    print("There is little to no bias in this data")
    


There is little to no bias in this data


The following table contains more information about these results

In [29]:
print(df_info)

  Group 1 Group 2   Difference Ground_Truth Group 2
0       F      NaN   -0.006108        False       M
1       F      NaN    0.006108         True       M


### Bias in the model results?

In [30]:
#Evaluation of model
df_sp = pd.DataFrame(columns=['Group 1', 'Group 2', 'Statistical_Parity', 'Model_Result'])
FF = []
GG = []
HH = []
II = []
JJ = []

#Run through protected values twice to pair them all up
prob_of_death_list = []
for j in range(0,len(protected_values)):
    privileged_group = protected_values[j]
    for i in range(j,len(protected_values)):
        unprivileged_group = protected_values[i]
        if unprivileged_group == privileged_group:
            cool = "cool"
        else:
            unprivileged_df = df[df[protected_chara] == unprivileged_group]
            privileged_df = df[df[protected_chara] == privileged_group]
            
            #Run through all potential results
            for k in range(0, len(pot_results)):
                percent = pot_results[k]
                prob_of_death_unprivileged = len(unprivileged_df[unprivileged_df[model_result]==percent])/len(unprivileged_df)
                prob_of_death_privileged = len(privileged_df[privileged_df[model_result]==percent])/len(privileged_df)
                prob_of_death = prob_of_death_unprivileged - prob_of_death_privileged
                prob_of_death_list.append([[privileged_group,unprivileged_group], percent, prob_of_death])
                FF.append(privileged_group)
                GG.append(unprivileged_group)
                HH.append(prob_of_death_unprivileged - prob_of_death_privileged)
                II.append(percent)
                
                #If there is an expected difference...
                if rate != 0 and (lean == privileged_group or lean == unprivileged_group):
                    
                    df_r = df[df[protected_chara] == lean]
                    M_t = len(df_r)
                    M_d = len(df_r[df_r[ground_truth] == percent])
                    if lean == privileged_group:
                        Fdf_r = unprivileged_df
                    else:
                        Fdf_r = privileged_df
                    expected_result = M_d/M_t - (M_d/int(rate))/len(Fdf_r)
                    JJ.append(expected_result)
            

                
df_sp['Group 1'] = FF
df_sp['Group 2'] = GG
df_sp['Statistical_Parity'] = HH
df_sp['Model_Result'] = II
if rate != 0 and (lean == privileged_group or lean == unprivileged_group):
    df_sp['Expected'] = JJ
    dif_2 = [a_i - b_i for a_i, b_i in zip(HH, JJ)]

if any(abs(t)>0.1 for t in HH):
    if dif:
        if any(r>0.1 for r in dif):
            print("There is some unexpected bias in these results. This is equivalent to...")
        else:
            print("There is some bias in these results, however it matches with your expected levels.")
    else:
        print("There is some unexpected bias in these results. This is equivalent to...")
else:
    print("There is little to no bias in these results")



There is little to no bias in these results


The following table contains more information about these results

In [31]:
print(df_sp)

  Group 1 Group 2  Statistical_Parity Model_Result
0       F       M           -0.011801        False
1       F       M            0.011801         True
