<a href="https://colab.research.google.com/github/DwayneLi/Check_Algorithm_Bias_In_Hiring/blob/master/stat_model_v1_0_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Preparation

### required packages

In [16]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
from scipy.stats import zscore, zmap
from sklearn.model_selection import train_test_split
import statsmodels.stats.proportion as proportion
from statsmodels.stats.proportion import proportions_ztest
import warnings
warnings.filterwarnings('ignore')
!pip install pyreadstat
!wget -q https://www.dropbox.com/s/7rd15y2jdam5wls/carsales.sav
!wget -q https://www.dropbox.com/s/0ctjebkulbqv23z/Professional_Sales_Applicants.sav



### Our train data contain 313 records including protected attributes, assessment dimention percentiles and rating score of each candidates. Our test data over 25000 rows including protected attributes and final result of recommend along with the assessment dimention percentiles.

In [17]:
df= pd.read_spss('carsales.sav')
test= pd.read_spss("Professional_Sales_Applicants.sav")

### Clean data

In [22]:
df=df.dropna()
test=test.dropna()

All null values are deleted. The analysis of null value in the EDA notebook.

#Modeling

## Relationship with existing OverallRating and (Age, Gender, Ethics)


### OLS regression using all protected attributes to test if there relationship in age, gender and ethics to overall rating score

Gather some categories with smaller amount into "other" category

In [37]:
df2=df.copy()
df2.loc[df2["Race"] == "Two or More Races (not Hispanic or Latino)","Race"]="Other"
df2.loc[df2["Race"] == "American Indian or Alaska Native (not Hispanic or Latino)","Race"]="Other"
df2.loc[df2["Race"] == "Native Hawaiian or Other Pacific Islander (not Hispanic or Latino)","Race"]="Other"
df2['Race']=df2['Race'].str.replace('(not Hispanic or Latino)','')
df2['Race']=df2['Race'].str.strip('()')
df2["Race"].value_counts()

White                         160
Hispanic or Latino             52
Black or African American      46
Asian                          20
Other                          20
Prefer Not To Say              15
Name: Race, dtype: int64

create dummy variables

In [38]:
race=pd.get_dummies(df2['Race'],drop_first=False)
Ageband=pd.get_dummies(df2['AgeBand'],drop_first=False)
Gender=pd.get_dummies(df2['Gender'],drop_first=False)
protected=pd.concat([race,Ageband,Gender], axis=1)

regression for the acutal score with age, gender, ethics

In [39]:
results_actual= smf.ols('OverallRating ~ race+Ageband+Gender', data=df2).fit()
print(results_actual.summary())

                            OLS Regression Results                            
Dep. Variable:          OverallRating   R-squared:                       0.048
Model:                            OLS   Adj. R-squared:                  0.007
Method:                 Least Squares   F-statistic:                     1.161
Date:                Wed, 04 Nov 2020   Prob (F-statistic):              0.307
Time:                        22:41:29   Log-Likelihood:                -460.00
No. Observations:                 313   AIC:                             948.0
Df Residuals:                     299   BIC:                             1000.
Df Model:                          13                                         
Covariance Type:            nonrobust                                         
                                  coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
Intercept         

We hold the null hypothesis that protected attributes have no influence on overall rating score.

The resutls shows that give alpha to be 0.05, the race[0], race[1] ,race[5] and ageband[1:4] have statisticall significant influence on overall rating score.

## Fit model, predcit overallrating and check bias

### Take train test split, the 'df' dataframe as train dataset and the 'test' dataframe as test dataset 

In [40]:
X_train = df.iloc[:,6:28]
X_test = test.iloc[:,15:35]
X_train_0, X_test_0, y_train_0, y_test_0 = train_test_split(X_train, df.OverallRating, test_size=0.3, random_state=0)

Fit the model

In [41]:
from sklearn.linear_model import LinearRegression
lm = LinearRegression() 
lm.fit(X_train_0, y_train_0)
y_test_0.predict=lm.predict(X_test_0)
print('Coefficients: \n', lm.coef_)
print('Intercept: \n', lm.intercept_)
print('R2 for Train', lm.score( X_train_0, y_train_0 ))
print('R2 for Test ', lm.score(X_test_0, y_test_0))

Coefficients: 
 [-0.25393264 -0.37952731 -0.34631502  0.77078372 -0.37040078 -0.85139293
 -0.66300667 -0.28013756  0.22917723  0.05768146 -0.50121535  0.1369348
  0.1114772   0.25731077  0.85323279  0.45429727  0.25837847  0.42546243
  1.36135422 -0.28227065]
Intercept: 
 2.8195778342515054
R2 for Train 0.19593613101934548
R2 for Test  0.013018479813734607


### Get the recommend rate from history record

**The recommend rate from test data set is 77.2%**

In [89]:
recommendornot=test['OverallFitPassFail'].value_counts()
total=recommendornot.sum()
recommend_rate=recommendornot[0]/total
print('Recommend rate is {} '.format('{0:0.1%}'.format(recommend_rate)))

Recommend rate is 77.2% 


Predict the OverallRating using train data set

In [43]:
X = df.iloc[:,6:28]
df2['Prediction']=lm.predict(X)

Sort the candidates by prediction

In [44]:
df2.sort_values("Prediction", inplace=True)

Select top 77.2% as recommend, and the bottom 22.8% at not recommend.

There are total 313 records, we sorted the data based on pred OverallRating, and the separation is 313*22.8% = 71.

Then we got the corresponding rating score of the last reccomend candidates.

In [92]:
df2.iloc[71:72,:]['Prediction']

272    2.959518
Name: Prediction, dtype: float64

In [93]:
# Since the socre of the applicants on dividing line is dictinct, we could use this method to separate.
df2['recommend']=0
df2.loc[df2['Prediction']>= 2.959518,'recommend']=1

In [None]:
results_actual= smf.ols('OverallRating ~ race+Ageband+Gender', data=df2).fit()
print(results_actual.summary())

Check our regression result with the train data.

In [94]:
# regression for the predict score
results_predict= smf.ols('Prediction ~ Race+AgeBand+Gender', data=df2).fit()
print(results_predict.summary())

                            OLS Regression Results                            
Dep. Variable:             Prediction   R-squared:                       0.040
Model:                            OLS   Adj. R-squared:                 -0.002
Method:                 Least Squares   F-statistic:                    0.9493
Date:                Wed, 04 Nov 2020   Prob (F-statistic):              0.502
Time:                        22:50:46   Log-Likelihood:                -194.31
No. Observations:                 313   AIC:                             416.6
Df Residuals:                     299   BIC:                             469.1
Df Model:                          13                                         
Covariance Type:            nonrobust                                         
                                         coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------------------
Inte

**From the summary, we find all independent variables have p value larger than 0.05 so we can conclude the age, race and gender have no significant influence in predicting the overall rating**

### Predict the OverallRating by using test data set

In [49]:
# Predict the OverallRating by using test data set
predict=lm.predict(X_test)
test['pred']=predict

Select top 77.2% as recommend, and the bottom 22.8% at not recommend.

There are total 28727 records, we sorted the data based on pred OverallRating, and the separation is 28727*22.8% = 6550.

Then we got the corresponding rating score of the last reccomend candidates.

In [95]:
# Sort the candidates by predicted OverallRating
test.sort_values("pred", inplace=True)

Not_Recommend  = test.iloc[0:6550,:] 
Recommend  = test.iloc[6550:28727,:] 

In [97]:
test.iloc[6550:6551,:]

Unnamed: 0,CMX,ClientId,ClientName,Industry_Final,AgeRange,Ethnicity,Gender,TestBatteryInstanceId,JobName,OutMatchJobProfileId,IsMobile,TbiTimeToCompletesecs,OverallFitScore,OverallFitBand,OverallFitPassFail,Accommodation_tile,Assertiveness_tile,CautiousThinking_tile,Competitiveness_tile,CriticismTolerance_tile,DetailInterest_tile,FollowThrough_tile,InterpersonalInsight_tile,Multitasking_tile,ObjectiveThinking_tile,Optimism_tile,PositiveViewofPeople_tile,PreferenceforStructure_tile,ProcessFocused_tile,RealisticThinking_tile,ReflectiveThinking_tile,Sociability_tile,SocialRestraint_tile,WorkIndependence_tile,WorkIntensity_tile,pred,recommend
9192,1.0,41.0,CarMax,Retail,20-29,Hispanic or Latino,Male,877294.0,Sales Consultant,220.0,Yes,13.0,4.5,Strongest,Recommend,0.5793,0.9929,1.0,1.0,0.7104,0.8665,0.71,0.9325,0.9239,0.6727,1.0,1.0,0.7511,0.6286,0.962,0.7107,1.0,1.0,0.11,1.0,2.814016,1


Split the recommend and poor candidates.

In [98]:
# Since the socre of candidate on dividing line is dictinct, we could use this method to separate.
test['recommend']=0
test.loc[test['pred']> 2.814016,'recommend']=1

Count the number of recommended and not recommended candidates with different gender

In [106]:
recommend_male_tile = round(Recommend.loc[Recommend['Gender'] == 'Male'].shape[0]/test.loc[test['Gender'] == 'Male'].shape[0],2)
recommend_female_tile = round(Recommend.loc[Recommend['Gender'] == 'Female'].shape[0]/test.loc[test['Gender'] == 'Female'].shape[0],2)
poor_male_tile = round(Not_Recommend.loc[Not_Recommend['Gender'] == 'Male'].shape[0]/test.loc[test['Gender'] == 'Male'].shape[0],2)
poor_female_tile = round(Not_Recommend.loc[Not_Recommend['Gender'] == 'Female'].shape[0]/test.loc[test['Gender'] == 'Female'].shape[0],2)

Get matrix of different gender and recommend rate.

In [107]:
pd.DataFrame([[recommend_male_tile,recommend_female_tile],[poor_male_tile,poor_female_tile]],columns=['male','female'],index=['recommend','poor'])

Unnamed: 0,male,female
recommend,0.78,0.75
poor,0.22,0.25


Now let us introduce to you the Four-Fifth rules : Prescribes that a selection rate for any disadvantaged group that is not less than four-fifths of that for the group with the highest rate. 

A selection rate for any race,sex, or ethnic group which is less than 4/5 of the rate for the group with the highest rate will generally be regarded by the Federal enforcement agencies as evidence of adverse impact.

The selection rate of the female is 75%, higher than 80% of selection rate of male, which means there is no bias between Male and Female based on four-fifth rules.


## Statistic test on protected attributes

Then we come to Z-test / Chi Square test. First we test the relationship between gender and recommend result.



### Gender and recommend result.

Here our null hypothesis **H0** is that **there is no difference between the percentage of female and male**, and our **H1** hypothesis is that **there is difference between the percentage of female and male**.

In [117]:
female_recommend=Recommend.loc[Recommend['Gender'] == 'Female'].shape[0] #7164
female_total=test.loc[test['Gender'] == 'Female'].shape[0]        #9515
male_recoomend=Recommend.loc[Recommend['Gender'] == 'Male'].shape[0]   #14697
male_total=test.loc[test['Gender'] == 'Male'].shape[0]          #18811

pass_gender = np.array([female_recommend, male_recoomend])
total_gender  = np.array([female_total,   male_total])

chisq, pvalue, table = proportion.proportions_chisquare(pass_gender, total_gender)
print('Results are ','chisq =%.3f, pvalue = %.3f'%(chisq, pvalue))

Results are  chisq =28.896, pvalue = 0.000


Given chi square equals to 28.896 and p value less than 0.001, we can conclude that the gender has influence on recommend or not.

### Same test with Age and Ethics

**All  tests can reject null hypothesis and imply that candidates' age and race are not independent of their recommend results.**

In [118]:
test['AgeRange'].value_counts()

20-29                12789
30-39                 5053
40-49                 3265
50-59                 2791
16-19                 2360
60 or over            1471
Prefer Not To Say      930
                        70
Name: AgeRange, dtype: int64

In [119]:
Recommend.loc[Recommend['AgeRange'] == '20-29'].shape[0]

9759

In [120]:
Recommend.loc[Recommend['AgeRange'] == '30-39'].shape[0]

3850

In [121]:
Recommend.loc[Recommend['AgeRange'] == '40-49'].shape[0]

2526

In [122]:
Recommend.loc[Recommend['AgeRange'] == '50-59'].shape[0]

2228

In [123]:
Recommend.loc[Recommend['AgeRange'] == '16-19'].shape[0]

1832

In [124]:
Recommend.loc[Recommend['AgeRange'] == '60 or over'].shape[0]

1182

In [125]:
Recommend.loc[Recommend['AgeRange'] == 'Prefer Not To Say'].shape[0]

739

In [126]:
pass_age = np.array([9759,3850,2520,2228,1832,1182,739])
total_age = np.array([12789,5053,3265,2791,2360,1471,930])

chisq, pvalue, table = proportion.proportions_chisquare(pass_age, total_age)
print('Results are ','chisq =%.3f, pvalue = %.3f'%(chisq, pvalue))

Results are  chisq =30.819, pvalue = 0.000


Reject H0 due to high chi square and p value close to 0.

In [78]:
test['Ethnicity'].value_counts()

White (not Hispanic or Latino)                                        12005
Black or African American (not Hispanic or Latino)                     7636
Hispanic or Latino                                                     4964
Two or More Races (not Hispanic or Latino)                             1512
Asian (not Hispanic or Latino)                                         1187
Prefer Not To Say                                                       791
Other                                                                   251
American Indian or Alaska Native (not Hispanic or Latino)               162
Native Hawaiian or Other Pacific Islander (not Hispanic or Latino)      151
                                                                         70
Name: Ethnicity, dtype: int64

In [80]:
# race

In [81]:
test.loc[test["Ethnicity"] == "Two or More Races (not Hispanic or Latino)","Ethnicity"]="Other"
test.loc[test["Ethnicity"] == "American Indian or Alaska Native (not Hispanic or Latino)","Ethnicity"]="Other"
test.loc[test["Ethnicity"] == "Native Hawaiian or Other Pacific Islander (not Hispanic or Latino)","Ethnicity"]="Other"
test["Ethnicity"].value_counts()

White (not Hispanic or Latino)                        12005
Black or African American (not Hispanic or Latino)     7636
Hispanic or Latino                                     4964
Other                                                  2076
Asian (not Hispanic or Latino)                         1187
Prefer Not To Say                                       791
                                                         70
Name: Ethnicity, dtype: int64

In [82]:
Recommend.loc[Recommend['Ethnicity'] == 'White (not Hispanic or Latino)'].shape[0]

9926

In [83]:
Recommend.loc[Recommend['Ethnicity'] == 'Black or African American (not Hispanic or Latino)'].shape[0]

5902

In [84]:
Recommend.loc[Recommend['Ethnicity'] == 'Hispanic or Latino'].shape[0]

3867

In [85]:
Recommend.loc[Recommend['Ethnicity'] == 'Asian (not Hispanic or Latino)'].shape[0]

925

In [86]:
Recommend.loc[Recommend['Ethnicity'] == 'Other'].shape[0]

1674

In [87]:
Recommend.loc[Recommend['Ethnicity'] == 'Prefer Not To Say'].shape[0]

625

In [88]:
pass_race = np.array([9927,5902,3867,925,1674,624])
total_race = np.array([12005,7635,4964,1187,2076,790])

chisq, pvalue, table = proportion.proportions_chisquare(pass_race, total_race)
print('Results are ','chisq =%.3f, pvalue = %.3f'%(chisq, pvalue))

Results are  chisq =106.839, pvalue = 0.000


Same result as above.