# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
+ Formulas for the Bernoulli distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

In [2]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [3]:
# number of callbacks for black-sounding names
print(sum(data[data.race=='w'].call))
print(sum(data[data.race=='b'].call))

235.0
157.0


In [4]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


In [5]:
data[['id','race','call']].head(5)

Unnamed: 0,id,race,call
0,b,w,0.0
1,b,w,0.0
2,b,b,0.0
3,b,b,0.0
4,b,w,0.0


In [6]:
data.call.sum()

392.0

<div class="span5 alert alert-success">
<h4> 1a. What test is appropriate for this problem? </h4>
 <p>An A/B test will be applicable here.</p>
 <h4>1b. Does CLT apply?</h4>
 <p>It does as we approximate the sample proportion to the population proportion</p>
 <h4>2. What are the null and alternate hypotheses?</h4>
 <p> H0: Black applicants are equally likely to get a call from employers as White applicants</p>
 <p> Ha: Black applicants are less likely to get a call from employers as White applicants</p>
</div>

### 3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.

In [7]:
#Determine the total number of applicants and applicants by race.
numof_applicants = data.race.count()
numof_wht_applicants = data.race[data.race=='w'].count()
numof_blk_applicants = data.race[data.race=='b'].count()
numof_wht_applicants_called = sum(data[data.race=='w'].call)
numof_blk_applicants_called = sum(data[data.race=='b'].call)

#Determine the race and callback proportion.
p_wht = numof_wht_applicants/numof_applicants
p_blk = numof_blk_applicants/numof_applicants
p_call_blk = numof_blk_applicants_called/numof_blk_applicants
p_call_wht = numof_wht_applicants_called/numof_wht_applicants



print(f'The total number of applicants = {numof_applicants}')
print(f'\n The number of black applicants = {numof_blk_applicants}')
print(f'\n The number of black applicants called = {numof_blk_applicants_called}')
print(f'\n The likelihood to be called if black is = {p_call_blk}')
print(f'\n The number of white applicants = {numof_wht_applicants}')
print(f'\n The number of white applicants called = {numof_wht_applicants_called}')
print(f'\n The likelihood to be called if white is = {p_call_wht}')
print(f'\n The probability of choosing a black application is = {p_blk}')
print(f'\n The probability of choosing a white application is = {p_wht}')



The total number of applicants = 4870

 The number of black applicants = 2435

 The number of black applicants called = 157.0

 The likelihood to be called if black is = 0.06447638603696099

 The number of white applicants = 2435

 The number of white applicants called = 235.0

 The likelihood to be called if white is = 0.09650924024640657

 The probability of choosing a black application is = 0.5

 The probability of choosing a white application is = 0.5


In [23]:
# Using the Frequentist method. 

alpha = 0.05
prop_diff = p_call_wht - p_call_blk # difference in the proportion of callbacks of whites vs black applicants
p_called = (numof_blk_applicants_called + numof_wht_applicants_called)/numof_applicants # proportion of all callbacks of the entire sample


mu_zero = 0 #assuming H0 p_call_blk = p_call_wht

# Standard deviation in the proportion difference assuming H0 p_call_blk = p_call_wht = p_called 
prop_diff_stderr = np.sqrt(2*p_called*(1-p_called)/numof_blk_applicants)

print(f'The standard error is = {prop_diff_stderr}')

z_stat = abs(prop_diff - mu_zero)/prop_diff_stderr

print(f'\nz-statistic = {z_stat}')

# we consider a confidence interval of 95% sample_propotion ± margin of error

ME_frq = z_stat * prop_diff_stderr
print(f'\nThe margin of error ME is = ±{ME_frq}')

# confidence interval.

conf_int = (prop_diff - ME_frq, prop_diff + ME_frq) 
print(f'\n95% Confidence Interval for the difference in the proportion of whites getting a call back vs blacks is \
between {conf_int[0]:.5f} and{conf_int[1]:.5f}')

p_val = stats.norm.sf(z_stat)*2

if p_val < alpha :
    print(f'\nThe p-value is : {p_val} < {alpha}. We reject the null hypothesis,\
    \n"H0: Black applicants are equally as likely to get a call from employers as White applicants" \
    \nand suggest the alternate "Ha : Black applicants are less likely to get a call from employers as White applicants"')
else:
    print(f'The p-value is: {p} > {alpha}.We fail to reject the null hypothesis H0. \
    There is a likelihood Black applicants are equally as likely to get a call from employers as White applicants')

The standard error is = 0.007796894036170457

z-statistic = 4.108412152434346

The margin of error ME is = ±0.032032854209445585

95% Confidence Interval for the difference in the proportion of whites getting a call back vs blacks is between 0.00000 and0.06407

The p-value is : 3.983886837585077e-05 < 0.05. We reject the null hypothesis,    
"H0: Black applicants are equally as likely to get a call from employers as White applicants"     
and suggest the alternate "Ha : Black applicants are less likely to get a call from employers as White applicants"


In [32]:
# Using the Bootstrap Method

np.random.seed(42)
# def permutation_sample(data1, data2):
#     """To permute or scramble two sets of data and divide the scrambled data into 2 equal data samples"""
#     data = np.concatenate((data1,data2))
#     permuted_data = np.random.permutation(data)
#     perm_smpl1 = permuted_data[:len(data1)]
#     perm_smpl2 = permuted_data[len(data1):]
    
#     return perm_smpl1, perm_smpl2

def permutation_sample(data1, data2):
    """To permute or scramble two sets of data and divide the scrambled data into 2 equal data samples"""
#     data = np.concatenate((data1,data2))
#     permuted_data = np.random.permutation(data)
    perm_smpl1 = np.random.permutation(data1)
    perm_smpl2 = np.random.permutation(data2)
    
    return perm_smpl1, perm_smpl2




def draw_perm_replicate(data1, data2, func, size=1):
    """ Uses the permutation_sample function to create defined "size" number of replicates based on the function "func". """
    perm_replicates = np.empty(size)
    
    for i in range(size):
        perm_samp1, perm_samp2 = permutation_sample(data1,data2)
        perm_replicates[i] = func(perm_samp1, perm_samp2)
        
        return perm_replicates

called = True
not_called = False

blk = np.array([called] * int(numof_blk_applicants_called)+ [not_called] * int(numof_blk_applicants - numof_blk_applicants_called))
wht = np.array([called] * int(numof_wht_applicants_called)+ [not_called] * int(numof_wht_applicants - numof_wht_applicants_called))

def diff_frac(blks,whts):
    frac_blks = np.sum(blks)/len(blks)
    frac_whts = np.sum(whts)/len(whts)
    return frac_whts - frac_blks

# Permutation Replicates

perm_reps = draw_perm_replicate(blk,wht,diff_frac,10000)

# 95% Confidence interval
blk1,blk2 = np.percentile(perm_reps,[2.5,97.5])
print(f'\nWe have 95% confidence that the difference in the proportion of whites getting a call back vs blacks is \
between {blk1:6f} and {blk2:6f}')


p = np.sum(perm_reps >= prop_diff)/len(perm_reps)


if p < alpha :
    print(f'\nThe p-value is : {p} < {alpha}. We reject the null hypothesis,\
    \n"H0: Black applicants are equally as likely to get a call from employers as White applicants" \
    \nand suggest the alternate "Ha : Black applicants are less likely to get a call from employers as White applicants"')
else:
    print(f'The p-value is: {p} > {alpha}.We fail to reject the null hypothesis H0.\
    There is a likelihood Black applicants are equally as likely to get a call from employers as White applicants')


We have 95% confidence that the difference in the proportion of whites getting a call back vs blacks is between 0.000000 and 0.034497
The p-value is: 0.0517 > 0.05.We fail to reject the null hypothesis H0.    There is a likelihood Black applicants are equally as likely to get a call from employers as White applicants


In [29]:
# Using the Bootstrap Method

def bootstrap_replicate_1d(data, func):
    """Creating a bootstrap replicate (a statistic computed from a bootstrap sample)"""
    #crealing a bootstrap sample (a resample array of the data)
    bs_sample = np.random.choice(data,size=len(data))
    bs_replicate = func(bs_sample)
    
    return bs_replicate


def draw_bs_rep(data, func, size=1):
    """Draw bootstrap replicates"""
    #initialize array of replicates
    btstrp_reps = np.empty(size)
    
    #Generate replicates
    for i in range(size):
        btstrp_reps[i] = bootstrap_replicate_1d(data,func)
        
    return btstrp_reps

called = True
not_called = False

blk = np.array([called] * int(numof_blk_applicants_called)+ [not_called] * int(numof_blk_applicants - numof_blk_applicants_called))
wht = np.array([called] * int(numof_wht_applicants_called)+ [not_called] * int(numof_wht_applicants - numof_wht_applicants_called))

def race_proportion(race):
    frac_race = np.sum(race)/len(race)
    return frac_race 


# Bootstrap Replicates

bs_reps_blk = draw_bs_rep(blk,race_proportion,10000)
bs_reps_wht = draw_bs_rep(wht,race_proportion,10000)

bs_replicate = bs_reps_wht - bs_reps_blk

# 95% Confidence interval
blk1,blk2 = np.percentile(bs_replicate,[2.5,97.5])
print(f'\nWe have 95% confidence that the difference in the proportion of whites getting a call back vs blacks is \
between {blk1:6f} and {blk2:6f}')


p = np.sum(bs_replicate >= prop_diff)/len(bs_replicate)


if p < alpha :
    print(f'\nThe p-value is : {p} < {alpha}. We reject the null hypothesis,\
    \n"H0: Black applicants are equally as likely to get a call from employers as White applicants" \
    \nand suggest the alternate "Ha : Black applicants are less likely to get a call from employers as White applicants"')
else:
    print(f'The p-value is: {p} > {alpha}.We fail to reject the null hypothesis H0.\
    There is a likelihood Black applicants are equally as likely to get a call from employers as White applicants')


We have 95% confidence that the difference in the proportion of whites getting a call back vs blacks is between 0.016838 and 0.047228
The p-value is: 0.5091 > 0.05.We fail to reject the null hypothesis H0.    There is a likelihood Black applicants are equally as likely to get a call from employers as White applicants


<div class="span5 alert alert-success">
<p> <h4> 4. Write a story describing the statistical significance in the context or the original problem.</h4>
<p> The p-value is : 0.00 < 0.05. We reject the null hypothesis,    
"H0: Black applicants are equally as likely to get a call from employers as White applicants" and suggest the alternate "Ha : Black applicants are less likely to get a call from employers as White applicants"</p>
   <h4> 5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?</h4>
<p> From the analysis, race is an important factor but since it is the only metric used in this analysis it is not conclusive as the most important factor in callback success. For a better analysis, it would be important to have all the applicants have about the same level of education, skill and experience.</p> 
 </p>
</div>