# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
+ Formulas for the Bernoulli distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution

In [34]:
import pandas as pd
import numpy as np
from scipy import stats

In [35]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [36]:
# number of callbacks for black-sounding names
sum(data[data.race=='w'].call)

235.0

In [37]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


<div class="span5 alert alert-success">
<p>Your answers to Q1 and Q2 here</p>
</div>

An appropriate test would be a Null Hypothesis Significance Test (NHST) test for this problem. CLT still applies because it states that the sum or mean of a sufficiently large number of iterations of ANY random variable will be approximately normally distributed – that is, it will take the shape of a bell curve. This random variable can be continous or discrete.

Null hypothesis: H0 =  Race does not have significant impact on the rate of callbacks for resumes.

Alternative hypothesis: H1 = Race has significant impact on the rate of callbacks for resumes.

In [61]:
w = data[data.race=='w']
b = data[data.race=='b']

In [266]:
# Your solution to Q3 here
# H0 = mean is the same
# H1 = mean is different.
from scipy import stats
# Using Bootstrap

def bs_perm_rep_diff(data1,data2,func,size=1):
    """This function combines 2 series, bootstraps them and divides them in to 2 groups. Applies function passed on size"""
    data = np.concatenate((data1,data2))
    res = np.empty(size)
    for i in range(size):
        data = np.random.permutation(data)
        data1_p = data[:len(data1)]
        data2_p = data[len(data1):]
        res[i] = func(data1_p) - func(data2_p)
    return(res)

# Calculate the difference in proportion/mean
obs_mean_diff = w.call.mean() - b.call.mean()
print('The observed mean difference is:',obs_mean_diff, w.call.mean() - b.call.mean())

#wb_bs_reps = bs_perm_rep_diff(w.call,b.call,np.mean,10000)
p = sum(wb_bs_reps>=obs_mean_diff)/len(wb_bs_reps)
print('The significance level of p via bootstrap is :',p)



def bs_samples(data1,data2):
    data1_p = np.random.choice(data1,len(data1))
    data2_p = np.random.choice(data2,len(data2))
    return(data1_p,data2_p)

def mean_diff(data1,data2):
    #data1_p , data2_p = perm_samples(data1,data2)
    return(data1.mean() - data2.mean())

def bs_reps(data1,data2,func,size=1):
    res = np.empty(size)
    for i in range(size):
        data1_p , data2_p = bs_samples(data1,data2)
        res[i] = func(data1_p,data2_p)
    return(res)

res_p = bs_reps(w.call,b.call,mean_diff,10000)

print('Bootstrap confidence interval is: ',np.percentile(res_p,[2.5,97.5]), res_p.mean())



# Using Frequentist

# Assuming P(P1 - P2|H0 is true) < 0.05 means rejecting Null hypothesis. 
# Lets figure out the z-score of the observed value.
# z = observed difference - assumed mean / sampling distribution of (P1 - P2)
# sampling distribution of (P1 - P2) = √(2P(1-P)/# of samples)
# Here we assume P to be the combined proportion or a common mean
mean_com = sum(data.call)/(len(data))
z_diff = (obs_mean_diff - 0) / np.sqrt(2*mean_com*(1-mean_com)/len(data))
p_com = 1-stats.norm.cdf(z_diff)
print ('The frequentist p-value of having', z_diff , ' as z-score is : ', p_com)

# Std deviation of the difference of distributions of sample means Sigma(P1-P2) = √P1.(1-P1)/n1 + P2.(1-P2)/n2
#p1 = sum(w)
sig_diff = np.sqrt(w.call.mean()*(1-w.call.mean())/w.call.size + b.call.mean()*(1-b.call.mean())/b.call.size)
print('Standard error using frequentist in difference of 2 sets is: ', sig_diff)
print('Confidence interval using frequentist : ', stats.norm.ppf(0.025) * sig_diff + obs_mean_diff , stats.norm.ppf(0.975) * sig_diff + obs_mean_diff)






The observed mean difference is: 0.03203285485506058 0.03203285485506058
The significance level of p via bootstrap is : 0.0001
Bootstrap confidence interval is:  [ 0.01642711  0.0476386 ] 0.0320086243674
The frequentist p-value of having 5.81017230289  as z-score is :  3.12042891526e-09
Standard error using frequentist in difference of 2 sets is:  0.00778337058606
Confidence interval using frequentist :  0.016777728828 0.0472879808821


<div class="span5 alert alert-success">
<p> Your answers to Q4 and Q5 here </p>
</div>

In [257]:
sum(np.random.choice(w.call,len(w.call)))

252.0