
### Examining racial discrimination in the US job market

#### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

#### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes.

#### Exercise
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Discuss statistical significance.

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

****

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

In [2]:
data = pd.io.stata.read_stata('job_market_data/us_job_market_discrimination.dta')

In [3]:
# number of callbacks for black-sounding names
sum(data[data.race=='b'].call)

157.0

In [8]:
# What test is appropriate for this problem? Does CLT apply?
data.describe()

Unnamed: 0,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,occupbroad,workinschool,...,educreq,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind
count,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,...,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0
mean,3.61848,3.661396,7.842916,0.052772,0.411499,0.097125,0.448049,215.637782,3.48152,0.559548,...,0.106776,0.437166,0.07269,0.082957,0.03039,0.08501,0.213963,0.267762,0.154825,0.165092
std,0.714997,1.219126,5.044612,0.223601,0.492156,0.296159,0.497345,148.127551,2.038036,0.496492,...,0.30886,0.496087,0.259654,0.275845,0.171676,0.278926,0.410143,0.442838,0.361775,0.371302
min,0.0,1.0,1.0,0.0,0.0,0.0,0.0,7.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,3.0,3.0,5.0,0.0,0.0,0.0,0.0,27.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,4.0,4.0,6.0,0.0,0.0,0.0,0.0,267.0,4.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,4.0,4.0,9.0,0.0,1.0,0.0,1.0,313.0,6.0,1.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
max,4.0,7.0,44.0,1.0,1.0,1.0,1.0,903.0,6.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [26]:
data[['call','race']].groupby('race').count()

Unnamed: 0_level_0,call
race,Unnamed: 1_level_1
b,2435
w,2435


In [33]:
data.call[(data['race'] == 'b')].head()

2    0.0
3    0.0
7    0.0
8    0.0
9    0.0
Name: call, dtype: float32

# What test is appropriate for this problem? Does CLT apply?

       call
race       
b     157.0
w     235.0
      call
race      
b     2435
w     2435
0.064476386037
0.0965092402464


# What are the null and alternate hypotheses?

In [None]:
# H0 : no difference in the amount of call backs black and white sounding resumes have
# H1 : there is a difference between call backs that black and white sounding resumes get

# Compute margin of error, confidence interval, and p-value.

In [121]:
import numpy as np
print 'sum', data[['call','race']].groupby('race').sum()
print ''
print 'count', data[['call','race']].groupby('race').count()

sum        call
race       
b     157.0
w     235.0

count       call
race      
b     2435
w     2435


In [122]:
p2 = data.call[(data['race'] == 'b')].sum()/data.call[(data['race'] == 'b')].count()
p1 = data.call[(data['race'] == 'w')].sum()/data.call[(data['race'] == 'w')].count()
n = 2435.0

p1_p2 = p1-p2

print 'diff between sample means:', p1_p2

v1 = (p1*(1-p1))/n
v2 = (p2*(1-p2))/n

print 'sampling distribution of white:', p1
print 'sampling distribution of black:', p2

print 'variance of white:', v1
print 'variance of black:', v2

p_sd = np.sqrt(v1 + v2)

print 'pop standard deviation: ', p_sd

diff between sample means: 0.0320328542094
sampling distribution of white: 0.0965092402464
sampling distribution of black: 0.064476386037
variance of white: 3.5809119833e-05
variance of black: 2.47717378565e-05
pop standard deviation:  0.00778337058668


In [62]:
sd = 1.96*(p_sd)
print 'diff between sample means:', p1_p2
print 'sd of sampling distribution of the sampling mean:', sd
print 'ANS: 97.5% chance that true mean difference is within', (p1_p2 - sd), 'and',(p1_p2 + sd)

# less than 2.5% chance that H0 is true.


diff between sample means: 0.0320328542094
sd of sampling distribution of the sampling mean: 0.0152554063499
ANS: 95% chance that true difference is within 0.0167774478596 and 0.0472882605593


# Discuss statistical significance.

In [112]:
print 'assuming the H0 is true, what is the likelyhood of getting', p1_p2, '?'

assuming the H0 is true, what is the likelyhood of getting 0.0320328542094 ?


In [107]:
# assuming the H0: that the propbability of black and white call backs are the same and so will 
# combine their probabilities and divide by total n

combined_p = (data.call[(data['race'] == 'b')].sum() + data.call[(data['race'] == 'w')].sum())/(2*n)
sd = np.sqrt(((2*combined_p)*(1-combined_p))/(n))

print 'p1_p2:', p1_p2

z = (p1_p2 - 0)/(sd)

print 'sd:',sd
print 'ANS: < .001% chance of gettin a z score of:', z, 'given H0 is true'

p1_p2: 0.0320328542094
sd: 0.00779689403617
ANS: < .001% chance of gettin a z score of: 4.10841215243 given H0 is true
