# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
+ Formulas for the Bernoulli distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

In [2]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [4]:
# number of callbacks for black-sounding names
sum(data[data.race=='w'].call)

235.0

In [5]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


<div class="span5 alert alert-success">
<p><h3>Answer to Q1:</h3> <br>Chi squared test is appropriate for the problem, since the chi-squared test is used to determine whether there is a significant difference between the expected frequencies and the observed frequencies in one or more categories. CLT is applicable because the sample size is larger than 30.</p>
<p><h3>Answer to Q2:</h3> <br><li>null hypothesis is race has a no impact on the rate of callbacks for resumes</li>
    <li>alternate hypothesis is race has a significant impact on the rate of callbacks for resumes</li></p>
</div>

In [11]:
w = data[data.race=='w']
b = data[data.race=='b']

<div class="span5 alert alert-success">
<p><h3>Answer to Q3:</h3><br>1. Apply 10,000 times of bootstraping tests to this problem: random reorder (permutation) of calls, and assign the first half to white-sounding and second half to black-sounding, calculate the difference of the number of calls: <br><li>The 95% confidence interval from the bootstrap tests is (-36, 36)</li> <li>the margin of error is 36,</li> <li>and the p value is 0 when the bootstrap difference larger than or equate to the number of call difference between white-sounding and black-sounding</li> </p>
<p>2. Apply Chi squared test:<br><li>statistic is 15.52</li><li>p value is 8.16e-5</li><li>95% confidence interval is (0.00, 5.02)</li></p><br>
<p><b>Conclusion:</b> From these two tests, both of the p value is very close to 0, definitely less the 0.05, that we can reject the null hypothesis and accept the alternate hypothesis (race has a significant impact on the rate of callbacks for resumes)</p>
</div>

In [35]:
# the entire data size
n = len(data)
# the difference of calls between white-sounding and black-sounding
diff = sum(w.call) - sum(b.call)
# draw 10,000 times bootstrap tests
bs_diff = np.empty(10000)
for i in range(10000):
    # reorder the call column
    reorder = np.random.permutation(data.call)
    # assume the first half represent white-sounding, and left represents black-sounding, and calculate the difference of calls
    bs_diff[i] = sum(reorder[:len(w)]) - sum(reorder[len(w):])
# 95% confidence interval from bootstrap tests, margin of error, and the p value
bs_interval = np.percentile(bs_diff, [2.5, 97.5])
bs_margin = bs_interval[1] - sum(bs_interval)
pval = np.sum(bs_diff >= diff) / 10000 
bs_margin, bs_interval, pval

(36.0, array([-36.,  36.]), 0.0)

In [44]:
# Chi squared test: statistic, p value and 95% confidence interval
statistic_chi, pval_chi = stats.chisquare([sum(w.call), sum(b.call)], f_exp=sum(data.call)/2)
interval_chi = stats.chi2.ppf(q = 0.025,df = 1), stats.chi2.ppf(q = 0.975,df = 1)
statistic_chi, pval_chi, interval_chi

(15.520408163265307,
 8.161930359704385e-05,
 (0.0009820691171752583, 5.023886187314888))

<div class="span5 alert alert-success">
<p> <h3>Answers to Q4:</h3><br> The above analysis sadly indicates that employers are more likely to give the interview opportunities to white-sounding applications, than to black-sounding applications. </p><br>

<p><h3>Answer to Q5:</h3><br> While this analysis doesn't mean that race/name is the most important factor in callback success, because we didn't analyze the influence of the other factors, such as education. To understand the correlation between race/name and callback success completely, I will consider the all other possible factors as well.</p>
</div>