# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

<div class="span5 alert alert-info">
### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
</div>
****

In [43]:
import pandas as pd
import numpy as np
from scipy import stats

In [44]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [45]:
# number of callbacks for black-sounding names
sum(data[data.race=='b'].call)

157.0

In [46]:
data.tail()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
4865,b,99,3,2,1,0,0,0,1,313,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,Private
4866,a,99b,4,4,6,0,0,0,0,285,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,
4867,a,99b,4,6,8,0,1,0,0,21,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,
4868,a,99b,4,4,2,0,1,1,0,267,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,
4869,a,99b,4,3,7,0,0,0,1,274,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,


In [47]:
#create a frame with just the relevant data
df = data[['race','call']]
#create a frame with just black sound names
dfb = df.loc[df['race'] == 'b']
#create a frame with just white sounding names
dfw = df.loc[df['race'] == 'w']
#create a frame of black names with no call backs
dfbnc=dfb.loc[df['call'] == 0]
#create a frame of black names with call backs
dfbc=dfb.loc[df['call'] == 1]
#create a frame of white names with no call backs
dfwnc=dfw.loc[df['call'] == 0]
#create a frame of white names with call backs
dfwc=dfw.loc[df['call'] == 1]
print('The number of black sounding names that did not get called back is:', len(dfbnc))
print('The number of black sounding names that did get called back is:', len(dfbc))
print('The number of white sounding names that did not get called back is:', len(dfwnc))
print('The number of white sounding names that did get called back is:', len(dfwc))

The number of black sounding names that did not get called back is: 2278
The number of black sounding names that did get called back is: 157
The number of white sounding names that did not get called back is: 2200
The number of white sounding names that did get called back is: 235


Central limit theorem can be said to apply if there are more than 10 of each group of each probability, therefore central limit theorem can be said to apply in this case.

In [48]:
#probability of being called back if your name sounds white
pw = len(dfwc)/len(dfw)
#probability of being called back if your name sounds black
pb = len(dfbc)/len(dfb)
print("Probability of being called back if your name sounds white is:", pw)
print("Probability of being called back if your name sounds black is:", pb)

Probability of being called back if your name sounds white is: 0.09650924024640657
Probability of being called back if your name sounds black is: 0.06447638603696099


The null hypothesis implies that race is not a statastically significant factor in employment screening. The alternate hypothesis is that race is a significant factor.

In [49]:
import statsmodels.stats as sm

black = list(map(int, dfb['call'].tolist()))
white = list(map(int, dfw['call'].tolist()))

sm.weightstats.ztest(black,white)


(-4.114705266723095, 3.8767444246104316e-05)

We investigated whether race is statistically significant factor in employment screening. The result is that as p is much less than 1% we must reject the null hypothesis and conclude that race is a significant factor in emloyment screening and prospective employers do actively screen against black people.

Race cannot be said to be the most significant factor in employment screening in the whole United States. To find the most important factor it would be important to discern whether this was a local effect based upon local socioeconomic circumstances and to screen for all other factors such as years of experience or military experience held greater significance.