# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

<div class="span5 alert alert-info">
### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
</div>
****

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import scipy as sp
%matplotlib inline

In [2]:
df = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [3]:
# number of callbacks for black-sounding names
sum(df[df.race=='b'].call)

157.0

In [4]:
df.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


### 1. What test is appropriate for this problem? Does CLT apply?
> A hypothesis test comparing differences in proportions is appropriate for this problem because we are interested in the relationship between two sample propotions. Could the measured results happen by chance, or is the difference statistically significant?

### Does CLT apply?
The requirements for applying CLT to this problem are the following:

1. The size of each population should be large compared to the samples. We are comparing a sample of ~2000 resumes representing white and black applicants. The actual populations for these samples most likely number in the millions, so this requirement is met.

2. The samples for each group are big enough to justify using a normal distribution to model proportion differences (minimum 40 observations). This requirement is also met, there are more than 40 observations for each sample group.

3. The samples are independent. This requirement is met because the results of one sent resume presumably had no effect on any other sent resume. 

For these reasons, CLT does apply.


### 2. What are the null and alternate hypotheses?



**null hypothesis**: proportion of responses to 'white' names = proportion of responses to 'black' names

**alternative hypothesis**: proportion of responses to 'white' names != proportion of responses to 'black' names

### 3. Compute margin of error, confidence interval, and p-value.

In [5]:
call_df = df[['race', 'call']]

# split data by race
w_df = call_df.loc[call_df['race'] == 'w']
b_df = call_df.loc[call_df['race'] == 'b']

# find number of resumes sent out
length_w = len(w_df['call'])
length_b = len(b_df['call'])

# find number of call backs
callbacks_w = w_df['call'].sum()
callbacks_b = b_df['call'].sum()

# find proportions of call backs to resumes sent
p_w = callbacks_w / length_w
p_b = callbacks_b / length_b

The population parameters are not known in this case, so the standard error will be used instead

In [6]:
# calculate pooled sample proportion
pool_prop = (p_b * callbacks_b + p_w + callbacks_w) / (callbacks_b + callbacks_w)

# calculate standard error
std_err = np.sqrt(pool_prop * (1 - pool_prop) * ((1/callbacks_b) + (1/callbacks_w)))

# calculate z-score
z_score = ((p_b - p_w) / std_err)

#calculate margin of error
margin_error = z_score * std_err
margin_error

-0.032032854209445585

In [7]:
# calculate confidence interval 
upper = (p_b - p_w) + margin_error
lower = (p_b - p_w) - margin_error

print("Lower bound of CI: %.3f" % lower)
print("Upper bound of CI: %.3f" % upper)

Lower bound of CI: 0.000
Upper bound of CI: -0.064


In [8]:
# find the probability from the data
z_prob = (p_w - p_b) * 100
print("z-prob: %.3f" % z_prob)

# calculate p_value using scipy
p_value = 1 - sp.special.ndtr(z_prob)
print("p-value: %.3f" % p_value)

z-prob: 3.203
p-value: 0.001


### 4. Write a story describing the statistical significance in the context or the original problem.

> The statistical methods used above show that the probability of the difference observed between callbacks for people of difference races is 0.1% (p-value is 0.001). We can reject the null hypothesis and conclude that race does play a roll in whether or not a person gets a callback for a job interview.

### 5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

> I don't believe that you could say this. I think the most you could say is that race/name plays a factor in callback success, not necessarily that it is the most important factor. Broader studies (including using the other variables present in this dataset) would need to be done to find what the most important factors are.