# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
+ Formulas for the Bernoulli distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

In [2]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [3]:
# number of callbacks for black-sounding names
sum(data[data.race=='w'].call)

235.0

In [4]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


<div class="span5 alert alert-success">
<p>Your answers to Q1 and Q2 here</p>
</div>

In [5]:
w = data[data.race=='w']
b = data[data.race=='b']

In [44]:
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.figure_factory as ff

init_notebook_mode(connected=True)

In [21]:
# visualize the two distributions 
trace0 = go.Histogram(x=w.call, name='white')
trace1 = go.Histogram(x=b.call, name='black')

iplot([trace0, trace1], filename='black_white_hist.html')

The histogram above shows that 235 "white" resumes received callbacks, while only 157 "black" resumes received callbacks.

In [22]:
# visualize the two normalized distributions 
trace0 = go.Histogram(x=w.call, name='white', histnorm='probability')
trace1 = go.Histogram(x=b.call, name='black', histnorm='probability')

iplot([trace0, trace1], filename='black_white_hist.html')

As a percentage of the total, 6.45% "black" resumes received callbacks while 9.65% of "white" resumes received callbacks.

### What test is appropriate for this problem? Does CLT apply?

This problem can be modeled as a Bernoulli trial because:
1. Each trial is assumed to be independent. Whether or not a resume receives a callback does not depend on whether any of the other resumes in the experiment received a callback or not.
2. Each trial has a pass/fail outcomes (1 if the resume received a callback and 0 if not).

<br>
The central limit theorem does not apply here because we are not modeling this as a regression problem. We are not taking a statistic of the distribution that will tend towards the mean.

### What are the null and alternate hypotheses?

The null hypothesis (H_0): The probability of a "white" resume receiving a callback is equal to the probability of a "black" resume receiving a callback.

<br>
The alternate hypothesis (H_A): The probability of a white resume receiving a callback is **not** equal to the probability of a "black" resume receiving a callback.

### Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.

#### Bootstrapping approach:

In [53]:
# Seed random number generator
np.random.seed(42)

# Take 10,000 samples out of the binomial distribution: n_callbacks
n_callbacks_black = np.random.binomial(100, 0.0645, size=10000)
n_callbacks_white = np.random.binomial(100, 0.0965, size=10000)
# plot the CDF of the simulation results
fig = ff.create_distplot([n_callbacks_black, n_callbacks_white], group_labels=['black', 'white'], curve_type='normal')

iplot(fig, filename='distribution.html')

In [42]:
# plot the CDF of the simulation results
trace0 = go.Histogram(x=n_callbacks, cumulative={'enabled': True}, histnorm='probability')

iplot([trace0], filename='binomial_hist.html')

In [None]:
# Compute CDF: x, y
x, y = ecdf(n_callbacks)

# Plot the CDF with axis labels
_ = plt.plot(x, y, marker='.', linestyle='none')
_ = plt.xlabel('number of defaults out of 100 loans')
_ = plt.ylabel('CDF')

# Show the plot
plt.show()

In [24]:
# Initialize the number of successes
n_defaults = np.empty(1000)

# Compute the number of successes
for i in range(1000):
    n_defaults[i] = perform_bernoulli_trials(100, 0.5)


# Plot the histogram with default number of bins; label your axes
_ = plt.hist(n_defaults, normed=True)
_ = plt.xlabel('number of defaults out of 100 loans')
_ = plt.ylabel('probability')

# Show the plot
plt.show()

4912

<div class="span5 alert alert-success">
<p> Your answers to Q4 and Q5 here </p>
</div>