# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
+ Formulas for the Bernoulli distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

In [2]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [3]:
# number of callbacks for black-sounding names
sum(data[data.race=='w'].call)

235.0

In [4]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


<div class="span5 alert alert-success">
<p>Your answers to Q1 and Q2 here</p>
</div>

In [5]:
w = data[data.race=='w']
b = data[data.race=='b']

In [6]:
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.figure_factory as ff

init_notebook_mode(connected=True)

In [7]:
# visualize the two distributions 
trace0 = go.Histogram(x=w.call, name='white')
trace1 = go.Histogram(x=b.call, name='black')

iplot([trace0, trace1], filename='black_white_hist.html')

The histogram above shows that 235 "white" resumes received callbacks, while only 157 "black" resumes received callbacks.

In [8]:
# visualize the two normalized distributions 
trace0 = go.Histogram(x=w.call, name='white', histnorm='probability')
trace1 = go.Histogram(x=b.call, name='black', histnorm='probability')

iplot([trace0, trace1], filename='black_white_hist.html')

As a percentage of the total, 6.45% "black" resumes received callbacks while 9.65% of "white" resumes received callbacks.

### 1. What test is appropriate for this problem? Does CLT apply?

A two-sample T test can be used to compare the means of the black and white sample populations. This will tell us if the two sample populations belong to the same total population, that is, their distributions are equal. If their distributions are equal, the average proportion of callbacks is equivalent for both black and white candidates.

<br>
The central limit theorem does apply here since we are sampling the mean of an independent random variable "call". 

### 2. What are the null and alternate hypotheses?

The null hypothesis (H_0): The proportion of "white" resumes that receive a callback is equal to the proportion of "black" resumes receiving a callback.

p_w - p_b = 0

<br>
The alternate hypothesis (H_A): The proportion of "white" resumes that receive a callback is **NOT** equal to the proportion of "black" resumes receiving a callback.

p_w - p_b != 0

### 3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.

#### Frequentist Approach:

For the frequentist approach we can conduct a two-sample t-test for the difference of means of the two samples. In this case, the mean of each sample is equal to the proportion of callbacks. This is equivalent to testing whether the white and black samples come from the same distribution (null hypothesis), or they come from two different distributions (alternate hypothesis).

In [9]:
# of samples n is equal for both black and white sample populations
n = len(w.call)

# compute means
w_mean = np.mean(w.call)
b_mean = np.mean(b.call)

# compute standard deviations
w_std = np.std(w.call)
b_std = np.std(b.call)

print('Number of samples: %s' % n)

Number of samples: 2435


In [10]:
# compute the standard error
SE = np.sqrt((w_std**2)/n + (b_std**2)/n)

print('Standard Error: %s' % round(SE, 5))

Standard Error: 0.00778


I'm interested in testing my hypothesis at the 99% confidence level. This corresponds to a critical value of alpha=0.01. Now I can compute the two-sample t-statstic for the difference of means of my two samples. 

In [11]:
# compute the t-statistic
t = (w_mean - b_mean)/SE
t

4.115583422082968

In [12]:
# compute degrees of freedom
DF = n - 1

# compute the p-value associated with the t-stat
p_value = stats.t.sf(t, DF)*2
p_value

3.990507764697911e-05

This corresponds to a p-value of 0.00004 which is << than my critical value of alpha=0.01. Therefore I can say with 99% confidence that the null hypothesis can be rejected, suggesting the alternative hypothesis that the proportion of white resumes receiving a callback is **NOT** equal to the proportion of black resumes receiving a callback.

Since we have support for the alternate hypothesis, we can calculate the difference between the two proportions and the 99% confidence intervals.

In [13]:
# difference in sample means
diff = round(w_mean - b_mean, 3)

# t-value corresponding to 99% CI
t_star = 2.326

# calculate upper and lower CI
lower_CI = round(diff - t_star*SE, 3)
upper_CI = round(diff + t_star*SE, 3)

print(f"The difference in proportion of callbacks between black and \
white populations is {diff} with a lower 99% CI of {lower_CI} and \
an upper 99% CI of {upper_CI}")

The difference in proportion of callbacks between black and white populations is 0.032 with a lower 99% CI of 0.014 and an upper 99% CI of 0.05


#### Bootstrapping approach:

This problem can be modeled as a Bernoulli trial because:
1. Each trial is assumed to be independent. Whether or not a resume receives a callback does not depend on whether any of the other resumes in the experiment received a callback or not.
2. Each trial has a pass/fail outcomes (1 if the resume received a callback and 0 if not).

For the boostrapping approach, we can thus model the two resume samples using np.random.binomial, take the difference of their means (proportion of callbacks) and create a bootstrap population of the difference of means. Then we can use a Z test to test the null and alternate hypotheses.

The function below creates a "black" and "white" bootstrap sample, computes the difference in their means (proportion of callbacks), stores the difference in p_bootstrap, and does this n_samples times.

In [14]:
def create_bootstrap_sample(sample_size=100, n_samples=1000, random_seed=None):
    if random_seed != None:
        # Seed random number generator
        np.random.seed(random_seed)
    """Create a bootstrap sample population of the difference of 
    proportions between the black and white sample populations."""
    
    # create an empty array to store the difference in proportions
    # from each bootstrap sample
    p_bootstrap = np.empty(n_samples)
    
    for i in range(0,n_samples):
        # sample proportion of the black sample
        b_sample_p = np.random.binomial(sample_size, 0.0645, size=1)/sample_size
        # sample proportion of the white sample
        w_sample_p = np.random.binomial(sample_size, 0.0965, size=1)/sample_size

        diff = w_sample_p - b_sample_p

        # store the difference of proportions
        p_bootstrap[i] = diff
        
    return p_bootstrap

In [15]:
# create the bootstrap sample population using the above function
p_bootstrap = create_bootstrap_sample(sample_size=n, n_samples=1000, random_seed=42)
print('Bootstrap population mean: %s' % np.mean(p_bootstrap))
print('Bootstrap population std: %s' % np.std(p_bootstrap))

Bootstrap population mean: 0.03186324435318275
Bootstrap population std: 0.007482501470597068


It's important to note that the sample size of each bootstrap sample must be the same as the original black and white sample populations so as to not affect the variance of the bootstrap sample population. Since a higher sample size will lead to a lower standard deviation. The number of bootstrap samples we create however will not affect the individual bootstrap population variance.

In [16]:
# plot the two distributions
fig = ff.create_distplot([p_bootstrap], 
                         group_labels=['bootstrap population'],
                         curve_type='normal', bin_size=0.002)

iplot(fig, filename='bootstrap_distribution.html')

We can get an idea of our test result by visualizing the ecdf of the bootstrap sample population.

In [17]:
def ecdf(data):
    """Compute ECDF for a one-dimensional array of measurements."""
    # Number of data points: n
    n = len(data)

    # x-data for the ECDF: x
    x = np.sort(data)

    # y-data for the ECDF: y
    y = np.arange(1, len(data) + 1) / n

    return x, y

In [18]:
# compute the ecdf for both distributions
x, y = ecdf(p_bootstrap)

In [19]:
# plot the two ecdfs
trace0 = go.Scatter(x=x, y=y, mode='markers', name='bootstrap sample population')

iplot([trace0], filename='bootstrap_ecdf.html')

Scrolling over to the left, we can see that there is no data point that actually reaches 0. This means that it's highly unlikely that the difference in the proportions of the black and white samples is equal to 0. Which would once again support the alternative hypothesis, that the proportion 

I'll use the same confidence level as in the frequentist approach of 99%, with a corresponding alpha value of 0.01.

In [20]:
# define alpha for the 99% confidence interval
alpha = 0.01

# this corresponds to a critical z-score of
z_score = -2.33

Since we are interested in testing the null hypothesis that the difference in means (proportions) of the black and white sample populations is equal to zero, we use the value 0, the bootsrap sample population mean and standard deviation to compute a z-score.

In [21]:
# calculate z-score
z = (0-np.mean(p_bootstrap))/np.std(p_bootstrap)

print('Z-score: %s' % round(z, 2))
print('Critical Z: %s' % z_score)

Z-score: -4.26
Critical Z: -2.33


The z-score of -4.26 is much less than the critical Z of -2.33. The z-score of -4.26 corresponds to a p-value of 0.00002, which is again much less than the 0.01. Thus we reach the same conclusion as in the frequentist approach and can reject the null hypothesis and accept the alternate hypothesis.

In [22]:
# compute the standard error
SE = np.std(p_bootstrap)/np.sqrt(1000)
print('Standard Error: %s' % round(SE, 6))

Standard Error: 0.000237


In [23]:
# difference in sample means
diff = round(np.mean(p_bootstrap), 3)

# t-value corresponding to 99% CI
t_star = 2.576*SE

# calculate upper and lower CI
lower_CI = round(diff - t_star*SE, 3)
upper_CI = round(diff + t_star*SE, 3)

print(f"The difference in proportion of callbacks between black and \
white populations is {diff} with a lower 99% CI of {lower_CI} and \
an upper 99% CI of {upper_CI}")

The difference in proportion of callbacks between black and white populations is 0.032 with a lower 99% CI of 0.032 and an upper 99% CI of 0.032


<div class="span5 alert alert-success">
<p> Your answers to Q4 and Q5 here </p>
</div>

### 4. Write a story describing the statistical significance in the context or the original problem.

Recall:

The null hypothesis (H_0): The proportion of "white" resumes that receive a callback is equal to the proportion of "black" resumes receiving a callback.

The alternate hypothesis (H_A): The proportion of "white" resumes that receive a callback is **NOT** equal to the proportion of "black" resumes receiving a callback.

In both the frequentist and bootstrapping approaches, I found with 99% confidence and statistical significance, evidence to reject the null hypothesis and accept the alternate hypothesis that the porportion of "black" resumes that receive callbacks is **NOT** equal to the proportion of "white" resumes that receive callbacks.

### 5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

The results of my analysis does not imply that race/name is the most important factor in callback success. The result only implies that there is a difference in callback success between two populations (black and white). There may in fact be other underlying factors that drive callback success which were not tested for here. For example, the level of education could be more important.

To test for these other factors, I would have to perform the analysis comparing the proportions of callbacks for the two populations of each factor I was interested in. For example, I could perform the same tests for the two populations where one has candidates that have higher education and one where candidates had at most a bachelors. I could do this for each of the factors I was interested in, determine which ones were statistically significantly different and choose the one with the largest difference in proportions of callback success as the most important.