# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
+ Formulas for the Bernoulli distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

In [2]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

<div class="span5 alert alert-success">
<p>Your answers to Q1 and Q2 here</p>
</div>

In [3]:
# 1 What test is appropriate for this problem? Does CLT apply?
print('Total white sounding names', len(data[data.race=='w']), 'is larger than 30')
print('Total calls for whites', sum(data[data.race=='w'].call), 'as well as non-calls', len(data[data.race=='w']) - sum(data[data.race=='w'].call), 'are greater than 10')
print('Total black sounding names', len(data[data.race=='b']), 'is larger than 30')
print('Total calls for blacks', sum(data[data.race=='b'].call), 'as well as non-calls', len(data[data.race=='b']) - sum(data[data.race=='b'].call), 'are greater than 10')

Total white sounding names 2435 is larger than 30
Total calls for whites 235.0 as well as non-calls 2200.0 are greater than 10
Total black sounding names 2435 is larger than 30
Total calls for blacks 157.0 as well as non-calls 2278.0 are greater than 10


<p>As the sizes of each sample, each split equally between white and black sounding names at 2435 elements are very large.</p>
<p>A z-test would be appropriate as according to the central theorem in a sample that is well above 30 elements using the standard error of the sample would provide a good estimate for the standard deviation of the population.</p>
<p>As data['call'] is a Bernoulli distribution 1 representing success in getting a callback and 0 a failure, the number of elements for failure or success cases for both distributions are well above 10, also following the central limit theorem.</p>

<p>2. What are the null and alternate hypotheses?</p>
<p>The null hypothesis would be that individuals with black sounding names get proportionally as many callbacks as ones with white sounding names, and the alternative hypothesis would be that there is a difference in proportional number of callbacks.</p>

In [4]:
# We split the data by race
w = data[data.race=='w']
b = data[data.race=='b']

In [5]:
# Your solution to Q3 here
# Using the bootstrap approach
def draw_bs_replicates(data, func, size=1):
    """Draw bootstrap replicates."""

    # Initialize array of replicates: bs_replicates
    bs_replicates = np.empty(size)

    # Generate replicates
    for i in range(size):
        bs_replicates[i] = func(np.random.choice(data, size=len(data)))

    return bs_replicates
mean_prop = np.mean(data.call)
bootstrap_white = draw_bs_replicates(w.call, np.mean, size=10000)
bootstrap_black = draw_bs_replicates(b.call, np.mean, size=10000)
bootstrap_replicates = np.absolute(bootstrap_white - bootstrap_black)
translated_white = w.call - np.mean(w.call) + mean_prop
translated_black = b.call - np.mean(b.call) + mean_prop
translated_bootstrap_white = draw_bs_replicates(translated_white, np.mean, size=10000)
translated_bootstrap_black = draw_bs_replicates(translated_black, np.mean, size=10000)
translated_bootstrap_replicates = np.absolute(translated_bootstrap_white - translated_bootstrap_black)

p = np.sum(bootstrap_replicates <= translated_bootstrap_replicates) / 10000
print('The p value using the bootstrap approach is', p, 'therefore we reject the null hypothesis and accept the alternative')
print('hypothesis that the callback rate between black and white sounding names is different')

The p value using the bootstrap approach is 0.0031 therefore we reject the null hypothesis and accept the alternative
hypothesis that the callback rate between black and white sounding names is different


In [6]:
# Using the frequentist approach
# Since we don't have the standard deviation of the population and our samples are above 30 elements we can estimate the margin of 
# error by multiplying the standard error of the sample proportion by the z factor.
std_err_diff = np.sqrt(np.mean(w.call)*(1-np.mean(w.call))/len(w)+np.mean(b.call)*(1-np.mean(b.call))/(len(b)))
# We now have to provide an appropriate z-value for the most commonly used 95% confidence level which in this case is two tailed
# as we are interested in differences in either direction
conf_95 = stats.norm.ppf(0.975)
# Therefore the margin of error is given by the standard error by the z factor at that confidence level
mrg_err_diff = conf_95*std_err_diff
print('The margin of error is +-', mrg_err_diff, 'at a 95% confidence level')

The margin of error is +- 0.015255126027 at a 95% confidence level


<p>The margin of error for the difference in callback proportions in the two samples is approximately 1.5% at a 95% confidence level. Such a low margin of error bellow 2% suggests the results of the study are close to the true value.</p>

In [7]:
# We calculate the confidence interval by adding and substracting the margin of error to the sample proportion
conf_int_diff = []
conf_int_diff.append(np.mean(w.call)-np.mean(b.call)-mrg_err_diff)
conf_int_diff.append(np.mean(w.call)-np.mean(b.call)+mrg_err_diff)
print('The cofidence interval is', conf_int_diff, 'at a 95% confidence level')

The cofidence interval is [0.016777728828047841, 0.047287980882073311] at a 95% confidence level


<p>The resulting confidence interval suggests resumes with white sounding names received a callback between 1.7%(2s.v) and 4.7%(2s.v) more often than résumés with black sounding names at a 95% confidence level</p>

In [8]:
# To calculate the p value we substact the target proportion from the difference in sample proportions and we divide it by the 
# standard error if both samples had the same rate of callbacks.
# The target proportion in this case 0 as the null hypothesis is that both samples would receive the same proportion of callbacks
# We calculate the standard proportion for both samples by adding the total number of successes of both samples and dividing them
# by the total number of both samples.
new_prop = (sum(w.call)+sum(b.call))/(len(w.call)+len(b.call))
z = (np.mean(b.call)-np.mean(w.call)-0)/np.sqrt(2*(new_prop*(1-new_prop))/len(w.call))
# we calculate the p value as twice the value of the area to the left of the negative of z standard deviations from 0 as we count
# we multiply it by 2 as the alternative hypothesis is that there is a difference in either direction away from the mean
p = 2 * stats.norm.cdf(z)
print('The p value using the frequentist approach is', p, 'therefore we reject the null hypothesis and accept the')
print('alternative hypothesis that the callback rate between black and white sounding names is different')

The p value using the frequentist approach is 3.9838854095e-05 therefore we reject the null hypothesis and accept the
alternative hypothesis that the callback rate between black and white sounding names is different


<p>The p value being well bellow 0.05 means we reject the null hypothesis and accept the alternative hypothesis</p>

<div class="span5 alert alert-success">
<p> Your answers to Q4 and Q5 here </p>
</div>

<p>4. Write a story describing the statistical significance in the context or the original problem</p>
<p>&nbsp;The hypothesis test on 4870 resumes with half of them being assigned randomly black sounding names and the other half white sounding names found that race was a factor in determining the rate of callbacks in a resume.</p>
<p>Because the resumes were randomly assigned white and black sounding names and there were a very large amount of each type of resume with a high rate of successful and unsuccessful callbacks for each type of resume, thus the sample followed the central limit theorem.</p>
<p>Despite the high number resumes it could be possible that by chance the selection had been squewed thus making the results unreliable.</p>
<p>We will perform a Wasserstein distance test on the two distributions to determine the dissimilarity between the two.</p>

In [9]:
data_keys = list(data.keys())
new_df = data.copy()
dist_list = []
# we simply factorise the data as it is not important which number represents which category in this case
for i in data_keys:
    if data[i].dtype == 'object':
        new_df[i] = pd.factorize(new_df[i])[0]
# we fill all the nan values with 0 to avoid errors though this will affect the accuracy of the results somewhat
new_df = new_df.fillna(0)
# we divide the factorised data in race groups a prevoiusly done
d0 = new_df[new_df.race==0]
d1 = new_df[new_df.race==1]
# we drop the race and firstname columns to improve accuracy in the comparison as they are purposefuly different in the context of
# this study
d0 = d0.drop(['race', 'firstname'], axis=1)
d1 = d1.drop(['race', 'firstname'], axis=1)
data_keys.remove('race')
data_keys.remove('firstname')
# We perform a Wasserstein distance test to find the dissimilarity between features of the two distributions
# We found EOF errors when we first tried this part of the code so we will try except them
import sys
try:
    for i in data_keys:
        dist_list.append(stats.wasserstein_distance(d0[i], d1[i]))
        if stats.wasserstein_distance(d0[i], d1[i]) > 0.2:
            data_keys = list(data.keys())
            print('Dissimilarity in the two distributions on', i, 'feature is above 0.2 at', stats.wasserstein_distance(d0[i], d1[i]))
    
except:
    print('unexpected_error', sys.exc_info())
print('Average dissimilarity rate', np.mean(dist_list))

Dissimilarity in the two distributions on occupspecific feature is above 0.2 at 3.03655030801
Average dissimilarity rate 0.053790333452


<p>From this result we see the rate of dissimilarity mostly insignificant at only 0.05 dominated by the occupspecific feature as an extreme outlier, we couldn't find in the original 
  <a href="https://www.nber.org/papers/w9873.pdf">paper</a> documentation on the data of the study so we can't tell for sure what that feature means at the moment.
</p>
<p> We can thus be certain that race/name was a factor in determining callback success, though we should be careful when applying the conclusions of this study to discuss race relations in the US as a whole, as this study only covered the populations of the Boston and Chicago urban areas.</p>
<p> A meta-study covering more diverse types of populations throughout the country that better represent the makeup of the country would be necessary in order to make the conclusions more applicable to the country as a whole.</p>

<p>5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?</p>
<p>The results of the analysis only determine that first-names associated to black or white people are a factor in determining callback success.</p>
<p>It being the most significant factor is a completely different question which would require analyzing all other possible factors such as gender, education level, years of experience, etc... Which we did not study, thus we can't determine that race is the most important factor in determining callback rate.</p>
<p>To determine the most significant factor in determining callback success we would have to perform tests on all possible factors that could affect callback success and compare which one is more significant.</p>
<p>We would also have to contend with the possibility that other factors that affected callback success were also not recorded in the study.</p>