# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

<div class="span5 alert alert-info">
### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
</div>
****

In [1]:
# import libraries
import pandas as pd
import numpy as np
from scipy import stats

In [2]:
# read the data
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [3]:
# premilinary look at the data
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


In [4]:
# get the list of all columns in case it is needed later
data.columns

Index(['id', 'ad', 'education', 'ofjobs', 'yearsexp', 'honors', 'volunteer',
       'military', 'empholes', 'occupspecific', 'occupbroad', 'workinschool',
       'email', 'computerskills', 'specialskills', 'firstname', 'sex', 'race',
       'h', 'l', 'call', 'city', 'kind', 'adid', 'fracblack', 'fracwhite',
       'lmedhhinc', 'fracdropout', 'fraccolp', 'linc', 'col', 'expminreq',
       'schoolreq', 'eoe', 'parent_sales', 'parent_emp', 'branch_sales',
       'branch_emp', 'fed', 'fracblack_empzip', 'fracwhite_empzip',
       'lmedhhinc_empzip', 'fracdropout_empzip', 'fraccolp_empzip',
       'linc_empzip', 'manager', 'supervisor', 'secretary', 'offsupport',
       'salesrep', 'retailsales', 'req', 'expreq', 'comreq', 'educreq',
       'compreq', 'orgreq', 'manuf', 'transcom', 'bankreal', 'trade',
       'busservice', 'othservice', 'missind', 'ownership'],
      dtype='object')

In [5]:
# create a dataset with the two columns we need, race and call
bw_data = data[['race','call']]

# further separate candidates with black or white sounding names
black = data[bw_data.race=='b']
white = data[bw_data.race=='w']
bw_data.head(5)

Unnamed: 0,race,call
0,w,0.0
1,w,0.0
2,b,0.0
3,b,0.0
4,w,0.0


In [6]:
# descriptive stistics for the dataset 
bw_data.describe()

Unnamed: 0,call
count,4870.0
mean,0.080493
std,0.272079
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,1.0


In [7]:
# Find total number of candidates, candidates with black sounding names and those with white sounding names
n=len(bw_data)
b_n=len(black)
w_n=len(white)

print('Total number of candidates are %d, out of which %d have black sounding names and %d have white sounding names' %(n,b_n,w_n))

Total number of candidates are 4870, out of which 2435 have black sounding names and 2435 have white sounding names


<div>
<h1>Questions</h1>
   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
</div

<div>
<h1>Solutions</h1>
<p>Since we do not know the population standard deviation, it is better to use the t-test rather than the z-test</p>
<p> The condistions for CLT are

1. Randomization Condition: The data must be sampled randomly.From the information given to us already, we can assume that the data from the sample is chosen randomly.

2. Independence Assumption: The sample values must be independent of each other. This means that the occurrence of one event has no influence on the next event. Usually, if we know that people or items were selected randomly we can assume that the independence assumption is met.

3. 10% Condition: When the sample is drawn without replacement (usually the case), the sample size, n, should be no more than 10% of the population. Since the total population is in millions, we can assume safely that the sample size is less than 10% of the population.

4. Sample Size Assumption: The sample size must be sufficiently large. In this case, the sample size is much more than 30 considered sufficient. </p>

<p> The null and alternate hypotheses </p>
<li>H<sub>0</sub>: callback rates are the same without discrimination in the US job market between candidates with black sounding and white sounding names
<li>H<sub>A</sub>: callback rates are NOT the same demonstrating discrimination in the US job market between candidates with between black sounding and white sounding names
</div

<div>
<p><b>Question 3: Compute margin of error, confidence interval, and p-value.</b>
</div

In [8]:
# break-down by success rate and race
success_rate = bw_data.groupby(['call', 'race']).size()
success_rate

call  race
0.0   b       2278
      w       2200
1.0   b        157
      w        235
dtype: int64

In [9]:
# percentage calculation of callbacks for candidates
print('Out of', len(black),'applicants with black sounding names',len(black[black.call==1]),'received a callback where', len(black[black.call==0]),'did not')
print('So in all %3.2f percent applicants were successful' %((len(black[black.call==1]))/(len(black))*100))
print('Out of', len(white),'applicants with white sounding names',len(white[white.call==1]),'received a callback where', len(white[white.call==0]),'did not')
print('So in all %3.2f percent applicants were successful' %((len(white[white.call==1]))/(len(white))*100))


Out of 2435 applicants with black sounding names 157 received a callback where 2278 did not
So in all 6.45 percent applicants were successful
Out of 2435 applicants with white sounding names 235 received a callback where 2200 did not
So in all 9.65 percent applicants were successful


From initial analysis, we can condlude that more candidates with white sounding names got callback than black sounding names. To test hypothesis, we use t-test to find the p-value. 

If it is less than critical value (for 95% it is 1.96), reject the null hypothesis.
Else we accept the numm hypothesis.

In [10]:
# calculate margin_error using formula
sample_std = bw_data.std()
sample_mean = bw_data.mean()
cv = 1.96

# calculate margin_error using formula
std_error = sample_std / np.sqrt(len(bw_data))
margin_error = cv * std_error
print('Standard error is %f and margin of error is %f' %(std_error, margin_error))

# calculate confidence interval using formula
ci_high = sample_mean + margin_error
ci_low = sample_mean - margin_error
print('Confidence interval is between %f and %f' %(ci_low,ci_high))

# the two sample t-test can also be computed using the library function
tv, pv = stats.ttest_ind(black['call'], white['call'])
print('t-staistic:',tv,'p-value',pv)

# compare t-statistic to critical value to test hypotheses
if tv >= cv:
    print('t-statistic value is greater than critical value.') 
    print('Accept null hypothesis.')
    print('There is NO discrimination in the US job market between candidates with black sounding and white sounding names')
else:
    print('t-statistic value is less than critical value.')
    print('Reject null hypothesis.')
    print('There is discrimination in the US job market between candidates with black sounding and white sounding names')

Standard error is 0.003899 and margin of error is 0.007642
Confidence interval is between 0.072851 and 0.088134
t-staistic: -4.11470529086 p-value 3.94080210313e-05
t-statistic value is less than critical value.
Reject null hypothesis.
There is discrimination in the US job market between candidates with black sounding and white sounding names


<div>
<b>Question: 
4. Write a story describing the statistical significance in the context or the original problem.
5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis? </b>
</div

To understand the staistical significance of the data, we can look at all the columns to find co-relation between call rates and other factors. To do so, we can do a t-test to see if there are other factors where the p-value is significantly smaller than the required >0.05 threshold set for a confidence interval of 95%

In [11]:
# compile list of all columns in the dataset
cols = list(data.columns.values)

In [12]:
# compare columns between the white and black datasets 
# check see if there are other factors which have significantly lower p-values
for i in cols:
    try:
        t, p = stats.ttest_ind(white[i], black[i])
        if p < 0.05:
            print (i,t,p)
    except:
        pass

computerskills -2.16642710428 0.0303269339554
call 4.11470529086 3.94080210313e-05




Computer skills have a p-value less than 0.05, 0.03 to be concise.
This leads us to consider if it is possible that people with black sounding names have less/worse computer skill and that accounts for the low callback rates

In [13]:
# create a dataset with the columns for computer skills added 
bwcomp_data = data[['race','call','computerskills']]

In [14]:
# descriptive analysis
bwcomp_data.describe()

Unnamed: 0,call,computerskills
count,4870.0,4870.0
mean,0.080493,0.820534
std,0.272079,0.383782
min,0.0,0.0
25%,0.0,1.0
50%,0.0,1.0
75%,0.0,1.0
max,1.0,1.0


In [15]:
# analyse the difference between computer skills and callback rates
compskills = bwcomp_data.groupby(['call', 'race','computerskills']).size()
compskills

call  race  computerskills
0.0   b     0                  379
            1                 1899
      w     0                  410
            1                 1790
1.0   b     0                   29
            1                  128
      w     0                   56
            1                  179
dtype: int64

By looking at the above table, we can see that overall, more candidates with black sounding names have computer skills. Yet fewer received callbacks. We can conclude that null hypothesis is rejected and confirms racial discrimination in the United States labor market 