# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

<div class="span5 alert alert-info">
### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
</div>
****

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

In [2]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [3]:
# number of callbacks for black-sounding names
sum(data[data.race=='b'].call)

157.0

In [4]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


# Question 1

The appropriate test here will be a two-sample test, because we will try to check if there is a meaningful difference in the proportion of white-souding and black-souding names in the labor market.

We could assume that CLT is true here because the fact that one resume is called doesn't affect the possibility that another resume is called. 

# Question 2

Like said above, we are going to check if there is a meaningful difference in the proportion of white-souding and black-souding names in the labor market.

Considering that pb is the proportion of black-sounding names are called, and pw is the proportion for the white-sounding names :

<b>Null Hypothesis :</b> pb - pw = 0 - there is no difference between the two proportions <br>
<b>Alternate Hypothesis</b> : pb - pw <> 0 - there is a significant difference between the two proportions.

# Question 3

We know from the nature of this problem that we are going to tackle a Bernoulli distribution comparing the proportions. For this, we need to get all the parameters from the two different groups and the parameters for the difference distribution

In [13]:
black = data[data.race=='b']
white = data[data.race=='w']

pb = sum(black.call)/len(black)
pw = sum(white.call)/len(white)

dif = pw - pb

print('Number of observations for Black-souding names resumes called: ', len(black))
print('Number of observations for White-souding names resumes called: ', len(white))

print('\nProportion of Black-souding names resumes called: ', pb)
print('Proportion of White-souding names resumes called: ', pw)

print('\nDifference between white and black proporations: ', dif)

Number of observations for Black-souding names resumes called:  2435
Number of observations for White-souding names resumes called:  2435

Proportion of Black-souding names resumes called:  0.064476386037
Proportion of White-souding names resumes called:  0.0965092402464

Difference between white and black proporations:  0.0320328542094


Here we can see that the dataset is perfect even in number of observations, which is good in avoiding some observation bias. <br>

Another important fact is that in the proportions there is a interesting difference. Now we are going to find if this is significant in the population. <br>

Lets go:

In [31]:
# getting the variances of the groups , and the difference variance and standard deviation

# IMPORTANT: we are dealing with Bernoulli distribution here, and getting the sampling variance

black_var = pb*(1-pb)/len(black)
white_var = pw*(1-pw)/len(white)

# Difference Standard Deviation

dif_std = np.sqrt(black_var + white_var)

# Considering that we are under CLT and a normal distribution in the sampling distribution, let`s find our 95% threshold in Z statistc

tr = stats.norm.ppf(.975)

#Margin of Error

margin = tr*dif_std

# Confidence Interval

conf = (dif - margin, dif+ margin)

# p value

# under the null hypothesis we know that pw - pb = 0 , and this is the mean of our sampling distribution, but we need
# to get the standard deviation for the difference under this assumption, so:


pop_prop = sum(data.call)/len(data)
dif_std_null = np.sqrt(2*(((pop_prop)*(1-pop_prop))/len(data)))

z = dif / dif_std_null

pvalue = stats.norm.pdf(z)


# Results.. 

print("Margin of error: ", margin)
print("Confidence Interval (95%): ", conf)
print("\nP-value: ", pvalue)


Margin of error:  0.0152551260282
Confidence Interval (95%):  (0.016777728181230755, 0.047287980237660412)

P-value:  1.86393863853e-08


# Question 4

As the results shown above, there is a statistical evidence that there is a meaningful difference between the black-souding names resumes callings and the white-sounding ones. 

Actually we could see that de 3 percentual points of difference seen in the sample was a good measure for the population once that is inside 95% of the sampling occurrences for the population estimation.

To conclude this insight we could see that the p-value is very small, indicating that the chance of getting this result by ramdomness is very odd.

# Question 5

Actually this is not a strong evidence that the race is the best parameter. We could see in the dataset that are 65 parameters differentiating the resumes and that could be some another parameter - or a combination of them - that is more relevant than the race, or with the race one could be even stronger than the result i've found.

This could be measured using linear regression and testing the correlation of all the parameters or using a chi-squared distribution.