# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

<div class="span5 alert alert-info">
### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
</div>
****

In [2]:
import pandas as pd
import numpy as np
from scipy import stats

In [3]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [4]:
# number of callbacks for black-sounding names
sum(data[data.race=='b'].call)

157.0

In [5]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


# 1. What test is appropriate for this problem? Does CLT apply?

In this problem we have to compare two groups to check if there is racial discrimination. To do this, the appropiate test would be a two-sample test to inference difference between two groups.

In [27]:
Nwhite = len(data[data.race=="w"])
Nblack = len(data[data.race=="b"])
Nwhite, Nblack

(2435, 2435)

We have a sample with 2435 data that correspond to white people and 2435 data that correspond to black people. The amount of data that we have allows us to consider that these two groups will behave as a normal distribution.

# 2. What are the null and alternate hypotheses?

### Null Hypothesis

$H_0$: No difference between the two distributions

$P_b=P_w$

$P_b-P_w=0$

### Alternate hypothesis

$H_1$: There is a difference between the two distributions

$P_b\neq P_w$

$P_b-P_w\neq0$

# 3. Compute margin of error, confidence interval, and p-value.

Let's consider a Bernoulli distribution for black and white people. So, we have aa probability $P_b$ to receive a cal if you are a black person and a probability $P_w$ if you are a white man.

Now for a Bernoulli distribution we have that $\mu = P$ so

$$ \mu_b=P_b$$
$$ \sigma_b^2=P_b(1-P_b)$$
$$ \mu_w=P_w$$
$$ \sigma_w^2=P_w(1-P_w)$$

In [50]:
black = data[data.race=="b"]
white = data[data.race=="w"]
P_b=len(black.call[black.call!=0])*1.0/len(black.call)
S_b2=P_b*(1-P_b)/len(black.call)
P_w=len(white.call[white.call!=0])*1.0/len(white.call)
S_w2=P_w*(1-P_w)/len(white.call)

The probability to receive a call if you are a black person and a white perso are: 

In [51]:
P_b,P_w

(0.06447638603696099, 0.09650924024640657)

with a margin error of

In [56]:
np.sqrt(S_b2),np.sqrt(S_w2)

(0.0049771214428119461, 0.0059840721781280661)

respectively.

The difference between the mean of the two distributions is $\mu_{bw}$ with a margin error $\sigma_{bw}$

In [63]:
mu_bw = P_b-P_w
S_bw = np.sqrt(S_b2 +S_w2)
mu_bw, S_bw

(-0.032032854209445585, 0.0077833705866767544)

We want to be confident of the $\mu_{bw}$ value obtained, for that we are going to use the $95\%$ confidence interval. 

In [62]:
CI_min=mu_bw-1.96*S_bw
CI_max=mu_bw+1.96*S_bw
CI_min,CI_max

(-0.047288260559332024, -0.016777447859559147)

$(-0.047,-0.017)$

Now let's figure out what is the p-value assuming that our null hypothesis is true.

Assuming that $P_b=P_w=P$, $P=\frac{P_b+P_w}{2}$ and $\sigma=\sqrt{\frac{2P(1-P)}{N}}$

In [67]:
P = (P_b+P_w)/(2)
Sigma = np.sqrt( 2*P*(1-P)/Nblack )
P,Sigma

(0.08049281314168377, 0.0077968940361704568)

The p-value
$$ p=\frac{P-\mu}{\sigma}=\frac{0.08-0.0}{0.008}$$

In [70]:
P_value = (P-0.0)/Sigma
P_value

10.323702331758101

Considering a normal distribution and a critical z value of 1.96, it is very unlikely to get $0.08$ as mean value assuming $P_b=P_w$, so the Null hypothesis is wrong and there is a diffeence in the two distributions.

# 4. Write a story describing the statistical significance in the context or the original problem.

It was examined the level of racial discrimination in the United States labor market by randomly assigning 1000 identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers. An analysis of the number of call received from the employers shows that the number of calls that black-sounding names receive are smaller in comparisson to the number of calls received by white-sounding names. Furthermore, it was found a clear difference between the distributions of the two groups with a certainty of $99\%$.

# 5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

To prove if the results obtained are not due by chance we should perform a test reorganizing randomly the names and check if in ur simulations we obtain the same results.

In [71]:
class Hypothesis_testing(object):
    
    def distance(self, dist1, dist2):
        diff= dist1.mean() - dist2.mean() 
        return diff
    
    def resample_with_model(self, n, pool):
        '''
        uses permutation to create samples from Ho
        '''
        np.random.shuffle(pool)
        return self.distance(pool[:n], pool[n:])
    
    def test_hypothesis(self, dist1, dist2, niters=1000):
        n, m = len(dist1), len(dist2)
        pool= np.hstack((dist1, dist2))
        sampling_dist= np.array([self.resample_with_model(n, pool) for i in range(niters)] )
        observed_d= self.distance(dist1, dist2)
        if (observed_d >= 0):
            p_value= sum(sampling_dist>= observed_d)/len(sampling_dist)
        if (observed_d <0):
            p_value= sum(sampling_dist<= observed_d)/len(sampling_dist)
        #plt.hist(sampling_dist)
        #plt.plot(observed_d*np.ones(300), np.arange(300), color='gray')
        return p_value

In [76]:
stats= Hypothesis_testing()
p_value= stats.test_hypothesis(white.call, black.call)
p_value

0

The result of above suggest that p_value $p_{value}<0.001$. Since p_value is smaller than $1\%$, the null hypothesis is rejected. Therefore, The difference in the porportions is a real effect in the population.