# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

<div class="span5 alert alert-info">
### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
+ Formulas for the Bernoulli distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution
</div>
****

In [None]:
import pandas as pd
import numpy as np
from scipy import stats

In [None]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [None]:
# number of callbacks for black-sounding names
sum(data[data.race=='w'].call)

In [None]:
data.head()

<div class="span5 alert alert-success">
<p>Your answers to Q1 and Q2 here</p>
</div>

##### For this problem, we can apply Bootstrap Resampling Technique in order to deal with this problem.
##### Principles of CLT are definately appicable in our case.

In [None]:
w = data[data.race=='w']
b = data[data.race=='b']

In [None]:
# Your solution to Q3 here

<div class="span5 alert alert-success">
<p> Your answers to Q4 and Q5 here </p>
</div>

#### For this problem, we can apply Bootstrap Resampling Technique in order to deal with this problem.
#### Principles of CLT are definately appicable in our case.

#### Total Whites and Black Count

In [None]:
data.shape

### Total number of candidates taken into experiment is 4870

In [None]:
w.shape

In [None]:
b.shape

##### Total number of Whites: 2435
##### Total number of Blacks: 2435

In [None]:
whites_getting_calls = w[w.call==1]
black_getting_calls = b[b.call==1]

In [None]:
black_getting_calls.shape

In [None]:
whites_getting_calls.shape

##### Number of Black candidates getting calls: 235

### Z-Test for Comparing Two Proportions

##### Null Hypothesis: To answer this question, we will evaluate the hypothesis that the two proportions (whites and blacks getting interview calls ) are the same.
##### Alternate Hypothesis: To answer this question, we will evaluate the hypothesis that the two proportions(whites and blacks getting interview calls ) are not the same.

#### Two Samples z-test for Proportions
#### z = (p1 - p2)/sqrt(p1(1-p1)*(1/n1 -1/n2 ))
#### where
#### p1 = x1/ n1
#### p2 = x2/n2

In [None]:
# Your solution to Q3 here

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
from pylab import *
from IPython.display import Image
import matplotlib.ticker as mtick

import scipy.stats as stats
import statsmodels.stats.weightstats as wstats
from collections import OrderedDict

from __future__ import print_function
from __future__ import division
%matplotlib inline

In [None]:
whites_getting_calls = 235
total_sample_of_whites = 2435

black_getting_calls = 157
total_sample_of_blacks = 2435

In [None]:
# implementation from scratch
def ztest_proportion_two_samples(x1, n1, x2, n2, one_sided=False):
    p1 = x1/n1
    p2 = x2/n2    

    p = (x1+x2)/(n1+n2)
    se = p*(1-p)*(1/n1+1/n2)
    se = sqrt(se)
    
    z = (p1-p2)/se
    p = 1-stats.norm.cdf(abs(z))
    p *= 2-one_sided # if not one_sided: p *= 2
    return z, p


In [None]:
z,p = ztest_proportion_two_samples(whites_getting_calls, total_sample_of_whites, black_getting_calls, total_sample_of_blacks, one_sided=False)
print(' z-stat = {z} \n p-value = {p}'.format(z=z,p=p))

##### The low p-value indicates that the Null-Hypothesis is rejected.
##### This indicates that the race factor does have a bearing on the candidate getting interview calls.