# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

<div class="span5 alert alert-info">
### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
+ Formulas for the Bernoulli distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution
</div>
****

In [2]:
import pandas as pd
import numpy as np
from scipy import stats

In [3]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [4]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


In [5]:
data.columns

Index(['id', 'ad', 'education', 'ofjobs', 'yearsexp', 'honors', 'volunteer',
       'military', 'empholes', 'occupspecific', 'occupbroad', 'workinschool',
       'email', 'computerskills', 'specialskills', 'firstname', 'sex', 'race',
       'h', 'l', 'call', 'city', 'kind', 'adid', 'fracblack', 'fracwhite',
       'lmedhhinc', 'fracdropout', 'fraccolp', 'linc', 'col', 'expminreq',
       'schoolreq', 'eoe', 'parent_sales', 'parent_emp', 'branch_sales',
       'branch_emp', 'fed', 'fracblack_empzip', 'fracwhite_empzip',
       'lmedhhinc_empzip', 'fracdropout_empzip', 'fraccolp_empzip',
       'linc_empzip', 'manager', 'supervisor', 'secretary', 'offsupport',
       'salesrep', 'retailsales', 'req', 'expreq', 'comreq', 'educreq',
       'compreq', 'orgreq', 'manuf', 'transcom', 'bankreal', 'trade',
       'busservice', 'othservice', 'missind', 'ownership'],
      dtype='object')

In [12]:
new_df = data[['race','call']]
w = new_df[new_df.race == 'w']
b = new_df[new_df.race == 'b']
b_callbacks = b.call.sum()
w_callbacks = w.call.sum()
percent_b = b_callbacks / len(b)
percent_w = w_callbacks / len(w)
print('Number of Black Entries: ',len(b))
print('Number of Black Callbacks: ', b_callbacks)
print('Percent of Black Callbacks: ', percent_b)
print('Number of White Entries: ', len(w))
print('Number of White Callbacks: ', w_callbacks)
print('Percent of White Callbacks: ', percent_w)


Number of Black Entries:  2435
Number of Black Callbacks:  157.0
Percent of Black Callbacks:  0.064476386037
Number of White Entries:  2435
Number of White Callbacks:  235.0
Percent of White Callbacks:  0.0965092402464


In [30]:
tabulated = pd.crosstab(index = new_df.call, columns = new_df.race)
print(tabulated)

race     b     w
call            
0.0   2278  2200
1.0    157   235


This is a sample dealing with proportions of discrete variables as opposed to continuous ones. The CLT can be applied in this aspect utilizing the equations for a two Sample z-test seen [here](http://math.mercyhurst.edu/~griff/courses/m109/Lectures/sect8.4.pdf) and below. The requirements for CLT is that the samples must be randomly drawn, and be representative of the population. Conversely since sample size **_n_** is > 30, we confirm that z-statistics is more appropriate than t-statistics

z = $\frac{(\hat p_1 - \hat p_2) - (p_1 - p_2)}{\sqrt{\bar{p}\bar{q}(\frac{1}{n_1}+\frac{1}{n_2})}}$

Where:

* $\bar{p} = \frac{x_1 + x_2}{n_1 + n_2}$

* $\bar{q} = 1 - \bar{p}$

* $\hat p_1 = \frac{x_1}{n_1}$

* $\hat p_2 = \frac{x_2}{n_2}$

$x_1$ = total number of white callbacks: **w_callbacks**

$x_2$ = total number of black callbacks: **b_callbacks**

$n_1$ = total number of white entries: **len(w)**

$n_2$ = total number of black entries: **len(b)**

$p_1$ = percent of white call backs:   **percent_w**

$p_2$ = percent of black call backs:   **percent_b**

Since the null hypothesis $H_0$ is that there is no significant difference between whites and blacks, $p_1 - p_2$ is 0. Conversely the alternative hypothesis is that the proportion white callbacks **does not** equal the proportion of black callbacks

> $H_0$: $P_w$ = $P_b$

> $H_1$: $P_w$ $\neq$ $P_b$

#### calculate the z-statistics

In [14]:
#(p1-p2)
difference = percent_w - percent_b
#calculate p-bar
p_bar = (w_callbacks + b_callbacks) / (len(w) + len(b))
#calculate q-bar
q_bar = 1 - p_bar
#calculate p-hat 1
p_hat_1 = np.divide(w_callbacks,len(w))
#calculate p-hat 2
p_hat_2 = np.divide(b_callbacks,len(b))
#calculate (1/n1) + (1/n2)
x = np.add(np.divide(1,len(w)),np.divide(1,len(b)))

In [27]:
#calculate z-score
z = (p_hat_1 - p_hat_2) / np.sqrt(p_bar * q_bar * x)
#calculate p-value
p = (1 - stats.norm.cdf(z)) * 2
print('The z-score is: {}\nThe p-value is: {}'.format(z,p))

The z-score is: 4.108412152434346
The p-value is: 3.983886837577444e-05


With a z-score of 4.1, meaning this particular difference in callback proportions is 4 standard deviations away from the mean of 0 difference, and a p-value << 0, these statistics mean that we should reject the null hypothesis. There is a significant difference in the callback proportions between blacks and whites

#### Margin of Error
z * $\sqrt{\bar{p}\bar{q}(\frac{1}{n_1}+\frac{1}{n_2})}$

where z is the z-score for the desired confidence interval (critical values) e.g. if you you want the z score for 95% confidence interval, you can reference the table [here](http://www.statisticshowto.com/probability-and-statistics/find-critical-values/#CommonCI) and see that the z score for this critical value is 1.96

In [28]:
#calculate the margin of error
moe = 1.96 * np.sqrt(p_bar * q_bar * x)

The margin of error is: 0.015281912310894095


#### Confidence Interval values
Using the margin of error, we can take our sample difference, add and subtract the m.o.e and establish (in this case) the 95% confidence interval. 

* Confidence Interval = ($p_1 - p_2$) $\pm$ (margin of error)

In [29]:
ci = [(difference-moe), (difference+moe)]
print('With 95% confidence, the mean difference in call back proportions between whites and black is in the range of:\n{}'.format(ci))

With 95% confidence, the mean difference in call back proportions between whites and black is in the range of:
[0.016750941898551489, 0.047314766520339682]


## $\chi^2$ Chi-squared Test for Equality of Proportions

The chi-square test compares the expected frequencies to observed frequencies. The null hypothesis is rejected if observed and expected frequencies are too far apart. A manual example can be seen [here](http://math.hws.edu/javamath/ryan/ChiSquare.html).

In [32]:
chi, pval, _, _ = stats.chi2_contingency(tabulated)

In [33]:
print('Chi-squared value is: {}\nThe p-value is: {}'.format(chi,pval))

Chi-squared value is: 16.44902858418937
The p-value is: 4.997578389963255e-05


With 1 degree of freedom and a chi-square statistic of 16, looking these values up on the chi square chart yields a p-value of approx 5e-5. This is substantially less than 0, meaning we still reject the null value that the average difference in proportions is 0. The Chi-square test confirms what we derived using z-statistics and CLT

### Conclusion
Performing both CLT and Chi square yields the same result that race is a significant variable in the proportion of white and black callbacks. This however does not mean it is the only reason for the disproportion. Multivariate analysis must be performed on all supplied variables, as some may weigh more heavy on the resulting statistics than only race