
### Examining racial discrimination in the US job market

#### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

#### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes.

#### Exercise
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Discuss statistical significance.

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

****

In [1]:
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
import seaborn as sns

In [2]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [3]:
# number of callbacks for black-sounding names
sum(data[data.race=='b'].call)

157.0

In [4]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


We need to focus on two columns for this particular exercise - race and call. Below we will construct a dataframe that removes all extraneous variables from the data provided. We can then also construct two dataframes by race.

In [5]:
df = data.loc[:,["race", "call"]]
df_white = df[df.race == "w"]
df_black = df[df.race == "b"]
print("Total resumes: {}, Resumes assigned race = black: {}, Resumes assigned race = white: {}".format(
        len(df), len(df_black), len(df_white)))

Total resumes: 4870, Resumes assigned race = black: 2435, Resumes assigned race = white: 2435


## What test is appropriate for this problem? Does CLT apply?

We want to perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes. Looking at the data to hand and the question we want to answer, the most appropriate test for this problem is a hypothesis test for comparing two means.

In order to be able to apply the central limit theorem, our sample data must meet some certain assumptions and conditions. These conditions are tailored to the fact that we are performing a two proportion comparison. 

First lets suppose we have two populations (black and white) with proportions equal to $P_1$ and $P_2$. Suppose further we take all possible samples of size $n_1$ and $n_2$.


* The size of each population is large relative to the sample drawn from the population. That is, $N_1$ is large relative to $n_1$, and $N_2$ is large relative to $n_2$.


* The samples from each population are big enough to justify using a normal distribution to model differences between proportions. The sample sizes will be big enough when the following conditions are met: $n_1P_1 > 10$, $n_1(1 -P_1) > 10$, $n_2P_2 > 10$, and $n_2(1 - P_2) > 10$.


* The samples are independent; that is, observations in population 1 are not affected by observations in population 2, and vice versa.


We know that the samples are independent since the data came from a randomized field experiment involving nearly 5,000 résumés sent in response to over 1,300 newspaper ads. As our population consists of all black/white US job seekers, its also trivial to deduce that both $N_1$ and $N_2$ are large to $n_1$ and $n_2$ respectively.

We just need to confirm that $n_1P_1 > 10$, $n_1(1 -P_1) > 10$, $n_2P_2 > 10$, and $n_2(1 - P_2) > 10$. This can be seen to be true from below.


In [6]:
n1 = len(df_white)
n2 = len(df_black)
p1 = sum(df_white.call)/n1
p2 = sum(df_black.call)/n2

In [7]:
n1*p1, n1*(1-p1), n2*p2, n2*(1-p2)

(235.0, 2200.0, 157.0, 2278.0)

We now know the following.





The set of differences between sample proportions will be normally distributed. We know this from the central limit theorem.

The expected value of the difference between all possible sample proportions is equal to the difference between population proportions. Thus, $E(p_1 - p_2) = P_1 - P_2$.

The standard deviation of the difference between sample proportions ($σd$) is approximately equal to: 
$σd = \sqrt{\frac{P_1(1 - P_1)}{n_1} + \frac{P_2(1 - P_2)}{n_2}}$


## What are the null and alternate hypotheses?

$H_0$: No difference between proportions, $P_1 = P_2$

$H_1$: Difference between proportions, $P_1 \neq P_2$

Here we also define our significance level, let's take $\alpha = 0.05$

## Compute margin of error, confidence interval, and p-value.

Assume $H_0$ true.

If $P(\bar{P_1}-\bar{P_2}|H_0)<0.05$ then we can reject $H_0$

Find the mean of the difference in sample proportions:

In [8]:
p_diff = p1 - p2
p_diff

0.032032854209445585

Since we are assuming no difference between $P_1$ and $P_2$ under $H_0$ can estimate $\bar{P}$ as if both groups were from the same sample. This is a pooled proportion.

In [9]:
p_pool = (sum(df_white.call) + sum(df_black.call))/(n1 + n2)
p_pool

0.080492813141683772

Calculate $\sigma{}_{\bar{P_1}-\bar{P_2}}= \sqrt{\bar{P}(1-\bar{P})(\frac{1}{n_1}+\frac{1}{n_2})} $

In [12]:
SE_Diff = np.sqrt(p_pool*(1-p_pool)*((1/n1)+(1/n2)))
SE_Diff

0.0077968940361704568

In [11]:
stats.norm.sf(4.1080)*2

3.9910010435886011e-05