
### Examining racial discrimination in the US job market

#### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

#### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes.

#### Exercise
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Discuss statistical significance.

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

****

In [15]:
import pandas as pd
import numpy as np
from scipy import stats

In [16]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')
data = data[['race', 'call']]

In [17]:
# number of callbacks for balck-sounding names
sum(data[data.race=='b'].call)

157.0

1) What test is appropriate for this problem? Does CLT apply?

The condition needed to be able to apply CLT are:

a. Independence - We are told that the data was collected by randomly assigning breating names that sound black or white. If done correctly this should ensure sufficient indepenace in the data

b. Sample size - Both the number of black people and white people are greater than 30
So we can use the CLT for this case and assume that the population distribution is nearly normal and is centered at the population mean.

2) What are the null and alternate hypotheses?

The null hypothesis is that the difference seen in the sample between the number of black people called versus the number of white people called is due to chance. 

So in the population the average proportion of black people called should be equal to the average proportion of white people called.

The alternative hypotheis is that the difference seen in the sample is not due to chance and the average number of white people called exceeds the avrage number of black peoplec called.

3) Compute margin of error, confidence interval, and p-value.

In [18]:
data_white = data[data.race.str.contains('w')]
total_white_group = data_white.groupby('call').count()
total_white_group

Unnamed: 0_level_0,race
call,Unnamed: 1_level_1
0,2200
1,235


In [50]:
total_white = 2200 + 235
mean_white_selected = 235.0 / total_white
mean_white_selected

0.09650924024640657

In [53]:
import math
variance_white = ((235*(1-mean_white_selected)**2) + (2200*(0-mean_white_selected)**2))/(total_white-1)
std_white = math.sqrt(variance_white)
std_white

0.2953489980097223

In [26]:
data_black = data[data.race.str.contains('b')]
total_black_group = data_black.groupby('call').count()
total_black_group

Unnamed: 0_level_0,race
call,Unnamed: 1_level_1
0,2278
1,157


In [54]:
total_black = 2278 + 157
mean_black_selected = 157.0 / total_black
mean_black_selected

0.06447638603696099

In [59]:
variance_black = ((157 *  (1-mean_black_selected) ** 2 ) + (2278 * (0 - mean_black_selected) ** 2))/(total_black-1)
std_black = math.sqrt(variance_black)
std_black

0.24565008364706123

Since population standard deviation is unknown in both cases we'll have to use the sample standard deviation.


In [57]:
std_error_white = std_white / math.sqrt(len(data))
std_error_white

0.004232247150096068

In [64]:
std_error_black = std_black / math.sqrt(len(data))
std_error_black

0.0035200792060987875

So now we can calculate the margin of error for a 95% confidence interval for the two cases:

In [66]:
margin_error_white = 1.96 * std_error_white
margin_error_white

0.008295204414188294

In [67]:
margin_error_black = 1.96 * std_error_black
margin_error_black

0.006899355243953623

In [71]:
conf_interval_white_upper = (mean_white_selected + margin_error_white) * 100
conf_interval_white_lower = (mean_white_selected - margin_error_white) * 100
print str(conf_interval_white_upper) + " %"
print str(conf_interval_white_lower) + " %"

10.4804444661 %
8.82140358322 %


In [73]:
conf_interval_black_upper = (mean_black_selected + margin_error_black) * 100
conf_interval_black_lower = (mean_black_selected - margin_error_black) * 100
print str(conf_interval_black_upper) + " %"
print str(conf_interval_black_lower) + " %"

7.13757412809 %
5.7577030793 %
