
### Examining racial discrimination in the US job market

#### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

#### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes.

#### Exercise
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Discuss statistical significance.

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

****

In [2]:
import pandas as pd
import numpy as np
from scipy import stats

In [6]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')
data = data[['race', 'call']]

In [33]:
# number of callbacks for balck-sounding names
sum(data[data.race=='b'].call)

157.0

1) What test is appropriate for this problem? Does CLT apply?

The condition needed to be able to apply CLT are:

a. Independence - We are told that the data was collected by randomly assigning breating names that sound black or white. If done correctly this should ensure sufficient indepenace in the data

b. Sample size - Both the number of black people and white people are greater than 30
So we can use the CLT for this case and assume that the population distribution is nearly normal and is centered at the population mean.

2) What are the null and alternate hypotheses?

The null hypothesis is that the difference seen in the sample between the number of black people called versus the number of white people called is due to chance. 

So in the population the average proportion of black people called should be equal to the average proportion of white people called.

The alternative hypotheis is that the difference seen in the sample is not due to chance and the average number of white people called exceeds the avrage number of black peoplec called.

3) Compute margin of error, confidence interval, and p-value.

In [21]:
data_white = data[data.race.str.contains('w')]
total_white_group = data_white.groupby('call').count()
total_white_group

Unnamed: 0_level_0,race
call,Unnamed: 1_level_1
0,2200
1,235


In [26]:
total_white = 2200 + 235
p_white_selected = 235.0 / total_white
p_white_selected

0.09650924024640657

In [41]:
data_black = data[data.race.str.contains('b')]
total_black_group = data_black.groupby('call').count()
total_black_group

Unnamed: 0_level_0,race
call,Unnamed: 1_level_1
0,2278
1,157


In [42]:
total_black = 2278 + 157`
p_black_selected = 157.0 / total_black
p_black_selected

0.06447638603696099

In [48]:
mean_black = p_black_selected * total_black
import math
std_black = math.sqrt(p_black_selected * total_black * (1 - p_black_selected))
std_error_black = std_black / math.sqrt(total_black)
std_error_black

0.24559963697158382

In [51]:
mean_white = p_white_selected * total_white
std_white = math.sqrt(p_white_selected * total_white * (1- p_white_selected))
std_error_white = std_white / math.sqrt(total_white)
std_error_white

0.295288345170391