# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
+ Formulas for the Bernoulli distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution

In [104]:
import pandas as pd
import numpy as np
from scipy import stats

In [105]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [106]:
# number of callbacks for black-sounding names
sum(data[data.race=='b'].call)

157.0

In [107]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


In [108]:
# Inspect the columns of the dataframe:
data.info()
# check NULL count for each column in the dataframe:
data.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4870 entries, 0 to 4869
Data columns (total 65 columns):
id                    4870 non-null object
ad                    4870 non-null object
education             4870 non-null int8
ofjobs                4870 non-null int8
yearsexp              4870 non-null int8
honors                4870 non-null int8
volunteer             4870 non-null int8
military              4870 non-null int8
empholes              4870 non-null int8
occupspecific         4870 non-null int16
occupbroad            4870 non-null int8
workinschool          4870 non-null int8
email                 4870 non-null int8
computerskills        4870 non-null int8
specialskills         4870 non-null int8
firstname             4870 non-null object
sex                   4870 non-null object
race                  4870 non-null object
h                     4870 non-null float32
l                     4870 non-null float32
call                  4870 non-null float32
city        

id            0
ad            0
education     0
ofjobs        0
yearsexp      0
             ..
trade         0
busservice    0
othservice    0
missind       0
ownership     0
Length: 65, dtype: int64

*Since there is no null values hence data wrangling is not required.*

In [109]:
# display the fields of relevance for EDA
print(data[['race', 'call']])

     race  call
0       w   0.0
1       w   0.0
2       b   0.0
3       b   0.0
4       w   0.0
...   ...   ...
4865    b   0.0
4866    b   0.0
4867    w   0.0
4868    b   0.0
4869    w   0.0

[4870 rows x 2 columns]


In [110]:
# number of callbacks for black-sounding names
total_b_calls = sum(data[data.race == 'b'].call)
print("Total calls for black-sounding names:", total_b_calls)
# number of black-sounding names
total_b = data[data.race == 'b'].race.size
print("Total black-sounding names:",total_b)
# Proportion of blacks that were called
prop_b = total_b_calls/total_b * 100
print("Proportion of blacks that were called %:",prop_b)

Total calls for black-sounding names: 157.0
Total black-sounding names: 2435
Proportion of blacks that were called %: 6.447638603696099


In [111]:
# total number of names
print("Total population:",data.race.size)

Total population: 4870


In [112]:
# number of callbacks for white-sounding names
total_w_calls = sum(data[data.race == 'w'].call)
print("Total calls for white-sounding names:",total_w_calls)
# number of white-sounding names
total_w = data[data.race == 'w'].race.size
print("Total white-sounding names:",total_w)
# Proportion of whites that were called
prop_w = total_w_calls/total_w * 100
print("Proportion of whites that were called %:",prop_w)

Total calls for white-sounding names: 235.0
Total white-sounding names: 2435
Proportion of whites that were called %: 9.650924024640657


In [113]:
#Proportion difference
diff = prop_w - prop_b
print("Proportion differnce:",diff)

Proportion differnce: 3.2032854209445585


**Q1. What test is appropriate for this problem? Does CLT apply?**

This is a binary response type of problem (1,0) which makes it a Bernoulli distribution or binomial distribution. However, the conditions like random sample, independence condition is well defined in problem statement and normal condition is also met since np >= 10. Hence, **Difference in Proportion Confidence Interval and Two-Sided Z-Test** is appropriate to use in comparing these two percentages.

**Q2. What are the null and alternate hypotheses?**

<font color=blue>**Ho: There is no significant difference between "percentage callback" for black and white resumes. <br>
H1: There is significant difference between "percentage callback" for black and white resumes. Sample size>30 so z-statistic is appropriate**</font><br>
Significance Level = .05

In [115]:
# percentage callback and variance for black-sounding names
p1 = prop_b/100
n1 = total_b
var_b = (p1*(1-p1)/n1)
print("Variance for black:",var_b)
# percentage callback and variance for white-sounding names
p2 = prop_w/100
n2 = total_w
var_w = (p2*(1-p2)/n2)
print("Variance for white:",var_w)

Variance for black: 2.4771737856498466e-05
Variance for white: 3.580911983304638e-05


In [116]:
# Sampling Distribution p1-p2 
var_b_w= var_b + var_w
var_b_w
std_b_w=np.sqrt(var_b_w)
std_b_w
print("Percentage difference:",abs(p1-p2))

Percentage difference: 0.032032854209445585


**Q3. Compute margin of error, confidence interval, and p-value?**

In [117]:
# Using 95% Confidence level that (p1-p2) is within d of 0.03
# z_score = 1.96 according to the z-table.
#computing the margin of error
moe = 1.96*std_b_w
print("Margin of error:",moe)
#computing the confidence interval
ci = abs(p1-p2) + np.array([-1, 1]) * moe
print("Confidence interval", ci)

Margin of error: 0.015255406349886438
Confidence interval [0.01677745 0.04728826]


In [118]:
# Standard Error 
SE = std_b_w
print("Standard Error:",SE)

Standard Error: 0.0077833705866767544


**Hypothesis Test**<br>

*Assuming the null hypothesis is true, how many times do we see that we see that the difference between the callback rates is zero (p2 - p1 = 0)?<br>

*If it occurs a lot then any difference we see between the white and black callback rate is probably due to chance. If it does not, then the effect is sigificant.<br>

In other words what is the probability or likelihood of getting a p2 - p1 = 0 in standard deviations from the population difference? If it is less than 5% then significant.

In [103]:
# New calculation for std becuase we are assuming null hypothesis is true
# Therefore must calculate population std such that p1 = p2 = p_pop
# p_pop is proportion of total callbacks for whole sample
# calculating proportion of callbacks disregarding race

p = p2 = p1
p_pop = ( total_b_calls + total_w_calls ) / ( total_b + total_w )
std = np.sqrt( (2 * p_pop * (1 - p_pop) ) / n1 ) # Since n1 = n2 is considered
# Calculating z-score
# How many standard devations away from the mean is our sample statistics
z_score = (p - 0) / std
print("Z-score:", z_score)
p_value = stats.norm.sf(abs(z_score))*2
print("p_value:", p_value)

Z-score: 12.377908407975275
p_value: 3.441962025115033e-35


*The probability of getting a z-score of 12 is very small, even while assuming the null hypothesis is true.*<br>

*Therefore, with a z-score as high as 12, the p-value = 0*

**4. Describing the statistical significance in the context or the original problem.**

There is 95% chance that the true difference of white-sounding callbacks and black-sounding callbacks is between .016 and .04.<br>

This means we are 95% confident that there exists a difference between the two races in which individuals with white-sounding names are favored and called back more often.<br>

Since p-value = 0, the probability of getting a z-score of 12 is very small, even while assuming the null hypothesis is true. This means that the effect that we see (a difference in the proportion of callbacks between white and black sounding names) is significant.

<font color=red>**Race has a significant impact on the rate of callbacks for resumes.**<font>

**5. Does this analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how could this analysis be amended?**<br>

Since the analysis only looked at the relationship between race and callbacks. There are a number of other variables included in the data that may have equal or more impact on the callback rate hence it can not be determined whether race is the most important factor for callbacks.

We must test the significance of these variables against callback rate to determine which is the best to determine callback success.

The next step in this analysis would be to comput on other factors in the table and test for significance and coorelation.