# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
+ Formulas for the Bernoulli distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution

In [39]:
import pandas as pd
import numpy as np
from scipy import stats
import math

In [5]:
data = pd.io.stata.read_stata('us_job_market_discrimination.dta')

In [6]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


In [7]:
# number of callbacks for black-sounding names
sum(data[data.race=='w'].call)

235.0

In [9]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


<div class="span5 alert alert-success">
<p>Your answers to Q1 and Q2 here</p>
</div>

In [13]:
# number of callbacks for white-sounding names
sum(data[data.race=='b'].call)

157.0

In [10]:
w = data[data.race=='w']
b = data[data.race=='b']

In [18]:
df_w = pd.DataFrame([w.call,w.race])
df_w.shape

(2, 2435)

In [20]:
df_w.head()
df_w = df_w.transpose()

In [22]:
df_w.head()

Unnamed: 0,call,race
0,0,w
1,0,w
4,0,w
5,0,w
6,0,w


In [24]:
df_b = pd.DataFrame([b.call,b.race])
df_b.head()

Unnamed: 0,2,3,7,8,9,10,12,14,17,19,...,4850,4853,4856,4857,4858,4859,4864,4865,4866,4868
call,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
race,b,b,b,b,b,b,b,b,b,b,...,b,b,b,b,b,b,b,b,b,b


In [25]:
print(df_b.shape)

(2, 2435)


In [26]:
df_b = df_b.transpose()

In [28]:
df_b.head()

Unnamed: 0,call,race
2,0,b
3,0,b
7,0,b
8,0,b
9,0,b


In [30]:
n1 = len(data[data['race'] == 'b'])
n1

2435

In [32]:
n2 = len(data[data['race'] == 'w'])
n2

2435

In [33]:
p1 = sum(data[data['race'] == 'b'].call)/n1 # black callback rate
p2 = sum(data[data['race'] == 'w'].call)/n2 # white callback rate

In [34]:
print(p1,p2)

0.06447638603696099 0.09650924024640657


# Question 1 - What test is appropriate for this problem? Does CLT apply?

# A z-test is appropriate as it is about finding significant difference between two population proportions
# CLT can be applied here since sample size is 4870

# Question 2 -What are the null and alternate hypotheses?
# H0: There is no significant difference between proportion of call backs for black and white sounding resumes
# H1: There is significant difference between proportion of call backs for black and white sounding resumes

In [7]:
# Your solution to Q3 here

# Question-3 - Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.


In [42]:
#Frequentist
#Compute SE
p = (p1 * n1 + p2 * n2) / (n1 + n2)
SE = math.sqrt(p*(1-p)*((1/n1)+(1/n2)))
SE

0.007796894036170457

In [43]:
# We are calculating at the 5% significance level so our critical value is 1.96
crit = 1.96
margin = abs(crit*SE)
print("The true population proportion lies +/- %0.6F around the point estimate" % margin)

The true population proportion lies +/- 0.015282 around the point estimate


# Computing confidence Interval

In [45]:
#Race is our explanatory varible and Call is our response variable
# Observed Difference
point_est = p2 - p1
point_est

0.032032854209445585

In [47]:
CI = [ point_est + margin, point_est - margin]
CI

[0.04731476652033968, 0.01675094189855149]

# P-Value

In [48]:
# Setting the parameters assuming Ho is true
# Expected Difference of p_w - p_b
null = 0

#Create a pooled proportion as expected value of calls across black & white 
p_pool = (sum(data.call)/(len(data.call)))
p_pool

0.08049281314168377

In [50]:
# z-score is the observed (p_w - p_b) - expected (null) divided by standard error (pooled, computed for the CI above)
# Expected Difference of p_w - p_b

z = (point_est - null)/SE #standard error calculated in CI above
p_values = stats.norm.sf(abs(z))*2 #twoside
print("Z-score is equal to : %6.3F  p-value equal to: %6.7F" % (z,p_values))

Z-score is equal to :  4.108  p-value equal to: 0.0000398


# P-Value is lesser then 0.05 hence we reject null hypothesis and thus there is significant difference between call backs for black sounding and white sounding names

<div class="span5 alert alert-success">
<p> Your answers to Q4 and Q5 here </p>
</div>

# Bootstrap method(I still dont understand the Math behind Bootstrap approach)


In [52]:
def permutation_sample(data1, data2):
    """Generate a permutation sample from two data sets."""
    data = np.concatenate((data1,data2))  # Concatenate the data sets: data
    permuted_data = np.random.permutation(data)  # Permute the concatenated array: permuted_data
    
    # Split the permuted array into two: perm_sample_1, perm_sample_2
    perm_sample_1 = permuted_data[:len(data1)]
    perm_sample_2 = permuted_data[len(data1):]

    return perm_sample_1, perm_sample_2


def draw_perm_reps(data_1, data_2, func, size=1):
    """Generate multiple permutation replicates."""

    # Initialize array of replicates: perm_replicates
    perm_replicates = np.empty(size)

    for i in range(size):
        # Generate permutation sample
        perm_sample_1, perm_sample_2 = permutation_sample(data_1,data_2)

        # Compute the test statistic
        perm_replicates[i] = func(perm_sample_1,perm_sample_2)

    return perm_replicates

def diff_in_props(rep_1, rep_2):
    """Compute the difference in proportions of two datasets"""
    count_1= np.count_nonzero(rep_1 == 1)
    count_2= np.count_nonzero(rep_2 == 1)
    prop1 = count_1/len(rep_1)
    prop2 = count_2/len(rep_2)
    return prop1-prop2

w_arr=df_w.call.values
b_arr=df_b.call.values
original_diff=diff_in_props(w_arr,b_arr)


In [53]:
reps=draw_perm_reps(w_arr,b_arr,diff_in_props, 100000)

p_val=np.sum(reps>=original_diff) /len(reps)

In [55]:
p_val

2e-05

# Null Hypothesis can be rejected

# Just a Thought


In [58]:
data_contingency = pd.crosstab(data.race,data.call)

In [59]:
data_contingency

call,0.0,1.0
race,Unnamed: 1_level_1,Unnamed: 2_level_1
b,2278,157
w,2200,235


In [60]:
from scipy.stats import chi2_contingency

In [61]:
chi2_contingency(data_contingency)

(16.44902858418937, 4.997578389963255e-05, 1, array([[2239.,  196.],
        [2239.,  196.]]))

# Here since both Call backs and Races can be taken as categories why cant a chi square test be done to find out dependencies between them
# Since p - value lesser then 0.05 Reject null hypothesis and thus we can tell that there is a relationship between Call Backs and Races

# 4- Write a story describing the statistical significance in the context or the original problem.

We can infer that there is a significant difference between Call backs for white sounding resumes and black sounding resumes.
A chi-square test was also done just to prove that there is a relationship between Rate of call backs and Race

# 5.Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

No the analysis does not tell whether race is the most important factor for callback success. Plus it can be found that there are many other important features like Years of experince, education level and number of previous jobs which need to be searched and analysed