# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
+ Formulas for the Bernoulli distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution

In [101]:
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

In [None]:
warnings.filterwarnings("ignore")

In [4]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [5]:
data.shape

(4870, 65)

In [102]:
w = data[data.race=='w']
b = data[data.race=='b']

In [49]:
np.mean(w.call)-np.mean(b.call)

0.03203285485506058

In [7]:
pd.options.display.max_columns = None

In [8]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,occupbroad,workinschool,email,computerskills,specialskills,firstname,sex,race,h,l,call,city,kind,adid,fracblack,fracwhite,lmedhhinc,fracdropout,fraccolp,linc,col,expminreq,schoolreq,eoe,parent_sales,parent_emp,branch_sales,branch_emp,fed,fracblack_empzip,fracwhite_empzip,lmedhhinc_empzip,fracdropout_empzip,fraccolp_empzip,linc_empzip,manager,supervisor,secretary,offsupport,salesrep,retailsales,req,expreq,comreq,educreq,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,1,0,0,1,0,Allison,f,w,0.0,1.0,0.0,c,a,384.0,0.98936,0.0055,9.527484,0.274151,0.037662,8.706325,1.0,5,,1.0,,,,,,,,,,,,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,6,1,1,1,0,Kristen,f,w,1.0,0.0,0.0,c,a,384.0,0.080736,0.888374,10.408828,0.233687,0.087285,9.532859,0.0,5,,1.0,,,,,,,,,,,,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,1,1,0,1,0,Lakisha,f,b,0.0,1.0,0.0,c,a,384.0,0.104301,0.83737,10.466754,0.101335,0.591695,10.540329,1.0,5,,1.0,,,,,,,,,,,,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,5,0,1,1,1,Latonya,f,b,1.0,0.0,0.0,c,a,384.0,0.336165,0.63737,10.431908,0.108848,0.406576,10.412141,0.0,5,,1.0,,,,,,,,,,,,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,5,1,1,1,0,Carrie,f,w,1.0,0.0,0.0,c,a,385.0,0.397595,0.180196,9.876219,0.312873,0.030847,8.728264,0.0,some,,1.0,9.4,143.0,9.4,143.0,0.0,0.204764,0.727046,10.619399,0.070493,0.369903,10.007352,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


<div class="span5 alert alert-success">
<p>Your answers to Q1 and Q2 here</p>
</div>

## Q1: What test is appropriate for this problem? Does CLT apply?

<b> Binomial Distribution </b> : https://en.wikipedia.org/wiki/Binomial_distribution

In probability theory and statistics, the binomial distribution with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent experiments, each asking a yes–no question, and each with its own boolean-valued outcome: success/yes/true/one (with probability p) or failure/no/false/zero (with probability q = 1 − p).

A single success/failure experiment is also called a Bernoulli trial. For a single trial, i.e., n = 1, the binomial distribution is a Bernoulli distribution.

In this problem the column <b> call </b> clearly presents a binomial distribution where success is the value 1, i.e. when the call is made to the applicant and failure is the value 0 , i.e when the call is not made to the applicant.

Each record of the column <b> call </b> repreents a Bernoulli trial whereas the number of success(call value = 1) for 2435 resumes represents a binomial distribution.

<b> Central Limit Theorem </b> : https://en.wikipedia.org/wiki/Central_limit_theorem

In probability theory, the central limit theorem (CLT) establishes that, in some situations, when independent random variables are added, their properly normalized sum tends toward a normal distribution (informally a "bell curve") even if the original variables themselves are not normally distributed.

In its common form, the random variables must be identically distributed.

In this problem the resumes are randomly assigned to a black sounding or white sounding names, hence the variables 
are random variables. Also as it can be reasonably assumed that assignment of a race to one record is not dependent on assignment of a race to another record, the records are independent as well.

We can see that in this problem altough the original variable i.e. the number of successes (call value = 1) is binomially distributed, as per CLT, we can say that the proportion of success which is nothing but normalized sum of success calls will tend toward a normal distribution.

<b> In conclusion, the CLT applies to this problem. As far as the appropriate test is concerned, as the proportion of success of population is not known, a t-test comparing the proportions of two groups will be appropriate here.</b>

## Q2: What are the null and alternate hypotheses?

<b> Null Hypothesis (H0)</b>: The proportion of calls made to white sounding names and black sounding names are same.

<b> Alternate Hypothesis (H1)</b>: The proportion of calls made to white sounding names and black sounding names are not same.







##  Q3:Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches

## Frequentist approach

In [12]:
# calculate the values of variables required for the statistical test
nb = len(b) # black sample size
nw = len(w) # white sample size
pb = sum(b.call)/nb # black callback rate
pw = sum(w.call)/nw # white callback rate

In [13]:
callb = sum(b.call)
callw = sum(w.call)

In [28]:
import math
import scipy

p_pooled = (callb + callw)/(nb+ nw) # joint callback

std_err = np.sqrt(p_pooled*(1-p_pooled)*((1/nb)+(1/nw))) # assuming equal variances of the two groups

test_stat = (pw-pb)/std_err

p_val_tstat_equal = stats.t.sf(abs(test_stat_equal), len(data)-1)*2


print('frquentist t-stat value:',test_stat_equal,' frequentist p_value',p_val_tstat_equal)

frquentist t-stat value: 4.108412152434346  frequentist p_value 4.0493178875903686e-05


In [45]:
#Margin of error and confidence interval

std_white = np.sqrt((pw*(1-pw))/nw)
std_black = np.sqrt((pb*(1-pb))/nb)

std_diff = np.sqrt(std_white ** 2 + std_black ** 2)

marerr = 1.96 * std_diff

# confidence interval for the difference of distribution
conf_low = (pw - pb) - marerr
conf_high = (pw - pb) + marerr
conf_int = [conf_low,conf_high]


print('Frequentist Margin of Error : {:0.4f}'. format(marerr))
print('Frequentst Confidence Interval :{}'.format(conf_int))
print('Frequentst p-value :{}'.format(p_val_tstat_equal))

Frequentist Margin of Error : 0.0153
Frequentst Confidence Interval :[0.016777447859559147, 0.047288260559332024]
Frequentst p-value :4.0493178875903686e-05


# Permutation Test Approach:

This approach is very similar to bootstrap approach but the same time more restrictive as in this approach the null hypothesis not only states that the proportion of success is same for both the groups, but also that the data in both the groups come from the same distribution.


In [77]:
# Function to create a permutation sample by first concatenating then then applying permutation on two group samples
def permutation_sample(d1, d2):
    d = np.concatenate((d1,d2))
    perm_data = np.random.permutation(d)
    perm_sample_1 = perm_data[:len(d1)]
    perm_sample_2 = perm_data[len(d1):]
    
    return perm_sample_1, perm_sample_2

# function to create difference in proportion of two groups
def diff_prop(d1, d2):
    return (sum(d1) / len(d1) - (sum(d2) / len(d2)))

# Function to create a permutation replicates of difference in proportions of both groups

def draw_perm_reps(d1, d2, func, size = 1):
    perm_replicates = np.empty(size)

    for i in range(size):
        perm_sample_1, perm_sample_2 = permutation_sample(d1, d2)
        perm_replicates[i] = func(perm_sample_1, perm_sample_2)

    return perm_replicates

In [65]:
observed_diff_prop = diff_prop (w['call'],b['call'])
print(observed_diff_prop)

0.032032854209445585


In [103]:
w['call'] = w['call'].astype(int)
b['call'] = b['call'].astype(int)

In [108]:
# Generating 10000 permutation replicates from both the groups and returning the differences of their success proportion
perm_replicates = draw_perm_reps(w['call'],b['call'], diff_prop, size = 10000)

In [115]:
perm_replicates[0:10]

array([ 0.00246407,  0.0164271 , -0.00164271, -0.00082136,  0.00082136,
       -0.00657084,  0.00903491,  0.        ,  0.00246407,  0.00492813])

In [116]:
perm_p = np.sum(perm_replicates >= observed_diff_prop)/len(perm_replicates)
print('The p-value using the permutation test approach is {:0.5f}'.format(perm_p))

perm_conf_int = np.percentile(perm_replicates, [2.5,97.5])
print('The confidence interval using the permutation test approach is {}'.format(perm_conf_int))
#print(boot_conf_int)

The p-value using the permutation test approach is 0.00010
The confidence interval using the permutation test approach is [-0.01478439  0.01560575]


In [17]:
#import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats.proportion import proportions_ztest
proportions_ztest(np.array([callw,callb]), np.array([nw,nb]))

(4.108412152434346, 3.983886837585077e-05)

## Bootstrap Approach

As stated above in the bootstrap approach, the null hypothesis doesn't state anything about the underlying distribution of the two groups and only states that the proportion of success in both the groups are same, or in other words statisically, there is no difference between the proportion of successes in two groups.

In [117]:
#Function to generate a bootstrap sample i.e. sample with replacement for a signle group
def bootstrap(data, func):
    
    bs_samples = np.random.choice(data, len(data))
    return func(bs_samples)

# Function to create a bootstrap replicates using a summarizing/aggregating function 
def draw_bootstrap_samples(data, func, size = 1):
    
    bs_replicates = np.empty(size)
    
    # Generate Replicates
    for i in range(size):
        bs_replicates[i] = bootstrap(data,func)
    return bs_replicates

In [140]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [141]:
w = data[data.race=='w']
b = data[data.race=='b']

In [142]:
observed_prop = (sum(w.call)/len(w)) - (sum(b.call)/len(b))

In [143]:
# Joint success proportion of the combined groups
wb_prop = np.sum(data.call)/len(data)

In [144]:
# Create a new data set for both the groups by shifting the proportion to the joint proportion of the combined groups 
w.call = w.call - sum(w.call)/nw + wb_prop
b.call = b.call - sum(b.call)/nb + wb_prop

In [145]:
# verify the shifted success proportion 
print(w.call.mean())
print(b.call.mean())

0.08049110323190689
0.08049406856298447


In [146]:
# Draw 10000 bootstrap replicates of mean of each bootstrap sample for each group
w_samples = draw_bootstrap_samples(w.call, np.mean, 10000 )
b_samples = draw_bootstrap_samples(b.call, np.mean, 10000)

In [147]:
# Calculate the difference of proportion between the bootstrap replicates
wb_sample_difference = w_samples - b_samples

In [148]:
# Calculate the p-value by calculating the proportion of means that was greater than the observed proportion
p_val_boot = np.sum(wb_sample_difference >= observed_prop)/len(wb_sample_difference)
print('The p-value using the bootstrap approach is {}'.format(p_val_boot))

The p-value using the bootstrap approach is 0.0


In [149]:
# As the mean was shifted for the hypothesis testing the two groups have to be reloaded to calculate the confidence interval
#using bootstrap method
w = data[data.race=='w']
b = data[data.race=='b']

In [150]:
w_samples_raw = draw_bootstrap_samples(w.call, np.mean, 10000 )
b_samples_raw = draw_bootstrap_samples(b.call, np.mean, 10000)

In [151]:
wb_sample_difference_raw = w_samples_raw - b_samples_raw

In [152]:
boot_conf_int = np.percentile(wb_sample_difference_raw, [2.5,97.5])
print('The confidence interval using the bootstrap approach is {}'.format(boot_conf_int))

The confidence interval using the bootstrap approach is [0.01683778 0.04722793]


## Q4: Write a story describing the statistical significance in the context or the original problem.

The p-values for the statistical tests using all the 3 approaches i.e. frquentist approach(t-test), permutation test approach and bootstrap approach are very close to 0, hence we can conclude that the callback rates for white sounding names and black sounding names are statistically not same with white sounding names having a higher call back rate.


## Q5: Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

No, we can't conclude that the race is the most important factor in callback success as we haven't evaluated the 
other variables with respect to the call back success.

We can run a logistic regression algorithm on the whole data set by keeping the call back success as the target variable and other variables as dependent variables to see the effect of each variable on call back success while controlling for other variables. The coefficient values of dependent variables will give us an idea about the most
important factor in callback success.