# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
+ Formulas for the Bernoulli distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution

In [10]:
import pandas as pd
import numpy as np
from scipy import stats
import seaborn as sns

In [4]:
data = pd.io.stata.read_stata(r'C:\Users\Godfather\Documents\Desktop folders\Data Science\Springboard\6.2 (2)-Project Examine Racial Discrimination Using EDA\1538005773_dsc_racial_disc\EDA_racial_discrimination\data/us_job_market_discrimination.dta')

In [7]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


In [5]:
# Checking all the column names
for col in data.columns: 
    print(col) 

id
ad
education
ofjobs
yearsexp
honors
volunteer
military
empholes
occupspecific
occupbroad
workinschool
email
computerskills
specialskills
firstname
sex
race
h
l
call
city
kind
adid
fracblack
fracwhite
lmedhhinc
fracdropout
fraccolp
linc
col
expminreq
schoolreq
eoe
parent_sales
parent_emp
branch_sales
branch_emp
fed
fracblack_empzip
fracwhite_empzip
lmedhhinc_empzip
fracdropout_empzip
fraccolp_empzip
linc_empzip
manager
supervisor
secretary
offsupport
salesrep
retailsales
req
expreq
comreq
educreq
compreq
orgreq
manuf
transcom
bankreal
trade
busservice
othservice
missind
ownership


In [6]:
# number of callbacks for black-sounding names
sum(data[data.race=='b'].call)

157.0

In [11]:
# ECDF Function

def ecdf(data):
    """Compute ECDF for a one-dimensional array of measurements."""

    # Number of data points: n
    n = len(data)

    # x: sort the data
    x = np.sort(data)

    # y: range for y-axis
    y = np.arange(1, n+1) / n

    return x, y

# Basic calculations on callbacks

In [13]:
# aggregate values
r = np.sum(data.call)
n = len(data)
p = r/n

w = data[data.race == 'w']
b = data[data.race == 'b']

# white-sounding names
w_r = np.sum(w.call)
w_n = len(w)
w_p = (w_r / w_n)


# black-sounding names
b_r = np.sum(b.call)
b_n = len(b)
b_p = (b_r / b_n)


data = {'CB': np.array([w_r, b_r, r]).astype(int),
        'No CB': np.array([w_n - w_r, b_n - b_r, n - r]).astype(int),
        'Total': np.array([w_n, b_n, n]).astype(int),
        '% Success': np.array(['{:.2%}'.format(w_r/w_n), '{:.2%}'.format(b_r/b_n), '{:.2%}'.format(r/n)])}

table = pd.DataFrame(data, columns = ['CB', 'No CB', 'Total', '% Success'], 
                   index = ['White-sounding names', 'Black-sounding names', 'Aggregate'])
table

Unnamed: 0,CB,No CB,Total,% Success
White-sounding names,235,2200,2435,9.65%
Black-sounding names,157,2278,2435,6.45%
Aggregate,392,4478,4870,8.05%


# Q1. What test is appropriate for this problem? Does CLT apply?

Using the collected sample, we are trying to evaluate if the race influences the callback rate. The null hypothesis for this problem is that the proportions in the two populations('b' and 'w'), from which the two samples are drawn, are equal. Therefore, a Z-test for proportions will be used. These samples are binomial distributions, and a callback will be labeled a success.

In order to compare the proportions using a Frequentist approach, the Z-test for proportions will be used:
Z=(p1-p2)/√(p1_hat (1-p_hat)(1/n1 + 1/n2))

where, p_hat = (n1p1 + n2p2) / (n1 + n2)

When sample sizes n1 and n2 are large enough and/or the proportions in each sample (p1 and p2) are close enough to 0.5, the difference between the two proportions is Normally distributed.

With the large population (>30) as the means are normally distributed across the sample, the CLT applies to this problem. The Central Limit Theorem (CLT) states that the sampling distribution of the sample means approaches a normal distribution as the sample size gets larger — no matter what the shape of the population distribution. This fact holds especially true for sample sizes over 30. 

# Q2. What are the null and alternate hypotheses?

A null hypothesis is a hypothesis that says there is no statistical significance between the two variables. This hypothesis is denoted by Ho.
Here, the probability of success (getting a callback) is the same for both resumes with white-sounding names and black-sounding names.

Ho : p_hat(w) - p_hat(b) = 0

An alternative hypothesis is one that states there is a statistically significant relationship between two variables. This hypothesis is denoted by Ha
Here, the probability of success is not the same for resumes with white-sounding names as it is for those with black-sounding names.

Ha : p_hat(w) - p_hat(b) ≠ 0

# Q3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.

## Frequentist Approach

p1_hat – p2_hat ± z * √(p1_hat(1-p1_hat ))/n1 + (p2_hat(1-p2_hat ))/n2)

In [14]:
def ztest_proportions_two_samples(r1, n1, r2, n2, one_sided=False):
    """Returns the z-statistic and p-value for a 2-sample Z-test of proportions"""
    p1 = r1/n1
    p2 = r2/n2
    
    p = (r1+r2)/(n1+n2)
    se = np.sqrt(p*(1-p)*(1/n1+1/n2))
    
    z = (p1-p2)/se
    p = 1-stats.norm.cdf(abs(z))
    p *= 2-one_sided
    return z, p

In [18]:
# 95% confidence interval
prop_diff = w_p - b_p
print('Observed difference in proportions: \t {}\n'.format(prop_diff))

z_crit = 1.96
p_hat1 = w_p*(1-w_p)/w_n
p_hat2 =  b_p*(1-b_p)/b_n
ci_high = prop_diff + z_crit*(np.sqrt(p_hat1 + p_hat2))
ci_low = prop_diff - z_crit*(np.sqrt(p_hat1 + p_hat2))

z_stat, p_val = ztest_proportions_two_samples(w_r, w_n, b_r, b_n)
print('z-stat: \t\t\t {}\np-value: \t\t\t {}'.format(z_stat, p_val))

print('95% confidence interval: \t {} - {}'.format(ci_low, ci_high))
moe = (ci_high - ci_low)/2
print('Margin of error: \t\t +/-{}'.format(moe))

Observed difference in proportions: 	 0.032032854209445585

z-stat: 			 4.108412152434346
p-value: 			 3.983886837577444e-05
95% confidence interval: 	 0.016777447859559147 - 0.047288260559332024
Margin of error: 		 +/-0.015255406349886438


## Bootstrap

In [19]:
# Construct arrays of data: white-sounding names, black-sounding names
all_callbacks = np.array([True] * int(r) + [False] * int(n-r))

size = 10000

bs_reps_diff = np.empty(size)

for i in range(size):
    w_bs_replicates = np.sum(np.random.choice(all_callbacks, size=w_n))
    b_bs_replicates = np.sum(np.random.choice(all_callbacks, size=b_n))
    
    bs_reps_diff[i] = (w_bs_replicates - b_bs_replicates)/b_n
    
bs_p_value = np.sum(bs_reps_diff >= prop_diff) / len(bs_reps_diff)

bs_ci = np.percentile(bs_reps_diff, [2.5, 97.5])
bs_mean_diff = np.mean(bs_reps_diff)

print('obs diff: {}\n'.format(prop_diff))
print('BOOTSTRAP RESULTS\np-value: {}\n95% conf. int.: {}'.format(bs_p_value, bs_ci))

obs diff: 0.032032854209445585

BOOTSTRAP RESULTS
p-value: 0.0
95% conf. int.: [-0.01560575  0.01520534]


## Analysis

The p-value for the Frequentist and Boostrap approaches are both well below the p=0.05 threshold so the null hypothesis must be rejected in favor of the alternate hypothesis that perception of race based on the name on the resume does have an effect on whether an applicant will receive a callback.

It can be concluded that for samples taken in a similar fashion from the same population, at least 95% of the time, the difference in proportions will not be as great as the empirical results seen here.

# Q4. Write a story describing the statistical significance in the context or the original problem.

It has been proven conclusively that the proportion of callbacks received for resumes with white-sounding names is significantly and consistently higher than the proportion of callbacks for resumes with black-sounding names. The evidence for the samples provided show that resumes with white-sounding names are approximately 50% more likely to receive a callback.



# Q5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

According to the study, researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

According to this description, the researchers took great care to make sure that the resumes were identical aside from the names. So, examining the resumes more closely for differences in education, work experience, etc. should not be helpful if the study was executed properly. The methods of the study could be examined more closely for sample bias in the cities and the hiring managers selected, etc.