# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
****

In [125]:
import pandas as pd
import numpy as np
from scipy import stats

To address all the questions, lets first look at the dataset. It contains 4870 rows and 65 columns and can be considered quite wide. It contains 2435 rows related to black-sounding names, and the same number of rows related to white-sounding names.

In [126]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

#data.info()
#ata.head()

print(data[data.race=='b']['race'].value_counts()) 
print(data[data.race=='w']['race'].value_counts())

b    2435
Name: race, dtype: int64
w    2435
Name: race, dtype: int64


Let's compute the number of callbacks for black-sounding names and callbacks for white-sounding names. The probabilities of a callback for both type of names is prob_b = mean_b/2435 and
prob_w = mean_w/2435. Both of them are rather small number, albeit mean_w and mean_b are rather large. Size of the sample N=2435 is large as well, meaning that the central limit theorem is applicable. Numbers of callbacks (for black and white sounding names) can be considered binomial random variables with the means mean_b and mean_w respectively. The standard deviations are std_w = sqrt(prob_w * (1- prob_w) N) and std_b = sqrt(prob_b * (1- prob_b) N). Since N, mean_b and mean_w are all large, the numbers of callbacks obey approximately normal distributions.

In [127]:
# number of callbacks for black-sounding names
mean_b = sum(data[data.race=='b'].call)

# number of callbacks for white-sounding names
mean_w = sum(data[data.race=='w'].call)

print("number of callbacks for black-sounding names:", mean_b)
print("number of callbacks for white-sounding names:", mean_w)

# probabilitirs of callbacks for both names
prob_b = mean_b/2435
prob_w = mean_w/2435


print("probability of a callbacks for a black-sounding name:", prob_b)
print("probability of a callbacks for a white-sounding name:", prob_w)

# the standard deviations
std_w = np.sqrt(prob_w * (1- prob_w) *2435)
std_b = np.sqrt(prob_b * (1- prob_b) *2435)

print("the std of the number of callback for black-sounding names:" , std_b)
print("the std of the number of callback for white-sounding names:" , std_w)

number of callbacks for black-sounding names: 157.0
number of callbacks for white-sounding names: 235.0
probability of a callbacks for a black-sounding name: 0.064476386037
probability of a callbacks for a white-sounding name: 0.0965092402464
the std of the number of callback for black-sounding names: 12.1192907132
the std of the number of callback for white-sounding names: 14.5712157537


Let's determine the range of values for the average numbers of callbacks around mean_b and mean_w if experiments were repeated over and over again. We will use the stats.norm.interval function from stats module and set the 95 percent confidence interval. We see that if the study is repeated many times, with 95 percent probability the average number of callbacks for black-sounding names will lie between 133.246626684 and 180.753373316, while with the same probability the average number of callbacks for white-sounding names will lie between 206.440941912 and 263.559058088. These ranges do not intersect, meaning that the lower boundary for white-sounding names is greater than the upper boundary for black-sounding names.

In [128]:
interv_b = stats.norm.interval(0.95, loc = mean_b, scale = std_b)
interv_w = stats.norm.interval(0.95, loc = mean_w, scale = std_w)

print("With 95 percent probability the average number of callbacks for black-sounding\
      names will lie between", interv_b[0], "and", interv_b[1])

print('\n')
print("With 95 percent probability the average number of callbacks for white-sounding\
      names will lie between", interv_w[0], "and", interv_w[1])

With 95 percent probability the average number of callbacks for black-sounding      names will lie between 133.246626684 and 180.753373316


With 95 percent probability the average number of callbacks for white-sounding      names will lie between 206.440941912 and 263.559058088


This implies that the numbers of callbacks for black-sounding and white-sounding names will stand sufficiently far apart if experiments are repeated many times. Let's confirm further this finding stating the null-hypothesis that mean_b is equal to mean_w implying that race still does not play any role. To compute the corresponding p_value, we use the function stats.norm.cdf from the stats module. The p-value is very small meaning the numbers of callbacks for black-sounding and white-sounding names are statistically significant.

In [129]:
p_value = stats.norm.cdf(mean_b, loc= mean_w, scale = std_w)

The finding that he numbers of callbacks for black-sounding and white-sounding names are statistically significant is based on the data provided in one column only, the actual number of callbacks themselves. This does not mean the presence of racial discrimination because the decision whether to make a callback or not could be based on numerical data given in other columns. To clarify the picture further, one should check whether the data for black and white sounding names in other columns are at least approximately the same. One should investigate the content of the columns that may play an important role in a callback decision: education, yearsexp, honors, volunteer, military, workinschoool, computerskills, specialskills and possibly some others. One can create additional column 'SKILLS' containing a numerical value corresponding to the weighted amount of all skills that play the role. After that one should do the similar statistical analysis, but test the hypothesis that means of callbacks for black and white sounding names are equal separately for candidates having one and the same rating in the 'SKILLS' column. 

In [130]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4870 entries, 0 to 4869
Data columns (total 65 columns):
id                    4870 non-null object
ad                    4870 non-null object
education             4870 non-null int8
ofjobs                4870 non-null int8
yearsexp              4870 non-null int8
honors                4870 non-null int8
volunteer             4870 non-null int8
military              4870 non-null int8
empholes              4870 non-null int8
occupspecific         4870 non-null int16
occupbroad            4870 non-null int8
workinschool          4870 non-null int8
email                 4870 non-null int8
computerskills        4870 non-null int8
specialskills         4870 non-null int8
firstname             4870 non-null object
sex                   4870 non-null object
race                  4870 non-null object
h                     4870 non-null float32
l                     4870 non-null float32
call                  4870 non-null float32
city        