
### Examining racial discrimination in the US job market

#### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

#### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes.

#### Exercise
Perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 

****

In [62]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

In [63]:
data = pd.io.stata.read_stata('C:/Users/Esme/Desktop/SpringBoard/Lesson_8_Jupyter_Human_Temp/data_wrangling_json/racial_disc/data/us_job_market_discrimination.dta')

In [64]:
# number of callbacks for balck-sounding names
total_black = sum(data[data.race=='b'].call)
print("total black calls:", total_black)

total_white = sum(data[data.race=='w'].call)
print("total white calls:", total_white)

total black calls: 157.0
total white calls: 235.0


****

# Exercise

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Discuss statistical significance.
    
You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
   
****

In [65]:
print(data.head(5))

  id ad  education  ofjobs  yearsexp  honors  volunteer  military  empholes  \
0  b  1          4       2         6       0          0         0         1   
1  b  1          3       3         6       0          1         1         0   
2  b  1          4       1         6       0          0         0         0   
3  b  1          3       4         6       0          1         0         1   
4  b  1          3       3        22       0          0         0         0   

   occupspecific    ...      compreq  orgreq  manuf  transcom  bankreal trade  \
0             17    ...          1.0     0.0    1.0       0.0       0.0   0.0   
1            316    ...          1.0     0.0    1.0       0.0       0.0   0.0   
2             19    ...          1.0     0.0    1.0       0.0       0.0   0.0   
3            313    ...          1.0     0.0    1.0       0.0       0.0   0.0   
4            313    ...          1.0     1.0    0.0       0.0       0.0   0.0   

  busservice othservice  missind  owne

In [80]:
b_data = data[data.race=='b']
w_data = data[data.race=='w']


print( 'total results with black sounding names:', + len(b_data))
print('total results with white sounding names:', + len(w_data))

  id ad  education  ofjobs  yearsexp  honors  volunteer  military  empholes  \
2  b  1          4       1         6       0          0         0         0   
3  b  1          3       4         6       0          1         0         1   
7  b  1          3       4        21       0          1         0         1   
8  b  1          4       3         3       0          0         0         0   
9  b  1          4       2         6       0          1         0         0   

   occupspecific    ...      compreq  orgreq  manuf  transcom  bankreal trade  \
2             19    ...          1.0     0.0    1.0       0.0       0.0   0.0   
3            313    ...          1.0     0.0    1.0       0.0       0.0   0.0   
7            313    ...          1.0     1.0    0.0       0.0       0.0   0.0   
8            316    ...          0.0     0.0    0.0       0.0       0.0   1.0   
9            263    ...          0.0     0.0    0.0       0.0       0.0   1.0   

  busservice othservice  missind  owne

1) We can assume independent variables (1 or 0 finding in the B  does not change W group) and the len of overal data is > 30 and the number of black sounding names and white sounding names in the sample are equivilent.

Therefore CLT can apply, and a 2 sided t test could work. 
In this I will be finding a mean of the B and W group and identifying w/ a 2 sided T test if I need to accept or reject the null hypothesis:
2) 
H0:  mean of black sounding name call-backs and white-sounding name call-backs are the same
H1: mean are not the same
p value of 0.05

3) Compute margin of error, confidence interval, and p-value.

To get the values for these I am going to try cleaning up the data and ioslating what I need. I have tried the same stats.ttest_ind() I used in the last problem. However I keep running into an error of NaN values. So I will check and clean the data (extract NaN values) from race and call back results. 

In [81]:
simple_data = data[['race', 'call']].copy()
clean_data = simple_data.dropna()

b_data_c = clean_data[clean_data.race=='b']
w_data_c = clean_data[clean_data.race=='w']

print( 'total results with black sounding names:', + len(b_data_c))
print('total results with white sounding names:', + len(w_data_c))


total_black_c = sum(clean_data[clean_data.race=='b'].call)
print("total black calls:", total_black_c)

total_white_c = sum(clean_data[clean_data.race=='w'].call)
print("total white calls:", total_white_c)



total results with black sounding names: 2435
total results with white sounding names: 2435
total black calls: 157.0
total white calls: 235.0
  race  call
2    b   0.0
3    b   0.0
7    b   0.0
8    b   0.0
9    b   0.0


Same number of results as above. No data removed and ttest_ind is still not working. I will be using ttest_ind_from_stats to get pvalue. 

In [87]:
mean_b = np.mean(b_data_c)
mean_w = np.mean(w_data_c)
var_b = np.var(b_data_c)
var_w = np.var(w_data_c)
n_b = len(b_data_c)
n_w = len(w_data_c)

std_b = np.std(b_data_c)
std_w = np.std(w_data_c)

t, p = stats.ttest_ind_from_stats(mean_b, np.sqrt(var_b), n_b,
                           mean_w, np.sqrt(var_w), n_w,
                           equal_var= False)

print("ttest_ind_from_stats: t=", t)
print("ttest_ind_from_stats: p=", p)

print("Mean of Black names call back rate:", mean_b)
print("Mean of White names call back rate:", mean_w)

print("Variance of black names call back rate:", var_b)
print("Variance of white names call back rate:", var_w)

print("Std of black names call back rate:", std_b)
print("Std of white names call back rate:", std_w)


ttest_ind_from_stats: t= call   -4.115583
dtype: float32
ttest_ind_from_stats: p= [  3.92801240e-05]
Mean of Black names call back rate: call    0.064476
dtype: float32
Mean of White names call back rate: call    0.096509
dtype: float32
Variance of black names call back rate: call    0.060319
dtype: float32
Variance of white names call back rate: call    0.087193
dtype: float32
Std of black names call back rate: call    0.245599
dtype: float32
Std of white names call back rate: call    0.295285
dtype: float32


## As we have the mean we can get the margin of error: 

In [90]:
se_CI = np.sqrt((mean_b*(1 - mean_b)/(len(b_data_c))) + (mean_w*(1 - mean_w) /(len(w_data_c))))
se_CI

# We are calculating at the 5% significance level so our critical value is 1.96
crit = 1.96
margin = abs(crit*se_CI)
print("The true population proportion lies +/- %0.6F around the point estimate" % margin)

The true population proportion lies +/- 0.015255 around the point estimate


The P value is statistically significant (3.92801240e-05 < 0.05), and requires we reject the null hypothesis. I.e. there is a statistically significant difference of the rate of call-backs for white-sounding and black-sounding names 

However there are more numbers to find. 

## The confidence intervel:
To find the confidence interval looking at race ast eh explanatory variable and call as the response variable (or dependent variable) with each data set. 

In [95]:
# difference in probability
mean_diff= mean_w - mean_b
print(mean_diff)

CI_high = mean_diff + margin
CI_low = mean_diff - margin

print("The proportion of CVs with white-sounding names that recieve a call is between %0.6F and %0.6F higher than the proportion of CVS with black-sounding names" % (CI_low,CI_high))


call    0.032033
dtype: float32
The proportion of CVs with white-sounding names that recieve a call is between 0.016777 and 0.047288 higher than the proportion of CVS with black-sounding names


Confidence interval agrees with the p-value: 
white-sounding names recieve more vall-backs then black-sounding names

## 4 discuss what it means: 

The P value of 0.0000388... means that the chance of this data happening by chance and the null hypothesis being true is 0.00388%. The confidence interval mathematically agrees that the likelhood of a white-sounding name call back from the data provided is higher then a black-sounding name - to a signifincant degree. 