In [2]:
import os
import sys
import numpy as np
import pylab as pl
import scipy.stats as stats


import json
import os

%pylab inline

Populating the interactive namespace from numpy and matplotlib


## Test 1: Z-Test for Prisoners Employed After Release

**NULL HYPOTHESIS:** the % of former prisoners employed after release is the same or lower for candidates who participated in the program as for the control group, significance level p=0.05

$H_0: P_0 - P_1 \geq$ 0

$H_a: P_0 - P_1 $< 0

$\alpha$ = 0.05

this is a TEST OF PROPORTIONS. we use the Binomial distribution since it is a yes/no (bernulli) test for each subject: the former inmate was or was not ever employed in a CEO transitional job (second row in the table above):

$P_0=0.035, P_1=0.701$

In [3]:
alpha=0.05
# we like fractions better then percentages. as a rule of thumb, either use fractions or counts
P_0 = 3.5 * 0.01 
P_1 = 70.1 * 0.01

if P_0 - P_1 >= 0:
    # we are done
    print ("the Null holds")
else:
    print ("we must assess the statistical significance")

n_0 = 409
n_1 = 564

#lets get the counts by multiplying by the sample size
Nt_0 = P_0 * n_0
Nt_1 = P_1 * n_1

we must assess the statistical significance


START WITH Z TEST

the z test compares the standard deviation of the expected distribution and the observed result. it tells you literally how many standard deviations from the tail an observation is, under the assumption of normality

In [4]:
#define the sample proportion first
sp = (P_0 * n_0 + P_1 * n_1) / (n_1 + n_0)
print (sp)

0.4210472764645426


In [5]:
p = lambda p0, p1, n0, n1: (p0 * n0 + p1 * n1) / (n0 + n1)
#standard error
se = lambda p, n0, n1: np.sqrt(p * (1 - p) * (1.0 / n0 + 1.0 / n1))

In [6]:
zscore = lambda p0, p1, s : (p0 - p1) / s
z_2y = zscore(P_1, P_0, se(p(P_0, P_1, n_0, n_1), n_0, n_1))
print (z_2y)

20.7697865408


In [7]:
## p-value for employment after 2 years: 
## since the largest number we read off the table for is (way) smaller 
## than the value for our statistic 
## our p-value will be smaller than it would be if calculated using 
## (e.g.) .9998 (and in fact using 1.0000 which is the largest number 
## in the table). Using 0.9998 is a **conservative** approach. 

p_2y = 1 - 0.9984


def report_result(p,a):
    print ('is the p value ' + 
           '{0:.2f} smaller than the critical value {1:.2f}?'.format(p,a))
    if p < a:
        print ("YES!")
    else: 
        print ("NO!")
    
    print ('the Null hypothesis is {}'.format(\
                            'rejected' if p < a  else 'not rejected') )

    
report_result(p_2y, alpha)

is the p value 0.00 smaller than the critical value 0.05?
YES!
the Null hypothesis is rejected


## Test 2: Z-Test for Recidivism

What if we used the values for where the former inmate was or was not "Convicted of a felony" (row 10) in the Recidivism (Years 1-3)?

Null hypothesis? $H_0$? $H_a$?

$P_0 = ??, P_1= ??$

look up data table and insert the appropriate values to get the appropriate result! you can use the functions I defined above, with different arguments.
P_0=... P_1=... z_3y = ... p_3y=... report_result... 

**Null Hypothesis $H_0$:** the % of former prisoners convicted of a felony after release is the same or greater for candidates who participated in the program as for the control group, significance level p=0.05

$H_0: P_1 - P_0 >= 0$

$H_a: P_1 - P_0 < 0$

$\alpha$ $ = .05$

$P_0 = 11.7, P_1 = 10$

In [8]:
alpha=0.05

P_0 = 11.7 * 0.01 
P_1 = 10 * 0.01

if P_1 - P_0 >= 0:
    print ("the Null holds")
else:
    print ("we must assess the statistical significance")

n_0 = 409
n_1 = 564

#lets get the counts by multiplying by the sample size
Nt_0 = P_0 * n_0
Nt_1 = P_1 * n_1

we must assess the statistical significance


In [9]:
z_3y = zscore(P_1, P_0, se(p(P_0, P_1, n_0, n_1), n_0, n_1))
print (z_3y)

-0.846282982605


Using the table, $p_3y$  approximately $=  .2$ , which  is $> .05 = $ $\alpha$.

In [10]:
p_3y = .2
report_result(p_3y, alpha)

is the p value 0.20 smaller than the critical value 0.05?
NO!
the Null hypothesis is not rejected


## Test 3: Chi-Squared for Employment

### Now lets do it with the $\chi^2$ test¶ (for initial test, employment)

In [14]:
def evalChisq(values):
    '''Evaluates the chi sq from a contingency value
    Arguments:
    values: 2x2 array or list, the contingengy table
    '''
    if not (len(values.shape) == 2 and values.shape == (2,2)):
        print ("must pass a 2D array")
        return -1
    values = np.array(values)
    E = np.empty_like(values)
    for j in range(len(values[0])):
        for i in range(2):
            
            E[i][j] = ((values[i,:].sum() * values[:,j].sum()) / 
                        (values).sum())
    return ((values - E)**2 / E).sum()

In [16]:
Ntot = 973 # a + b + c + d = tot

sample_values = np.array([[0.701 * 564, 0.299 * 564], [0.0305 * 409, 0.965 * 409]])

print(evalChisq(sample_values))

436.223462575


432 is hella larger then 3.84

why am i mentioning 3.84? **3.84 is where p = .05 = alpha**

how does the chi square statistic that you derived compare? **My chi square value is 432, which is likely different because of how the different formulas used to calculate them are derived.**

please state what that means in terms of your Null hypothesis in a markdown cell below! **Because 432 > 3.84, p_data < .05 = alpha. Therefore, the null hypothesis is rejected**

## Test 4: Chi-squared repeated with felony attribute

 |convicted of a felony     |     yes   | no         |Total           |
 |---------------------------|-----------|-----------|----------------|
 | test sample               |.1 * 568   | .9 *568   |568             |
 | control sample            |.117 * 409 |.883 * 409 |409             |
 |                           |           |           |                |
 | total                     |104.653    |872.347    |977             |

In [17]:
Ntot = 973 # a + b + c + d = tot

sample_values = np.array([[0.1 * 564, 0.9 * 564], [0.117 * 409, 0.883 * 409]])

print(evalChisq(sample_values))

0.716194886647


**Because .716 < 3.84, p_data > .05 = alpha. Therefore, the null hypothesis cannot be rejected**