This notebook, by [felipe.alonso@urjc.es](mailto:felipe.alonso@urjc.es)

In this notebook we will:

1. Solve hypothesis testing exercices for **comparing two proportions**

2. Solve hypothesis testing for **contingency tables**


## Preliminars

#### How to build a contingency table

- There are different options here, but a quick an easy way is to use the [pd.crosstab](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html) function

#### Other uses of chi-square statistic

- [Feature selection](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html) in machine learning: if a feature is independent of the target then is uninformative.


In [1]:
import pandas as pd
import numpy as np

housing_data = pd.read_csv('../data/AmesHousing.csv',sep=',', decimal = '.')

# Display the info.
display(housing_data.head())

# Get categorical variables
print('- LIST of CAT. VARIABLES:\n\n',housing_data.columns[housing_data.dtypes == 'object'].values)

# Get values of a selected feature
column = 'MS Zoning'
print('\n- VALUES of {}:\n\n{}'.format(column,housing_data[column].unique()))

# Contingency table
print('\n- CONTINGENCY TABLE:')
pd.crosstab(housing_data['House Style'],housing_data['Neighborhood'],margins = True) 

Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,1,526301100,20,RL,141.0,31770,Pave,,IR1,Lvl,...,0,,,,0,5,2010,WD,Normal,215000
1,2,526350040,20,RH,80.0,11622,Pave,,Reg,Lvl,...,0,,MnPrv,,0,6,2010,WD,Normal,105000
2,3,526351010,20,RL,81.0,14267,Pave,,IR1,Lvl,...,0,,,Gar2,12500,6,2010,WD,Normal,172000
3,4,526353030,20,RL,93.0,11160,Pave,,Reg,Lvl,...,0,,,,0,4,2010,WD,Normal,244000
4,5,527105010,60,RL,74.0,13830,Pave,,IR1,Lvl,...,0,,MnPrv,,0,3,2010,WD,Normal,189900


- LIST of CAT. VARIABLES:

 ['MS Zoning' 'Street' 'Alley' 'Lot Shape' 'Land Contour' 'Utilities'
 'Lot Config' 'Land Slope' 'Neighborhood' 'Condition 1' 'Condition 2'
 'Bldg Type' 'House Style' 'Roof Style' 'Roof Matl' 'Exterior 1st'
 'Exterior 2nd' 'Mas Vnr Type' 'Exter Qual' 'Exter Cond' 'Foundation'
 'Bsmt Qual' 'Bsmt Cond' 'Bsmt Exposure' 'BsmtFin Type 1' 'BsmtFin Type 2'
 'Heating' 'Heating QC' 'Central Air' 'Electrical' 'Kitchen Qual'
 'Functional' 'Fireplace Qu' 'Garage Type' 'Garage Finish' 'Garage Qual'
 'Garage Cond' 'Paved Drive' 'Pool QC' 'Fence' 'Misc Feature' 'Sale Type'
 'Sale Condition']

- VALUES of MS Zoning:

['RL' 'RH' 'FV' 'RM' 'C (all)' 'I (all)' 'A (agr)']

- CONTINGENCY TABLE:


Neighborhood,Blmngtn,Blueste,BrDale,BrkSide,ClearCr,CollgCr,Crawfor,Edwards,Gilbert,Greens,...,NridgHt,OldTown,SWISU,Sawyer,SawyerW,Somerst,StoneBr,Timber,Veenker,All
House Style,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1.5Fin,0,0,0,56,8,0,21,43,1,0,...,0,73,22,10,3,0,0,2,0,314
1.5Unf,0,0,0,8,0,0,1,0,0,0,...,0,2,0,0,1,0,0,0,0,19
1Story,28,3,0,32,22,158,41,102,26,8,...,110,70,8,107,53,89,36,49,17,1481
2.5Fin,0,0,0,0,0,0,1,0,0,0,...,0,3,4,0,0,0,0,0,0,8
2.5Unf,0,0,0,2,0,0,4,1,0,0,...,0,13,1,0,0,0,0,0,0,24
2Story,0,7,30,10,10,98,33,24,119,0,...,56,75,13,8,58,92,15,15,3,873
SFoyer,0,0,0,0,0,5,0,13,0,0,...,0,2,0,14,5,0,0,0,0,83
SLvl,0,0,0,0,4,6,2,11,19,0,...,0,1,0,12,5,1,0,6,4,128
All,28,10,30,108,44,267,103,194,165,8,...,166,239,48,151,125,182,51,72,24,2930


# 1. Comparing two proportions

In [2]:
from scipy.stats import norm

# custom function
def two_proportion_test(x1,n1,x2,n2,message):
    
    # p1 = x1/n1
    # p2 = x2/n2
    # H0: p1=p2 vs H1:p1 != p2
    
    p1 = x1/n1
    p2 = x2/n2
    n = n1+n2
    
    p_pool = (x1+x2)/n
    se = np.sqrt(p_pool*(1-p_pool)*(1/n1+1/n2))

    z_statistic = (p1-p2)/se
    p_value = 2*norm().cdf(-1*np.abs(z_statistic))
    print('p-value = ', p_value)

    if p_value < 0.05: 
        print('Thus, we reject the null hypothesis ({})'.format(message))
    else:
        print('Thus, we FAIL to reject the null hypothesis (no evidence that {})'.format(message))

### Exercise 1:

Time magazine reported the result of a telephone poll of 800 adult Americans (smokers vs non-smokers). The question posed of the Americans who were surveyed was: "Should the federal tax on cigarettes be raised to pay for health care reform?" The results of the survey were the following_

- 351 out of 605 non-smokers said 'yes'
- 41 out of 195 smokers said 'yes'

<div class="alert alert-block alert-info">
Is there sufficient evidence at 5% confidence level to conclude that the two populations differ significantly with respect to their opinions?
</div>

In [3]:
# p1: proportion of non-smokers that said 'yes'
# p2: proportion of smokers that said 'yes'
# H0: p1=p2 vs H1:p1 != p2

two_proportion_test(x1=351,n1=605,x2=41,n2=195,message = 'populations differ')

p-value =  2.566230446480293e-19
Thus, we reject the null hypothesis (populations differ)


### Exercise 2:

A 30-year study was conducted with nearly 90,000 female participants. During a 5-year screening period, each woman was randomized to one of two groups: in the first group, women received regular mammograms to screen for breast cancer, and in the second group, women received regular non-mammogram breast cancer exams. No intervention was made during the following 25 years of the study, and we’ll consider death resulting from breast cancer over the full 30-year period. Results from the study are summarized in the following table

|Treatment |Death fro breast cancer|No death from breast cancer|
|---|-:-|---:|
|Mammogram|500|44425|
|Control|505|44405|

<div class="alert alert-block alert-info">
Can we conclude that mammograms have no benefits or harm?
</div>

In [4]:
# p1: proportion of deaths receiving mammograms
# p2: proportion of deaths receiving non-mammogram examination
# H0: p1=p2 vs H1 p1 != p2

two_proportion_test(x1=500,n1=44425,x2=505,n2=44405,message = 'mammograms have benefits or harm')

p-value =  0.8683157113444111
Thus, we FAIL to reject the null hypothesis (no evidence that mammograms have benefits or harm)


This example was extracted from example 6.18 from OpenIntro book. There are some interesting conclusions that arises from this example (also extracted from the book):

- We do not accept the null hypothesis, which means we don’t have sufficient evidence to conclude that mammograms reduce or increase breast cancer deaths.
- If mammograms are helpful or harmful, the data suggest the effect isn’t very large.
- Are mammograms more or less expensive than a non-mammogram breast exam? If one option is much more expensive than the other and doesn’t offer clear benefits, then we should lean towards the less expensive option.
- The study’s authors also found that mammograms led to overdiagnosis of breast cancer, which means some breast cancers were found (or thought to be found) but that these cancers would not cause symptoms during patients’ lifetimes. That is, something else would kill the patient before breast cancer symptoms appeared. This means some patients may have been treated for breast cancer unnecessarily, and this treatment is another cost to consider. It is also important to recognize that overdiagnosis can cause unnecessary physical or emotional harm to patients.

### Exercise 3

[Meuer and Woessner](https://journals.sagepub.com/doi/abs/10.1177/1477370818809663) describe an experiment to test the effect of electronic monitoring (tagging) on “low-risk” prisoners. Forty-eight (male) prisoners were randomly allocated to two groups:

* In the experimental group, the prisoner served the last part of his sentence under “supervised early work release”, involving the use of an open prison and electronic tagging.
* In the control group, the prisoner served the last part of his sentence in prison, as normal.

Following the end of the sentence, the prisoners were followed up for two years. It was recorded whether each prisoner reoffended. The results were as follows:

|group|sample size|	number reoffending|	\% reoffending|
|---|---|---|---|
|experimental|	24|	7|	29.2%|
|control|	30|	15|	50.0%|

<div class="alert alert-block alert-info">
Can we conclude that early release and tagging of prisoners affect the likelihood of reoffending?
</div>

In [5]:
# p1: proportion of reoffending (experimental group)
# p2: proportion of reoffending (control group)
# H0: p1=p2 vs H1 p1 != p2

two_proportion_test(x1=7,n1=24,x2=15,n2=30,
                    message = 'early release and tagging of prisoners affect the likelihood of reoffending')

p-value =  0.12156686001105872
Thus, we FAIL to reject the null hypothesis (no evidence that early release and tagging of prisoners affect the likelihood of reoffending)


# 2. Hypothesis testing for contingency tables

SciPy stats provides with a number of functions to perform inference analysis for contingency tables:

- [`chi2_contingency`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html#scipy.stats.chi2_contingency)

- [`fisher_exact`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.fisher_exact.html#scipy.stats.fisher_exact)

- [`expected_freq`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.contingency.expected_freq.html#scipy.stats.contingency.expected_freq)


### Exercise 4:

We consider data from a random sample of 275 jurors in a small county. Jurors identified their racial group, as shown in the following table

|Race|White| Black| Hispanic| Other|
|---|---|---|---|---|
|Representation in juries (counts) |205| 26| 25| 19|    
|Registered voters (%)|0.72 |0.07 |0.12 |0.09|
 

<div class="alert alert-block alert-info">
Are these jurors racially representative of the population?
</div>

In [6]:
from scipy.stats import chi2

# This is a one-way table example, and we cannot us the chi2_contingency Python function

# First, we convert the list [205,26,25,19] to a numpy array, so we can do math operations on it
observed = np.array([205,26,25,19]) 
n = observed.sum()
expected = n*np.array([0.72,0.07,0.12,0.09])

# Calculate the Chi2 statistic
chi2_statistic = ( (observed-expected)**2/expected ).sum()
print(chi2_statistic)

# Calculate the p-value
N = len(observed)
p_value = 1-chi2(df=N-1).cdf(chi2_statistic)
print('p-value = ', p_value)

message = 'there is racial bias in the juror selection'
if p_value < 0.05: 
    print('Thus, we reject the null hypothesis ({})'.format(message))
else:
    print('Thus, we FAIL to reject the null hypothesis (no evidence that {})'.format(message))

5.889610389610387
p-value =  0.1171061913085063
Thus, we FAIL to reject the null hypothesis (no evidence that there is racial bias in the juror selection)


### Exercise 5:

In a survey of 237 students smoking habits and exercise levels were observed

|Smoking status| exercise: regular|exercise: some/none|
|---|---|---|        
|Never|87|102|
|Occasional|12|7|
|Regular|9|8|
|Heavy|7|4|


<div class="alert alert-block alert-info">
Is smoking status independent of exercise level?
</div>

In [7]:
from scipy.stats import chi2_contingency

# Since this is not a 2x2 contingency table, we cannot use Fisher exact's test Python implementation

#H0: smoking status and exercise level are independent 
#H1: smoking status depends on exercise level

obs = [[87,102],[12,8], [9,8], [7,4]]

chi2_statistic, p_value, df, expected_counts = chi2_contingency(obs)

print('p-value = ', p_value)

message = 'smoking status depends on exercise level'
if p_value < 0.05: 
    print('Thus, we reject the null hypothesis ({})'.format(message))
else:
    print('Thus, we FAIL to reject the null hypothesis (no evidence that {})'.format(message))

p-value =  0.4465424972323605
Thus, we FAIL to reject the null hypothesis (no evidence that smoking status depends on exercise level)


#### Exercise 6:

The table below shows the observed frequencies of different kinds of crime in three neighborhoods.

|Violence|	Theft|	Vandalism|**Total**|
|---|---|---|---|
|Neighborhood1|	16|	25|	42|	**83**|
|Neighborhood2|	15|	18|	16|	**49**|
|Neighborhood3|	39|	36|	30|	**105**|
|**Total**	|70	|79	|88	|237|


<div class="alert alert-block alert-info">
What are the expected counts of this table? Is there an association between different neighbourhoods and types of crime?
</div>


In [8]:
# Since this is not a 2x2 contingency table, we cannot use Fisher exact's test Python implementation

#H0: there is no association between different neighbourhoods and types of crime (they are independent)
#H1: there is an association between different neighbourhoods and types of crime

obs = [[16,25],[15,18], [39,36]]

chi2_statistic, p_value, df, expected_counts = chi2_contingency(obs)

print('The expected counts are: \n', expected_counts.round(2))

print('p-value = ', p_value)

message = 'there is an association between different neighbourhoods and types of crime'
if p_value < 0.05: 
    print('Thus, we reject the null hypothesis ({})'.format(message))
else:
    print('Thus, we FAIL to reject the null hypothesis (no evidence that {})'.format(message))

The expected counts are: 
 [[19.26 21.74]
 [15.5  17.5 ]
 [35.23 39.77]]
p-value =  0.4002374266857166
Thus, we FAIL to reject the null hypothesis (no evidence that there is an association between different neighbourhoods and types of crime)


### Exercise 7:

You have quite a lot of plants in and outside your house, some of which have flowers, and some of which don't. Your flower data is presented below: 

|Flowering |Indoors|	Outdoors|
|---|---|---|
|Flower	|7	|3|
|No flower|	1|	12|


<div class="alert alert-block alert-info">
Is flowering independent from the plant being indoors or outdoors?
</div>

In [9]:
# We can use use Fisher exact's test 
from scipy.stats import fisher_exact

# H0: Flowering does not depend on plant location (indoor/outdoor)
# H1: Flowering depends on plant location (indoor/outdoor)

obs = [[7,3],[1,12]]
_, p_value = fisher_exact(obs)
print('p-value = ', p_value)

message = 'flowering depends on plant location (indoor/outdoor)'
if p_value < 0.05: 
    print('Thus, we reject the null hypothesis ({})'.format(message))
else:
    print('Thus, we FAIL to reject the null hypothesis (no evidence that {})'.format(message))
    
# For comparison purposes, we add the chi2_contingency test
chi2_statistic, p_value, df, expected_counts = chi2_contingency(obs)
print('p-value = ', p_value)

if p_value < 0.05: 
    print('Thus, we reject the null hypothesis ({})'.format(message))
else:
    print('Thus, we FAIL to reject the null hypothesis (no evidence that {})'.format(message))

p-value =  0.005898261114306346
Thus, we reject the null hypothesis (flowering depends on plant location (indoor/outdoor))
p-value =  0.007616400216780019
Thus, we reject the null hypothesis (flowering depends on plant location (indoor/outdoor))


### Exercise 8:

The table below describes residents of a Madrid neighborhood based on their car ownership and public transportation usage.

| Public vs Cars  | Owns car | Does not own car| Total|
|---|---|---|---|
|Uses public transport|34|94|128|
|Does not use public transport|126|17|143|
|Total|160|111|271|  



<div class="alert alert-block alert-info">
Is there an association between car ownership and public transportation usage? If there was no association, how many individuals would we expect to not own a car and not use public transport?
</div>


In [10]:
# Question 1: is there an association between car ownership and public transportation usage?
# H0: There is no association between car ownership and public transportation usage
# H1: There is an association between car ownership and public transportation usage

obs = [[34,94],[126,17]]
_, p_value = fisher_exact(obs)
print('p-value = ', p_value)

message = 'There is an association between car ownership and public transportation usage'
if p_value < 0.05: 
    print('Thus, we reject the null hypothesis ({})'.format(message))
else:
    print('Thus, we FAIL to reject the null hypothesis (no evidence that {})'.format(message))
    
# Question 2: 
# If there was no association, how many individuals would we expect to not own a car and not use public transport?

# Sol: if events were independent, then
p_not_own_car = 111/271
p_not_use_public = 143/271

n_individuals = 271 * p_not_own_car * p_not_use_public
print('\n# of individuals = ', n_individuals)

p-value =  3.449987535043802e-26
Thus, we reject the null hypothesis (There is an association between car ownership and public transportation usage)

# of individuals =  58.57195571955719
