This notebook, by [felipe.alonso@urjc.es](mailto:felipe.alonso@urjc.es)

In this notebook we will:

1. Solve hypothesis testing exercices for **comparing two proportions**

2. Solve hypothesis testing for **contingency tables**


## Preliminars

#### How to build a contingency table

- There are different options here, but a quick an easy way is to use the [pd.crosstab](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html) function

#### Other uses of chi-square statistic

- [Feature selection](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html) in machine learning: if a feature is independent of the target then is uninformative.


In [None]:
import pandas as pd
import numpy as np

housing_data = pd.read_csv('./data/AmesHousing.csv',sep=',', decimal = '.')

In [None]:
housing_data.head(10)

# 1. Comparing two proportions

### Exercise 1

Time magazine reported the result of a telephone poll of 800 adult Americans (smokers vs non-smokers). The question posed of the Americans who were surveyed was: "Should the federal tax on cigarettes be raised to pay for health care reform?" The results of the survey were the following_

- 351 out of 605 non-smokers said 'yes'
- 41 out of 195 smokers said 'yes'

<div class="alert alert-block alert-info">
Is there sufficient evidence at 5% confidence level to conclude that the two populations differ significantly with respect to their opinions?
</div>

In [None]:
# your code here
# ...

### Exercise 2

A 30-year study was conducted with nearly 90,000 female participants. During a 5-year screening period, each woman was randomized to one of two groups: in the first group, women received regular mammograms to screen for breast cancer, and in the second group, women received regular non-mammogram breast cancer exams. No intervention was made during the following 25 years of the study, and we’ll consider death resulting from breast cancer over the full 30-year period. Results from the study are summarized in the following table

|Treatment |Death fro breast cancer|No death from breast cancer|
|---|-:-|---:|
|Mammogram|500|44425|
|Control|505|44405|

<div class="alert alert-block alert-info">
Can we conclude that mammograms have no benefits or harm?
</div>

In [None]:
# your code here
# ...

### Exercise 3

[Meuer and Woessner](https://journals.sagepub.com/doi/abs/10.1177/1477370818809663) describe an experiment to test the effect of electronic monitoring (tagging) on “low-risk” prisoners. Forty-eight (male) prisoners were randomly allocated to two groups:

* In the experimental group, the prisoner served the last part of his sentence under “supervised early work release”, involving the use of an open prison and electronic tagging.
* In the control group, the prisoner served the last part of his sentence in prison, as normal.

Following the end of the sentence, the prisoners were followed up for two years. It was recorded whether each prisoner reoffended. The results were as follows:

|group|sample size|	number reoffending|	\% reoffending|
|---|---|---|---|
|experimental|	24|	7|	29.2%|
|control|	30|	15|	50.0%|

<div class="alert alert-block alert-info">
Can we conclude that early release and tagging of prisoners affect the likelihood of reoffending?
</div>

In [None]:
# your code here
# ...

# 2. Hypothesis testing for contingency tables

SciPy stats provides with a number of functions to perform inference analysis for contingency tables:

- [`chi2_contingency`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html#scipy.stats.chi2_contingency)

- [`fisher_exact`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.fisher_exact.html#scipy.stats.fisher_exact)

- [`expected_freq`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.contingency.expected_freq.html#scipy.stats.contingency.expected_freq)


### Exercise 4

We consider data from a random sample of 275 jurors in a small county. Jurors identified their racial group, as shown in the following table

|Race|White| Black| Hispanic| Other|
|---|---|---|---|---|
|Representation in juries (counts) |205| 26| 25| 19|    
|Registered voters (%)|0.72 |0.07 |0.12 |0.09|
 

<div class="alert alert-block alert-info">
Are these jurors racially representative of the population?
</div>

In [None]:
# your code here
# ...

### Exercise 5

In a survey of 237 students smoking habits and exercise levels were observed

|Smoking status| exercise: regular|exercise: some/none|
|---|---|---|        
|Never|87|102|
|Occasional|12|7|
|Regular|9|8|
|Heavy|7|4|


<div class="alert alert-block alert-info">
Is smoking status independent of exercise level?
</div>

In [None]:
# your code here
# ...

### Exercise 6

The table below shows the observed frequencies of different kinds of crime in three neighborhoods.

|Violence|	Theft|	Vandalism|**Total**|
|---|---|---|---|
|Neighborhood1|	16|	25|	42|	**83**|
|Neighborhood2|	15|	18|	16|	**49**|
|Neighborhood3|	39|	36|	30|	**105**|
|**Total**	|70	|79	|88	|237|


<div class="alert alert-block alert-info">
What are the expected counts of this table? Is there an association between different neighbourhoods and types of crime?
</div>


In [None]:
# your code here
# ...

### Exercise 7

You have quite a lot of plants in and outside your house, some of which have flowers, and some of which don't. Your flower data is presented below: 

|Flowering |Indoors|	Outdoors|
|---|---|---|
|Flower	|7	|3|
|No flower|	1|	12|


<div class="alert alert-block alert-info">
Is flowering independent from the plant being indoors or outdoors?
</div>

In [None]:
# your code here
# ...

### Exercise 8

The table below describes residents of a Madrid neighborhood based on their car ownership and public transportation usage.

| Public vs Cars  | Owns car | Does not own car| Total|
|---|---|---|---|
|Uses public transport|34|94|128|
|Does not use public transport|126|17|143|
|Total|160|111|271|  



<div class="alert alert-block alert-info">
Is there an association between car ownership and public transportation usage? If there was no association, how many individuals would we expect to not own a car and not use public transport?
</div>


In [None]:
# your code here
# ...