# Hypothesis Testing - Part 1. Comparing Proportions (z test)

The ANOVA of Test is a hypothesis test typically used to determine if 3 or more means are different.

**Assumptions**  

Conditions for Inference:  
1) Random Sample (Both  samples are random)    
2) np1 > 10 n(1-p1) > 10 np2 > 10 n(1-p2) > 10  
3) Observations in sample are independant (Either done with replacement or less than 10% of population)

Do not need 30 condition when dealing with proportions, we do for means


**Hypothesis**  
**Null:** Proprtions are equal.  
**Alternative:** Proportions are not equal (two tailed) / Proprtions are > or < (one tailed)

If the p value is less than the significance level then we can reject the null hypothesis.


The following Resources have been used:  


*This notebook is from a series on Hypothesis Testing* 
1. ***Hypothesis Testing - Comparing Proportions (z test)***
2. *Hypothesis Testing - Comparing Means (t test)*  
3. *Hypothesis Testing - Chi Sq*  
4. *Hypothesis Testing - ANOVA*  

#### Libraries

In [197]:
import numpy as np
import pandas as pd
from scipy import stats
from sklearn import datasets

#### Import the data 
Here we are using the stroke data set from kaggle

In [198]:
stroke = pd.read_csv('./data/healthcare-dataset-stroke-data.csv')
stroke.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [199]:
print("The proportion of patients having a stroke is {:.2%}".format(stroke['stroke'].mean()))
print("The total of number of patients having a stroke is {}".format(stroke['stroke'].sum()))

stroke.describe()

The proportion of patients having a stroke is 4.87%
The total of number of patients having a stroke is 249


Unnamed: 0,id,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke
count,5110.0,5110.0,5110.0,5110.0,5110.0,4909.0,5110.0
mean,36517.829354,43.226614,0.097456,0.054012,106.147677,28.893237,0.048728
std,21161.721625,22.612647,0.296607,0.226063,45.28356,7.854067,0.21532
min,67.0,0.08,0.0,0.0,55.12,10.3,0.0
25%,17741.25,25.0,0.0,0.0,77.245,23.5,0.0
50%,36932.0,45.0,0.0,0.0,91.885,28.1,0.0
75%,54682.0,61.0,0.0,0.0,114.09,33.1,0.0
max,72940.0,82.0,1.0,1.0,271.74,97.6,1.0


#### Create the proportions

In [200]:
gender_stroke_df = pd.DataFrame(stroke.groupby('gender').agg(proportion = ('stroke', np.mean), patients = ('stroke', 'count')))
gender_hd_df = pd.DataFrame(stroke.groupby('gender').agg(proportion = ('heart_disease', np.mean), patients = ('heart_disease', 'count')))
residence_stroke_df = pd.DataFrame(stroke.groupby('Residence_type').agg(proportion = ('stroke', np.mean), patients = ('stroke', 'count')))
residence_hd_df = pd.DataFrame(stroke.groupby('Residence_type').agg(proportion = ('heart_disease', np.mean), patients = ('heart_disease', 'count')))


display(gender_stroke_df)
display(gender_hd_df)
display(residence_stroke_df)
display(residence_hd_df)

Unnamed: 0_level_0,proportion,patients
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,0.047094,2994
Male,0.051064,2115
Other,0.0,1


Unnamed: 0_level_0,proportion,patients
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,0.037742,2994
Male,0.077069,2115
Other,0.0,1


Unnamed: 0_level_0,proportion,patients
Residence_type,Unnamed: 1_level_1,Unnamed: 2_level_1
Rural,0.045346,2514
Urban,0.052003,2596


Unnamed: 0_level_0,proportion,patients
Residence_type,Unnamed: 1_level_1,Unnamed: 2_level_1
Rural,0.053302,2514
Urban,0.0547,2596


## 1. Confidence Intervals for difference in Proportions
The confidence Interval for *one proportion* is calculated using the following formula
$$
 \hat p\  \pm z \sqrt \frac{(\hat p(1-\hat p)}{n} 
$$

The confidence Interval for *two proportions* is as follows
$$
 \hat p_1 - \hat p_2\  \pm z \sqrt{\frac{(\hat p_1(1-\hat p_1)}{n_1}+\frac{(\hat p_2(1-\hat p_2)}{n_2}}
$$


#### Create a function to calculate the CI of the difference between the second and the first proportion

#### Check the confidence Intervals for stroke by gender

In [201]:
male_p = gender_stroke_df.loc['Male','proportion']
female_p = gender_stroke_df.loc['Female','proportion']
male_n = gender_stroke_df.loc['Male','patients']
female_n = gender_stroke_df.loc['Female','patients']
ci_two_proportions(male_p,female_p,male_n,female_n,2,0.95)

A z test has been calculated at 95.0% confidence for the difference of two proportions: 
0.0471 and 0.0511 with respective sample sizes 2994 and 2115.  The z statistic is 1.960
We are 95.0% confident that the difference in proportions is between -0.016 and 0.0081
Because the confidence interval includes zero, we conclude that the difference is not statistically significant


(-0.016, 0.0081)

## 2. Confidence Intervals for one Proportion

#### Adapt the function to work for one population

#### Check some Examples using the stroke data set

In [202]:
#male CI
ci_one_proportion(male_p,male_n,2,0.95)

A z test has been calculated at 95.0% confidence for a sample proportion : 
0.0511 and size 2115.0000. The z statistic is 1.960
We are 95.0% confident that the population proportion is between 0.0417 and 0.0604


(0.0417, 0.0604)

In [203]:
# Female CI
ci_one_proportion(female_p,female_n,2,0.95)

A z test has been calculated at 95.0% confidence for a sample proportion : 
0.0471 and size 2994.0000. The z statistic is 1.960
We are 95.0% confident that the population proportion is between 0.0395 and 0.0547


(0.0395, 0.0547)

In [204]:
# Lets Try for heart disease
male_p = gender_hd_df.loc['Male','proportion']
female_p = gender_hd_df.loc['Female','proportion']
male_n = gender_hd_df.loc['Male','patients']
female_n = gender_hd_df.loc['Female','patients']
ci_two_proportions(male_p,female_p,male_n,female_n,2,0.95)

A z test has been calculated at 95.0% confidence for the difference of two proportions: 
0.0377 and 0.0771 with respective sample sizes 2994 and 2115.  The z statistic is 1.960
We are 95.0% confident that the difference in proportions is between -0.0526 and -0.0261
Because the confidence interval does not includes zero, we conclude that the difference is statistically significant


(-0.0526, -0.0261)

In [205]:
#male CI
ci_one_proportion(male_p,male_n,1,0.95)

A z test has been calculated at 95.0% confidence for a sample proportion : 
0.0771 and size 2115.0000. The z statistic is 1.645
We are 95.0% confident that the population proportion is between 0.0675 and 0.0866


(0.0675, 0.0866)

In [206]:
#female CI
ci_one_proportion(female_p,female_n,1,0.95)

A z test has been calculated at 95.0% confidence for a sample proportion : 
0.0377 and size 2994.0000. The z statistic is 1.645
We are 95.0% confident that the population proportion is between 0.032 and 0.0435


(0.032, 0.0435)

### Example- Opinion split by city region (additional two proportion CI example)

Duncan is investigating if residents in the city support the construction of a new high school. He is curious avout the differenve of opinion between the noth and the south parts of the city.He obtained the following information as a random sample.


|Supports|North|South|
|---|---|---|
Yes|54|77|
No|66|63|
Total|120|140|

What is the 90% CI in the difference of opinions who support the project (Ps-Pn)

In [207]:
# here we use proportion = false as we are entering the raw values
ci_two_proportions(54,77,120,140,2,0.9,False)

A z test has been calculated at 90.0% confidence for the difference of two proportions: 
0.5500 and 0.4500 with respective sample sizes 140 and 120.  The z statistic is 1.645
We are 90.0% confident that the difference in proportions is between -0.0018 and 0.2018
Because the confidence interval includes zero, we conclude that the difference is not statistically significant


(-0.0018, 0.2018)

## 3. Hypothesis Testing Proportions
In th above code we calculated the confidence Intervals. We can also run hypothesis test to calculate if there is a significance difference between the populations. The approach is similar but note one key difference. Because our null hypothesis assumes the samples follow the same distribution we used a pooled variance calculated as follows

Note that we are doing a hypothesis test where we are testing for difference in proportions = 0
We use the combined as the standard error of estimate
SE = sqrt(p(1-p) (1/n1 + 1/n2))


#### Create a function to calculate the p value for comparing two proportions

In [208]:
def compare_two_proportions(x1,x2,n1,n2,tail, alpha,proportion =True):
        
    if proportion == True:
        p1 = x1
        p2 = x2
        p = (x1*n1 + x2*n2)/(n1+n2)
    else:
        p1 = x1/n1
        p2 = x2/n2
        p = (x1+x2)/(n1+n2)
        
    p21 = p2 - p1
    n= n1+n2
    
    # Unlike before we estimate the pooled variance
    se = np.sqrt(p*(1-p)*(1/n1 +1/n2))

    #calculate the critical z 
    critical_z = p21/se

    #calculate the p value ( multiply by 2 if 2 way test)
    p_val = tail*stats.norm.cdf(-np.abs(critical_z))
    if p_val <= alpha:
        print("We reject the Null Hypothesis with a pvalue {:2f}".format(p_val))
    else:
        print("We fail to reject the Null Hypothesis with a pvalue {:2f}".format(p_val))
    
    print ("The z stat is ", critical_z)
    print("The SE",se)
    return 

#### Example Male/Female Stroke (Two tail)
$ H_0 $ : Difference in proportion of male and female having stroke = 0  
$ H_1 $ : Difference in proportion of male and female having stroke $ \neq $ 0   

In [209]:
# Lets Try for stroke
male_p = gender_stroke_df.loc['Male','proportion']
female_p = gender_stroke_df.loc['Female','proportion']
male_n = gender_stroke_df.loc['Male','patients']
female_n = gender_stroke_df.loc['Female','patients']

# Test Null hypothesis: The chance of getting a stroke is the same for men and women (2 sided test at 5% significance)
compare_two_proportions(male_p,female_p,male_n,female_n, 2, 0.05 ,proportion =True)

We fail to reject the Null Hypothesis with a pvalue 0.516302
The z stat is  -0.6490565013589283
The SE 0.006116018254449811


#### Example Male/Female Heart Disease (Two tail)
$ H_0 $ : Difference in proportion of male and female having heart disease = 0  
$ H_1 $ : Difference in proportion of male and female having heart disease $ \neq $ 0   

In [210]:
# Lets Try for heart disease - clear difference
male_p = gender_hd_df.loc['Male','proportion']
female_p = gender_hd_df.loc['Female','proportion']
male_n = gender_hd_df.loc['Male','patients']
female_n = gender_hd_df.loc['Female','patients']

# Test Null hypothesis: The chance of getting a stroke is the same for men and women (2 sided test at 5% significance)
compare_two_proportions(male_p,female_p,male_n,female_n, 2, 0.05 ,proportion =True)

We reject the Null Hypothesis with a pvalue 0.000000
The z stat is  -6.124496144701803
The SE 0.006421166088093382


#### Example Urban/Rural Stroke (One tail)
$ H_0 $ : Urban proportion with stroke = Rural proportion with stroke  
$ H_1 $ : Urban proportion with stroke $ > $ Rural proportion with stroke

In [211]:
rural_p = residence_stroke_df.loc['Rural','proportion']
urban_p = residence_stroke_df.loc['Urban','proportion']
rural_n = residence_stroke_df.loc['Rural','patients']
urban_n = residence_stroke_df.loc['Urban','patients']

# Test Null hypothesis: The chance of getting a stroke is the same for men and women (2 sided test at 10% significance)
compare_two_proportions(rural_p,urban_p,rural_n,urban_n, 1, 0.1 ,proportion =True)

# Just to illustrate would be significany at 15 % p hacking 
compare_two_proportions(rural_p,urban_p,rural_n,urban_n, 1, 0.15 ,proportion =True)

We fail to reject the Null Hypothesis with a pvalue 0.134580
The z stat is  1.1050012851200206
The SE 0.006024445130730861
We reject the Null Hypothesis with a pvalue 0.134580
The z stat is  1.1050012851200206
The SE 0.006024445130730861
