<a href="https://colab.research.google.com/github/ChardyBalla/Chardy/blob/main/Hypothesis_Testing_for_Proportions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lesson 3: Hypothesis Testing for Proportions

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd

Lesson Outline


  1. Hypothesis Testing for Proportions
    - One-Sample Binomial Test
    - One-Sample Z-Test
    - Chi-Squared Goodness of Fit Test




We now try find out whether a specified fraction of the population has a given property. We tackle two cases: small sample size and large sample size.

## Example 1: Test if the employee attrition rate is more than 15%


### Small Sample Size: Binomial Test

For small sample sizes (n<30), we can calculate the p-value using a binomial distribution.

A binomial distribution gives a probability on the number of successes in a sequence of independent experiments. In other words, it returns the probability that an experiment is a success $n$ times (e.g. probability to get $n$ heads after tossing a coin $N$ times)

statsmodels has the [binom_test](https://www.statsmodels.org/dev/generated/statsmodels.stats.proportion.binom_test.html) function that can calculate this value.

``` python
p_val = statsmodels.stats.proportion.binom_test(count,
                                                nobs,
                                                prop=0.5,
                                                alternative='two-sided')
```



Let's load our dataset

In [3]:
df = pd.read_csv('/content/drive/MyDrive/Refocus Module/Jupyter Notebook-20230501T012835Z-001/Jupyter Notebook/datasets/hr_employee_attrition.csv')
df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,DepartmentInCompany,DistanceHome,Education,Educ,EmployeeNumber,EnvironmentSatisfaction,...,PerformanceRating,RelationshipSatisfaction,StandardHours,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1.0,2,Life Sciences,1,2,...,3,1,80,8,0,1,6,4,0,5
1,-49,No,Travel_Frequently,279,Research & Development,8.0,1,Life Sciences,2,3,...,4,4,80,10,3,3,10,7,1,7
2,1,Yes,Travel_Rarely,1373,Research & Development,2.0,2,Other,4,4,...,3,2,80,7,3,3,0,0,0,0
3,2,No,Travel_Frequently,1392,Research & Development,3.0,4,Life Sciences,5,4,...,3,3,80,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2.0,1,Medical,7,1,...,3,4,80,6,3,3,2,2,2,2


In [4]:
df.Attrition.value_counts()

No     1234
Yes     236
Name: Attrition, dtype: int64

In [5]:
resigned = df[df.Attrition=='Yes']
not_resigned = df[df.Attrition=='No']

In [6]:
print('Count of Resigned Employees:', len(resigned))
print('Count of Remaining Employees:', len(not_resigned))
print('Count of All Employees:', len(resigned)+len(not_resigned))
print('Proportion of Resigned Employess:', len(resigned)/(len(resigned)+len(not_resigned)))

Count of Resigned Employees: 236
Count of Remaining Employees: 1234
Count of All Employees: 1470
Proportion of Resigned Employess: 0.16054421768707483


We define the null and alternative hypothesis as follows:

$H_0$: Fraction of Resigned Employees $\leq 0.15$
$H_A$: Fraction of Resigned Employees $> 0.15$

We set our alpha to 0.05

In [7]:
from statsmodels.stats.proportion import binom_test

In [8]:
alpha = 0.05
p_val = binom_test(count=len(resigned),
                   nobs=len(resigned)+len(not_resigned),
                   prop=0.15,
                   alternative='larger')

print('p-value (one-sided):', p_val)

p-value (one-sided): 0.1369501423188985


In [9]:
if p_val < alpha:
    print (f'Reject the Null Hypothesis (alpha = {alpha}).')
else:
    print (f'Fail to Reject the Null Hypothesis (alpha = {alpha}).')

Fail to Reject the Null Hypothesis (alpha = 0.05).


Conclusion: There is not enough evidence to say that we have a 15% employee attrition rate

### Large Sample Size: Z-Test

With larger sample sizes (n>30), the binomial distribution could be approximated by the normal distribution (see also: central limit theorem). With this, we can use the Z-test for proportions. We calculate the test statistic, z, then convert it to the corresponding p-value.

If $N$ is our sample size, $N_q$ is the number of successes in our sample, and $q$ the population fraction we are testing, we can use the following equations:

$\hat{q} = \frac{N_q}{N}$ <br>
$\sigma_{\hat{q}} = \sqrt{\frac{q(1 - q)}{n}} $

The test statistic, z, can then be taken below. We then calculate the corresponding p-value from this test statistic.

$z = \frac{\hat{q} - q}{\sigma_{\hat{q}}}$

<br>

Similarly, statsmodels has the [proportions_ztest](https://www.statsmodels.org/dev/generated/statsmodels.stats.proportion.proportions_ztest.html) function that can calculate the p-value.

``` python
z_stat, p_val = statsmodels.stats.proportion.proportions_ztest(count,
                                                               nobs,
                                                               value=None,
                                                               alternative='two-sided',
                                                               prop_var=False)
```



In [10]:
from statsmodels.stats.proportion import proportions_ztest

In [11]:
z_stat, p_val = proportions_ztest(count=len(resigned),
                                  nobs=len(resigned)+len(not_resigned),
                                  value=0.15,
                                  alternative='larger',
                                  prop_var=False)
print('p-value (one-sided):', p_val)

p-value (one-sided): 0.1353989735075387


In [12]:
if p_val < alpha:
    print (f'Reject the Null Hypothesis (alpha = {alpha}).')
else:
    print (f'Fail to Reject the Null Hypothesis (alpha = {alpha}).')

Fail to Reject the Null Hypothesis (alpha = 0.05).


Conclusion: There is not enough evidence to say that we have a 15% employee attrition rate

## Tests for Categorical Data

### Example 3: Test if the 3 ice cream flavor preference between all students are equal.

In this example, we will utilize the Ice Cream dataset from the University of Sheffield. This data was collected on 200 high school students with scores on various tests, including a video game and a puzzle. The outcome measure in this analysis is the student’s favorite flavor of ice cream – vanilla, chocolate or strawberry.


In [13]:
df_icecream = pd.read_csv('/content/drive/MyDrive/Refocus Module/Jupyter Notebook-20230501T012835Z-001/Jupyter Notebook/datasets/IceCream.csv')
df_icecream.head()

Unnamed: 0,id,female,ice_cream,video,puzzle
0,70,0,2,47,57
1,121,1,1,63,61
2,86,0,3,58,31
3,141,0,3,53,56
4,172,0,1,53,61


### The $\chi^2$ Goodness-of-Fit Test (chi-square)

In the $\chi^2$ Goodness-of-Fit Test, we aim to test if our sample distribution matches the population. Similar with the previous tests, we calculate a test statistic ($\chi^2$) and use it the identify the corresponding p-value.

Here, we calculate the observed frequencies ($O_i$) and compare each of them to the expected ($E_i$) using the following summation:

$\chi^2 = \sum^K_{i = 1}\frac{\left(O_i - E_i\right)^2}{E_i}$

<br>

scipy has the [chisquare](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html) goodness of fit function that can calculate the p-value.

``` python
chi_sq, p_val = scipy.stats.chisquare(f_obs, f_exp=None, ddof=0, axis=0)
```



In [14]:
from scipy.stats import chisquare

We define the null and alternative hypothesis as follows:

$H_0$: Distribution of students who prefer each ice cream flavor are equal
$H_A$: Distribution of students who prefer each ice cream flavor are Not Equal

We set our alpha to 0.05

In [15]:
obs = df_icecream.ice_cream.value_counts()
n_obs = len(df_icecream)

obs

1    95
3    58
2    47
Name: ice_cream, dtype: int64

In [16]:
print('Total students surveyed:', n_obs)

Total students surveyed: 200


In [17]:
alpha = 0.05
f_obs = obs.values

chi_sq, p_val = chisquare(f_obs=f_obs, f_exp=None) #None since we are checking if the distribution is equal

print('p-value:', p_val)

p-value: 7.598307042939823e-05


In [18]:
if p_val < alpha:
    print (f'Reject the Null Hypothesis (alpha = {alpha}).')
else:
    print (f'Fail to Reject the Null Hypothesis (alpha = {alpha}).')

Reject the Null Hypothesis (alpha = 0.05).


Conclusion: More students may like some ice cream flavors more than others.

## Lesson Summary

|               | Variables Compared                     | Function                                        |
|---------------|----------------------------------------|-------------------------------------------------|
| Binomial Test | 1 variable vs. Fixed Value (n<30)      | statsmodels.stats.proportion.binom_test         |
| Z-Test        | 1 variable vs. Fixed Value (n>30)      | statsmodels.stats.proportion.proportions_ ztest |
| $\chi^2$ Test | 1 variable across different categories | scipy.stats.chisquare                           |