<a href="https://colab.research.google.com/github/ChardyBalla/Chardy/blob/main/Hypothesis_Testing_for_Means.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lesson 2: Hypothesis Testing for Means

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd

In this notebook, we aim to learn the following:
1. Install and utilize the `scipy` and `statsmodels` packages for hypothesis testing
2. Identify and implement the applicable tests for varying problems
  a. Hypothesis Testing for Means (Lesson 2)
    - One-Sample T-Test
    - Two-Sample Unpaired T-Test
    - Two-Sample Paired T-Test

  b. Hypothesis Testing for Proportions (Lesson 3)
    - One-Sample Binomial Test
    - One-Sample Z-Test
    - Chi-Squared Goodness of Fit Test




# The Scipy and Statsmodels Package

**[Scipy](https://scipy.org)** (pronounced “Sigh Pie”) is an open-source software for mathematics, science, and engineering.

- Can be installed with pip: `pip install scipy`


**[Statsmodels](https://www.statsmodels.org/stable/index.html)** provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.

- Can also be installed with pip: `pip install statsmodels`

## Recall:

$p-value$ = probability that our results will occur given that our null hypothesis is True

$\alpha$ = probability threshold for our p-value where we reject the null hypothesis

In hypothesis testing, we usually set $\alpha$ to a fixed value.

1. Set up null and alternative hypotheses
    * Determine the appropriate test to use
    * Determine the test *sidedness* and *distribution*
2. Set significance level, $\alpha$
3. Calculate the *test statistic*
4. Find the p-value, $p$
5. Compare the p-value to the significance level
    * Reject the null if $p < \alpha$
    * Otherwise, Fail to reject the null


## T-Test

When testing for means, we can calculate the p-values using a T-Test. In this test, we make the assumption that our data samples are **independent** (meaning that they are randomly selected and the result of any data point is not based on other observations) and are approximately **normal** (they follow the normal distribution/bell curve) - this assumption would work with large sample sizes (see also: central limit theorem).

In this test, we calculate a t-statistic using the means and standard devations of our sample/s using the corresponding formula. Then we can identify the p-value by looking it up from a table of critical values.

In the scipy package, the t-test formula and critical values are already implemented for us.


Sample Critical Values Table:
![T-Table](http://www.ttable.org/uploads/2/1/7/9/21795380/published/9754276.png?1517416376)





### Example 1: From our employee attrition dataset, Test if the average age of our employees is greater than 36 years old

**One-Sample Mean**

When doing hypothesis testing, we consider the variables that we need to observe and match it to the corresponding test. In this example, we are looking at one variable (age) and are trying to compare its mean to a fixed value (36 years old).



For comparing the sample mean with the expected population (i.e. if the average age of employees is greater than 36), we utilize the following equation to calculate the t-statistic. Then we can identify the p-value by looking it up from a table of critical values.

<br>
$t = \frac{\bar{x} - \mu}{\frac{s}{\sqrt{N}}}$

$\bar{x}$ and $s$ is the mean and standard deviation of our sample, respectively. $\mu$ is the expected population mean in our null hypothesis, and $N$ is the total number of samples
<br>


scipy has the [ttest_1samp](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_1samp.html) function that can implements this calculation and determines the corresponding p-value.

``` python
t_stat, p_val = scipy.stats.ttest_1samp(a,
                                        popmean,
                                        axis=0,
                                        nan_policy='propagate',
                                        alternative='two-sided')
```

The first parameter `a` is the sample data points (age) and `popmean` is the population mean that we are trying to test (36 years old). Note that we also set the parameter `alternative='greater'`. This means that we are looking at the probability that the average age is greater than 36 years old.

Let's load our dataset

In [3]:
df = pd.read_csv('/content/drive/MyDrive/Refocus Module/Jupyter Notebook-20230501T012835Z-001/Jupyter Notebook/datasets/hr_employee_attrition.csv')
df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,DepartmentInCompany,DistanceHome,Education,Educ,EmployeeNumber,EnvironmentSatisfaction,...,PerformanceRating,RelationshipSatisfaction,StandardHours,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1.0,2,Life Sciences,1,2,...,3,1,80,8,0,1,6,4,0,5
1,-49,No,Travel_Frequently,279,Research & Development,8.0,1,Life Sciences,2,3,...,4,4,80,10,3,3,10,7,1,7
2,1,Yes,Travel_Rarely,1373,Research & Development,2.0,2,Other,4,4,...,3,2,80,7,3,3,0,0,0,0
3,2,No,Travel_Frequently,1392,Research & Development,3.0,4,Life Sciences,5,4,...,3,3,80,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2.0,1,Medical,7,1,...,3,4,80,6,3,3,2,2,2,2


In [4]:
df.Age.describe()

count    1470.000000
mean       36.706803
std         9.784563
min       -59.000000
25%        30.000000
50%        36.000000
75%        43.000000
max        60.000000
Name: Age, dtype: float64

We first define the null and alternative hypothesis. The null hypothesis will be the reverse of what we are testing:

$H_0$: $\mu_{age} <= 36 years old$
$H_A$: $\mu_{age} > 36 years old$

We then set our alpha to 0.05

In [5]:
from scipy.stats import ttest_1samp

In [6]:
alpha = 0.05

t_stat, p_val = ttest_1samp(df['Age'], 36, alternative='greater')
print ('Test statistic: ', t_stat)
print ('p-value (one-sided): ', p_val)

Test statistic:  2.7695898660579283
p-value (one-sided):  0.0028416573468577617


In [7]:
if p_val < alpha:
    print (f'Reject the Null Hypothesis (alpha = {alpha}).')
else:
    print (f'Fail to Reject the Null Hypothesis (alpha = {alpha}).')

Reject the Null Hypothesis (alpha = 0.05).


Conclusion: The average age of our employees is greater than 36 years old

### Example 2: Are the remaining employees older than those who resigned?

**Two-Sample Unpaired**

When trying to compare 2 variables from different samples (e.g. resigned/not resigned, male/female, etc.) We utilize the Two-Sample Unpaired T-Test.

In the Welch's t-test equation, we now compare the means ($\bar{x_i}$) and standard deviation ($s_i$) of both samples. We assume that the two population variances are not equal (i.e. the two sample sizes may or may not be equal).
<br>
$t = \frac{(\bar{x_1} - \bar{x_2})}{S}$
<br>

$S = \sqrt{\frac{s_1^2}{N_1} + \frac{s_2^2}{N_2}}$


scipy has the [ttest_ind](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html) function to calculate the p-value.

``` python
t_stat, p_val = scipy.stats.ttest_ind(a, b, axis=0,
                                      equal_var=True,
                                      nan_policy='propagate',
                                      permutations=None,
                                      random_state=None,
                                      alternative='two-sided',
                                      trim=0)
```

In these function, we load the 2 arrays that we are testing and the alternative parameter.

We first define the null and alternative hypothesis:

$H_0$: $\mu_{age\_resigned} >= \mu_{age\_not\_resigned}$
$H_A$: $\mu_{age\_resigned} < \mu_{age\_not\_resigned}$

We set our alpha to 0.05

Then, we load our samples from the dataset

In [8]:
df.groupby('Attrition').Age.describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Attrition,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
No,1234.0,37.324959,9.637328,-59.0,31.0,36.0,43.0,60.0
Yes,236.0,33.474576,9.932292,1.0,27.0,31.5,39.0,58.0


In [9]:
age_resigned = df.loc[df.Attrition=='Yes', 'Age']
age_not_resigned = df.loc[df.Attrition=='No', 'Age']

Note that the Two-Sample Unpaired T-Test will work even if the counts are different (1234 retained employees vs. 236 resigned employees)

In [10]:
from scipy.stats import ttest_ind

In [11]:
alpha = 0.05

t_stat, p_val = ttest_ind(age_resigned, age_not_resigned, equal_var=False, alternative='less')
print ('Test statistic: ', t_stat)
print ('p-value (one-sided): ', p_val)

Test statistic:  -5.482250970756793
p-value (one-sided):  4.219565905678923e-08


In [12]:
if p_val < alpha:
    print (f'Reject the Null Hypothesis (alpha = {alpha}).')
else:
    print (f'Fail to Reject the Null Hypothesis (alpha = {alpha}).')

Reject the Null Hypothesis (alpha = 0.05).


Conclusion: The average age of employees who resigned is lower than the average age of those who stayed.

### Example 3: From an experiment with varying diet programs, test if the average weight after 6 weeks is less than the pre-diet weight

In this example, let's use the `diet dataset` to identify the effectiveness of diet programs on weight loss.

The data shows a sample of people who subscribed to 3 different diet types. Their weights were taken before starting the diet (`pre.weight`) and 6 weeks from starting their diets (`weight6weeks`).

**Two-Sample Paired**

When we have 2 measurements from the same sample (e.g. student diagnostic and post-lecture exam scores, pre-diet vs. post-diet weights, etc.) We can use the Two-Sample Paired T-Test.

The test statistic can then be calculated from the equation below:

$t = \frac{\bar{x_{12}} - \delta_{12}}{\frac{s_d}{\sqrt{N}}}$

In this case $\bar{x_{12}}$ is the difference between the means of the paired samples ($\bar{x_1} - \bar{x_2}$) and $s_d$ is the standard deviation of this mean difference.

<br>

scipy has the [ttest_rel](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_rel.html) function that implements this equation and calculates the p-value.

``` python
t_stat, p_val = scipy.stats.ttest_rel(a, b, axis=0,
                                      equal_var=True,
                                      nan_policy='propagate',
                                      permutations=None,
                                      random_state=None,
                                      alternative='two-sided',
                                      trim=0)
```

In this case, the data must come from the same sample (same person) and have equal counts.

First, we load the dataset:

In [14]:
df_diets = pd.read_csv('/content/drive/MyDrive/Refocus Module/Jupyter Notebook-20230501T012835Z-001/Jupyter Notebook/datasets/Diet.csv')
df_diets.head()

Unnamed: 0,Person,gender,Age,Height,pre.weight,Diet,weight6weeks
0,25,,41,171,60,2,60.0
1,26,,32,174,103,2,103.0
2,1,0.0,22,159,58,1,54.2
3,2,0.0,46,192,60,1,54.0
4,3,0.0,55,170,64,1,63.3


Since we wish to test if diet was effective in weight loss after 6 weeks, we define the null and alternative hypothesis as follows:

$H_0$: $\mu_{pre.weight} \leq \mu_{weight6weeks}$
$H_A$: $\mu_{pre.weight} > \mu_{weight6weeks}$

We set our alpha to 0.05

In [15]:
df_diets[['pre.weight', 'weight6weeks']].describe()

Unnamed: 0,pre.weight,weight6weeks
count,78.0,78.0
mean,72.525641,68.680769
std,8.723344,8.924504
min,58.0,53.0
25%,66.0,61.85
50%,72.0,68.95
75%,78.0,73.825
max,103.0,103.0


In [16]:
from scipy.stats import ttest_rel

In [17]:
alpha = 0.05

t_stat, p_val = ttest_rel(df_diets['pre.weight'], df_diets['weight6weeks'], alternative='greater')
print ('Test statistic: ', t_stat)
print ('p-value (one-sided): ', p_val)

Test statistic:  13.308753851748712
p-value (one-sided):  5.861180231207614e-22


In [18]:
if p_val < alpha:
    print (f'Reject the Null Hypothesis (alpha = {alpha}).')
else:
    print (f'Fail to Reject the Null Hypothesis (alpha = {alpha}).')

Reject the Null Hypothesis (alpha = 0.05).


Conclusion: The average weight after 6 weeks is lower than their weight before diet.

|                     | Variables Compared               | Scipy Function          |Assumption             |
|---------------------|----------------------------------|-------------------------|-----------------------|
| One-Sample Mean     | 1 variable vs. Fixed Value       | scipy.stats.ttest_1samp |independent and normal, 1 sample compared to the population |
| Two-Sample Unpaired | 2 variables (different subjects) | scipy.stats.ttest_ind   |independent and normal, taken from different populations|
| Two-Sample Paired   | 2 variables (same subject)       | scipy.stats.ttest_rel   |independent and normal, samples can be directly paired together|