# Chapter 5- Hypothesis Testing in Python

## Hypothesis Testing

It is nothing more than a test (or an educated guess) to decide whether an assumption is correct or not. We'll focus on the T-test, a way to tell if two groups are different.

## What is Hypothesis Testing?

Hypothesis Testing is like making an educated guess about a situation. For instance, "students at School X study for 2 hours every day." We then gather information to see if this guess holds true or not.
- Null Hypothesis (H0): The statement weâ€™re questioning. For example, "students at School X do not study for 2 hours every day."
- Alternative Hypothesis (HA): The statement we want to support. For example, "students at School X study for 2 hours every day."

It's like a courtroom trial where the null hypothesis is being tested, and the alternative hypothesis presents the evidence against the null hypothesis.


## T-test(s):

T-test checks if the two groups' mean values are truly different. 

- One-sample T-test: "Does this coffee look like it came from a pot that averages 70 degrees?"
- Two-sample T-test: "Are men's and women's average weights different?"
- Paired-sample T-test: "Did people's average stress levels change after using a meditation app for a month?"

T-test gives us two values: the t-statistic and the p-value. The bigger the t-statistic, the more difference there is between groups mean values. The p-value is the probability in favor of null hypothesis (i.e. probability that Null hypothesis is true given the data). If the p-value is less than 0.05, usually, we conclude that the difference is statistically significant and not due to randomness and thus it is safe to reject the null hypothesis.

### For One Sample T-test, 
- Null hypothesis is that the mean age of users (provided as the ages array) equals 30
- Alternative hypothesis states that the mean age is not 30

In [1]:
import numpy as np
from scipy import stats

# One Sample T-Test
ages = np.array([25, 33, 26, 25, 27, 27, 27, 29, 30, 31, 33])  # mean = 28.45
t_statistic, p_value = stats.ttest_1samp(ages, 30)
print("t-statistic:", t_statistic)  # -1.74
print("p-value:", p_value)  # 0.1124 so we don't have enough statistical evidence to claim that the mean age of users is different from 30


ages = np.random.normal(loc=33, scale=5, size=90)  # mean = 33
t_statistic, p_value = stats.ttest_1samp(ages, 30)
print("t-statistic:", t_statistic)  # 4.872
print("p-value:", p_value)  # ~0.000 so we reject the null hypothesis

t-statistic: -1.7405028310295299
p-value: 0.11239328363861013
t-statistic: 3.86215700410574
p-value: 0.00021283871266502


### For a Two Sample T-Test

Imagine you want to test if two teams in your office work the same hours. After collecting data, you can use a two-sample T-test to find out.

- The null hypothesis is that the mean working hours of Team A is equal to the mean working hours of Team B.
- The alternative hypothesis is that the mean working hours of Team A is different from the mean working hours of Team B.


In [2]:
# Assuming meeting hours for management and developer team before and after new project planning implementation
management_hours = np.array([3, 2, 3, 3, 3, 2, 3, 3])
developer_hours = np.array([2, 2, 2, 2, 3, 2, 2.5, 2])
print("Management hours mean:", management_hours.mean())
print("Developer hours mean:", developer_hours.mean())

t_statistic, p_value = stats.ttest_ind(management_hours, developer_hours)
print("t-statistic:", t_statistic)
print("p-value:", p_value)

significance_level = 0.05
if p_value < significance_level:
    print("We reject the null hypothesis. There is a significant difference in meeting hours.")
else:
    print("We fail to reject the null hypothesis. There is no significant difference in meeting hours.")

Management hours mean: 2.75
Developer hours mean: 2.1875
t-statistic: 2.6790325100441423
p-value: 0.017979525890164865
We reject the null hypothesis. There is a significant difference in meeting hours.


## Mann-Whitney U test

This tool helps determine if two datasets are (significantly) different when data doesn't meet the normality assumption for T-tests.

In [3]:
import numpy as np
from scipy.stats import mannwhitneyu

# Data on time spent (in minutes) on the website by users
time_A = np.array([31, 22, 39, 27, 35, 28, 34, 26, 23, 33])
time_B = np.array([26, 25, 30, 28, 29, 28, 27, 30, 27, 28])

# Perform the Mann-Whitney U test
U, p = mannwhitneyu(time_A, time_B)

# Print out the results
print(f'U-value: {U}')  # 60
print(f'p-value: {p}')  # 0.47 thus we fail to reject the null hypothesis and conclude that there isn't a significant different between the two groups 

U-value: 60.0
p-value: 0.4699923731723571


## ANOVA

Analysis of Variance or ANOVA is a way to determine if there are significant differences between the **means (or averages) of three or more groups**.

ANOVA assumes three things:
- Normality: The data from each group looks like a normal distribution.
- Homogeneity of Variance: Each group has the same spread or variance.
- Independence: Each data point doesn't depend on the others.

### One-way ANOVA
Think of One-way ANOVA like a game where you're comparing the average scores (means) of several teams (groups). The ultimate goal is to figure out if there is at least one team scoring differently than the others.

The output of the One-way ANOVA test is a value called F-statistic. A simple way to think about the F-statistic is like a signal-to-noise ratio:

    Signal: How much the group means differ from each other.
    Noise: How much the group members differ among themselves.

If the teams' scores are all similar, we would have a low signal and high noise, yielding an F-statistic close to 1.0. But if one of the teams' average score is substantially different from the others, the signal increases compared to the noise, resulting in an F-statistic greater than 1.0.


In [4]:
import pandas as pd

# Sample weights for 3 different apple types
data = pd.DataFrame({
    'apple_type': ['Apple1']*5 + ['Apple2']*5 + ['Apple3']*5,
    'weight': [162.5, 165.0, 167.5, 160.0, 158.5, 175.0, 177.5, 172.5, 170.0, 160.5, 182.5, 185.0, 180.0, 177.5, 165.5]
})

# Select weights for each apple type
apple1_weights = data['weight'][data['apple_type'] == 'Apple1']
apple2_weights = data['weight'][data['apple_type'] == 'Apple2']
apple3_weights = data['weight'][data['apple_type'] == 'Apple3']

from scipy import stats
# Perform One-way ANOVA
f_value, p_value = stats.f_oneway(apple1_weights, 
                                  apple2_weights, 
                                  apple3_weights)

# Print the F-value and P-value
print("F-value:", f_value)  # 7.845
print("P-value:", p_value)  # 0.006 thus we can confidently reject the idea that all apple types have the same average weight

F-value: 7.845172641301955
P-value: 0.006623953105761937


## Chi-Square

It's a handy tool for assessing whether there are **significant differences between observed and expected frequencies in one or more categories**.

The Chi-Square Test assumes two things:
- Randomness: The data was randomly sampled.
- Adequacy: Each cell in the table contains at least five items, ensuring the test's validity.


In [5]:
import pandas as pd

# Observations
data = pd.DataFrame({
    'Color': ['Red', 'Blue', 'Green', 'Yellow', 'Purple'],
    'Observed': [30, 20, 15, 10, 25],
    'Expected': [20, 20, 20, 20, 20]
})

# Prepare observed and expected frequencies
observed_frequencies = data['Observed']
expected_frequencies = data['Expected']

from scipy import stats

# Perform Chi-Square Test
chi_square_stat, p_value = stats.chisquare(observed_frequencies,
                                           expected_frequencies)

# Print the chi-square statistic and P-value
print("Chi-Square Statistic:", chi_square_stat)  # 12.5
print("P-value:", p_value)  # 0.014 suggesting that our observed marble distribution is statistically different from what we expected.

Chi-Square Statistic: 12.5
P-value: 0.013995792487650894
