 # ANOVA (Analysis of Variance):
#### ANOVA is a statistical test used to compare the means of three or more groups to determine if there is a statistically significant difference between them. The goal of ANOVA is to assess whether the variation in data is due to the factor being tested or due to random chance.

## One-Way ANOVA:
##### One-Way ANOVA is a specific type of ANOVA that tests whether there is a significant difference in the means of three or more independent groups based on a single factor or independent variable.

## 1. Testing if there is a significant difference in the mean heights of plants across different fertilizer types.
- Null Hypothesis (H₀): There is no significant difference in the mean height of plants across different fertilizer types (μ₁ = μ₂ = μ₃).
- Alternative Hypothesis (H₁): At least one fertilizer type leads to a different mean height compared to the others.

In [2]:
import numpy as np
from scipy import stats
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Sample data for 3 fertilizer types
fertilizer_A = [15, 16, 14, 13, 17]
fertilizer_B = [20, 19, 21, 22, 18]
fertilizer_C = [30, 28, 32, 31, 29]

# Perform One-Way ANOVA
F_statistic, p_value = stats.f_oneway(fertilizer_A, fertilizer_B, fertilizer_C)
print("F-statistic:", F_statistic)
print("p-value:", p_value)
print('*'*50)
if p_value < 0.05:
    print("Reject H₀: There is a significant difference in the mean height of plants.")
    
    # Perform Tukey's Post-hoc Test
    data = np.array([fertilizer_A, fertilizer_B, fertilizer_C]).flatten()
    groups = ['A']*len(fertilizer_A) + ['B']*len(fertilizer_B) + ['C']*len(fertilizer_C)
    tukey = pairwise_tukeyhsd(data, groups, alpha=0.05)
    print(tukey)
else:
    print("Fail to reject H₀: No significant difference.")


F-statistic: 116.66666666666622
p-value: 1.3694561117818263e-08
**************************************************
Reject H₀: There is a significant difference in the mean height of plants.
Multiple Comparison of Means - Tukey HSD, FWER=0.05 
group1 group2 meandiff p-adj   lower   upper  reject
----------------------------------------------------
     A      B      5.0 0.0008  2.3321  7.6679   True
     A      C     15.0    0.0 12.3321 17.6679   True
     B      C     10.0    0.0  7.3321 12.6679   True
----------------------------------------------------


## 2. Testing if the mean satisfaction scores differ across marketing campaigns.
- Null Hypothesis (H₀): The mean satisfaction scores are the same across all marketing campaigns.
- Alternative Hypothesis (H₁): At least one marketing campaign leads to a significantly different mean satisfaction score.

In [3]:
campaign_A = [75, 78, 80, 79, 74]
campaign_B = [68, 70, 72, 71, 69]
campaign_C = [82, 85, 87, 83, 80]

f_stats,p_value = stats.f_oneway(campaign_A,campaign_B,campaign_C)
print(f'f_stats: {f_stats}\np_value: {p_value}')

print('*'*50)

alpha = 0.05

if p_value < alpha:
    print('Reject the null hypotheis: At lest one marketing campagin lead to a significantly differnt mean')

    data = np.array([campaign_A,campaign_B,campaign_C]).flatten()
    groups = ['a']*len(campaign_A) + ['b']*len(campaign_B) + ['c']*len(campaign_C)
    tukey = pairwise_tukeyhsd(data,groups,alpha=0.05)
    print(tukey)
else:
    print('Faild to reject the null hypothesis')

f_stats: 40.88484848484848
p_value: 4.392506381684241e-06
**************************************************
Reject the null hypotheis: At lest one marketing campagin lead to a significantly differnt mean
 Multiple Comparison of Means - Tukey HSD, FWER=0.05 
group1 group2 meandiff p-adj   lower    upper  reject
-----------------------------------------------------
     a      b     -7.2 0.0011 -11.1571 -3.2429   True
     a      c      6.2 0.0034   2.2429 10.1571   True
     b      c     13.4    0.0   9.4429 17.3571   True
-----------------------------------------------------


## 3. Testing if the mean weight loss differs between three exercise routines.
- Null Hypothesis (H₀): The mean weight loss is the same across all exercise routines.
- Alternative Hypothesis (H₁): At least one exercise routine results in a significantly different mean weight loss.

In [9]:
routine_A = [2, 3, 2.5, 3.5, 2.8]
routine_B = [1, 1.5, 1.2, 1.3, 1.8]
routine_C = [4, 4.5, 4.3, 4.1, 4.6]

f_stats,p_value = stats.f_oneway(routine_A,routine_B,routine_C)
print(f'f_stats: {f_stats}\np_value: {p_value}')

print('*'*50)

alpha = 0.05

if p_value < alpha:
    print('Reject H₀: at least one exercise routin mean weight loss is differs')
    print('*'*50)
    # Tukey HSD

    data = np.array([routine_A,routine_B,routine_C]).flatten()
    groups = ['a']*len(routine_A) + ['b'] * len(routine_B) + ['c'] * len(routine_C)

    tukey = pairwise_tukeyhsd(data,groups,alpha = 0.05)
    print(tukey)

f_stats: 68.8704883227176
p_value: 2.6487654813248643e-07
**************************************************
Reject H₀: at least one exercise routin mean weight loss is differs
**************************************************
Multiple Comparison of Means - Tukey HSD, FWER=0.05 
group1 group2 meandiff p-adj   lower   upper  reject
----------------------------------------------------
     a      b     -1.4 0.0003 -2.0686 -0.7314   True
     a      c     1.54 0.0001  0.8714  2.2086   True
     b      c     2.94    0.0  2.2714  3.6086   True
----------------------------------------------------


## 4. Testing if customer satisfaction differs across different age groups.
- Null Hypothesis (H₀): Customer satisfaction is the same across age groups.
- Alternative Hypothesis (H₁): At least one age group has a different mean customer satisfaction.

In [13]:
group_18_25 = [80, 85, 90, 87, 82]
group_26_35 = [75, 80, 78, 79, 77]
group_36_50 = [60, 65, 62, 64, 63]

f_stats,p_value = stats.f_oneway(group_18_25,group_26_35,group_36_50)
print(f'f_stats: {f_stats}\np_value: {p_value}')

print('*'*50)

alpha = 0.05

if p_value < alpha:
    print('Reject H0: At lest one age group has different mean customer satisfaction')
    print('*'*50)
    # tukey hsd
    data = np.array([group_18_25,group_26_35,group_36_50]).flatten()
    group = ['18-25']*len(group_18_25) + ['26-35']*len(group_26_35) + ['36-50']  * len(group_36_50)

    tukey = pairwise_tukeyhsd(data,group,alpha=0.05)
    print(tukey)
else:
    print('faild to reject the null hypothesis')

f_stats: 82.03463203463215
p_value: 1.0022733490302527e-07
**************************************************
Reject H0: At lest one age group has different mean customer satisfaction
**************************************************
 Multiple Comparison of Means - Tukey HSD, FWER=0.05  
group1 group2 meandiff p-adj   lower    upper   reject
------------------------------------------------------
 18-25  26-35     -7.0 0.0047 -11.6821  -2.3179   True
 18-25  36-50    -22.0    0.0 -26.6821 -17.3179   True
 26-35  36-50    -15.0    0.0 -19.6821 -10.3179   True
------------------------------------------------------


## 5. Testing if product ratings differ across categories.
- Null Hypothesis (H₀): Product ratings are the same across categories.
- Alternative Hypothesis (H₁): At least one product category has a different mean rating.

In [15]:
category_A_ratings = [4, 5, 3, 4, 5]
category_B_ratings = [2, 3, 2, 3, 1]
category_C_ratings = [5, 5, 4, 4, 5]

f_stats,p_value = stats.f_oneway(category_A_ratings,category_B_ratings,category_C_ratings)
print(f'f_stats: {f_stats}\np_value: {p_value}')

print('*'*50)

alpha = 0.05

if p_value < alpha:
    print('[Reject the H0 : At least one product category has a different mean rating]')

    # tukey hsd test
    data = np.array([category_A_ratings,category_B_ratings,category_C_ratings]).flatten()
    groups = ['cat a'] * len(category_A_ratings) + ['cat b'] * len(category_B_ratings) + ['cat c'] * len(category_C_ratings)

    tukey = pairwise_tukeyhsd(data, groups,alpha=0.05)

    print(tukey)
else:
    print('Faild to reject the null hypothesis')

f_stats: 14.588235294117638
p_value: 0.0006126222478125284
**************************************************
[Reject the H0 : At least one product category has a different mean rating]
Multiple Comparison of Means - Tukey HSD, FWER=0.05 
group1 group2 meandiff p-adj   lower   upper  reject
----------------------------------------------------
 cat a  cat b     -2.0 0.0033 -3.2702 -0.7298   True
 cat a  cat c      0.4 0.6863 -0.8702  1.6702  False
 cat b  cat c      2.4 0.0008  1.1298  3.6702   True
----------------------------------------------------


# apply one way anova and Post-hoc Tukey hsdon  test Titanic dataset

In [19]:
# loading module
import pandas as pd
import numpy as np
import seaborn as sns
from scipy import stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# loading titanic dataset

titanic = sns.load_dataset('titanic')
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


### categorical columns:

- pclass: Passenger class (1, 2, 3)
- sex: Gender of the passenger (male, female)
- age: Age of the passenger
- fare: Ticket fare
- embarked: Embarkation point (C = Cherbourg, Q = Queenstown, S = Southampton)
- survived: Survival status (0 = No, 1 = Yes)

In [None]:
# preprocessing on data

titanic = titanic.dropna(subset=['age','fare','pclass','sex','embarked'])

# converting data type obj to categorical
titanic['sex'] = titanic['sex'].astype('category')
titanic['embarked'] = titanic['embarked'].astype('category')

### question  1: Testing if the mean fare differs by passenger class
- Null Hypothesis (H₀): The mean fare is the same across all passenger classes (pclass).
- Alternative Hypothesis (H₁): The mean fare differs across passenger classes.

In [33]:
# group fares by passenger class
pclass_1 = titanic[titanic['pclass']==1]['fare']
pclass_2 = titanic[titanic['pclass']==2]['fare']
pclass_3 = titanic[titanic['pclass']==3]['fare']

f_stats,p_value = stats.f_oneway(pclass_1,pclass_2,pclass_3)

print(f'f_stats: {f_stats}\np_value: {p_value}')
print('*'*50)

# alpha value 
alpha = 0.05

if p_value < alpha:
    print('Reject  the null hypothesis: mean fare are differs across passenger')

    groups = ['1'] * len(pclass_1) + ['2' ] * len(pclass_2) + ['3'] * len(pclass_3)
    data = np.concatenate([pclass_1,pclass_2,pclass_3])

    # tukey hsd test
    tukey = pairwise_tukeyhsd(data,groups,alpha=alpha)
    print(tukey)
else:
    print('Faild to reject the null hypothesis')

f_stats: 199.51520026428128
p_value: 1.821885989198717e-69
**************************************************
Reject  the null hypothesis: mean fare are differs across passenger
 Multiple Comparison of Means - Tukey HSD, FWER=0.05  
group1 group2 meandiff p-adj   lower    upper   reject
------------------------------------------------------
     1      2 -66.5766    0.0 -77.1241  -56.029   True
     1      3 -74.8187    0.0  -83.866 -65.7714   True
     2      3  -8.2421 0.0913 -17.4769   0.9927  False
------------------------------------------------------


### question 2: Testing if the mean fare differs by embarkation point (Embarked)
- Null Hypothesis (H₀): The mean fare is the same across all embarkation points (C, Q, S).
- Alternative Hypothesis (H₁): The mean fare differs across embarkation points.

In [None]:
embarked_c = titanic[titanic['embarked']=='C']['fare']
embarked_q = titanic[titanic['embarked']=='Q']['fare']
embarked_s = titanic[titanic['embarked']=='S']['fare']

f_stats,p_value = stats.f_oneway(embarked_c,embarked_q,embarked_s)

print(f'f_stats: {f_stats}\np_value: {p_value}')
print('*'*50)

# alpha value 
alpha = 0.05

if p_value < alpha:
    print('reject the null hypothesis: the mean fare differ')

    data = np.concat([embarked_s,embarked_c,embarked_q])
    group = ['embarked_c'] * len(embarked_c) + ['embarked_q'] * len(embarked_q) + ['embarked_s'] * len(embarked_s)

    # tukey hsd test
    tukey = pairwise_tukeyhsd(data,group,alpha=alpha)
    print(tukey)

else:
    print('Faild to reject the null hypothesis')


f_stats: 35.892190453113834
p_value: 1.4184349041133077e-15
**************************************************
reject the null hypothesis: the mean fare differ
     Multiple Comparison of Means - Tukey HSD, FWER=0.05     
  group1     group2   meandiff p-adj   lower    upper  reject
-------------------------------------------------------------
embarked_c embarked_q  -5.2432  0.882 -31.0403 20.5538  False
embarked_c embarked_s  12.2486 0.0457   0.1817 24.3154   True
embarked_q embarked_s  17.4918  0.201  -6.4921 41.4757  False
-------------------------------------------------------------
68.29676692307693 18.265774999999998


### 3: Testing if the mean age differs by gender
- Null Hypothesis (H₀): The mean age is the same for male and female passengers.
- Alternative Hypothesis (H₁): The mean age differs by gender.

In [42]:
male_age =  titanic[titanic['sex']=='male']['age']
female_age = titanic[titanic['sex']=='female']['age']

f_stats,p_value = stats.f_oneway(male_age,female_age)

print(f'f_stats: {f_stats}\np_value: {p_value}')
print('*'*50)

# alpha value 
alpha = 0.05

if p_value < alpha:
    print('reject the null hypothesis: the mean age is differ')

    data = np.concat([male_age,female_age])
    group = ['male_age'] * len(male_age) + ['female_age'] * len(female_age) 

    # tukey hsd test
    tukey = pairwise_tukeyhsd(data,group,alpha=alpha)
    print(tukey)

else:
    print('Faild to reject the null hypothesis')

f_stats: 7.032925899599193
p_value: 0.008181084109248551
**************************************************
reject the null hypothesis: the mean age is differ
  Multiple Comparison of Means - Tukey HSD, FWER=0.05   
  group1    group2  meandiff p-adj  lower  upper  reject
--------------------------------------------------------
female_age male_age   2.9815 0.0082 0.7742 5.1887   True
--------------------------------------------------------
