# ANOVA
ANOVA is a statistical test that stands for analysis of variance.

The idea behind ANOVA is to compare different groups of samples to determine whether there is a significant difference between the groups.

ANOVA is an extension of the t and the z test and was developed to compare more than two groups.

The null hypothesis of ANOVA is that there is no difference between the groups. The alternative hypothesis is that there is a difference between the groups.

ANOVA is an omnibus test, meaning it tests the data as a whole. In other words, it does not tell you which specific groups were significantly different from each other; it only tells you that at least two groups were different.

## Types of ANOVA
There are three main types of ANOVA:

1. One-way ANOVA
2. Two-way ANOVA
3. N-way ANOVA

**One-way ANOVA**

One-way ANOVA is used to compare two or more groups of samples across one continuous independent variable.

For example, you could use a one-way ANOVA to compare the height of people living in different cities.

**Two-way ANOVA**

Two-way ANOVA is used to compare two or more groups of samples across two independent variables.

For example, you could use a two-way ANOVA to compare the height of people living in different cities and different countries.

**N-way ANOVA**

N-way ANOVA is used to compare two or more groups of samples across N independent variables.

## Assumptions of ANOVA

ANOVA has three main assumptions:

1. The samples are independent.
2. The samples are normally distributed.
3. The variance of each group is equal.

If these assumptions are not met, you may not be able to trust the results of your ANOVA.

In [31]:
# import the liberaries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

## 1. ONE-WAY ANOVA

In [32]:
# Sample data: Growth of plants with three types of fertilizers
fertilizer1 = [20, 22, 19, 24, 25]
fertilizer2 = [28, 30, 27, 26, 29]
fertilizer3 = [18, 20, 22, 19, 24]

# Perform the one-way ANOVA
f_stat, p_val = stats.f_oneway(fertilizer1, fertilizer2, fertilizer3)

print("F-statistic:", f_stat)
print("p-value:", p_val)

# print the results based on if the p-value is less than 0.05

if p_val < 0.05:
    print("Reject null hypothesis: The means are not equal, as the p-value: {p_val} is less than 0.05")
else:
    print("Failed to reject null hypothesis: The means are equal, as the p-value: {p_val} is greater than 0.05")

F-statistic: 15.662162162162158
p-value: 0.0004515404760997282
Reject null hypothesis: The means are not equal, as the p-value: {p_val} is less than 0.05


## ONE-WAY ANOVA Using Statsmodels

In [33]:
# !pip install statsmodels

In [34]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

In [35]:
# Create a dataframe
df = pd.DataFrame({"fertilizer": ["fertilizer1"] * 5 + ["fertilizer2"] * 5 + ["fertilizer3"] * 5,
                   "growth": fertilizer1 + fertilizer2 + fertilizer3})
df

Unnamed: 0,fertilizer,growth
0,fertilizer1,20
1,fertilizer1,22
2,fertilizer1,19
3,fertilizer1,24
4,fertilizer1,25
5,fertilizer2,28
6,fertilizer2,30
7,fertilizer2,27
8,fertilizer2,26
9,fertilizer2,29


In [36]:
# Fit the model (predict growth on based of fertilizer)
model = ols("growth ~ fertilizer", data = df).fit() # osl stands fro ordinal least squares

In [37]:
# Perform ANOVA and print the summary table
anova_table = sm.stats.anova_lm(model, typ = 2)
print(anova_table)

                sum_sq    df          F    PR(>F)
fertilizer  154.533333   2.0  15.662162  0.000452
Residual     59.200000  12.0        NaN       NaN


In [38]:
# print the results based on if the p-value is less than 0.05
if anova_table["PR(>F)"][0] < 0.05:
    print("Reject null hypothesis: The means are not equal, as the p-value: {p_val} is less than 0.05")
else:
    print("Failed to reject null hypothesis: The means are equal, as the p-value: {p_val} is greater than 0.05")

Reject null hypothesis: The means are not equal, as the p-value: {p_val} is less than 0.05


## 2. TWO-WAY ANOVA

In [39]:
# Sample data
df = pd.DataFrame({
    "Growth": [20, 22, 19, 24, 25, 28, 30, 27, 26, 29, 18, 20, 22, 19, 24,
               21, 23, 20, 25, 26, 29, 31, 28, 27, 30, 19, 21, 23, 20, 25],
    "Fertilizer": ["F1", "F1", "F1", "F1", "F1", "F2", "F2", "F2", "F2", "F2", 
                   "F3", "F3", "F3", "F3", "F3", "F1", "F1", "F1", "F1", "F1", 
                   "F2", "F2", "F2", "F2", "F2", "F3", "F3", "F3", "F3", "F3"],
    "Sunlight": ["High", "High", "High", "High", "High", "High", "High", "High", "High", "High", 
                 "High", "High", "High", "High", "High", "Low", "Low", "Low", "Low", "Low", 
                 "Low", "Low", "Low", "Low", "Low", "Low", "Low", "Low", "Low", "Low"]
})
df.sample(4)

Unnamed: 0,Growth,Fertilizer,Sunlight
1,22,F1,High
17,20,F1,Low
23,27,F2,Low
25,19,F3,Low


In [40]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Growth      30 non-null     int64 
 1   Fertilizer  30 non-null     object
 2   Sunlight    30 non-null     object
dtypes: int64(1), object(2)
memory usage: 852.0+ bytes


In [41]:
# Perform two-way ANOVA
model = ols('Growth ~ C(Fertilizer) + C(Sunlight) + C(Fertilizer):C(Sunlight)', data = df).fit()
anova_table = sm.stats.anova_lm(model, typ = 2)
print(anova_table)

                                 sum_sq    df             F        PR(>F)
C(Fertilizer)              3.090667e+02   2.0  3.132432e+01  2.038888e-07
C(Sunlight)                7.500000e+00   1.0  1.520270e+00  2.295198e-01
C(Fertilizer):C(Sunlight)  1.226021e-28   2.0  1.242589e-29  1.000000e+00
Residual                   1.184000e+02  24.0           NaN           NaN


In [42]:
# print the results based on if the p-value is less than 0.05
if anova_table["PR(>F)"][0] < 0.05:
    print("Reject null hypothesis: The means are not equal, as the p-value: {p_val} is less than 0.05")
else:
    print("Failed to reject null hypothesis: The means are equal, as the p-value: {p_val} is greater than 0.05")

Reject null hypothesis: The means are not equal, as the p-value: {p_val} is less than 0.05


## Interpretation
- For One-Way ANOVA, if the p-value is less than 0.05, it suggests a significant difference in means among the groups.
- For Two-Way ANOVA, we look at the p-values for each factor and their interaction. A p-value less than 0.05 indicates a significant effect.
  
These examples should give you a good starting point for conducting ANOVA analyses in Python. Remember, the interpretation of your results should always take into account the context of your data and the specific question you are trying to answer.

## 3. N-WAY ANOVA (FACTORIAL ANOVA)
N-way ANOVA, also known as factorial ANOVA, is used when you have more than two independent variables. It allows you to analyze the effects of each factor on the dependent variable and the interaction effects between factors.

Example: Three-Way ANOVA
Suppose we have an experimental data set with three factors:

Fertilizer Type (3 levels: F1, F2, F3)
Sunlight Exposure (2 levels: High, Low)
Watering Frequency (2 levels: Regular, Sparse)

We want to study the impact of these factors and their interactions on plant growth.

In [43]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Sample data
df = pd.DataFrame({
    "Growth": [20, 22, 19, 24, 25, 28, 30, 27, 26, 29, 18, 20, 22, 19, 24,
               21, 23, 20, 25, 26, 29, 31, 28, 27, 30, 19, 21, 23, 20, 25,
               20, 22, 21, 23, 24, 26, 28, 25, 27, 29, 17, 19, 21, 18, 20],
    "Fertilizer": ["F1", "F1", "F1", "F1", "F1", "F2", "F2", "F2", "F2", "F2", 
                   "F3", "F3", "F3", "F3", "F3", "F1", "F1", "F1", "F1", "F1", 
                   "F2", "F2", "F2", "F2", "F2", "F3", "F3", "F3", "F3", "F3",
                   "F1", "F1", "F1", "F1", "F1", "F2", "F2", "F2", "F2", "F2", 
                   "F3", "F3", "F3", "F3", "F3"],
    "Sunlight": ["High", "High", "High", "High", "High", "High", "High", "High", "High", "High", 
                 "High", "High", "High", "High", "High", "Low", "Low", "Low", "Low", "Low", 
                 "Low", "Low", "Low", "Low", "Low", "Low", "Low", "Low", "Low", "Low",
                 "High", "High", "High", "High", "High", "High", "High", "High", "High", "High", 
                 "High", "High", "High", "High", "High"],
    "Watering": ["Regular", "Regular", "Regular", "Regular", "Regular", 
                 "Regular", "Regular", "Regular", "Regular", "Regular",
                 "Regular", "Regular", "Regular", "Regular", "Regular",
                 "Sparse", "Sparse", "Sparse", "Sparse", "Sparse", 
                 "Sparse", "Sparse", "Sparse", "Sparse", "Sparse",
                 "Sparse", "Sparse", "Sparse", "Sparse", "Sparse",
                 "Regular", "Regular", "Regular", "Regular", "Regular", 
                 "Regular", "Regular", "Regular", "Regular", "Regular",
                 "Regular", "Regular", "Regular", "Regular", "Regular"]
})

df.head()

Unnamed: 0,Growth,Fertilizer,Sunlight,Watering
0,20,F1,High,Regular
1,22,F1,High,Regular
2,19,F1,High,Regular
3,24,F1,High,Regular
4,25,F1,High,Regular


In [44]:
# Fit the model
model = ols('Growth ~ C(Fertilizer) + C(Sunlight) + C(Watering) + C(Fertilizer):C(Sunlight) + C(Fertilizer):C(Watering) + C(Sunlight):C(Watering) + C(Fertilizer):C(Sunlight):C(Watering)', data = df).fit()

In [45]:
# Perform three-way ANOVA
anova_results = sm.stats.anova_lm(model, typ = 2)
print(anova_results)

                                             sum_sq    df             F  \
C(Fertilizer)                          5.606667e+01   2.0  6.950413e+00   
C(Sunlight)                           -1.267188e-11   1.0 -3.141788e-12   
C(Watering)                           -1.726407e-11   1.0 -4.280348e-12   
C(Fertilizer):C(Sunlight)              5.813222e-15   2.0  7.206474e-16   
C(Fertilizer):C(Watering)             -3.159168e-14   2.0 -3.916325e-15   
C(Sunlight):C(Watering)                2.054444e+01   1.0  5.093664e+00   
C(Fertilizer):C(Sunlight):C(Watering)  1.088889e+00   2.0  1.349862e-01   
Residual                               1.573000e+02  39.0           NaN   

                                         PR(>F)  
C(Fertilizer)                          0.011967  
C(Sunlight)                            1.000000  
C(Watering)                            1.000000  
C(Fertilizer):C(Sunlight)              1.000000  
C(Fertilizer):C(Watering)              1.000000  
C(Sunlight):C(Watering) 

In [49]:
# print the results based on if the p-value is less than 0.05
if anova_results["PR(>F)"][0] < 0.05:
    print("Reject null hypothesis: The means are not equal, as the p-value: {p_val} is less than 0.05")
else:
    print("Faield to reject null hypothesis: The means are equal, as the p-value: {p_val} is greater than 0.05")

Reject null hypothesis: The means are not equal, as the p-value: {p_val} is less than 0.05


## Interpretation
In the output, you'll see p-values for:

- The main effects of each factor (Fertilizer, Sunlight, Watering)
- The interaction effects between two factors (e.g., Fertilizer:Sunlight)
- The interaction effect among all three factors (Fertilizer:Sunlight:Watering)

A p-value less than 0.05 typically suggests a statistically significant effect. However, interpreting ANOVA results can be complex, especially with interactions. You should consider the practical significance and the context of your experiment alongside the statistical results.

Remember, ANOVA makes certain assumptions (normality, homogeneity of variance, and independence), which should be tested before running the analysis.