<a href="https://colab.research.google.com/github/Rumaisa1054/Data_Science/blob/main/stats_visuals/stats/ANOVA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **ANOVA**
ANOVA is a statistical test that stands for analysis of variance.

ANOVA was developed by statistician and evolutionary biologist Ronald Fisher. The idea behind ANOVA is to compare different groups of samples to determine whether there is a significant difference between the groups.

ANOVA is an extension of the t and the z test and was developed to compare more than two groups.

**The null hypothesis** of ANOVA is that there is no difference between the groups.

**The alternative hypothesis** is that there is a difference between the groups.

ANOVA is an omnibus test, meaning it tests the data as a whole. In other words, it does not tell you which specific groups were significantly different from each other; it only tells you that at least two groups were different.

# **Types of ANOVA**
There are three main types of ANOVA:

- One-way ANOVA
- Two-way ANOVA
- N-way ANOVA

## **One-way ANOVA**
One-way ANOVA is used to compare two or more groups of samples across one continuous independent variable.

For example, you could use a one-way ANOVA to compare the height of people living in different cities.

## **Two-way ANOVA**
Two-way ANOVA is used to compare two or more groups of samples across two independent variables.

For example, you could use a two-way ANOVA to compare the height of people living in different cities and different countries.

## **N-way ANOVA**
N-way ANOVA is used to compare two or more groups of samples across N independent variables.

# **Assumptions of ANOVA**
ANOVA has three main assumptions:

The samples are independent.
The samples are normally distributed.
The variance of each group is equal.
If these assumptions are not met, you may not be able to trust the results of your ANOVA.

# **ANOVA : analysis of variances - compare the means of 3 or more groups by the help of variances**

Even though the name says variance, ANOVA is actually used to compare means.

ANOVA is used to compare the means of 3 or more groups.

**Why Not Use Multiple t-tests Instead?**

**Because:** Running many t-tests increases Type-I error (false positives). ANOVA keeps the overall error rate controlled.

## **one-way anova**

In [9]:
import scipy.stats as stats

# Sample data: Growth of plants with three types of fertilizers
fertilizer1 = [20, 22, 19, 24, 25]
fertilizer2 = [28, 30, 27, 26, 29]
fertilizer3 = [18, 20, 22, 19, 24]

# Perform the one-way ANOVA
f_stat, p_val = stats.f_oneway(fertilizer1, fertilizer2, fertilizer3)

print("F-statistic:", f_stat)
print("p-value:", p_val)

# print the results based on if the p-value is less than 0.05

if p_val < 0.05:
    print("Reject null hypothesis: The means are not equal, as the p-value: {p_val} is less than 0.05")
else:
    print("Accept null hypothesis: The means are equal, as the p-value: {p_val} is greater than 0.05")

F-statistic: 15.662162162162158
p-value: 0.00045154047609972817
Reject null hypothesis: The means are not equal, as the p-value: {p_val} is less than 0.05


# **ANOVA oNE WAY**

In [5]:
!pip install statsmodels --quiet

In [6]:
# One-way ANOVA using statsmodels
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

In [11]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create DataFrame properly
df = pd.DataFrame({
    "fertilizer": ["fertilizer1"] * 5 + ["fertilizer2"] * 5 + ["fertilizer3"] * 5,
    "growth": fertilizer1 + fertilizer2 + fertilizer3
})

print(df)


     fertilizer  growth
0   fertilizer1      20
1   fertilizer1      22
2   fertilizer1      19
3   fertilizer1      24
4   fertilizer1      25
5   fertilizer2      28
6   fertilizer2      30
7   fertilizer2      27
8   fertilizer2      26
9   fertilizer2      29
10  fertilizer3      18
11  fertilizer3      20
12  fertilizer3      22
13  fertilizer3      19
14  fertilizer3      24


In [12]:
df['fertilizer'].value_counts()

Unnamed: 0_level_0,count
fertilizer,Unnamed: 1_level_1
fertilizer1,5
fertilizer2,5
fertilizer3,5


In [14]:

model = ols("growth ~ fertilizer", data=df).fit() # fitting one way anova

# model = ols("growth ~ fertilizer + water", data=df).fit() # fitting two way anova

# model = ols("growth ~ fertilizer + water + sunlight", data=df).fit() # fitting three way anova

In [30]:
# Perform ANOVA and print the summary table
anova_table = sm.stats.anova_lm(model, typ=2)
anova_table.head()

Unnamed: 0,sum_sq,df,F,PR(>F)
C(Fertilizer),309.0667,2.0,31.32432,2.038888e-07
C(Sunlight),7.5,1.0,1.52027,0.2295198
C(Fertilizer):C(Sunlight),6.44124e-28,2.0,6.528284e-29,1.0
Residual,118.4,24.0,,


Since p_val = 0.000452 < 0.05 - therefore there is significant difference in mean of these growth

In [18]:
if anova_table["PR(>F)"][0] > 0.05:
  print("No significant difference in mean of growth across fertilizers - FAIL TO REJECT")
else:
  print("there is difference - reject null hypothesis")

there is difference - reject null hypothesis


  if anova_table["PR(>F)"][0] > 0.05:


# **Two-way anova**

In [28]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Sample data
data = pd.DataFrame({
    "Growth": [20, 22, 19, 24, 25, 28, 30, 27, 26, 29, 18, 20, 22, 19, 24,
               21, 23, 20, 25, 26, 29, 31, 28, 27, 30, 19, 21, 23, 20, 25],
    "Fertilizer": ["F1", "F1", "F1", "F1", "F1", "F2", "F2", "F2", "F2", "F2",
                   "F3", "F3", "F3", "F3", "F3", "F1", "F1", "F1", "F1", "F1",
                   "F2", "F2", "F2", "F2", "F2", "F3", "F3", "F3", "F3", "F3"],
    "Sunlight": ["High", "High", "High", "High", "High", "High", "High", "High", "High", "High",
                 "High", "High", "High", "High", "High", "Low", "Low", "Low", "Low", "Low",
                 "Low", "Low", "Low", "Low", "Low", "Low", "Low", "Low", "Low", "Low"]
})

# Perform two-way ANOVA
model = ols('Growth ~ C(Fertilizer) + C(Sunlight) + C(Fertilizer):C(Sunlight)', data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

anova_table.head()

Unnamed: 0,sum_sq,df,F,PR(>F)
C(Fertilizer),309.0667,2.0,31.32432,2.038888e-07
C(Sunlight),7.5,1.0,1.52027,0.2295198
C(Fertilizer):C(Sunlight),6.44124e-28,2.0,6.528284e-29,1.0
Residual,118.4,24.0,,


In [29]:
if anova_table["PR(>F)"][0] < 0.05:
    print("Reject null hypothesis: The means are not equal, as the p-value: {p_val} is less than 0.05\n\n")
else:
    print("Accept null hypothesis: The means are equal, as the p-value: {p_val} is greater than 0.05\n\n")

Reject null hypothesis: The means are not equal, as the p-value: {p_val} is less than 0.05




  if anova_table["PR(>F)"][0] < 0.05:


**Interpretation**
For **One-Way ANOVA**, if the p-value is less than 0.05, it suggests a significant difference in means among the groups.

For **Two-Way ANOVA**, we look at the p-values for each factor and their interaction.

A p-value less than 0.05 indicates a significant effect.

These examples should give you a good starting point for conducting ANOVA analyses in Python. Remember, the interpretation of your results should always take into account the context of your data and the specific question you are trying to answer.

# **N-way ANOVA**
N-way ANOVA, also known as factorial ANOVA, is used when you have more than two independent variables. It allows you to analyze the effects of each factor on the dependent variable and the interaction effects between factors.

**Suppose we have an experimental data set with three factors:**

Fertilizer Type (3 levels: F1, F2, F3)

Sunlight Exposure (2 levels: High, Low)

Watering Frequency (2 levels: Regular, Sparse)

In [None]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

In [23]:
# Sample data
data = pd.DataFrame({
    "Growth": [20, 22, 19, 24, 25, 28, 30, 27, 26, 29, 18, 20, 22, 19, 24,
               21, 23, 20, 25, 26, 29, 31, 28, 27, 30, 19, 21, 23, 20, 25,
               20, 22, 21, 23, 24, 26, 28, 25, 27, 29, 17, 19, 21, 18, 20],
    "Fertilizer": ["F1", "F1", "F1", "F1", "F1", "F2", "F2", "F2", "F2", "F2",
                   "F3", "F3", "F3", "F3", "F3", "F1", "F1", "F1", "F1", "F1",
                   "F2", "F2", "F2", "F2", "F2", "F3", "F3", "F3", "F3", "F3",
                   "F1", "F1", "F1", "F1", "F1", "F2", "F2", "F2", "F2", "F2",
                   "F3", "F3", "F3", "F3", "F3"],
    "Sunlight": ["High", "High", "High", "High", "High", "High", "High", "High", "High", "High",
                 "High", "High", "High", "High", "High", "Low", "Low", "Low", "Low", "Low",
                 "Low", "Low", "Low", "Low", "Low", "Low", "Low", "Low", "Low", "Low",
                 "High", "High", "High", "High", "High", "High", "High", "High", "High", "High",
                 "High", "High", "High", "High", "High"],
    "Watering": ["Regular", "Regular", "Regular", "Regular", "Regular",
                 "Regular", "Regular", "Regular", "Regular", "Regular",
                 "Regular", "Regular", "Regular", "Regular", "Regular",
                 "Sparse", "Sparse", "Sparse", "Sparse", "Sparse",
                 "Sparse", "Sparse", "Sparse", "Sparse", "Sparse",
                 "Sparse", "Sparse", "Sparse", "Sparse", "Sparse",
                 "Regular", "Regular", "Regular", "Regular", "Regular",
                 "Regular", "Regular", "Regular", "Regular", "Regular",
                 "Regular", "Regular", "Regular", "Regular", "Regular"]
})

In [25]:
# Fit the model
model = ols('Growth ~ C(Fertilizer) + C(Sunlight) + C(Watering) + C(Fertilizer):C(Sunlight) + C(Fertilizer):C(Watering) + C(Sunlight):C(Watering) + C(Fertilizer):C(Sunlight):C(Watering)', data=data).fit()

# Perform three-way ANOVA
anova_results = sm.stats.anova_lm(model, typ=2)

anova_results.head()

Unnamed: 0,sum_sq,df,F,PR(>F)
C(Fertilizer),468.0444,2.0,58.02204,2.050614e-12
C(Sunlight),-2.833807e-13,1.0,-7.025969e-14,1.0
C(Watering),-6.093706e-13,1.0,-1.510836e-13,1.0
C(Fertilizer):C(Sunlight),1.01492e-13,2.0,1.258165e-14,1.0
C(Fertilizer):C(Watering),-1.9436e-13,2.0,-2.409422e-14,1.0


In [27]:
# print the results based on if the p-value is less than 0.05
if anova_results["PR(>F)"][0] < 0.05:
    print("Reject null hypothesis: The means are not equal, as the p-value: {p_val} is less than 0.05\n\n")
else:
    print("Fail to reject null hypothesis: The means are equal, as the p-value: {p_val} is greater than 0.05\n\n")

Reject null hypothesis: The means are not equal, as the p-value: {p_val} is less than 0.05




  if anova_results["PR(>F)"][0] < 0.05:


# ***Interpretation***
In the output, you'll see p-values for:

The main effects of each factor (Fertilizer, Sunlight, Watering)

The interaction effects between two factors (e.g., Fertilizer:Sunlight)

The interaction effect among all three factors (Fertilizer:Sunlight:Watering)

A p-value less than 0.05 typically suggests a statistically significant effect.

However, interpreting ANOVA results can be complex, especially with interactions. You should consider the practical significance and the context of your experiment alongside the statistical results.

Remember, ANOVA makes certain assumptions (normality, homogeneity of variance, and independence), which should be tested before running the analysis.

# **Post-hoc Tests for N-Way ANOVA (Factorial ANOVA)**

## **Since there is significant difference. how do we know how much is this difference and their cross-match diff `kon kis se kitna differ krta ha`**

one differs how much when compared to two

two differs how much when compared to three

one differ  how much when compared with three

In [35]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd

In [36]:
# Create DataFrame properly
df = pd.DataFrame({
    "fertilizer": ["fertilizer1"] * 5 + ["fertilizer2"] * 5 + ["fertilizer3"] * 5,
    "growth": fertilizer1 + fertilizer2 + fertilizer3
})
tukey = pairwise_tukeyhsd(df['growth'], df['fertilizer'], alpha=0.05)
print(tukey)

      Multiple Comparison of Means - Tukey HSD, FWER=0.05      
   group1      group2   meandiff p-adj   lower    upper  reject
---------------------------------------------------------------
fertilizer1 fertilizer2      6.0 0.0029   2.2523  9.7477   True
fertilizer1 fertilizer3     -1.4 0.5928  -5.1477  2.3477  False
fertilizer2 fertilizer3     -7.4 0.0005 -11.1477 -3.6523   True
---------------------------------------------------------------


**Interpretation of Each Row**
### **Fertilizer1 vs Fertilizer2**

Mean difference = 6.0

Fertilizer2 gives 6 units more growth than Fertilizer1.

p-value = 0.0029 → Significant

reject = True

✔ There IS a significant difference.
Fertilizer2 increases growth much more than Fertilizer1.

### **Fertilizer1 vs Fertilizer3**

Mean difference = -1.4

Fertilizer3 is 1.4 units lower, but very small difference.

p-value = 0.5928 → NOT significant

reject = False

✘ There is NO significant difference between Fertilizer1 and Fertilizer3.

### **Fertilizer2 vs Fertilizer3**

Mean difference = -7.4

Fertilizer3 gives 7.4 units LESS growth than Fertilizer2.

p-value = 0.0005 → Significant

reject = True

✔ There IS a large and statistically significant difference.
Fertilizer2 performs much better than Fertilizer3.

# **Another data tukey post-hoc test**

In [37]:

# Sample data
data = pd.DataFrame({
    "Growth": [20, 22, 19, 24, 25, 28, 30, 27, 26, 29, 18, 20, 22, 19, 24,
               21, 23, 20, 25, 26, 29, 31, 28, 27, 30, 19, 21, 23, 20, 25,
               20, 22, 21, 23, 24, 26, 28, 25, 27, 29, 17, 19, 21, 18, 20],
    "Fertilizer": ["F1", "F1", "F1", "F1", "F1", "F2", "F2", "F2", "F2", "F2",
                   "F3", "F3", "F3", "F3", "F3", "F1", "F1", "F1", "F1", "F1",
                   "F2", "F2", "F2", "F2", "F2", "F3", "F3", "F3", "F3", "F3",
                   "F1", "F1", "F1", "F1", "F1", "F2", "F2", "F2", "F2", "F2",
                   "F3", "F3", "F3", "F3", "F3"],
    "Sunlight": ["High", "High", "High", "High", "High", "High", "High", "High", "High", "High",
                 "High", "High", "High", "High", "High", "Low", "Low", "Low", "Low", "Low",
                 "Low", "Low", "Low", "Low", "Low", "Low", "Low", "Low", "Low", "Low",
                 "High", "High", "High", "High", "High", "High", "High", "High", "High", "High",
                 "High", "High", "High", "High", "High"],
    "Watering": ["Regular", "Regular", "Regular", "Regular", "Regular",
                 "Regular", "Regular", "Regular", "Regular", "Regular",
                 "Regular", "Regular", "Regular", "Regular", "Regular",
                 "Sparse", "Sparse", "Sparse", "Sparse", "Sparse",
                 "Sparse", "Sparse", "Sparse", "Sparse", "Sparse",
                 "Sparse", "Sparse", "Sparse", "Sparse", "Sparse",
                 "Regular", "Regular", "Regular", "Regular", "Regular",
                 "Regular", "Regular", "Regular", "Regular", "Regular",
                 "Regular", "Regular", "Regular", "Regular", "Regular"]
})


In [38]:
tukey = pairwise_tukeyhsd(data['Growth'], data['Fertilizer'] + data['Sunlight'] + data['Watering'], alpha=0.05)
print(tukey)

        Multiple Comparison of Means - Tukey HSD, FWER=0.05        
    group1        group2    meandiff p-adj   lower    upper  reject
-------------------------------------------------------------------
F1HighRegular   F1LowSparse      1.0 0.9419  -2.2956  4.2956  False
F1HighRegular F2HighRegular      5.5    0.0   2.8092  8.1908   True
F1HighRegular   F2LowSparse      7.0    0.0   3.7044 10.2956   True
F1HighRegular F3HighRegular     -2.2 0.1647  -4.8908  0.4908  False
F1HighRegular   F3LowSparse     -0.4 0.9991  -3.6956  2.8956  False
  F1LowSparse F2HighRegular      4.5 0.0027   1.2044  7.7956   True
  F1LowSparse   F2LowSparse      6.0 0.0004   2.1946  9.8054   True
  F1LowSparse F3HighRegular     -3.2 0.0613  -6.4956  0.0956  False
  F1LowSparse   F3LowSparse     -1.4 0.8775  -5.2054  2.4054  False
F2HighRegular   F2LowSparse      1.5 0.7478  -1.7956  4.7956  False
F2HighRegular F3HighRegular     -7.7    0.0 -10.3908 -5.0092   True
F2HighRegular   F3LowSparse     -5.9 0.0001  -9.

# **Further link** [link](https://github.com/AammarTufail/python-ka-chilla-2024/blob/main/06_statistics/04_anova.ipynb)