<a href="https://www.kaggle.com/code/hassaneskikri/anovas-statistical-tests?scriptVersionId=168273836" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
%%html
<style>
    *{
        font-family: 'Arial', sans-serif;
        align-item : center;
        justifiy-content:center;
        max-width : 1000px;
            font-size: 18px;
        line-height: 1.5;
    }
    img{
        display: flex;
        margin-left: auto;
        margin-right: auto;
        width: 700px;
        height: auto;
        text-align: center;
        border-radius: 15px;
    }
    
</style>

**`ANOVA`**, which stands for Analysis of Variance, is a statistical test used to analyze the difference between the means of more than two groups.

A one-way ANOVA uses one independent variable, while a two-way ANOVA uses two independent variables.

# When to use a one-way ANOVA

Use a one-way ANOVA when you have collected data about one categorical independent variable and one quantitative dependent variable. The independent variable should have at least three levels (i.e. at least three different groups or categories).

ANOVA tells you if the dependent variable changes according to the level of the independent variable. For example:

- Your independent variable is social media use, and you assign groups to low, medium, and high levels of social media use to find out if there is a difference in hours of sleep per night.

- Your independent variable is brand of soda, and you collect data on Coke, Pepsi, Sprite, and Fanta to find out if there is a difference in the price per 100ml.

- You independent variable is type of fertilizer, and you treat crop fields with mixtures 1, 2 and 3 to find out if there is a difference in crop yield.

The null hypothesis (H0) of ANOVA is that there is no difference among group means. The alternative hypothesis (Ha) is that at least one group differs significantly from the overall mean of the dependent variable.

If you only want to compare two groups, use a t test instead.

# How ANOVA Works:

- Purpose: ANOVA tests if there are any statistically significant differences between the means of three or more independent (unrelated) groups.

- F Test: It uses the F statistic to compare the variances between groups to the variance within groups. If the between-group variance is significantly larger than the within-group variance, it suggests that at least one group mean is different.

- Assumptions: ANOVA assumes that the data are normally distributed, the variances are equal, and the observations are independent.

# Steps in ANOVA:

- Group Means: Calculate the mean of each group.
- Overall Mean: Calculate the overall mean of the data.
- Between-Group Variance: Calculate how much each group mean deviates from the overall mean.
- Within-Group Variance: Calculate the variance within each group.
- F Ratio: Calculate the F statistic, which is the ratio of between-group variance to within-group variance.
- Significance: Compare the F statistic to a critical value to determine if the observed differences are statistically significant.

# Results Interpretation:

- Significant F: If the F statistic is larger than the critical value (indicating a low p-value), it suggests that at least one group mean is significantly different from the others.
- Not Significant: If the F statistic is not larger than the critical value, it suggests that any observed differences are likely due to chance.

# Two way anova

A two-way ANOVA is a statistical method used when you want to examine the effect of two categorical independent variables on a quantitative dependent variable, and possibly their interaction effect on the dependent variable. Here's a simplified summary of when and how to use a two-way ANOVA, without delving into R programming:

# When to Use Two-Way ANOVA:

- Two Categorical Independent Variables: You have two factors (independent variables) that are categorical, and you're interested in their effects on a quantitative outcome.
- Quantitative Dependent Variable: The outcome you're measuring is quantitative, meaning it represents amounts or counts that can be averaged.
- Interest in Interaction: You want to know not just the separate effects of each independent variable on the dependent variable but also whether there's an interaction effect between the two independent variables.

# How Two-Way ANOVA Works:

- Three Null Hypotheses: Two-way ANOVA tests three null hypotheses: (1) no difference in means across the levels of the first independent variable, (2) no difference in means across the levels of the second independent variable, and (3) no interaction effect between the two independent variables.

- F Test: It uses the F statistic to compare variances, similar to one-way ANOVA, but it considers both independent variables and their interaction.

- Assumptions: The data should meet certain assumptions, including homogeneity of variance, independence of observations, and normally distributed dependent variable.

# Steps in Two-Way ANOVA:

- Test for Main Effects: Examine if there are any significant differences in the dependent variable across the levels of each independent variable.

- Test for Interaction Effect: Determine if the effect of one independent variable on the dependent variable changes across the levels of the other independent variable.

- Post-hoc Testing: If significant effects are found, further tests (like Tukey's HSD) can identify where those differences lie.

# Interpreting Results:

- Significant Main Effects: If either independent variable shows a significant effect, it means that variable influences the dependent variable.

- Significant Interaction Effect: If the interaction is significant, the impact of one independent variable on the dependent variable depends on the level of the other independent variable.

- No Significant Effects: If no significant effects are found, it suggests that neither the independent variables nor their interaction significantly affect the dependent variable.

# implementation in python 

 imagine we're looking at the effect of different types of fertilizer on crop yield. We have three types of fertilizer, and we've collected yield data (in units of output) for fields treated with each type.

In [2]:
import numpy as np
import scipy.stats as stats

# Example data: Crop yields (in arbitrary units) for different fertilizers
fertilizer1_yields = np.array([20, 22, 19, 20, 23, 18, 20])
fertilizer2_yields = np.array([28, 30, 27, 26, 28, 30, 29])
fertilizer3_yields = np.array([23, 25, 27, 26, 22, 24, 23])

f_stat, p_value = stats.f_oneway(fertilizer1_yields, fertilizer2_yields, fertilizer3_yields)

print(f"F-statistic: {f_stat}")
print(f"P-value: {p_value}")


F-statistic: 40.09090909090908
P-value: 2.3397686246935857e-07


we're comparing the yields of crops using three different fertilizers to see if there's a statistically significant difference in their means. If the p-value is below a certain threshold (commonly 0.05), we reject the null hypothesis and conclude that at least one fertilizer leads to a different average yield than the others.

 Assume we want to investigate not only the type of fertilizer but also whether the use of an irrigation system affects crop yield. Our two independent variables are now fertilizer type and irrigation system presence.

In [3]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Example data: Crop yields (in arbitrary units) for different fertilizers and irrigation systems
data = {
    "Yield": [20, 22, 19, 28, 30, 27, 23, 25, 27, 20, 23, 18, 26, 28, 30, 22, 24, 23],
    "Fertilizer": ["F1", "F1", "F1", "F2", "F2", "F2", "F3", "F3", "F3", "F1", "F1", "F1", "F2", "F2", "F2", "F3", "F3", "F3"],
    "Irrigation": ["Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "No", "No", "No", "No", "No", "No", "No", "No"]
}
df = pd.DataFrame(data)


model = ols('Yield ~ C(Fertilizer) + C(Irrigation) + C(Fertilizer):C(Irrigation)', data=df).fit()
anova_results = sm.stats.anova_lm(model, typ=2)

print(anova_results)


                                 sum_sq    df          F    PR(>F)
C(Fertilizer)                184.333333   2.0  27.650000  0.000032
C(Irrigation)                  2.722222   1.0   0.816667  0.383940
C(Fertilizer):C(Irrigation)    3.444444   2.0   0.516667  0.609192
Residual                      40.000000  12.0        NaN       NaN


we're looking at how both fertilizer type and irrigation (and their interaction) affect crop yield. The model includes both main effects and the interaction effect between fertilizer type and irrigation presence. The results will tell us if each factor has a significant effect on crop yields, and whether there's a significant interaction effect between them.

we can conclude that the type of fertilizer has a significant effect on crop yields (p = 0.000032), indicating different fertilizers lead to statistically different yields. However, the presence of irrigation and the interaction between fertilizer type and irrigation do not significantly affect the yields (p = 0.383940 and p = 0.609192, respectively), suggesting that irrigation and the combination of irrigation with fertilizer type do not influence crop yields in a statistically significant way.

# Resource

- [one way anova ](https://www.scribbr.com/statistics/one-way-anova/)

- [two way anova](https://www.scribbr.com/statistics/two-way-anova/)