<center>
<h1><b>ANOVA</b></h1>
</center>

-------------

**ANOVA** (Analysis of Variance) is a statistical method used to compare the means of three or more groups to see if at least one group mean is different from the others. It's like a more advanced version of the t-test, which compares the means of two groups.

### Types of ANOVA

1. **One-Way ANOVA**
2. **Two-Way ANOVA**
3. **Repeated Measures ANOVA**


### One-Way ANOVA

**Use Case**: When you have one independent variable with three or more levels (groups) and one dependent variable.

**Example**: A company wants to test if three different training programs lead to different levels of employee productivity. The independent variable is the training program (with three levels: Program A, Program B, Program C), and the dependent variable is productivity.

**Conditions for Usage**:
- The groups should be independent.
- The dependent variable should be continuous and approximately normally distributed.
- Homogeneity of variances

  Think of variance as the difference between the heights of people in a group. If everyone is about the same height, the       variance is small. If     there are both very short and very tall people, the variance is large.

  **Homogeneity of variances** means that the variances (the amount of spread in the data) in different groups should be roughly the same. In other         words, the spread of heights in Group 1 should be similar to the spread of heights in Group 2.

**Limitations**:
- Only one independent variable can be tested.
- Does not show which groups are different, only that at least one group is different.


### Two-Way ANOVA

**Use Case**: When you have two independent variables and one dependent variable.

**Example**: A retail company wants to understand the effect of both store location (Lekki, Ikeja, Epe) and promotion type (discount, buy-one-get-one-free, no promotion) on sales. Here, the independent variables are store location and promotion type, and the dependent variable is sales.

**Conditions for Usage**:
- The groups should be independent.
- The dependent variable should be continuous and approximately normally distributed.
- Homogeneity of variances.

**Limitations**:
- More complex interpretation.
- Interactions between variables can be difficult to understand.

### Repeated Measures ANOVA

**Use Case**: When you measure the same subjects under different conditions or over time.

**Example**: A tech company wants to test the effectiveness of different website designs on user satisfaction over three different interactions. The independent variable is the website design (Design A, Design B, Design C), and the dependent variable is user satisfaction scores.

**Conditions for Usage**:
- The same subjects are used in all conditions.
- The dependent variable should be continuous and approximately normally distributed.
- Sphericity (the variances of the differences between all combinations of related groups should be equal).

**Limitations**:
- Assumes sphericity; if not met, results can be misleading.
- Can be sensitive to missing data.

## Practical Python Implementations

In [1]:
# importation of libraries
import pandas as pd
import scipy.stats as stats

### One-Way ANOVA
**Example**: A company wants to test if three different training programs lead to different levels of employee productivity. The independent variable is the training program (with three levels: Program A, Program B, Program C), and the dependent variable is productivity.

In [2]:
# Sample Data
data = {
    'Program_A': [89, 85, 90, 92, 88],
    'Program_B': [78, 82, 79, 81, 77],
    'Program_C': [92, 94, 93, 95, 91]
}

df = pd.DataFrame(data)

f_stat, p_val = stats.f_oneway(df['Program_A'], df['Program_B'], df['Program_C'])
print(f'F-Statistic: {f_stat}, p-value: {p_val}')

F-Statistic: 53.88148148148136, p-value: 1.0119342241180066e-06


### One-Way ANOVA Interpretation

- **F-Statistic**: 53.881
- **P-Value**: 0.000001011 (approximately)

We measured the productivity levels of employees who participated in three different training programs (Program A, Program B, Program C). We used a special tool (ANOVA) to see if there is a significant difference in productivity levels among the three training programs or if any observed differences are just by chance.

#### What We Found:
1. **The F-Statistic (53.881)**: This number tells us how big the differences are among the productivity levels of employees from the three training programs. A larger number indicates a greater difference between at least one pair of training programs.

2. **The P-Value (0.000001011)**: This very small number tells us the likelihood that the differences in productivity levels among the three training programs happened just by chance. Since this number is much smaller than 0.05 (which is like saying 5 out of 100), it means the differences are very unlikely to be just by chance.

**In Simple Words**:
It's almost certain that at least one of the training programs leads to a different level of productivity compared to the others, not just by luck or random chance. It's like finding a significant difference in the amount of candies in three different bags and being very sure it's not just by accident.

So, we can say with a lot of confidence that the training programs have different effects on employee productivity. This means we should look deeper to see which specific programs are making the difference and by how much.

### Practical Business Example:

Imagine a company called "TechCorp" that implemented three different training programs to improve employee productivity. The company wants to know if there's a significant difference in productivity resulting from these programs.

- **Program A**: Standard Training
- **Program B**: Advanced Training
- **Program C**: Mixed Training

After measuring productivity levels, ANOVA results showed:

- **F-Statistic**: 53.881
- **P-Value**: 0.000001011

#### What This Means for TechCorp:
1. **Big Differences in Productivity**: The F-Statistic (53.881) indicates significant differences in productivity between the programs.
2. **Very Low Chance of Randomness**: The P-Value (0.000001011) suggests that these differences are not due to random chance.

**In Simple Words**:
TechCorp can be very confident that the differences in employee productivity are due to the training programs and not just random variation. Now, post-hoc tests like Tukey's HSD can be used to find out which specific training programs are the most effective.


--------------

### Two-Way ANOVA

**Example**: A retail company wants to understand the effect of both store location (Lekki, Ikeja, Epe) and promotion type (discount, buy-one-get-one-free, no promotion) on sales. Here, the independent variables are store location and promotion type, and the dependent variable is sales.

In [3]:
# importation of libraries
import pandas as pd  # Import pandas library for data manipulation
import statsmodels.api as sm  # Import statsmodels library for statistical modeling
from statsmodels.formula.api import ols  # Import OLS function for linear regression

In [4]:
# Sample Data 
data = {
    'Location': ['Lekki']*9 + ['Ikeja']*9 + ['Epe']*9,
    'Promotion': ['Discount']*3 + ['BOGO']*3 + ['No Promotion']*3 +
                 ['Discount']*3 + ['BOGO']*3 + ['No Promotion']*3 +
                 ['Discount']*3 + ['BOGO']*3 + ['No Promotion']*3,
    'Sales': [200, 210, 220, 230, 240, 250, 260, 270, 280,
              290, 300, 310, 320, 330, 340, 350, 360, 370,
              380, 390, 400, 410, 420, 430, 440, 450, 460]
}

# Create a DataFrame from the sample data
df = pd.DataFrame(data)
df.sample(3)  # sampling the dataframe

Unnamed: 0,Location,Promotion,Sales
0,Lekki,Discount,200
22,Epe,BOGO,420
18,Epe,Discount,380


In [5]:
# Performing two-way ANOVA analysis
try:
    model = ols('Sales ~ C(Location) + C(Promotion) + C(Location):C(Promotion)', data=df).fit()
    # Generating the Two-Way ANOVA table for the full model
    anova_table = sm.stats.anova_lm(model, typ=2)
    # Converting the Two-Way ANOVA table to DataFrame
    anova_df = pd.DataFrame(anova_table)    
except Exception as e:
    # Print any errors that occur during the fitting of the full model
    print(f"Error: {e}")
    
#display the Two_Way ANOVA
anova_df.head()

Unnamed: 0,sum_sq,df,F,PR(>F)
C(Location),145800.0,2.0,729.0,5.965891e-18
C(Promotion),16200.0,2.0,81.0,1e-09
C(Location):C(Promotion),2.0262159999999998e-26,4.0,5.065539e-29,1.0
Residual,1800.0,18.0,,


### Two-Way ANOVA Interpretation

- **F-Statistic (C(Location))**: 729.0
- **P-Value (C(Location))**: 5.97e-18 (approximately)

- **F-Statistic (C(Promotion))**: 81.0
- **P-Value (C(Promotion))**: 1.00e-09 (approximately)

- **F-Statistic (Interaction: C(Location):C(Promotion))**: 5.07e-29
- **P-Value (Interaction: C(Location):C(Promotion))**: 1.00

We measured the sales performance of stores in three different locations (Lekki, Ikeja, Epe) under three different promotion strategies (Discount, BOGO-buy-one-get-one-free, No Promotion). We used a special tool (Two-Way ANOVA) to see if there is a significant difference in sales performance based on location, promotion strategy, or their interaction, or if any observed differences are just by chance.

#### What We Found:
1. **The F-Statistic for Location (729.0)**: This number tells us how big the differences are in sales performance between the different locations. A larger number indicates a greater difference in sales between at least one pair of locations.

2. **The P-Value for Location (5.97e-18)**: This extremely small number tells us the likelihood that the differences in sales performance among the locations happened just by chance. Since this number is much smaller than 0.05, it means the differences are very unlikely to be just by chance.

3. **The F-Statistic for Promotion (81.0)**: This number indicates how big the differences are in sales performance between the different promotion strategies. A larger number suggests a greater difference in sales due to the promotion strategies.

4. **The P-Value for Promotion (1.00e-09)**: This very small number indicates the likelihood that the differences in sales performance among the promotion strategies happened just by chance. Since this number is much smaller than 0.05, it means the differences are very unlikely to be just by chance.

5. **The F-Statistic for Interaction (5.07e-29)**: This number indicates the differences in sales performance due to the interaction between location and promotion. A very small number suggests minimal interaction effects.

6. **The P-Value for Interaction (1.00)**: This number indicates that the interaction effect between location and promotion strategy is not statistically significant.

**In Simple Words**:
It's almost certain that the location and the type of promotion have significant effects on sales performance. However, the interaction between location and promotion does not significantly affect sales performance. It's like finding a significant difference in the sales figures in different stores and with different promotions but not necessarily a combination of both.

### Practical Business Example:

Imagine a retail company called "StoreABC" that has stores in three locations (Lekki, Ikeja, Epe) and runs three different promotion strategies to boost sales.

- **Locations**: Lekki, Ikeja, Epe
- **Promotions**: Discount, BOGO (Buy One Get One), No Promotion

After measuring sales performance, Two-Way ANOVA results showed:

- **F-Statistic (Location)**: 729.0
- **P-Value (Location)**: 5.97e-18
- **F-Statistic (Promotion)**: 81.0
- **P-Value (Promotion)**: 1.00e-09
- **F-Statistic (Interaction)**: 5.07e-29
- **P-Value (Interaction)**: 1.00

#### What This Means for StoreABC:
1. **Big Differences in Sales by Location**: The F-Statistic (729.0) for location indicates significant differences in sales between Lekki, Ikeja, and Epe.
2. **Big Differences in Sales by Promotion**: The F-Statistic (81.0) for promotion indicates significant differences in sales between the different promotions.
3. **No Significant Interaction Effect**: The interaction effect is not significant, meaning the combined effect of location and promotion doesn't vary significantly.

**In Simple Words**:
StoreABC can be very confident that both the store location and the type of promotion significantly affect sales performance. The lack of interaction means that the effectiveness of a promotion is consistent across all locations. Now, StoreABC can use this information to optimize promotions in different locations separately.

For further enquiry, StoreABC can perform post-hoc tests like Tukey's HSD (Honestly Significant Difference) to determine which specific locations and promotions differ significantly. Also, for startegy optimization, StoreABC can tailor its promotion strategies to maximize sales based on the specific effects observed for locations and promotions independently.

----------------

### Repeated Measures ANOVA

**Example**: A tech company wants to test the effectiveness of different website designs on user satisfaction over three different interactions. The independent variable is the website design (Design A, Design B, Design C), and the dependent variable is user satisfaction scores.


In [6]:
# importation of library
import pingouin as pg

In [7]:
# Sample Data
data = {
    'User': list(range(1, 31)) * 3,  # 30 users repeated for each design (A, B, C)
    'Design': ['A']*30 + ['B']*30 + ['C']*30,  # Each design (A, B, C) applied to 30 users
    'Satisfaction': [70, 75, 72, 74, 73, 76, 78, 77, 79, 80, 85, 88, 84, 82, 81, 87, 86, 89, 90, 91, 83, 78, 80, 85, 87, 89, 90, 88, 85, 87,
                     65, 70, 68, 66, 67, 69, 71, 73, 72, 74, 76, 78, 75, 77, 79, 80, 82, 84, 81, 83, 85, 87, 89, 86, 88, 90, 92, 94, 93, 91,
                     60, 62, 61, 63, 64, 65, 67, 66, 68, 69, 70, 71, 73, 75, 74, 76, 78, 77, 79, 80, 82, 81, 84, 86, 85, 87, 88, 89, 90, 91]
}

# Create a DataFrame from the sample data
df = pd.DataFrame(data)

In [8]:
# Perform repeated measures ANOVA
anova = pg.rm_anova(dv='Satisfaction',    # Dependent variable (user satisfaction scores)
                    within='Design',      # Within-subject factor (website design)
                    subject='User',       # Subject identifier (user)
                    data=df)              # DataFrame containing the data

# Print the ANOVA table
RManova_df = pd.DataFrame(anova) 
RManova_df

Unnamed: 0,Source,ddof1,ddof2,F,p-unc,p-GG-corr,ng2,eps,sphericity,W-spher,p-spher
0,Design,2,58,28.864675,1.993196e-09,2e-06,0.109198,0.576874,False,0.266519,9.124002e-09


### Repeated Measures ANOVA Interpretation

- **F-Statistic**: 28.864675
- **P-Value**: 1.993196e-09 (approximately 0.00000000199)

We measured user satisfaction scores for three different website designs (Design A, Design B, Design C) over three different interactions to see if there's a significant difference in satisfaction scores between the designs or if any observed differences are just by chance.

### What We Found:
1. **The F-Statistic (28.864675)**: This number tells us how big the differences are among the user satisfaction scores for the three different website designs. A larger number indicates a greater difference between at least one pair of designs.

2. **The P-Value (1.993196e-09)**: This very small number tells us the likelihood that the differences in user satisfaction scores among the three website designs happened just by chance. Since this number is much smaller than 0.05 (which is like saying 5 out of 100), it means the differences are very unlikely to be just by chance.

**In Simple Words**:
It's almost certain that at least one of the website designs leads to a different level of user satisfaction compared to the others, not just by luck or random chance. It's like finding a significant difference in user happiness after using three differently designed websites and being very sure it's not just by accident.

So, we can say with a lot of confidence that the website designs have different effects on user satisfaction. This means we should look deeper to see which specific designs are making the difference and by how much.

### Practical Business Example:

Imagine a tech company called "WebInnovate" that wants to test the effectiveness of different website designs on user satisfaction over three different interactions.

- **Design A**: Classic Website Design
- **Design B**: Modern Website Design
- **Design C**: Experimental Website Design

After measuring user satisfaction scores for these three designs over three interactions, the Repeated Measures ANOVA results showed:

- **F-Statistic**: 28.864675
- **P-Value**: 1.993196e-09

#### What This Means for WebInnovate:
1. **Big Differences in User Satisfaction**: The F-Statistic (28.864675) indicates significant differences in user satisfaction between the website designs.
2. **Very Low Chance of Randomness**: The P-Value (1.993196e-09) suggests that these differences are not due to random chance.

**In Simple Words**:
WebInnovate can be very confident that the differences in user satisfaction are due to the website designs and not just random variation. Now, post-hoc tests like Tukey's HSD can be used to find out which specific website designs are the most effective.

<center>
<h1><b>Thank You</b></h1>
</center>