<a href="https://colab.research.google.com/github/Tealexkay/Midterm-project/blob/main/Day26_Introduction_to_ANOVA_Understanding_Group_Differences.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Day 26: Introduction to ANOVA - Understanding Group Differences

Welcome to Day 26! Today, we’re diving into **ANOVA (Analysis of Variance)** — a powerful statistical tool used when comparing the means of three or more groups.

In previous lessons, we've looked at comparing means between two groups using t-tests. But what if you have **three or more groups**? Running multiple t-tests increases the risk of false positives. That's where **ANOVA** comes in.

By the end of this lesson, you will:

-Understand the purpose of ANOVA and when to use it  
- Learn about the assumptions behind ANOVA  
- Visualize and compare group differences using boxplots  
- Perform a one-way ANOVA in Python using real-world data  
- Interpret the results to decide whether there are significant group differences  

**Why this matters:**  
Whether you're analyzing the impact of different teaching strategies, medical treatments, marketing campaigns, or product versions, ANOVA helps answer a fundamental question:  

Let’s get started with exploring how this method works and when to use it.


## Quick Review from Previous Class  
### Summary of Day 25 (A/B Testing and Statistical Comparison)

In the last class, we focused on **A/B Testing** and its application for comparing two groups, particularly in business and marketing contexts:

- Setting up **hypothesis tests** for A/B comparisons
- Calculating **sample size and power** to detect differences
- Implementing **t-tests** and **proportion tests** to analyze A/B test outcomes
- Visualizing results to make clear and actionable conclusions

We discussed:
- The difference between **paired and unpaired tests** in A/B settings
- How to correctly randomize and assign users to groups
- Interpreting p-values and confidence intervals in A/B testing



## Prepare the data

We are working with a `Diet Study` dataset, which simulates a real-world weight loss intervention experiment.
Each observation corresponds to an individual participant's health measurements and weight before and after undergoing a dietary program for six weeks.
Our goal is to explore the effect of different diets on weight change and assess whether the observed differences are statistically significant.

The dataset includes the following relevant columns:

- Person: Unique identifier for each participant.

- Gender: Binary indicator of gender (0 = Female, 1 = Male).

- Age: Age of the participant in years.

- Height: Participant’s height in centimeters.

- pre.weight: Participant’s body weight before starting the diet, in kilograms.

- Diet: Categorical variable indicating the diet group (1, 2, or 3), representing different dietary programs.

- weight6weeks: Participant’s body weight after completing 6 weeks on the diet, in kilograms.

Load the data into a variables called `diet`

```python
https://raw.githubusercontent.com/liger1apwm/MAT-301_Applied_Stats_Data_Analysis/refs/heads/main/data/Diet_R.csv
```

### Import the Libraries

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

### Import the dataset

In [None]:
url = 'https://raw.githubusercontent.com/liger1apwm/MAT-301_Applied_Stats_Data_Analysis/refs/heads/main/data/Diet_R.csv'
diet = pd.read_csv(url)
diet.head()

Display the information for the dataset using the function .info()

Check the gender column and assess any issues within the column (discussion)

### Data Preparation

- Column Cleaning: Standardizes column names by removing spaces, converting to lowercase, and replacing periods with underscores.

- Weight Loss Calculation: Adds a new column capturing each participant’s weight loss over 6 weeks.


In [None]:
# Clean column names (remove whitespace, lowercase, fix dots)
diet.columns = diet.columns.str.strip().str.lower().str.replace('.', '_')

# Calculate weight loss
diet['weight_loss'] = diet['pre_weight'] - diet['weight6weeks']

### Visualization

- We will make a boxplot that shows the distribution of weight loss across diet groups, highlighting medians, variability, and outliers.

- Interpretation: This plot helps visually compare how much weight was lost in each of the three diet groups.

In [None]:
# Create the boxplot
plt.figure(figsize=(8, 5))
sns.boxplot(x='diet', y='weight_loss', hue='diet', data=diet, palette='pastel', legend=False)

# Add titles and labels
plt.title('Weight Loss by Diet Group (6 Weeks)', fontsize=14)
plt.xlabel('Diet Group')
plt.ylabel('Weight Loss (lbs)')
plt.grid(axis='y', linestyle='--', alpha=0.7)

plt.tight_layout()
plt.show()

We will use ANOVA to determine whether the type of diet plan has a statistically significant effect on weight loss. But first let's understand ANOVA.

## 1. What Is ANOVA?
ANOVA stands for Analysis of Variance. It's a statistical method used to compare the means of three or more groups to determine if at least one group mean is significantly different from the others.​

Why not use multiple t-tests?

Using multiple t-tests increases the risk of Type I errors (false positives). ANOVA controls this error rate, providing a more reliable comparison across multiple groups.

## 2. When to Use ANOVA
Use ANOVA when:

- You have one categorical independent variable (e.g., teaching method) with three or more groups.

- Your dependent variable is numerical (e.g., test scores).

- You want to test the hypothesis: "Are all group means equal?"

## 3. What Does ANOVA Do?
ANOVA compares:

- Between-group variability: Differences among the group means.

- Within-group variability: Variability of observations within each group.​

It calculates an F-statistic, which is the ratio of between-group variance to within-group variance. A higher F-value suggests a greater probability that at least one group mean is different.

### 4. Assumptions of ANOVA
Before applying ANOVA, ensure that:

- Independence: Observations are independent within and across groups.

- Normality: The data in each group are approximately normally distributed.

- Homogeneity of variances: The variances among the groups are approximately equal.

## 5. ANOVA (One-Way ANOVA) Application


**One-Way ANOVA** (Analysis of Variance) is a statistical technique used to determine whether the **means of three or more independent groups** are significantly different. It is used when you have **one independent categorical variable** with **two or more levels** and **one continuous dependent variable**.

### Why Use One-Way ANOVA Instead of Multiple T-Tests?  
Using multiple t-tests increases the risk of **Type I error** (false positives). ANOVA overcomes this by comparing **all group means simultaneously**, thus controlling the error rate.


To perform a one-way ANOVA in Python, we use the `f_oneway` function from the `scipy.stats` module.

```python
from scipy.stats import f_oneway

```

What does f_oneway do?
f_oneway performs a one-way ANOVA test.

- It compares the means of two or more independent groups to determine if there is a statistically significant difference among them.

- It returns two values:
  -  F-statistic: A number that indicates the ratio of variance between the groups to the variance within the groups.

  - p-value: Tells us whether the observed differences between group means are statistically significant.

Example:
```python
f_stat, p_value = f_oneway(group1, group2, group3)

```

Here, group1, group2, and group3 are arrays or lists of numerical values (e.g., test scores) from different groups.

----

#### Hypotheses for One-Way ANOVA (Comparing Customer Satisfaction Across Party Sizes)

Null Hypothesis (H₀): The mean customer satisfaction is the same across all party sizes.
$$H_0: \mu_1 = \mu_2 = \mu_3 = \dots = \mu_k$$
	•	Alternative Hypothesis (H₁): At least one party size group has a different mean customer satisfaction compared to the others.
$$H_1: \text{At least one } \mu_i \text{ differs from the others}$$

Interpretation:
- If p_value < 0.05: There is a statistically significant difference between at least one pair of group means.

- If p_value >= 0.05: We do not have enough evidence to say the group means are different.


We will create the groups and run ANOVA to interpret the results!

In [None]:
from scipy.stats import f_oneway

# Group the data by 'diet' and collect the 'weight_loss' values
groups = diet.groupby('diet')['weight_loss'].apply(list)


# Run one-way ANOVA
f_stat, p_value = f_oneway(*groups)

# Display results
print(f"F-statistic: {f_stat:.2f}")
print(f"p-value: {p_value:.4f}")

 Interpretation:

- Since the p-value is less than 0.05, we reject the null hypothesis.
- There is a statistically significant difference in weight loss between at least one pair of diet groups.

### 6. What’s Next If ANOVA Is Significant?
ANOVA indicates that at least one group mean is different but doesn't specify which groups differ.​

Next step:
- Perform a post-hoc test, such as Tukey’s Honest Significant Difference (HSD), to identify which specific group means differ.​



To perform post-hoc analysis after a significant ANOVA result, we use the `pairwise_tukeyhsd` function from the `statsmodels` library.

```python
from statsmodels.stats.multicomp import pairwise_tukeyhsd
```

What is pairwise_tukeyhsd?
- This function performs Tukey’s Honest Significant Difference (HSD) test, a post-hoc test used after ANOVA.

- It checks all possible pairwise comparisons between group means to identify which ones are significantly different.

- It controls for Type I error across multiple comparisons (also known as FWER — Familywise Error Rate).

In [None]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd

In [None]:
# Run Tukey's HSD test
tukey_result = pairwise_tukeyhsd(endog=diet['weight_loss'],
                                 groups=diet['diet'],
                                 alpha=0.05)

# Print the result
print(tukey_result)

#### Interpretation:

- There is **no significant difference** in weight loss between **Diet 1 and Diet 2**.
- **Diet 3 leads to significantly more weight loss** compared to both **Diet 1** and **Diet 2**.
- The results suggest that **Diet 3 may be the most effective** in terms of weight reduction over the 6-week period.

#### Conclusion:
Tukey’s HSD helps pinpoint **which groups are different**, not just that a difference exists. This is essential when an ANOVA test is significant.

#### **Try it yourself:**



Using the same restaurant transaction dataset used in the notebook from day 24:

In [None]:
tips_data = pd.read_csv("https://raw.githubusercontent.com/liger1apwm/MAT-301_Applied_Stats_Data_Analysis/refs/heads/main/data/tips_feedback.csv")

day_mapping = {"Fri":"Friday",
               "Sat":"Saturday",
               "Mon":"Monday",
               "Thur":"Thursday",
               "Tues":"Tuesday",
               "Weds":"Wednesday",
               "Sun":"Sunday"
}

tips_data['day'] = tips_data['day'].map(day_mapping)

tips_data.head()

The restaurant management wants to determine whether customer satisfaction varies depending on the party size. They hypothesize that larger groups may have a different dining experience compared to smaller groups, potentially affecting satisfaction scores.

Research Question:
Is there a statistically significant difference in customer satisfaction based on the size of the party?

<details><summary>Answer:</summary>

```python
# Group the data by 'size' and collect the 'satisfaction' values into lists
groups = tips_data.groupby('size')['satisfaction'].apply(list)

# Run one-way ANOVA
f_stat, p_value = f_oneway(*groups)

# Display results
print(f"F-statistic: {f_stat:.2f}")
print(f"P-value: {p_value:.4f}")
```
</details>

Use the Tukey’s Honest Significant Difference (HSD) test to verify your previous results:

<details><summary>Answer:</summary>

```python
# Run Tukey's HSD test
tukey_result = pairwise_tukeyhsd(endog=tips_data['satisfaction'],
                                 groups=tips_data['size'],
                                 alpha=0.05)

# Print the result
```
</details>

### 7. Summary: Understanding ANOVA

- **ANOVA (Analysis of Variance)** is a statistical method used to compare the means of **three or more groups**.

- It tests the **null hypothesis** that all group means are equal, by analyzing the ratio of **between-group variability** to **within-group variability**.

- A **significant ANOVA result** (p < 0.05) indicates that **at least one group mean is different**, but it does **not** tell us which specific groups differ.

- **Key Assumptions** to check before applying ANOVA:
  - **Independence**: Observations in each group should be independent.
  - **Normality**: Data in each group should be approximately normally distributed.
  - **Homogeneity of variance**: Variances across groups should be similar.

- If the ANOVA result is significant, use **post-hoc tests** like **Tukey’s HSD** to determine **which pairs of group means differ significantly**.

- ANOVA helps avoid the increased Type I error risk that comes with running multiple t-tests across several groups.
