# üßÆ One-Way ANOVA (Analysis of Variance)

**One-Way ANOVA** is a statistical method used to determine whether there are any **significant differences between the means** of three or more independent groups.  

It helps to test if all group means are **approximately equal**, or if **at least one group mean** is significantly different from the others.

You typically use One-Way ANOVA when:
- You have **one categorical independent variable** (that divides data into groups), and  
- **One continuous dependent variable** (whose mean you want to compare).

**Example:**  
Suppose you want to test whether three different teaching methods lead to different average test scores among students.  
One-Way ANOVA can tell you whether the difference in mean scores is statistically significant or just due to random variation.


## üß† Hypothesis in One-Way ANOVA

In One-Way ANOVA, we test whether all group means are equal or at least one group mean is different.

### üîπ Null Hypothesis (H‚ÇÄ)
All population means are equal.  
$$
H_0 : \mu_1 = \mu_2 = \mu_3 = \dots = \mu_k
$$

### üîπ Alternative Hypothesis (H‚ÇÅ)
At least one group mean is different.  
$$
H_1 : \text{At least one } \mu_i \neq \mu_j
$$

---

## ‚öôÔ∏è Steps to Perform One-Way ANOVA

1. **Calculate the Mean of Each Group**
   $$
   \bar{X_i} = \frac{\sum X_i}{n_i}
   $$

2. **Calculate the Grand Mean (Overall Mean)**
   $$
   \bar{X} = \frac{\sum X_{ij}}{N}
   $$
   where $N$ is the total number of observations across all groups.

3. **Compute the Between-Group Sum of Squares (SSB)**
   $$
   SS_B = \sum n_i (\bar{X_i} - \bar{X})^2
   $$

4. **Compute the Within-Group Sum of Squares (SSW)**
   $$
   SS_W = \sum \sum (X_{ij} - \bar{X_i})^2
   $$

5. **Calculate Degrees of Freedom**
   $$
   df_B = k - 1 \quad \text{and} \quad df_W = N - k
   $$
   where $k$ = number of groups.

6. **Compute Mean Squares**
   $$
   MS_B = \frac{SS_B}{df_B} \quad \text{and} \quad MS_W = \frac{SS_W}{df_W}
   $$

7. **Calculate the F-statistic**
   $$
   F = \frac{MS_B}{MS_W}
   $$

8. **Find the p-value**
   Compare the calculated F-value with the F-distribution using $df_B$ and $df_W$.  
   If $p < 0.05$, reject $H_0$; otherwise, fail to reject $H_0$.

---

## üßæ Decision Rule
- If $F_{calculated} > F_{critical}$ or $p < 0.05$ ‚Üí **Reject H‚ÇÄ** (at least one mean differs).  
- If $F_{calculated} \leq F_{critical}$ or $p \geq 0.05$ ‚Üí **Fail to Reject H‚ÇÄ** (all means are equal).

---

## üìä Interpretation
A significant F-value means that **not all group means are the same**,  
but it does not show **which groups differ**.  
For that, you perform a **post-hoc test** such as **Tukey‚Äôs HSD**.


In [34]:
import pandas as pd
import seaborn as sns
import numpy as np

In [35]:
data = sns.load_dataset('titanic')

## üö¢ Dataset Used: Titanic Dataset

In this analysis, we are using the **Titanic dataset**, which contains information about the passengers who were aboard the Titanic ship.  
The dataset includes various attributes such as passenger class, age, gender, survival status, fare, and more.

### üéØ Objective
We want to test whether the **average age of passengers differs across different passenger classes** (1st, 2nd, and 3rd class).

In other words, we are checking if the **passenger class has a significant effect on the average age** of passengers.

### üí° Explanation
Here:
- The **independent variable (factor)** is the **passenger class** (categorical variable with 3 groups: 1st, 2nd, and 3rd class).  
- The **dependent variable** is the **age** of the passengers (a continuous variable).

So, we are applying **One-Way ANOVA** to compare the **mean ages** among the three passenger classes.

### üß† Hypothesis

**Null Hypothesis (H‚ÇÄ):**  
There is **no significant difference** in the mean age of passengers among the three classes.  
$$
H_0 : \mu_{1st} = \mu_{2nd} = \mu_{3rd}
$$

**Alternative Hypothesis (H‚ÇÅ):**  
There is a **significant difference** in the mean age among at least one of the passenger classes.  
$$
H_1 : \text{At least one } \mu_i \neq \mu_j
$$

If the ANOVA result gives a **p-value less than 0.05**, we will reject the null hypothesis and conclude that **passenger class has a significant effect on passenger age**.


In [36]:
data.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [37]:
data.isna().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

In [38]:
data = data.dropna()
data.isna().sum()

survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
deck           0
embark_town    0
alive          0
alone          0
dtype: int64

In [39]:
dic = {'p_1':data[data['pclass'] == 1]['age'], 
       'p_2':data[data['pclass'] == 2]['age'],
       'p_3':data[data['pclass'] == 3]['age']}


In [None]:
from scipy.stats import f
def test(groups:dict):
    
    # calculating groups mean
    groups_mean = {}
    for name, value in groups.items():
        groups_mean.update({name:value.mean()})
    
    # calculating grand mean
    grand_mean = count = 0
    for lis in groups.values():
        for i in lis:
            grand_mean += i
            count += 1
    grand_mean = grand_mean / count
    
    # calucalting ssb
    ssb = 0
    for group_name, value in groups.items():
        ssb += ((len(value) * (groups_mean[group_name] - grand_mean) ** 2))

    
    ssw = 0
    for name, value in groups.items():
        for i in value:
            ssw += ((i - groups_mean[name]) ** 2)
   
    # degree of freedom
    ssw_df = count - len(groups)
    ssb_df = len(groups) - 1
    
    msw = ssw / ssw_df
    msb = ssb / ssb_df
    f_stats = msb / msw
    p_value = 1 - f.cdf(f_stats, ssb_df, ssw_df)
    
    print("===== One Way Anova =====")
    print("No. Groups:", len(groups))
    print("No. Observations:", count)
    print("========")
    print("Group means")
    for i, j in groups_mean.items():
      print("group name:", i, "group mean:", j)
    print("========")
    print("grand mean:", grand_mean)
    print("Some of Square Within:", ssw)
    print("Some of Square between:", ssb)
    print("Some of square within degree of freedom:", ssw_df)
    print("Some of square between degree of freedom:", ssb_df)
    print("========")
    print("f-statistic", f_stats)
    print("p-value", p_value)
    print("========")
    print("===== Intrepretation =====")
    if p_value < 0.05:
        print("H‚ÇÅ wins! ‚Üí There is a significant difference between the group means.")
    else:
        print("No significant difference found between the group means.")   
test(dic)        

===== One Way Anova =====
No. Groups: 3
No. Observations: 182
Group means
group name: p_1 group mean: 37.54407643312102
group name: p_2 group mean: 25.266666666666666
group name: p_3 group mean: 21.0
grand mean: 35.62318681318681
Some of Square Within: 40126.974724416126
Some of Square between: 4326.539827232217
Some of square within degree of freedom: 179
Some of square between degree of freedom: 2
f-statistic 9.650000210498495
p-value 0.00010470399891171489
===== Intrepretation =====
H‚ÇÅ wins! ‚Üí There is a significant difference between the group means.
