# **SHETH L.U.J. & SIR M.V. COLLEGE**
**Swati Mahajan | T093**
## **Practical No. 5**

**Aim** :- ANOVA (Analysis of Variance)
*  Perform one-way ANOVA to compare means across multiple groups.
*  Conduct post-hoc tests to identify significant differences between group means.


# 1. Data Loading and Setup

In [1]:
import pandas as pd
import scipy.stats as stats
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Load the dataset
try:
    # Use 'latin1' encoding if you encounter a UnicodeDecodeError
    df = pd.read_csv('Datasets/Global_Superstore2.csv', encoding='latin1')
    print("Dataset 'Global_Superstore2.csv' loaded successfully.")
    print("Shape of the dataset:", df.shape)
except FileNotFoundError:
    print("Error: 'Global_superstore.csv' not found. Make sure the file is in the correct directory.")
    df = pd.DataFrame() # Create an empty DataFrame to prevent further errors

Dataset 'Global_Superstore2.csv' loaded successfully.
Shape of the dataset: (51290, 24)


## 1. One-Way ANOVA (F-Test)

**Hypothesis:** Is there a significant difference in Sales among the different Regions?

This test will help determine if the average sales are the same across all geographical regions or if at least one region has a significantly different average sales value.

In [2]:
if not df.empty and df['Region'].nunique() >= 2:
    # Prepare the data for ANOVA
    df_anova = df[['Sales', 'Region']]
    groups = pd.unique(df_anova.Region.values)
    # Create a dictionary where keys are regions and values are the sales data for that region
    d_data = {group: df_anova['Sales'][df_anova.Region == group] for group in groups}

    # Perform one-way ANOVA by unpacking the dictionary values
    F, p = stats.f_oneway(*d_data.values())

    print("Hypothesis: Are Sales different across Regions?")
    print("p-value for significance is: ", p)

    if p < 0.05:
        print("Conclusion: Reject the null hypothesis; there is a significant difference in Sales across Regions.")
    else:
        print("Conclusion: Accept the null hypothesis; there is no significant difference in Sales across Regions.")

Hypothesis: Are Sales different across Regions?
p-value for significance is:  7.28547664877952e-134
Conclusion: Reject the null hypothesis; there is a significant difference in Sales across Regions.


## 2. Two-Way ANOVA (F-Test)

**Hypothesis:** Do Sales vary based on an interaction between Region and customer Segment?

In [3]:
if not df.empty and df['Region'].nunique() >= 2 and df['Segment'].nunique() >= 2:
    # Fit the ANOVA model using the Ordinary Least Squares (ols) function
    # C() indicates that the variable should be treated as categorical
    model = ols('Sales ~ C(Region) * C(Segment)', data=df).fit()
    
    print(f"Overall model F({model.df_model:.0f}, {model.df_resid:.0f}) = {model.fvalue:.3f}, p = {model.f_pvalue:.4f}")
    
    # Perform two-way ANOVA and print the results table
    res = sm.stats.anova_lm(model, typ=2)
    print("\nTwo-Way ANOVA Results:")
    print(res)

Overall model F(38, 51251) = 17.869, p = 0.0000

Two-Way ANOVA Results:
                            sum_sq       df          F         PR(>F)
C(Region)             1.562991e+08     12.0  55.475822  7.666667e-134
C(Segment)            8.032951e+04      2.0   0.171070   8.427632e-01
C(Region):C(Segment)  3.056067e+06     24.0   0.542351   9.658303e-01
Residual              1.203300e+10  51251.0        NaN            NaN
