Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.

In [1]:
# ANOVA assumes:

# Independence of observations – Data should be collected independently.

# Normality – The residuals should be approximately normally distributed.

# Homogeneity of variances (homoscedasticity) – Variances across groups should be equal.

# Examples of Violations:

# Using the same subjects in multiple groups (violates independence).

# Skewed distributions (violates normality).

# Different variances across groups (violates homogeneity).

Q2. What are the three types of ANOVA, and in what situations would each be used?

In [2]:
# One-way ANOVA: Used when comparing means of 3+ groups based on one independent variable.

# Example: Compare test scores across three teaching methods.

# Two-way ANOVA: Used when comparing means with two independent variables.

# Example: Test scores based on teaching method and student gender.

# Repeated Measures ANOVA: Used when the same subjects are measured multiple times.

# Example: Blood pressure measured before, during, and after treatment.

Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

In [4]:
# ANOVA splits total variability into:
# SST (Total Sum of Squares): Total variation in the data.
# SSB/SSE (Sum of Squares Between/Explained): Variation due to the model/groups.
# SSR (Residual/Error Sum of Squares): Variation within groups.
# It's important because it explains where the variation lies and helps determine if group means are significantly different.

Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?

In [5]:
import pandas as pd
import numpy as np
from scipy import stats

# Sample data
data = {
    'group': ['A']*5 + ['B']*5 + ['C']*5,
    'score': [22, 23, 25, 20, 21, 30, 28, 27, 29, 26, 35, 33, 32, 34, 31]
}
df = pd.DataFrame(data)

# Overall mean
grand_mean = df['score'].mean()

# SST
sst = sum((df['score'] - grand_mean) ** 2)

# Group means
group_means = df.groupby('group')['score'].mean()

# SSE (between groups)
sse = sum(df.groupby('group').size() * (group_means - grand_mean)**2)

# SSR (within groups)
df['group_mean'] = df['group'].map(group_means)
ssr = sum((df['score'] - df['group_mean']) ** 2)

print(f"SST: {sst}, SSE (Explained): {sse}, SSR (Residual): {ssr}")


SST: 326.93333333333334, SSE (Explained): 292.1333333333334, SSR (Residual): 34.8


Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [7]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Example Data
df = pd.DataFrame({
    'score': np.random.randint(60, 100, size=40),
    'program': ['A', 'B'] * 20,
    'experience': ['novice', 'experienced'] * 20
})

model = ols('score ~ C(program) + C(experience) + C(program):C(experience)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)


                          sum_sq    df   F  PR(>F)
C(program)                   NaN   1.0 NaN     NaN
C(experience)                NaN   1.0 NaN     NaN
C(program):C(experience)     NaN   1.0 NaN     NaN
Residual                  4873.6  38.0 NaN     NaN


  F /= J


Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?

In [8]:
F-statistic = 5.23, p-value = 0.02

Since p < 0.05, we reject the null hypothesis.

Conclusion: At least one group mean differs significantly from the others

                           sum_sq    df   F  PR(>F)
C(program)                    NaN   1.0 NaN     NaN
C(experience)                 NaN   1.0 NaN     NaN
C(program):C(experience)      NaN   1.0 NaN     NaN
Residual                  5385.55  38.0 NaN     NaN


  F /= J
