Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.
Assumptions for ANOVA:

Independence of Observations: Each group must be composed of independent observations. For example, one person’s score should not influence another person’s score.

Violation Example: If participants in one group are friends and discuss the experiment, this could influence their responses.
Normality: The data within each group should be approximately normally distributed.

Violation Example: If the data are heavily skewed or contain outliers, this could impact the ANOVA results.
Homogeneity of Variances (Homoscedasticity): The variance among the groups should be approximately equal.

Violation Example: If one group has a much larger variance than the others, the assumption is violated, leading to potentially invalid results.
Impact of Violations:

Independence Violation: Can lead to underestimated variability and inflated Type I error rates.
Normality Violation: ANOVA is robust to moderate violations of normality, but severe violations can affect Type I error rates and power.
Homogeneity of Variances Violation: Can lead to inaccurate F-statistics and p-values. In such cases, corrections like Welch’s ANOVA might be used.
Q2. What are the three types of ANOVA, and in what situations would each be used?
One-Way ANOVA:

Use: When comparing the means of three or more independent groups based on one factor.
Example: Comparing test scores of students from three different teaching methods.
Two-Way ANOVA:

Use: When comparing the means across two factors, allowing for the examination of interaction effects between the factors.
Example: Examining the effect of teaching method (three levels) and gender (two levels) on test scores.
Repeated Measures ANOVA:

Use: When comparing the means of three or more groups where the same subjects are measured multiple times.
Example: Measuring the effect of different diets on the same group of participants over three months.
Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?
Partitioning of Variance in ANOVA:

ANOVA breaks down the total variability in the data into two components: variability between groups and variability within groups.
Total Sum of Squares (SST): Represents the total variability in the data.
Between-Group Sum of Squares (SSB or SSE): Represents the variability due to differences between group means.
Within-Group Sum of Squares (SSW or SSR): Represents the variability within each group.
Importance:

Understanding the partitioning helps in identifying how much of the total variability is explained by the factor(s) being studied. It is crucial for calculating the F-statistic and for determining if the observed group differences are statistically significant.
Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

In [1]:
import numpy as np

# Example data
group1 = [23, 20, 22]
group2 = [27, 29, 28]
group3 = [25, 26, 24]

# Combine data
data = np.array(group1 + group2 + group3)
group_labels = ['group1'] * len(group1) + ['group2'] * len(group2) + ['group3'] * len(group3)

# Calculate means
overall_mean = np.mean(data)
means = {label: np.mean(data[np.array(group_labels) == label]) for label in set(group_labels)}

# Calculate SST
sst = np.sum((data - overall_mean) ** 2)

# Calculate SSE (Between-group sum of squares)
sse = np.sum([len(data[np.array(group_labels) == label]) * (mean - overall_mean) ** 2 for label, mean in means.items()])

# Calculate SSR (Within-group sum of squares)
ssr = np.sum([(value - means[label]) ** 2 for value, label in zip(data, group_labels)])

sst, sse, ssr


(68.88888888888889, 60.2222222222222, 8.666666666666666)

In [3]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Example data
data = {
    'Factor1': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
    'Factor2': ['X', 'Y', 'Z', 'X', 'Y', 'Z', 'X', 'Y', 'Z'],
    'Value': [23, 20, 22, 27, 29, 28, 25, 26, 24]
}
df = pd.DataFrame(data)

# Fit the model
model = ols('Value ~ C(Factor1) * C(Factor2)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

anova_table


Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?
Conclusion:

The p-value (0.02) is less than the significance level (typically 0.05), so we reject the null hypothesis.
Interpretation:

There is statistically significant evidence to suggest that there are differences between the group means. The F-statistic of 5.23 indicates that the variance between the groups is larger than the variance within the groups, supporting the conclusion of significant differences.
Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?
Handling Missing Data:

Listwise Deletion: Exclude any participant with missing data. This can reduce sample size and power.
Pairwise Deletion: Use all available data points without excluding entire participants.
Imputation: Replace missing values with estimated ones, such as mean imputation, regression imputation, or multiple imputation.
Mixed-Effects Models: These models can handle missing data by modeling the random effects.
Potential Consequences:

Listwise Deletion: Can lead to biased results if the data are not missing completely at random.
Pairwise Deletion: Can lead to inconsistent results across different analyses.
Imputation: Can introduce bias if the imputation method is not appropriate.
Mixed-Effects Models: Generally more robust but require more complex statistical techniques.
Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.
Common Post-Hoc Tests:

Tukey’s Honestly Significant Difference (HSD): Used when you need to compare all possible pairs of group means. Suitable for equal sample sizes.

Example: Comparing test scores across multiple teaching methods to identify which pairs of methods differ significantly.
Bonferroni Correction: Adjusts p-values to control for Type I error rate when making multiple comparisons. Suitable for fewer comparisons.

Example: Comparing mean blood pressure levels between three different medications.
Scheffé’s Method: More conservative and used for all possible contrasts. Suitable for complex comparisons.

Example: Comparing various diet plans and their effects on weight loss.
Dunnett’s Test: Compares each treatment group to a single control group.

Example: Comparing the efficacy of several new drugs to a standard drug.
Example Situation:

After conducting a one-way ANOVA to compare the mean test scores across four different teaching methods, you find a significant effect. A post-hoc test, like Tukey’s HSD, would be necessary to determine which specific teaching methods differ from each other.