Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.


Q2. What are the three types of ANOVA, and in what situations would each be used?


Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?


Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?


Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?


Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?


Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?


Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.


Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.


Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.


Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.


Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any

significant differences in sales between the three stores. If the results are significant, follow up with a post-
hoc test to determine which store(s) differ significantly from each other.

1. Assumptions of ANOVA (Analysis of Variance):

Independence: Observations within each group are independent of each other.
Normality: The populations from which the samples are drawn should be approximately normally distributed. Violations can occur when data is heavily skewed or has outliers.
Homogeneity of Variances (Homoscedasticity): The variances of the groups should be roughly equal. Violations can lead to unequal influence of groups on the overall result.
Examples of Violations:

Non-independence: In a repeated measures design, where the same subjects are used in multiple groups, observations may not be independent.
Non-normality: If your data is not normally distributed, ANOVA results may not be valid. For instance, if you have a highly skewed dataset.
Heteroscedasticity: Unequal variances can occur when the variance in one group is much larger or smaller than the others.

2. The three main types of ANOVA are:

One-Way ANOVA: Used when you have one categorical independent variable (factor) with more than two levels (groups). It tests if there are significant differences in the means of the groups.

Two-Way ANOVA: Used when you have two categorical independent variables (factors) and you want to examine their main effects and potential interaction effect on a continuous dependent variable.

Repeated Measures ANOVA: Used when you have repeated measurements on the same subjects under different conditions (within-subjects design). It assesses the effects of both the within-subject factor(s) and between-subject factor(s).

3. The partitioning of variance in ANOVA refers to the process of dividing the total variance in the data into different components, including explained variance (due to factors) and unexplained variance (due to random variation). Understanding this concept is crucial because it helps in:

Assessing the contribution of each factor to the overall variability in the data.
Evaluating the significance of factors.
Understanding the proportion of variance that can be attributed to systematic effects.

In [None]:
4.
import numpy as np
import scipy.stats as stats

# Example data
data = [group1_data, group2_data, ...]  # Replace with your actual data

# Calculate the overall mean
overall_mean = np.mean(np.concatenate(data))

# Calculate SST (Total Sum of Squares)
sst = np.sum([(x - overall_mean)**2 for group in data for x in group])

# Calculate SSE (Explained Sum of Squares)
sse = np.sum([len(group) * (np.mean(group) - overall_mean)**2 for group in data])

# Calculate SSR (Residual Sum of Squares)
ssr = sst - sse

# Degrees of freedom
df_total = len(np.concatenate(data)) - 1
df_groups = len(data) - 1
df_residual = df_total - df_groups

# Calculate the mean squares (MS) for groups and residuals
ms_groups = sse / df_groups
ms_residual = ssr / df_residual

# F-statistic
f_statistic = ms_groups / ms_residual

# P-value
p_value = 1 - stats.f.cdf(f_statistic, df_groups, df_residual)


In [None]:
5.
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Example data
data = pd.read_csv("your_data.csv")  # Replace with your data file

# Fit a two-way ANOVA model
formula = "dependent_variable ~ C(factor1) + C(factor2) + C(factor1):C(factor2)"
model = ols(formula, data=data).fit()

# Calculate main effects
main_effect_factor1 = sm.stats.anova_lm(model, typ=2).loc["C(factor1)", "sum_sq"]
main_effect_factor2 = sm.stats.anova_lm(model, typ=2).loc["C(factor2)", "sum_sq"]

# Calculate interaction effect
interaction_effect = sm.stats.anova_lm(model, typ=2).loc["C(factor1):C(factor2)", "sum_sq"]


6. With an F-statistic of 5.23 and a p-value of 0.02, you can conclude that there are significant differences between at least some of the groups. Specifically:

The null hypothesis (that all group means are equal) is rejected because the p-value is less than the chosen significance level (e.g., 0.05).

However, you would need post-hoc tests to determine which specific groups differ from each other, as the one-way ANOVA only tells you that there are differences but not where those differences are.

7. Handling missing data in a repeated measures ANOVA can be complex. Common methods include:

Listwise Deletion: Remove cases with missing data. This can reduce sample size and statistical power.

Imputation: Estimate missing values based on available data. Imputation methods like mean imputation or regression imputation can introduce bias if not done carefully.

Mixed-Design ANOVA: Utilize mixed-effects models that can handle missing data appropriately.

The potential consequences of different methods include biased results, loss of power, and incorrect inferences. Choosing an appropriate method depends on the nature of the missing data and the assumptions underlying the analysis.

8. Common post-hoc tests after ANOVA include:

Tukey's Honestly Significant Difference (HSD): Used when you have multiple groups and want to compare all possible pairs of means. It controls the family-wise error rate.

Bonferroni Correction: Used to control the overall Type I error rate when conducting multiple pairwise comparisons. It is more conservative than Tukey's HSD.

Dunnett's Test: Used when you have one control group and want to compare it to multiple treatment groups.

Scheffé's Test: Used when you have unequal sample sizes or unequal variances between groups.

Holm-Bonferroni Method: A step-down procedure that controls the family-wise error rate.

You would use these post-hoc tests when you have rejected the null hypothesis in ANOVA and need to determine which specific groups differ from each other. For example, after finding a significant difference in test scores between different teaching methods in ANOVA, you might use Tukey's HSD to identify which pairs of methods lead to significantly different scores.

In [None]:
9.

import scipy.stats as stats

# Example data (replace with your actual data)
diet_A = [weight_loss_data_for_diet_A]
diet_B = [weight_loss_data_for_diet_B]
diet_C = [weight_loss_data_for_diet_C]

# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(diet_A, diet_B, diet_C)

# Interpret the results
alpha = 0.05  # Significance level
if p_value < alpha:
    print("Reject the null hypothesis: There are significant differences between the mean weight loss of the three diets.")
else:
    print("Fail to reject the null hypothesis: There are no significant differences between the mean weight loss of the three diets.")

# Print F-statistic and p-value
print(f"F-statistic: {f_statistic}")
print(f"P-value: {p_value}")


This code will perform a one-way ANOVA to determine if there are any significant differences in weight loss between diets A, B, and C. The F-statistic and p-value will be reported, and you can interpret the results based on the significance level.

In [None]:
10.

import statsmodels.api as sm
from statsmodels.formula.api import ols

# Example data (replace with your actual data)
data = pd.read_csv("your_data.csv")  # Replace with your data file

# Fit a two-way ANOVA model
formula = "completion_time ~ C(software_program) + C(experience_level) + C(software_program):C(experience_level)"
model = ols(formula, data=data).fit()

# Calculate the main effects
main_effect_software = sm.stats.anova_lm(model, typ=2).loc["C(software_program)", "sum_sq"]
main_effect_experience = sm.stats.anova_lm(model, typ=2).loc["C(experience_level)", "sum_sq"]

# Calculate the interaction effect
interaction_effect = sm.stats.anova_lm(model, typ=2).loc["C(software_program):C(experience_level)", "sum_sq"]

# Interpret the results
alpha = 0.05  # Significance level
if main_effect_software < alpha:
    print("Reject the null hypothesis: There is a significant main effect of software programs.")
else:
    print("Fail to reject the null hypothesis: There is no significant main effect of software programs.")

if main_effect_experience < alpha:
    print("Reject the null hypothesis: There is a significant main effect of experience level.")
else:
    print("Fail to reject the null hypothesis: There is no significant main effect of experience level.")

if interaction_effect < alpha:
    print("Reject the null hypothesis: There is a significant interaction effect between software programs and experience level.")
else:
    print("Fail to reject the null hypothesis: There is no significant interaction effect between software programs and experience level.")

# Print F-statistics and p-values
print(f"Main Effect (Software): F-statistic = {main_effect_software}, p-value = {main_effect_software_p_value}")
print(f"Main Effect (Experience): F-statistic = {main_effect_experience}, p-value = {main_effect_experience_p_value}")
print(f"Interaction Effect: F-statistic = {interaction_effect}, p-value = {interaction_effect_p_value}")


11. To determine if there are significant differences in test scores between the control group and the experimental group, you can conduct a two-sample t-test using Python. Here's an example of how to do it:

In [None]:
import scipy.stats as stats

# Example data (replace with your actual data)
control_group_scores = [score1, score2, ..., score50]  # Scores for the control group
experimental_group_scores = [score1, score2, ..., score50]  # Scores for the experimental group

# Perform a two-sample t-test
t_statistic, p_value = stats.ttest_ind(control_group_scores, experimental_group_scores)

# Set the significance level
alpha = 0.05

# Interpret the results
if p_value < alpha:
    print("Reject the null hypothesis: There are significant differences in test scores between the two groups.")
else:
    print("Fail to reject the null hypothesis: There are no significant differences in test scores between the two groups.")


12. To determine if there are significant differences in daily sales between the three retail stores and conduct a post-hoc test if needed, you can use Python with the statsmodels library for the repeated measures ANOVA. However, repeated measures ANOVA typically applies to data where the same subjects are measured under different conditions or time points. If you are looking to compare different stores' daily sales on the same days, you might consider using a one-way ANOVA instead.

Here's how you can conduct a one-way ANOVA and a post-hoc test if needed:

In [None]:
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Example data (replace with your actual data)
data = pd.read_csv("your_data.csv")  # Replace with your data file

# Fit a one-way ANOVA model
formula = "sales ~ C(store)"
model = ols(formula, data=data).fit()

# Perform the ANOVA
anova_table = sm.stats.anova_lm(model, typ=2)

# Interpret the results
alpha = 0.05  # Significance level
if anova_table['PR(>F)'][0] < alpha:
    print("Reject the null hypothesis: There are significant differences in daily sales between the three stores.")

    # Perform a post-hoc Tukey's HSD test
    posthoc = pairwise_tukeyhsd(data['sales'], data['store'], alpha=alpha)
    print(posthoc)
else:
    print("Fail to reject the null hypothesis: There are no significant differences in daily sales between the three stores.")
