Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.

 Assumptions required to use ANOVA and examples of violations:

Independence of observations: The observations within each group should be independent of each other. Violations can occur when there is a lack of independence, such as in repeated measures or matched pairs designs.

Homogeneity of variance: The variances of the groups being compared should be roughly equal. Violations occur when the assumption is not met, such as when one group has significantly larger variances than others.

Normality of residuals: The residuals (the differences between observed and predicted values) should follow a normal distribution. Violations occur when the residuals are not normally distributed, such as when they exhibit skewness or heavy tails.

Q2. What are the three types of ANOVA, and in what situations would each be used?

One-Way ANOVA: Used when there is a single categorical independent variable (factor) with two or more levels (groups), and the goal is to compare the means of the dependent variable across those groups.

Two-Way ANOVA: Used when there are two independent variables (factors) and their interactions, allowing for the examination of the main effects of each factor and their interaction on the dependent variable.

Three-Way ANOVA: Used when there are three independent variables (factors) and their interactions, allowing for the examination of the main effects of each factor and their interactions on the dependent variable.

Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

Partitioning of variance in ANOVA and its importance:

Partitioning of variance in ANOVA refers to the decomposition of the total variation observed in the data into different sources of variation. This decomposition is important because it allows us to understand the contributions of different factors to the observed variation. In ANOVA, the total variation is divided into two components:

Between-Groups Variation: Variation due to differences between the group means. It represents the variation explained by the factor(s) under investigation.

Within-Groups Variation: Variation within each group that is not accounted for by the factor(s). It represents the variation that is not explained by the factors and is often considered as random or error variation.

Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?


In [1]:
import numpy as np
import scipy.stats as stats

group1 = [10, 15, 20]
group2 = [12, 18, 22]
group3 = [8, 14, 16]

data = np.concatenate([group1, group2, group3])

group_means = [np.mean(group1), np.mean(group2), np.mean(group3)]
grand_mean = np.mean(data)

sst = np.sum((data - grand_mean) ** 2)

sse = np.sum([(gm - grand_mean) ** 2 * len(g) for gm, g in zip(group_means, [group1, group2, group3])])

ssr = sst - sse

print("SST:", sst)
print("SSE:", sse)
print("SSR:", ssr)


SST: 168.0
SSE: 32.66666666666666
SSR: 135.33333333333334


Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?



In [7]:
import pandas as pd
import statsmodels.api as sm

# Create a DataFrame with the data
data = pd.DataFrame({'Factor1': [1, 1, 2, 2, 3, 3],
                     'Factor2': [1, 2, 1, 2, 1, 2],
                     'Response': [10, 12, 15, 18, 20, 22]})

# Fit the two-way ANOVA model
model = sm.formula.ols('Response ~ Factor1 + Factor2 + Factor1:Factor2', data=data).fit()

# Get the ANOVA table
anova_table = sm.stats.anova_lm(model, typ=2)

# Extract the main effects and interaction effect
main_effect_factor1 = anova_table['sum_sq']['Factor1']
main_effect_factor2 = anova_table['sum_sq']['Factor2']
interaction_effect = anova_table['sum_sq']['Factor1:Factor2']

# Print the main effects and interaction effect
print("Main Effect of Factor1:", main_effect_factor1)
print("Main Effect of Factor2:", main_effect_factor2)
print("Interaction Effect:", interaction_effect)


Main Effect of Factor1: 100.00000000000003
Main Effect of Factor2: 8.166666666666607
Interaction Effect: 3.1554436208840416e-30


Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?

When conducting a one-way ANOVA and obtaining an F-statistic of 5.23 and a p-value of 0.02, we can conclude that there are significant differences between the groups being compared.

The F-statistic of 5.23 indicates that the variation between the group means is larger than the variation within the groups. This suggests that the factor (independent variable) being studied has a significant effect on the dependent variable. The associated p-value of 0.02 is less than the chosen significance level (commonly 0.05), indicating that the observed differences are unlikely to occur by chance alone.

Interpreting these results, we can state that there are statistically significant differences in the means of the groups being compared. However, we cannot determine which specific groups differ from one another based on the one-way ANOVA alone. Post-hoc tests or pairwise comparisons are typically conducted to identify the specific group differences.


Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?

In a repeated measures ANOVA, handling missing data can be approached in various ways. One common method is to use pairwise deletion, where missing data for each individual comparison are excluded from the analysis. Another approach is to impute missing values, replacing them with estimated values based on the available data or using imputation techniques.

The potential consequences of using different methods to handle missing data in repeated measures ANOVA include:

Bias: The choice of missing data handling method can introduce bias if the missing data mechanism is related to the outcome variable or other variables in the analysis.

Reduced power: Using pairwise deletion may result in a loss of statistical power as fewer data points are used in the analysis.

Inaccurate estimates: Imputing missing data introduces additional uncertainty as the missing values are estimated. The accuracy of the imputation depends on the validity of the assumptions made during imputation.


Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.

Tukey's Honestly Significant Difference (HSD) test: This test compares all possible pairs of group means and controls the familywise error rate. It is suitable when the goal is to identify which specific groups differ significantly from each other.

Bonferroni correction: This method adjusts the significance level for multiple pairwise comparisons. It is more conservative than Tukey's HSD test and controls the familywise error rate.

Dunnett's test: This test compares each group mean to a control group mean. It is useful when there is a specific control group to which other groups are compared.

Scheffe's test: This test allows for more complex comparisons and can handle both planned and unplanned comparisons. It is less restrictive but also less powerful than Tukey's HSD test.

Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.


In [9]:
import scipy.stats as stats

# Weight loss data for each diet
diet_A = [2, 4, 3, 1, 2]
diet_B = [3, 5, 4, 2, 1]
diet_C = [1, 3, 2, 1, 3]

# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(diet_A, diet_B, diet_C)

# Print the F-statistic and p-value
print("F-statistic:", f_statistic)
print("P-value:", p_value)


F-statistic: 0.7916666666666669
P-value: 0.47538921646792204


Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.

In [10]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a DataFrame with the data
data = pd.DataFrame({'Software': ['A', 'B', 'C', 'A', 'B', 'C'],
                     'Experience': ['Novice', 'Experienced', 'Experienced', 'Novice', 'Novice', 'Experienced'],
                     'Time': [10, 12, 15, 18, 20, 22]})

# Fit the two-way ANOVA model
model = ols('Time ~ C(Software) + C(Experience) + C(Software):C(Experience)', data=data).fit()

# Print the ANOVA table
print(sm.stats.anova_lm(model, typ=2))


                              sum_sq   df         F    PR(>F)
C(Software)                 0.465675  2.0  0.008242  0.935937
C(Experience)                    NaN  1.0       NaN       NaN
C(Software):C(Experience)  28.016667  2.0  0.495870  0.554269
Residual                   56.500000  2.0       NaN       NaN


  F /= J


Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.


In [11]:
import scipy.stats as stats

# Test scores for the control group
control_scores = [75, 80, 85, 90, 70]

# Test scores for the experimental group
experimental_scores = [85, 90, 95, 80, 75]

# Perform two-sample t-test
t_statistic, p_value = stats.ttest_ind(control_scores, experimental_scores)

# Print the t-statistic and p-value
print("T-statistic:", t_statistic)
print("P-value:", p_value)


T-statistic: -1.0
P-value: 0.34659350708733416


Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any

significant differences in sales between the three stores. If the results are significant, follow up with a post-
hoc test to determine which store(s) differ significantly from each other.

In [None]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a DataFrame with the data
data = pd.DataFrame({'Store': ['A'] * 30 + ['B'] * 30 + ['C'] * 30,
                     'Day': list(range(1, 31)) * 3,
                     'Sales': [100, 110, 95, 105, 120, 100, 105, 98, 102, 112,
                               80, 85, 90, 95, 105, 78, 88, 92, 98, 102,
                               115, 120, 118, 100, 110, 105, 98, 102, 95, 105,
                               92, 98, 105, 95, 100, 105, 100, 110, 105, 98, 102,
                               88, 92, 98, 102, 112, 105, 95, 100, 110, 95]})

# Check if the length of arrays matches
assert len(data['Store']) == len(data['Sales']), "Arrays must be of the same length"

# Fit the repeated measures ANOVA model
model = ols('Sales ~ C(Store)', data=data).fit()

# Print the ANOVA table
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

# Perform post-hoc test (Tukey's HSD)
posthoc = sm.stats.multicomp.pairwise_tukeyhsd(data['Sales'], data['Store'])
print(posthoc.summary())
