In [1]:
# Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
# the validity of the results.

In [2]:
# Independence of Observations: Data points must be independent. Violation example: Using repeated measures on the same
# subjects.

# Normality: Data within each group should be normally distributed. Violation example: Skewed data or outliers.

# Homogeneity of Variances: Variances among groups should be equal. Violation example: One group has much larger 
# variance than others.

# Violations can lead to incorrect conclusions, such as false positives or reduced statistical power.

In [3]:
# Q2. What are the three types of ANOVA, and in what situations would each be used?

In [4]:
# One-Way ANOVA:
#     Use: When comparing the means of three or more independent groups based on one factor or independent variable.
#     Situation: Testing the effectiveness of different diets (Diet A, Diet B, Diet C) on weight loss.\
    
# Two-Way ANOVA:
#     Use: When comparing the means of groups based on two factors, and it can also assess the interaction between these factors.
#     Situation: Studying the effect of different teaching methods (Method A, Method B) and different study times 
#     (1 hour, 2 hours) on student performance.
    
# Repeated Measures ANOVA:
#     Use: When comparing means across multiple measurements taken from the same subjects under different conditions or at different times.
#     Situation: Evaluating the impact of a training program on the same group's performance over several time points (e.g., before, during, and after training).

In [5]:
# Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

In [6]:
# The partitioning of variance is a fundamental concept in Analysis of Variance (ANOVA). It involves dividing the total
# variance in the response variable into:

# Between-group variance (SSB): Variance between different groups or levels of the factor(s)
# Within-group variance (SSW): Variance within each group or level of the factor(s)
# Error variance (SSE): Variance that cannot be explained by the factor(s)

# Understanding the partitioning of variance is crucial in ANOVA because it helps determine the significance of the 
# factor(s) and estimate the effect size.

In [7]:
# Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
# sum of squares (SSR) in a one-way ANOVA using Python?

In [8]:
import numpy as np

In [13]:
def calculate_anova(y, groups):
    unique_groups = np.unique(groups)
    group_means = np.array([np.mean(y[groups == i]) for i in unique_groups])
    overall_mean = np.mean(y)
    SST = np.sum((y - overall_mean) ** 2)
    SSE = np.sum([(y[groups == i] - group_means[j]) ** 2 for j, i in enumerate(unique_groups)]).sum()
    SSR = SST - SSE
    return SST, SSE, SSR


In [14]:
y = np.array([23, 21, 19, 24, 20, 22, 18, 25, 19, 21, 20, 23])
groups = np.array([1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3])

In [15]:
SST, SSE, SSR = calculate_anova(y, groups)

In [16]:
SST, SSE, SSR

(52.25, 50.25, 2.0)

In [17]:
# Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [18]:
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

In [19]:
np.random.seed(0)
A = np.repeat(['A1', 'A2'], 10)
B = np.tile(['B1', 'B2'], 10)
y = np.random.normal(0, 1, 20)

In [20]:
import pandas as pd
df = pd.DataFrame({'y': y, 'A': A, 'B': B})

In [21]:
model = ols('y ~ C(A) + C(B) + C(A):C(B)', data=df).fit()

In [22]:
anova_table = anova_lm(model, typ=2)

In [23]:
anova_table

Unnamed: 0,sum_sq,df,F,PR(>F)
C(A),0.569117,1.0,0.752153,0.398616
C(B),1.705237,1.0,2.253666,0.152774
C(A):C(B),0.075458,1.0,0.099727,0.75624
Residual,12.106404,16.0,,


In [24]:
# Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
# What can you conclude about the differences between the groups, and how would you interpret these
# results?

In [25]:
# With an F-statistic of 5.23 and a p-value of 0.02, you can conclude that there are statistically significant differences
# between the group means. Since the p-value is less than 0.05, it suggests that at least one group mean is different from 
# the others. Post-hoc tests would be needed to identify which specific groups differ.

In [26]:
# Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
# consequences of using different methods to handle missing data?

In [27]:
# Listwise Deletion: Exclude subjects with missing data, but this reduces sample size and may bias results.
# Pairwise Deletion: Use all available data, risking inconsistent sample sizes.
# Mean Imputation: Replace missing values with the mean, but it underestimates variability.
# LOCF: Carry forward the last observation, which may introduce bias if patterns change.
# Multiple Imputation: Impute missing values using statistical models, providing more accurate estimates but is complex.
# Mixed-Effects Models: Handle missing data within the model, offering unbiased estimates if modeled correctly.

In [28]:
# Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
# an example of a situation where a post-hoc test might be necessary.

In [29]:
# Tukey's HSD: Compares all pairs of means, controlling for Type I error. Use: To compare multiple treatments, like different 
# medications.

# Bonferroni Correction: Adjusts significance levels for multiple comparisons. Use: When comparing several pairs of means.

# Scheffé Test: Handles complex comparisons and group combinations. Use: When comparing various combinations of treatments.

# Dunnett's Test: Compares multiple groups to a single control. Use: To test different drug dosages against a placebo.

# Example: After a significant ANOVA result comparing teaching methods, Tukey's HSD can identify which specific methods 
# differ in their effectiveness.

In [30]:
# Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
# 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
# to determine if there are any significant differences between the mean weight loss of the three diets.
# Report the F-statistic and p-value, and interpret the results.

In [31]:
from scipy.stats import f_oneway

In [32]:
diet_A = np.random.normal(loc=5, scale=1, size=17) 
diet_B = np.random.normal(loc=4.5, scale=1, size=17)  
diet_C = np.random.normal(loc=4, scale=1, size=16)

In [33]:
f_stat, p_value = f_oneway(diet_A, diet_B, diet_C)

In [34]:
f_stat, p_value

(8.192248308455516, 0.0008865394976159794)

In [35]:
if p_value < 0.05:
    print("There are significant differences between the mean weight loss of the diets.")
else:
    print("There are no significant differences between the mean weight loss of the diets.")

There are significant differences between the mean weight loss of the diets.


In [36]:
# Q10. A company wants to know if there are any significant differences in the average time it takes to
# complete a task using three different software programs: Program A, Program B, and Program C. They
# randomly assign 30 employees to one of the programs and record the time it takes each employee to
# complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
# interaction effects between the software programs and employee experience level (novice vs.
# experienced). Report the F-statistics and p-values, and interpret the results.

In [45]:
import numpy as np
import pandas as pd
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
from scipy import stats

In [39]:
np.random.seed(0)
program_A_novice = np.random.normal(60, 10, 10) 
program_A_experienced = np.random.normal(40, 10, 10)  
program_B_novice = np.random.normal(70, 10, 10) 
program_B_experienced = np.random.normal(50, 10, 10)  
program_C_novice = np.random.normal(80, 10, 10)  
program_C_experienced = np.random.normal(60, 10, 10)  

In [40]:
df = pd.DataFrame({
    'time': np.concatenate([program_A_novice, program_A_experienced, program_B_novice, program_B_experienced, program_C_novice, program_C_experienced]),
    'program': np.repeat(['A', 'A', 'B', 'B', 'C', 'C'], 10),
    'experience': np.tile(['novice', 'experienced'], 30)
})

In [42]:
model = ols('time ~ C(program) + C(experience) + C(program):C(experience)', data=df).fit()
anova_table = anova_lm(model, typ=2)


In [43]:
anova_table

Unnamed: 0,sum_sq,df,F,PR(>F)
C(program),1073.743353,2.0,2.445944,0.096191
C(experience),0.069109,1.0,0.000315,0.985908
C(program):C(experience),557.57131,2.0,1.270125,0.28905
Residual,11852.712779,54.0,,


In [44]:
# Q11. An educational researcher is interested in whether a new teaching method improves student test
# scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
# experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
# two-sample t-test using Python to determine if there are any significant differences in test scores
# between the two groups. If the results are significant, follow up with a post-hoc test to determine which
# group(s) differ significantly from each other.

In [47]:
np.random.seed(0)
control_group = np.random.normal(70, 10, 50) 
experimental_group = np.random.normal(80, 10, 50)

In [48]:
t_stat, p_val = stats.ttest_ind(control_group, experimental_group)

In [49]:
t_stat, p_val

(-4.131173276068804, 7.60404836914434e-05)

In [50]:
# Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
# retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
# on those days. Conduct a repeated measures ANOVA using Python to determine if there are any

# significant differences in sales between the three stores. If the results are significant, follow up with a post-
# hoc test to determine which store(s) differ significantly from each other.

In [51]:
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
from statsmodels.stats.multicomp import pairwise_tukeyhsd

In [52]:
np.random.seed(0)
store_A = np.random.normal(1000, 100, 30)  
store_B = np.random.normal(1200, 100, 30)  
store_C = np.random.normal(1100, 100, 30) 

In [53]:
df = pd.DataFrame({
    'tore': np.repeat(['A', 'B', 'C'], 30),
    'ales': np.concatenate([store_A, store_B, store_C])
})

In [55]:
model = ols('sales ~ C(store)', data=df).fit()
anova_table = anova_lm(model, typ=2)

PatsyError: Error evaluating factor: NameError: name 'store' is not defined
    sales ~ C(store)
            ^^^^^^^^