Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

Assumptions of ANOVA:

Normality: The dependent variable should be approximately normally distributed within each group.
Homogeneity of Variance (Homoscedasticity): The variances of the groups should be approximately equal.
Independence: Observations within each group should be independent of each other.
Random Sampling: The data should be obtained through a random sampling process.
Violations and Impacts:

Normality Violation: If the assumption of normality is violated, it can lead to inaccurate p-values. Transformations or non-parametric alternatives might be considered.
Homogeneity of Variance Violation: Unequal variances can lead to biased standard errors and affect the validity of the F-test. Robust ANOVA techniques or transformations may be used.
Independence Violation: If observations are not independent, it can lead to pseudoreplication, and the F-test may be overly optimistic. This may be addressed through proper study design or statistical methods for repeated measures.

Q2. What are the three types of ANOVA, and in what situations would each be used?

Types of ANOVA:

One-Way ANOVA: Used when comparing means across two or more independent groups.
Two-Way ANOVA: Used when there are two independent variables (factors) affecting the dependent variable.
Repeated Measures ANOVA: Used when measurements are taken on the same set of subjects under different conditions (within-subjects design).

Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

Partitioning of Variance:

ANOVA decomposes the total variance in the data into different components: total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR).
SST = SSE + SSR
Importance:

Understanding how variance is partitioned helps identify the sources of variation in the data.
It allows us to assess the proportion of total variance explained by the model (SSE) and the proportion that remains unexplained (SSR).

Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?



In [1]:
import scipy.stats as stats
import numpy as np

# Example data for three groups
group1 = [10, 12, 15, 8, 11]
group2 = [14, 18, 20, 16, 22]
group3 = [8, 10, 12, 11, 14]

# Combine data from all groups
all_data = np.concatenate([group1, group2, group3])

# Calculate SST
sst = np.sum((all_data - np.mean(all_data))**2)

# Calculate SSE
sse = np.sum((group1 - np.mean(all_data))**2) + np.sum((group2 - np.mean(all_data))**2) + np.sum((group3 - np.mean(all_data))**2)

# Calculate SSR
ssr = sst - sse

print(f"SST: {sst}")
print(f"SSE: {sse}")
print(f"SSR: {ssr}")


SST: 245.60000000000002
SSE: 245.59999999999997
SSR: 5.684341886080802e-14


Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?



In [2]:
import statsmodels.api as sm
from statsmodels.formula.api import ols
import pandas as pd

# Example data for two-way ANOVA
data = {'A': ['A1']*5 + ['A2']*5 + ['A3']*5,
        'B': ['B1', 'B2', 'B1', 'B2', 'B1']*3,
        'Value': [10, 12, 15, 8, 11, 14, 18, 20, 16, 22, 8, 10, 12, 11, 14]}

df = pd.DataFrame(data)

# Fit the two-way ANOVA model
model = ols('Value ~ A * B', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Extract main effects and interaction effects
main_effect_A = anova_table.loc['A', 'sum_sq'] / anova_table.loc['A', 'df']
main_effect_B = anova_table.loc['B', 'sum_sq'] / anova_table.loc['B', 'df']
interaction_effect = anova_table.loc['A:B', 'sum_sq'] / anova_table.loc['A:B', 'df']

print(f"Main Effect A: {main_effect_A}")
print(f"Main Effect B: {main_effect_B}")
print(f"Interaction Effect: {interaction_effect}")


Main Effect A: 79.40000000000005
Main Effect B: 8.100000000000003
Interaction Effect: 0.43333333333334045


Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

The obtained p-value (0.02) is less than the significance level (e.g., 0.05). Therefore, you would reject the null hypothesis. This suggests that there are significant differences between at least two groups. The F-statistic (5.23) indicates the ratio of the variance between groups to the variance within groups. The larger the F-statistic, the more evidence you have against the null hypothesis.

Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

Handling missing data in repeated measures ANOVA can be done using various methods:

Pairwise deletion: Ignores cases with missing data for specific comparisons.
Listwise deletion: Removes cases with any missing data.
Imputation: Replaces missing values with estimated values.
Potential consequences:

Pairwise and listwise deletion may lead to biased results and reduced statistical power.
Imputation methods introduce additional uncertainty and may impact the accuracy of the results.

Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

Common post-hoc tests include Tukey's HSD, Bonferroni correction, and Scheffé's method. Post-hoc tests are used when ANOVA indicates significant differences between groups, but they don't identify which specific groups differ.

Example: In a one-way ANOVA comparing the effectiveness of three teaching methods, if ANOVA shows a significant difference, a post-hoc test would be necessary to determine which pairs of teaching methods are significantly different from each other.

Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.



In [3]:
import scipy.stats as stats

# Example data for three diets
diet_A = [2, 3, 4, 5, 3, 2, 4, 6, 5, 3, 2, 4, 5, 3, 2, 4, 6, 5, 3, 2, 4, 5, 3, 2, 4]
diet_B = [3, 4, 5, 6, 4, 3, 5, 7, 6, 4, 3, 5, 6, 4, 3, 5, 7, 6, 4, 3, 5, 6, 4, 3, 5]
diet_C = [4, 5, 6, 7, 5, 4, 6, 8, 7, 5, 4, 6, 7, 5, 4, 6, 8, 7, 5, 4, 6, 7, 5, 4, 6]

# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(diet_A, diet_B, diet_C)

print(f"F-statistic: {f_statistic}")
print(f"P-value: {p_value}")

# Interpretation
if p_value < 0.05:
    print("Reject the null hypothesis. There are significant differences between the mean weight loss of the three diets.")
else:
    print("Fail to reject the null hypothesis. There is no significant difference between the mean weight loss of the three diets.")


F-statistic: 15.09054325955734
P-value: 3.3622599171676966e-06
Reject the null hypothesis. There are significant differences between the mean weight loss of the three diets.


Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.


are significant effects.

In [4]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a DataFrame with the data
data = pd.DataFrame({
    'Software': ['A']*10 + ['B']*10 + ['C']*10,
    'Experience': ['Novice']*5 + ['Experienced']*5 + ['Novice']*5 + ['Experienced']*5 + ['Novice']*5 + ['Experienced']*5,
    'Time': [15, 16, 14, 18, 17, 20, 25, 22, 24, 23, 12, 13, 14, 15, 16, 18, 22, 21, 20, 19, 10, 11, 12, 13, 14, 28, 26, 27, 30, 29]
})

# Fit a two-way ANOVA model
model = ols('Time ~ C(Software) * C(Experience)', data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Print the ANOVA table
print(anova_table)


                           sum_sq    df           F        PR(>F)
C(Software)                  50.4   2.0    9.333333  1.003391e-03
C(Experience)               691.2   1.0  256.000000  2.641534e-14
C(Software):C(Experience)   154.4   2.0   28.592593  4.454649e-07
Residual                     64.8  24.0         NaN           NaN


Interpretation:

Check the p-values in the ANOVA table for the main effects of Software, Experience, and the interaction effect.
If any p-value is less than the chosen significance level (e.g., 0.05), you can conclude that there are significant effects.

Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.



In [5]:
import scipy.stats as stats

# Example data for control group
control_group = [75, 78, 80, 82, 79, 77, 76, 81, 83, 79, 74, 78, 80, 82, 79]

# Example data for experimental group
experimental_group = [85, 88, 90, 92, 89, 87, 86, 91, 93, 89, 84, 88, 90, 92, 89]

# Perform two-sample t-test
t_statistic, p_value = stats.ttest_ind(control_group, experimental_group)

print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value}")

# Interpretation
if p_value < 0.05:
    print("Reject the null hypothesis. There is a significant difference in test scores between the control and experimental groups.")
else:
    print("Fail to reject the null hypothesis. There is no significant difference in test scores between the control and experimental groups.")


T-statistic: -10.472806010719456
P-value: 3.437682700718616e-11
Reject the null hypothesis. There is a significant difference in test scores between the control and experimental groups.


Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post-hoc test to determine which store(s) differ significantly from each other.



In [11]:
import pandas as pd
import pingouin as pg

# Sample data (replace with your actual data)
data = {
    'Store_A': [30, 35, 25, 40, 45, 50, 35, 40, 30, 45, 35, 40, 50, 55, 45, 50, 55, 60, 50, 55, 45, 40, 35, 30, 25, 20, 15, 10, 20, 30],
    'Store_B': [25, 30, 40, 35, 50, 45, 40, 35, 30, 25, 20, 15, 25, 30, 40, 35, 40, 30, 25, 20, 30, 35, 40, 45, 50, 55, 60, 55, 50, 45],
    'Store_C': [20, 25, 30, 35, 25, 20, 30, 35, 40, 45, 50, 55, 45, 40, 50, 55, 60, 50, 45, 55, 60, 35, 40, 45, 50, 55, 40, 35, 30, 25],
}

df = pd.DataFrame(data)

# Add a 'Subject' column
df['Subject'] = range(1, len(df) + 1)

# Perform repeated measures ANOVA
rm_anova = pg.rm_anova(df.melt(id_vars=['Subject'], value_name='Sales', var_name='Store'), dv='Sales', within='Store', subject='Subject')

# Print ANOVA results
print(rm_anova)

# Post-hoc test if ANOVA is significant
if rm_anova['p-unc'][0] < 0.05:
    posthoc = pg.pairwise_ttests(data=df.melt(id_vars=['Subject'], value_name='Sales', var_name='Store'), dv='Sales', within='Store', subject='Subject', parametric=True, padjust='bonf')
    print(posthoc)


  Source  ddof1  ddof2         F     p-unc  p-GG-corr       ng2       eps  \
0  Store      2     58  0.798885  0.454718   0.435682  0.021104  0.837901   

   sphericity   W-spher   p-spher  
0       False  0.806541  0.049291  


This example assumes that your data is structured with each store's sales as a separate column and with repeated measures (days) for each store. The 'Subject' column represents the repeated measure factor.

Make sure to replace the sample data with your actual data. The pairwise_ttests function performs post-hoc tests to compare the means of the different levels of the repeated measure factor (stores) after a significant ANOVA result.