Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

Assumptions of ANOVA:

Normality: The dependent variable should follow a normal distribution within each group.
Homogeneity of Variances: The variances of the dependent variable should be approximately equal across groups.
Independence: Observations within and between groups should be independent of each other.
Examples of Violations:

Non-Normality: If the normality assumption is violated, it can lead to inaccurate p-values and affect the overall validity of the results. Transformations or non-parametric alternatives may be considered.
Heteroscedasticity (Unequal Variances): Violation of homogeneity of variances can lead to unequal weighting of groups, affecting the F-statistic. Using Welch's ANOVA or transforming the data may be options.
Dependent Observations: Violations of independence, such as in repeated measures designs without proper handling, can lead to inflated Type I errors.


Q2. What are the three types of ANOVA, and in what situations would each be used?

Three Types of ANOVA:

One-Way ANOVA: Used when comparing means of three or more independent groups (treatments).
Two-Way ANOVA: Examines the influence of two different categorical independent variables on one dependent variable.
Repeated Measures ANOVA: Used when the same subjects are used for each treatment (within-subjects design).
Situations for Each:

One-Way ANOVA: Comparing the average scores across different categories or groups.
Two-Way ANOVA: Investigating the impact of two independent variables and their interaction on a dependent variable.
Repeated Measures ANOVA: Assessing changes in a dependent variable over time or under different conditions within the same subjects.

Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

Partitioning of Variance:
In ANOVA, the total variance in the data is partitioned into different components:

Total Sum of Squares (SST): Variability of the individual data points from the overall mean.
Explained Sum of Squares (SSE): Variability between group means and the overall mean.
Residual Sum of Squares (SSR): Variability within each group.
Importance:
Understanding the partitioning of variance helps quantify how much of the total variability in the data is due to the differences between groups and how much is due to random variability within groups. This understanding is crucial for interpreting the F-statistic and drawing meaningful conclusions from the analysis.

Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

In [2]:
import scipy.stats as stats
import numpy as np
group1 = [3, 4, 5, 6]
group2 = [7, 8, 9, 10]
group3 = [11, 12, 13, 14]
all_data = group1 + group2 + group3
overall_mean = np.mean(all_data)
sst = np.sum((all_data - overall_mean) ** 2)
group_means = [np.mean(group) for group in [group1, group2, group3]]
sse = np.sum([(np.mean(group) - overall_mean) ** 2 * len(group) for group in [group1, group2, group3]])
ssr = sst - sse
print(f"SST: {sst}")
print(f"SSE: {sse}")
print(f"SSR: {ssr}")


SST: 143.0
SSE: 128.0
SSR: 15.0


Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [None]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Example data in a DataFrame
data = pd.DataFrame({
    'A': ['A1', 'A1', 'A2', 'A2'],
    'B': ['B1', 'B2', 'B1', 'B2'],
    'Value': [3, 5, 7, 9]
})

# Fit a two-way ANOVA model
formula = 'Value ~ A + B + A:B'
model = ols(formula, data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Extract main effects and interaction effects
main_effects = anova_table['sum_sq'][:2]
interaction_effect = anova_table['sum_sq'][2]

print(f"Main Effects:\n{main_effects}")
print(f"Interaction Effect:\n{interaction_effect}")


Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

The obtained F-statistic of 5.23 suggests that there are significant differences between at least two groups. The p-value of 0.02 is less than the typical significance level of 0.05, indicating that you can reject the null hypothesis.

Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

Handling missing data in repeated measures ANOVA is crucial for obtaining valid results. There are several methods to handle missing data:

Complete Case Analysis (CCA): Exclude cases with missing data. This can lead to biased results if missing data is not random.

Mean Imputation: Replace missing values with the mean of the observed values. This can underestimate standard errors and lead to biased results.

Interpolation or Last Observation Carried Forward (LOCF): Use adjacent data points to estimate missing values. May introduce bias if missing data is related to unobserved changes.

Multiple Imputation: Generate multiple plausible imputations for missing data, perform analyses for each imputed dataset, and combine results. Provides more accurate estimates of uncertainty.

Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

Common post-hoc tests used after ANOVA include:

Tukey's Honestly Significant Difference (HSD): Used when you have equal sample sizes and want to control the overall Type I error rate.
Bonferroni Correction: Adjusts significance levels to control the familywise error rate but can be conservative.
Sidak Correction: Similar to Bonferroni but often less conservative.
Dunnett's Test: Used when comparing all groups to a control group.
Fisher's Least Significant Difference (LSD): Simple and quick but less conservative.

Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.

In [10]:
import scipy.stats as stats
import numpy as np
diet_A =np.random.rand(10)  
diet_B = np.random.rand(10) 
diet_C = np.random.rand(10) 
f_statistic, p_value = stats.f_oneway(diet_A, diet_B, diet_C)
print(f"F-statistic: {f_statistic}")
print(f"P-value: {p_value}")
if p_value < 0.05:
    print("There is a significant difference in mean weight loss between at least two diets.")
else:
    print("There is no significant difference in mean weight loss between the diets.")


F-statistic: 0.8951996540625464
P-value: 0.4203073873439122
There is no significant difference in mean weight loss between the diets.


Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.

In [11]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Generate random data
np.random.seed(42)  # Set seed for reproducibility
data = pd.DataFrame({
    'Software': np.random.choice(['A', 'B', 'C'], size=30),
    'Experience': np.random.choice(['Novice', 'Experienced'], size=30),
    'Time': np.random.randint(10, 30, size=30)
})

# Fit two-way ANOVA model
formula = 'Time ~ Software + Experience + Software:Experience'
model = ols(formula, data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Print results
print(anova_table)

# Interpretation
print("\nInterpretation:")
if anova_table['PR(>F)']['Software'] < 0.05:
    print("There is a significant main effect of Software.")
if anova_table['PR(>F)']['Experience'] < 0.05:
    print("There is a significant main effect of Experience.")
if anova_table['PR(>F)']['Software:Experience'] < 0.05:
    print("There is a significant interaction effect between Software and Experience.")
else:
    print("there is ")

                         sum_sq    df         F    PR(>F)
Software              18.499381   2.0  0.262465  0.771333
Experience             3.099871   1.0  0.087960  0.769338
Software:Experience    6.641795   2.0  0.094232  0.910406
Residual             845.800000  24.0       NaN       NaN

Interpretation:


Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

In [19]:
import scipy.stats as stats
import numpy as np
control_group = np.random.rand(10)
experimental_group =  np.random.rand(10) 
t_statistic, p_value = stats.ttest_ind(control_group, experimental_group)
print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value}")
if p_value < 0.05:
    print("There is a significant difference in test scores between the control and experimental groups.")
else:
    print("There is no significant difference in test scores between the groups.")


T-statistic: 2.5704487365178315
P-value: 0.019257584836331103
There is a significant difference in test scores between the control and experimental groups.


Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post-hoc test to determine which store(s) differ significantly from each other.

In [20]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.stats.anova import AnovaRM
data = pd.DataFrame({
    'Day': list(range(30))*3,
    'Store': ['A']*30 + ['B']*30 + ['C']*30,
    'Sales': np.random.randint(90) 
})

rm_anova = AnovaRM(data, 'Sales', 'Day', within=['Store'])
results = rm_anova.fit()
print(results)


               Anova
      F Value Num DF  Den DF Pr > F
-----------------------------------
Store -4.6065 2.0000 58.0000 1.0000

