Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.

Q2. What are the three types of ANOVA, and in what situations would each be used?

There are three main types of ANOVA, each used depending on the experimental design and the research question being addressed. These types are:

1. One-Way ANOVA
Description: One-way ANOVA is used to compare the means of three or more independent groups based on a single factor (independent variable). It tests whether there is a statistically significant difference between the group means.

When to Use:
When you have one categorical independent variable (factor) with multiple levels (groups) and one continuous dependent variable.
Example: Comparing the average exam scores of students across three different teaching methods (Group A, Group B, Group C). The independent variable is the "teaching method" (with three levels), and the dependent variable is the "exam score".

2. Two-Way ANOVA
Description: Two-way ANOVA is used to evaluate the effect of two independent variables (factors) simultaneously on a dependent variable. It also assesses whether there is an interaction effect between the two factors.

When to Use:
When you have two categorical independent variables, each with multiple levels, and one continuous dependent variable.
Example: Analyzing the effect of both "teaching method" (three levels: A, B, C) and "student gender" (two levels: male, female) on exam scores. Here, you are interested in the main effects of each factor (teaching method and gender) as well as any interaction effect between the two.

3. Repeated Measures ANOVA
Description: Repeated measures ANOVA is used when the same subjects are measured multiple times under different conditions or over time. It accounts for the correlation between repeated measurements on the same subjects.
When to Use:
When you have one group of subjects measured multiple times (e.g., at different time points) or under different conditions.
Example: Measuring the blood pressure of a group of participants before, during, and after a treatment. The dependent variable is "blood pressure," and the repeated measurements allow for analyzing changes over time.

Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

Partitioning of Variance in ANOVA
ANOVA (Analysis of Variance) is a statistical technique that partitions the total variance in a dataset into two components: between-group variance and within-group variance.

Between-group variance: This measures the variability between the means of different groups. It reflects how much the means of the groups differ from each other.

Within-group variance: This measures the variability within each group. It reflects how much the individual data points within a group differ from the group's mean.

Why is Partitioning of Variance Important?

Hypothesis Testing: ANOVA's primary goal is to determine if there are significant differences between the means of multiple groups. By partitioning the variance, we can compare the between-group variance to the within-group variance. If the between-group variance is significantly larger than the within-group variance, it suggests that the differences between the means are not due to chance.

Effect Size: Understanding the partitioning of variance helps us assess the magnitude of the differences between the groups. A large between-group variance relative to the within-group variance indicates a substantial effect size.

Understanding Variation: Partitioning variance provides insight into the sources of variability in the data. This can help researchers identify factors that are driving the differences between groups.

Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?

In [1]:
import numpy as np
import pandas as pd

# Example dataset
data = {
    'Group': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
    'Value': [5, 7, 9, 8, 10, 12, 15, 17, 19]
}

# Convert to a DataFrame
df = pd.DataFrame(data)

# Calculate overall mean
overall_mean = df['Value'].mean()

# Calculate group means
group_means = df.groupby('Group')['Value'].mean()

# Total Sum of Squares (SST)
sst = sum((df['Value'] - overall_mean) ** 2)

# Explained Sum of Squares (SSE)
sse = sum(df.groupby('Group').size() * (group_means - overall_mean) ** 2)

# Residual Sum of Squares (SSR)
ssr = sum((df['Value'] - df['Group'].map(group_means)) ** 2)

# Print the results
print(f"Total Sum of Squares (SST): {sst}")
print(f"Explained Sum of Squares (SSE): {sse}")
print(f"Residual Sum of Squares (SSR): {ssr}")


Total Sum of Squares (SST): 182.0
Explained Sum of Squares (SSE): 158.0
Residual Sum of Squares (SSR): 24.0


Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [4]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Example dataset
data = {
    'Factor_A': ['Low', 'Low', 'Low', 'High', 'High', 'High', 'Low', 'Low', 'Low', 'High', 'High', 'High'],
    'Factor_B': ['Type1', 'Type1', 'Type1', 'Type1', 'Type1', 'Type1', 'Type2', 'Type2', 'Type2', 'Type2', 'Type2', 'Type2'],
    'Value': [10, 12, 14, 22, 24, 26, 9, 11, 13, 21, 23, 25]
}

# Convert to DataFrame
df = pd.DataFrame(data)

# Fit the two-way ANOVA model
model = ols('Value ~ C(Factor_A) + C(Factor_B) + C(Factor_A):C(Factor_B)', data=df).fit()

# Perform the ANOVA
anova_table = sm.stats.anova_lm(model, typ=2)

# Display the ANOVA table
print(anova_table)


                               sum_sq   df             F    PR(>F)
C(Factor_A)              4.320000e+02  1.0  1.080000e+02  0.000006
C(Factor_B)              3.000000e+00  1.0  7.500000e-01  0.411694
C(Factor_A):C(Factor_B)  8.998931e-28  1.0  2.249733e-28  1.000000
Residual                 3.200000e+01  8.0           NaN       NaN


Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?

nterpreting ANOVA Results: F-Statistic and p-value
Understanding the Results:

F-Statistic (5.23): This value represents the ratio of between-group variance to within-group variance. A higher F-statistic suggests that the differences between group means are more pronounced relative to the variability within the groups.
p-value (0.02): This value indicates the probability of observing an F-statistic as extreme or more extreme than the one obtained, assuming there are no true differences between the group means. A lower p-value suggests that the observed differences are less likely to be due to chance.

Conclusion:
Given an F-statistic of 5.23 and a p-value of 0.02, we can conclude that there are statistically significant differences between the group means. The p-value is less than the commonly used alpha level of 0.05, indicating that the observed differences are unlikely to be due to chance.

Interpretation:

Reject the null hypothesis: The null hypothesis states that there is no difference between the group means. Since the p-value is less than 0.05, we can reject this null hypothesis.
There are significant differences: This means that at least one group's mean is significantly different from the others.
Further analysis: To pinpoint which specific groups differ, post-hoc tests like Tukey's HSD or Bonferroni correction can be used.

Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?

Handling Missing Data in Repeated Measures ANOVA
Repeated measures ANOVA is a statistical technique used to analyze data collected from the same individuals over multiple time points or conditions. When dealing with missing data in this context, it's crucial to handle it appropriately to maintain the integrity of the analysis.

Common Methods to Handle Missing Data
Listwise Deletion: This method involves excluding any participant with missing data from the analysis. While straightforward, it can lead to a significant reduction in sample size, especially if many participants have missing values.

Pairwise Deletion: In this approach, only participants with complete data for a particular pair of time points are included in the analysis for that pair. This can result in different sample sizes for different comparisons, making interpretation more complex.

Mean Imputation: This involves replacing missing values with the mean of the available data for that participant. However, it can underestimate the variance and bias the results if the missing values are not missing at random.

Last Observation Carried Forward (LOCF): This method assumes that a participant's last observed value remains constant until the next measurement. LOCF can introduce bias if the missing values represent a systematic pattern.

Multiple Imputation: This technique involves creating multiple plausible datasets by imputing missing values using statistical models. It provides more accurate estimates but can be computationally intensive.

Potential Consequences of Different Methods

Bias: Different methods can introduce bias into the results. For example, listwise deletion and mean imputation can bias estimates if missing values are not missing at random.

Loss of Power: Excluding participants with missing data (listwise deletion) can reduce statistical power, making it harder to detect significant effects.

Type I Error Rate: The choice of method can affect the Type I error rate (the probability of falsely rejecting the null hypothesis). Some methods may inflate or deflate the Type I error rate.

Interpretation: The choice of method can influence the interpretation of the results. For instance, LOCF may mask true changes over time if missing values represent a systematic pattern.

Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.

Common Post-Hoc Tests in ANOVA
Post-hoc tests are used to identify which specific groups differ significantly from each other after a significant ANOVA result. Here are some common post-hoc tests:

Tukey's Honestly Significant Difference (HSD)
When to use: When you want to compare all possible pairs of group means.
Advantages: Controls the family-wise error rate (FWER), making it suitable for multiple comparisons.
Disadvantages: Can be overly conservative, especially with unequal sample sizes.

Bonferroni Correction
When to use: When you want to compare all possible pairs of group means.
Advantages: Controls the FWER.
Disadvantages: Can be overly conservative, especially with many comparisons.

Sidak Correction
When to use: When you want to compare all possible pairs of group means.
Advantages: Controls the FWER, but it's generally less conservative than Bonferroni.
Disadvantages: Can still be conservative in some cases.

Dunnett's Test
When to use: When you want to compare all groups to a control group.
Advantages: More powerful than Tukey's HSD for comparing to a control group.
Disadvantages: Less powerful for comparing non-control groups.

Fisher's Least Significant Difference (LSD)
When to use: When you have equal sample sizes and are willing to accept a higher FWER.
Advantages: More powerful than other post-hoc tests.
Disadvantages: Can be overly liberal, increasing the risk of Type I errors.

Example:
Scenario: A researcher conducted a study to investigate the effects of three different teaching methods (Method A, Method B, and Method C) on student test scores. After performing an ANOVA, the researcher found a significant difference between the means of the three groups. To determine which specific methods resulted in significantly different scores, a post-hoc test would be necessary.

Choosing a post-hoc test:

If the researcher wants to compare all possible pairs of methods, Tukey's HSD, Bonferroni, or Sidak correction could be used.
If there's a control group (e.g., traditional teaching method), Dunnett's test could be appropriate.
If the researcher is willing to accept a higher FWER, Fisher's LSD could be used.

Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.

In [6]:
import numpy as np
import pandas as pd
from scipy import stats

# Simulate weight loss data for three diets
np.random.seed(42)  # For reproducibility

# Example data: 50 participants, randomly assigned to Diet A, B, or C
data = {
    'Diet': ['A'] * 17 + ['B'] * 17 + ['C'] * 16,  # 17 participants in Diet A, 17 in B, 16 in C
    'Weight_Loss': np.concatenate([
        np.random.normal(5, 1.5, 17),  # Diet A
        np.random.normal(7, 1.2, 17),  # Diet B
        np.random.normal(6, 1.4, 16)   # Diet C
    ])
}

# Convert to DataFrame
df = pd.DataFrame(data)

# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(
    df[df['Diet'] == 'A']['Weight_Loss'],
    df[df['Diet'] == 'B']['Weight_Loss'],
    df[df['Diet'] == 'C']['Weight_Loss']
)

# Print the results
print(f"F-statistic: {f_statistic:.2f}")
print(f"P-value: {p_value:.4f}")


F-statistic: 8.76
P-value: 0.0006


Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.

In [7]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Simulate data
np.random.seed(42)  # For reproducibility

# Create a dataset with 30 employees
data = {
    'Software': np.repeat(['A', 'B', 'C'], 10),
    'Experience': np.tile(['Novice', 'Experienced'], 15),
    'Time': np.concatenate([
        np.random.normal(25, 3, 10),  # Program A
        np.random.normal(20, 2.5, 10),  # Program B
        np.random.normal(22, 2.8, 10)   # Program C
    ])
}

df = pd.DataFrame(data)

# Fit the two-way ANOVA model
model = ols('Time ~ C(Software) + C(Experience) + C(Software):C(Experience)', data=df).fit()

# Perform the ANOVA
anova_table = sm.stats.anova_lm(model, typ=2)  # Type II ANOVA

# Display the ANOVA table
print(anova_table)


                               sum_sq    df          F        PR(>F)
C(Software)                350.500394   2.0  34.874203  7.924432e-08
C(Experience)                0.036619   1.0   0.007287  9.326795e-01
C(Software):C(Experience)    0.532895   2.0   0.053022  9.484697e-01
Residual                   120.605041  24.0        NaN           NaN


Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.

In [8]:
import numpy as np
import pandas as pd
from scipy import stats

# Simulate test score data
np.random.seed(42)  # For reproducibility

# Generate test scores for control and experimental groups
control_scores = np.random.normal(75, 10, 50)  # Control group
experimental_scores = np.random.normal(80, 10, 50)  # Experimental group

# Perform two-sample t-test
t_statistic, p_value = stats.ttest_ind(control_scores, experimental_scores)

# Print results
print(f"T-statistic: {t_statistic:.2f}")
print(f"P-value: {p_value:.4f}")

# Interpret the results
if p_value < 0.05:
    print("There is a significant difference between the test scores of the two groups.")
else:
    print("There is no significant difference between the test scores of the two groups.")


T-statistic: -4.11
P-value: 0.0001
There is a significant difference between the test scores of the two groups.


In [9]:
# Calculate Cohen's d
mean_control = np.mean(control_scores)
mean_experimental = np.mean(experimental_scores)
std_control = np.std(control_scores, ddof=1)
std_experimental = np.std(experimental_scores, ddof=1)

# Pooled standard deviation
pooled_std = np.sqrt(((std_control**2 + std_experimental**2) / 2))

# Cohen's d
cohens_d = (mean_experimental - mean_control) / pooled_std
print(f"Cohen's d: {cohens_d:.2f}")


Cohen's d: 0.82


Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any

significant differences in sales between the three stores. If the results are significant, follow up with a post-
hoc test to determine which store(s) differ significantly from each other.

In [14]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import AnovaRM
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Simulate sales data for three stores over 30 days
np.random.seed(42)  # For reproducibility

# Generate daily sales data
store_A_sales = np.random.normal(200, 25, 30)
store_B_sales = np.random.normal(220, 30, 30)
store_C_sales = np.random.normal(210, 20, 30)

# Create a DataFrame
data = {
    'Day': np.tile(np.arange(1, 31), 3),
    'Store': np.repeat(['A', 'B', 'C'], 30),
    'Sales': np.concatenate([store_A_sales, store_B_sales, store_C_sales])
}
df = pd.DataFrame(data)

# Perform repeated measures ANOVA
anova = AnovaRM(df, 'Sales', 'Day', within=['Store'])
anova_results = anova.fit()
print("Repeated Measures ANOVA Results:")
print(anova_results)

# Post-hoc test: Tukey's HSD
# Prepare the data for Tukey's HSD test
df_posthoc = df.copy()
df_posthoc['Store'] = df_posthoc['Store'].astype('category')

# Perform Tukey's HSD test
tukey = pairwise_tukeyhsd(endog=df_posthoc['Sales'], groups=df_posthoc['Store'], alpha=0.05)
print("\nPost-Hoc Test Results:")
print(tukey)


Repeated Measures ANOVA Results:
               Anova
      F Value Num DF  Den DF Pr > F
-----------------------------------
Store  6.5881 2.0000 58.0000 0.0026


Post-Hoc Test Results:
 Multiple Comparison of Means - Tukey HSD, FWER=0.05 
group1 group2 meandiff p-adj   lower    upper  reject
-----------------------------------------------------
     A      B  21.0688 0.0025   6.4988 35.6388   True
     A      C  14.9614 0.0428   0.3914 29.5313   True
     B      C  -6.1074 0.5791 -20.6774  8.4625  False
-----------------------------------------------------
