# **Statistic Advanced 6**

Q1: Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

#### Assumptions Required for ANOVA:

1. **Independence of Observations**:
    - Each observation should be independent of the others. This means that the data collected from different groups should not influence each other.
    - **Violation Example**: If multiple measurements are taken from the same subject or if there is some inherent connection between groups (like repeated measures without proper adjustments), this assumption is violated.

2. **Normality**:
    - The data within each group should be approximately normally distributed.
    - **Violation Example**: If the data is heavily skewed or has outliers, the normality assumption might be violated. For instance, if you are comparing the income levels of different groups and the income data is skewed, this could impact the results of ANOVA.

3. **Homogeneity of Variances (Homoscedasticity)**:
    - The variance among the groups should be approximately equal.
    - **Violation Example**: If one group has much higher variability in scores than another group, this assumption is violated. For instance, comparing the test scores of different teaching methods where one method results in highly variable scores compared to the others.

4. **Fixed Effects Model**:
    - The levels of the factor(s) are fixed and not random.
    - **Violation Example**: If the levels of the factor are randomly chosen and not fixed, this assumption is violated.

#### Examples of Violations and Their Impact:

1. **Independence Violation**:
    - **Example**: Measuring the same set of students multiple times without proper adjustments.
    - **Impact**: This can lead to an inflation of type I error rates (false positives), where the test indicates a significant effect when there isn't one.

2. **Normality Violation**:
    - **Example**: Analyzing reaction times that are positively skewed without transformation.
    - **Impact**: ANOVA is robust to some deviations from normality, but severe violations can lead to incorrect conclusions, particularly in small sample sizes.

3. **Homogeneity of Variances Violation**:
    - **Example**: Comparing the effectiveness of three drugs where one drug has highly variable side effects compared to the others.
    - **Impact**: Violations of this assumption can lead to inaccurate F-statistics and p-values, potentially leading to incorrect inferences about group differences.

4. **Fixed Effects Violation**:
    - **Example**: If the groups being compared are not fixed but randomly selected from a larger population.
    - **Impact**: This affects the generalizability of the results. Inferences drawn may not apply to the larger population, leading to misleading conclusions.

To mitigate the impact of these violations, researchers can:
- Use transformations to normalize data.
- Apply robust statistical techniques like Welch's ANOVA when variances are unequal.
- Use non-parametric tests like the Kruskal-Wallis test if normality is severely violated.
- Ensure proper experimental design to maintain independence of observations.

Understanding and checking these assumptions before performing ANOVA is crucial to ensure the validity and reliability of the results.

Q2: What are the three types of ANOVA, and in what situations would each be used?

One-Way ANOVA:

Situation: Used when comparing the means of three or more independent groups based on a single factor.
Example: Comparing the average test scores of students from three different teaching methods.
Two-Way ANOVA:

Situation: Used when comparing the means of groups based on two independent factors. It can also assess the interaction between the two factors.
Example: Studying the effect of different teaching methods and different study times on test scores.
Repeated Measures ANOVA:

Situation: Used when the same subjects are measured multiple times under different conditions.
Example: Measuring the blood pressure of patients before, during, and after administering a drug.

Q3: What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

Partitioning of Variance:

Total Sum of Squares (SST): Represents the total variance in the data.
Explained Sum of Squares (SSE): Represents the variance explained by the groups.
Residual Sum of Squares (SSR): Represents the variance within the groups (error).
Importance:

Understanding how variance is partitioned helps in determining whether the differences between group means are significant or if they could have occurred by chance.
It aids in calculating the F-statistic, which is used to test the null hypothesis in ANOVA.

Q4: How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

In [1]:
import numpy as np
import pandas as pd
from scipy import stats

# Sample data (replace with actual data)
group1 = [2.1, 2.5, 2.3, 2.8, 2.7]
group2 = [3.2, 3.3, 3.1, 3.0, 3.4]
group3 = [1.5, 1.7, 1.6, 1.8, 1.7]

# Combine the data
data = group1 + group2 + group3
groups = ['group1']*len(group1) + ['group2']*len(group2) + ['group3']*len(group3)

df = pd.DataFrame({'group': groups, 'value': data})

# Calculate group means and overall mean
group_means = df.groupby('group')['value'].mean()
overall_mean = df['value'].mean()

# Total Sum of Squares (SST)
sst = sum((df['value'] - overall_mean)**2)

# Explained Sum of Squares (SSE)
sse = sum(df.groupby('group').size() * (group_means - overall_mean)**2)

# Residual Sum of Squares (SSR)
ssr = sum((df['value'] - df['group'].map(group_means))**2)

print(f"SST: {sst}, SSE: {sse}, SSR: {ssr}")


SST: 6.417333333333333, SSE: 5.937333333333333, SSR: 0.4799999999999999


Q5: In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [2]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Sample data (replace with actual data)
data = {
    'FactorA': ['A1', 'A1', 'A1', 'A2', 'A2', 'A2', 'A1', 'A1', 'A2', 'A2'],
    'FactorB': ['B1', 'B2', 'B2', 'B1', 'B2', 'B1', 'B1', 'B2', 'B2', 'B1'],
    'Value': [20, 21, 19, 23, 22, 25, 24, 26, 27, 28]
}

df = pd.DataFrame(data)

# Fit the two-way ANOVA model
model = ols('Value ~ C(FactorA) + C(FactorB) + C(FactorA):C(FactorB)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

print(anova_table)


                          sum_sq   df         F    PR(>F)
C(FactorA)             20.416667  1.0  2.070423  0.200233
C(FactorB)              0.416667  1.0  0.042254  0.843934
C(FactorA):C(FactorB)   0.416667  1.0  0.042254  0.843934
Residual               59.166667  6.0       NaN       NaN


Q6: Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

Conclusion:

Since the p-value (0.02) is less than the significance level (0.05), we reject the null hypothesis.
Interpretation:

There are significant differences between the group means. This means that at least one group mean is significantly different from the others.

Q7: In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

Handling Missing Data:

Listwise Deletion: Removing any case with missing data.
Pairwise Deletion: Using all available data without excluding entire cases.
Mean Substitution: Replacing missing values with the mean of the available data.
Multiple Imputation: Using statistical methods to estimate and replace missing data.
Consequences:

Listwise Deletion: Can reduce the sample size and statistical power.
Pairwise Deletion: Can lead to biased estimates if the data are not missing completely at random.
Mean Substitution: Can underestimate the variability and lead to biased estimates.
Multiple Imputation: Provides more accurate estimates but is computationally intensive.

Q8: What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

Common Post-Hoc Tests:

Tukey's HSD: Used to compare all possible pairs of means. Example: Comparing the means of multiple teaching methods.
Bonferroni Correction: Used to adjust the significance level when multiple comparisons are made. Example: Comparing the means of different drug treatments while controlling for Type I error.
Scheffé Test: Used when all possible contrasts between means need to be tested. Example: Comparing the means of different diet plans with unequal sample sizes.
Situation for Post-Hoc Test:

After finding a significant F-statistic in ANOVA, to determine which specific groups are significantly different from each other.

Q9: Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of three diets: A, B, and C. Report the F-statistic and p-value, and interpret the results.

In [3]:
import numpy as np
import scipy.stats as stats

# Sample data (replace with actual data)
diet_A = [2.1, 2.5, 2.3, 2.8, 2.7, 2.2, 2.4, 2.6, 2.5, 2.7, 2.3, 2.8, 2.9, 2.5, 2.6, 2.8, 2.7, 2.9, 2.4, 2.5]
diet_B = [3.2, 3.3, 3.1, 3.0, 3.4, 3.3, 3.1, 3.2, 3.5, 3.1, 3.2, 3.0, 3.4, 3.3, 3.2, 3.5, 3.3, 3.1, 3.2, 3.4]
diet_C = [1.5, 1.7, 1.6, 1.8, 1.7, 1.5, 1.6, 1.8, 1.7, 1.9, 1.6, 1.8, 1.7, 1.5, 1.6, 1.9, 1.7, 1.8, 1.6, 1.7]

# Conduct one-way ANOVA
f_stat, p_val = stats.f_oneway(diet_A, diet_B, diet_C)

print(f"F-statistic: {f_stat}, P-value: {p_val}")

# Interpretation
if p_val < 0.05:
    print("Reject the null hypothesis. There are significant differences between the mean weight loss of the three diets.")
else:
    print("Fail to reject the null hypothesis. There are no significant differences between the mean weight loss of the three diets.")


F-statistic: 402.41039790879984, P-value: 2.4157375915394667e-34
Reject the null hypothesis. There are significant differences between the mean weight loss of the three diets.


Q10: Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced).

In [4]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Sample data (replace with actual data)
data = {
    'Program': ['A']*10 + ['B']*10 + ['C']*10 + ['A']*10 + ['B']*10 + ['C']*10,
    'Experience': ['Novice']*30 + ['Experienced']*30,
    'Time': [30, 35, 28, 31, 32, 34, 29, 27, 33, 30, 25, 30, 26, 28, 27, 26, 29, 31, 30, 28, 20, 22, 21, 23, 24, 22, 25, 21, 23, 20, 15, 18, 14, 16, 17, 15, 18, 14, 19, 16, 19, 21, 18, 22, 23, 19, 20, 21, 23, 22, 18, 17, 19, 20, 18, 19, 21, 20, 18, 19]
}

# Create DataFrame
df = pd.DataFrame(data)

# Fit the two-way ANOVA model
model = ols('Time ~ C(Program) + C(Experience) + C(Program):C(Experience)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

print(anova_table)

# Interpretation
print("\nInterpretation:")
for factor in ['C(Program)', 'C(Experience)', 'C(Program):C(Experience)']:
    if anova_table['PR(>F)'][factor] < 0.05:
        print(f"There is a significant effect of {factor}.")
    else:
        print(f"There is no significant effect of {factor}.")


                               sum_sq    df           F        PR(>F)
C(Program)                 168.233333   2.0   23.919431  3.638865e-08
C(Experience)             1050.016667   1.0  298.582938  1.166851e-23
C(Program):C(Experience)   340.833333   2.0   48.459716  8.882029e-13
Residual                   189.900000  54.0         NaN           NaN

Interpretation:
There is a significant effect of C(Program).
There is a significant effect of C(Experience).
There is a significant effect of C(Program):C(Experience).


Q11: Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the control group and the experimental group.

In [5]:
import numpy as np
import scipy.stats as stats

# Sample data (replace with actual data)
control_group = [75, 78, 72, 74, 76, 73, 71, 79, 77, 74, 73, 72, 76, 78, 75, 74, 77, 79, 72, 76, 73, 75, 77, 74, 73, 72, 76, 78, 75, 74]
experimental_group = [82, 85, 83, 84, 86, 85, 84, 83, 85, 82, 84, 83, 85, 86, 84, 85, 83, 84, 85, 83, 84, 85, 86, 84, 85, 83, 84, 85, 83, 84]

# Conduct two-sample t-test
t_stat, p_val = stats.ttest_ind(control_group, experimental_group)

print(f"T-statistic: {t_stat}, P-value: {p_val}")

# Interpretation
if p_val < 0.05:
    print("Reject the null hypothesis. There are significant differences in test scores between the control and experimental groups.")
else:
    print("Fail to reject the null hypothesis. There are no significant differences in test scores between the control and experimental groups.")


T-statistic: -20.041339095162105, P-value: 9.451070788284656e-28
Reject the null hypothesis. There are significant differences in test scores between the control and experimental groups.


Q12: Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores.

In [6]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import AnovaRM

# Sample data (replace with actual data)
data = {
    'Day': list(range(1, 31)) * 3,
    'Store': ['A']*30 + ['B']*30 + ['C']*30,
    'Sales': [200, 220, 210, 215, 230, 225, 220, 210, 215, 225, 230, 240, 220, 215, 225, 230, 235, 240, 225, 230, 240, 235, 230, 225, 220, 215, 225, 230, 220, 215,
              210, 215, 220, 225, 230, 235, 240, 225, 230, 235, 220, 215, 210, 220, 225, 230, 215, 220, 225, 210, 220, 225, 230, 240, 220, 215, 230, 225, 230, 235,
              210, 215, 220, 210, 215, 220, 225, 230, 235, 240, 230, 225, 220, 215, 230, 225, 220, 230, 225, 220, 215, 210, 220, 215, 220, 225, 230, 235, 240, 225]
}

# Create DataFrame
df = pd.DataFrame(data)

# Fit the repeated measures ANOVA model
aovrm = AnovaRM(df, 'Sales', 'Day', within=['Store'])
res = aovrm.fit()

print(res)

# Interpretation
print("\nInterpretation:")
if res.anova_table['Pr > F'][0] < 0.05:
    print("Reject the null hypothesis. There are significant differences in sales between the three stores.")
else:
    print("Fail to reject the null hypothesis. There are no significant differences in sales between the three stores.")


               Anova
      F Value Num DF  Den DF Pr > F
-----------------------------------
Store  0.1132 2.0000 58.0000 0.8932


Interpretation:
Fail to reject the null hypothesis. There are no significant differences in sales between the three stores.


  if res.anova_table['Pr > F'][0] < 0.05:


# **Complete**