# Statistics Advance 6 Assignment

## Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

**Answer:**

ANOVA assumes:
- **Independence:** Observations are independent of each other.
- **Normality:** The data in each group are normally distributed.
- **Homogeneity of variances:** All groups have the same variance.

**Violations:**
- Non-independence (e.g., repeated measures without accounting for it)
- Non-normality (e.g., skewed data)
- Unequal variances (heteroscedasticity)

Violations can lead to incorrect conclusions (e.g., increased Type I error).

## Q2. What are the three types of ANOVA, and in what situations would each be used?

**Answer:**

- **One-way ANOVA:** Compares means of three or more groups based on one independent variable. Used when testing one factor (e.g., diet type).
- **Two-way ANOVA:** Compares means based on two independent variables (factors) and can test for interaction effects. Used when testing two factors (e.g., diet type and gender).
- **Repeated measures ANOVA:** Used when the same subjects are measured under different conditions or times (e.g., before and after treatment).

## Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

**Answer:**

Partitioning of variance in ANOVA divides the total variability in the data into components:
- **Between-group variance:** Variability due to differences between group means.
- **Within-group variance:** Variability within each group.

Understanding this helps determine if group means are significantly different compared to the variability within groups.

## Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

In [None]:
import numpy as np

group1 = [10, 12, 9, 11]
group2 = [20, 22, 19, 21]
group3 = [30, 32, 29, 31]
data = group1 + group2 + group3
groups = [group1, group2, group3]

grand_mean = np.mean(data)
SST = np.sum((np.array(data) - grand_mean) ** 2)
SSE = sum([len(g)*(np.mean(g) - grand_mean)**2 for g in groups])
SSR = sum([sum((np.array(g) - np.mean(g))**2) for g in groups])

print(f"SST: {SST:.2f}, SSE (Between): {SSE:.2f}, SSR (Within): {SSR:.2f}")

## Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [None]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

data = pd.DataFrame({
    'score': [85, 90, 88, 92, 78, 80, 79, 81],
    'factor1': ['A', 'A', 'B', 'B', 'A', 'A', 'B', 'B'],
    'factor2': ['X', 'Y', 'X', 'Y', 'X', 'Y', 'X', 'Y']
})

model = ols('score ~ C(factor1) + C(factor2) + C(factor1):C(factor2)', data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

## Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

**Answer:**

Since the p-value (0.02) is less than 0.05, we reject the null hypothesis and conclude that there are significant differences between the group means. The F-statistic indicates the ratio of between-group to within-group variance.

## Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

**Answer:**

- **Handling missing data:**
  - Listwise deletion (remove subjects with missing data)
  - Imputation (fill in missing values)
  - Mixed-effects models (can handle missing data)
- **Consequences:**
  - Listwise deletion reduces sample size and power
  - Imputation can introduce bias if not done properly
  - Mixed-effects models are more robust but require more complex analysis

## Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

**Answer:**

- **Tukey's HSD:** Used when comparing all possible pairs of group means.
- **Bonferroni correction:** Used to control for Type I error when making multiple comparisons.
- **Scheffé's test:** More conservative, used for all possible contrasts.

**Example:** If ANOVA shows significant differences among three diets, a post-hoc test identifies which specific diets differ.

## Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.

In [None]:
import numpy as np
import scipy.stats as stats

diet_A = np.random.normal(5, 1, 17)
diet_B = np.random.normal(6, 1, 17)
diet_C = np.random.normal(7, 1, 16)

f_stat, p_value = stats.f_oneway(diet_A, diet_B, diet_C)
print(f"F-statistic: {f_stat:.2f}, p-value: {p_value:.4f}")
if p_value < 0.05:
    print("There are significant differences between the diets.")
else:
    print("No significant differences between the diets.")

## Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.

In [None]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols

np.random.seed(0)
data = pd.DataFrame({
    'time': np.random.normal(30, 5, 90),
    'program': np.repeat(['A', 'B', 'C'], 30),
    'experience': np.tile(np.repeat(['novice', 'experienced'], 15), 3)
})

model = ols('time ~ C(program) + C(experience) + C(program):C(experience)', data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

## Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

In [None]:
import numpy as np
from scipy.stats import ttest_ind

control = np.random.normal(75, 10, 50)
experimental = np.random.normal(80, 10, 50)
t_stat, p_value = ttest_ind(control, experimental)
print(f"t-statistic: {t_stat:.2f}, p-value: {p_value:.4f}")
if p_value < 0.05:
    print("Significant difference between groups.")
else:
    print("No significant difference between groups.")

## Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post-hoc test to determine which store(s) differ significantly from each other.

In [None]:
import numpy as np
import pandas as pd
from statsmodels.stats.anova import AnovaRM

np.random.seed(0)
data = pd.DataFrame({
    'subject': np.tile(np.arange(1, 31), 3),
    'store': np.repeat(['A', 'B', 'C'], 30),
    'sales': np.random.normal(100, 10, 90)
})

anova = AnovaRM(data, 'sales', 'subject', within=['store']).fit()
print(anova)