### Q1: Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

A1.
1. **Independence of Observations**: The observations must be independent of each other.
   - *Violation Example*: Repeated measurements on the same subjects without accounting for this in the model.
   
2. **Normality**: The dependent variable should be approximately normally distributed within each group.
   - *Violation Example*: Data with severe skewness or outliers can violate the normality assumption.

3. **Homogeneity of Variances (Homoscedasticity)**: The variances within each group should be approximately equal.
   - *Violation Example*: One group has a much larger variance compared to others, which can be tested using Levene’s Test or Bartlett’s Test.

Violations of these assumptions can impact the validity of the ANOVA results, leading to incorrect conclusions. For instance, if the homogeneity of variances assumption is violated, the F-test may be too liberal or too conservative.

### Q2: What are the three types of ANOVA, and in what situations would each be used?

A2.
1. **One-Way ANOVA**: Used when comparing the means of three or more independent groups based on one factor.
   - *Example*: Comparing test scores across three different teaching methods.

2. **Two-Way ANOVA**: Used when comparing the means of groups based on two factors, which allows for the analysis of main effects and interaction effects.
   - *Example*: Comparing the effects of diet (low-fat, low-carb) and exercise (none, moderate, high) on weight loss.

3. **Repeated Measures ANOVA**: Used when the same subjects are measured multiple times under different conditions.
   - *Example*: Measuring the impact of different diets on the same group of individuals over three time periods.

### Q3: What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

A3. **Partitioning of Variance in ANOVA**:
**Total Sum of Squares (SST)**: Measures the total variability in the data.
**Between-Group Sum of Squares (SSB)**: Measures the variability between the groups.
**Within-Group Sum of Squares (SSW)**: Measures the variability within each group.

**Importance**:
Partitioning of variance helps in understanding how much of the total variability in the data can be attributed to the differences between group means (SSB) versus the variability within groups (SSW). This is crucial for determining the significance of the factors being studied.

### Q4: How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?


In [7]:
#A4.
import numpy as np
import scipy.stats as stats

group1 = [5, 7, 9, 6, 8]
group2 = [7, 8, 6, 7, 9]
group3 = [6, 5, 7, 8, 6]

data = group1 + group2 + group3
overall_mean = np.mean(data)

sst = sum((x - overall_mean)**2 for x in data)

group_means = [np.mean(group1), np.mean(group2), np.mean(group3)]
n = len(group1)  
ssb = sum(n * (group_mean - overall_mean)**2 for group_mean in group_means)

ssw = sum(sum((x - group_mean)**2 for x in group) for group, group_mean in zip([group1, group2, group3], group_means))

print(f"SST: {sst}, SSB: {ssb}, SSW: {ssw}")

SST: 22.93333333333334, SSB: 2.533333333333333, SSW: 20.4


### Q5: In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?


In [6]:
#A5.
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Sample data
data = {
    'Factor1': np.tile(['A', 'B', 'C'], 10),
    'Factor2': np.repeat(['X', 'Y'], 15),
    'Value': np.random.randn(30)
}
df = pd.DataFrame(data)

# Fit the two-way ANOVA model
model = ols('Value ~ C(Factor1) * C(Factor2)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

                          sum_sq    df         F    PR(>F)
C(Factor1)              0.411303   2.0  0.345436  0.711374
C(Factor2)              0.044610   1.0  0.074932  0.786628
C(Factor1):C(Factor2)   0.470428   2.0  0.395092  0.677919
Residual               14.288130  24.0       NaN       NaN



### Q6: Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

A6. Since the p-value (0.02) is less than the significance level (0.05), we reject the null hypothesis.
Interpretation: There is a statistically significant difference between the group means.

### Q7: In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

A7. We can handle Missing Data through:
1. **Listwise Deletion**: Removing subjects with any missing data.
   - *Consequence*: Reduced sample size, which may affect the power of the test.
   
2. **Mean Imputation**: Replacing missing values with the mean of the observed data.
   - *Consequence*: Underestimation of variability, potentially leading to biased results.
   
3. **Multiple Imputation**: Replacing missing values based on a model that accounts for the uncertainty.
   - *Consequence*: More accurate and reliable, but computationally intensive.

### Q8: What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

A8. These are the Common Post-hoc Tests:
1. **Tukey's HSD**: Used when comparing all possible pairs of means.
   - *Example*: Comparing multiple treatment groups to identify which ones differ.
   
2. **Bonferroni Correction**: Used to adjust the significance level for multiple comparisons.
   - *Example*: Testing multiple hypotheses while controlling the family-wise error rate.
   
3. **Scheffé's Test**: Used for comparing complex contrasts of means.
   - *Example*: Comparing the average of several group means to another group.

**Example Situation**:
If ANOVA indicates a significant difference between group means, post-hoc tests can identify which specific groups differ from each other.

### Q9: A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.


In [5]:
#A9.
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols

np.random.seed(0)
n_per_group = 50 // 3
data = {
    'Diet': np.repeat(['A', 'B', 'C'], n_per_group),
    'WeightLoss': np.random.randn(n_per_group * 3) + np.tile([1, 2, 3], n_per_group)
}
df = pd.DataFrame(data)

model = ols('WeightLoss ~ C(Diet)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)


             sum_sq    df         F    PR(>F)
C(Diet)    5.787445   2.0  1.903686  0.160827
Residual  68.402838  45.0       NaN       NaN



### Q10: A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.


In [3]:
#A10.
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Sample data
np.random.seed(0)
data = {
    'Program': np.tile(['A', 'B', 'C'], 10),
    'Experience': np.repeat(['Novice', 'Experienced'], 15),
    'Time': np.random.randn(30) + np.tile([1, 2, 3], 10)
}
df = pd.DataFrame(data)

# Fit the two-way ANOVA model
model = ols('Time ~ C(Program) * C(Experience)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

                             sum_sq    df         F    PR(>F)
C(Program)                 5.338375   2.0  2.277381  0.124277
C(Experience)              1.788304   1.0  1.525801  0.228700
C(Program):C(Experience)   0.458415   2.0  0.195563  0.823669
Residual                  28.129023  24.0       NaN       NaN



### Q11: An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.


In [2]:
#A11.
import scipy.stats as stats
import numpy as np


np.random.seed(0)
control_group = np.random.normal(70, 10, 50)
experimental_group = np.random.normal(75, 10, 50)

t_stat, p_value = stats.ttest_ind(control_group, experimental_group)

print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")

if p_value < 0.05:
    print("The new teaching method significantly improves student test scores.")
else:
    print("No significant difference in test scores between the two groups.")

T-statistic: -1.6677351961320235
P-value: 0.09856078338184604
No significant difference in test scores between the two groups.




### Q12: A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post-hoc test to determine which store(s) differ significantly from each other.

In [1]:
#A12. 

import pandas as pd
import numpy as np
from statsmodels.stats.anova import AnovaRM

np.random.seed(0)
data = {
    'Store': np.repeat(['A', 'B', 'C'], 30),
    'Day': np.tile(range(30), 3),
    'Sales': np.random.randn(90) + np.tile([100, 110, 120], 30)
}
df = pd.DataFrame(data)

aovrm = AnovaRM(df, 'Sales', 'Day', within=['Store'])
res = aovrm.fit()
print(res)

if res.anova_table['Pr > F'][0] < 0.05:
    print("Significant differences found between stores. Conducting post-hoc tests...")
    from statsmodels.stats.multicomp import pairwise_tukeyhsd

    posthoc = pairwise_tukeyhsd(df['Sales'], df['Store'], alpha=0.05)
    print(posthoc)
else:
    print("No significant differences in sales between the stores.")


               Anova
      F Value Num DF  Den DF Pr > F
-----------------------------------
Store  4.2820 2.0000 58.0000 0.0184

Significant differences found between stores. Conducting post-hoc tests...


  if res.anova_table['Pr > F'][0] < 0.05:


Multiple Comparison of Means - Tukey HSD, FWER=0.05
group1 group2 meandiff p-adj   lower  upper  reject
---------------------------------------------------
     A      B  -0.7324 0.9363 -5.7849 4.3201  False
     A      C  -0.5766   0.96 -5.6291 4.4759  False
     B      C   0.1558  0.997 -4.8967 5.2083  False
---------------------------------------------------
