## Q1. Assumptions Required to Use ANOVA and Examples of Violations

*Assumptions:*
1. *Independence of observations:* Each subject should belong to one group, and the groups should not influence each other.
2. *Normality:* The data in each group should be approximately normally distributed.
3. *Homogeneity of variances (homoscedasticity):* The variances across groups should be roughly equal.

*Examples of Violations:*
1. *Independence violation:* If a researcher tests the same subjects multiple times without accounting for repeated measures, this assumption is violated.
2. *Normality violation:* If the data in one or more groups are heavily skewed or have outliers, this can affect the validity.
3. *Homogeneity violation:* If one group has a much larger variance than others, the assumption is violated, potentially leading to misleading results.


## Q2. Types of ANOVA and Their Situations

1. *One-way ANOVA:* Used when comparing means of three or more independent (unrelated) groups.
   - Example: Comparing the average test scores of students from three different schools.
   
2. *Two-way ANOVA:* Used when examining the influence of two different categorical independent variables on one continuous dependent variable.
   - Example: Studying the effect of teaching method (traditional vs. new) and gender on student test scores.
   
3. *Repeated Measures ANOVA:* Used when the same subjects are used for each treatment (i.e., within-subjects design).
   - Example: Measuring the blood pressure of patients before and after taking a new medication.

## Q3. Partitioning of Variance in ANOVA

ANOVA partitions the total variance observed in the data into components:
1. *Total Sum of Squares (SST):* Total variance in the data.
2. *Explained Sum of Squares (SSE):* Variance explained by the model (differences between group means).
3. *Residual Sum of Squares (SSR):* Variance within the groups (differences within group data).

Understanding this helps to discern how much of the total variance is explained by the groups versus the variance within the groups.


In [1]:
## Q4. Calculating SST, SSE, and SSR in One-way ANOVA Using Python

import numpy as np
import pandas as pd
from scipy import stats

# Sample data
data = {'group': ['A']*10 + ['B']*10 + ['C']*10,
        'values': np.random.randn(30)}
df = pd.DataFrame(data)

# Grouping data
grouped_data = df.groupby('group')['values']

# Mean of each group
group_means = grouped_data.mean()
overall_mean = df['values'].mean()

# SST (Total Sum of Squares)
sst = sum((df['values'] - overall_mean)**2)

# SSE (Explained Sum of Squares)
sse = sum(grouped_data.size() * (group_means - overall_mean)**2)

# SSR (Residual Sum of Squares)
ssr = sum((df['values'] - df['group'].map(group_means))**2)

sst, sse, ssr

(32.38919911046362, 2.530948957956093, 29.858250152507534)

In [2]:
## Q5. Calculating Main Effects and Interaction Effects in Two-way ANOVA Using Python

import statsmodels.api as sm
from statsmodels.formula.api import ols

# Sample data
np.random.seed(0)
data = pd.DataFrame({
    'software': np.repeat(['A', 'B', 'C'], 20),
    'experience': np.tile(np.repeat(['novice', 'experienced'], 10), 3),
    'time': np.random.randn(60) * 5 + 20
})

# Two-way ANOVA
model = ols('time ~ C(software) * C(experience)', data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

anova_table

Unnamed: 0,sum_sq,df,F,PR(>F)
C(software),232.853491,2.0,4.436187,0.016452
C(experience),3.777445,1.0,0.143931,0.705891
C(software):C(experience),31.044751,2.0,0.591446,0.557072
Residual,1417.218194,54.0,,


## Q6. Interpreting F-statistic and p-value from One-way ANOVA

If you obtained an F-statistic of 5.23 and a p-value of 0.02:
- The p-value of 0.02 is less than the typical significance level of 0.05, indicating that there is a statistically significant difference between the group means.
- The F-statistic of 5.23 suggests that the between-group variance is significantly larger than the within-group variance.

## Q7. Handling Missing Data in Repeated Measures ANOVA

*Methods:*
1. *Listwise Deletion:* Removing any subjects with missing data.
   - Consequence: Loss of data, reducing power and potentially biasing results.
   
2. *Imputation:* Filling in missing values using statistical methods (mean, median, etc.).
   - Consequence: Depending on the method, it can introduce bias or reduce variability.

3. *Mixed-effects Models:* Using models that can handle missing data by considering it as random.
   - Consequence: More complex but often the best approach to retain all available data without biasing results.

## Q8. Common Post-hoc Tests after ANOVA

1. *Tukey's HSD:* Used when comparing all possible pairs of means.
   - Example: After finding a significant difference in ANOVA, use Tukey's HSD to find which groups differ.
   
2. *Bonferroni Correction:* Adjusts p-values when performing multiple comparisons to control the Type I error rate.
   - Example: When performing multiple pairwise t-tests.
   
3. *Scheffé Test:* More conservative, used when comparisons involve more than two groups.
   - Example: When comparing complex group means.

In [4]:
## Q9. One-way ANOVA for Weight Loss Data Using Python

from scipy import stats

# Sample data
data = {
    'Diet': ['A']*17 + ['B']*17 + ['C']*16,
    'WeightLoss': np.random.randn(50) * 2 + 5
}
df = pd.DataFrame(data)

# One-way ANOVA
f_stat, p_val = stats.f_oneway(df[df['Diet'] == 'A']['WeightLoss'],
                               df[df['Diet'] == 'B']['WeightLoss'],
                               df[df['Diet'] == 'C']['WeightLoss'])

f_stat, p_val

# Interpretation:
# - Report the F-statistic and p-value.
# - If the p-value is less than 0.05, conclude there is a significant difference between the mean weight loss of the three diets.

(2.766424791352801, 0.073142890570744)

In [None]:
## Q10. Two-way ANOVA for Software Programs and Experience Levels Using Python

# Assuming the same data as previously created for this question
model = ols('time ~ C(software) * C(experience)', data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

anova_table

# Interpretation:
# - Report the F-statistics and p-values for main effects and interaction effects.
# - Significant p-values (less than 0.05) indicate significant effects.

In [6]:
## Q11. Two-sample t-test for Teaching Methods Using Python

# Sample data
np.random.seed(0)
control = np.random.randn(50) * 5 + 70
experiment = np.random.randn(50) * 5 + 75

# Two-sample t-test
t_stat, p_val = stats.ttest_ind(control, experiment)

t_stat, p_val

# Interpretation:
# - If p-value < 0.05, conclude there is a significant difference in test scores between the two groups.

(-4.131173276068804, 7.60404836914434e-05)

In [7]:
## Q12. Repeated Measures ANOVA for Sales Data Using Python

import statsmodels.stats.anova as anova
from statsmodels.stats.anova import AnovaRM

# Sample data
data = {
    'Store': np.tile(['A', 'B', 'C'], 30),
    'Day': np.repeat(np.arange(30), 3),
    'Sales': np.random.randn(90) * 10 + 200
}
df = pd.DataFrame(data)

# Repeated Measures ANOVA
aovrm = AnovaRM(df, 'Sales', 'Day', within=['Store'])
res = aovrm.fit()

res.summary()

# Interpretation:
# - Report the F-statistics and p-values.
# - If significant, use post-hoc tests like Tukey’s HSD to determine which stores differ significantly.

0,1,2,3,4
,F Value,Num DF,Den DF,Pr > F
Store,1.7820,2.0000,58.0000,0.1774
