### Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

#### Assumptions:
1. Independence of Observations:

Each group's samples must be independent of each other.

Violation Example: Measuring the same subjects multiple times without considering repeated measures.

2. Normality:
The dependent variable should be approximately normally distributed for each group.

Violation Example: Highly skewed data or presence of outliers.

3. Homogeneity of Variances:
The variance among the groups should be approximately equal.

Violation Example: Variance differs significantly across groups.

#### Impact of Violations:
Violations of these assumptions can lead to inaccurate conclusions. For example, non-normality can make the ANOVA test less robust, while unequal variances can inflate Type I error rates.

### Q2. What are the three types of ANOVA, and in what situations would each be used?

#### Types of ANOVA

1. One-Way ANOVA:
Used when comparing the means of three or more independent groups based on one factor.

Example: Comparing the average scores of students from different teaching methods.

2. Two-Way ANOVA:
Used to evaluate the effect of two different categorical independent variables on a continuous dependent variable, including interaction effects.

Example: Analyzing the impact of teaching method and gender on students' scores.

3. Repeated Measures ANOVA:
Used when the same subjects are measured multiple times under different conditions.

Example: Measuring blood pressure of patients before, during, and after treatment.

### Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

- ANOVA partitions the total variance observed in the data into components attributable to different sources:
- Total Sum of Squares (SST): Total variance in the data.
- Explained Sum of Squares (SSE): Variance explained by the model (between-group variance).
- Residual Sum of Squares (SSR): Variance not explained by the model (within-group variance).

#### Importance:
- Understanding how variance is partitioned helps in determining the contribution of different factors to the total variability.

### Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?


In [None]:
import numpy as np
import pandas as pd
import scipy.stats as stats

# Example data
data = {'Group': np.repeat(['A', 'B', 'C'], 10),
        'Value': np.random.normal(loc=50, scale=5, size=30)}
df = pd.DataFrame(data)

# One-way ANOVA
anova_result = stats.f_oneway(df[df['Group'] == 'A']['Value'],
                              df[df['Group'] == 'B']['Value'],
                              df[df['Group'] == 'C']['Value'])

# Calculate SST, SSE, and SSR
grand_mean = df['Value'].mean()
sst = np.sum((df['Value'] - grand_mean)**2)
sse = np.sum(df.groupby('Group').apply(lambda x: len(x) * (x['Value'].mean() - grand_mean)**2))
ssr = np.sum(df.groupby('Group').apply(lambda x: np.sum((x['Value'] - x['Value'].mean())**2)))

print(f"SST: {sst}, SSE: {sse}, SSR: {ssr}")
print(f"F-statistic: {anova_result.statistic}, p-value: {anova_result.pvalue}")


SST: 502.08112942829115, SSE: 15.545814429242743, SSR: 486.5353149990484
F-statistic: 0.4313530556259582, p-value: 0.6540280054434859


### Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?



In [None]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Example data
data = {'Factor1': np.repeat(['A', 'B'], 30),
        'Factor2': np.tile(np.repeat(['X', 'Y'], 15), 2),
        'Value': np.random.normal(loc=50, scale=5, size=60)}
df = pd.DataFrame(data)

# Two-way ANOVA
model = ols('Value ~ C(Factor1) + C(Factor2) + C(Factor1):C(Factor2)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)


                            sum_sq    df         F    PR(>F)
C(Factor1)                7.692084   1.0  0.388726  0.535501
C(Factor2)                9.154745   1.0  0.462642  0.499194
C(Factor1):C(Factor2)     1.251788   1.0  0.063260  0.802336
Residual               1108.125230  56.0       NaN       NaN


#### Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

#####  Interpretation of One-Way ANOVA Results

Given an F-statistic of 5.23 and a p-value of 0.02:
- Conclusion: Since the p-value (0.02) is less than the significance level (0.05), we reject the null hypothesis.
- Interpretation: There are significant differences between the group means.


#### Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

#### Handling Missing Data in Repeated Measures ANOVA Methods:

1. Listwise Deletion: Removing any cases with missing data.

- Consequence: Reduced sample size, potential bias.

2. Mean Imputation: Replacing missing values with the mean of the observed values.

- Consequence: Underestimates variability, can bias results.

3. Multiple Imputation: Replacing missing values with multiple sets of simulated values.

- Consequence: More accurate but complex.

#### Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

#### Common Post-Hoc Tests
1. Tukey's HSD: Used for pairwise comparisons after ANOVA.
- Example: Comparing all pairs of group means in a study with three diets.
2. Bonferroni Correction: Adjusts the significance level to account for multiple comparisons.
- Example: When performing multiple t-tests on the same data.

#### Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.



In [11]:
import numpy as np
import pandas as pd
import scipy.stats as stats

# Example data
np.random.seed(0)
data = {'Diet': np.repeat(['A', 'B', 'C'], 50),
        'WeightLoss': np.concatenate([
            np.random.normal(loc=5, scale=1.5, size=50),  # Diet A
            np.random.normal(loc=7, scale=1.5, size=50),  # Diet B
            np.random.normal(loc=6, scale=1.5, size=50)   # Diet C
        ])}

df = pd.DataFrame(data)

# One-way ANOVA
anova_result = stats.f_oneway(
    df[df['Diet'] == 'A']['WeightLoss'],
    df[df['Diet'] == 'B']['WeightLoss'],
    df[df['Diet'] == 'C']['WeightLoss'])

print(f"F-statistic: {anova_result.statistic}, p-value: {anova_result.pvalue}")


F-statistic: 16.978550056714944, p-value: 2.3228194271141875e-07


 Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.




In [12]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Example data
np.random.seed(0)
data = {'Program': np.repeat(['A', 'B', 'C'], 30),
        'Experience': np.tile(np.repeat(['Novice', 'Experienced'], 15), 3)}

# Generate random times for each combination of Program and Experience
times = []
for prog in ['A', 'B', 'C']:
    for exp in ['Novice', 'Experienced']:
        mean_time = np.random.normal(loc=[20, 18, 22, 19, 17, 21][0], scale=2, size=1)[0]
        times.extend([mean_time] * 15)

data['Time'] = times
df = pd.DataFrame(data)

# Two-way ANOVA
model = ols('Time ~ C(Program) + C(Experience) + C(Program):C(Experience)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)


                                sum_sq    df             F  PR(>F)
C(Program)                8.162687e+01   2.0  2.515002e+29     0.0
C(Experience)             8.682309e+01   1.0  5.350204e+29     0.0
C(Program):C(Experience)  2.595670e+02   2.0  7.997507e+29     0.0
Residual                  1.363152e-26  84.0           NaN     NaN


Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.

In [None]:
import numpy as np
import pandas as pd
import scipy.stats as stats

# Example data
np.random.seed(0)
control_group = np.random.normal(loc=75, scale=10, size=50)
experimental_group = np.random.normal(loc=80, scale=10, size=50)

# Two-sample t-test
t_statistic, p_value = stats.ttest_ind(control_group, experimental_group)

print(f"T-statistic: {t_statistic}, p-value: {p_value}")

# Post-hoc test if significant
if p_value < 0.05:
    print("Follow-up with post-hoc test to determine specific differences between groups.")


T-statistic: -1.6677351961320235, p-value: 0.09856078338184605


Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any
significant differences in sales between the three stores. If the results are significant, follow up with a posthoc test to determine which store(s) differ significantly from each other.

In [None]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import AnovaRM

# Generate data for each store
np.random.seed(0)
store_data = {
    'Store': np.tile(['A', 'B', 'C'], 30),
    'Day': np.repeat(np.arange(1, 31), 3),
    'Sales': np.concatenate([
        np.random.normal(loc=200, scale=10, size=30),  # Store A
        np.random.normal(loc=210, scale=10, size=30),  # Store B
        np.random.normal(loc=215, scale=10, size=30)   # Store C
    ])
}

# Create DataFrame
df = pd.DataFrame(store_data)

# Repeated Measures ANOVA
aovrm = AnovaRM(df, 'Sales', 'Day', within=['Store'])
res = aovrm.fit()
print(res)

# Post-hoc test if significant
if res.anova_table['Pr > F'][0] < 0.05:
    print("Follow-up with post-hoc test to determine specific differences between stores.")


               Anova
      F Value Num DF  Den DF Pr > F
-----------------------------------
Store  1.5758 2.0000 58.0000 0.2156

