In [None]:
# Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

Assumptions:
Independence of observations.
Normality of data within each group.
Homogeneity of variances (equal variance across groups).
                          
Violations:
Independence: If observations are dependent (e.g., repeated measures without proper handling), results may be biased.
Normality: Skewed data can affect results; transformations or non-parametric tests may be needed.
Homogeneity of variances: Unequal variances (heteroscedasticity) can lead to incorrect conclusions, corrected by Welch's ANOVA.

In [None]:
# Q2. What are the three types of ANOVA, and in what situations would each be used?

One-way ANOVA: Compares means of three or more groups based on one independent variable.
Two-way ANOVA: Compares means across two independent variables, testing for main effects and interaction.
Repeated Measures ANOVA: Used when the same subjects are measured multiple times under different conditions.

In [None]:
# Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

Variance is partitioned into:
Total Variance (SST): Total variability in the data.
Explained Variance (SSE): Variability explained by the group differences.
Residual Variance (SSR): Variability unexplained by the model. Understanding this helps assess how much of the total variability is accounted for by the factors under study.

In [2]:
# Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

import statsmodels.api as sm
from statsmodels.formula.api import ols
import numpy as np

data = {'group': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
        'value': [23, 21, 22, 30, 32, 31, 40, 42, 41]}

import pandas as pd
df = pd.DataFrame(data)
model = ols('value ~ C(group)', data=df).fit()
anova_table = sm.stats.anova_lm(model)
print(anova_table)


           df  sum_sq  mean_sq      F    PR(>F)
C(group)  2.0   542.0    271.0  271.0  0.000001
Residual  6.0     6.0      1.0    NaN       NaN


In [3]:
# Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

import statsmodels.api as sm
from statsmodels.formula.api import ols

df = pd.DataFrame({
    'factor1': ['A', 'A', 'B', 'B'],
    'factor2': ['X', 'Y', 'X', 'Y'],
    'value': [23, 22, 30, 31]
})

model = ols('value ~ C(factor1) + C(factor2) + C(factor1):C(factor2)', data=df).fit()
anova_table = sm.stats.anova_lm(model)
print(anova_table)


                        df        sum_sq       mean_sq    F  PR(>F)
C(factor1)             1.0  6.400000e+01  6.400000e+01  0.0     NaN
C(factor2)             1.0  1.262177e-29  1.262177e-29  0.0     NaN
C(factor1):C(factor2)  1.0  1.000000e+00  1.000000e+00  0.0     NaN
Residual               0.0  1.729183e-27           inf  NaN     NaN


In [None]:
# Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
# What can you conclude about the differences between the groups, and how would you interpret these results?

With an F-statistic of 5.23 and a p-value of 0.02, you can reject the null hypothesis at the 0.05 significance level. This means there are statistically significant differences between the group means. Further post-hoc tests are needed to identify which specific groups differ

In [None]:
# Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

Methods for handling Missing Data in Repeated Measures ANOVA:
  Listwise deletion: Removes any rows with missing data.
  Imputation: Fills missing values based on other data.
    
Consequences: Deletion can reduce sample size, while imputation can introduce bias if not done carefully.

In [None]:
# Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

Common Post-hoc Tests:
Tukey’s HSD: Used when you want to compare all group pairs after a significant ANOVA.
Bonferroni: A more conservative test for pairwise comparisons, controlling the family-wise error rate.

In [4]:
# Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of 
# the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. 
# Report the F-statistic and p-value, and interpret the results.

# One-Way ANOVA for Diets: Using Python's scipy.stats.f_oneway function:
from scipy import stats
diet_A = [5, 6, 7, 4, 6]
diet_B = [6, 7, 6, 5, 8]
diet_C = [8, 9, 7, 8, 9]
F_statistic, p_value = stats.f_oneway(diet_A, diet_B, diet_C)
print(F_statistic, p_value)

# Interpret based on the p-value (if <0.05, there is a significant difference).

8.060606060606064 0.006037863050699394


In [5]:
# Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: 
# Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. 
# Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level
# (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.

import statsmodels.api as sm
from statsmodels.formula.api import ols

df = pd.DataFrame({
    'software': ['A', 'A', 'B', 'B', 'C', 'C'],
    'experience': ['novice', 'experienced', 'novice', 'experienced', 'novice', 'experienced'],
    'time': [30, 25, 40, 35, 50, 45]
})

model = ols('time ~ C(software) + C(experience) + C(software):C(experience)', data=df).fit()
anova_table = sm.stats.anova_lm(model)
print(anova_table)


                            df        sum_sq       mean_sq    F  PR(>F)
C(software)                2.0  4.000000e+02  2.000000e+02  0.0     NaN
C(experience)              1.0  3.750000e+01  3.750000e+01  0.0     NaN
C(software):C(experience)  2.0  2.303474e-28  1.151737e-28  0.0     NaN
Residual                   0.0  1.968997e-27           inf  NaN     NaN


In [None]:
# Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the 
# control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
# two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, 
# follow up with a post-hoc test to determine which group(s) differ significantly from each other.

# Two-Sample t-test: Using Python's scipy.stats.ttest_ind:

from scipy import stats
control_group = [80, 85, 90, 88, 82]
experimental_group = [85, 90, 95, 93, 91]
t_stat, p_value = stats.ttest_ind(control_group, experimental_group)
print(t_stat, p_value)

# If significant (p < 0.05), post-hoc tests can be used to check differences between groups.

In [6]:

# Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C.
# They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any
# significant differences in sales between the three stores. If the results are significant, follow up with a post- hoc test to determine which store(s) differ significantly from each other.

import statsmodels.api as sm
from statsmodels.formula.api import ols

df = pd.DataFrame({
    'store': ['A', 'B', 'C', 'A', 'B', 'C'],
    'sales': [300, 250, 400, 320, 270, 410]
})

model = ols('sales ~ C(store)', data=df).fit()
anova_table = sm.stats.anova_lm(model)
print(anova_table)


           df   sum_sq  mean_sq          F    PR(>F)
C(store)  2.0  21700.0  10850.0  72.333333  0.002896
Residual  3.0    450.0    150.0        NaN       NaN
