Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.

Assumptions of ANOVA:

   1. Normality: The data within each group should be normally distributed.

   2. Homogeneity of variance: The variance of the data within each group should be equal.

   3. Independence: The observations should be independent of each other.
   
Examples of Violations:

    Non-normality: If the assumption of normality is not met, the ANOVA results may not be accurate. For example, if the data is heavily skewed or has outliers, the assumption of normality is likely to be violated.

    Heteroscedasticity: If the assumption of homogeneity of variance is not met, the ANOVA results may not be reliable. For example, if the variance in one group is much larger than the variance in another group, the assumption of homogeneity of variance is likely to be violated.

    Non-independence: If the observations are not independent of each other, the ANOVA results may not be accurate. For example, if the same subjects are used in multiple groups, the assumption of independence is likely to be violated.

    Outliers: Outliers can distort the normal distribution of data making it difficult to meet the assumption of normality.

    Missing Data: If there is a significant amount of missing data, it can affect the accuracy of the ANOVA results.

    Violation of normality and homogeneity of variance: If the data is not normally distributed and the variance is not equal across groups, the results of the ANOVA test may be inaccurate. In such cases, a non-parametric test such as Kruskal-Wallis or Mann-Whitney U-test could be used instead.

Q2. What are the three types of ANOVA, and in what situations would each be used?

The three types of ANOVA are:

    One-way ANOVA: This type of ANOVA is used when there is only one factor being tested, and it has three or more levels. For example, a one-way ANOVA could be used to test whether there is a difference in average test scores between students in three different schools.

    Two-way ANOVA: This type of ANOVA is used when there are two factors being tested, and each factor has two or more levels. For example, a two-way ANOVA could be used to test whether there is a difference in average test scores between male and female students in three different schools.

    N-way ANOVA: This type of ANOVA is used when there are three or more factors being tested, and each factor has two or more levels. For example, an N-way ANOVA could be used to test whether there is a difference in average test scores between male and female students in three different schools, with different teachers, using different textbooks, and at different times of day.

Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

The partitioning of variance in ANOVA refers to the process of breaking down the total variability in a dataset into separate sources of variability based on the factors being studied. In ANOVA, the total variation in a response variable is partitioned into two components: the variation between the groups being compared (known as the "between-group" variation) and the variation within each group (known as the "within-group" variation). 

The partitioning of variance is important because it allows researchers to determine the relative contributions of different sources of variation to the total variation in the response variable. This information can be used to test hypotheses about the differences between group means and to estimate effect sizes. Additionally, understanding the partitioning of variance can help researchers identify potential confounding variables that may need to be controlled for in their analyses.


Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?

In [3]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a data frame with the data
data = pd.DataFrame({'group': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
                     'score': [7, 8, 6, 5, 4, 6, 8, 9, 7]})

# Fit the ANOVA model
model = ols('score ~ group', data=data).fit()

# Calculate the total sum of squares (SST)
ss_tot = sum((data['score'] - data['score'].mean())**2)

# Calculate the explained sum of squares (SSE)
ssr = sum((model.fittedvalues - data['score'].mean())**2)
ssr_corr = ssr / (len(data['score'].unique()) - 1)

# Calculate the residual sum of squares (SSR)
sse = ss_tot - ssr
sse_corr = sse / (len(data['score']) - len(data['score'].unique()))

print('SST =', ss_tot)
print('SSE =', sse_corr)
print('SSR =', ssr_corr)


SST = 20.0
SSE = 1.9999999999999964
SSR = 2.800000000000002


Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [6]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Load data into a Pandas dataframe
data = pd.read_csv('data.csv')

# Create a two-way ANOVA model
model = ols('dependent_variable ~ factor1 + factor2 + factor1:factor2', data).fit()

# Calculate the main effects and interaction effect
main_effects = sm.stats.anova_lm(model, typ=1)
interaction_effect = sm.stats.anova_lm(model, typ=2)

print(main_effects)
print(interaction_effect)


Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?

A one-way ANOVA with an F-statistic of 5.23 and a p-value of 0.02 indicates that there is a statistically significant difference between the groups. 

Specifically, there is evidence that at least one of the group means is different from the others.

Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?

Handling missing data in repeated measures ANOVA can be challenging because the measurements are taken on the same subjects over time, and missing data in one time point can affect subsequent measurements. 
There are different methods to handle missing data in repeated measures ANOVA, including 
pairwise deletion, 
listwise deletion, and 
imputation.

Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.

After conducting an ANOVA, post-hoc tests can be used to determine which specific groups are significantly different from each other. Some common post-hoc tests include Tukey's Honestly Significant Difference (HSD), Scheffe's test, Bonferroni correction, and Fisher's Least Significant Difference (LSD).

Tukey's HSD is a commonly used post-hoc test when the sample sizes are equal, and it controls the family-wise error rate (FWER). It is recommended when there are many groups and the primary goal is to compare all possible pairs of groups.

Scheffe's test is a more conservative post-hoc test that is useful when the sample sizes are unequal or the assumptions of equal variances and normality are violated.

Bonferroni correction is a method that adjusts the significance level of each pairwise comparison to control for multiple comparisons. It is a very conservative approach that reduces the likelihood of type I errors but increases the likelihood of type II errors.

Fisher's LSD is a less conservative post-hoc test that can be used when the sample sizes are equal and the variances are approximately equal.

An example situation where a post-hoc test might be necessary is when a one-way ANOVA shows that there is a significant difference between groups but does not identify which specific groups are different from each other. In this case, a post-hoc test can be used to determine which groups are significantly different from each other.

Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.

In [8]:
import numpy as np
import pandas as pd
from scipy.stats import f_oneway

# generate data
np.random.seed(123)  # for reproducibility
diet_a = np.random.normal(loc=10, scale=2, size=50)
diet_b = np.random.normal(loc=8, scale=2, size=50)
diet_c = np.random.normal(loc=6, scale=2, size=50)

# combine data into a single DataFrame
data = pd.DataFrame({
    'weight_loss': np.concatenate([diet_a, diet_b, diet_c]),
    'diet': ['A'] * 50 + ['B'] * 50 + ['C'] * 50
})

data_A=data[data['diet'] == 'A']['weight_loss']
data_B=data[data['diet'] == 'B']['weight_loss']
data_C=data[data['diet'] == 'C']['weight_loss']

# conduct one-way ANOVA
f_statistic, p_value = f_oneway(data_A,data_B,data_C)
# print results
print('F-statistic:', f_statistic)
print('p-value:', p_value)
if p_value < 0.05:
    print('The mean weight loss of the three diets is significantly different.')
else:
    print('There is no significant difference in the mean weight loss of the three diets.')


F-statistic: 37.03885406173804
p-value: 9.413909285242866e-14
The mean weight loss of the three diets is significantly different.


Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.

In [1]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from scipy.special import factorial
from statsmodels.formula.api import ols

# generate random data
np.random.seed(123)
data = pd.DataFrame({
    'software': np.repeat(['A', 'B', 'C'], 30),
    'experience': np.tile(['novice', 'experienced'], 45),
    'time': np.random.normal(loc=20, scale=5, size=90)
})

# fit the two-way ANOVA model
model = ols('time ~ C(software) + C(experience) + C(software):C(experience)', data=data).fit()
table = sm.stats.anova_lm(model, typ=2)

# print the ANOVA table
print(table)

                                sum_sq    df         F    PR(>F)
C(software)                  18.292844   2.0  0.264784  0.768009
C(experience)                19.341996   1.0  0.559941  0.456374
C(software):C(experience)    20.839120   2.0  0.301641  0.740401
Residual                   2901.603591  84.0       NaN       NaN


Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.

In [8]:
import pandas as pd
from scipy import stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Load data into a Pandas dataframe
data = pd.DataFrame({
    'Group': ['Control', 'Experimental', 'Control', 'Experimental', 'Control', 'Experimental'],
    'Score': [75, 80, 65, 90, 70, 85]
})

# Separate the scores into two groups
control_scores = data[data['Group'] == 'Control']['Score']
experimental_scores = data[data['Group'] == 'Experimental']['Score']

# Conduct a two-sample t-test
t_statistic, p_value = stats.ttest_ind(control_scores, experimental_scores)

# Print the results
print("T-statistic:", t_statistic)
print("P-value:", p_value)

# Conduct a post-hoc test if the result is significant
if p_value < 0.05:
    posthoc = pairwise_tukeyhsd(data['Score'], data['Group'])
    print(posthoc)


T-statistic: -3.6742346141747677
P-value: 0.021311641128756713
   Multiple Comparison of Means - Tukey HSD, FWER=0.05    
 group1    group2    meandiff p-adj  lower   upper  reject
----------------------------------------------------------
Control Experimental     15.0 0.0213 3.6652 26.3348   True
----------------------------------------------------------


Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any
significant differences in sales between the three stores. If the results are significant, follow up with a post-
hoc test to determine which store(s) differ significantly from each other.

In [9]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.stats.anova import AnovaRM

# Generate example data
np.random.seed(123)
store_a = np.random.normal(loc=100, scale=20, size=30)
store_b = np.random.normal(loc=110, scale=15, size=30)
store_c = np.random.normal(loc=90, scale=25, size=30)
sales_data = pd.DataFrame({'store_a': store_a, 'store_b': store_b, 'store_c': store_c})

# Convert data to long format
sales_data = pd.melt(sales_data.reset_index(), id_vars=['index'], value_vars=['store_a', 'store_b', 'store_c'])
sales_data.columns = ['day', 'store', 'sales']

# Conduct repeated measures ANOVA
rm_anova = AnovaRM(data=sales_data, depvar='sales', subject='day', within=['store'])
rm_results = rm_anova.fit()

print(rm_results)


               Anova
      F Value Num DF  Den DF Pr > F
-----------------------------------
store  7.8521 2.0000 58.0000 0.0010

