#### Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

Ans: There are three primary assumptions in ANOVA:
1. The responses for each factor level have a normal population distribution.
2. These distributions have the same variance.
3. The data are independent.

Examples of violations that could impact validity:
1. If the data is skewed(Non-normality), the mean may not be a good measure of central tendency and ANOVA assumes normality.
2. If one group has much larger variance than the others, it may dominate the analysis and lead to erroneous conclusions.
3. If the observations within each group are not independent ANOVA may not be appropriate.

#### Q2. What are the three types of ANOVA, and in what situations would each be used?

Ans: 
1. One-Way ANOVA:
A one-way ANOVA has just one independent variable. Eg: Difference in IQ can be assessed by Country, and County can have 2, 20, or more different categories to compare.

2. Two-Way ANOVA:
A two-way ANOVA refers to an ANOVA using two independent variables. Taking above example: A 2-way ANOVA can examine differences in IQ scores by Country and Gender.

3. N-Way ANOVA:
A researcher can also use more than two independent variables, and this is an n-way ANOVA (with n being the number of independent variables you have). Eg: potential differences in IQ scores can be examined by Country, Gender, Age group, Ethnicity, etc, simultaneously.

#### Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

Ans: Partitioning of variance in ANOVA: The process of decomposing the total variation in the response variable into different sources of variation, such as variation due to the treatments, variation due to random error, and so on.

Understanding the partitioning of variance is also important because it allows us to calculate various statistics that are useful for testing hypotheses and making inferences about the population. 

#### Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

In [1]:
import numpy as np
import pandas as pd
import scipy.stats as stats

# create sample data as a pandas dataframe
data = pd.DataFrame({
    'group': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
    'value': [10, 12, 14, 15, 16, 18, 20, 22, 24]
})

overall_mean = np.mean(data['value'])
ssg = np.sum(data.groupby('group').count() * (data.groupby('group').mean() - overall_mean)**2)
ssw = np.sum((data['value'] - data.groupby('group').mean())**2)

sst = ssg + ssw
sse = sst - ssw
ssr = sse - ssg

print("SST:", sst)
print("SSE:", sse)
print("SSR:", ssr)

SST: 0               NaN
1               NaN
2               NaN
3               NaN
4               NaN
5               NaN
6               NaN
7               NaN
8               NaN
value    150.888889
dtype: float64
SSE: 0               NaN
1               NaN
2               NaN
3               NaN
4               NaN
5               NaN
6               NaN
7               NaN
8               NaN
value    150.888889
dtype: float64
SSR: 0        NaN
1        NaN
2        NaN
3        NaN
4        NaN
5        NaN
6        NaN
7        NaN
8        NaN
value    0.0
dtype: float64


#### Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [13]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Generate dummy data
np.random.seed(1)
n = 50
df = pd.DataFrame({
    'Sales': np.random.normal(loc=50, scale=10, size=n*6),
    'Store': np.repeat(['A', 'B', 'C'], n*2),
    'Time': np.tile(['Before', 'After'], n*3),
    'Gender': np.tile(np.repeat(['Male', 'Female'], n), 3)
})

model = ols('Sales ~ C(Store) + C(Time) + C(Store):C(Time) + C(Gender) + C(Store):C(Gender) + C(Time):C(Gender) + C(Store):C(Time):C(Gender)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)


print(anova_table)

                                  sum_sq     df         F    PR(>F)
C(Store)                      105.069864    2.0  0.590967  0.554462
C(Time)                        48.194673    1.0  0.542143  0.462145
C(Gender)                     145.867553    1.0  1.640868  0.201237
C(Store):C(Time)              419.623246    2.0  2.360176  0.096227
C(Store):C(Gender)              4.887141    2.0  0.027488  0.972889
C(Time):C(Gender)               1.090264    1.0  0.012264  0.911896
C(Store):C(Time):C(Gender)    398.248576    2.0  2.239954  0.108315
Residual                    25602.221131  288.0       NaN       NaN


#### Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

Ans: We can say that there is evidence of a statistically significant difference between the groups. The F-statistic of 5.23 indicates that there is more variation between the group means. 

The p-value of 0.02 suggests that the probability of obtaining an F-statistic as large as 5.23, assuming there is no true difference between the groups, is only 2%. 

Therefore, we reject the null hypothesis and conclude that there is a significant difference between the groups. 

#### Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

Ans:
1. Listwise deletion: This method involves deleting any participant from the analysis who has any missing data. This is a simple and commonly used approach, but it can reduce power and bias the results if the data are missing not at random (MNAR).

2. Pairwise deletion: This method involves using all available data for each comparison, even if some data are missing. This approach can be more powerful than listwise deletion, but it can also lead to biased estimates if the data are not MCAR.

3. Imputation: This method involves replacing missing data with estimates based on other available data. There are several methods of imputation, including mean imputation, regression imputation, and multiple imputation.

#### Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

Ans:
1. Tukey's HSD test is a commonly used post-hoc test that compares all possible pairs of means to determine which pairs are significantly different from each other. This test is appropriate when the sample sizes are equal and the variances are homogenous.

2. Bonferroni correction is another post-hoc test that adjusts the significance level for multiple comparisons. It is a conservative approach that reduces the chance of making a type I error (false positive).

3. Scheffe's method is a more conservative post-hoc test that can be used when sample sizes are unequal or the variances are heterogenous.

Eg: Suppose a researcher is interested in comparing the mean test scores of three different teaching methods (A, B, and C) in a study. The researcher conducts an ANOVA and finds a significant difference between the groups. A post-hoc test, such as Tukey's HSD, can be used to determine which groups have significantly different means from each other. If the test reveals that Method A has a significantly higher mean than Methods B and C, while Methods B and C do not differ significantly from each other, the researcher can conclude that Method A is the most effective teaching method.

#### Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.

In [2]:
import pandas as pd
import scipy.stats as stats

# Create a dictionary of the data
data = {'diet_A': [2, 5, 1, 3, 6, 7, 2, 4, 3, 5, 1, 2, 3, 4, 1, 5, 6, 3, 2, 4,
                   1, 3, 5, 7, 8],
        'diet_B': [3, 6, 7, 8, 2, 1, 3, 5, 2, 1, 4, 6, 7, 8, 2, 1, 3, 6, 7, 8,
                   4, 5, 6, 7, 8],
        'diet_C': [5, 4, 6, 8, 9, 10, 4, 3, 5, 4, 6, 7, 9, 10, 5, 4, 6, 7, 8,
                   5, 4, 6, 7, 9, 8]}
df = pd.DataFrame(data)


f_statistic, p_value = stats.f_oneway(df['diet_A'], df['diet_B'], df['diet_C'])

print('F-Statistic:', f_statistic)
print('P-Value:', p_value)

F-Statistic: 9.090825688073394
P-Value: 0.0003017972362803229


#### Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.

In [4]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# create a dataframe with the data
data = {'time': [13.2, 14.1, 15.2, 11.9, 12.5, 13.8, 16.1, 14.9, 15.5, 17.2, 12.3, 13.1, 14.3, 10.9, 11.5, 13.8, 15.1, 16.5, 18.2, 13.9, 15.5, 16.8, 14.5, 15.1, 16.3, 11.9, 12.5, 13.8, 15.1, 16.5, 17.2],
        'program': ['A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'A', 'A', 'A', 'B'],
        'experience': ['novice', 'novice', 'novice', 'novice', 'novice', 'novice', 'novice', 'novice', 'novice', 'novice', 'novice', 'novice', 'novice', 'novice', 'novice', 'novice', 'novice', 'novice', 'experienced', 'experienced', 'experienced', 'experienced', 'experienced', 'experienced', 'experienced', 'experienced', 'experienced', 'experienced', 'experienced', 'experienced', 'experienced']
       }
df = pd.DataFrame(data)

model = ols('time ~ C(program) + C(experience) + C(program):C(experience)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

print(anova_table)

                             sum_sq    df         F    PR(>F)
C(program)                11.585359   2.0  1.873805  0.174473
C(experience)              9.618560   1.0  3.111393  0.089965
C(program):C(experience)   5.662162   2.0  0.915793  0.413210
Residual                  77.285000  25.0       NaN       NaN


#### Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

In [5]:
import pandas as pd
import scipy.stats as stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd

control = [75, 80, 85, 70, 90, 82, 78, 88, 72, 79]
experimental = [80, 85, 90, 77, 92, 86, 81, 91, 76, 83]

t_stat, p_val = stats.ttest_ind(control, experimental)

print("Two-Sample t-test Results:")
print("t-value: {:.2f}".format(t_stat))
print("p-value: {:.4f}".format(p_val))
if p_val < 0.05:
    print("The results are statistically significant at the 5% level.")
else:
    print("The results are not statistically significant at the 5% level.")

data = pd.DataFrame({'score': control + experimental, 'group': ['control'] * 10 + ['experimental'] * 10})
tukey_results = pairwise_tukeyhsd(data['score'], data['group'], alpha=0.05)

print("\nPost-hoc Test (Tukey HSD):")
print(tukey_results)

Two-Sample t-test Results:
t-value: -1.53
p-value: 0.1438
The results are not statistically significant at the 5% level.

Post-hoc Test (Tukey HSD):
  Multiple Comparison of Means - Tukey HSD, FWER=0.05   
 group1    group2    meandiff p-adj  lower  upper reject
--------------------------------------------------------
control experimental      4.2 0.1438 -1.574 9.974  False
--------------------------------------------------------


#### Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post-hoc test to determine which store(s) differ significantly from each other.

In [6]:
import pandas as pd


df = pd.DataFrame({
    'Day': range(1, 31),
    'Store A': [10, 15, 12, 9, 8, 11, 13, 10, 16, 11, 14, 13, 12, 9, 11, 10, 14, 12, 11, 13, 12, 10, 11, 13, 12, 9, 11, 10, 12, 13],
    'Store B': [8, 9, 10, 11, 7, 8, 12, 10, 9, 11, 8, 12, 7, 10, 9, 8, 11, 12, 10, 11, 9, 8, 10, 9, 11, 12, 8, 10, 11, 9],
    'Store C': [5, 7, 6, 8, 9, 6, 5, 7, 8, 6, 5, 7, 9, 6, 8, 7, 5, 6, 8, 7, 5, 6, 7, 9, 8, 7, 6, 8, 5, 6]
})

df_long = pd.melt(df, id_vars=['Day'], value_vars=['Store A', 'Store B', 'Store C'], var_name='Store', value_name='Sales')

In [7]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

model = ols('Sales ~ Store + Day + Store:Day', data=df_long).fit()

sm.stats.anova_lm(model, typ=2)

Unnamed: 0,sum_sq,df,F,PR(>F)
Store,355.755556,2.0,69.832944,1.369431e-18
Day,0.560957,1.0,0.220226,0.6400838
Store:Day,1.375083,2.0,0.269922,0.7640989
Residual,213.96396,84.0,,


In [8]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd

posthoc = pairwise_tukeyhsd(df_long['Sales'], df_long['Store'])

print(posthoc.summary())

 Multiple Comparison of Means - Tukey HSD, FWER=0.05 
 group1  group2 meandiff p-adj  lower   upper  reject
-----------------------------------------------------
Store A Store B     -1.9   0.0 -2.8699 -0.9301   True
Store A Store C  -4.8333   0.0 -5.8032 -3.8635   True
Store B Store C  -2.9333   0.0 -3.9032 -1.9635   True
-----------------------------------------------------
