## Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.



To use the ANOVA test we made the following assumptions:

The residuals are normally distributed

Group populations have a common variance
  
All samples are drawn independently of each other
  
Within each sample, the observations are sampled randomly and independently of each other
  
Factor effects are additiveditive

## Q2. What are the three types of ANOVA, and in what situations would each be used?


Types: 1. One Way ANOVA, 2. Two Way ANOVA, 3. N Way ANOVA

**One-Way ANOVA**:
One-Way ANOVA is a statistical method used when we’re looking at the impact of one single factor on a particular outcome. For instance, if we want to explore how IQ scores vary by country, that’s where One-Way ANOVA comes into play

**Two-Way ANOVA**:
Moving a step further, Two-Way ANOVA, also known as factorial ANOVA, allows us to examine the effect of two different factors on an outcome simultaneously. Building on our previous example, we could look at how both country and gender influence IQ scores

**N-Way ANOVA**:

When researchers have more than two factors to consider, they turn to N-Way ANOVA, where “n” represents the number of independent variables in the analysis. This could mean examining how IQ scores are influenced by a combination of factors like country, gender, age group, and ethnicity all at once. N-Way ANOVA allows for a comprehensive analysis of how these multiple factors interact with each other and their combined effect on the dependent variable, providing a deeper understanding of the dynamics at play...

## Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?


**What is the partitioning of variance in ANOVA**:

The act of partitioning, or splitting up, is the core idea of ANOVA. To use the house analogy. Our total sums of squares (SS Total) is our big empty house. We want to split it up into little rooms. Before we partitioned SS Total using this formula:

SSTOTAL=SSEffect+SSError
 
Remember, the  SSEffect was the variance we could attribute to the means of the different groups, and  SSError was the leftover variance that we couldn’t explain. SSEffect and SSError are the partitions of  SSTOTAL, they are the little rooms.

Link: https://stats.libretexts.org/Bookshelves/Applied_Statistics/Answering_Questions_with_Data_-__Introductory_Statistics_for_Psychology_Students_(Crump)/08%3A_Repeated_Measures_ANOVA/8.02%3A_Partioning_the_Sums_of_Squares

**Why Partitioning of Variance is Important in ANOVA**:

By partitioning total variance into components, ANOVA unravels relationships between variables and identifies true sources of variation

## Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

In [1]:
import pandas as pd

#create pandas DataFrame
df = pd.DataFrame({'hours': [1, 1, 1, 2, 2, 2, 2, 2, 3, 3,
                             3, 4, 4, 4, 5, 5, 6, 7, 7, 8],
                   'score': [68, 76, 74, 80, 76, 78, 81, 84, 86, 83,
                             88, 85, 89, 94, 93, 94, 96, 89, 92, 97]})

#view first five rows of DataFrame
df.head()

Unnamed: 0,hours,score
0,1,68
1,1,76
2,1,74
3,2,80
4,2,76


## Next, we’ll use the OLS() function from the statsmodels library to fit a simple linear regression model using score as the response variable and hours as the predictor variable:

In [2]:
import statsmodels.api as sm

#define response variable
y = df['score']

#define predictor variable
x = df[['hours']]

#add constant to predictor variables
x = sm.add_constant(x)

#fit linear regression model
model = sm.OLS(y, x).fit()

## Lastly, we can use the following formulas to calculate the SST, SSR, and SSE values of the model:

In [3]:
import numpy as np

#calculate sse
sse = np.sum((model.fittedvalues - df.score)**2)
print("SSE: ", sse)


#calculate ssr
ssr = np.sum((model.fittedvalues - df.score.mean())**2)
print("SSR: ", ssr)

#calculate sst
sst = ssr + sse
print("SST: ", sst)


SSE:  331.0748847926267
SSR:  917.4751152073769
SST:  1248.5500000000036


Information: https://www.statology.org/sst-ssr-sse-in-python/

## Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?



In [10]:
# Importing libraries 
import statsmodels.api as sm 
from statsmodels.formula.api import ols 
import numpy as np
import pandas as pd

#create data
df = pd.DataFrame({'water': np.repeat(['daily', 'weekly'], 15),
                   'sun': np.tile(np.repeat(['low', 'med', 'high'], 5), 2),
                   'height': [6, 6, 6, 5, 6, 5, 5, 6, 4, 5,
                              6, 6, 7, 8, 7, 3, 4, 4, 4, 5,
                              4, 4, 4, 4, 4, 5, 6, 6, 7, 8]})


#perform two-way ANOVA
model = ols('height ~ C(water) + C(sun) + C(water):C(sun)', data=df).fit()
sm.stats.anova_lm(model, typ=2)



Unnamed: 0,sum_sq,df,F,PR(>F)
C(water),8.533333,1.0,16.0,0.000527
C(sun),24.866667,2.0,23.3125,2e-06
C(water):C(sun),2.466667,2.0,2.3125,0.120667
Residual,12.8,24.0,,


We can see the following p-values for each of the factors in the table:


water: p-value = .000527,
  
Sun: p-value = .0000002

water*sun: p-value = .1206670667

Since the p-values for water and sun are both less than .05, this means that both factors have a statistically significant effect on plant height.

And since the p-value for the interaction effect (.120667) is not less than .05, this tells us that there is no significant interaction effect between sunlight exposure and watering frequency.

Note: Although the ANOVA results tell us that watering frequency and sunlight exposure have a statistically significant effect on plant height, we would need to perform post-hoc tests to determine exactly how different levels of water and sunlight affect plant height.

## Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?
?

F-statistics is larger indicates a greater difference among the group means. It suggests that the variations between the groups are significant.

P-Value < Significance Level (e.g., 0.05): If the p-value is less than your chosen significance level (often set at 0.05), it indicates that there are statistically significant differences among the groups. In other words, you have evidence to reject the null hypothesis, which assumes no significant differences.

Information from: https://surveysparrow.com/blog/anova/

## Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?


**LISTWISE DELETION**:

By far the most common approach to missing data is to simply omit those cases with missing data and to run our analyses on what remains. Thus if 5 subjects in Group 1 don't show up to be tested, that group is 5 observations short.  Or if 5 individuals have missing scores on one or more variables, we simply omit those individuals from the analysis. This approach is usually called listwise deletion, but it is also known as complete case analysis. 

Although listwise deletion often results in a substantial decrease in the sample size available for the analysis, it does have important advantages. In particular, under the assumption that data are missing completely at random, it leads to unbiased parameter estimates. Unfortunately, even when the data are MCAR there is a loss in power using this approach, especially if we have to rule out a large number of subjects. And when the data are not MCAR, bias results. (For example when low income individuals are less likely to report their income level, the resulting mean is biased in favor of higher incomes.) The alternative approaches discussed below should be considered as a replacement for listwise deletion, though in some cases we may be better off to "bite the bullet" and fall back on listwise deletion.


**MULTIPLE IMPUTATION**

Just like the old-fashioned imputation methods, Multiple Imputation fills in estimates for the missing data.  But to capture the uncertainty in those estimates, MI estimates the values multiple times. Because it uses an imputation method with error built in, the multiple estimates should be similar, but not identical.

The result is multiple data sets with identical values for all of the non-missing values and slightly different values for the imputed values in each data set. The statistical analysis of interest, such as ANOVA or logistic regression, is performed separately on each data set, and the results are then combined. Because of the variation in the imputed values, there should also be variation in the parameter estimates, leading to appropriate estimates of standard errors and appropriate p-values.

Source: 

1. https://www.uvm.edu/~statdhtx/StatPages/Missing_Data/Missing-Part-One.html
2. https://www.theanalysisfactor.com/missing-data-two-recommended-solutions/

## Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

**Tukey's Post Hoc Tests**:

Tukey’s Honest Significant Difference (HSD) test is a post hoc test commonly used to assess the significance of differences between pairs of group means. Tukey HSD is often a follow up to one-way ANOVA, when the F-test has revealed the existence of a significant difference between some of the tested groups.

**Scheffé's method**:

In statistics, Scheffé's method, named after American statistician Henry Scheffé, is a method for adjusting significance levels in a linear regression analysis to account for multiple comparisons. It is particularly useful in analysis of variance (a special case of regression analysis), and in constructing simultaneous confidence bands for regressions involving basis functions.

Scheffé's method is a single-step multiple comparison procedure which applies to the set of estimates of all possible contrasts among the factor level means, not just the pairwise differences considered by the Tukey–Kramer method. It works on similar principles as the Working–Hotelling procedure for estimating mean responses in regression, which applies to the set of all possible factor level

**Holm–Bonferroni method**:

In statistics, the Holm–Bonferroni method,[1] also called the Holm method or Bonferroni–Holm method, is used to counteract the problem of multiple comparisons. It is intended to control the family-wise error rate (FWER) and offers a simple test uniformly more powerful than the Bonferroni correction. It is named after Sture Holm, who codified the method, and Carlo Emilio Bonferroni.


**Example on where the Post Hoc Test is necessary**:

A researcher wants to investigate differences in the effectiveness of TikTok, Instagram and Facebook influencers in promoting a nutraceutical brand. Let’s say that, by ANOVA, the null hypothesis (that all three influencer types have similar effectiveness) is rejected. A post-hoc pairwise comparison may then reveal that Instagram influencers have a significantly higher effectiveness in promoting the brand than TikTok and Facebook influencers, while the latter two are similar.s.-8}
$$

## Q9. The following data represent the test scores of two groups of students: Group A: 80, 85, 90, 92, 87, 83; Group B: 75, 78, 82, 79, 81, 84. Conduct an F-test at the 1% significance level to determine if the variances are significantly different..

In [2]:
import numpy as np
import scipy.stats as stat

In [13]:
group_a = [80, 85, 90, 92, 87, 83]
group_b = [75, 78, 82, 79, 81, 84]
alpha = 0.99

In [4]:
variance_a = np.var(group_a)
variance_b = np.var(group_b)

In [5]:
f_value = variance_a/variance_b

In [6]:
df_a = len(group_a) - 1
df_b = len(group_b) - 1

In [11]:
p_value = stat.f.cdf(f_value, df_a, df_b)

In [12]:
print('Degree of freedom 1:',df_a)
print('Degree of freedom 2:',df_b)
print("F-statistic:", f_value)
print("p-value:", p_value)

Degree of freedom 1: 5
Degree of freedom 2: 5
F-statistic: 1.9442622950819677
p-value: 0.7584478225464656


In [14]:
if p_value > alpha:
    print("Reject the null hypothesis that Var(X) == Var(Y)")
else:
    print("Accept the Null Hypothesis that Var(X) == Var(Y)")

Accept the Null Hypothesis that Var(X) == Var(Y)
