## Q1.  ANOVA assumes that the data is normally distributed. The ANOVA also assumes homogeneity of variance, which means that the variance among the groups should be approximately equal. ANOVA also assumes that the observations are independent of each other.
### Potential assumption violations include: Implicit factors: lack of independence within a sample. Lack of independence: lack of independence between samples. Outliers: apparent nonnormality by a few data points. Nonnormality: nonnormality of entire samples.

## Q2.  Commonly, ANOVAs are used in three ways: one-way ANOVA, two-way ANOVA, and N-way ANOVA.
### Three-way ANOVAs are useful for gaining an understanding of complex interactions where more than one variable may influence the result and have many applications in finance, social science, and medical research, among a host of other fields.

## Q3.  An ANOVA uses an F-test to evaluate whether the variance among the groups is greater than the variance within a group. Another way to view this problem is that we could partition variance, that is, we could divide the total variance in our data into the various sources of that variation.
### ANOVA is helpful for testing three or more variables. It is similar to multiple two-sample t-tests. However, it results in fewer type I errors and is appropriate for a range of issues. ANOVA groups differences by comparing the means of each group and includes spreading out the variance into diverse sources.

## Q4.  code

In [3]:
import pandas as pd
df = pd.DataFrame({'hours': [1, 1, 1, 2, 2, 2, 2, 2, 3, 3,
                             3, 4, 4, 4, 5, 5, 6, 7, 7, 8],
                   'score': [68, 76, 74, 80, 76, 78, 81, 84, 86, 83,
                             88, 85, 89, 94, 93, 94, 96, 89, 92, 97]})
df.head()

import statsmodels.api as sm
y = df['score']
x = df[['hours']]
x = sm.add_constant(x)
model = sm.OLS(y, x).fit()

import numpy as np
sse = np.sum((model.fittedvalues - df.score)**2)
print(sse)
ssr = np.sum((model.fittedvalues - df.score.mean())**2)
print(ssr)
sst = ssr + sse
print(sst)

331.07488479262696
917.4751152073725
1248.5499999999995


## Q5.  code

In [7]:
import statsmodels.api as sm
from statsmodels.formula.api import ols
  
# Create a dataframe
dataframe = pd.DataFrame({'Fertilizer': np.repeat(['daily', 'weekly'], 15),
                          'Watering': np.repeat(['daily', 'weekly'], 15),
                          'height': [14, 16, 15, 15, 16, 13, 12, 11,
                                     14, 15, 16, 16, 17, 18, 14, 13, 
                                     14, 14, 14, 15, 16, 16, 17, 18,
                                     14, 13, 14, 14, 14, 15]})
  
  
# Performing two-way ANOVA
model = ols('height ~ C(Fertilizer) + C(Watering) +\
C(Fertilizer):C(Watering)',
            data=dataframe).fit()
result = sm.stats.anova_lm(model, type=2)
  
# Print the result
print(result)

                             df     sum_sq   mean_sq         F    PR(>F)
C(Fertilizer)               1.0   0.033333  0.033333  0.012069  0.913305
C(Watering)                 1.0   0.000369  0.000369  0.000133  0.990865
C(Fertilizer):C(Watering)   1.0   0.040866  0.040866  0.014796  0.904053
Residual                   28.0  77.333333  2.761905       NaN       NaN


## Q7.  To do repeated measures ANOVA, you'd need to remove the data for that participant/animal/whatever entirely from the data table before running the ANOVA. Beginning with Prism 8, Prism offers an alternative method to analyze repeated measures data: fitting a mixed effects model.
### Missing data can result in bias, although this need not always be the case, depending on the missing data mechanism and the applied statistical approach. In a complete case analysis, already with low percentages of missing values there can be substantial bias and with high percentages there need not be a bias.

## Q8.  The post hoc test I'll use is Tukey's method. There are a variety of post hoc tests you can choose from, but Tukey's method is the most common for comparing all possible group pairings. There are two ways to present post hoc test results—adjusted p-values and simultaneous confidence intervals.
### Once you have determined that differences exist among the means, post hoc range tests and pairwise multiple comparisons can determine which means differ. Range tests identify homogeneous subsets of means that are not different from each other.

## Q9.  code

In [18]:
from scipy.stats import f_oneway

dietA = [89, 89, 88, 78, 79]
dietB = [93, 92, 94, 89, 88]
dietC = [89, 88, 89, 93, 90]

f_oneway(dietA, dietB,dietC)

F_onewayResult(statistic=4.35011990407674, pvalue=0.03795204795237708)

## Q10.  code

In [19]:
import statsmodels.api as sm
from statsmodels.formula.api import ols
dataframe = pd.DataFrame({'Fertilizer': np.repeat(['daily', 'weekly'], 15),
                          'Watering': np.repeat(['daily', 'weekly'], 15),
                          'height': [14, 16, 15, 15, 16, 13, 12, 11,
                                     14, 15, 16, 16, 17, 18, 14, 13, 
                                     14, 14, 14, 15, 16, 16, 17, 18,
                                     14, 13, 14, 14, 14, 15]})
  
model = ols('height ~ C(Fertilizer) + C(Watering) +\
C(Fertilizer):C(Watering)',
            data=dataframe).fit()
result = sm.stats.anova_lm(model, type=2)
  
# Print the result
print(result)

                             df     sum_sq   mean_sq         F    PR(>F)
C(Fertilizer)               1.0   0.033333  0.033333  0.012069  0.913305
C(Watering)                 1.0   0.000369  0.000369  0.000133  0.990865
C(Fertilizer):C(Watering)   1.0   0.040866  0.040866  0.014796  0.904053
Residual                   28.0  77.333333  2.761905       NaN       NaN


## Q12.  code

In [20]:
import numpy as np
import pandas as pd
from statsmodels.stats.anova import AnovaRM
  
# Create the data
dataframe = pd.DataFrame({'Cars': np.repeat([1, 2, 3, 4, 5], 4),
                          'Oil': np.tile([1, 2, 3, 4], 5),
                          'Mileage': [36, 38, 30, 29,
                                      34, 38, 30, 29,
                                      34, 28, 38, 32,
                                      38, 34, 20, 44,
                                      26, 28, 34, 50]})
  
# Conduct the repeated measures ANOVA
print(AnovaRM(data=dataframe, depvar='Mileage',
              subject='Cars', within=['Oil']).fit())

              Anova
    F Value Num DF  Den DF Pr > F
---------------------------------
Oil  0.5679 3.0000 12.0000 0.6466

