1) Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.

Analysis of Variance (ANOVA) is a statistical technique used to test for significant differences among means of two or more groups. In order to use ANOVA, several assumptions must be met. Violations of these assumptions can lead to inaccurate or invalid results.

The assumptions required for ANOVA are:

1) Normality : The data in each group should follow a normal distribution
2) Homogenity of variance : The variance of the data in each group should be approximately equal
3) Independence : The observations in each group should be independent of each other

ex:
1) Violation of normality assumption: If the data in each group are not normally distributed, ANOVA may not be valid. For example, if the data are skewed or have outliers, the normality assumption may be violated. In such cases, non-parametric tests like the Kruskal-Wallis test may be used instead
2) Violation of homogeneity of variance assumption: If the variance of the data in each group is not approximately equal, ANOVA may not be valid. For example, if the data in one group have a much larger variance than the others, this assumption may be violated. In such cases, Welch's ANOVA or a non-parametric alternative may be used
3) Violation of independence assumption: If the observations in each group are not independent of each other, ANOVA may not be valid. For example, if the same participant is measured in multiple

2) What are the three types of ANOVA, and in what situations would each be used?

ANOVA (Analysis of Variance) is a statistical technique used to analyze the differences between two or more groups. There are three types of ANOVA

1) One-way ANOVA : One-way ANOVA is used when you want to compare the means of three or more groups for a single independent variable. For example, if you want to compare the average income of people from different countries, you can use one-way ANOVA
2) Two-way ANOVA : Two-way ANOVA is used when you want to compare the means of two or more groups for two independent variables. For example, if you want to compare the average income of people from different countries based on their gender, you can use two-way ANOVA
3) Mixed-way ANOVA : Mixed ANOVA is used when you want to compare the means of two or more groups for both within-subjects and between-subjects factors. For example, if you want to compare the effect of a drug on two groups of people, with one group receiving the drug and the other receiving a placebo, and you want to measure the effect at different time points, you can use mixed ANOVA

Each type of ANOVA is used in different situations, depending on the number of independent variables and the type of design of the study. One-way ANOVA is used when there is one independent variable and the groups are independent. Two-way ANOVA is used when there are two independent variables and the groups are independent. Mixed ANOVA is used when there are both within-subjects and between-subjects factors

3) What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

The partitioning of variance in ANOVA (Analysis of Variance) refers to the process of decomposing the total variation in a data set into its component parts, which can be attributed to different sources of variation. In ANOVA, the total variation in the dependent variable is partitioned into two or more components: the variation due to differences among the groups being compared and the variation due to random error or individual differences.

The partitioning of variance is important because it helps researchers to understand the relative contributions of different sources of variation to the overall variation in their data, and to determine whether the treatment effect is statistically significant. By partitioning the total variation in a data set into its component parts, researchers can determine the proportion of the total variation that is accounted for by the treatment effect, as well as the proportion that is due to chance or individual differences

Understanding the partitioning of variance is also important because it allows researchers to calculate effect sizes, which are measures of the strength of the relationship between the independent and dependent variables. Effect sizes are important because they provide a more meaningful interpretation of the results of statistical analyses than p-values alone, and they can be used to compare the strength of the treatment effect across different studies

4) How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?

To calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python, you can use the statsmodels library.

First, you need to import the necessary modules:

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
data = {'group': ['A', 'A', 'B', 'B', 'C', 'C'],
        'value': [1, 2, 3, 4, 5, 6]}
df = pd.DataFrame(data)
model = ols('value ~ group', data=df).fit()
ssr = model.ssr
sse = model.ess
sst = sse + ssr

print('SSE:', sse)
print('SSR:', ssr)
print('SST:', sst)


SSE: 16.0
SSR: 1.5
SST: 17.5


5) In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [None]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
data = pd.read_csv('mydata.csv')
model = ols('outcome_var ~ var1 + var2 + var1*var2', data=data).fit()
table = sm.stats.anova_lm(model, typ=2)
print(table)
main_effect_var1 = model.params['var1']
main_effect_var2 = model.params['var2']
interaction_effect = model.params['var1:var2']
print('Main effect of var1:', main_effect_var1)
print('Main effect of var2:', main_effect_var2)
print('Interaction effect:', interaction_effect)

6) Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?

If you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02, you can conclude that there is evidence to suggest that there are significant differences between the groups in the population from which the samples were drawn.

The F-statistic is a measure of the ratio of variance between the groups and variance within the groups. A larger F-statistic value suggests that there is more variability between the groups, which may indicate that there are significant differences between them. The p-value of 0.02 indicates that there is a low probability (2%) of observing such an F-statistic if there were no significant differences between the groups in the population.

Therefore, we can reject the null hypothesis that there are no significant differences between the groups, and conclude that at least one of the groups is significantly different from the others. However, we cannot determine which specific group(s) is different from the others without further analysis.

In terms of interpretation, these results suggest that there are differences in the means of the groups, but we cannot determine the nature or magnitude of these differences without further analysis. Additionally, the significance of the differences depends on the context and the goals of the study, and it may be necessary to conduct post-hoc tests or further analysis to fully understand the implications of these findings.

7) In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?

Handling missing data in a repeated measures ANOVA can be a challenging task since the data are correlated across time points and within-subjects, and ignoring the missing data can lead to biased estimates and reduced statistical power. Here are a few common methods to handle missing data in a repeated measures ANOVA:

1) Complete-case-analysis : In this approach, only complete cases (i.e., subjects with data at all time points) are included in the analysis, and missing data are simply excluded. While this method is straightforward and easy to implement, it can lead to biased results if the missing data are not missing completely at random
2) Pairwise deletion : In this approach, only the available data at each time point are used for each subject, and missing data are replaced by a mean or a median of the observed data for that subject. While this method allows for more subjects to be included in the analysis, it can lead to biased results if the missing data are not MCAR
3) Multiple imputation : In this approach, missing data are imputed multiple times to create several completed datasets, which are then analyzed separately, and the results are combined. This method can provide more accurate estimates and preserve statistical power, especially if the missing data are missing at random

The consequences of using different methods to handle missing data in a repeated measures ANOVA can be significant. Using complete-case analysis or pairwise deletion can lead to biased results and reduced statistical power, especially if the missing data are not MCAR. Multiple imputation can provide more accurate estimates and preserve statistical power, but it can be computationally intensive and requires assumptions about the missing data mechanism. The choice of method should depend on the characteristics of the missing data and the goals of the analysis, and it is recommended to conduct sensitivity analyses to examine the robustness of the results to different methods of handling missing data.

8) What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.

Post-hoc tests are used to compare specific pairs of groups after obtaining a significant result in an ANOVA, and they can help identify which groups differ significantly from each other. Here are some common post-hoc tests used after ANOVA, along with their characteristics and situations where they might be used:

1) Tukey's Honestly Significant Difference (HSD): This test compares all possible pairs of means while controlling the overall Type I error rate. It is commonly used when the number of groups is small and equal, and when the variances are homogeneous across groups.

2) Bonferroni correction: This test adjusts the alpha level for each comparison to control the family-wise error rate. It is commonly used when there are a large number of pairwise comparisons, and when the variances are heterogeneous across groups.

3) Scheffe's test: This test adjusts the alpha level based on the number of contrasts being made, which can be useful when testing multiple hypotheses simultaneously. It is commonly used when there are a small number of groups and a large number of contrasts.

4) Games-Howell test: This test does not assume equal variances across groups and can be used when the assumption of equal variances is violated. It is commonly used when the sample sizes are unequal or when the variances are heterogeneous across groups.

An example of a situation where a post-hoc test might be necessary is a study that examines the effects of different treatments on a medical condition. Suppose that an ANOVA is used to compare the mean improvement scores of four treatment groups, and the ANOVA result indicates a significant difference among the groups. A post-hoc test can be used to determine which specific pairs of groups differ significantly from each other, which can help identify the most effective treatment(s) for the medical condition. In this case, Tukey's HSD test or Bonferroni correction can be used to compare all possible pairs of means, while controlling the overall Type I error rate.



9) A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.

In [6]:
import numpy as np
from scipy.stats import f_oneway
diet_A = np.array([4.5, 5.2, 6.1, 4.9, 5.7, 5.8, 6.2, 4.8, 5.6, 5.0,
                   5.9, 6.3, 5.1, 6.0, 5.5, 5.3, 5.4, 4.7, 5.8, 6.4,
                   4.6, 5.2, 6.2, 5.9, 5.3])
diet_B = np.array([3.9, 3.8, 4.2, 4.5, 3.7, 4.1, 4.3, 3.5, 3.9, 4.0,
                   4.2, 4.4, 4.1, 4.6, 4.5, 4.0, 4.3, 4.4, 4.7, 4.2,
                   4.1, 4.3, 4.5, 4.7, 4.4])
diet_C = np.array([2.8, 3.1, 2.9, 3.3, 3.2, 3.6, 3.0, 3.5, 3.4, 2.9,
                   3.2, 3.3, 3.1, 3.5, 3.4, 3.6, 3.0, 3.1, 3.3, 3.5,
                   3.2, 3.4, 3.1, 3.6, 3.2])
f_statistic, p_value = f_oneway(diet_A, diet_B, diet_C)
print('F-statistic:', f_statistic)
print('p-value:', p_value)








F-statistic: 205.54449472096522
p-value: 1.7337754997814795e-30


10) A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.

In [None]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
data = pd.DataFrame({
    'Program': ['A']*30 + ['B']*30 + ['C']*30,
    'Experience': ['Novice']*15 + ['Experienced']*15 + ['Novice']*15 + ['Experienced']*15 + ['Novice']*15 + ['Experienced']*15,
    'Time': [32, 29, 31, 28, 33, 30, 31, 27, 30, 28, 33, 29, 30, 31, 28, 27, 34, 29, 33, 30, 31, 32, 28, 29, 30, 32, 31, 30, 29, 28,
             35, 37, 38, 36, 39, 36, 35, 37, 38, 36, 39, 37, 36, 38, 36, 35, 40, 38, 39, 37, 36, 35, 37, 38, 36, 38, 39, 37, 36, 35]
})
model = ols('Time ~ C(Program) + C(Experience) + C(Program):C(Experience)', data=data).fit()
table = sm.stats.anova_lm(model, typ=2)
print(table)

11) An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.

In [10]:
import numpy as np
from scipy.stats import ttest_ind
control_scores = np.array([78, 85, 69, 82, 79, 94, 80, 83, 76, 81, 87, 75, 71, 88, 90, 77, 84, 79, 86, 85, 73, 80, 75, 82, 81, 78, 83, 90, 85, 84, 86, 88, 79, 84, 80, 72, 81, 83, 75, 70, 87, 79, 86, 82, 76, 84, 88, 80, 74, 89, 77, 73, 81, 85, 77, 79, 83, 82, 85, 71, 90, 84, 77, 82, 89, 83, 85, 76, 81, 84, 86, 88, 79, 84, 75, 82, 81, 77, 86, 83, 79, 84, 85, 78, 83, 87, 88, 80, 76, 84, 75, 81, 79, 83, 82, 85, 72, 89, 86, 80, 88, 84, 85, 83, 81, 78])
experimental_scores = np.array([92, 81, 76, 91, 85, 87, 94, 90, 82, 86, 79, 92, 88, 86, 85, 95, 83, 86, 90, 94, 80, 89, 84, 91, 78, 86, 84, 92, 85, 88, 83, 94, 89, 87, 80, 81, 90, 83,84, 87, 92, 90, 84, 78, 89, 87, 91, 85, 92, 84, 90, 88, 91, 84, 89, 82, 87, 85, 81, 89, 88, 84, 86, 81, 92, 88, 86, 83, 90, 87, 86, 85, 90, 89, 88, 81, 91, 85, 83, 88, 92, 88, 86, 83, 91, 85, 87, 89, 86, 82, 80, 92, 87, 83, 85, 90, 91, 80, 89, 85, 91, 88, 92])
t_stat, p_value = ttest_ind(control_scores, experimental_scores)
print(f"T-statistic: {t_stat:.2f}")
print(f"P-value: {p_value:.4f}")

T-statistic: -8.13
P-value: 0.0000


12) A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any

significant differences in sales between the three stores. If the results are significant, follow up with a post-
hoc test to determine which store(s) differ significantly from each other.

In [11]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
data = {'store': ['A']*30 + ['B']*30 + ['C']*30,
        'day': list(range(1,31))*3,
        'sales': [4,5,7,8,5,6,9,11,10,12,14,15,16,18,19,20,22,23,21,19,18,17,16,15,14,13,12,11,10,9] +
                 [3,2,4,6,7,8,9,12,14,16,17,18,20,22,21,20,18,17,15,14,13,12,10,9,7,6,5,4,3,2] +
                 [5,4,6,8,9,10,11,14,13,15,17,19,21,22,23,24,25,26,28,27,25,24,23,22,21,20,18,17,16,15]}

df = pd.DataFrame(data)
rm_anova = ols('sales ~ C(store) + C(day)', data=df).fit()
print(rm_anova.summary())

from statsmodels.stats.multicomp import pairwise_tukeyhsd

tukey = pairwise_tukeyhsd(df['sales'], df['store'], alpha=0.05)
print(tukey.summary())

                            OLS Regression Results                            
Dep. Variable:                  sales   R-squared:                       0.863
Model:                            OLS   Adj. R-squared:                  0.790
Method:                 Least Squares   F-statistic:                     11.79
Date:                Thu, 30 Mar 2023   Prob (F-statistic):           1.58e-15
Time:                        03:43:03   Log-Likelihood:                -209.10
No. Observations:                  90   AIC:                             482.2
Df Residuals:                      58   BIC:                             562.2
Df Model:                          31                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         3.2889      1.835      1.792