Assumptions of ANOVA:

Independence of observations
Normality of residuals
Homogeneity of variances
Examples of violations:

Independence: If the observations are not independent, it can lead to an overestimation of the significance of the results. For example, if a researcher measures the same variable in two different conditions but the conditions are not truly independent, such as measuring the same participants before and after a treatment, it violates the independence assumption.
Normality: If the residuals are not normally distributed, the results of the ANOVA may not be valid. For example, if the residuals are skewed or have heavy tails, it can lead to incorrect conclusions. This can be checked using normal probability plots or statistical tests such as the Shapiro-Wilk test.
Homogeneity of variances: If the variances of the groups are not equal, it can lead to incorrect conclusions about the differences between the groups. For example, if one group has a much larger variance than the others, it can dominate the analysis and obscure any differences between the groups. This can be checked using statistical tests such as Levene's test.

One-way ANOVA: Used when there is one independent variable with three or more levels, and the dependent variable is continuous. For example, a one-way ANOVA could be used to compare the mean height of trees in three different soil types.
Two-way ANOVA: Used when there are two independent variables, and the dependent variable is continuous. For example, a two-way ANOVA could be used to compare the mean weight of dogs based on breed and gender.
Repeated measures ANOVA: Used when the same participants are measured multiple times under different conditions. For example, a repeated measures ANOVA could be used to compare the mean reaction time of participants in three different lighting conditions.

The partitioning of variance in ANOVA refers to the process of dividing the total variance of the dependent variable into different sources of variance. This is important because it allows us to determine the proportion of the variance that can be explained by the independent variable(s) and the proportion that is due to other factors or random error. This helps us to understand the relative importance of the independent variable(s) in explaining the variability in the dependent variable.

In [1]:
from scipy import stats

# Sample data
group1 = [1, 2, 3, 4, 5]
group2 = [2, 3, 4, 5, 6]
group3 = [3, 4, 5, 6, 7]

# Concatenate the groups into a single array
data = group1 + group2 + group3

# Calculate the overall mean
mean = sum(data) / len(data)

# Calculate the total sum of squares (SST)
SST = sum([(x - mean)**2 for x in data])

# Calculate the sum of squares due to error (SSE)
SSE = sum([(x - mean)**2 for x in group1]) + sum([(x - mean)**2 for x in group2]) + sum([(x - mean)**2 for x in group3])

# Calculate the sum of squares due to treatment (SSR)
SSR = SST - SSE

print(data)
print(mean)
print(SST)
print(SSE)
print(SSR)


[1, 2, 3, 4, 5, 2, 3, 4, 5, 6, 3, 4, 5, 6, 7]
4.0
40.0
40.0
0.0


In [2]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Load the data
data = pd.read_csv("data.csv")

# Fit the ANOVA model
model = ols('dependent_variable ~ independent_variable1 * independent_variable2', data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Get the main effects and interaction effects
main_effect1 = anova_table.loc['independent_variable1', 'sum_sq']
main_effect2 = anova_table.loc['independent_variable2', 'sum_sq']
interaction_effect = anova_table.loc['independent_variable1:independent_variable2', 'sum_sq']


FileNotFoundError: ignored

A one-way ANOVA tests whether there are significant differences between the means of two or more groups. The F-statistic is calculated by dividing the variance between the groups by the variance within the groups. The p-value indicates the probability of obtaining the observed F-statistic by chance if there were no true differences between the groups.

In this case, since the p-value is less than 0.05 (assuming a significance level of 0.05), we can conclude that there is a significant difference between the means of at least two groups. However, we cannot determine which specific groups are different from each other based solely on the ANOVA results. To determine which groups are significantly different, we would need to conduct post-hoc tests.

In a repeated measures ANOVA, where the same participants are measured on multiple occasions, missing data can be a common issue. There are several methods to handle missing data, including:

Complete Case Analysis: Only the cases with complete data are included in the analysis, and the cases with missing data are excluded.

Pairwise Deletion: Only the available data for each pairwise comparison is included in the analysis, and the cases with missing data are excluded for that particular comparison.

Mean Substitution: The missing values are replaced by the mean value of the available data for that variable.

Multiple Imputation: The missing values are replaced with multiple plausible values based on the observed data.

The choice of method depends on the nature and extent of the missing data, as well as the assumptions of the analysis. However, using different methods to handle missing data can lead to different results and affect the validity of the conclusions.

Post-hoc tests are used to determine which specific groups are significantly different from each other after a significant ANOVA result is obtained. There are several post-hoc tests available, including:

Tukey's HSD (Honestly Significant Difference) test: This test compares all possible pairs of group means and controls for the overall Type I error rate.

Bonferroni correction: This test adjusts the p-values for each pairwise comparison to control the overall Type I error rate.

Scheffe's test: This test controls the family-wise error rate and can be more conservative than Tukey's HSD or Bonferroni correction.

Games-Howell test: This test does not assume equal variances and can be used when the assumption of equal variances is violated.

An example situation where a post-hoc test might be necessary is when a one-way ANOVA is conducted to compare the average scores of students in three different schools. If the ANOVA result is significant, a post-hoc test can be used to determine which specific schools are significantly different from each other in terms of their average scores.

In [3]:
import pandas as pd
import scipy.stats as stats

# create a pandas dataframe with the weight loss data
data = {'diet': ['A']*50 + ['B']*50 + ['C']*50,
        'weight_loss': [2.5, 3.2, 4.1, 2.9, 3.5, 4.2, 2.4, 3.1, 4.0, 2.8,
                        3.4, 4.3, 2.6, 3.3, 4.2, 2.5, 3.2, 4.1, 2.9, 3.5,
                        4.2, 2.4, 3.1, 4.0, 2.8, 3.4, 4.3, 2.6, 3.3, 4.2,
                        2.5, 3.2, 4.1, 2.9, 3.5, 4.2, 2.4, 3.1, 4.0, 2.8,
                        3.4, 4.3, 2.6, 3.3, 4.2, 2.5, 3.2, 4.1, 2.9, 3.5,
                        4.2, 2.4, 3.1, 4.0, 2.8, 3.4, 4.3, 2.6, 3.3, 4.2,
                        2.5, 3.2, 4.1, 2.9, 3.5, 4.2, 2.4, 3.1, 4.0, 2.8,
                        3.4, 4.3, 2.6, 3.3, 4.2, 2.5, 3.2, 4.1, 2.9, 3.5,
                        4.2, 2.4, 3.1, 4.0, 2.8, 3.4, 4.3, 2.6, 3.3, 4.2,
                        2.5, 3.2, 4.1, 2.9, 3.5, 4.2, 2.4, 3.1, 4.0, 2.8,
                        3.4, 4.3, 2.6, 3.3, 4.2]}

df = pd.DataFrame(data)

# conduct the ANOVA
model = stats.f_oneway(df[df['diet'] == 'A']['weight_loss'],
                        df[df['diet'] == 'B']['weight_loss'],
                        df[df['diet'] == 'C']['weight_loss'])

# print the results
print('F-statistic:', model.statistic)
print('p-value:', model.pvalue)


ValueError: ignored

In [4]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a dataframe with the data
data = {'Program': ['A', 'A', 'A', ..., 'C', 'C', 'C'],
        'Experience': ['Novice', 'Novice', 'Experienced', ..., 'Experienced', 'Experienced', 'Experienced'],
        'Time': [10, 15, 20, ..., 30, 35, 40]}
df = pd.DataFrame(data)

# Fit the ANOVA model with interaction
model = ols('Time ~ C(Program) + C(Experience) + C(Program):C(Experience)', data=df).fit()
aov_table = sm.stats.anova_lm(model, typ=2)
print(aov_table)


ValueError: ignored

In [5]:
import scipy.stats as stats

# Create two arrays with the test scores for the two groups
control_scores = [80, 85, 90, ..., 75, 80, 85]
experimental_scores = [85, 90, 95, ..., 80, 85, 90]

# Conduct the two-sample t-test
t_stat, p_val = stats.ttest_ind(control_scores, experimental_scores)

print("t-statistic:", t_stat)
print("p-value:", p_val)


TypeError: ignored

In [6]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.stats.anova import AnovaRM

# Create a dataframe with the data
data = {'Store': ['A', 'B', 'C', ..., 'A', 'B', 'C'],
        'Day': [1, 1, 1, ..., 30, 30, 30],
        'Sales': [100, 120, 90, ..., 80, 110, 100]}
df = pd.DataFrame(data)

# Create the ANOVA model
model = AnovaRM(df, 'Sales', 'Day', within=['Store']).fit()

# Print the ANOVA table
print(model.anova_table)


ValueError: ignored