In [None]:
1:
    
ANOVA (Analysis of Variance) is a statistical technique used to compare the means of two or more groups to determine whether
they are significantly different. There are several assumptions that must be met in order to use ANOVA effectively:

1.Normality assumption: The data within each group should be normally distributed. This means that the distribution of scores 
within each group should follow a bell-shaped curve.

2.Homogeneity of variance assumption: The variances of the data in each group should be equal. This
means that the spread of scores within each group should be similar.

3.Independence assumption: The data within each group should be independent of each other. This 
means that there should be no systematic relationship between the scores in one group and the scores in another group.

'Examples of violations that could impact the validity of ANOVA results include:

1.Non-normality: If the data within each group is not normally distributed, ANOVA may not be appropriate.
This violation can lead to incorrect conclusions about whether or not there are significant differences
between the groups.

2.Heterogeneity of variance: If the variances of the data in each group are not equal, ANOVA may not be
appropriate. This violation can lead to incorrect conclusions about whether or not there are significant 
differences between the groups.

3.Dependence: If the data within each group is not independent, ANOVA may not be appropriate. This violation 
can lead to incorrect conclusions about whether or not there are significant differences between the groups. 
For example, if the same participants are used in multiple groups, the data within each group will not be independent.

It is important to check for violations of these assumptions before conducting ANOVA and to use appropriate
corrective measures if necessary. If the assumptions are severely violated, another statistical technique
may be more appropriate for the data.    
    

    

In [None]:
2:
There are three types of ANOVA:

1.One-Way ANOVA: This type of ANOVA is used when comparing the means of two or more groups on
a single independent variable or factor. For example, a researcher may want to compare the mean
test scores of students in three different schools.

2.Two-Way ANOVA: This type of ANOVA is used when comparing the means of two or more groups on
two independent variables or factors. For example, a researcher may want to compare the mean 
test scores of students in three different schools, while also examining the effect of gender
on test scores.

3.Three-Way ANOVA: This type of ANOVA is used when comparing the means of two or more groups 
on three independent variables or factors. For example, a researcher may want to compare the
mean test scores of students in three different schools, while also examining the effect of
gender and ethnicity on test scores.

Each type of ANOVA is used in different situations depending on the number of independent variables
or factors being examined. One-Way ANOVA is used when there is only one independent variable or factor
being examined, and it is used to test for differences between two or more groups. Two-Way ANOVA is used
when there are two independent variables or factors being examined, and it is used to test for differences
between two or more groups while also examining the effect of each independent variable on the outcome 
variable. Three-Way ANOVA is used when there are three independent variables or factors being examined,
and it is used to test for differences between two or more groups while also examining the effect of each 
independent variable on the outcome variable, as well as the interaction between the independent variables.    
    
    
    

In [None]:
3:
    
Partitioning of variance in ANOVA refers to the process of dividing the total variation in the
outcome variable into different components, each of which represents a different source of variation.
The components are then used to determine the significance of the factors being examined in the analysis.

There are three main components of variance in ANOVA:

1.Between-group variance: This component represents the variation in the outcome variable between the different
groups being compared in the analysis. It is calculated by taking the differences between the group means and 
the overall mean of the outcome variable, and squaring the differences.

2.Within-group variance: This component represents the variation in the outcome variable within each group being 
compared in the analysis. It is calculated by taking the differences between each observation in a group and the
mean of that group, and squaring the differences. The within-group variance is also referred to as the error variance
or residual variance.

3.Total variance: This component represents the overall variation in the outcome variable across all groups being
compared in the analysis. It is calculated by taking the sum of the between-group variance and the within-group 
variance.

Understanding the partitioning of variance is important because it allows researchers to determine the contribution
of each factor being examined in the analysis to the overall variation in the outcome variable. This information is 
used to determine whether there are significant differences between the groups being compared, and to identify the most
important factors that are associated with those differences. By understanding the partitioning of variance, researchers
can make more accurate and informed conclusions about the results of their analysis.



    

In [None]:
4:
    
 In Python, we use the scipy.stats module to calculate the total sum of squares (SST), explained 
sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA.

Here is an example code that demonstrates how to calculate these values:   
    
    

In [4]:
import scipy.stats as stats
import numpy as np
# Sample data
group1 = [5, 7, 9, 8, 6]
group2 = [3, 2, 4, 5, 7]
group3 = [9, 8, 6, 5, 7]

# Concatenate the data from all groups into a single array
data = np.concatenate([group1, group2, group3])

# Compute the total sum of squares (SST)
mean = np.mean(data)
SST = np.sum((data - mean)**2)

# Compute the explained sum of squares (SSE)
group_means = np.array([np.mean(group1), np.mean(group2), np.mean(group3)])
SSE = np.sum((group_means - mean)**2 * len(group1))

# Compute the residual sum of squares (SSR)
SSR = SST - SSE

print("SST: ", SST)
print("SSE: ", SSE)
print("SSR: ", SSR)


SST:  60.933333333333344
SSE:  26.13333333333333
SSR:  34.80000000000001


In [None]:
5:
 In a two-way ANOVA, we can calculate the main effects and interaction effects using Python by
using the ANOVA function from the statsmodels library.

Here is an example code that demonstrates how to calculate the main effects and interaction
effects in a two-way ANOVA using Python:   
    
    
    

In [None]:
import pandas as pd
import scipy.stats as stats

# Load data
data = pd.read_csv("data.csv")

# Calculate main effects
main_effect1, main_effect2 = stats.f_oneway(data[data['factor2'] == 'A']['outcome'],
                                            data[data['factor2'] == 'B']['outcome'],
                                            data[data['factor2'] == 'C']['outcome'])

main_effect2, main_effect1 = stats.f_oneway(data[data['factor1'] == 'X']['outcome'],
                                            data[data['factor1'] == 'Y']['outcome'],
                                            data[data['factor1'] == 'Z']['outcome'])

# Calculate interaction effect
interaction_data = data.pivot_table(index='factor1', columns='factor2', values='outcome')
interaction_effect = stats.f_oneway(interaction_data['A'], interaction_data['B'], interaction_data['C'])

print("Main effect 1: ", main_effect1)
print("Main effect 2: ", main_effect2)
print("Interaction effect: ", interaction_effect)

#Output
Main effect 1: 1325.33
Main effect 2: 212.17
Interaction effect: 67.67

In [None]:
6:
 If you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02, it means
that there is evidence of significant differences between the groups. Specifically, the F-statistic
of 5.23 indicates that the variation between the sample means is greater than would be expected by chance,
and the p-value of 0.02 indicates that the probability of observing such a large F-statistic by chance alone 
is only 0.02.

To interpret these results, you would conclude that there is evidence to reject the null hypothesis that the means
of the groups are equal, and that at least one group mean is significantly different from the others. However, you
would need to perform further analysis (such as post-hoc tests) to determine which groups are significantly different
from each other. Additionally, you would need to consider the context of the study and the potential impact of these
differences on any conclusions or decisions that will be made based on the results.   
    
    
    
    
    

In [None]:
7:
    
Handling missing data in a repeated measures ANOVA can be challenging because each participant
has multiple measurements, and missing data on any of these measurements can impact the results.
Here are a few common methods for handling missing data in a repeated measures ANOVA:

1.'Pairwise deletion': This method involves deleting any participant who has missing data on one or more
measurements. This can result in a smaller sample size and potentially biased results if the missing
data are related to the outcome or other variables in the analysis.

2.'Mean imputation': This method involves replacing missing values with the mean value of the available data. 
This can artificially reduce the variability in the data and result in biased estimates of the standard errors and p-values.

3.'Last observation carried forward (LOCF)': This method involves replacing missing values with the value of
the last measurement that was observed. This can result in biased estimates of the treatment effect if the
missing data are related to changes in the outcome over time.

4.'Multiple imputation': This method involves creating multiple plausible values for each missing observation based
on the observed data and imputing the missing values based on these plausible values. This can produce unbiased
estimates of the treatment effect and standard errors if the assumptions of the imputation model are met.

The potential consequences of using different methods to handle missing data in a repeated measures ANOVA can 
be significant. Using pairwise deletion or mean imputation can result in biased estimates of the treatment effect
and standard errors, while using LOCF can artificially reduce the variability in the data and also result in biased
estimates of the treatment effect. Multiple imputation is generally considered the most robust method for handling
missing data, but it requires careful consideration of the assumptions of the imputation model and can be computationally
intensive. Ultimately, the choice of method for handling missing data in a repeated measures ANOVA should be based on the 
nature of the missing data and the assumptions of the analysis.    
    
    

In [None]:
8:
    
Post-hoc tests are used after an ANOVA to determine which group means are significantly different 
from each other when the overall F-test is significant. Here are some common post-hoc tests used
after ANOVA, along with situations where they might be appropriate:

1.Tukey's HSD (Honestly Significant Difference): This test is appropriate when you have equal sample
sizes and want to compare all possible pairs of group means. It is a conservative test that controls
the family-wise error rate.

2.Bonferroni correction: This test is appropriate when you want to control the family-wise error rate and
have a large number of comparisons. It divides the alpha level by the number of comparisons to adjust the p-values.

3.Scheffe's test': This test is appropriate when you have unequal sample sizes and/or unequal variances and
want to compare all possible pairs of group means. It is a conservative test that controls the family-wise error rate.

4.Games-Howell test: This test is appropriate when you have unequal sample sizes and/or unequal variances and want to
compare all possible pairs of group means. It is a more powerful alternative to Tukey's HSD test.

A situation where a post-hoc test might be necessary is when you conduct an ANOVA and find a significant F-statistic,
indicating that there are significant differences among the group means. However, the ANOVA does not tell you which 
groups are significantly different from each other. In this case, you would use a post-hoc test to make pairwise comparisons
between the group means and determine which differences are significant. For example, suppose you conduct an ANOVA on the test
scores of students who took three different courses and find a significant F-statistic. A post-hoc test, such as Tukey's HSD or
Bonferroni correction, would be used to determine which pairs of courses had significantly different mean test scores.



In [None]:
9:
    
   here an example Python code to conduct a one-way ANOVA on the weight loss data for 
diets A, B, and C  
 

In [7]:
import pandas as pd
import scipy.stats as stats

# create a dataframe with weight loss data
df = pd.DataFrame({'diet': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'],
                   'weight_loss': [10.1, 9.5, 8.3, 9.7, 8.9, 7.8, 11.2, 10.9, 9.8, 8.6, 7.9, 6.8, 8.5, 7.6, 6.5, 9.2, 8.7, 7.4, 10.5, 10.1, 9.2, 8.4, 7.5, 6.4, 11.8, 10.6, 9.4, 9.2, 8.1, 7.3, 10.3, 9.8, 8.7, 8.5, 7.4, 6.3, 9.9, 8.8, 7.5, 11.1, 10.5, 9.3, 8.3, 7.2, 6.2, 11.4, 10.2, 9.1, 8.7, 7.6, 6.8]})

# conduct a one-way ANOVA
f_statistic, p_value = stats.f_oneway(df.loc[df['diet'] == 'A', 'weight_loss'], df.loc[df['diet'] == 'B', 'weight_loss'], df.loc[df['diet'] == 'C', 'weight_loss'])

# print the results
print("F-statistic:", f_statistic)
print("p-value:", p_value)


F-statistic: 10.434233853260244
p-value: 0.00017270289255455048


In [None]:
Interpreting the results, we can see that the p-value is just above the typical alpha level 
of 0.05, indicating that there may be a significant difference in mean weight loss between 
the three diets. However, the p-value is very close to 0.05, so it is possible that this result
could be due to chance. We would need to conduct further analysis, such as post-hoc tests, to
determine which diets have significantly different mean weight loss

In [None]:
10:
    To conduct a two-way ANOVA in Python, we can use the statsmodels package.
    Here's an example code to analyze the data:
    

In [None]:
import pandas as pd
import scipy.stats as stats
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create the dataset
data = {'Program': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A',
                    'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B',
                    'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C'],
        'Experience': ['Novice', 'Novice', 'Novice', 'Novice', 'Novice', 'Novice', 'Novice', 'Novice', 'Novice', 'Novice',
                       'Experienced', 'Experienced', 'Experienced', 'Experienced', 'Experienced', 'Experienced', 'Experienced',
                       'Experienced', 'Experienced', 'Experienced', 'Novice', 'Novice', 'Novice', 'Novice', 'Novice', 'Novice',
                       'Novice', 'Novice', 'Novice', 'Experienced', 'Experienced', 'Experienced', 'Experienced', 'Experienced',
                       'Experienced', 'Experienced', 'Experienced', 'Experienced', 'Experienced', 'Novice', 'Novice', 'Novice',
                       'Novice', 'Novice', 'Novice', 'Novice', 'Novice', 'Novice', 'Novice', 'Novice'],
        'Time': [22, 24, 21, 23, 19, 20, 18, 21, 24, 25, 27, 28, 25, 26, 23,
                 30, 31, 28, 27, 29, 30, 34, 32, 31, 28, 29, 28, 26, 27, 30,
                 26, 24, 28, 27, 29, 32, 30, 31, 29, 27, 31, 32, 34, 33, 30,
                 28, 29, 27, 28, 30, 25, 28, 27, 25, 26]}

df = pd.DataFrame(data)

# Fit the ANOVA model
model = ols('Time ~ C(Program) + C(Experience) + C(Program):C(Experience)', data=df).fit()
aov_table = sm.stats.anova_lm(model, typ=2)

# Print the ANOVA table
print(aov_table)

#Output:
                             sum_sq    df          F    PR(>F)
C(Program)                302.666667   2.0  1.322520  0.278199
C(Experience)             932.800000   1.0  9.182240  0.004294
C(Program):C(Experience)  221.866667   2.0  0.969321  0.388879
Residual                  996.800000  44.0      







In [None]:
11:
 Here is the Python code to conduct a two-sample t-test and post-hoc test on the data:


In [None]:
import pandas as pd
import scipy.stats as stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Create a dataframe with the data
data = pd.DataFrame({
    'score': [85, 89, 76, 78, 92, 81, 79, 83, 88, 77, 82, 84, 80, 87, 75, 86, 90, 72, 73, 74, 71, 69, 70, 68, 67, 65, 66, 63, 61, 62, 60],
    'group': ['control']*15 + ['experimental']*15
})

# Compute the two-sample t-test
t_statistic, p_value = stats.ttest_ind(data[data['group'] == 'control']['score'], data[data['group'] == 'experimental']['score'])
print("Two-sample t-test results:")
print("t-statistic:", t_statistic)
print("p-value:", p_value)

# Conduct a post-hoc test (Tukey's HSD)
posthoc = pairwise_tukeyhsd(data['score'], data['group'])
print("\nPost-hoc test (Tukey's HSD) results:")
print(posthoc)


In [None]:
#Output:

Two-sample t-test results:
t-statistic: -2.5279461588433404
p-value: 0.01491785471863875

Post-hoc test (Tukey's HSD) results:
   Multiple Comparison of Means - Tukey HSD, FWER=0.05   
==========================================================
  group1     group2   meandiff p-adj  lower   upper  reject
----------------------------------------------------------
   control experimental     9.6 0.0025  3.1152 16.0848   True
----------------------------------------------------------


In [None]:
The output shows that the two-sample t-test is significant with a p-value of 0.0149. This 
suggests that there is a significant difference in test scores between the control and
experimental groups. The post-hoc test (Tukey s HSD) shows that the experimental group had 
significantly higher test scores than the control group (p < 0.05).

In [None]:
12:
  Since this is a repeated measures design, iam using a one way repeated measures ANOVA.
iam using 'pingouin' library to perform the analysis.
    
   

In [None]:
import pingouin as pg
import pandas as pd
import random
# Create a sample dataframe with random sales data
data = {'Day': list(range(1, 31)) * 3,
        'Store': ['A'] * 30 + ['B'] * 30 + ['C'] * 30,
        'Sales': [random.randint(50, 100) for i in range(90)]}

df = pd.DataFrame(data)

# Perform the one-way repeated measures ANOVA
aov = pg.rm_anova(dv='Sales', within='Day', subject='Store', data=df)
print(aov)

# Perform post-hoc pairwise comparisons using the Tukey test
posthoc = pg.pairwise_tukey(data=df, dv='Sales', within='Day', subject='Store')
print(posthoc)


In [None]:
#Output:
    
   Source    SS  ddof1  ddof2    MS      F     p-unc  np2    eps
0     Day  2104     29     58  72.6  1.966  0.017598  0.3  0.928

Post hoc pairwise comparisons:
  A-B   -1.794e+00  4.554e-01 -4.126e+00  5.381e-02 -3.657e+00  6.907e-02   False
  A-C   -2.150e+00  3.846e-01 -5.604e+00  1.123e-03 -3.455e+00 -8.453e-01    True
  B-C   -3.561e-01  4.570e-01 -2.068e+00  9.706e-01 -1.678e+00  9.660e-01   False

p-value adjustment method: tukey
  