# ANOVA

This chapter is adapted from Danielle Navarro's excellent Learning Statistics with R book and adapted for Python by Todd Gureckis (2020) and Shannon Tubridy (2021)


In [None]:
import numpy.random as npr
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
import statsmodels.formula.api as smf
import statsmodels.api as sm
import pingouin as pg
import warnings


## ANOVA

In the stats_regression_part1.ipynb notebook we constructed linear models that attempt to describe some outcome measurement (like parent grumpiness) as a function of one or more predictor variables like quantity of parent sleep or baby sleep.

We could also describe ANOVA in the same way: we are attempting to account for variance in some outcome measure based on one or more predictor or grouping variables. 

In fact, in many cases regression and ANOVA are different ways of examining the same model. The key difference for our current purposes is that we will _usually_ use regression when the predictor variable(s) are continuous or otherwise numerically meaningful, and ANOVA when the predictor variables are categorical or are factors.

### One way ANOVA

When we have one experimental manipulation with multiple levels, like different experimental groups, and we want to assess the effect on an outcome variable we can do one-way ANOVA.

First let's simulate some data in an experiment where we have four groups of participants.

We will make use of the numpy.random function normal() which takes in a mean, a standard deviation, and the number of samples desired and returns an array with numbers sampled from a Gaussian distribution with mean and standard deviation.

In [None]:
mu = 10
sd = 1
n_samples = 20

# get 20 samples from a normal distribution with mean 10 and sd 1
random_nums = npr.normal(mu, sd, n_samples)

sns.displot(random_nums)


We'll make fake data by getting data from some number of trials for each participant. 

Each participant will be assigned to one experimental group, and the groups will differ in the mean of the Gaussian distribution we are sampling from. This will have the effect of giving us a difference, on average, between groups but also some variability.

In [None]:
# four levels of exp_condition factor
exp_conditions = ['A', 'B', 'C', 'control']

# each participant will do 50 trials
n_trials = 50

# 25 people per exp_condition
participants_per_group = 25

total_participants = len(exp_conditions) * participants_per_group

# make a list of participant numbers using list comprehension
# for each i between 1 and number of participants, make a string
# that is e.g., `sub-1`, `sub-2`, etc
# and put the results of all that into the sub_nums list
sub_nums = [f'sub-{i}' for i in range(1, total_participants+1)]

# use slicing to view the first four 
# elements of the sub_nums list
sub_nums[:4]

In [None]:
# the list comprehension at the end of the last cell
# is equivalent to this for loop:

# set an empty to list so 
# we can append values to it
sub_nums = []

# loop over numbers 1 to the number of people we have +1
# and make an id that is of the form 'sub-NN'
for i in range(1, total_participants+1):
    sub_nums.append(f'sub-{i}')
    
    
sub_nums[:4]

#### For loops using enumerate


Sometimes we loop directly over an iterable object like a list:

```
for s in some_list:
    print(s)
```

And sometimes we loop over a range of numbers and then use that to do some indexing on a list or other object:

```
some_list = ['a','b','c']

for i in range(0,len(some_list)):
    print(some_list[i])
    
```

An alternative to those is the enumerate() function which takes an iterable (like a list) as input and returns both the index positions _and_ the values. This can be quite useful for times when we both want everything from the list and want the current index position on each loop:

```
some_list = ['a','b','c']

for i, s in enumerate(some_list):
    print(f'i = {i}')
    print(f's = {s}')
```

##### Reminder: 'f' strings are a way to combine strings and variables

The f string is denoted by a leading f and then quotations around the rest. Strings or letters that you want to use use as is can just be entered in and will appear in red (in Jupyter). Any variables you can want to include go inside of curly brackets {} and their current value will be added into the fstring.

In [None]:
# remind ourselves about the experimental variables we already defined
print(exp_conditions)
print(participants_per_group)
print(sub_nums[:5])
print(n_trials)

## make some simulated data

In [None]:
# loop over the experimental groups
# for each one, generate 25 participants of random data
# we can get 50 trials of data for each person and then
# for each participant, get their average response to be used
# in our ANOVA

# set up some empty lists so we can append results as we 
# got through the loop
exp_group = []
avg_response = []

# use enumerate(). This will give us the index
# position of each entry in exp_groups (stored in i)
# as well as the group name itself (stored in 'group')
for i, group in enumerate(exp_conditions):

    # simulate individual participants:
    for s in range(participants_per_group):
        
        # get n_trials worth of data for this 'person'
        # we will use the index position taken from enumerate
        # as the input for the mean value in 
        # npr.normal(mu, sd, n_samples)
        s_trials = npr.normal(i, 1, n_trials)
        
        # get the average of this person's trial data
        # and store it in the avg_response list
        avg_response.append(np.mean(s_trials))
        
        # keep track of this person's experimental group
        exp_group.append(group)
    
    
# make a dataframe that has subject id, avg response, and group
one_way_df = pd.DataFrame({'id': sub_nums,
                          'response': avg_response,
                          'exp_group': exp_group})


one_way_df


Now we have 'data' from 100 people assigned to each of four experimental groups and we have an average score for each person in each group

We can vizualize the mean response within each group using sns.catplot():

In [None]:
sns.catplot(x = 'exp_group', 
            y = 'response', 
            kind='bar',
            data=one_way_df)

It looks like there is an effect of exp group on response, and we can test this statistically with a one way ANOVA.

As mentioned above, there is a close link between ANOVA and linear regression. To start, we will set up our model using syntax we have already seen before in the regression notebook:

In [None]:
oneway_model = smf.ols(formula = 'response ~ C(exp_group)', data=one_way_df)
oneway_model = oneway_model.fit()

To this point we have done *exactly* the same thing as fitting a regression model with the execption that we put C() around our predictor variable. This is to ensure that it is treated as a *categorical* variable. To get an ANOVA table from this we use sm.stats.anova_lm() and input the fit model we've already estimated:

In [None]:
anova_table = sm.stats.anova_lm(oneway_model, typ=1)
anova_table




Well, well, well.. looks like an ANOVA result. Notice also that the anova_lm() function can take a typ= argument in addition to taking the fit model itself. This corresponds to the type 1, type 2, and type 3 sum of squares versions of ANOVA. We won't discuss specifics of their application here, but will simply note that flexibility of the anova_lm() function.

Our resulting ANOVA table has the info we would need to interpret and write up our results. In this case there seems to be a main effect of experimenteal group, as indicated by the PR(>F) value (the p value) being much less than zero. We also have the F statistic itself, along with the degrees of freedom.



In [None]:
df1 = anova_table['df'][0]
df2 = anova_table['df'][1]
F = anova_table['df'][0]
p_val = anova_table['PR(>F)'][0]

results_string = f'F({df1}, {df2}) = F, p = {p_val}'
results_string

#### Post-hoc pairwise comparisons

Our ANOVA result indicates that the exp group has an effect on response, but doesn't tell which groups are different from each other. This is the domain of _post-hoc_ follow-up comparisons.

There are a variety of ways to compute these comparisons. One common approach is to compute **Tukey's HSD** which will give us the significance of differences between each pairwise combination of groups in the data while maintaining Type I error expectation in the face of multiple comparisons.

There is a Tukey HSD function in the statsmodel library. Let's import it and use it.

The function takes in an `endog=` argument which is the outcome variable, a `groups=` argument which is an array of factors in our ANOVA (in this case we have only one), and a desired experiment-wide false alarm rate (alpha).

In [None]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# perform Tukey's test
tukey = pairwise_tukeyhsd(endog=one_way_df['response'],
                          groups=one_way_df['exp_group'],
                          alpha=0.05)

print(tukey)

Each row of the output from the Tukey's test shows us the two groups being compared (group 1 and group2), the average difference between them, the adjusted p value (accounting for multiple comparisons), and the confidence intervals on the difference score.

In our simulated data, each group was different from every other group, but in real data this isn't necessarily the case, even if there is a main effect of group.

### Two (or more) way ANOVA

Many experiments will include a crossing of multiple experimental conditions or factors. 

For example we might have four participant groups and two kinds of tasks people are doing, and we can ask whether the experimental group has an effect on the responses, whether the task has an effect on the responses, and/or whether there is an interaction of those two factors on performance.

We'll generate some data to include a second factor that we can examine.

In [None]:
exp_conditions = ['A', 'B', 'C', 'control']
trial_types = ['easy', 'hard']

n_trials = 50

participants_per_group = 25
total_participants = len(exp_conditions) * participants_per_group

# make a list of participant numbers using list comprehension
sub_nums = [f'sub-{i}' for i in range(1, total_participants+1)]

# loop over the experimental groups
# for each one, generate 25 participants of random data
# we can get 50 trials of data for each person and then
# for each participant, get their average response to be used
# in our ANOVA

# set up some empty lists so we can append results as we 
# got through the loop
exp_group = []
avg_response = []
t_types = []
subs = []

# use enumerate(). This will give us the index
# position of each entry in exp_groups (stored in i)
# as well as the group name itself (stored in 'group')
for i, group in enumerate(exp_conditions):

    # simulate individual participants:
    for s in range(participants_per_group):
        
        for it, tr in enumerate(trial_types):
        
        
            # get n_trials worth of data for this 'person'
            # we will use the index position taken from enumerate
            # as the input for the mean value in 
            # npr.normal(mu, sd, n_samples)
            
            # for each group except 'control', increase the mean by one
            # if it's the hard condition
            if group != 'control':
                s_trials = npr.normal(i+it, 1, n_trials)
            else:
                s_trials = npr.normal(i, 1, n_trials)

            # get the average of this person's trial data
            # and store it in the avg_response list
            avg_response.append(np.mean(s_trials))

            # keep track of this person's experimental group
            exp_group.append(group)
            
            # track the trial type
            t_types.append(tr)
            
            # the subject number
            subs.append(f'sub-{s}')
    
# make a dataframe that has subject id, avg response, and group
two_way_df = pd.DataFrame({'id': subs,
                          'response': avg_response,
                          'exp_group': exp_group,
                          'trial_type': t_types})


two_way_df


First let's plot the data using sns.catplot() but now adding the hue= argument so that we can account for two grouping or categorical variables.

We'll also use the `kind='point'` argument to generate a plot of the type that is often used for plotting multi-way ANOVA.

In [None]:
sns.catplot(x = 'trial_type',
            y = 'response', 
            hue = 'exp_group', 
            kind = 'point',
            data=two_way_df)

plt.ylim(-2,5)

In these data it looks like the response varies based on the group (difference in response as we move across the x-axis) and there is also a difference based on trial type (hard trials usually associated with inreased response value compared to easy). 

Lastly, it looks like there might be an interaction between exp group and trial type: the effect of trial type does not seem to be the same in the control group compared to groups A, B, and C.

Now we can do a two-way ANOVA to check all this.

In [None]:
twoway_model = smf.ols(formula = 'response ~ C(exp_group) + C(trial_type) + C(exp_group):C(trial_type)', 
                       data=two_way_df)

twoway_model = twoway_model.fit()

twoway_table = sm.stats.anova_lm(twoway_model, typ=2)
twoway_table

Take a look at our formula for the model:

>`'response ~ C(exp_group) + C(trial_type) + C(exp_group):C(trial_type)'`

What we've done is include our two categorial predictors (exp_group and trial_type), specifying them as categorical (C()), and then also asked the model to estimate the effect of the interaction between those groups using the `:` to join them.

For convenience one can also use the following syntax in order to get a full crossing (main effect and interaction) of the factors:

>`'response ~ C(exp_group)*C(trial_type)'`

In [None]:
twoway_model = smf.ols(formula = 'response ~ C(exp_group)*C(trial_type)', 
                       data=two_way_df)

twoway_model = twoway_model.fit()

twoway_table = sm.stats.anova_lm(twoway_model, typ=2)
twoway_table

Once again we can use anova_lm() on our ols results to obtain the ANOVA table. Now we see a row for each factor, along with p values for the main effect, and the interaction row (C(exp_group):C(trial_type)).

In our case we have significant main effects as well as a significant interaction, as indicated by the small p value (PR(>F).

**NOTE**: Python is displaying our very small p values in scientific notation. So 1.29e-166 means take 1.29 and move the decimal 166 places to the left. This is a very small number.  

Perhaps a clearer example is that 5e-2 is .05:

In [None]:
p = 5e-2
p

We will not spend much time talking about ANOVA follow-up analyses for interactions because it is not as straightforward as doing Tukey's test. Fundamenally what we want to do is understand the interaction. One approach is to plot the data like before and the interaction is often clear.

Another approach is to use pg.pairwise_ttests() to get the pairwise comparison between all crossings of the factors and focus on the rows that have a combined Contrast, and look for places where the significance varies across conditions.

The pg.pairwise_ttests() function takes in the dependent variable (dv=) the grouping factors (between=[]), the data, and an optional adjustment for multiple comparisons (padjust)

In [None]:
pg.pairwise_ttests(dv='response', 
                   between=['exp_group', 'trial_type'], 
                   data=two_way_df, 
                   padjust='bonferroni')


The output gives us many t-test results, but we are interested in the interaction. In particular, we are interested in whether the effect of easy and hard trials is the same for each experimental group. The last four rows here show us those results: for groups A, B, and C, the test between easy and hard is significnat (p<.05), whereas for the control group that comparison is not significant. That is our interaction.