# Patsy contrast tutorial

The following tutorial for the patsy package is based on combination of Schad, Vasishth, Hohenstein, and Kliegl (2020), Tutorial of contrast coding for analysis specification using R packages and https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/.

In [2]:
# Import packages for conducting the tutorial
import pandas as pd 
import numpy as np 
import seaborn as sns
from statsmodels.formula.api import ols

# The General issue

Traditionally the data produced from experimental designs are often analysed using some variant of the analysis of variance (ANOVA) depnding on the experimental design. Standard practive is analsye the data with the desired ANOVA check the F-test for significance followed by Post hoc analysis of all the differences with some form pairwise comparison (Bonferroni, typically). This apporaoch is limited though if researchers have a prior theory driven comparison hypotheses before seeing the data.

The tutorial below focuses on Frequentist statistics but its application in the following notebooks is of course Bayesian.

In [37]:
# Simulate data for tutorial
np.random.seed(1)
F1 = np.random.normal(0.8,0.2,5)
F2 = np.random.normal(0.6,0.2,5)
F3 = np.random.normal(0.4,0.2,5) 
F4 = np.random.normal(0.2,0.2,5)
n = 5

#Specifying data for two group dataset
#Put data i to python dictionary
data = {'Subject': range(n),
        'F1': F1,
         'F2': F2
        }

data_2 = {'Subject': range(n),
        'F1': F1,
        'F2': F2,
        'F3': F3,
        'F4': F4
        }

# Using the specified python dictionary above 
# to generate a Pandas dataframe. 
df = pd.DataFrame(data, columns = ['Subject','F1','F2'])
df = pd.melt(df,id_vars=['Subject'],var_name='F', value_name='DV')
df["Subject"] = df["Subject"] + 1


df["F_Recoded"] = df["F"]
df["F_Recoded"] = df['F_Recoded'].replace(['F1'], '1').replace('F2', 2)
df["F_Recoded"] = df['F_Recoded'].astype(int)


Unnamed: 0,Subject,F,DV,F_Recoded
0,1,F1,1.124869,1
1,2,F1,0.677649,1
2,3,F1,0.694366,1
3,4,F1,0.585406,1
4,5,F1,0.973082,1
5,1,F2,0.139692,2
6,2,F2,0.948962,2
7,3,F2,0.447759,2
8,4,F2,0.663808,2
9,5,F2,0.550126,2


In [5]:
df_2 = pd.DataFrame(data_2, columns = ['Subject','F1','F2','F3','F4'])
df_2 = pd.melt(df_2,id_vars=['Subject'],var_name='F', value_name='DV')     
df_2;    

# Treatment Contrasts
The default contrast in many ststisticsl softwares is the treatment contrast. This type of statisical model contrast name is derived from its general use in medical settings, where treatments are compared to a baseline (control group).

In [51]:
#This is demsontrated here by fitting a OLS regression to the simuate data set above

mod = ols("DV ~ F", data=df)
res = mod.fit()

print(res.summary())

                            OLS Regression Results                            
Dep. Variable:                     DV   R-squared:                       0.234
Model:                            OLS   Adj. R-squared:                  0.138
Method:                 Least Squares   F-statistic:                     2.442
Date:                Sun, 22 Nov 2020   Prob (F-statistic):              0.157
Time:                        13:43:11   Log-Likelihood:                0.24060
No. Observations:                  10   AIC:                             3.519
Df Residuals:                       8   BIC:                             4.124
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.8111      0.118      6.867      0.0

## $$ Intercept (F1) = \hat{\mu}_1 = 0.81$$
## $$ Slope (F2) = \hat{\mu}_1 - \hat{\mu}_2 = -0.46$$

When using Treatment coding the mean of baseline/control group is the intercept of the ouptut and the Slope (Beta) is the diffence between the two groups, please note that tis was all generated by defaukt in the OLS function, so if not specifying the exact cotrasts be careful with interpretation.

# General linear model formulation

Why do we get these regression coeffiecients?

We can udertand why we get the observed Bata coesfficent if we specify the model above in general linear form. $y = \beta_0 + \beta_1 x$ for the F1 condition, $x = 0$, and for the F2, condition $x = 1$. Therefore when $y = \beta_0 + \beta_1 \cdot 0 = \beta_0$, but when $y = \beta_0 + \beta_1 \cdot 1 = \beta_0 + \beta_1$. This si a very simple case and using the defaults give the results described but the rorder of coding could be changed (see later complex example) by reversing the setting of the coding would simple flip the sign of Slope coefficient.

# Formulating the contrasts above as NHST's.

Because treatment coding in a categorical regression test level of IV's against a baseline IV as such in the two level example above (Default) the NHST is specified as $H_0: \beta_1 = 0$, which is a result of the $\beta_1$ representing the diffence between the two scores. Thes test above also geneartes a NSHT for the intercept term $H_0: \beta_0 = 0$, of course this is of less as essntially a ones sample t-test for comparising the Baseline IV has a statistically significant diffenrence from 0.

# Reordering treatment coding from default

In [54]:
from patsy.contrasts import Treatment
Two_levels_examp = [2,1]
contrast = Treatment(reference=0).code_without_intercept(Two_levels_examp)
print(contrast.matrix)

[[0.]
 [1.]]


In [56]:
print(contrast.matrix[df.F_Recoded-1, :])

[[0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]]


# Sum contrasts

This type of test contrast is different from the treatment contrasts, as the diffence is generated by comparing eaxch facors mean against the grand mean of all the levels of the factor. 

# Repeated Contrasts

This contrast compares groups in succesive order. i.e if you had groups separated by some level of manipulation with settings of low, medium and high; low would be compared to medium and then medium compared to high.

# Citations

Schad, D. J., Vasishth, S., Hohenstein, S., & Kliegl, R. (2020). How to capitalize on a priori contrasts in linear (mixed) models: A tutorial. Journal of Memory and Language, 110, 104038.

In [65]:
#df['F_Recoded'] = df['F_Recoded'].astype('category')
#df['F_Recoded'].cat.reorder_categories(['1', '2'])
