CurrentModule = StatsModels
Modeling categorical data
To convert categorical data into a numerical representation suitable for
StatsModels implements a variety of contrast coding systems.
Each contrast coding system maps a categorical vector with $k$ levels onto
$k-1$ linearly independent model matrix columns.
The following contrast coding systems are implemented:
How to specify contrast coding
The default contrast coding system is
DummyCoding. To override this, use
contrasts argument when constructing a
mf = ModelFrame(@formula(y ~ 1 + x), df, contrasts = Dict(:x => EffectsCoding()))
To change the contrast coding for one or more variables in place, use
Contrast coding systems
DummyCoding EffectsCoding HelmertCoding ContrastsCoding
Special internal contrasts
Categorical variables in
Generating model matrices from multiple variables, some of which are categorical, requires special care. The reason for this is that rank-$k-1$ contrasts are appropriate for a categorical variable with $k$ levels when it aliases other terms, making it partially redundant. Using rank-$k$ for such a redundant variable will generally result in a rank-deficient model matrix and a model that can't be identified.
A categorical variable in a term aliases the term that remains when that
variable is dropped. For example, with categorical
a, the sole variable
aaliases the intercept term
a&b, the variable
aaliases the main effect term
b, and vice versa.
a&b&c, the variable
aalises the interaction term
b&c(regardless of whether
If a categorical variable aliases another term that is present elsewhere in the
formula, we call that variable redundant. A variable is non-redundant when
the term that it alises is not present elsewhere in the formula. For
y ~ 1 + a, the
ain the main effect of
aaliases the intercept
y ~ 0 + a,
adoes not alias any other terms and is non-redundant.
y ~ 1 + a + a&b:
a&bis redundant because it aliases the main effect
a&bis non-redundant because it aliases
b, which is not present anywhere else in the formula.
When constructing a
ModelFrame from a
Formula, each term is checked for
non-redundant categorical variables. Any such non-redundant variables are
"promoted" to full rank in that term by using
of the contrasts used elsewhere for that variable.
One additional complexity is introduced by promoting non-redundant variables to
full rank. For the purpose of determining redundancy, a full-rank dummy coded
categorical variable implicitly introduces the term that it aliases into the
formula. Thus, in
y ~ 1 + a + a&b + b&c:
aaliases the main effect
b, which is not explicitly present in the formula. This makes it non-redundant and so its contrast coding is promoted to
FullDummyCoding, which implicitly introduces the main effect of
- Then, in
b&c, the variable
cis now redundant because it aliases the main effect of
b, and so it keeps its original contrast coding system.