---
title: "Regression with a Categorical Independent Variable"
description: "Incorporating categorical predictors (i.e., factor variables) into a regression model using dummy variables."
categories: [Regression, GLM]
image: "Figures/4.png"
order: 5
---

## General Principles
To study the relationship between a categorical independent variable and a continuous dependent variable, we use a _Categorical model_ which applies _stratification_.

_Stratification_ involves modeling how the *k* different categories of the independent variable affect the target continuous variable by performing a regression for each *k* category and assigning a regression coefficient for each category. To implement stratification, categorical variables are often encoded using [<span style="color:#0D6EFD">one-hot encoding ðŸ›ˆ</span>]{#ohe} or by converting categories to [<span style="color:#0D6EFD">indices ðŸ›ˆ</span>]{#indices}.


## Considerations
::: callout-note
- We have the same considerations as for [Regression for a Continuous Variable](1.&#32;Linear&#32;Regression&#32;for&#32;continuous&#32;variable.qmd).
 
- As we generate regression coefficients for each *k* category, we need to specify a prior with a shape equal to the number of categories *k* in the code (see comments in the code).
  
- To compare differences between categories, we need to compute the distribution of the differences between categories, known as the contrast distribution. **Never compare confidence intervals or p-values directly**.


:::

## Example
Below is an example of code that demonstrates Bayesian regression with an independent categorical variable using the Bayesian Inference (BI) package. The data consist of one continuous dependent variable (*kcal_per_g*), representing the caloric value of milk per gram, a categorical independent variable (*index_clade*), representing species clade membership, and a continuous independent variable (*mass*), representing the mass of individuals in the clade. The goal is to estimate the differences in milk calories between clades. This example is based on @mcelreath2018statistical.

::: {.panel-tabset group="language"}
### Python

In [None]:
from BI import bi

# Setup device------------------------------------------------
m = bi(platform='cpu')

# Import Data & Data Manipulation ------------------------------------------------
# Import
from importlib.resources import files
data_path = m.load.milk(only_path = True)
m.data(data_path, sep=';') 
m.index(["clade"]) # Convert clade names into index
m.scale(['kcal_per_g']) # Scale

# Define model ------------------------------------------------
def model(kcal_per_g, index_clade, mass):
    a = m.dist.normal(0, 0.5, shape=(4,), name = 'a') # shape based on the number of clades
    b = m.dist.normal(0, 0.5, shape=(4,), name = 'b')
    s = m.dist.exponential( 1, name = 's')    
    mu = a[index_clade]+b[index_clade]*mass
    m.dist.normal(mu, s, obs=kcal_per_g)


# Run mcmc ------------------------------------------------
m.fit(model) # Optimize model parameters through MCMC sampling

# Summary ------------------------------------------------
m.summary()

### R
```R
library(BayesianInference)
m=importBI(platform='cpu')

# Load csv file
m$data(m$load$milk(only_path = T), sep=';')
m$scale(list('kcal.per.g')) # Manipulate
m$index(list('clade')) # Scale
m$data_to_model(list('kcal_per_g', 'index_clade')) # Send to model (convert to jax array)

# Define model ------------------------------------------------
model <- function(kcal_per_g, index_clade){
  # Parameter prior distributions
  beta = bi.dist.normal( 0, 0.5, name = 'beta', shape = c(4))  # shape based on the number of clades
  sigma = bi.dist.exponential(1, name = 's')
  # Likelihood
  bi.dist.normal(beta[index_clade], sigma, obs=kcal_per_g)
}

# Run mcmc ------------------------------------------------
m$fit(model) # Optimize model parameters through MCMC sampling

# Summary ------------------------------------------------
m$summary() # Get posterior distributions
```

### Julia
```julia
using BayesianInference

# Setup device------------------------------------------------
m = importBI(platform="cpu")

# Import Data & Data Manipulation ------------------------------------------------
# Import
data_path = m.load.milk(only_path = true)
m.data(data_path, sep=';')
m.index("clade") # Convert clade names into index
m.scale(["kcal_per_g"]) # Scale

# Define model ------------------------------------------------
@BI function model(kcal_per_g, index_clade, mass)
    a = m.dist.normal(0, 0.5, shape=(4,), name = "a") # shape based on the number of clades
    b = m.dist.normal(0, 0.5, shape=(4,), name = "b")
    s = m.dist.exponential( 1, name = 's')    
    mu = a[index_clade]+b[index_clade]*mass
    m.dist.normal(mu, s, obs=kcal_per_g)
end

# Run mcmc ------------------------------------------------
m.fit(model)  # Optimize model parameters through MCMC sampling

# Summary ------------------------------------------------
m.summary() # Get posterior distributions
```
:::

::: callout-caution
For R users, when working with indices you have to ensure 1) that indices are intergers (i.e. ```as.integer(index_clade)```) and, 2) that indices start at 0 (i.e. ```as.integer(index_clade)-1```).
:::

## Mathematical Details
### *Frequentist formulation*
We model the relationship between the categorical input feature (X) and the target variable (Y) using the following equation:

$$
Y_i = \alpha + \beta_k X_i + \sigma
$$

Where:

- $Y_i$ is the dependent variable for observation *i*. 
  
- $\alpha$ is the intercept term.
  
- $\beta_k$ are the regression coefficients for each _k_ category.
  
- $X_i$ is the encoded categorical input variable for observation *i*. 
  
- $\sigma$ is the error term.

We can interpret $\beta_i$ as the effect of each category on $Y$ relative to the baseline (usually one of the categories or the intercept). 

### *Bayesian formulation*
In the Bayesian formulation, we define each parameter with [<span style="color:#0D6EFD">priors ðŸ›ˆ</span>]{#prior}. We can express the Bayesian regression model accounting for prior distributions as follows:

$$
Y \sim \text{Normal}(\alpha +  \beta_K X, \sigma)
$$

$$
\alpha \sim \text{Normal}(0,1)
$$

$$
\beta_K \sim \text{Normal}(0,1)
$$

$$
\sigma \sim \text{Exponential}(1)
$$

Where:

- $Y_i$ is the dependent variable for observation *i*.
  
- $\alpha$ is the intercept term, which in this case has a unit-normal prior.
  
- $\beta_K$ are slope coefficients for the _K_ distinct independent variables categories, which also have unit-normal priors.
  
- $X_i$ is the encoded categorical input variable for observation *i*. 
  
- $\sigma$ is a standard deviation parameter, which here has a Exponential prior that constrains it to be positive.

## Notes
::: callout-note

- We can apply multiple variables similarly to [Chapter 2: Multiple Continuous Variables](2.&#32;Multiple&#32;continuous&#32;Variables.qmd).

- We can apply interaction terms similarly to [Chapter 3: Interaction between Continuous Variables](3.&#32;Interaction&#32;between&#32;continuous&#32;variables.qmd).
:::

## Reference(s)
::: {#refs}
:::