### What are Generalized Linear Models?

GLMs were one of those concepts that didn't click until I got lucky with a very specific view on the matter. When I first read the model specification, it looked arbitrary, random and unjustified. Where was it coming from? The subject finally made sense when I saw it from the angle of it's original motiation: a generalization of linear regression. Someone broken linear regression into pieces, generalized each of them and then put it back together, yielding a machine with more nobs. That's really it.

So if it worked for me, it can work for you. I'll now share that view.

## What are GLMs?

As advertised, we'll start in the bubblegum land of linear regression. There, we assume our dependent variable (a scalar $y$) is normally distributed *given* a linear combination of our independent variables. This can be written as:

$$
\begin{align}
p(y) & = \mathcal{N}(y|\mu,\sigma^2) \tag{1}\\
\mu & = \mathbf{x}^T\boldsymbol{\beta} \tag{2}\\
\end{align}
$$

I'll call (1) the 'distribution assumption' and (2) the 'parameter relation assumption'. In a GLM, we generalize both.

## Generalize the distribution assumption

Here, we generalize from:

"$y$ is **normally distributed** given our independent variables"

to

"$y$ has **any distribution within the exponential family** given our independent variables"

This is a good idea, considering the exponential family is pretty awesome (link). To refresh, the *vector* $\mathbf{y}$ has a distribution from the exponential family if its probability [1] is

$$
\begin{align}
p(\mathbf{y})=&\frac{1}{Z(\boldsymbol{\theta})}h(\mathbf{y})\exp\Big\{{T(\mathbf{y})\cdot\boldsymbol{\theta}\Big\}}\\
\end{align}
$$

for some parameter vector $\boldsymbol{\theta}$. Our choices of the functions $h(\cdot)$ and $T(\cdot)$ will determine which distribution within the exponential family we're in. So we *could* choose them such we get the normal distribution. But a different choice might give us the binomial or poisson or something else. Note we can't choose $Z(\boldsymbol{\theta})$ - it's there to ensures our probabilities sum to 1. Our choice of $h(\cdot)$ and $T(\cdot)$ will force $Z(\cdot)$ to be something.

One thing to call out is that our dependent variable just went from necessarily a scalar to a vector (which could also be a scalar). So we've sneakily generalized to multi-output predictions.


## Before we do our next generalization...

Let's pretend we've chosen $h(\cdot)$ and $T(\cdot)$ to recreate linear regression. Now, let's write what we have so far [2]:

$$
\begin{align}
p(y) & = \frac{1}{Z(\theta)}h(y)\exp\Big\{{T(y)\cdot\theta\Big\}} \tag{1}\\
\mu & = \mathbf{x}^T\boldsymbol{\beta} \tag{2}\\
\end{align}
$$

Notice an issue? How does $\mu$ connect to $\boldsymbol{\theta}$? We have to build that bridge. We can get that done with a function we'll call $\Psi$, which is defined to solve our problem: $\theta = \Psi(\mu)$. In other words, if I select $h(\cdot)$ and $T(\cdot)$ to recreate linear regression, $\Psi(\cdot)$ will fall out as required. It is purely a result of how we rewrite the normal distribution into the exponential family form.

Now un-pretend and let's get back to generalizing.

## Generalize the parameter relation assumption

Now we generalize the parameter relation from:

"the scalar mean of $y$ is a linear combination of our independent variables"

to

"**some simple function of our *vector*-**mean is a linear combination of our indendent variables"

That 'simple' function is a known as a link function (call it $g(\cdot)$), so this may be written as $g(\boldsymbol{\mu})=\mathbf{x}^T\boldsymbol{\beta}$. It's a generalization because $g(\cdot)$ *could* be the identity function and the length of $\boldsymbol{\mu}$ *could* be one, yielding what we're familiar with. But it could not and we could land elsewhere. Note that if $\boldsymbol{\mu}$ is a vector with length greater than 1, then $\boldsymbol{\beta}$ is a matrix.

Also, the 'simple' requirement on $g(\cdot)$ means its invertible, so we may write $\boldsymbol{\mu}=g^{-1}(\mathbf{x}^T\boldsymbol{\beta})$

So now we have both generalizations. Let's put it all together.

$$
\begin{align}
p(\mathbf{y}) & = \frac{1}{Z(\boldsymbol{\theta})}h(\mathbf{y})\exp\Big\{{T(\mathbf{y})\cdot\boldsymbol{\theta}\Big\}} \tag{1}\\
\boldsymbol{\theta} &  = \Psi(\boldsymbol{\mu}) \tag{'bridge'}\\
\boldsymbol{\mu} & = g^{-1}(\mathbf{x}^T\boldsymbol{\beta})\\
\end{align}
$$

#### Generalize the parameter relation assumption



Now we can focus on




Here, things get a little convoluted, so let's do the following to making things extra clear. Let's express linear regression with our previous generalization incorporated:


In linear regression, our parameters were $\mu$ and $\sigma^2$. However, if we want to write the normal distribution in the exponential family form, we need a reparametization - a function that will take us from the $\mu$ and $\sigma^2$-world to the $\boldsymbol{\theta}$-world. Call this function $\Psi(\cdot): \boldsymbol{\theta} = \Psi([\mu, \sigma^2])$.






So incorporate *just* our previous generalization,


However, to force linear regression to reside within the exponential family, we require a reparameterization. That is, we need another function $\Psi$



so $\boldsymbol{\theta}=[\mu, \sigma^2]$ in that case. The relation to the independent variables was created by picking one parameter ($\mu$) and making it a linear combination of those variables ($\mu = \mathbf{x}^T\boldsymbol{\beta}$). Let's do something similar. Let's take the general parameter vector $\boldsymbol{\theta}$ and split it into two other vectors: $\boldsymbol{\theta}=[\boldsymbol{\theta}_D,\boldsymbol{\theta}_C]$. The former group is analogous to $\mu$ in the sense that it'll **d**epend on our independent variables. The latter is analogous to $\sigma^2$ in that in won't - we will **c**hoose it exogenously[2].

### Footnotes

[1] 'Probability' isn't correct here. I should say probability density or probability mass.



[2] But wait, doesn't $\sigma^2$ depend on our independent variables? That always ends up as the variance of our residual and those residual depend on our independent variables! Yes, but that routine of estimating the variance after fitting is a convenience specific to linear regression that doesn't generalize. In effect, we can treat $\sigma^2$ as something we chose exogenously and that idea does generalize.