### What are Generalized Linear Models?

GLMs were one of those concepts that didn't click until I got lucky to see it from a very particular angle. When I first encountered it, it looked arbitrary, random and unjustified. The subject finally made sense when I understood its original motivation: to generalize linear regression. Someone broke linear regression into pieces, generalized each of them and put them back together, yielding a machine with more nobs to turn.

I'll now share that view with you. Hopefully you'll have the same reaction I did.

## What are GLMs?

As advertised, we'll start in the bubblegum land of linear regression. Here, we assume our dependent variable (a scalar $y$) is normally distributed *given* a linear combination of our independent variables (a vector $\mathbf{x}$). This can be written as:

$$
\begin{align}
p(y) & = \mathcal{N}(y|\mu,\sigma^2) \tag{1}\\
\mu & = \mathbf{x}^T\boldsymbol{\beta} \tag{2}\\
\end{align}
$$

I'll call (1) the 'distribution assumption' and (2) the 'expected-$y$ and $\mathbf{x}$ relation assumption'. The latter has this name because $\mu$ is the expectation of $y$ given $\mathbf{x}$ (That is, $\mu=\mathbb{E}[y|\mathbf{x}]$, but I'll use $\mu$ going forward). 

A GLM generalizes both of these pieces.

## Generalize the distribution assumption

Here, we generalize from:

"$y$ is **normally distributed** given $\mathbf{x}$"

to

"$y$ has **any distribution within the exponential family** given $\mathbf{x}$"

This is a good idea considering the exponential family is pretty damn awesome (link). To refresh, the *vector* $\mathbf{y}$ has a distribution from the exponential family if its probability [1] is

$$
\begin{align}
p(\mathbf{y})=&\frac{1}{Z(\boldsymbol{\theta})}h(\mathbf{y})\exp\Big\{{T(\mathbf{y})\cdot\boldsymbol{\theta}\Big\}}\\
\end{align}
$$

for some parameter vector $\boldsymbol{\theta}$. Our choices of the functions $h(\cdot)$ and $T(\cdot)$ will determine which distribution within the exponential family we're using. So we *could* choose them such we get the normal distribution. But a different choice might give us the binomial or poisson or something else. Though we can't choose $Z(\boldsymbol{\theta})$ - it's there to ensures $p(\mathbf{y})$ is a properly normalized distribution. So our choice of $h(\cdot)$ and $T(\cdot)$ will force $Z(\cdot)$ to be something.

One thing to call out is that our dependent variable just went from necessarily a scalar to a vector (which could also be a scalar). So we've sneakily generalized to multi-output predictions.


## Before we do our next generalization...

Let's pretend we've chosen $h(\cdot)$ and $T(\cdot)$ to recreate linear regression and write what we have so far:

$$
\begin{align}
p(y) & = \frac{1}{Z(\theta)}h(y)\exp\Big\{{T(y)\cdot\theta\Big\}} \tag{1}\\
\mu & = \mathbf{x}^T\boldsymbol{\beta} \tag{2}\\
\end{align}
$$

Notice an issue? How does our $\mu$ connect to $\boldsymbol{\theta}$? We have to build that bridge. We can get that done with a function we'll call $\Psi(\cdot)$, which is defined to solve our problem: $\theta = \Psi(\mu)$. This means if we select $h(\cdot)$ and $T(\cdot)$ to recreate linear regression, $\Psi(\cdot)$ will fall out as required. It is purely a result of the distribution we name and that distribution's connect between its $\boldsymbol{\theta}$-values and the resulting expectation of $\boldsymbol{y}$[2].

Now un-pretend and let's get back to generalizing.

## Generalize the expected-$y$ and $\mathbf{x}$ relation assumption

Now we generalize from:

"the scalar expectation of $y$ is a linear combination of $\mathbf{x}$"

to

"**some simple function of our *vector* - **expectation of $y$ is a linear combination of $\mathbf{x}$"

That 'simple' function is known as a link function (call it $g(\cdot)$), so this may be written as $g(\boldsymbol{\mu})=\mathbf{x}^T\boldsymbol{\beta}$. It's a generalization because $g(\cdot)$ *could* be the identity function and the length of $\boldsymbol{\mu}$ *could* be one, landing us back in bubblegum land. But it could not and we could land elsewhere. Note that if $\boldsymbol{\mu}$ is a vector with length greater than 1, then $\boldsymbol{\beta}$ is a matrix.

Also, the 'simple' requirement on $g(\cdot)$ means its invertible, so we may write $\boldsymbol{\mu}=g^{-1}(\mathbf{x}^T\boldsymbol{\beta})$

## Put it back together.

Now that we have both generalizations, let's put it all together:

$$
\begin{align}
p(\mathbf{y}) & = \frac{1}{Z(\boldsymbol{\theta})}h(\mathbf{y})\exp\Big\{{T(\mathbf{y})\cdot\boldsymbol{\theta}\Big\}} \tag{1}\\
\boldsymbol{\theta} &  = \Psi(\boldsymbol{\mu}) \tag{'bridge'}\\
\boldsymbol{\mu} & = g^{-1}(\mathbf{x}^T\boldsymbol{\beta}) \tag{2}\\
\end{align}
$$

And that's it. So if you are applying a generalized linear model, you have two choices to make:

1. Given the independent variables, what distribution do you want $y$ to have? Normal, the poisson, the binomial, something else? You make that choice and that tells you $h(\cdot)$ and $T(\cdot)$ which forces $\Psi(\cdot)$ and $Z(\cdot)$.
2. What link function, $g(\cdot)$, do you want? You have a few choices, but typically your answer to the previous question will guide what you should do here. Not all link functions can be combined with all exponential family distributions.

With that, you have your model specification and you can start learning the parameters $\boldsymbol{\beta}$. Not too bad right?

## An example

To ground this concept, let's follow those steps in an easy example. Let's say we have a classification problem where $y$ is either 1 or 0. Let's answer those questions:

1. Given an $\mathbf{x}$, what should the distribution of $y$ be? A natural choice is the Bernoulli distribution. If we ask the internet, we discover this implies $h(y)=1$ and $T(y)=y$. These then tell us $\Psi(\mu)=\log(\frac{\mu}{1-\mu})=\theta$ and $Z(\theta) = \exp(\theta)+1$.
2. The expectation of $y$ needs to be between 0 and 1 and $\mathbf{x}^T\boldsymbol{\beta}$ can vary over the whole real line, so we should pick a function that maps from 0 and 1 to the real line. How about the logit function (which is coincidentally our $\Psi()$ as well)? That is, $g(\mu)=\log(\frac{\mu}{1-\mu})=\mathbf{x}^T\boldsymbol{\beta}$.

So if we sub in these settings and reduce things a bit, our model specification becomes:

$$
\begin{align}
p(y) & = \frac{1}{\exp(\mathbf{x}^T\boldsymbol{\beta})+1}\exp\Big\{{y\cdot\mathbf{x}^T\boldsymbol{\beta}\Big\}}\\
\end{align}
$$

And whadda-ya-know, it's logistic regression! Easy!

## A few things worth pointing out...

If you want to survive in the wilderness of GLMs, you'll need to keep a few things in mind:

1. I've presented the general form of GLMs, but it's easy to make choices of $h(\cdot)$, $T(\cdot)$ and $g(\cdot)$ such that the optimization to determine $\boldsymbol{\beta}$ is nearly impossible. Because of this, most GLM software out there will restrict your choices, such that no matter what you pick, it'll be able to optimize that problem.
2. At the same time, there are additional generalizations that software will offer that I haven't. They may offer a parameter that allows you to smoothly vary $h(\cdot)$ or $T(\cdot)$  or $\Psi(\cdot)$. Sometimes a weighted version will be offered. I'm sure there are others I'm not aware of.
3. $\Psi(\cdot)$ can be pretty weird. Sometime it maps from a single input to multiple outputs. Also, it often holds your exogenously decided parameters - parameters that impact that distribution of $y$ but aren't learned from the data.

These two points can make the specific form you see look different from the above. In such a circumstance, we should think carefully about the specific implementation offer.

### Footnotes

[1] 'Probability' isn't correct here. I should say probability density or probability mass, but I'm trying to keep things simple (and wrong I guess?).

[2] It's not necessary for my dicussion, but the following is good to know in general. Since picking a distribution within the exponential family fixes $\Psi(\cdot)$ and this function relates $\boldsymbol{\mu}$ (the expected-$\mathbf{y}$) and the parameters $\boldsymbol{\theta}$, then if you know $\boldsymbol{\mu}$, then you know the parameters as well. Because of this, $\boldsymbol{\mu}$ is often called the mean *parameters* of the distribution. In the same thread, $\boldsymbol{\theta}$ are called the canonical parameters.

### Sources

[1] Kevin Murphy's book Machine Learning: A Probabilistic Perspective - I *really* love that book.

[2] I decided to write this after reading the documentation from the Statsmodels package (link), so I should give them so love.
