# Generalized Linear Models

For regression we had $y|x;\theta \sim \mathcal{N}(\mu, \sigma^2)$ and for classification we $y|x;\theta \sim \text{Bernoulli}(\phi)$ for some $\mu$ and $\phi$ as functions of $x$ and $\theta$. Both of these methods are special cases of Generalized Linear Models (GLM).

# The Exponential Family

A class of distributions is in the exponential family if it can be written in the form

$$p(y;\eta) = b(y)\exp(\eta^{T}T(y)-a(\eta))$$

where $\eta$ is called the natural/canonical parameter of the distribution, $T(y)$ is the sufficient statistic (we will often consider distributions where $T(y)=y$),$b(y)$ is base measure, and $a(\eta)$ is the log partition function. The quantity $e^{-a(\eta)}$ plays the role of a normalizing constant that makes sure the $p(y:\eta)$ sums/integrates to over $y$ to 1.

A fixed choice of $T, a$, and $b$ defines a family (or set) of distributions that is parameterized by $\eta$. 

We can show that the Bernoulli distributions belong to this family. Let $\text{Bernoulli}(\phi)$ be the Bernoulli distribution with mean $\phi$. Recall that this means $y \in \{0,1\}$ such that $p(y;\eta) = \phi^{y}(1-\phi)^{1-y}$. We will rewrite this as

$$p(y;\eta) = \exp(y\ln\phi + (1-y)\ln(1-\phi)) = \exp\left(y\ln\left(\frac{\phi}{1-\phi}\right) + \ln(1-\phi)\right)$$
 
So we can see that $\eta=\ln(\phi/(1-\phi))$ and if we use this to solve for $\phi$ we get $\phi=1/(1+e^{-\eta})$ which is just the sigmoid function. This will show up again later when trying to derive logistic regression as a GLM. The rest of quantities $T,a,$ and $b$ can be seen in the distribution showing that the Bernoulli distribution is an exponential family distribution. The normal distribution is just as easily proven.

There are many other distributions that are members of the exponential family and the next section shows how to construct a model in which $y$ (given $x$ and $\theta$) comes from any of these distributions. There several notable properties

1. The maximum likelihood estimate (MLE) w.r.t $\eta$ is concave. On the other hand the negative log likelihood (NLL) is convex
2. $E[y;\eta]=\frac{\partial a(\eta)}{\partial \eta}$ is the mean of the distribution as parameterized by $\eta$
3. $\text{Var}[y;\eta]=\frac{\partial^2 a(\eta)}{\partial \eta^2}$

Another fact about GLMS is that they all follow the same learning update rule, namely (we use batch gradient descent here):

$$\vec{\theta} := \vec{\theta} - \frac{\alpha}{m}\sum_{i=1}^{m} \left(h_{\theta}(\vec{x}^{\,(i)})-y^{(i)}\right)\vec{x}^{\,(i)}$$

## Constructing GLMs

Consider a classification or regression problem where we would like to predict value of some random variable $y$ as a function of $x$. To derive a GLM for this we will make the following assumptions:

1. $y | x; \theta \sim \text{ExponentialFamily}(\eta)$
2. Given $x$ the goal is to predict the expected value of $T(y)$ given $x$. For most of the examples since $T(y)=y$ so we would like the prediction $h(x)$ to satisfy $h(x)=E[y|x]$.
3. Natural parameter $\eta$ and the inputs $x$ are related linearly $\eta=\theta^{T}x$ or $\eta_i =\theta_{i}^{T}x$ if $\eta$ is vector valued (this is more of a design choice) 

The task you are attempting to do will give you the distribution. For regression over real numbers you use a normal, for binary classification you use a Bernoulli, and so on.

For the case of regression one can imagine the line (technically hyperplane) given by $\theta^T x$ and we assume that each point has a corresponding normal distribution. The data then is some sampled value from the points distribution and by using a GLM we can attempt to find the $\theta$ which gave that line.

### Logistic Regression

Given that we are interested in binary classification, $y \in \{0,1\}$, it seems natural to choose the Bernoulli family of distributions to model $y|x$. We can recall from earlier that $\phi=1/(1+e^{-\eta})$ and since  $y|x;\theta \sim \text{Bernoulli}(\phi)$ then $E[y|x;\theta]=\phi$ so

$$h_{\theta}(x)= E[y|x;\theta] = \phi = \frac{1}{1+e^{-\eta}} = \frac{1}{1+e^{-\theta^{T}x}}$$

The first equality follows from assumption 2, the third from assumption 1, and the last from assumption 3.

The function $g$ giving the distribution’s mean as a function of the natural parameter, $g(\eta)=E[T(y);\eta]$ (for us $T(y)=y$ so $g(\eta) = \frac{\partial a(\eta)}{\partial \eta}$) is called the canonical response function. The inverse $g^{-1}$ is called the canonical link function. (Note that $g^{-1}$ and $g$ can have the opposite definitions elsewhere)

## Softmax Regression

Softmax regression is another member of the GLM family and can be analyzed using the method above (very messy). However, another way to approach it is using Cross Entropy. Here we are looking at the problem of multiclass classification. Let $\vec{x}\,^{(i)}\in \mathbb{R}^{n}$ ,$k$ be the number of classes, $\vec{y}=[\{0,1\}^{k}]$ be a vector such that an element is a 1 if it belongs to that class and 0 everywhere else (essentially a one hot vector). Next we will assume that each class has it's own set of parameters $\vec{\theta}_{c} \in \mathbb{R}^{n}$ where $c \in \{1,2,\ldots,k\}$. This can also be thought of as a matrix where each row represents the parameters of a certain class. Corresponding to each class is a decision boundary $\theta_{c}^{T}x$. Given some $x$ we can plot a graph of $\theta_{c}^{T}x$ vs $c$. What we will get is a certain value for each class and the value $\theta_{c}^{T}x \in \mathbb{R}$ belongs to the logit space. Our goal now is to get a probability distribution over the classes and we can do so using the following steps.

1. Exponentiate the logits $\exp(\theta_{c}^{T}x)$ to make all values positive
2. Normalize by dividing everything by the sum of them $\sum_{i=1}^{k} \exp(\theta_{i}^{T}x)$

This creates a probability distribution $\hat{p}(y)$

$$\hat{p}(y) = \frac{e^{\theta_{c}^{T}x}}{\sum_{i=1}^{k} \exp(\theta_{i}^{T}x)}$$

This is like our hypothesis function but it outputs a probability distribution over all classes instead of scalar or probability. Now the true output $y$ would give a probability distribution $p(y)$, called the label, which is 1 for a certain class and 0 elsewhere. The learning approach we now take is to minimize the "distance" between these two probability distributions or in other words get $\hat{p}(y)$ to appear more like $p(y)$. What this is essentially doing is minimizing the cross entropy between them.

$$H(p,\hat{p})= -\sum_{y \in \{1,2,\ldots,k\}}p(y)\ln\hat{p}(y)$$ 

We can then treat this as the loss (referred to as cross entropy loss) and do gradient descent on it.