<center><h1> Generalized linear models</h1></center>

# 1. The exponential family
We have now encountered a wide variery of probability distributions: the Gaussian, the Bernoulli, the Student t, the uniform, the gamma, etc. It turns out that most of these are members of a broader class of distributions known as the **exponential family**.

We will see how we can easily use any member of the exponential family as a class-conditional density in order to make a generative classifier. In addition, we will discuss how to build discriminative models, where the response variable has an exponential family distribution,whose mean parameters is a linear function of the inputs, which is known as a generalized linear model, and generalizes the idea of logistic regression to other kind of response variables.

## 1.1 Definition
A pdf  or pmf $p(\vec{x}|\vec{\theta})$ for $\vec{x}=(x_1,x_2,\ldots,x_m) \in \mathcal{X}^m$ and $\vec{\theta} \in \Theta \subseteq R^d$ is said to be in the exponential family if it is of the form
\begin{align}
p(\vec{x}|\vec{\theta})&=\frac{1}{Z(\vec{\theta})}h(\vec{x})exp \left[\vec{\theta}^T\phi(\vec{x}) \right] \\
                       &=h(\vec{x})exp \left[\vec{\theta}^T\phi(\vec{x})-A(\vec{\theta}) \right]
\end{align}
where
\begin{align}
Z(\vec{\theta}) &= \int h(\vec{x})exp \left[\vec{\theta}^T\phi(\vec{x}) \right] d\vec{x} \\
A(\vec{\theta}) &=log\,Z(\vec{\theta})
\end{align}
Here $\vec{\theta}$ is called the **natural parameters** or **canonical parameters**,$\phi(\vec{x})$ is called a vector of **sufficient statistics**,$Z(\vec{\theta})$ is called the **partition function**,$A(\vec{\theta})$ is called the **cumulant function**, $h(\vec{x})$ is the **scaling constant**,which is often 1

## 1.2 Some examples
### 1.2.1 Bernoulli distribution
The Bernoulli for $x \in \lbrace 0,1 \rbrace$ can be writen as
\begin{align}
p(x|\mu) &=\mu^x(1-\mu)^{1-x} \\
         &=exp[x\, log(\mu)+(1-x)\,log(1-\mu)] \\
         &=exp[x\,log\frac{\mu}{1-\mu}+log(1-\mu)]\\
         &=(1-\mu) exp[log \left(\frac{\mu}{1-\mu} \right) x] \\
\end{align}
Now we have 
\begin{align}
\phi(x) &=x  \\
\theta        &=log\,\left(\frac{\mu}{1-\mu} \right)
\end{align}
which is the **sufficient statistics** and **natural parameters** for the Bernoulli distribution. We can recover the mean parameter $\mu$ from the natural parameters using
$$
\mu=sigm(\theta)=\frac{1}{1+e^{-\theta}}
$$
### 1.2.2 Multinoulli
The Multinoulli for $x \in \lbrace 1,\ldots,C \rbrace$ can be writen as 
\begin{align}
Cat(x|\vec{\mu})&=\prod_{c=1}^C \mu_c^{\mathbb{1}(x=c)} \\
              &=exp[\sum_{c=1}^C \mathbb{1}(x=c)\,log\,\mu_c]   \\
              &=exp[\sum_{c=1}^{C-1} \mathbb{1}(x=c) \,log\,\mu_c +\left(1-\sum_{c=1}^{C-1} \mathbb{1}(x=c) \right) \,log\, \mu_C] \\
              &=exp[\sum_{c=1}^{C-1} \mathbb{1}(x=c) \,log \, \frac{\mu_c}{\mu_C} + log\,\mu_C ]  \\
              &=(1-\sum_{c=1}^{C-1} \mu_c)exp[\sum_{c=1}^{C-1} \mathbb{1}(x=c) \,log \, \frac{\mu_c}{\mu_C}]
\end{align}
Now we have 
\begin{align}
\phi(x) &=\lbrace \mathbb{1}(x=1),\ldots,\mathbb{1}(x=C-1) \rbrace \\
\vec{\theta}  &=\lbrace log\,\frac{\mu_1}{\mu_C},\ldots,log\,\frac{\mu_{C-1}}{\mu_C} \rbrace \\
\end{align}
which is the **sufficient statistics** and **natural parameters** for the Multinoulli distribution. The form of the natural parameters for the multinoulli distribution is similar to the Bernoulli distribution.

### 1.2.3 Univariate Gaussian 
The univariate Gaussian can be written in exponential family form as follows
\begin{align}
\mathcal{N}(x|\mu,\sigma^2) &= \frac{1}{(2\pi\sigma^2)^{\frac{1}{2}}}
exp\left[-\frac{(x-\mu)^2}{2\sigma^2} \right] \\
                            &= \frac{1}{(2\pi\sigma^2)^{\frac{1}{2}}}exp \left[-\frac{x^2}{2\sigma^2}+\frac{\mu x}{\sigma^2}-\frac{\mu^2}{2\sigma^2} \right]
\end{align}
Now we have 
\begin{align}
\phi(x)  &=\lbrace x^2,x \rbrace \\
\vec{\theta}  &=\lbrace \frac{-1}{2\sigma^2},\frac{\mu}{\sigma^2} \rbrace \\
\end{align}
which is the **sufficient statistics** and **natural parameters** for the Univariate Gaussian distribution.

## 1.3 MLE for the exponential family
The likelihood of an exponential family model has the form
$$
p(D|\vec{\theta})=\left(\frac{1}{Z(\vec{\theta})}\right)^N \left[\prod_{i=1}^N h(\vec{x}_i) \right] exp\left[ \vec{\theta}^T \sum_{i=1}^N \phi(\vec{x}_i) \right]
$$
We see that the sufficient statistics are $N$ and
$$
\vec{\phi}(D)=[\sum_{i=1}^N \phi_1(x_i),\ldots,\sum_{i=1}^N \phi_K(x_i)]
$$
The log-likelihood is that
$$
log\, p(D|\vec{\theta})=\vec{\theta}^T \vec{\phi}(D)-N\,log\,Z(\vec{\theta})-\sum_{i=1}^N log\,h(\vec{x}_i)
$$
The derivative of the log-likelihood is that
\begin{align}
\frac{\partial}{\partial \vec{\theta}} log\, p(D|\vec{\theta}) &=\vec{\phi}(D)-N\,\frac{\partial}{\partial \vec{\theta}} log\,Z(\vec{\theta}) \\
&=\vec{\phi}(D)-N\,\frac{1}{Z(\vec{\theta})}\frac{\partial}{\partial \vec{\theta}}Z(\vec{\theta}) \\
&=\vec{\phi}(D)-N\,\frac{1}{Z(\vec{\theta})}\frac{\partial}{\partial \vec{\theta}} \int h(\vec{x})exp\{\vec{\theta}^T\phi(\vec{x})\}d\vec{x} \\
&=\vec{\phi}(D)-N\,\frac{1}{Z(\vec{\theta})}\int \frac{\partial}{\partial \vec{\theta}} h(\vec{x})exp\{\vec{\theta}^T\phi(\vec{x})\}d\vec{x} \\
&=\vec{\phi}(D)-N\,\frac{1}{Z(\vec{\theta})}\int \phi(\vec{x})  h(\vec{x})exp\{\vec{\theta}^T\phi(\vec{x})\}d\vec{x} \\
&=\vec{\phi}(D)-N\,\int \phi(\vec{x}) \frac{1}{Z(\vec{\theta})}  h(\vec{x})exp\{\vec{\theta}^T\phi(\vec{x})\}d\vec{x} \\
&=\vec{\phi}(D)-N\,\mathbb{E}[\phi(\vec{x})]
\end{align}
So the MLE means the empirical average of the sufficient statistics must equal the model’s theoretical expected sufficient statistics, which means
$$
\mathbb{E}[\phi(\vec{x})]=\frac{1}{N}\sum_{i=1}^N\phi(\vec{x}_i)
$$
Think about this, which is just a **moment matching** process.
## 1.4 Bayes for the exponential family
The likelihood of the exponential family is given by
$$
p(D|\vec{\theta}) \propto g(\vec{\theta})^N exp(\vec{\theta}^T\vec{s}_N)
$$
where \begin{align}
g(\vec{\theta}) &=\frac{1}{Z(\vec{\theta})}\\
\vec{s}_N  &=\sum_{i=1}^N \phi(\vec{x}_i)
\end{align}
The natural conjugate prior has the form
\begin{align}
p(\vec{\theta}|\nu_0,\vec{\tau}_0) &\propto g(\vec{\theta})^{\nu_0} exp \left(\vec{\theta}^T \vec{\tau}_0 \right)\\
                                   &=g(\vec{\theta})^{\nu_0} exp \left(\nu_0 \vec{\theta}^T \vec{\bar{\tau}}_0 \right) \\
                                   &=p(\vec{\theta}|\nu_0,\vec{\bar{\tau}})
\end{align}
The $\nu_0$ means the size of  prior pseudo-data and $\vec{\bar{\tau}}_0$ means the mean of the sufficient statistics on this pseudo-data.The posterior is given by
\begin{align}
p(\vec{\theta}|D) &\propto p(D|\vec{\theta})p(\vec{\theta}|\nu_0,\vec{\tau}_0) \\
                  &=g(\vec{\theta})^{\nu_0+N} exp \left(\vec{\theta}^T(\nu_0 \vec{\bar{\tau}}_0+\vec{s}_N) \right)
\end{align}

## 1.5 Maximum entropy derivation of the exponential family
The exponential family distribution is that makes the least number of assumptions about the data ,subject to a specific set of user-specified constraints. Suppose all we known is the expected values of certain features or functions:
$$
\sum_x f_k(x)p(x)=F_k,\quad k=1,2,\ldots,K
$$
where $F_k$ are known constants, and $f_k(x)$ si an arbitrary function.The principle of **maximum
entropy** or **maxent** says we should pick the distribution with maximum entropy (closest to
uniform), subject to the constraints that the moments of the distribution match the empirical
moments of the specified functions. To maximum entropy subject to the constraints, we need to use Lagrange multipliers as follows
$$
\mathcal{J}(p,\lambda)=-\sum_xp(x)log\,p(x)+\lambda_0\left(1-\sum_xp(x)\right)+\sum_k \lambda_k\left(F_k-\sum_x f_k(x)p(x)\right)
$$
We can use the calculus of variations to take derivatives wrt the distribution $p(x)$, then we have
$$
\frac{\partial \mathcal{J}}{\partial p(x)}=-1-log\,p(x)-\lambda_0-\sum_k \lambda_k f_k(x)
$$
Setting $\frac{\partial \mathcal{J}}{\partial p(x)}=0$ yields
$$
p(x)=\frac{1}{Z} exp\left(-\sum_k \lambda_k f_k(x)\right)
$$
which is the exponential family distribution, also known as the **Gibbs distribution**.

# 2. Generalized linear models(GLMs)
## 2.1 Basic
These are models in which the output density is in the exponential family, and in which the mean parameters are a linear combination of the inputs, passed through a possibly nonlinear function.

The exponential family is written as 
\begin{align}
p(\vec{y}|\vec{\theta}) &=h(\vec{y})exp \left[\vec{\theta}^T\vec{y}-A(\vec{\theta}) \right] \\
                        &=exp \left[\vec{\theta}^T\vec{y}-A(\vec{\theta})+C(\vec{y}) \right]
\end{align}
where $C(\vec{y})=log\,h(\vec{y})$

More generally, one sometimes introduces an extra parameter $\sigma^2$, called the dispersion parameter, to control the shape of $p(\vec{y}|\vec{\theta})$ as follows
$$
p(\vec{y}|\vec{\theta},\sigma^2)=exp \left[\frac{\vec{\theta}^T\vec{y}-A(\vec{\theta})}{\sigma^2}+C(\vec{y},\sigma^2) \right]
$$
where $\sigma^2$ is the **dispersion parameter** and $\vec{\theta}$ is the **natural parameter**, $A$ is the partition function. 

We now define a linear function of the inputs $\vec{x}$
$$
\vec{\eta}=\mathbf{W}\vec{x}
$$
Then,we make the mean of the distribution be some invertible monotonic function of $\vec{\eta}$, which known as the **mean function**, is denoted by $g^{-1}$
$$
\vec{\mu}=\mathbb{E}[\vec{y}|\vec{\theta},\sigma^2]=g^{-1}(\vec{\eta})=g^{-1}(\mathbf{W}\vec{x})
$$
The inverse of the mean function ,namely $g$ ,is called the **link function**.

One particularly simple form of link function is to make $\vec{\theta}=\vec{\eta}$,which is called **canonical link function**.
### 2.1.1 Linear regression
In the linear regression case, we care much more attention on the mean $\mu$ than the variance $\sigma^2$. We can view $\sigma^2$ as a dispersion parameter.
\begin{align}
p(y|\mu,\sigma^2) &=\mathcal{N}(y|\mu,\sigma^2) \\
                  &=exp \left[-\frac{(y-\mu)^2}{2\sigma^2}-\frac{1}{2}log\,\left(2\pi\sigma^2 \right) \right] \\
                  &=exp \left[\frac{y\mu-\frac{\mu^2}{2}}{\sigma^2}-\frac{1}{2}\left(\frac{y^2}{\sigma^2}+log\,\left(2\pi\sigma^2 \right) \right) \right]
\end{align}
The **natural parameter** is $\theta=\mu=\vec{w}^T\vec{x}$, which is the ordinary linear regression case.

### 2.1.2 Binomial regression
We have 
\begin{align}
p(y|\mu) &=Bin(y|N,\mu) \\
         &=\binom{N}{y}\mu^y(1-\mu)^{N-y} \\
         &=exp \left[log\,\binom{N}{y}+ylog\,\mu+(N-y)log\,(1-\mu) \right] \\
         &=exp \left[ylog\,\frac{\mu}{1-\mu}+Nlog\,(1-\mu)+log\,\binom{N}{y} \right] \\
\end{align}
The **natural parameter** is 
$$
\theta=log\,\frac{\mu}{1-\mu}=\vec{w}^T\vec{x}
$$
### 2.1.3 Binary classification
In (binary) logistic regression, we use a model of the form
$$
p(y=1|\vec{x},\vec{w})=sigm(\vec{w}^T\vec{x})
$$
In general, we can write
$$
p(y=1|\vec{x},\vec{w})=g^{-1}(\vec{w}^T\vec{x})
$$
for any function $g^{-1}$ which maps $[-\infty,\infty]$ to $[0,1]$. Several possible mean functions are listed below.

| Name     |          Formula &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
|----------|:----------------------------------------:|
| Logistic |         $g^{-1}(\eta)=sigm(\eta)$        |
| Probit   |         $g^{-1}(\eta)=\Phi(\eta)$        |
| Log-log  |$g^{-1}(\eta)=exp\left(-exp(-\eta)\right)$|
## 2.2 ML and MAP estimation
The likelihood has the following form
\begin{align}
p(D|\vec{w}) &=\prod_{i=1}^N exp \left[\frac{\theta_i y_i-A(\theta_i)}{\sigma^2}+C(y_i,\sigma^2) \right] \\
\theta_i     &=\vec{w}^T\vec{x}_i
\end{align}
We can compute the gradient respect to $\vec{w}$ as follows
\begin{align}
\frac{\partial log\, p(D|\vec{w})}{\partial \vec{w}} &=\sum_{i=1}^N\frac{\partial log\, p(y_i|\vec{w},\vec{x_i})}{\partial \theta_i}\frac{\partial \theta_i}{ \partial \vec{w}} \\
&=\sum_{i=1}^N \frac{y_i-A'(\theta_i)}{\sigma^2}\frac{\partial \theta_i}{ \partial \vec{w}} \\
&=\sum_{i=1}^N \frac{y_i-A'(\theta_i)}{\sigma^2}\vec{x}_i
\end{align}
in which $y_i-A'(\theta_i)$ takes the error form. The gradient form is similar to the logistic regression. It is straightforward to modify the above procedure to perform MAP estimation with a Gaussian prior