# The Exponential Family

## General form
$$p(\mathbf{x}|\mathbf{\eta})=h(\mathbf{x})g(\mathbf{\eta})exp\{\mathbf{\eta}^T\mathbf{u}(\mathbf{x})\} \tag{2.194}$$
where
- $\mathbf{x}$ may be scalar or vector, and may be discrete or continuous.
- $\mathbf{\eta}$ are called the natural parameters of the distribution.
- $\mathbf{u}(\mathbf{x})$ is some function of $\mathbf{x}$.
- $g(\mathbf{\eta})$ is the coefficient that ensures the distribution to be normalized and therefore satisfies
$$g(\mathbf{\eta})\int h(\mathbf{x}) exp\{\mathbf{\eta}^T\mathbf{u}(\mathbf{x})\}d\mathbf{x}=1 \tag{2.195}$$

Step that changes a distribution to exponential family
1. If there is a exponential term in the distribution, go to step 3.
2. Transform the distribution from $p(x|\mu)$ to $exp\{\ln(p(x|\mu))\}$
3. In the exponent, one part is relative to $x$, and the other part is relative to $\mu$. Comparing with the general form of exponential family, extract the factor $\eta$ from the part of $x$.
4. The part of $\mu$ will multiply by the coefficient outside the exponent to normalize the distribution and finally be a factor of $g(\eta)$. Thus, we should use the parameter $\eta$ to represent $\mu$.
5. Extract the $x$ relative factors $u(x)$ and $h(x)$.
6. Extract the $\mu$ relative factor $g(\eta)$.  

-----------------------
## Bernoulli distribution
$$p(x|\mu)=Bern(x|\mu)=\mu^x(1-\mu)^{1-x} \tag{2.196}$$

Transform to the exponential family
$$\begin{align*}
p(x|\mu)&=exp\{x\ln \mu+(1-x)\ln(1-\mu)\}\\
&=exp\{x\ln \mu+\ln(1-\mu)-x\ln(1-\mu)\}\\
&=(1-\mu)exp\{x\ln \mu-x\ln(1-\mu)\}\\
&=(1-\mu)exp\left\{\ln\left(\frac{\mu}{1-\mu}\right)x\right\} \tag{2.197}
\end{align*}$$
Comparison with $(2.194)$ allows us to indentify
$$\eta=\ln\left(\frac{\mu}{1-\mu}\right) \tag{2.198}$$
For solving the factor $(1-\mu)$ outside the exponential, we also need to figure out what does $\mu$ equal to. 
$$\begin{align*}
\eta&=\ln\left(\frac{\mu}{1-\mu}\right)\\
\Rightarrow exp(\eta)&=\frac{\mu}{1-\mu}\\
\Rightarrow \frac{1}{exp(\eta)}&=\frac{1-\mu}{\mu}=\frac{1}{\mu}-1\\
\Rightarrow \frac{1}{\mu}&=\frac{1}{exp(\eta)}+1=\frac{1+exp(\eta)}{exp(\eta)}\\
\Rightarrow \mu&=\frac{exp(\eta)}{1+exp(\eta)}=1-\frac{1}{1+exp(\eta)}\\
\Rightarrow 1-\mu &= \frac{1}{1+exp(\eta)}\\
\Rightarrow p(x|\eta) &= \frac{1}{1+exp(\eta)}exp\left\{\ln\left(\frac{\mu}{1-\mu}\right)x\right\}
\end{align*}$$
Then we can use the *logistic sigmoid* function which is denoted by
$$\sigma(\eta)=\frac{1}{1+exp(-\eta)} \tag{2.199}$$
substitute this to the expression, gives that
$$p(x|\eta)=\sigma(-\eta)exp(\eta x) \tag{2.200}$$
where
- $\eta=\ln\left(\frac{\mu}{1-\mu}\right)$
- $u(x)=x$.
- $h(x)=1$.
- $g(\eta)=\sigma(-\eta)$


-----------------
## ~~Multinomial distribution~~ (Multiple outcomes Bernoulli distribution)
### First solution
$$p(\mathbf{x}|\mathbf{\mu})=\prod_{k=1}^M\mu_k^{x_k}=exp\left\{\sum_{k=1}^Mx_k\ln \mu_k\right\}=exp(\mathbf{\eta}^T\mathbf{x})$$
where 
- $\mathbf{x}=(x_1,\cdots,x_M)^T$.
- $\mathbf{\eta}=(\eta_1,\cdots,\eta_M)^T=(\ln \mu_1,\cdots,\ln\mu_M)^T$.
- $\mathbf{u}(\mathbf{x})=\mathbf{x}$.
- $h(\mathbf{x})=1$.
- $g(\mathbf{\eta})=1$.


### Another solution
In some circumstances, it will be convinient to remove the probability of $x_M$ which is denoted by $\mu_M$. Leaving only $M-1$ parameters to express the distribution.
$$\begin{align*}
p(\mathbf{x}|\mathbf{\mu})&=\prod_{k=1}^M\mu_k^{x_k}=exp\left\{\sum_{k=1}^Mx_k\ln \mu_k\right\}\\
&=exp\left\{\sum_{k=1}^{M-1}x_k\ln\mu_k+\underbrace{\left(1-\sum_{k=1}^{M-1}x_k\right)}_{x_M}\ln\underbrace{\left(1-\sum_{k=1}^{M-1}\mu_k\right)}_{\mu_M}\right\}\\
&=exp\left\{\sum_{k=1}^{M-1}x_k\ln\left(\frac{\mu_k}{1-\sum_{j=1}^{M-1}\mu_j}\right)+\ln\left(1-\sum_{k=1}^{M-1}\mu_k\right)\right\} \tag{2.211}\\
&=\left(1-\sum_{k=1}^{M-1}\mu_k\right)exp\left\{\sum_{k=1}^{M-1}x_k\ln\left(\frac{\mu_k}{1-\sum_{j=1}^{M-1}\mu_j}\right)\right\}
\end{align*}$$
Then we extract the factor $\mathbf{\eta}=(\eta_1,\cdots,\eta_{M-1})$, where $\eta_k=\ln\left(\frac{\mu_k}{1-\sum_{j=1}^{M-1}\mu_j}\right)^T$, from which we solve the $\mu_k$
$$\begin{align*} exp(\eta_k)&=\frac{\mu_k}{1-\sum_{j=1}^{M-1}\mu_j}\\
\Rightarrow \sum_{k=1}^{M-1}exp(\eta_k)&=\frac{\sum_{k=1}^{M-1}\mu_k}{1-\sum_{j=1}^{M-1}\mu_j}\\
\Rightarrow 1+\sum_{k=1}^{M-1}exp(\eta_k)&=1+\frac{\sum_{k=1}^{M-1}\mu_k}{1-\sum_{j=1}^{M-1}\mu_j}=\frac{1}{1-\sum_{j=1}^{M-1}\mu_j}\\
\Rightarrow \frac{1}{1+\sum_{k=1}^{M-1}exp(\eta_k)}&=1-\sum_{j=1}^{M-1}\mu_j\\
\Rightarrow \frac{exp(\eta_k)}{1+\sum_{j=1}^{M-1}exp(\eta_j)}&=\left(1-\sum_{j=1}^{M-1}\mu_j\right)\cdot exp(\eta_k)\\
\Rightarrow \frac{exp(\eta_k)}{1+\sum_{j}exp(\eta_j)}&=\mu_k \tag{2.213}
\end{align*}$$
This is called the *softmax* function. Substituting softmax function to $(2.211)$ gives
$$p(\mathbf{x}|\mathbf{\eta})=\left(1+\sum_{k=1}^{M-1}exp(\eta_k)\right)^{-1}exp(\mathbf{\eta}^T\mathbf{x})$$
where
- $\mathbf{x}=(x_1,\cdots,x_{M-1})^T$
- $\mathbf{\eta}=(\eta_1,\cdots,\eta_{M-1})$, where $\eta_k=\ln\left(\frac{\mu_k}{1-\sum_{j=1}^{M-1}\mu_j}\right)^T$.
- $\mathbf{u}(\mathbf{x})=\mathbf{x}$.
- $h(\mathbf{x})=1$.
- $\displaystyle{g(\mathbf{\eta})=\left(1+\sum_{k=1}^{M-1}exp(\eta_k)\right)^{-1}}$.

-----------------

## Gaussian
$$\begin{align*}
p(x|\mu,\sigma^2)&=\frac{1}{(2\pi\sigma^2)^{1/2}}exp\left\{-\frac{1}{2\sigma^2}(x-\mu)^2\right\}\\
&=\frac{1}{(2\pi\sigma^2)^{1/2}}exp\left\{-\frac{1}{2\sigma^2}x^2+\frac{\mu}{\sigma^2}x-\frac{1}{\sigma^2}\mu^2\right\}
\end{align*}$$
where
- $\mathbf{\eta}=\begin{bmatrix}\mu/\sigma^2\\ -1/2\sigma^2\end{bmatrix}$.
- $\mathbf{u}(x)=\begin{bmatrix}x\\ x^2\end{bmatrix}$.
- $h(x)=(2\pi)^{1/2}$.
- $g(\mathbf{\eta})=(-2\eta_2)^{1/2}exp\left(\frac{\eta_1^2}{4\eta_2}\right)$

----------------
---------------
# The Properties of Exponential Family
## Maximun likelihood and sufficient statistics

### Gradient

First order gradient of a function $f(\mathbf{x})$ is denoted by a vector whose elements are the partial derivative in each dimensionality. And the second order gradient of the function $f(\mathbf{x})$ is a $N\times N$ matrix.
$$\nabla f(\mathbf{x})=\begin{bmatrix}
\partial f/\partial x_1\\
\partial f/\partial x_2\\
\vdots\\
\partial f/\partial x_N\\
\end{bmatrix}\qquad
\nabla^2 f(\mathbf{x})=
\nabla\big(\nabla f(\mathbf{x})\big)^T
=\nabla(
\begin{bmatrix} \frac{\partial f}{\partial x_1} &\frac{\partial f}{\partial x_2} &\cdots &\frac{\partial f}{\partial x_N} \end{bmatrix}
)
=
\begin{bmatrix}
\frac{\partial^2 f}{\partial x_1\partial x_1} &\frac{\partial^2 f}{\partial x_1 \partial x_2} &\cdots &\frac{\partial^2 f}{\partial x_1\partial x_N} \\
\frac{\partial^2 f}{\partial x_2\partial x_1} &\frac{\partial^2 f}{\partial x_2 \partial x_2} &\cdots &\frac{\partial^2 f}{\partial x_2\partial x_N} \\
\vdots &\vdots &\ddots &\vdots\\
\frac{\partial^2 f}{\partial x_N\partial x_1} &\frac{\partial^2 f}{\partial x_N \partial x_2} &\cdots &\frac{\partial^2 f}{\partial x_N\partial x_N} \\
\end{bmatrix}
$$

Taking the gradient of both side of the integration of the exponential distribution with respect to $\mathbf{\eta}$, we have
$$\begin{align*}
\nabla \left( \int p(\mathbf{x}|\mathbf{\eta})d\mathbf{x}\right ) &= \nabla \left(g(\mathbf{\eta})\int h(\mathbf{x})exp\{\mathbf{\eta}^T\mathbf{u}(\mathbf{x})\} d\mathbf{x}\right )\\
&=\nabla g(\mathbf{\eta}) \int h(\mathbf{x})exp\{\mathbf{\eta}^T\mathbf{u}(\mathbf{x})\} d\mathbf{x}+g(\mathbf{\eta}) \int h(\mathbf{x})exp\{\mathbf{\eta}^T\mathbf{u}(\mathbf{x})\}\mathbf{u}(\mathbf{x}) d\mathbf{x}\qquad (fg)'=f'g+fg' \tag{2.224}\\
&=\nabla g(\mathbf{\eta})\cdot\frac{1}{g(\mathbf{\eta})}+\mathbb{E}[\mathbf{u}(\mathbf{x})]\\
&=0\\
\Rightarrow \mathbb{E}[\mathbf{u}(\mathbf{x})]&=-\nabla g(\mathbf{\eta})\cdot\frac{1}{g(\mathbf{\eta})} \tag{2.225}\\
&=-\nabla \ln g(\mathbf{\eta}) \tag{2.226}
\end{align*}$$
We can see that the first order moment of $\mathbf{u}(\mathbf{x})$ can be expressed in term of the derivatives of $-\ln g(\mathbf{\eta})$. And the following derivation will show that the second order moment can also be expressed in term of the second derivatives of $-\ln g(\mathbf{\eta})$.
$$\begin{align*}
-\nabla^2\ln g(\mathbf{\eta)})&=\nabla((-\nabla\ln g(\mathbf{\eta)})^T)\\
&=\nabla(\mathbb{E}[\mathbf{u}(\mathbf{x})]^T)\\
&=\nabla(\mathbb{E}[\mathbf{u}^T(\mathbf{x})])\\
&=\nabla\left(g(\mathbf{\eta})\int h(\mathbf{x})exp\{\mathbf{\eta}^T\mathbf{u}(\mathbf{x})\}\mathbf{u}^T(\mathbf{x})d\mathbf{x} \right )\\
&=\nabla g(\mathbf{\eta})\int h(\mathbf{x})exp\{\mathbf{\eta}^T\mathbf{u}(\mathbf{x})\}\mathbf{u}^T(\mathbf{x})d\mathbf{x} 
+\int h(\mathbf{x})exp\{\mathbf{\eta}^T\mathbf{u}(\mathbf{x})\}\mathbf{u}(\mathbf{x})\mathbf{u}^T(\mathbf{x})d\mathbf{x} \\
&=\frac{\nabla g(\mathbf{\eta})}{g(\mathbf{\eta})}\cdot g(\mathbf{\eta})\int h(\mathbf{x})exp\{\mathbf{\eta}^T\mathbf{u}(\mathbf{x})\}\mathbf{u}^T(\mathbf{x})d\mathbf{x} 
+\mathbb{E}[\mathbf{u}(\mathbf{x})\mathbf{u}^T(\mathbf{x})] \\
&=-\mathbb{E}[\mathbf{u}(\mathbf{x})]\mathbb{E}[\mathbf{u}(\mathbf{x})]+\mathbb{E}[\mathbf{u}(\mathbf{x})\mathbf{u}^T(\mathbf{x})]\\
&=cov[\mathbf{u}(\mathbf{x})] \tag{Exercise 2.58}
\end{align*}$$

### Maximun likelihood and sufficient statistics
Now consider a set of independent identically distributed data denoted by $\mathbf{X}=\{\mathbf{x}_1,\cdots,\mathbf{x}_N\}$, for which the likelihood function is given by
$$p(\mathbf{X}|\mathbf{\mu})=\left(\prod_{n=1}^Nh(\mathbf{x}_n)\right)g(\mathbf{\eta})^Nexp\left\{\mathbf{\eta}^T\sum_{n=1}^N\mathbf{u}(\mathbf{x}_n)\right\}\tag{2.227}$$
Setting the gradient of $\ln p(\mathbf{X}|\mathbf{\eta})$ with respect to $\eta$ to zero, we get the following condition to be satisfied bt the maximum likelihood estimator $\mathbf{\eta}_{ML}$
$$-\nabla \ln g(\mathbf{\eta}_{ML})=\frac{1}{N}\sum_{n=1}^N\mathbf{u}(\mathbf{x}_n)\tag{2.228}$$
We see that the solution for te maximum likelihood estimator depends on the data only through $\sum_n \mathbf{u}(\mathbf{x}_{n})$, which is therefore called the *sufficient statistic* of the distribution.

------------------

## Conjugate priors
### Prior of exponential family
$$p(\mathbf{\eta}|\chi, v)=f(\chi, v)g(\mathbf{\eta})^v exp\{v\mathbf{\eta}^T\chi\}\tag{2.229}$$
where $f(\chi, v)$ is a normalization coefficient, and $g(\mathbf{\eta})$ is the same function as appeares in the exponential family function form $(2.194)$.

### Posterior of exponential family
$$\begin{align*}
p(\eta|\mathbf{X}, \chi,v) &\propto p(\mathbf{X}|\mathbf{\eta})p(\mathbf{\eta}|\chi, v)\\
&=\left(\prod_{n=1}^Nh(\mathbf{x}_n)\right)g(\mathbf{\eta})^Nexp\left\{\mathbf{\eta}^T\sum_{n=1}^N\mathbf{u}(\mathbf{x}_n)\right\}\cdot
f(\chi, v)g(\mathbf{\eta})^v exp\{v\mathbf{\eta}^T\chi\}\\
&\propto g(\mathbf{\eta})^{v+N}exp\left\{ \mathbf{\eta}^T\left( \sum_{n=1}^N \mathbf{u}(\mathbf{x}_n)+v\chi \right) \right\} \tag{2.230}
\end{align*}$$


--------------------
--------------------

# Noninformative priors

In some applications of probabilistic inference, we may have prior knowledge that can be conveniently expressed throught the prior distribution. But in many cases, however, we may have little idea of what form the distribution should take. We may then seek a form of prior distributon, called a *noninformative prior* , which is intended to have as little inluence on the posterior distribution as possible.

Here, we shall introduce two kinds of informative priors.

### Translation invariance prior
Translation invariance means that the density mass is only relative to the length of a section we choose but the position. This is denoted by
$$\int_A^B p(\mu)d\mu=\int_{A-c}^{B-c}p(\mu)d\mu=\int_A^Bp(\mu-c)d\mu \tag{2.234}$$
Such expression must hold for all choices of $A$ and $B$, we have
$$p(\mu-c)=p(\mu) \tag{2.235}$$

#### Gaussian translation invariance prior
For a Gaussian ditribution, consider that we want to find a translation invariance prior with respect to $\mu$. It's known that the prior of a Gaussian $\mathcal{N}(x|\mu,\sigma^2)$ with respect to $\mu$ is still a Gaussian $\mathcal{N}(\mu|\mu_0, \sigma_0^2)$. If this prior Gaussian satisfies the condition of translation invariance, which is equivilant to the expression
$$\begin{align*}
\mathcal{N}(\mu|\mu_0, \sigma_0^2) &=\mathcal{N}(\mu|(\mu_0+c), \sigma_0^2)\\
\Rightarrow \frac{1}{(2\pi\sigma_0^2)^{1/2}}exp\left\{-\frac{1}{2\sigma_0^2}(x-\mu_0)^2\right\}
&= \frac{1}{(2\pi\sigma_0^2)^{1/2}}exp\left\{-\frac{1}{2\sigma_0^2}(x-\mu_0-c)^2\right\}\\
\Rightarrow \frac{1}{(2\pi\sigma_0^2)^{1/2}}exp\left\{-\frac{1}{2\sigma_0^2}(x-\mu_0)^2\right\}
&= \frac{1}{(2\pi\sigma_0^2)^{1/2}}exp\left\{-\frac{1}{2\sigma_0^2}(x-\mu_0)^2\right\}\cdot exp\left\{-\frac{1}{2\sigma_0^2}(-c)^2\right\}\\
\Rightarrow exp\left\{-\frac{1}{2\sigma_0^2}(-c)^2\right\}&=1\\
\Rightarrow \sigma_0^2 & \to\infty\\
\Rightarrow \mathcal{N}(\mu|\mu_0, \sigma_0^2) &=0
\end{align*}$$


### Scale invariance prior
Scale invariance means that the density mass of arbitrary interval, we say $[A,B]$, is the same as the interval of $[A/c, B/c]$ given any $c$. This is denoted by
$$\int_A^B p(\sigma)d\sigma=\int_{A/c}^{B/c}p(\sigma)d\sigma=\int_A^Bp(\frac{1}{c}\sigma)\frac{1}{c}d\sigma \tag{2.238}$$
Because this must hold for all choices of $A$ and $B$, we have
$$\begin{align*}
p(\sigma)&=p(\frac{1}{c}\sigma)\frac{1}{c}\tag{2.239}\\
\Rightarrow p(\sigma)&=p(\hat{\sigma})\frac{\hat{\sigma}}{\sigma}\qquad let\ \hat{\sigma}=\frac{\sigma}{c}\\
\Rightarrow p(\sigma)\sigma &= p(\hat{\sigma})\hat{\sigma}\\
\Rightarrow p(\sigma)&\propto \frac{1}{\sigma}
\end{align*}$$

#### Gaussian scale invariance prior
For a Gaussian ditribution, consider that we want to find a scale invariance prior with respect to $\lambda$. It's known that the prior of a Gaussian $\mathcal{N}(x|\mu,\lambda^{-1})$ with respect to $\lambda$ is a Gamma distribution $Gam(\lambda|a_0, b_0)$. If this prior Gaussian satisfies the condition of translation invariance, then the Gamma distribution takes the form
$$\begin{align*}
Gam(\lambda|a_0, b_0) &\propto \lambda^{-1} \\
\Rightarrow \frac{1}{\Gamma(a_0)}b_0^{a_0}\lambda^{a_0-1}exp(-b_0\lambda) &\propto \lambda^{-1}\\
\Rightarrow a_0=b_0&=0
\end{align*}$$