# Conjugate analysis

When the posterior distribution follows the same parametric structure than the prior distribution, we say that we have a conjugate model.

Formally, if $p(Y|\theta)\in\mathcal{F}$ and $\mathcal{P}$ is a family of prior distributions for $\theta$, then $\mathcal{P}$ is conjugate for $\mathcal{F}$ if $p(\theta|\mathbf{Y})\in\mathcal{P}$ for all $p(\cdot|\theta)\in\mathcal{F}$ and $p(\cdot)\in\mathcal{P}$.

## Conjugate models for the exponential family

When the likelihood of the data follows a distribution of the exponential family of distributions, it is possible to obtain the form of the prior conjugate.

Let be $p(Y|\theta)\in\mathcal{F}$, where $\mathcal{F}$ is the exponential family of distributions. Then, by the Fisher's factorization theorem, $p(Y|\theta)$ can be expressed as:
$$p(Y|\theta)=f(y)g(\theta)\exp\left\lbrace\phi^T(\theta)u(y)\right\rbrace,$$

where $\phi(\theta)$ and $u(y)$ are, in general, of the same dimension than $\theta$, $\phi(\theta)$ is called the natural parameter of the family $\mathcal{F}$.

If $Y_1,\ldots,Y_n$ are random variables independent and identically distributed, then

$$
\begin{align*}
p(\mathbf{Y}|\theta) &= \left(\prod_{i=1}^n f(y_i)\right)g^n(\theta)\exp\left\lbrace\phi^T(\theta)\sum_{i=1}^n u(y_i)\right\rbrace \\
&\propto g^n(\theta)e^{\phi^T(\theta)t(\mathbf{y}),}
\end{align*}
$$

where $t(\mathbf{y})=\sum_{i=1}^n u(y_i)$ is called the sufficient and minimal.

If we consider the prior of the form

$$p(\theta)\propto g(\theta)^\eta e^{\phi^T(\theta)\xi},$$ 

then

$$p(\theta|\mathbf{Y})\propto g(\theta)^{n+\eta}e^{\phi^T(\theta)(\xi+t(\mathbf{y}))}.$$

Moreover, note that, due to the structure, we can interpret $\eta$ as the "size" of the sample *a priori*  and $\xi$ as the "sufficient and minimal statistic" *a priori*.

## Binomial distribution

Let be $Y_1,\ldots,Y_n|\theta\overset{iid}{\sim}\textsf{Bernoulli}(\theta)$, then

$$
\begin{align*}
p(\mathbf{Y}|\theta) & = \theta^{\sum_{i=1}^n y_i}(1-\theta)^{n-\sum_{i=1}^n y_i} \underbrace{\prod_{i=1}^n 1_{\{0,1\}}(y_i)}_{f(\mathbf{y})} \\
& = f(\mathbf{y})\exp\left\lbrace\sum_{i=1}^n y_i\log\theta + \left(n-\sum_{i=1}^n y_i\right)\log (1-\theta)\right\rbrace \\
& = f(\mathbf{y})\exp\left\lbrace n \log (1-\theta) + \sum_{i=1}^n y_i\log\frac{\theta}{1-\theta}\right\rbrace \\
& = f(\mathbf{y})\underbrace{(1-\theta)^n}_{g(\theta)^n}\exp\underbrace{\left\lbrace\sum_{i=1}^n y_i\log\frac{\theta}{1-\theta}\right\rbrace}_{t(\mathbf{y})\phi(\theta)} \\
\end{align*}
$$

Therefore, the conjugate prior is given by

$$
\begin{align*}
p(\theta) & \propto (1-\theta)^\eta\exp\left\lbrace\xi\log\frac{\theta}{1-\theta}\right\rbrace \\
& = (1-\theta)^{\eta-\xi}\theta^{\xi},
\end{align*}
$$

note that we can interpreate $\eta$ and $\xi$ as:

- $\eta$: number of Bernoulli experiments *a priori*,
- $\xi$: number of successes *a priori*,

so $\eta-\xi$ wpuld be the number of fails *a priori*.

Let be $\alpha-1=\xi$ and $\beta-1=\eta-\xi$, then the conjugate prior can be expressed as

$$p(\theta)\propto \theta^{\alpha-1}(1-\theta)^{\beta-1},$$

which we recognize as the kernelr of a distrbution $\textsf{Beta}(\alpha,\beta)$.


## Poisson distribution

Let be $Y_1,\ldots,Y_n|\theta\overset{iid}{\sim}\textsf{Poisson}(\theta)$, then

$$
\begin{align*}
p(\mathbf{Y}|\theta) & = \underbrace{\left\lbrack\prod_{i=1}^n \frac{1}{y_i!}1_{\{0,1,\ldots\}}(y_i)\right\rbrack}_{f(\mathbf{y})}\theta^{\sum_{i=1}^n y_i}\exp\{-n\theta\} \\
& = f(\mathbf{y})\underbrace{\left(e^{-\theta}\right)^n}_{g(\theta)^n}\exp\underbrace{\left\{\sum_{i=1}^n y_i\log\theta\right\}}_{t(\mathbf{y})\phi(\theta)}.
\end{align*}
$$

Therefore, the conjugate prior is given by

$$
\begin{align*}
p(\theta) & \propto e^{-\eta\theta}\exp\{\xi\log\theta\} \\
& = \theta^\xi e^{-\eta\theta}
\end{align*}
$$

Let be $\xi=\alpha-1$ and $\eta=\beta$, then the conjugate prior might be expressed as

$$p(\theta)\propto\theta^{\alpha-1}e^{-\beta\theta},$$

which we recognize as the kernel of a distribution $\textsf{Gama}(\alpha,\beta)$.

Note that $\alpha-1$ is equivalent as the sum of prior counts and $\beta$ the number of prior experiments. Therefore, if we haven't done experiments previously, we can set $\alpha=1$ and $\beta=0$, in such case $p(\theta)\propto 1_{(0,\infty)}(\theta)$.