# 4 Exponential Family

## Parametric Family

Parametric family is a set of probability measure $\mathbb P_\theta$ on $(\Omega,\mathcal F)$ with each indexed by a paramter $\theta\in \Theta\subset \mathbb R^d$. The $\Theta$ is called the parameter space while $d$ is the dimension.

A particular probability measure in the parametric family is called a parametric model.

### Identifiability

A parametric family $\{\mathbb P_\theta: \ \theta\in\Theta\}$ is said to be identifiable if $\theta\mapsto \mathbb P_{\theta}$ is an injection, i.e. $\mathbb P_{\theta_1} = \mathbb P_{\theta_2}$ implies $\theta_1 = \theta_2$.

## Exponential Family

### Exponential Family 

A parametric family $\{\mathbb P_{\theta}:\ \theta\in\Theta\}$ dominated by a $\sigma$-finite measure $\nu$ on $(\Omega,\mathcal F)$ is called an exponential family if and only if it takes the form

$$\frac{d \mathbb P_\theta}{d\nu}(\omega) = h(\omega) \exp\left\{\eta(\theta)^TT(w) - \xi(\theta)\right\}\quad\quad  \omega\in\Omega\quad\quad\quad (\star)$$

where $T(\omega)$ is a random $p$-vector and $h(\omega)$ is a nonnegative random variable independent with $\theta$. Function $\eta(\theta):\ \Theta\rightarrow \mathbb R^p$, while $\xi(\theta)$ is a scaling factor that ensures  $\int_{\Omega} \frac{d \mathbb P_\theta}{d\nu}d\nu = 1$.


<br>

NOTE: The form in definition $(\star)$ is not unique (for instance we can multiply $h(\omega)$ by a constant while adding the logarithm of the constant to the scaling factor $\xi(\theta)$).

#### Joint Distribution

If $X_1,\dotsc,X_n$ are i.i.d. from $\mathbb P_\theta$, then the joint distribution is also an exponential family: 

$$f_\theta(x_1,\dotsc,x_n) = \prod_{k=1}^n h(x_k) \exp\left\{\eta(\theta)^T\sum_{k=1}^nT(x_k) - n\xi(\theta)\right\}.
$$

#### Canonical Form

We can change the variable, $\eta = \eta(\theta)$ so that $\eta\in \{\eta(\theta):\ \theta\in\Theta\}$. Now the exponential family has parameter $\eta$ and is equivalent to the following form

$$\frac{d \mathbb P_\eta}{d\nu}(\omega) = h(\omega) \exp\left\{\eta^TT(w) - \zeta(\eta)\right\} \quad\quad \omega\in\Omega.$$

This is called the canonical form of the exponential family. If an exponential family is already in canonical form, then it is called a natural exponential family. The $\eta$ is called the natural parameter.

#### Natural Parameter Space

The natural parameter space $\Xi$ of the natrual exponential family is defined by
$$\Xi = \left\{\eta: \int_{\Omega} h(\omega)\exp \{\eta^T T(\omega)\} d\nu(\omega)<\infty\right\}.$$

The natural exponential family is said to be of full rank if there exists an open set contained in the natural parameter space $\Xi$.

**Theorem** When $\Xi$ is not empty, it is convex.

**Proof** For $\alpha \in (0,1)$, recall Holder's inequality $\int f^\alpha g^{1 -\alpha}dx\leqslant \left(\int fdx\right)^\alpha \left(\int gdx\right)^{1 - \alpha}$. Thus, 
$$ \int_{\Omega} h(\omega)e^{[\alpha\eta_1 + (1 - \alpha)\eta_2]^T T(\omega)} d\nu
\leqslant \left(\int_{\Omega} h(\omega)e^{\eta_1^T T(\omega)} d\nu \right)^\alpha \left(\int_{\Omega} h(\omega)e^{\eta_2^T T(\omega)} d\nu\right)^{1-\alpha}<\infty
$$
as long as $\eta_1,\eta_2\in \Xi$.

### Examples

Several common families of distributions are exponential family.

#### Poisson Distribution

Recall poisson distribution with parameter $\lambda$ has discrete probability

$$p(x) = \frac{\lambda^x e^{-\lambda}}{x!} = \frac{1}{x!}\cdot \exp\{x\log \lambda - \lambda\}.$$

Take $\eta(\lambda) = \log\lambda$, $T(x) = x$, $\xi(\lambda)=\lambda$ and $h(x) = \frac{1}{x!}$, we can see that Poisson distribution is an exponential family with parameter $\lambda$.


#### Gaussian Distribution

Let $[\mu,\sigma]\in\mathbb R\times [0,+\infty)$ be parameters, a Gaussian random variable $X$ has density

$$f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left\{-\frac{1}{2\sigma^2}(x - \mu)^2\right\}
= \frac{1}{\sqrt{2\pi}}\exp\left\{-\frac{1}{2\sigma^2}x^2+\frac{\mu}{\sigma^2}x -\frac{\mu^2}{2\sigma^2} -\log\sigma \right\}$$

Take $\eta([\mu,\sigma]) = [-\frac{1}{2\sigma^2},\frac{\mu}{\sigma^2}]^T$, $T(x) = [x^2,x]^T$, $\xi([\mu,\sigma])=\frac{\mu^2}{2\sigma^2} +\log\sigma$ and $h(x) = \frac{1}{\sqrt{2\pi}}$, we can see that Gaussian distribution is an exponential family with parameter $\lambda$.

### Moment Generating Function

For a natural exponential family, let $\eta_0$ be a interior parameter of the natrual parameter space $\Xi$. Then the moment generating function of $\mathbb P\circ T^{-1}(t) = \mathbb E(e^{t^TT(X)})$ is given by

$$\psi_{\eta_0}(t) = \exp\left\{\zeta (\eta_0 + t) - \zeta(\eta_0)\right\}.$$

**Proof** Since $\eta_0$ is an interior, it ensures that $\eta_0+t\in\Xi$ for sufficiently small $t$. Hence, by using the fact that $\int h(x)e^{(\eta_0+t)^TT(x) -\zeta(\eta_0+t)}dx = 1$ we obtain

$$\psi_{\eta_0}(t)= \mathbb E(e^{t^TT(X)})=\int h(x)e^{(\eta_0+t)^TT(x) - \zeta(\eta_0)} dx
=e^{\zeta (\eta_0 + t) - \zeta(\eta_0)}.$$

## Sufficiency

Given a parametric family $\mathcal P=\{\mathbb P_\theta: \ \theta\in\Theta\}$, denote $X$ to be any sample (sample is multiple i.i.d. observations). A function (statistic) $T(X)$ is said to be sufficient if and only if $\mathbb P(X|T(X))$ does not depend on $\theta$.

<br>

**Example** Let $X = \{X_1,\dotsc,X_n\}$ be a sample from binomial distribution

$$f_p (x) = p^x(1-p)^{1-x}\mathbb I_{x\in\{0,1\}}\quad\quad p\in (0,1).$$

For any $X_1,\dotsc,X_n$, use $T(X) = \bar X =\frac 1n\sum_{i=1}^nX_i$, then for fixed $x_1,\dotsc,x_n$ such that $xi\in\{0,1\}$ and $t =\frac1n\sum_{i=1}^nx_i$,

$$f_p\left[x_1,\dotsc,x_n\left|\frac1n\sum_{i=1}^nx_i=t\right. \right] = 
\frac{\prod_{k=1}^n p^{x_k}(1-p)^{1-x_k} }
{\binom{n}{nt}p^{nt}(1-p)^{1-nt}\prod_{k=1}^n }
=\frac{1}{\binom{n}{nt}}.
$$

The conditional probability does not vary with parameter $p$. So $T(X) =\frac1n\sum_{i=1}^n X_i$ is sufficient for the binomial family.

<br>


### Factorization Theorem

**Theorem** [(Fisher-Neyman)](https://encyclopediaofmath.org/wiki/Factorization_theorem) Statistic $T(X)$ is sufficient for distribution family $\mathcal P$ on $(\mathbb R^n,\mathcal B^n)$ dominated by $\sigma$-finite measure $\nu$ if and only if there exists nonnegative Borel function $h(x)$ that does not depend on $\mathbb P$ and some nonnegative Borel $g_{\mathbb P}(x)$ (that depends on concrete $\mathbb P$) such that 
$$\frac{d\mathbb P}{d\nu}(x)= g_{\mathbb  P}(T(x))h(x)$$

**Corollary** The $T(x)$ in the definition of exponential family is sufficient. (It is replaced with $\sum_{i=1}^n T(x_i)$ for i.i.d. sample as joint distribution.)

**Corollary** Order statistics are sufficient for continuous distribution.

 

If $T$ is already sufficient, adding more information to the statistic will still preserve sufficiency. Then one will take interest in the minimal sufficient statistic.

### Minimal Sufficiency

If $T$ is sufficient for a distribution family $\mathcal P$ and for any other sufficient statistic $S$ there exists a measurable function $\psi$ such that $\psi(S) = T$, then we say $T$ is a minimal sufficient statistic.

Minimal statistic is not unique (because we can apply an bijective transformation on it).

### Completeness

Given a statistic $T$ on distribution family $\mathcal P$, if for any Borel $f$ such that $\mathbb E(f(T))=0$ for all $\mathbb P\in\mathcal P$, it has $f(T)=0$, then $T$ is said to be complete.

**Theorem** If a statistic is complete and sufficient, then it is minimal sufficent.

**Theorem** If $\mathcal P$ is an exponential family of full rank, then the $T(x)$ in the definition is complete and sufficient. (It is replaced with $\sum_{i=1}^n T(x_i)$ for i.i.d. sample as joint distribution.)

<br>

### Lehmann-Scheffé Theorem 

**Theorem** [[Ref](https://faculty.math.illinois.edu/~r-ash/Stat/StatLec16-20.pdf)]  Suppose $T$ is a complete and sufficent statistic for distribution family $\mathcal P$. If a function $\varphi(T(X))$ using $T$ unbiasedly estimates some parameter $\theta$, then it has minimum variance amongst all unbiased estimator $\psi(X)$ for $\theta$. Such $T$ is called UMVUE (uniformly minimum variance unbiased estimator).


#### Rao-Blackwell Theorem

**Theorem** If $V(x)$ estimates $\theta$ unbiasedly while $T(x)$ is complete and sufficent statistic, then $\mathbb E(V|T)$ can be represented as a function of $T$ and is a UMVUE for $\theta$.

**Proof**


#### Example

Let $X_1,\dotsc,X_n$ be i.i.d. from normal distribution $N(\mu,\sigma^2)$ where $\mu$ and $\sigma^2$ are both parameters. Suppose we want to find UMVUE for $\mu^2$.

**Solution** Let $T(X)=a(\sum_{i=1}^n X_i)^2 + b\sum_{i=1}^n X_i^2$. By some computation $\mathbb E(T)=(an+b)n\mu^2+(a+b)n\sigma^2$. Take $a = \frac{1}{n(n-1)}$ and $b=-a$ so that $\mathbb E(T)=\mu^2$. Recall $[\sum_{i=1}^n X_i,\sum_{i=1}^n X_i^2]$ is complete and sufficent statistic as Gaussian distribution is a full-rank exponential family. Thus $T$ as a function of the complete and sufficent statistic is, by Lehamann-Scheffé theorem, UMVUE.

### Ancillary

Given a statistic $T$ on distribution family $\mathcal P$, if $\mathbb P(T(x))$ does not depend on $\mathbb P$, then $T$ is said to be ancillary.

### Basu's Theorem



## References

[1] https://people.eecs.berkeley.edu/~jordan/courses/260-spring10/other-readings/chapter8.pdf 

[2] https://faculty.math.illinois.edu/~r-ash/Stat/StatLec16-20.pdf 