## I. Exponential Families

- Definition of Exponential Families:
    - A family of pdfs or pmfs indexed by parameter(s) $\theta$ 
        - is called an exponential family if it can be written as
    - $f(x\mid\theta)=h(x) c(\theta) \exp\left(\sum_{i=1}^kw_i(\theta)t_i(x)\right),\,\forall x\in\mathbb{R}$
        - $h(x),t_1(x),\ldots,t_k(x)$ are functions of $x$ only (not $\theta$)
        - $c(\theta),w_1(\theta),\ldots,w_k(\theta)$ are functions of $\theta$ only (not $x$)
        - $h(x)\geq0,\forall x$ and $c(\theta)\geq0,\forall\theta$

### $N(\mu,\sigma^2)$ is an exponential family.

$\begin{aligned}
f(x)&=\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2\sigma^{2}}(x^{2}-2x\mu+\mu^{2})}\\
    &=\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{\mu^2}{2\sigma^{2}}} \cdot e^{-\frac{1}{2\sigma^{2}}x^2 + x\frac{\mu}{\sigma^2}}\\
\end{aligned}$
- Indicator function: A handy tool to get more compact expressions of pdf/pmf:
    $I_A(x)=\left\{\begin{array}{cc}1&x\in A\\0&x\notin A\end{array}\right.$
    
- $k = 2$
- $h(x)=I_{\mathbb{R}}(x)\frac{1}{\sqrt{2\pi}}$
- $c(\mu,\sigma)=\frac{1}{\sigma}e^{-\frac{\mu^2}{2\sigma^{2}}}$
- $w_1(\mu,\sigma)=-\frac{1}{2\sigma^2}$
- $t_1(x)=x^2$
- $w_2(\mu,\sigma)=\frac{\mu}{\sigma^2}$
- $t_2(x)=x$

### Binomial($n,p$) is an exponential family, if $n$ is known (fixed)
- Define：$X\sim$ Binomial($n,p$), n is fixed, and $A = \{0,1,2,...n\}$ 
    $\begin{aligned}
    f(x_{1}&=\binom{n}{x}p^{x}(1-p)^{n-x}I_{A}(x)\\
           &=\binom{n}{x}I_{A}(x)e^{ln(p^x(1-p)^{n-x}}\\
           &=\binom{n}{x}I_{A}(x)exp(ln(p^x)+ln(1-p)^{n-x})\\
           &=\binom{n}{x}I_{A}(x)exp(xln(p)+(n-x)ln(1-p))\\
           &=\binom{n}{x}I_{A}(x)exp(x(ln(p)-ln(1-p))+nln(1-p))\\
           &=\binom{n}{x}I_{A}(x)exp(xln\frac{p}{1-p}+nln(1-p))\\
           &=\binom{n}{x}I_{A}(x)(1-p)^nexp(xln\frac{p}{1-p})
    \end{aligned}$


### Expo($\beta$) is an exponential family
- pdf: $f(x)=\frac{1}{\beta}e^{-x/\beta}, I_{(0,\infty)}(x)$
- $h(x) = I_{(0,\infty)}(x)$
- $c(\beta)=\frac{1}{\beta}$
- $w_i(\beta)=-\frac{1}{\beta}$
- $t_i(x)=x$
- $k = 1$
- $f(x)=h(x)c(\beta)exp(w_i(\beta)t_i(x))$

### Uniform($a,b$) is not an exponential family
- $f(x)=\frac{1}{b-a}I_{[a,b]}(x)$
- Since the function $I_{[a,b]}(x)$ of both $x$ and $a,b$ can't be written as either $c(a,b)h(x)$ or $exp(w(a,b)t(x))$
- In general: If the support of the distribution depends on a parameter, it is not an exponential family. For example, if both n and p are unknown in Binomial($n,p$), it is no longer an exponential family.

### Mean and variance for exponential families
- Theorem: If $X$ is a random variable with a pdf or pmf from an exponential family then

    $\begin{aligned}
    \operatorname{E}\left(\sum_{i=1}^{k}\frac{\partial w_{i}(\theta)}{\partial\theta_{j}}t_{i}(X)\right)& =-\frac\partial{\partial\theta_j}\log\left(c(\theta)\right) \\
    \mathrm{Var}\left(\sum_{i=1}^k\frac{\partial w_i(\theta)}{\partial\theta_j}t_i(X)\right)& =-\frac{\partial^2}{\partial\theta_j^2}\log\left(c(\theta)\right)-\mathrm{E}\left(\sum_{i=1}^k\frac{\partial^2w_i(\theta)}{\partial\theta_j^2}t_i(X)\right) 
    \end{aligned}$
- Example: Expo($\beta$), and $log = ln$

- For Expo($\beta$), k = 1

    $\begin{aligned}
    &\operatorname{E}\left(\frac{\partial w_{1}(\beta)}{\partial\beta}t_{1}(X)\right)\\&=-\frac{\partial}{\partial\beta}log(c(\beta)) \\
          &=-\frac{\partial}{\partial\beta}log(\frac{1}{\beta}) \\
          &=\frac{\partial}{\partial\beta}log(\beta)\\
          &=\frac{1}{\beta}
    \end{aligned}$
    
    $\begin{aligned}
    &\operatorname{E}\left(\frac{\partial w_{1}(\beta)}{\partial\beta}t_{1}(X)\right)\\
    &=E(\frac{\partial}{\partial \beta}(-\frac{1}{\beta})X)\\
    &=E(\frac{1}{\beta^2}X)\\
    &=\frac{1}{\beta^2}E(X)=\frac{1}{\beta^2}\beta\\
    &=\frac{1}{\beta}
    \end{aligned}$    

$\mathrm{Var}\left(\frac{\partial w_1(\beta)}{\partial\beta}t_1(X)\right) =-\frac{\partial^2}{\partial\beta^2}\log\left(c(\beta)\right)-\mathrm{E}\left(\frac{\partial^2w_1(\beta)}{\partial\beta^2}t_1(X)\right)$
- $\frac{\partial w_1(\beta)}{\partial\beta} = -\frac{\partial}{\partial\beta}\cdot\frac{1}{\beta}=\frac{1}{\beta^2}$
- $-\frac{\partial^2}{\partial\beta^2}\log\left(c(\beta)\right) =\frac{\partial^2}{\partial\beta^2}\log\left(\beta\right)=-\frac{1}{\beta^2}$
- $\frac{\partial^2w_1(\beta)}{\partial\beta^2}=\frac{\partial}{\partial\beta^2}-\frac{1}{\beta}=-\frac{2}{\beta^3}$

$\mathrm{Var}\left(\frac{1}{\beta^2}X\right) =-\frac{1}{\beta^2}+\mathrm{E}\left(\frac{2}{\beta^3}X\right)$

$\frac{1}{\beta^4}\mathrm{Var}\left(X\right)= -\frac{1}{\beta^2}+\frac{2}{\beta^2}$

$\mathrm{Var}\left(X\right)=\beta^2$

### Curved vs. full exponential families

- A pdf/pmf from an exponential family:
- $f(x\mid\theta)=h(x)c(\theta)\exp\left(\sum_{i=1}^kw_i(\theta)t_i(x)\right)$
- Often the dimension of $\theta$ is equal to $k$ - but not always
- Definition: Curved or Full Expo Families
    - If we can write $f(x)$ such that $k = d$ where $d$ is the dimension of the vector $\theta$, 
        - the familiy is called a **full exponential family**. 
    - A **curved exponential family** is an exponential family for which $d < k$.
    - Example: N$(\theta,\theta^2)$

### Location-scale families
It is a handy theorem about shifting and re-scaling pdfs.
- Theorem: Let $f(x)$ be a pdf and let $\mu\in\mathbb{R},\sigma>0$ be constants. Then
    - $g(x\mid\mu,\sigma)=\frac1\sigma f\left(\frac{x-\mu}\sigma\right)$ is also a pdf.
- Proof: 

    Since $f(x)\geq 0,\forall x \text{ and } \sigma > 0$, then $g(x)\geq 0,\forall x$.

    $\begin{aligned}
    \int_{-\infty}^{\infty}g(x)dx 
    &= \int_{-\infty}^{\infty}\frac{1}{\sigma}f(\frac{x-\mu}{\sigma})dx\\
    &= \int_{-\infty}^{\infty}\frac{1}{\sigma}f(u)\sigma du,\,(u=\frac{x-\mu}{\sigma},dx=\sigma du)\\
    &= \int_{-\infty}^{\infty}f(u)du = 1,\,(x\in(-\infty,\infty)\to u=\frac{x-\mu}{\sigma}\in(-\infty,\infty))
    \end{aligned}$

    Therefore, $g(x)$ is a pdf.

### Location-scale families

- Definition: Let $f(x)$ be a pdf (sometimes called the standard pdf )
    - (i)  Set $g(x|\mu)=f(x-\mu)$. Then {$g(x|\mu):\mu\in\mathbb{R}$} is called a **location family**.
    - (ii) Set $g(x|\mu)=\frac{1}{\sigma}f(\frac{x}{\sigma})$. Then {$g(x|\sigma):\sigma>0$} is called a **scale family**,
    - (iii)Set $g(x\mid\mu,\sigma)=\frac1\sigma f\left(\frac{x-\mu}\sigma\right)$. Then $\{g(x\mid\mu,\sigma):\mu\in\mathbb{R},\sigma>0\}$ is called a **location-scale family**.
    - $\mu$ is called a **location parameter** and $\sigma$ is called a **scale parameter**.
    - Example: N$(\mu,\sigma^2)$ is a location-scale family.
        The standard pdf N(0,1): $f(x)=\frac{1}{\sqrt{2\pi}}e^{-x^2/2}I_{(-\infty,\infty)}(x)$
        
        $\begin{aligned}
        g(x)
        &=\frac{1}{\sigma}\frac{1}{\sqrt{2\pi}}e^{-(\frac{x-\mu}{\sigma})^{2}/2}I_{(-\infty,\infty)}(\frac{x-\mu}{\sigma}) \\
        &=\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x-\mu)^{2}}{2\sigma^2}}I_{(-\infty,\infty)(x)} \\
        &=\text{ pdf of }N(\mu,\sigma^2)
        \end{aligned}$

- If support of $f(x)$ is not $\mathbb{R}$ then the support of $g(x|\mu,\sigma)$ will depend on $\mu$ and $\sigma$
- Example: Uniform($a,b$), pdf: $f(x)=\frac{1}{b-a}I_{[a,b]}(x)$, $a$ and $b$ fixed.
    $\begin{aligned}
    g(x)&=\frac{1}{\sigma}\frac{1}{b-a}I_{[a,b]}(\frac{x-\mu}{\sigma}),\,(a\leq\frac{x-\mu}{\sigma}\leq b) \\
        &=\frac{1}{\sigma}\frac{1}{b-a}I_{[\sigma a+\mu,\sigma b+\mu]}(x),\,(\sigma a + \mu \leq x \leq \sigma b+\mu) \\
        &=\text{ pdf of uniform}(\sigma a + \mu, \sigma b +\mu)
     \end{aligned}$
     
     Note: $\mu$ is not the mean of $g(x)$, but $\frac{\sigma a+\mu+\sigma b+\mu}2 = \mu + \sigma\frac{a+b}{2}, (\frac{a+b}{2}=\text{ mean of }f(x))$
    

#### One use of location-scale families:
Probabilities for any location-scale pdf can be calculated by transforming to the **standard pdf**.
- Theorem: Let $g(\cdot|\mu,\sigma)$ be a pdf from a location-scale family with standard pdf $f(\cdot)$
    - (a): If $X\sim g(X\mid\mu,\sigma)$ then $Z=\frac{X-\mu}\sigma\sim f(z)$
    - (b): If $Z\sim f(z)$ then $X=\sigma Z+\mu\sim g(x\mid\mu,\sigma)$

### Chebychev’s Inequality

Let $X$ be a random variable and let $g(x)$ be a non-negative function. 
- Then for any $k > 0$, $P(g(X)\geq k)\leq\frac{E(g(X))}k$.
- Proof:

    $P(g(x)\geq k)=\int_{A\in\{x:g(x)\geq k\}}f(x)dx\leq\int_{-\infty}^{\infty} f(x)dx$
    
    $\begin{aligned}
    E(g(x))&=\int_{-\infty}^{\infty}g(x)f(x)dx \\
           &\geq\int_{-\infty}^{\infty} kf(x)dx,\,(g(x)\geq k,x\in\mathbb{R})=k\int_{-\infty}^{\infty} f(x)dx \\
           &\geq k \int_{A\in\{x:g(x)\geq k\}}f(x)dx=kP(g(X)\geq k)
     \end{aligned}$
    
    Therefore, $P(g(X)\geq k)\leq\frac{E(g(X))}k$.

Let $X$ be a random variable with mean $\mu=E(X)$ and variance $\sigma^2=Var(X)$. 

Consider $g(x)=\frac{(x-\mu)^2}{\sigma^2}$, what does Chebychev’s inequality imply?

$\frac{1}{k}E(\frac{(x-\mu)^{2}}{\sigma^{2}})=\frac{1}{k\sigma^{2}}E((x-\mu)^{2})=\frac{1}{k\sigma^{2}}\cdot\sigma^2=\frac{1}{k}\cdot\cdot\cdot(1)$

$P(\frac{(x-\mu)^{2}}{\sigma^{2}}\geq k)=P((x-\mu)^{2}\geq k\sigma^{2})=P(|x-\mu|\geq\sqrt{k}\sigma)\cdot\cdot\cdot(2)$

According to (1) $\leq$ (2):

$P(|x-\mu|\geq t\sigma)\leq\frac{1}{t^{2}},\,(t=\sqrt{k})$

$1-\frac{1}{t^{2}}\leq P(|x-\mu|<t\sigma)$

For t = 2: $P(|x-\mu|<t\sigma)\geq 0.75$

For $X\sim N(\mu,\sigma)$: $P(|x-\mu|<2\sigma)\approx0.95$