# What is the exponential family?

Structure:

[A lot of exponential family tutorials define the exponential family and show how all the familiar distributions are special cases. This is going for showing the generality of the exponential family to familiar circumstances, but it doesn't show good exponential family reasoning. Meaning it doesn't teach you to reason from the generality of exponential families. Here, we will reason from the data and how the exponential family can handle a wide range.]

Talk about the familiar case of estimating the parameters of the normal distribution. Show the vector as a scatter plot and a histogram. "How would you figure out the normal distribution this came from? Well you'd plug in the empirical mean/variance. The analytic solution is distracting from the general view. What you're actually doing is maximizing this guy (show formula). So you could think (naively) about varying mu/sigma over all values and maximizes the result. (Show a grid of graphs where we vary mu and sigma and show the PDF over the histogram. Highlight the max)

Now I'm going to do the *exact same thing*, but I'm going to rewrite some of the algebra and name a few things.

(See the Gaussian section of https://people.eecs.berkeley.edu/~jordan/courses/260-spring10/other-readings/chapter8.pdf)

Create labels on the expanded exponential family-like equation:

    1. measurements of our data
    2. a function of our parameters that get multiplied by our measurements
    3. a function of our parameter that get substracted out
    4. a number (the 1/sqrt(2pi))
    
Now let's relabel [mu/sigma^1, -1/2sigma^2] as [theta1, theta2]. You'd agree that if I know that values of [theta1, theta2], I'd also know the values of mu and sigma^2, yes? (2 equations, 2 variables). So that means I know A(\mu,\sigma)

(Describe how you can now search over theta1, theta2)

This right here is the heart of the exponential family. What's hidden here is the implicit choices we made that led us to the normal distribution, rather than something else. I'll get into that, but now I'll define the exponential family in all it's glory.

(Define)

(Talk about \eta(x) and how it's a bit of a trick to get us to know what to sum over.)

So when I said 'we suspect it's normally distributed', that's a choice right there! We are basically saying the only thing that matters about the data is:

h(x) = 1
T(x) = [x,x^2] (or a sum? - actually no, start about reasoning about one data point at a time)

But we need, so that means:

A(theta) = (something of theta)

since this is needed to make sure the distribution sums to 1.

Then when we answer the question 'which normal distribution fits best?' we find the theta1 and theta2 that maximize that objective. Now in this specific instance, we *could* rewrite theta1 and theta2 to be our normal parameters? But what's the point? Theta1 and theta2 make more sense in the broader context.

Now say you came across data like this (describe 1-0 data). Now let's come at it from the direction of the exponential family. What do you think matters about one piece of data? The only thing I can think of if it's a 1 or not. So let's choose

h(x) = 1
T(x) = x

We need that integrate to 1 situation, so that makes:

A(theta) = (something of theta)

Great, now find the best theta1 across all the data.



We're done (but explain how we just fit a bernoulli).

Let's try something harder, what if each data point was the number of bernoulli success out of N trials. So we have M pieces of data like this:

[Show binomial data]

It's hard to think directly about the measurement of the data that matters, but it might be easier if we think about what's 'underneath' all this. That would look like:

[Show one binomial datapoint expanded out into N bernoulli trials]

So what's the probability of observing a certain sum? Well we need that many bernoulli trials to go off, but they can go off in any number of (n choose x) ways. So the probability is:

(Show binomial pmf in exponential family form)

ahh so now we see the point of h(x). It tells us how *big* x is, independent of parameters. So if I have a binomial with N = 6, then the event x = 3 is (n choose x) times larger than the atomic event the exponential is measuring. This is true *despite* our settings of the parameters.

Ok, but discrete might be too easy. Let's try something harder. What if I have data like this:

[Show exponential data]

What measure about this matters? I don't know, hmm. Let's take a tra

Hmm, I don't know how to measure this. But maybe I can build it out of stuff I do understand? Expand



2. "and THIS is what the exponential family is about" it's a general framework 

2. The procedure mentioned

In [2]:
import numpy as np
import pandas as pd

%matplotlib inline



I'll give a different perspective on the exponential family. If you visit the wiki page, you'll read its definition and its wonderfully general properties. That's all well and good, but it doesn't well represent how it's useful to a data scientist. I'll try to do that here. I'll avoid some typical extensions, because they distract from a good starting point. I'll start with a real simple problem, how the exponential family helps in that circumstance and how it can, in fact, help in more broader circumstances. After all this, you should be able to put this knowledge in your pocket and actually use the exponential family in all it's general glory. Now..

[Insert Vegeta picture with the let us begin caption]

Let's say my data is a vector of real values scalars $\{x_{i}\}_{i=1}^{N}$ and I *suspect* it's normally distributed. The goal is to determine *which* normal distribution it comes from. This might sound trivial - the empirical mean/variance tell us which normal distribution. But that answer is representative of the broader picture. What we are really doing is answering the question:

“What are the parameters of the normal distribution that maximize the data I'm seeing?”

In the simple case of the normal distribution, we *happen* to have a way to calculate that answer immediately from the data. But we don't always. So let's handicap

If you'd like to estimate the parameters of the normal distribution, you do the typical routine: the mean is the average and the variance is the empirical variance

If you know how to approach this, forget your approach!

Let's define it. We seek to define some probability distribution over a vector of observations. We do that with this magical device:

$$
p(\mathbf{x}|\boldsymbol{\theta})=\frac{1}{Z(\boldsymbol{\theta})}h(\mathbf{x})e^{T(\mathbf{x})\cdot\boldsymbol{\theta}}
$$

$$
\begin{align}
T(\mathbf{x}) = & \left[\begin{array}{c}
x\\
x^2
\end{array}\right] \\
\boldsymbol{\theta} = & \left[\begin{array}{c}
\frac{\mu}{\sigma^2}\\
-\frac{1}{2\sigma^2}
\end{array}\right] \\
h(\mathbf{x}) = & 1 \\
\log Z(\boldsymbol{\theta}) = & \frac{1}{2\sigma^2}\mu^2+\log\sigma + \frac{1}{2}\log (2\pi)
\end{align}
$$

### Need to determine something about the multinomial..

In [6]:
x = np.array([1,2,3])
numerator = np.math.factorial()

720

In [3]:
M = 5
x_len = 3

def n_choose_subsets(x):
    numerator = np.math.factorial(np.sum(x))
    denominator = np.prod(np.array([np.math.factorial(xi) for xi in x]))
    
    return (numerator/denominator)

def prob_unnorm(x,theta):
    
    return n_choose_subsets(x)*np.exp(np.sum(x*theta))

def partition(theta):
    
    count = 0
    Z = 0
    for x1 in range(M+1):
        for x2 in range(M-x1+1):
            x3 = M-(x1+x2)
            x = np.array([x1,x2,x3])
            count += 1
            Z += prob_unnorm(x,theta)
            
    return Z, count
            
count_analytic = np.math.factorial(M+x_len-1)/(np.math.factorial(M)*np.math.factorial((M+x_len-1)-M))

def my_prob(x, theta):
    Z, count = partition(theta)
    return prob_unnorm(x,theta)/Z

def their_prob(x,theta):
    return n_choose_subsets(x)*np.exp(np.sum(x*theta))

theta = np.array([.1,.1,.1])
x = [2,2,1]
print(my_prob(x, theta))
print(their_prob(x,theta))

0.123456790123
49.461638121


## NEED TO GET THE IDEA OF THE MINIMAL REPRESENTATION IN THERE EARLY. SAY THIS IS WHAT IT'S ALL ABOUT. This will fix up the explanation of the multinomial.

'It's about sweeping over distributions in the real coordinate system'

Talk about how exp() helps us map everything to [0,1]

# Final Answer

I'd like to give a different perspective on the exponential family. If you visit the wiki page, you'll read its definition and its wonderfully general properties. That's all well and good, but it doesn't well represent how it's useful to a data scientist. I'll try to do that here. I'll start with a real simple problem, how the exponential family helps in that familiar circumstance and how it can, in fact, help in broader circumstances. 

So let's say we've come across a list of continuous numbers that look like this:

In [3]:
# Make a plot showing the scatter plot (x-axis is the index) of a normally distributed vector and it's histogram.

Our goal is to determine which distribution generated these numbers. That is, we speculate a distribution and determine which parameters make the most sense according to the data. In this case, it certainly looks normally distributed, so let's guess that distribution. Then we pick the maximum likelihood parameters (those that 'make the most sense').

Now, before you say, 'just use the empirical mean/variance', let's think about exactly what we are doing. We are concerned with finding the parameters $\mu$ and $\sigma^2$ that maximize the likelihood of our data. That is, find me $\mu^*$ and $\sigma^{2*}$:

$$
\mu^*,\sigma^{2*} = \textrm{argmax}_{\mu,\sigma^2} \prod_i^N \mathcal{N}(x_i | \mu,\sigma^2)
$$

The answer happens to the be the empirical mean and variance, but that solution doesn't generalize, so forget it! Let's scan for values of $\mu$ and $\sigma^2$ until we find a combination that works. That process looks like this:

In [None]:
# Show multiple-plot where we fit different normals to the histogram.

Now I'm going to do the *exact same thing*, but I'm going to rewrite some of the algebra.

$$
\begin{align}
\mathcal{N}(x_i | \mu,\sigma^2) = & \frac{1}{\sqrt{2\pi\sigma^2}}\exp\Big\{{-\frac{(x_i-\mu)^2}{2\sigma^2}}\Big\} \\
= & \exp\Big\{\frac{\mu}{\sigma^2}x_i - \frac{1}{2\sigma^2}x_i^2 - \frac{1}{2\sigma^2}\mu^2-\log\sigma -\frac{1}{2}\log (2\pi) \Big\} \\
= & \exp\Big\{\left[\begin{array}{cc}
x_i & x_i^2\end{array}\right]\left[\begin{array}{c}
\frac{\mu}{\sigma^2}\\
-\frac{1}{2\sigma^2}
\end{array}\right] - \big(\frac{1}{2\sigma^2}\mu^2+\log\sigma + \frac{1}{2}\log (2\pi) \big)\Big\} \\
\end{align} 
$$

Now let's label: 

$$ \left[\begin{array}{c}
\frac{\mu}{\sigma^2}\\
-\frac{1}{2\sigma^2}
\end{array}\right] = \left[\begin{array}{c}
\theta_1\\
\theta_2
\end{array}\right]
$$

You'd agree that if knew $\theta_1$ and $\theta_2$, I'd know $\mu$ and $\sigma^2$, right? (Two equations, two variables). So let's reason in terms of those variables. Now, just combine it across all data:

$$
\prod_i^N\mathcal{N}(x_i | \mu,\sigma^2) = \exp\Big\{\left[\begin{array}{cc}
\sum_i^N x_i & \sum_i^N x_i^2\end{array}\right]\left[\begin{array}{c}
\theta_1\\
\theta_2
\end{array}\right] - N\big( \frac{-\theta^2_1}{4\theta_2} - \frac{1}{2}\log(-2\theta_2) - \frac{1}{2}\log(2\pi)\big)\Big\}
$$

So instead of searching for $\mu$ and $\sigma^2$ that maximize the function, let's look for $\theta_1$ and $\theta_2$.

In [None]:
# Show multiple-plot where we fit different normals to the histogram.

It's worth stating it again: this is the *exact same thing* as finding the best mean and variance. We just did it in different terms. But the crazy thing is.. nearly every distribution you've heard of can be re-worked into this form. In other words, we can do the equivalent of finding $\mu^*$ and $\sigma^{2*}$, but for a huge range of distributions.

### So what is the fully general exponential family?

According to the exponential family, the probability of a vector $\mathbf{x}$ according to a parameter vector $\boldsymbol{\theta}$ is

$$
\begin{align}
p(\mathbf{x}|\boldsymbol{\theta})=&\frac{1}{Z(\boldsymbol{\theta})}h(\mathbf{x})\exp\Big\{{T(\mathbf{x})\cdot\boldsymbol{\theta}\Big\}}\\
= &h(\mathbf{x})\exp\Big\{{T(\mathbf{x})\cdot\boldsymbol{\theta}-\log Z(\boldsymbol{\theta})\Big\}}\\
\end{align}
$$

$Z(\boldsymbol{\theta})$ is called the partition function and it's there to ensure that $p(\mathbf{x}|\boldsymbol{\theta})$ sums to 1 over $\mathbf{x}$. That is

$$
Z(\boldsymbol{\theta}) = \int h(\mathbf{x})\exp\Big\{{T(\mathbf{x})\cdot\boldsymbol{\theta}\Big\}} \nu(d\mathbf{x})
$$

$\nu(d\mathbf{x})$ refers to the 'measure' of $\mathbf{x}$. It's there to generalize the idea of 'summing over all possible events' to both the continuous and discrete domains. It also may determine the 'volume' of $\mathbf{x}$, though you can get that work done with $h(\mathbf{x})$. So when we say we 'know' $\nu(d\mathbf{x})$, that means we know how to sum over all possible $\mathbf{x}$'s appropriately.

In fact, let's make that separation and say that's what $h(\mathbf{x})$ is - it's the volume  of $\mathbf{x}$. Think of this as the component of $\mathbf{x}$'s likelihood that *isn't* due to it's parameter. This will become more clear with an example.

$T(\mathbf{x})$ is called the 'vector of *sufficient* statistics'. This is a measurement of our data that is 100% of what we need to determine agreement with $\boldsymbol{\theta}$. In other words, if I handed you two vectors ($\mathbf{x}_1$ and $\mathbf{x}_2$) that had the same sufficient statistics (so $T(\mathbf{x}_1) = T(\mathbf{x}_2)$) then these data points would agree with all parameter vectors equally.[1]

### How were we using the exponential family when fitting a normal distribution?

So when we fitted our normal distribution, we were quietly making choices with respect to this form. That is, we were making assertions that implied specific settings to our exponential family. Those were:

1. $x$ could be any real valued number: that sets $\nu(dx)$ which determines how we'll do our integration.
2. It's normally distributed, which dictates two things:
    * $T(x) = [x,x^2]$, which means we care about parameters that relate to $x$ and $x^2$. ELABORATE
    * $h(x) = 1$, which means that the difference in likelihood between two $\mathbf{x}$ values is due entirely the parameters.

From here, the definition of the exponential family will dictate the rest. That is:

$$
\begin{align}
Z(\boldsymbol{\theta}) = & \int h(\mathbf{x})\exp\Big\{{T(\mathbf{x})\cdot\boldsymbol{\theta}\Big\}} \nu(d\mathbf{x})\\
= & \int \exp\Big\{{ \left[\begin{array}{cc}
x & x^2\end{array}\right] \cdot \left[\begin{array}{c}
\theta_1\\
\theta_2
\end{array}\right] \Big\}} dx \\
= & \exp\Big\{ \frac{-\theta^2_1}{4\theta_2} - \frac{1}{2}\log(-2\theta_2) - \frac{1}{2}\log(2\pi)\Big\}
\end{align}
$$

And now we can proceed to find the best $\theta_1$ and $\theta_2$ like we did before.

### What's so useful about the generalization?

The reason for this rephrasing is it reveals all the remarkable degrees of freedom the exponential family rewards us. We didn't *need* to say $x$ was real valued - that was our choice. We didn't *need* to say the sufficient statistics were $[x,x^2]$, we could have picked anything! So let's use those degrees of freedom in another example. Let's say I come across data like this:

[Show a vector with A's and B's as values]

[Show a histogram with A/B counts]

Hmm, these aren't numbers. We'll let's make some choices:

1. $\nu(dx)$ should mean we sum over the two possible outcomes ($x=A$ or $x=B$)
2. The only measurement of this data I can think of is an indicator function:
$$
    T(x)= 
\begin{cases}
    1 & \text{if } x = A\\
    0              & \text{if } x = B
\end{cases}
$$
which can be represented as $\mathbb{1}[x=A]$
3. Outside of something that relates to the parameters, I have no reason to think $x=A$ has a greater volume than $x=B$ (or visa versa), so let's say $h(x)=1$.

We've made all our choices - What does this imply about $Z(\boldsymbol{\theta})$?

$$
\begin{align}
Z(\theta) = & \int h(x)\exp\Big\{{T(x)\cdot\theta\Big\}} \nu(dx)\\
= & \int \exp\Big\{{\mathbb{1}[x=A]\cdot\theta\Big\}} \nu(dx)\\
= & \sum_{x \in \{A,B\}} \exp\Big\{{\mathbb{1}[x=A]\cdot\theta\Big\}}\\
= & \exp(\theta) + 1
\end{align}
$$

And now we can write $p(x|\theta)$:

$$
\begin{align}
p(x|\theta)= &\frac{1}{Z(\theta)}h(x)\exp\Big\{{T(x)\cdot\theta\Big\}}\\
= &\frac{1}{\exp(\theta) + 1}\exp\Big\{{\mathbb{1}[x=A]\cdot\theta\Big\}}\\
\end{align}
$$

Or, written out more explicitly:

$$
\begin{align}
p(x=A|\theta) = &\frac{\exp(\theta)}{\exp(\theta) + 1}\\
p(x=B|\theta) = &\frac{1}{\exp(\theta) + 1} = 1 - p(x=A|\theta)
\end{align}
$$

Since we can choose $\theta$ to yield any value in [0,1] for $p(x=A|\theta)$, we see this distributions just assigns a constant probability to the events $x=A$ and $x=B$ - in other words, it's a Bernoulli random variable.

So one set of choices can lead us directly to the normal distribution while another set led us to the Bernoulli distribution. What else can we generate?

### Try something harder!

What if we came across data like this:

[Show multinomial data]

Well this is odd. Now $\mathbf{x}$ is a length 3 vector and it's got this unusual constraint where the elements sum to 5. Ok, deep breath - let's start from our familiar spot:

1. How should we think about $\nu(d\mathbf{x})$? How should we think about all possible events that we should sum over? Said differently, what are the legal observations you could make for this data? Well it's any 3 nonnegative integers such that their sum is 5. Here, I'll write them out for you:

    So I just need to sum over all these. Let's call this set $\mathcal{X}$.

2. What is $T(\mathbf{x})$? The most natural thing I can think of is the data itself, so $T(\mathbf{x})=\mathbf{x}$. (This will change - it's actually the measurement of the first 2 elements).

3. What is $h(\mathbf{x})$? In other words, should one observation of $\mathbf{x}$ ever be considered more likely than another, regardless of what parameter vector $\boldsymbol{\theta}$ we have? From this angle, I have no clue, but it's not simple enough for me to say $h(\mathbf{x})=1$. To crystalize our understanding, let's think about how we would generate $\mathbf{x}$ ourselves. One way is to make 5 draws (from some distribution I don't yet know) and then aggregate the results, like this:

    [Show mapping of things like A,A,B,A,C --> A = 3, B = 1, C = 1]
    
    The useful thing here is that all possible sequences will map to all events of $\mathcal{X}$. Now from this angle, do some observations seem more likely then others (even though we can't reference the distributed that generates a single draw)? Well, some observations are mapped to by more sequences than others. For example, the only thing that maps to [5,0,0] is [A,A,A,A,A] while [4,1,0] is mapped to by 5 sequences ($[B,A,A,A,A],[A,B,A,A,A],\cdots,[A,A,A,A,B]$). So it seems the latter observations is 5 times bigger than the former. So let's make $h(\mathbf{x})$ the number of sequences that map to $\mathbf{x}$. If you remember your combinatorics, that is:
    
    $$
    h(\mathbf{x}) = \frac{5!}{(x_1!)(x_2!)(x_3!)}
    $$
    
Ok, we've made all our choices so we should be good to go. The partition function is this guy:

$$
\begin{align}
Z(\boldsymbol{\theta}) = & \int h(\mathbf{x})\exp\Big\{{T(\mathbf{x})\cdot\boldsymbol{\theta}\Big\}} \nu(d\mathbf{x})\\
= & \sum_{\mathbf{x}\in \mathcal{X}} \frac{5!}{(x_1!)(x_2!)(x_3!)}\exp\Big\{\mathbf{x}\cdot\boldsymbol{\theta}\Big\}\\
\end{align}
$$

So now we may write the expression for the probability:

$$
\begin{align}
p(\mathbf{x}|\boldsymbol{\theta}) = & \frac{1}{Z(\boldsymbol{\theta})}\frac{5!}{(x_1!)(x_2!)(x_3!)}\exp\Big\{\mathbf{x}\cdot\boldsymbol{\theta}\Big\}\\
\end{align}
$$

Can we simplify this expression further? It would be nice if there was some cancellation between $h(\mathbf{x})$ and $Z(\boldsymbol{\theta})$. I suspect


# Extensions:

1. Mention GLMs
2. Mention how we want to exponential parameterization to be 'minimal'

### Footnotes

[1] I realize 'agreement with parameters' is a bit vague. The actual definition is
