# Bayes Theorem

Goals:
* Grasp how the mathematics of probability can be used to do statistical inference.
* Start working through real inference problems, with pencil, paper, and PGMs.

In [None]:
execfile('../graphics/bayes.py') # see code here for later demos
%matplotlib inline

## References

* Gelman ch. 1
* Ivezic 5.1-5.3
* MacKay 2.3

### Sampling distributions and likelihoods
You've just been introduced to PGMs - these are a visual representation of how data are generated.
* By filling in the blanks, we can write down a **likelihood function** (a function of model parameters), which says how probable a given data set is.
* Properly normalized, the same expression encodes the distribution from which data are generated for fixed parameters, called the **sampling distribution**, $P($data|params$)$.

### Sampling distributions and likelihoods
*Insert example PGM from the previous lesson to illustrate*

### Other ingredients for principled inference

$P($data|params$)$ clearly has a role to play in inferring which parameter values are consistent with the data. What else do we need?

### Other ingredients for principled inference

$P($data|params$)$ clearly has a role to play in inferring which parameter values are consistent with the data. What else do we need?

* $P($params$)$, the **prior distribution**
* $P($params|data$)$, the **posterior distribution**

### The prior distribution

$P($params$)$
* The *marginal* probability of a set of parameter values (integrated over possible data sets).
* Consequently, *independent of the measured data*.
* Interpretation: what we know about the model parameters *before* incorporating new knowledge in the form of the measured data.

### The posterior distribution

$P($params|data$)$
* The probability of a model *given* the measured data.
* Interpretation: what we know about the model parameters *after* incorporating new knowledge in the form of the measured data. In other words, the product of statistical inference.

## Bayes Theorem
The ingredients above are all related through the definition of conditional probability

$P(\mathrm{params}|\mathrm{data}) = \frac{P(\mathrm{data}|\mathrm{params})~P(\mathrm{params})}{P(\mathrm{data})}$

## Bayes Theorem

 $P(\mathrm{params}|\mathrm{data}) = \frac{P(\mathrm{data}|\mathrm{params})~P(\mathrm{params})}{P(\mathrm{data})}$

* $P(\mathrm{params})$: prior - what we know before doing the experiment
* $P(\mathrm{data}|\mathrm{params})$: sampling distribution - probability of obtaining our data set
* $P(\mathrm{params}|\mathrm{data})$: posterior - what we know after doing the experiment ("the answer")
* $P(\mathrm{data})$???: **evidence** - marginal probability of obtaining our data for any parameter values (more on this later)

## Example: measuring the flux of a source

Say we want to measure the flux of a galaxy. In a given integration time, $T$, the number of counts, $N$, that we collect in our fancy CCD will be Poisson distributed

$N|\mu \sim \mathrm{Poisson}(\mu)$

where $\mu=FAT$ is the average number of counts we would expect in time $T$, the product of the integration time, the source flux ($F$, counts per unit time and area), and the collecting area of our telescope ($A$).

Presumably we know $A$ and $T$ well, so for convenience we can make $\mu$ rather than $F$ the free parameter of our model.

### Example: measuring the flux of a source

$N|\mu \sim \mathrm{Poisson}(\mu)$

<table>
    <tr>
        <td><img src="../graphics/bayes_poissoneg_likelihood.png" width=400></td>
    </tr>
</table>

### Example: measuring the flux of a source

*Insert PGM here*

### Example: measuring the flux of a source

We'll talk more about how to choose a prior in a few minutes. For now, we'll make a common choice, the uniform distribution (for $\mu\geq0$ in this case).
* This is an **improper** distribution, i.e. one that can't technically be normalized. This doesn't necessarily matter, as long as the posterior distribution turns out to be proper.

### Example: measuring the flux of a source

$\mu \sim \mathrm{Uniform}(0,\infty)$

<table>
    <tr>
        <td><img src="../graphics/bayes_poissoneg_prior.png" width=400></td>
    </tr>
</table>

### Example: measuring the flux of a source

What about the evidence, $P(N)$?
* This is constant with respect to model parameters by definition, so we don't actually need to calculate it (although we could, by marginalizing the sampling distribution over $\mu$).
* That's because we know the posterior will be a probability distribution - as long as it's proper, the normalizing constant must be whatever makes it integrate to 1.

### Example: measuring the flux of a source

Now we have everything we need to calculate $P(\mu|N)\propto P(N|\mu)P(\mu)$.
* In a sense we're done, and could simply evaluate this for all possible $\mu$s, but it would be nice to have a simpler form for the result.

### Aside: [conjugate distributions](https://en.wikipedia.org/wiki/Conjugate_prior)

**Conjugate distributions** are like eigenfunctions of Bayes Theorem. These are special cases for which the form of the posterior is the same as the prior, for a specific sampling distribution.

### Example: measuring the flux of a source

The Poisson distribution is conjugate to the Gamma distribution

$P(x) = \frac{1}{\Gamma(\alpha)}\beta^\alpha x^{\alpha-1} e^{-\beta x}$ for $x\geq0$

Our Uniform prior is a limiting case of Gamma (with $\beta\rightarrow0$), so we can take advantage of this.

If we take the prior $\mu \sim \mathrm{Gamma}(\alpha_0,\beta_0)$, the posterior will be $\mu|N \sim \mathrm{Gamma}(\alpha_0+N,\beta_0+1)$.

### Example: measuring the flux of a source

*Insert PGM here (with alpha and beta this time)*

### Example: measuring the flux of a source

Here we can demo how the posterior distribution depends on these prior **hyperparameters**, as well as the observed data.

In [None]:
plt.rcParams['figure.figsize'] = (10.0,7.0)
bayesDemo(alpha0=1.0, beta0=0.001, N=5)