# Bayesian inference: Basics

## Table of Content:

- I [Frequentist vs Bayesian inference](#I.-Frequentist-vs-Bayesian-Approaches)
- II [The Bayesian Problem Setting](#II.-The-Bayesian-Problem-Setting)
- III [The Bayes' theorem / Bayes' Rule](#III-The-Bayes'-theorem-/-Bayes'-Rule):
- IV Simple Bayesian Modeling: See [Bayes_simple_modeling.ipynb](Bayes_simple_modeling.ipynb)
- V Bayesian modeling with Monte Carlo Markov Chains (MCMC): See [Bayes_MCMC.ipynb](Bayes_MCMC.ipynb)
- XX [Referemces](#XX-References:)

## I. Frequentist vs Bayesian Approaches

Both the model fitting and model selection problems can be approached from either a *frequentist* or a *Bayesian* standpoint. The "philosophical" difference between the two approaches will not be discussed in details. The interested reader (ideally already familiar with Bayesian statistics) may read [Bayes_philosophy.ipynb](Bayes_philosophy.ipynb) and [J. VanderPlass blog posts](http://jakevdp.github.io/blog/2014/03/11/frequentism-and-bayesianism-a-practical-intro/) to get mor insights on this question. We propose to review the formalism of Bayes statistics in the next Section. 


## II. The Bayesian Problem Setting

The end-goal of a Bayesian analysis is a probabilistic statement about the universe.
Roughly we want to measure

$$
P(science)
$$

Where "science" might be encapsulated in the cosmological model, the mass of a planet around a star, or whatever else we're interested in learning about.

We don't of course measure this without reference to data, so more specifically we want to measure

$$
P(science~|~data)
$$

which should be read "the probability of the science *given* the data."

Of course, we should be explicit that this measurement is not done in a vaccum: generally before observing any data we have *some* degree of background information that informs the science, so we should actually write

$$
P(science~|~data, background\ info)
$$

This should be read "the probability of the science given the data *and* the background information".

Finally, there are often things in the scientific model that we don't particularly care about: these are known as "nuisance parameters". As an example of a nuisance parameter, if you are finding a planet in radial velocity data, the secular motion of the star is *extremely* important to model correctly, but in the end you don't really care about the best-fit value of this velocity. This is an example of nuisance parameter. 

With that in mind, we can write:

$$
P(science,nuisance\ parameters~|~data, background\ info)
$$

Where as before the comma should be read as an "and".



Mathematically, we write this down:

$$
P(\boldsymbol{\theta}_S, \boldsymbol{\theta}_N~|~D, I)
$$

- $\boldsymbol{\theta}_S$ represents the "science": the set of parameters that we are interested in constraining
- $\boldsymbol{\theta}_N$ represents the "nuisance parameters": the set of parameters that are important in the model, but are not particularly interesting for the scientific result.
- $D$ represents the "observed data"
- $I$ represents the information or knowledge you had before observing the data, including whatever made you choose the model you're fitting.

Finally, we'll often just write $\boldsymbol{\theta} = (\boldsymbol{\theta}_S, \boldsymbol{\theta}_N)$ as a shorthand for all the model parameters.

This quantity, $P(\boldsymbol{\theta}~|~D,I)$ is called the "posterior probability" and determining this quantity is the ultimate goal of a Bayesian analysis.

Now all we need to do is compute it!

The core problem is this: **We do not have a way to directly calculate** $P(\boldsymbol{\theta}~|~D,I)$. We often do have an expression for $P(D~|~\boldsymbol{\theta},I)$, but these two expressions are **not** equal.

$$
P(\boldsymbol{\theta}~|~D,I) \ne P(D~|~\boldsymbol{\theta},I)
$$


The way these two expressions are related is through the Bayes' theorem.

## III The Bayes' theorem / Bayes' Rule

The definition of conditional probability is entirely symmetric, so we can write

$$
P(A \cap B) = P(B \cap A)
$$

$$
P(A\mid B)\,P(B) = P(B\mid A)\,P(A)
$$

which is more commonly rearranged in this form:

$$
P(A\mid B) = \frac{P(B\mid A)\,P(A)}{P(B)}
$$

This is known as *Bayes' Theorem* or *Bayes' Rule*, and is important because it gives a formula for "flipping" conditional probabilities.



In the context of model fitting, the Bayes' theorem is written:

$$
P(\boldsymbol{\theta} \mid D) = \frac{P(D\mid \boldsymbol{\theta})P(\boldsymbol{\theta})}{P(D)}
$$

Technically all the probabilities should all be conditioned on the information $I$:

$$
P(\boldsymbol{\theta} \mid D,I) = \frac{P(D \mid \boldsymbol{\theta},I)P(\boldsymbol{\theta} \mid I)}{P(D \mid I)}
$$

Recall $\boldsymbol{\theta}$ is the model we're interested in, $D$ is the observed data, and $I$ encodes all the prior information, including what led us to choose the particular model we're using.

*What is controversial in that expression ?* 

- We have a probability distribution over model parameters. A frequentist would say this is meaningless!

- The answer depends on the prior $P(\theta\mid I)$. This is the probability of the model parameters without any data: how are we supposed to know that?

Nevertheless, applying Bayes' rule in this manner gives us a means of quantifying our knowledge of the parameters $\theta$ given observed data

### III.1 Exploring the Terms in Bayesian Inference

We have four terms in the above expression, and we need to make sure we understand them:

#### $P(\boldsymbol{\theta}\mid D, I)$ is the *posterior*.
This is the quantity we want to compute: our knowledge of the model given the data & background knowledge (including the choice of model).

#### $P(D\mid\boldsymbol{\theta},I)$ is the *likelihood*.
This measures the probability of seeing our data given the model. This is identical to the quantity maximized in frequentist *maximum-likelihood* approaches.

#### $P(\boldsymbol{\theta}\mid I)$ is the *prior*.
This encodes any knowledge we had about the answer before measuring the current data.

#### $P(D\mid I)$ is the *Fully Marginalized Likelihood* (or *Evidence*)
You might prefer the acronym *FML* (it's also called the *Evidence* - namely the evidence that the data D was generated by the model - among other things). Its complete expression is:

$$
P(D\mid I) = \int P(D\mid\boldsymbol{\theta}, I) \, P(\boldsymbol{\theta}) \, \rm{d}\boldsymbol{\theta}
$$

In the context of **model fitting**, it acts as a normalization constant and in most cases can be ignored. In **model selection**, the FML can become important (but it is costly to calculate as you need to evaluate the likelihood for all the values of your parameters $\boldsymbol{\theta}$.

### III.2 What is the Point?

At first blush, this might all seem needlessly complicated. Why not simply maximize the likelihood and be done with it? Why multiply by a prior at all?

- *Purity*: you quantify knowledge in terms of a probability, then follow the math to compute the answer. The fact that you need to specify a prior might be inconvenient, but we can't simply pretend it away.
- *Parameter Uncertainties*: Whether frequentist or Bayesian, the maximum likelihood "point estimate" is only a small part of the picture. What we're really interested in scientifically is the uncertainty of the estimates. So simply reporting a point estimate is not appropriate. In frequentist approaches, "error bars" are generally computed from *Confidence Intervals*, which effectively measure the probability that the data encompass the true (fixed) value of the parameter, hence $P(\hat{\boldsymbol{\theta}}\mid\boldsymbol{\theta})$, rather than $P(\boldsymbol{\theta}\mid D)$, which is effectively what we are interested in (and what is derived from the Bayesian approach). Note the difference: the Bayesian solution is a statement of probability about the parameter value given fixed bounds. The frequentist solution is a probability about the bounds given a fixed parameter value. This follows directly from the philosophical definitions of probability that the two approaches are based on.
- *Marginalization and Nuisance Parameters*: Bayesian approaches offer a very natural way to systematically account for nuisance parameters.


## Summary

- In Bayesian statistics, the problem setting can be framed: we want to infer some probability about reality (e.g. the mass of a planet) given some data and some information about it. Mathematically this is done by computing the posterior probability distribution $𝑃(\boldsymbol{\theta} | 𝐷,𝐼)$, where $\theta$ are the parameters of the model used to question nature, $D$ are the data that we obtained, and $I$ encapsulates our knowledge on the (parameters of the) problem.
- In the context of model fitting, the Bayes theorem is expressed: 
$$
P(\boldsymbol{\theta} \mid D) = \frac{P(D\mid \boldsymbol{\theta})P(\boldsymbol{\theta})}{P(D)}
$$
- $P(\boldsymbol{\theta} \mid D) $ is the *posterior* probability distribution, $P(D\mid \boldsymbol{\theta})$ is the *likelihood*, $P(\boldsymbol{\theta})$ is the *prior* on the model parameters and $P(D)$ is the *fully marginalised likelihood* or *evidence*. 
- The Bayesian approach differs from the Frequentist's one in several conceptual aspects. In practice, one of the main difference is that in the frequentist's framework we search for the *best* parameters of a model and derive some confidence intervals, which give a statement about the proabability of a that a new measurement of a parameter falls in a given range. In a Bayesian framework, we derive the posterior probability distribution on a parameter (Which effectively is a non sense for a frequentist). 

## XX References:

**Chapter 5** (5.1, 5.2, 5.3, 5.8) of the book <a class="anchor" id="book"></a> *Statistics, data mining and Machine learning in astronomy* by Z. Ivezic et al. in Princeton Series in Modern Astronomy. 

- This notebook includes a large fraction of the material that J. Vander Plas gave during the "Bayesian Methods in Astronomy workshop", presented at the 227th meeting of the American Astronomical Society. The full repository with that material can be found on GitHub: http://github.com/jakevdp/AAS227Workshop

- More insights on the differences between frequentist and Bayesian approaches: see [J. VanderPlass blog posts](http://jakevdp.github.io/blog/2014/03/11/frequentism-and-bayesianism-a-practical-intro/) 

- Jayes: [*Probability Theory: The Logic of Science*](http://bayes.wustl.edu/etj/prob/book.pdf).

- For some approachable reading on frequentist vs. Bayesian uncertainties, I'd suggest [The Fallacy of Placing Confidence in Confidence Intervals](https://learnbayes.org/papers/confidenceIntervalsFallacy/), as well as Jake VanderPlast blog post on the topic, [Confidence, Credibility, and why Frequentism and Science do not Mix](http://jakevdp.github.io/blog/2014/06/12/frequentism-and-bayesianism-3-confidence-credibility/).

- Andreon 2011 [Understanding better (some) astronomical data using Bayesian methods](https://arxiv.org/abs/1112.3652)