# Bayesian Models

Proprietary material - Under Creative Commons 4.0 licence CC-BY-NC-ND https://creativecommons.org/licenses/by-nc-nd/4.0/



Just as a heads up, in this material we will give a very quick brush over of Bayesian models. Just giving a reasonable introduction to the subject could take a whole semester with a proper understanding of probability and statistics and truly grasping bayesian thinking can take even longer. 

Still, I think it can be valuable to at least go over some basic concepts to explain the strengths and values of bayesian models. At the end, I will leave several references that I highly recommend if you are interested in the subject. 



# Statistical Thinking

Statistics is the discipline that concerns the organization, analysis, interpretation and presentation of data. Statistical thinking is the systematic way of thinking how we describe the world using data to make predictions and decisions.


Just like statistical thinking, our human intuition often tries to answer questions of the reality around us. The main difference is that our intuition often provides the wrong answers.

For example, in the United States there has been reported that more more people think that violent crime has been on the rise in recent years. However, a statistical analysis of the data indicates that violent crime has been decreasing since the 1990's. 

Our intuition is often wrong because we rely upon guesses shaped by our biases and experiences that doesn't apply in general.


There are three major thinks that we can do with statistics:

1. **Describe**: The world is complex and we need a simplification to understand it.
2. **Decide**: We often need to make decision based on data, even if we have uncertainty.
3. **Predict**: We often need to make predictions about new situations based on previous data.



> Ref [Poldrack, 2019]

# Probability and Statistics

Probability is the language of uncertainty that is also the basis for statistical inference. It forms an important part of the foundation for statistics, because it provides us with the mathematical tools to describe uncertain events.


The problem studied in probability is: 

Given a data generating process, which are the properties of the outputs?

We can compare this to the problem that we study in statistical inference, data mining and machine learning:

Given the outputs, what can we say about the process that generates the observed data?


We think about probability as a number (from 0 to 1) that describes the likelihood of an event occurring. This indicates how likely that particular event occurrence is, ranging from impossible to guaranteed.

### Some Important Concepts

- Random Experiment: 
The act of measuring a process whose output is uncertain. Ej: flipping a coin, predicting travel arrival time, etc.

- Sample Space $\Omega$:
The set with all posible outputs of a random experiment. It can either be discrete or continuous. 

Ej: For a six sided dice, the sample space is: 

$\Omega = \{1, 2, 3, 4, 5, 6\}$

- Event $E$:
It's a subset of the sample space.

Ej: Going back to the dice, an event would be getting an even number like so:

$E = \{2, 4, 6\}$

## Probability

A formal definition of probability $P$ is a real-valued function defined over $\Omega$ that satisfies the following properties:

1. For any event $E \subseteq \Omega, 0 \leq P(E) \leq 1$
2. The probability of tha sample space is 1. $P(\Omega)$
3. Let $E_1, E_2, ..., E_k \in \Omega$ be disjoint sets

$$
P(\bigcup^{k}_{i=1} E_i) = \sum^k_i P(E_i)
$$

Now we have a definition, but how do we interpret it?

There are two main ways to interpret probabilities: frequentist and bayesian.

### Frequentist probability

For the frequancy interpretation, $P(E)$ is the long run proportion of times that $E$ is true.

For example, let's say that the probability of throwing a coin and getting heads is 1/2, we mean that if we flip the coin many times then the proportion of cases that results in heads tends to 1/2 as the number of coin tosses increases. 

When the sample space $\Omega$ is finite, we can say that 

$$
P(E) = \frac{Favorable Cases}{Total Cases} = \frac{|E|}{|\Omega|}
$$

### Bayesian probability

The bayesian approach (or degree of belief interpretation) is that $P(E)$ measures an observer's strength of belief that $E$ is true. If I ask "How likely is that a political candidate will get elected", you can provide an answer based on your knowledge and beliefs even if there are no relevant frequencies to compute a frequentist probability.


The difference in interpretation doesn't usually matter much until we deal with inference. Then, this difference in approach leads to the two schools of inference: the frequentist and Bayesian schools. 

As I said before, there are too many things to go over to explain bayesian models properly, still I will quickly explain some of the most relevant for now.

### Conditional Probabilities

The conditional probability of $Y$ given $X$ is defined as:

$$
P(Y|X) = \frac{P(X,Y)}{P(X)}
$$

$P(Y|X)$ can be interpreted as the fraction of times $Y$ occurs when $X$ is known to occur.

If $X$ and $Y$ are independent then $P(Y|X) = P(Y)$

If this isn't clear, please let me know and we can see an example on the whiteboard. 

### Bayes Theorem

The conditional probability $P(Y|X)$ and $P(X|Y)$ can be expressed as a function of each other using bayes theorem:

$$
P(Y|X) = \frac{P(X|Y)P(Y)}{P(X)}
$$



### Distributions

A probability distribution is a function that indicates the likelihood of occurrence that a random variable will take certain values over the sample space. The distributions can be either continuous or discrete, but the integral over the sample space always has to be 1. 

Some of the more important distribution functions are: 

<img src="pdfs.png">

Mainly, the binomial distribution is a discrete function that indicates how likely is to get a number of successes over a number of trials.

The Normal or Gaussian distribution it's very important in statistics because it shows up all the time in nature. 

<img src="normalpdf.png">

# Statistical Inference 

The main goal of statistical inference is to investigate some properties about a target population.

We define a population as the entire group of individuals that we are interested in studying. This can be anything from all humans to a type of cell or some particles. 

Each individual in a population is often called an unit.

Often, to draw conclusions about a population, it's not feasible to gather the data from all the population. But when we can this is called a census. If we extract a subgroup of the population this is called a sample.

In statistical inference we try to make reasonable conclusions about a population based on the evidence provided by a sample data.
From a more general point of view, the goal of the inference is to infer the distribution that generated the observed data. For example, if we have a sample $X_1, ..., X_n \sim F$, how do we infer F?

But more commonly, we are only interested in a statistic. A sample statistic or statistic is a quantitative measure calculated from a sample. For example the mean, the standard deviation, the minimum, the maximum, etc.

Statistical models that assume that the distribution can be modeled with a finite set of parameters $\theta = (\theta_1, \theta_2, ..., \theta_k)$ are called parametric models. For example, if we assume that our data follows a normal distribution $N(\mu, \sigma^2)$ then $\mu$ and $\sigma$ are the parameters of the model. 

Frequentist models are based on the following postulates: 

1. Probability refers to limiting relative frequencies. Probabilities are objective properties of the real world.
2. Parameters are fixed, unknown constants. Because they are not fluctuating, no useful probability statements can be made about parameters. 
3. Statistical procedures should be designed to have well-defined long run frequency properties. For example, a 95 percent confidence interval should trap the true value of the parameter with limiting frequency at least 95 percent. 

> Ref [Wasserman, 2013]

# Bayesian Inference

The statistical methods that we have covered up to now requires that all probabilities be defined by connection to the frequencies of events in very large samples. This leads to frequentist uncertainty being premised on imaginary resampling of data. 

If we were to repeat the measurements many many times, we would en up collecting a list of values that will have some pattern to it. This means that parameters models cannot have probability distributions, only measurements can. 

The distribution on these measurements is called a sampling distribution but this resampling is usually never done or it doesn't even make sense. 

But there is another approach to inference called Bayesian inference which is based on the following postulates:

1. Probability describes degree of belief, not limiting frequency. So we can make probability statements about a lots of things, not just data which are subject to random variation.
2. We can make probability statements about parameters, even though they are fixed constants. 
3. We can make inferences about a parameter $\theta$ by producing a probability distribution for it. 
4. Inferences, such as point estimates and interval estimates, may then be extracted from this distribution.

In modest terms, Bayesian data analysis is no more than counting the number of ways the data could happen, according to our assumptions. 

In Bayesian analysis all alternative sequences of events that could have generated our data are evaluated. As we learn more about what happened, we prune some of this alternatives. By the end, what remains is only what is logically consistent with our knowledge.

<!-- 
Grasping bayesian inference can be quite hard to understand, so lets see an example:

- Let's suppose that we have a bag with four marbles. 
- These marbles come in two colors: blue and white.
- We know there are four marbles in the bag, but we don't know how many are of each color.
- But we can know that there are five possibilities:

<img src="marbles_1.png">

- These are the only possibilities consistent with what we know about the contents of the bag. We will call this possibilities the conjectures. 
- Our goal is to figure out which of this conjectures is more plausible, given some evidence from the contents of the bag. 
- For our evidence, we pull a sequence of three marbles from the bag, one at a time and with replacement. 
- The sequence that we pull from the bag is [b,w,b]
- Now, we can see how to use the data to infer what is in the bag.
- Let's begin by considering just the single conjecture (2) [b,w,w,w], that the bag contains one blue and three white marbles. 
- So, on our first draw of the bag, one of four things could happen, corresponding to one of our four marbles in the bag.

<img src="marbles_2.png">
-->

A Bayesian model has the following components:

- Parameter $\theta$: A way of indexing possible explanations of the data. In our example $\theta$ is a conjectured proportion of blue marbles.
- Likelihood $f(d|\theta)$: The relative number of ways that a value $\theta$ can produce the data. It is derived by enumerating all the possible data sequences that could have happened and then eliminating those sequences inconsistent with the data.
- Prior probability $f(\theta)$: The prior plausibility of any specific value of $\theta$.
- Posterior probability $f(\theta|d)$: The new, updated plausibility of any specific $\theta$.
- Evidence or Average Likelihood $f(d)$: the average probability of the data averaged over the prior. It’s job is just to standardize the posterior, to ensure it sums (integrates) to one.


The general equation that relates all Bayesian components (for both density and mass functions) is the following:
$$
f(\theta|d) = \frac{f(d|\theta) × f (\theta)}{f(d)}

$$

This equation is essentially the Bayes theorem.
It says that the probability of any particular value of $\theta$ considering the data d, is proportional to the product of the relative plausibility of the data, conditional on $\theta$, and the prior plausibility of $\theta$.
This product is then divided by the average probability of the data to produce a valid probability distribution for the posterior (to sum or integrate to one).
We must bear in mind that Bayesian statistics is not only about using Bayes theorem. There are many non-Bayesian techniques that use this theorem.
Bayesian inference uses the Bayes theorem more generally, to quantify uncertainty about unobserved variables such as parameters.



> Ref [McElreath, 2020]

Ok, this sounds complicated but what can state of the art bayesian models do? 

Let's see some predictions made by spectral mixture multi-output gaussian processes: 

<img src="mogp_climate.png">

<img src="mogp_prediction.png">

All of this is not even a proper beginning of what Bayesian models are or can do. So once again I ask of you to check the references that I have mentioned up to now or some of the following that I will list. 




[Statistical Rethinking with brms, ggplot2, and the tidyverse](https://bookdown.org/ajkurz/Statistical_Rethinking_recoded/)

[Statistical Rethinking with Python and PyMC3](https://github.com/pymc-devs/resources/tree/master/Rethinking)

[Statistical Rethinking with PyTorch and Pyro](https://fehiepsi.github.io/rethinking-pyro/)

[Demystifying hypothesis testing with simple Python examples](https://towardsdatascience.com/demystifying-hypothesis-testing-with-simple-python-examples-4997ad3c5294)

[On Bayesian and frequentist, latent variables and parameters By Dustin Tran](http://dustintran.com/blog/on-bayesian-and-frequentist-latent-variables-and-parameters)

[What is statistics by Michael Jordan](https://www.youtube.com/watch?v=EYIKy_FM9x0&t=4742s)

[Foundations of Statistics – Frequentist and Bayesian by Mary Parker](https://www.austincc.edu/mparker/stat/nov04/talk_nov04.pdf)

[The Permutation Test by Jared Wilber](https://www.jwilber.me/permutationtest/)

[Spurious Correlations](https://tylervigen.com/old-version.html)

[Research Methods and Statistics Bristol University](http://www.bristol.ac.uk/medical-school/media/rms/red/index.html)

[Seeing Theory - Frequentist Inference](https://seeing-theory.brown.edu/frequentist-inference/)

[Stats with R by Manny Gimond](https://mgimond.github.io/Stats-in-R/index.html)

[FiveMinuteStats by Matthew Stephens](https://stephens999.github.io/fiveMinuteStats/index.html)

[Experimentation - Yale Universtity](http://www.stat.yale.edu/Courses/1997-98/101/expdes.htm)

[Probabilistic graphical models notes by Stefano Ermon](https://ermongroup.github.io/cs228-notes/)

[D-separation without tears by Judea Pearl](http://bayes.cs.ucla.edu/BOOK-2K/d-sep.html)

[ML beyond Curve Fitting: An Intro to Causal Inference and do-Calculus by Ferenc Huszár](https://www.inference.vc/untitled/)

[AIC and BIC by Barum Park](https://barumpark.com/blog/2018/aic-and-bic/)
