<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Introduction to Bayesian Statistics

_Instructor: Aymeric Flaisler_

---

### Learning Objectives
- Review the axioms and properties of probability
- Cover the formula for Bayes rule
- Learn the diachronic interpretation of Bayes rule
- Gain an intuition for the different components of the formula
- Tackle the Monty Hall problem with Bayesian statistics
- Complete some additional Bayesian statistics problems

### Lesson Guide
- [Review of probability](#review)
    - [Axioms of probability](#axioms)
    - [Properties of probability](#properties)
- [Bayes rule](#bayes-rule)
    - [The "diachronic" interpretation](#diachronic)
- [Frequentist vs. Bayesian probability](#freq-vs-bayes)
- [Bayes rule in parts](#parts)
- [The Monty Hall problem](#monty-hall)
- [Additional Bayesian statistics problems](#additional)

<a id='review'></a>
## Review of probability

---


---
<a id='axioms'></a>
### Axioms of probability

**Nonnegativity**

For any event $A$, the probability of the event must be greater than or equal to zero.

### $$ 0 \le P(A) $$

**Unit measure**

The probability of the entire sample space is 1.

### $$ P(S) = 1 $$

**Additivity**

For mutually exclusive, or in other words "disjoint" events $E$, the probability of any of the events occuring is equivalent to the sum of their probabilties.

### $$ P\left(\cup_{i=1}^{\infty}\; E_i \right) = \sum_{i=1}^{\infty} P(E_i) $$

---
<a id='properties'></a>
### Properties of probability

**The probability of no event**

The probability of the empty set, denoted $\emptyset$, is zero.

### $$ P\left(\emptyset \right) = 0 $$

**The probability of A or B occuring (union)**

The probability of event $A$ or event $B$ occuring is equivalent to the sum of their individual probabilities minus the intersection of their probabilities (the probability they both occur).

### $$ P(A \cup B) = P(A) + P(B) - P(A \cap B)$$

**Conditional probability**

It is defined as the: Probability of an event A given B equals the probability of B and A happening together divided by the probability of B.”

**For example:** Assume two partially intersecting sets A and B as shown below.

Set A represents one set of events and Set B represents another. We wish to calculate the probability of A given B has already happened. Lets represent the happening of event B by shading it with red.

![image.png](attachment:image.png)

Now since B has happened, the part which now matters for A is the part shaded in blue which is interestingly the A(inter)B. So, the probability of A given B turns out to be:
```
BlueArea 
__________________
(RedArea + BlueArea)
```


The probability of an event conditional on another event is written using a vertical bar between the two events. The probability of event $A$ occuring _given_ event $B$ occurs is calculated:

### $$ P(A | B) = \frac{P(A \cap B)}{P(B)} $$



Meaning the probability of both $A$ and $B$ occuring divided by the probability that $B$ occurs at all.

**Joint probability**

The joint probability of two events $A$ and $B$ is a reformulation of the above equation.

### $$ P(A \cap B) = P(A|B) \; P(B) $$

Verbally, if we want to know the probability that both $A$ and $B$ happen, we can multiply the probability that $B$ happens by the probability that $A$ happens given $B$ happens.

In a similar fashion,

### $$ P(B \cap A) = P(B|A) \; P(A) $$



The conditional probability can now be rewritten:

### $$ P(A | B) = \frac{P(B|A) \; P(A)}{P(B)} $$

This is known as **Conditional Probability**.

### Let’s try to answer a betting problem with this technique.

Suppose, B be the event of winning of James Hunt. A be the event of raining. Therefore,

P(A) =1/2, since it rained twice out of four days.  
P(B) is 1/4, since James won only one race out of four.  
P(A|B)=1, since it rained every time when James won.  

Substituting the values in the conditional probability formula, we get the probability to be around 50%, which is almost the double of 25% when rain was not taken into account (Solve it yourself).

This further strengthened our belief  of  James winning **in the light of new evidence i.e rain**. You must be wondering that this formula bears close resemblance to something you might have heard a lot about.

Probably, you guessed it right. It looks like Bayes Theorem.

Bayes  theorem is built on top of conditional probability.

**The law of total probability**

Lets say we want to know the probability of the event $B$ occuring across _all_ different events $A$. Then:

### $$ P(B) = \sum_{i=1}^n P(B \cap A_i) $$

![total probability](./assets/images/output_27_0.png)

**Checkout (in pair):** translate this into an everyday event

<a id='bayes-rule'></a>
## Bayes rule

---

Bayes Rule relates the probability of $A$ given $B$ to the probability of $B$ given $A$. This rule is critical for performing statistical inference, as we shall see shortly. It is formulated as:

![image.png](attachment:image.png)


<a id='diachronic'></a>
## Bayesian modeling intuition:

We can re-write the formula for Bayes Rule in the context of hypotheses and data. The diachronic interpretation is for the probability of events _over time_. As in, the probability of an event changes over time as we collect new data.

In this case we have a model or a statistic, and we are asking the probability of our model given the data that we have observed.

### $$P\left(model\;|\;data\right) = \frac{P\left(data\;|\;model\right)}{P(data)}\; P\left(model\right)$$

<a id='diachronic'></a>
## BAYESIAN INFERENCE OF PARAMETERS

___

![image.png](attachment:image.png)

<a id='freq-vs-bayes'></a>
## Frequentist vs. Bayesian probability

---

### Frequentism (observations)

Frequentists believe the "true" value of a statistic about a population (for example, the mean) is fixed (and not known). We can infer more about this "true" distribution by engaging in sampling, testing for effects, and studying relevant parameters of the population.

Say we are flipping a coin and want to know the probability of heads. Frequentists formulate the probability of heads as a limit, defining the true probability of heads derived from an infinite number of coin flips with that coin.

### $$P(\text{heads}) = \lim_{\text{# of coin flips} \to \infty} \frac{\text{# of heads}}{\text{# of flips}}$$

Alternatively, we can write this more generally as the number of times any event $A$ occurs given an infinite number of observations/experiments (random samples from the event space).

### $$P(A) = \lim_{\text{# of experiments} \to \infty} \frac{\text{# of occurances of A}}{\text{# of experiments}} $$

### Bayesianism (calculations)

Bayesians believe that data informs us about the distribution of a statistic or event, and as we receive more data our view of the distribution can be updated, further confirming or denying our previous beliefs (but never in total certainty).

For the coin flip example above, we would write out the probability of heads as our belief in the probability of getting heads given the evidence we have from observing coin flips.

### $$ P(\text{heads}) = \frac{P(\text{# of heads observed} \;|\; \text{heads})}{P(\text{# of heads observed})} P(\text{heads}) $$

Here we are representing the probability of flipping with:

Our **prior** belief, before observing flips, of the probability of flipping heads: $P(\text{heads})$

The **likelihood** of the data we observe given the chance to flip heads: $P(\text{# of heads observed} \;|\; \text{heads})$

The **total probability** of observing that many heads in coin flips regardless of weighting (or rather, across all coin weightings): $P(\text{# of heads observed})$

### Frequentism vs. Bayesianism: a Philosophical Debate
Fundamentally, the disagreement between frequentists and Bayesians concerns the definition of probability.

For frequentists, probability only has meaning in terms of a **limiting case of repeated measurements**. That is, if I measure the photon flux F from a given star (we'll assume for now that the star's flux does not vary with time), then measure it again, then again, and so on, each time I will get a slightly different answer due to the statistical error of my measuring device. In the limit of a large number of measurements, the frequency of any given value indicates the probability of measuring that value. For frequentists **probabilities are fundamentally related to frequencies of events**. This means, for example, that in a strict frequentist view, it is meaningless to talk about the probability of the true flux of the star: the true flux is (by definition) a single fixed value, and to talk about a frequency distribution for a fixed value is nonsense.

For Bayesians, the concept of probability is extended to cover **degrees of certainty about statements**. Say a Bayesian claims to measure the flux F of a star with some probability P(F): that probability can certainly be estimated from frequencies in the limit of a large number of repeated experiments, but this is not fundamental. The probability is a statement of my knowledge of what the measurement reasult will be. For Bayesians, **probabilities are fundamentally related to our own knowledge about an event**. This means, for example, that in a Bayesian view, we can meaningfully talk about the probability that the true flux of a star lies in a given range. That probability codifies our knowledge of the value based on prior information and/or available data.

The surprising thing is that this arguably subtle difference in philosophy leads, in practice, to vastly different approaches to the statistical analysis of data.

<a id='parts'></a>
## Bayes rule in parts
---

Using the diachronic interpretation of Bayes Rule, we can describe each part with its label like in our coin flip example above.

### $$P\left(model\;|\;data\right) = \frac{P\left(data\;|\;model\right)}{P(data)}\; P\left(model\right)$$

**The prior**

### $$ \text{prior} = P\left(model\right) $$

The prior is our belief in the model given no additional information. This "model" could be as simple as a statistic like the mean we are measuring, or a complex regression. 

**The likelihood**

### $$ \text{likelihood} = P\left(data\;|\;model\right) $$

The likelihood is the probability of the data we observed occuring given the model. So, for example, assuming that a coin is biased towards heads with a mean rate of heads of 0.9, what is the likelihood we observed 10 tails and 2 heads in 12 coin flips.

The likelihood is in fact what frequentist statistical methods are measuring. 

**The marginal probability or total probability of the data**

### $$ \text{marginal probability of data} = P(data) $$

The marginal probability of the data is the probability that our data is observed regardless of what model we choose or believe in. You divide the likelihood by this value to ensure that we are only talking about our model within the context of the data occuring. More technically, we divide by this value to ensure that what we get out on the other side is a true probability distribution - more on this later.

**The posterior**

### $$ \text{posterior} = P\left(model\;|\;data\right) $$

The posterior is our _updated_ belief in the model given the new data we have observed. Bayesian statistics is all about updating a prior belief we have about the world with the data we observe, and so we are transforming our _prior_ belief about the world into this new _posterior_ belief about the world.


**The likelyhood ratio**

### $$ \frac{P\left(data\;|\;model\right)}{P(data)}\;$$

<a id='monty-hall'></a>

## The Monty Hall problem
---

The Monty Hall problem is a famous probability problem with an unintuitive solution. Framing it in a Bayesian context makes it clear!

Open up the Monty Hall notebook and tackle the problem.

### To summarize the differences:

- **Frequentism** considers probabilities to be related to frequencies of real or hypothetical events.
- **Frequentist** analyses generally proceed through use of point estimates and maximum likelihood approaches.
- **Bayesianism** considers probabilities to measure degrees of knowledge.
- **Bayesian** analyses generally compute the posterior either directly or through some use of sampling methods.

# Learning Objectives Check for Intro to Bayesian Stats:

### At this point, it is important that your are able to:
- Explain what is meant by 'mutually exclusive' or 'disjoint' events. 
- Use & understand the formula for: 
        - unions (the probability of A or B occurring); 
        - conditional probabilities (the probability of A occurring given that B occurs); 
        - joint probabilities (the probability that A and B both occur);
        - the law of total probability (the probability of event A across all different event Bs).
- Use & understand the formula for Bayes Rule, including how we can re-write Bayes rule in terms of 'model' and 'data', to show the probability of our model given the data we have observed.


<a id='additional'></a>
## Additional Bayesian statistics problems
---

As independent practice, you can tackle some more Bayesian statistics problems:
- Pregnancy screening problem
- Cookie Jar problem
- The German Tank problem
- Dungeons & Dragons dice problems
- M&M's problem

[The questions can be found in this notebook.](bayes-problems.ipynb)

### Ressource:

- https://www.analyticsvidhya.com/blog/2016/06/bayesian-statistics-beginners-simple-english/
- Very good introduction about Frequentism and Bayesianism: http://jakevdp.github.io/blog/2014/03/11/frequentism-and-bayesianism-a-practical-intro/
- [Semi-technical comparison of the essential features of the frequentist and Bayesian approaches to statistical inference](1411.5018.pdf)