 # Lesson Bayes: The Philosophy and Equation
 3 things we care about, one we don't

In [11]:
import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy
from scipy import stats

In [12]:
%matplotlib inline
az.style.use("arviz-darkgrid")
RANDOM_SEED = 8265
np.random.seed(RANDOM_SEED)

# Two websites: Which one is better?
If we deploy two websites, which one will generate *more conversions*?
![image.png](attachment:image.png)


# Two websites: Which one is better (using distributions)?

If we model 100 people visiting a website using a *binomial distribution* what is the most plausible *distribution* of *p(conversion)*?


# Two websites: Which one is better (using distributions and expert knowledge)?

If we model 100 people visiting a website using a *binomial distribution* what is the most plausible *distribution* of *p(conversion)*? given that we know from experiment that similar websites get between 10% and 50% conversions?


# Bayes Rule: The mathematical formula 

Below is the mathematical formula that represents Bayes Theorem, with each component label.

*Y* is the observed data and $\theta$ is our parameters
$$ 
\underbrace{p(\boldsymbol{\theta} \mid \boldsymbol{Y})}_{\text{posterior}} = \frac{\overbrace{p(\boldsymbol{Y} \mid \boldsymbol{\theta})}^{\text{likelihood}}\; \overbrace{p(\boldsymbol{\theta})}^{\text{prior}}}{\underbrace{{p(\boldsymbol{Y})}}_{\text{marginal likelihood}}}
$$



# Bayes Rule: Building an intuition
Below is the mathematical formula that represents Bayes Theorem, with each component labeled

$$ 
\underbrace{p(\boldsymbol{\theta} \mid \boldsymbol{Y})}_{\text{posterior}} = \frac{\overbrace{p(\boldsymbol{Y} \mid \boldsymbol{\theta})}^{\text{likelihood}}\; \overbrace{p(\boldsymbol{\theta})}^{\text{prior}}}{\underbrace{{p(\boldsymbol{Y})}}_{\text{marginal likelihood}}}
$$

While we won't be using the formula directly, breaking each piece down can help with intuition


# Bayes Rule: Posterior Distribution

$$ 
\underbrace{p(\boldsymbol{\theta} \mid \boldsymbol{Y})}_{\text{posterior}} = \frac{\overbrace{p(\boldsymbol{Y} \mid \boldsymbol{\theta})}^{\text{likelihood}}\; \overbrace{p(\boldsymbol{\theta})}^{\text{prior}}}{\underbrace{{p(\boldsymbol{Y})}}_{\text{marginal likelihood}}}
$$

On the left is the *posterior distribution* $p(\boldsymbol{\theta} \mid \boldsymbol{Y})$.  
The posterior distribution represents the learnings from the world is what enables to perform *inference*.

In our A/B test our posterior distribution represents our beliefs *after* seeing the observed samples from the A/B test

# Bayes Rule: Prior  Distribution
$$ 
\underbrace{p(\boldsymbol{\theta} \mid \boldsymbol{Y})}_{\text{posterior}} = \frac{\overbrace{p(\boldsymbol{Y} \mid \boldsymbol{\theta})}^{\text{likelihood}}\; \overbrace{p(\boldsymbol{\theta})}^{\text{prior}}}{\underbrace{{p(\boldsymbol{Y})}}_{\text{marginal likelihood}}}
$$

On the top right is the *prior distribution* $p(\boldsymbol{\theta})$ 

The prior distribution describes the plausibility of parameters before seeing any data. 
This is what lets us express expert knowledge if we choose to do so.

In our A/B test for instance we may know that anything above an 80% conversion rate is essentially impossible and can choose to express that expert knowledge in our prior

# Bayes Rule: Likelihood
$$ 
\underbrace{p(\boldsymbol{\theta} \mid \boldsymbol{Y})}_{\text{posterior}} = \frac{\overbrace{p(\boldsymbol{Y} \mid \boldsymbol{\theta})}^{\text{likelihood}}\; \overbrace{p(\boldsymbol{\theta})}^{\text{prior}}}{\underbrace{{p(\boldsymbol{Y})}}_{\text{marginal likelihood}}}
$$

Also on the top right is the *prior distribution* $p(\boldsymbol{Y} \mid \boldsymbol{\theta})$ 

The likelihood distribution the plausibility of data given a parameter.  
This is what incorporates the data into the posterior calculation.

In our A/B test if we say 40 out of 100 people convert in our observed sample the likelihood is what tells us that the true conversion rate of 35 percent is more plausible than a true conversion rate of 90%

# 🤔 Marginal Likelihood of Data (and why we don't care)
The last term is $P(Y)$ at the bottom and this can be described as the *unconditional probability of the data*

We don't care ignore this for three reasons
1. The probability of the data itself is not important, what we care about is the underlying parameter *p(conversion)*
2. It's extremely difficult to calculate for non trivial problems
  * The type of math required to actually figure it out, integrals, can quickly become challenging or even impossible
3. In sampling based Bayes, we don't need it to calculate the posterior distribution

We will ignore this term for the rest of this course for these reasons

# Two websites: Which one is better (in Bayes language)
Reframed in Bayesian language
* the knowledge that 10% to 50% conversion is what we believe *prior* to seeing data is what defines our *prior distribution
* The assumption each visitor has *equal probability* of converting and each visit is *independent* and what we'll observe is a total count is represented in the *likelihood*
    * The observed data will be used in conjunction with the likelihood distribution to calculate, well, the likelihood of the observed data

Thus *posterior* distribution will be a combination of
* our *prior knowledge* 
* our model of *how the world works*
* the *observed data*

all of which will inform which website is better

# Bayes Formula in a picture
We showed this in our introductory Lesson. Now you know the mathematics behind it
![image.png](attachment:image.png)

# Section Recap
* Bayes Formula gives us a mathematical framework in which to model real world observations
* Bayes Rule is comprised of 3 major parts
    * Priors (A thing we specify)
    * Likelihoods (A thing we specify)
    * Posterior (A thing we calculate)
* Bayes rule also has a fourth term *marginal likelihood* that is not needed with sampling based methods and can be safely ignored in this course