In [6]:
import matplotlib.pyplot as plt
import numpy as np

# Bayes' rule

$ P( x | y ) = \frac{P( y | x ) \, P( x )}{ P(y) } $

$ P(x) $ = prior belief of the state, before seeing any data

$ P(y|x) $ = probability of seeing the data given the state

$ P(y) $ = probability of seeing the data

$ P(x|y) $ = posterior belief of the state, after seeing the data

# Our dilemma

We are at a pingo site in Adventdalen. We want to know if the pingo is emitting methane or not. 

We start our experiment by assuming there is a 50/50 percent chance that the pingo is a methane source. 

What do we do? We measure the methane concentration downwind from the pingo. Suppose that we measure a methane concentration of 2.2 ppm. 

What do we (assume to) know from our observation model? 
- The probability of measuring a methane concentration above 2 ppm downwind from a pingo emitting methane is 70 percent.
- The probability of measuring a methane concentration above 2 ppm downwind from a pingo NOT emitting methane is 20 percent.

(These statistics are entirely made-up for this exercise)

# Translate this to math

### Probability

Uncertainty in the outcomes of a random variable $A$ is represented by a probability distribution. The probability of any particular outcome of the random variable, $A = a$, lies between 0 and 1, such that: 
$$ 0 \le P(A=a) \le 1$$

In this exercise, we look at discrete probability distributions, which denote distributions over a discrete set of values. The distribution associated with random variable $A$ assigns probabilities to each possible value that $A$ can take. The probabilities sum to 1::

$$ \sum_{i=1}^n P(A = a_i) = 1 $$

For our dilemma, the statement "the pingo is a methane source" is modeled as a binary random variable that can take on two values: true of false:

$ P(source) = 0.5 $

$ P( \neg source) = 1 - P(source) = 0.5 $

Our prior belief $P(x)$ is a discrete probability distribution over these two values: $ P(source)$ and $ P( \neg source) $. As you can see, the discrete probability distribution of a binary random variable can be fully characterized by the probability of one state, $P(source)$, as the probability of the other state is simply $1-P(source)$.

In [7]:
P_X = 0.5
P_notX = 1 - P_X

In [None]:
fig, ax = plt.subplots(1,1, figsize=(4,3))

categories = ['source', 'not source']

ax.bar(categories, [P_X, P_notX])
ax.set_title('Prior')
ax.set_ylabel('P(x)')
ax.set_xlabel('x')
ax.set_ylim([0,1])

plt.tight_layout()

### Conditional probability

The likelihood is given by conditional probability $ P(y|x) $.

$ P(y|x) $ = probability of $y$, given $x$.

For our dilemma, we have:

$ P( > 2\,ppm | source ) = 0.7 $

$ P( > 2\,ppm | \neg source ) = 0.2 $

Note that $ P(y|x) $ is generally NOT the same as $ P(x|y)$. For example, $P ( rain | cloudy ) \neq P ( cloudy | rain )$.

In [9]:
P_Y_given_X = 0.7
P_Y_given_notX = 0.2

### Joint probability

$ P( y \cap x ) $ = probability that both $y$ and $x$ are the case. This is also written as $ P( y, x ) $.

The product rule: $$ P( y \cap x) = P (y|x) \, P (x)$$

For our dilemma:

$ P( > 2\,ppm \cap source ) = P( > 2\,ppm | source ) \, P(source) = $ ? 

$ P( > 2\,ppm \cap \neg source ) = P( > 2\,ppm | \neg source ) \, P(\neg source) = $ ? 

Note that $ P(x \cap y) $ is the same as $ P(y \cap x) $. For example, $P ( rain \cap cloudy ) = P ( cloudy \cap rain )$.

In [None]:
# complete the expressionsss below
P_Y_and_X = ...
P_Y_and_notX = ...

print(f"P(>2 ppm and source) = {P_Y_and_X}")
print(f"P(>2 ppm and not source) = {P_Y_and_notX}")

# What did Thomas Bayes notice?

We have:

$ P (y \cap x) = P(y | x) \, P(x) $

$ P (x \cap y) = P(x | y) \, P(y) $

Because $ P (y \cap x) = P (x \cap y) $:

$ P(y | x) \, P(x) = P(x | y) \, P(y) $

We can rewrite this as Bayes' rule:

$$ P(x | y) = \frac{P(y | x) \, P(x)}{ P(y) } $$

# Apply this to our dilemma

For our dilemma:

$ P(source | > 2 \, ppm ) = \frac{ P(> 2\,ppm | source ) \, P( source) } {P( > 2\, ppm)} $.

We discussed the numerator, but not the denominator. What is $ P( > 2\, ppm) $? 

Here, we use the sum rule. For discrete random variables: 

$$P(y) = \sum_{i} P( y \cap x_i) $$

Such that (using joint probability):

$$P(y) = \sum_{i} P( y | x_i) \, P (x_i) $$

We can write Bayes' theorem as:

$$ P(x | y ) = \frac{ P( y | x ) \, P( x ) } { \sum_i P( y | x_i) \, P (x_i) } $$

In our dilemma, the discrete probability distribution over the state can only two values. We can write:

$ P(source | > 2 \, ppm ) = \frac{ P(> 2\,ppm | source ) \, P( source) } {P( > 2\, ppm \, \cap \, source) + P( > 2\, ppm \, \cap \, \neg source)} $.

In [None]:
# complete the expression below
P_X_given_Y = ...
print(f"P(source | >2 ppm) = {P_X_given_Y}")

In [None]:
fig, ax = plt.subplots(1,2, figsize=(6,3))

categories = ['source', 'not source']

ax[0].bar(categories, [P_X, P_notX])
ax[0].set_title('Prior')
ax[0].set_ylabel('P(x)')
ax[0].set_xlabel('x')
ax[0].set_ylim([0,1])

ax[1].bar(categories, [P_X_given_Y, 1-P_Y_given_X])
ax[1].set_title('Posterior')
ax[1].set_ylabel('P(x)')
ax[1].set_xlabel('x')
ax[1].set_ylim([0,1])

plt.tight_layout()

# Multiple observations

Suppose we obtain one more observation, such that we have $y_1$ and $y_2$. 

Bayes' rule for two observations:

$$ P(x | y_1 \cap y_2 ) = \frac{P(y_1 \cap y_2 | x) \, P(x)}{ P(y_1 \cap y_2) } $$

We assume that the observations are conditionally independent given state $x$, such that:

$$ P(y_1 \cap y_2 | x ) = P(y_1 | x ) \, P(y_2 | x ) $$

We can then write:

$$ P(x | y_1 \cap y_2 ) = \frac{P(y_1 | x ) \, P(y_2 | x ) \, P(x )}{ \sum_i P(y_1|x_i) \, P(y_2|x_i) \, P(x_i) } $$

Suppose that our second measurement gave 1.9 ppm. What is the batch (using both observations) posterior?

In [None]:
# for our first observation of 2.2 pm 
P_Y1_given_X = 0.7
P_Y1_given_notX = 0.2

# for our second observation of 1.9 ppm
# fill in the probabilities
P_Y2_given_X = ...
P_Y2_given_notX = ...

# complete the expression
batch_posterior = ... 

print(f"batch posterior = {batch_posterior}")

In [None]:
fig, ax = plt.subplots(1,2, figsize=(6,3))

categories = ['source', 'not source']

ax[0].bar(categories, [P_X, P_notX])
ax[0].set_title('Prior')
ax[0].set_ylabel('P(x)')
ax[0].set_xlabel('x')
ax[0].set_ylim([0,1])

ax[1].bar(categories, [batch_posterior, 1-batch_posterior])
ax[1].set_title('Batch posterior')
ax[1].set_ylabel('P(x)')
ax[1].set_xlabel('x')
ax[1].set_ylim([0,1])

plt.tight_layout()

### Sequential updating

Note that we can also update our belief sequentially. After applying Bayes' rule to the first observation, we use the resulting posterior as prior when we apply Bayes' rule to the second observation. 

1) Apply Bayes' rule with $y_1$, as follows: $ P(x | y_1 ) = \frac{P(y_1 | x) \, P(x)}{ P(y_1) } $.

2) The posterior $ P(x | y_1 ) $ becomes the new prior $P (x)$.

3) Apply Bayes' rule with $y_2$.

Does this give the same final posterior?

In [None]:
# for the first observation
prior = P_X
# complete the expression
posterior_1obs = P_Y1_given_X * prior / ...

# for the second observation, using the posterior of the first iteration as prior
prior = posterior_1obs
# complete the expression
posterior_2obs = ...

print(f"final posterior = {posterior_2obs}")

In [None]:
fig, ax = plt.subplots(1,3, figsize=(9,3))

categories = ['source', 'not source']

ax[0].bar(categories, [P_X, P_notX])
ax[0].set_title('Prior')
ax[0].set_ylabel('P(x)')
ax[0].set_xlabel('x')
ax[0].set_ylim([0,1])

ax[1].bar(categories, [posterior_1obs, 1-posterior_1obs])
ax[1].set_title('Posterior after 1 obs')
ax[1].set_ylabel('P(x)')
ax[1].set_xlabel('x')
ax[1].set_ylim([0,1])

ax[2].bar(categories, [posterior_2obs, 1-posterior_2obs])
ax[2].set_title('Posterior after 2 obs')
ax[2].set_ylabel('P(x)')
ax[2].set_xlabel('x')
ax[2].set_ylim([0,1])

plt.tight_layout()

### Order of observations

Does it matter in which order we process the observations? Do we get the same answer if we first update with $y_2$ and then with $y_1$?

In [None]:
# complete the expressions

prior = P_X
posterior_1obs = P_Y2_given_X * prior / ...

prior = posterior_1obs
posterior_2obs = ...

print(f"final posterior = {posterior_2obs}")

fig, ax = plt.subplots(1,3, figsize=(9,3))

categories = ['source', 'not source']

ax[0].bar(categories, [P_X, P_notX])
ax[0].set_title('Prior')
ax[0].set_ylabel('P(x)')
ax[0].set_xlabel('x')
ax[0].set_ylim([0,1])

ax[1].bar(categories, [posterior_1obs, 1-posterior_1obs])
ax[1].set_title('Posterior after 1 obs')
ax[1].set_ylabel('P(x)')
ax[1].set_xlabel('x')
ax[1].set_ylim([0,1])

ax[2].bar(categories, [posterior_2obs, 1-posterior_2obs])
ax[2].set_title('Posterior after 2 obs')
ax[2].set_ylabel('P(x)')
ax[2].set_xlabel('x')
ax[2].set_ylim([0,1])

plt.tight_layout()

# Denominator as normalization constant

Note that the denominator in Bayes' rule acts as a normalizing constant. It is used to scale the posterior. 

Bayes' theorem is therefore often given by:

$$ P(x | y ) \propto P (y | x) \, P(x) $$

This can help us with coding.

In [None]:
# for example, to compute the entire distribution:

prior = np.array([P_X, P_notX])
likelihood = np.array([P_Y_given_X, P_Y_given_notX])
posterior = prior * likelihood
posterior /= np.sum(posterior)
print(f"posterior distribution = {posterior}")

# Another dilemma

In the previous exercise, the state could take two possible values, represented by a discrete probability distribution over these two outcomes. We will now look at a more complex dilemma where the probability mass is distributed across five possible values. 

Our dilemma:

A methane source emits particles, and our goal is to estimate the number of particles emitted by the source per unit time. We know that the emission rate can be 1, 2, 3, 4 or 5 particles per unit time. Our belief regarding the emission rate is represented by a probability distribution over these five values.

We observe the methane concentration downwind from the methane source. The observation is given as the detected number of particles within a fixed sampling time.

For simplicity, we use a strightforward, entirely made-up, observational model. The expected number of particles detected within the sampling time is equal to the number of particles emitted by the source per unit time. However, our observations are noisy. We model this using a Poisson distribution for the probability mass function of the detected number of particles within the sample time:
$$ P(k|\lambda) = \frac{\lambda^k}{k!} \exp(- \lambda) $$
where:
- $k$ is the number of occurances, in this context,, it is the number of particles detected during the sampling time: $k = y$.
- $\lambda$ is the expected value of $k$. Here, it is the expected number of particles detected during the sample time. Since we have assumed for simplicity that this equals the number of particles emitted by the source per unit time, $\lambda = x$. (In a more complex exercise, $\lambda$ could follow from an atmospheric dispersion model.) 

The likelihood $p(y|x)$ based on this observational model can then be expressed as using the Poisson distribution.  

Consider different potential measurements and a prior belief (this does not need to be a uniform distribution). What is the resulting posterior belief? Plot the prior and posterior beliefs.

In [16]:
def get_likelihood(y, x):
    """
    The probability of observing y given that x is true.
    y: observation [number of detected particles in sampling time]
    x: state [number of particles emitted per unit time]
    """

    # factorial of y
    fact = 1  
    for i in range(2, y+1):
        fact *= i

    # Poisson distribution
    return (x**y * np.exp(-x)) / fact

In [None]:
# define some observations, a prior, and compute the posterior. 
# you can use the set-up below for this (for sequential Bayesian inference), 
# feel free to adjust it, or do it in a completely different way!

# define the observations and prior are numpy arrays
observations = np.array([ ... ])
prior = ...

fig, ax = plt.subplots(layout='constrained')

x = np.arange(5)
width = 0.1  # the width of the bars
multiplier = 1

rects = ax.bar(x, prior, width, label='prior')

for i, obs in enumerate(observations):

    # compute the posterior 
    ...
    
    offset = width * multiplier
    rects = ax.bar(x+offset, posterior, width, label=f'after {i+1} obs')
    multiplier += 1

ax.set_ylabel('P(x)')
ax.set_xlabel('x')
ax.set_xticks(x + width, ['1', '2', '3', '4', '5'])
ax.set_ylim([0,1])
ax.legend()

# Cromwell's rule

Cromwell's rule (Lindley, 2013) states that:

If $P(x) = 0$ then $P(x|y)=0$. If $P(x)=1$ and $P(y)>0$, then $P(x|y) = 1$.

Or in words, Cromwell's rule says that if the prior probability assigned to a random variable is 0 or 1, then according to Bayes' theorem, the posterior probability is forced to be 0 or 1 as well. No evidence, no matter how strong, could have any influence. So, hard convictions are insensitive to counter-evidence. 

You can test this rule for our dilemma above.

#### Some links to YouTube movies about Bayes' rule:

https://www.youtube.com/watch?v=5NMxiOGL39M&t=22s

https://www.youtube.com/watch?v=HZGCoVF3YvM&t=261s


