# Bayesian statistics
### Example via Jack Bennetto (thanks Jack)

In [None]:
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
%matplotlib inline

## Success Criteria
I will feel successful today if I can...

 * Describe the difference between Frequentist and Bayesian statistics
 * ID different components of Bayes Theorem
 * Use Bayes' theorem to calculate posterior probabilities 


### Bayes' theorem.

This Theorem is usually written using variables A and B... 

$$ P(A|B) = \frac{P(B|A) P(A)}{P(B)} $$

Each term has a name.

* $P(A)$ is the *prior probability*
* $P(B|A)$ is the *likelihood*.
* $P(B)$ is the *marglinal likelihood* sometimes called *normalization constant*
* $P(A|B)$ is the *posterior probability*.




Suppose we're considering some hypothesis $H$ and we've collected some data $\mathbf{X}$.
$$ P(H|\mathbf{X}) = \frac{P(\mathbf{X}|H) P(H)}{P(\mathbf{X})} $$



If there are a bunch of hypotheses $H_1, H_2, ... H_n$, we could write this as

$$\begin{align}
P(H_i|\mathbf{X}) & = \frac{P(\mathbf{X}|H_i) P(H_i)}{P(\mathbf{X})}\\
         & = \frac{P(\mathbf{X}|H_i) P(H_i)}{\sum_{j=0}^{n} P(\mathbf{X}|H_j) P(H_j)}
\end{align}
$$

Here we see the normalizing constant is the likelihood times the prior summed over all possible hypotheses (using the law of total probability).

In other words, it's the constant (independent of hypothesis) needed to be multiplied by all the numerators so that they all add up to one.

Let's run through an example.

### Bayesian statistics to find a mean
Let's assume you have a bunch of points drawn from a normal distribution. To make things easy, let's say you happen to know that the standard deviation is 3, and the mean $\mu \in \{0, 1, 2, 3, 4, 5, 6, 7, 8, 9\}$. We're going to determine the probability that any of those are the correct mean based on data.

Humans are pretty bad at choosing random numbers, so someone will need to run this.

In [None]:
np.random.seed(42)
mu = stats.randint(0,10).rvs()
sd = 3

# Now you need to choose a number from the distribution. What is it?
first_X = stats.norm(mu, sd).rvs()
first_X

Great! So now using that number, I'm going to figure out the likelihood you would have gotten that from any of the possible hypotheses, by looking at the **pdf** of the distribution.

In [None]:
datum = first_X
likelihoods = []
for i in range(0,10):
    likelihoods.append(stats.norm(i, sd).pdf(datum))
    print("The likelihood of N({0}, {1}) generating {2:5.2f} is {3:6.4f}"
           .format(i, sd, datum, likelihoods[i]))

In [None]:
fig, ax = plt.subplots()
ax.bar(range(10), likelihoods)
ax.bar(mu, likelihoods[mu], color = 'r', label = r'True ${mu}$')
ax.set_xlabel('hypothesized mean')
ax.set_ylabel('likelihood')
ax.legend()
;

**Question:** Do these bars all add to one?

Which of these hypotheses has the maximum likelihood of producing the data?

If we were a Frequentist, we'd go with that, and then we'd construct a confidence interval, giving a range that (had we sampled from the data many times) has a certain probability (maybe 95%) of including the actual value.

But today we're all going to be Bayesians, which means we're going to assign probabilities of each hypothesis being true.

The tough part of being a Bayesian is we need to start out with a prior probabilities. For this, we'll assume that all the probabilities are equal. You chose them that way using the computer, so that works out, but if you'd picked a number from your head, and you liked some numbers more than others, that might not be best.

The arbitrary choice of priors is probably **the largest criticism** of Bayesian statistics. But if you have enough data it doesn't matter that much.

Dont Forget... Here is our Bayesian formula


$$\begin{align}
P(H_i|\mathbf{X}) & = \frac{P(\mathbf{X}|H_i) P(H_i)}{P(\mathbf{X})}\\
         & = \frac{P(\mathbf{X}|H_i) P(H_i)}{\sum_{j=0}^{n} P(\mathbf{X}|H_j) P(H_j)}
\end{align}
$$

In [None]:
probs = np.ones(10)/10
probs

Now we need to multiple each of these by the likelihood...

In [None]:
for i in range(10):
    probs[i] *= stats.norm(i, sd).pdf(datum)

...and then divide normalize them by dividing them each by the sum:

In [None]:
probs /= probs.sum()

So again, what we've done is multiplied each of the prior probabilities by the likelihood of each hypothesis of generating the observed data, and divided these all by the normalizing constant, to get the posterior probabilities.
$$\begin{align}
P(H_i|\mathbf{X}) & = \frac{P(\mathbf{X}|H_i) P(H_i)}{P(\mathbf{X})}\\
         & = \frac{P(\mathbf{X}|H_i) P(H_i)}{\sum_{j=0}^{n} P(\mathbf{X}|H_j) P(H_j)}
\end{align}
$$



Let's see what we got for the **probabilities**.

In [None]:
for i in range(0,10):
    print("The probability of N({0}, {1}) being correct is {2:6.4f}"
           .format(i, sd, probs[i]))

fig, ax = plt.subplots()
ax.bar(range(10), probs)
ax.bar(mu, probs[mu], color = 'r', label = r'True ${mu}$')
ax.set_xlabel('hypothesized mean')
ax.set_ylabel('posterior probability');

**Question:** Do these add up to one?

Okay, that was great, but maybe we should get some more data. Generate another number!

In [None]:
another_X = stats.norm(mu, sd).rvs()
another_X

# Now we
 * calculate the likelihoods,
 * multiply these **by our old posterior probabilities** (which are the new priors),
 * normalize (divide the sum of the prior times likelihood, so they add to one), and
 * look at the output.

In [None]:
datum = another_X

# calculate the likelihoods
for i in range(10):
#     multiply these by our old posterior probabilities (which are the new priors),
    probs[i] *= stats.norm(i, sd).pdf(datum)
# normalize (divide the sum of the prior times likelihood, so they add to one)
probs /= probs.sum()

# look at the output
for i in range(0,10):
    print("The probability of N({0}, {1}) being correct is {3:10.8f}"
           .format(i, sd, datum, probs[i]))
fig, ax = plt.subplots()
ax.bar(range(10), probs)
ax.bar(mu, probs[mu], color = 'r', label = r'True ${mu}$')
ax.set_xlabel('hypothesized mean')
ax.set_ylabel('posterior probability');

In [None]:
# and again... our "posteriors" replace our priors and we update our situation over and over and over...
for _ in range(10):
    datum = stats.norm(mu, sd).rvs()

    # calculate the likelihoods
    for i in range(10):
    #     multiply these by our old posterior probabilities (which are the new priors),
        probs[i] *= stats.norm(i, sd).pdf(datum)
    # normalize (divide the sum of the prior times likelihood, so they add to one)
    probs /= probs.sum()

fig, ax = plt.subplots()
ax.bar(range(10), probs)
ax.bar(mu, probs[mu], color = 'r', label = r'True ${mu}$')
ax.set_xlabel('hypothesized mean')
ax.set_ylabel('posterior probability');

We're doing this iteratively, repeatedly getting another data point and updating our prior, but we could have done this all at once, calculating the likelihood of seeing the whole dataset.

In a real problem we'd have *many* more possible hypotheses. In the case above, we might not know the number came from a discrete distribution so we'd need to consider every possible value. And we probably wouldn't know the standard deviation, so we'd need to consider every combination of a mean and standard deviation. We could follow the same approach, calculating the likelihood of seeing our data for each possible hypothesis and updating the posterior probabilities. Later we'll talk about how to solve this practically...


### A Tangible Example 

##### Example via RFT

#### Fair or Unfair Coin

Let's say we picked one of two coins.  One isn't fair (pHeads = 0.40) while the other one is (pHeads = 0.5).  After a certain number of flips, what is our degree of belief that the results came from each of the coins?

Bayes rule for this example:

Your Hypotheses available are only... 

$H_i$ = **Fair OR Unfair**


\begin{equation*}
p(H_i | Flips)   = \frac{p(Flips | H_i) \times p(H_i)}{p(Flips)}
\end{equation*}


At the start, there is an equal probability of picking either coin:

$$ p(Fair) = p(Unfair) = 0.5 $$


$p(Flips)$ is calculated using the Law of Total Probability: 

$$p(Flips) = p(Flips|Fair)\times p(Fair) + p(Flips|Unfair)\times p(Unfair) $$

If we have just one flip (a tails):

$$p(tails) = p(tails|Fair)\times p(Fair) + p(tails|Unfair)\times p(Unfair)$$

$$p(tails) = 0.5\times 0.5 + 0.6\times 0.5 = 0.55$$

If we have just one flip (a heads):

$$p(heads) = p(heads|Fair)\times p(Fair) + p(heads|Unfair)\times p(Unfair)$$

$$p(heads) = 0.5\times 0.5 + 0.4\times 0.5 = 0.45$$

If we have zero flips, $P(flips)=1$ as default.


We'll keep track of the probabilities in a list:  
$$ [p(Fair), p(Unfair)] $$ 
<br>


In [None]:
np.random.seed(37) # try to make it so that we all get the same result

# make the coins and select one
p_fair = 0.5 # don't change

p_not_fair = 0.4 

# Pick one of these two values to be the p of the coin we chose
p = np.random.choice([p_fair, p_not_fair])
p

In [None]:
def indicate_coin_picked(p):
    if p == .5:
        print("It's the fair coin (p = 0.5).")
    else:
        print(f"It's the unfair coin (p = {p}).")

### Can we figure out, from the flips, whether it's a fair coin or not?  And if so, how soon can we know it?

In [None]:
# helper function 1

def flip_the_coin(p, flips_lst):
    '''Flips the coin with probability of success p
       and appends to the flips_lst'''
    flips_lst.append(1*(np.random.random()<=p))

In [None]:
# test it out - initialize flips_lst
flips_lst = []

In [None]:
# do some flips (trials) - keep executing this cell
flip_the_coin(p, flips_lst)
print(flips_lst, round(np.mean(flips_lst),3))

# Now we
 * calculate the likelihoods,


In [None]:
# Calculate our likelihoods

def calculate_likelihood(flips_lst):
    '''Likelihood of flips in flips_lst given fair, not fair coin'''
    likelihood_fair = stats.bernoulli.pmf(flips_lst[-1], p_fair)
    likelihood_not_fair = stats.bernoulli.pmf(flips_lst[-1], p_not_fair)
    return [likelihood_fair, likelihood_not_fair]

In [None]:
# Let's double check that function
likelihoods = calculate_likelihood(flips_lst)
print(np.around(likelihoods,3))
print("\nLikelihood fair: {0:0.3f}".format(likelihoods[0]))
print("Likelihood not fair: {0:0.3f}".format(likelihoods[1]))


\begin{equation*}
p(H_i | Flips)   = \frac{p(Flips | H_i) \times p(H_i)}{p(Flips)}
\end{equation*}


\begin{equation*}
posterior   = \frac{likelihood \times prior}{marginal}
\end{equation*}

*remember marginal likelihood is aka normalization constant*

 * multiply these **by our old posterior probabilities** (which are the new priors),
 * normalize (divide the sum of the prior times likelihood, so they add to one), and


In [None]:
# helper function 3
marginal = 1  # placeholder, If we have zero flips, P(flips)=1 as default.

def calculate_posterior(likelihoods_lst, prior_lst):
    '''Calculates the posterior given the likelihoods and prior
    '''
    posterior_unnormalized = []
    for likelihood, prior in zip(likelihoods_lst, prior_lst):
        #multiply likelihoods by our old posterior probabilities (marinal/constant is one for now)
        posterior_unnormalized.append(likelihood * prior/ marginal)
        
    # normalize so that the total probability in posterior is 1
    posterior_un_total = sum(posterior_unnormalized)
    posterior_lst = []
    for posterior in posterior_unnormalized:
        posterior_lst.append(posterior/posterior_un_total)
    return posterior_lst

 * look at the output.

In [None]:
# helper function - plot the probability of the fair coin with increased flips

def plot_pfair_prob(num_flips, p_fair_arr, data):
    fig, axs = plt.subplots(2,1,figsize=(10,10))
    flip_num = np.arange(1, num_flips + 1)
    axs[0].scatter(np.arange(1,num_flips+1), data+np.random.random(len(data))/10, marker='.')
    axs[0].plot(np.arange(1,num_flips+1),np.cumsum(data)/np.arange(1,num_flips+1))
    axs[0].axhline(.5, color = 'green', linestyle = '--')
    axs[0].axhline(p, color = 'green', linestyle = '--')
    axs[0].set_title('Actual Flips and Running Average')
    axs[1].plot(flip_num, p_fair_arr)
    axs[1].set_ylim([-0.1, 1.1])
    axs[1].set_title('Probability of fair coin as a function of flip number')
    axs[1].set_ylabel('Probability p_fair')
    axs[1].set_xlabel('Flip number')
    plt.tight_layout()
    plt.show()

In [None]:
# lets do a simulation0
# np.random.seed(2) # try 2 and 3

indicate_coin_picked(p)

# initialize
priors = [0.5, 0.5]
flips_lst = []

# set the number of flips
num_flips = 2000
p_fair_arr = np.zeros(num_flips)

for i in range(num_flips):
    flip_the_coin(p, flips_lst)
    likelihoods = calculate_likelihood(flips_lst)
    posteriors = calculate_posterior(likelihoods, priors)
    p_fair_arr[i] = posteriors[0]
    priors = posteriors

print("\nPosteriors after {0} trials".format(num_flips))
print("Probability Fair {0:0.3f}, Not fair {1:0.3f}".format(posteriors[0], posteriors[1]))

plot_pfair_prob(num_flips, p_fair_arr, flips_lst)

### Seems very sensitive to updates.  Let's add a tuning dial (learning rate) that affects how much each update can affect the posteriors.

In [None]:
marginal = 1  # placeholder, p(Flips) is same for both fair and unfair 

def normalize(lst):
    total = sum(lst)
    return [val/total for val in lst]

def calculate_posterior_with_learning_rate(likelihoods_lst, prior_lst, learning_rate):
    '''Calculates the posterior given the likelihoods and prior'''
    posterior_unnormalized = []
    for likelihood, prior in zip(likelihoods_lst, prior_lst):
        posterior_unnormalized.append(likelihood * prior / marginal)
    
    # now need to normalize so that the total probability in posterior is 1
    posterior_lst = normalize(posterior_unnormalized)
    
    # now weight returned posterior by new posterior and old posterior
    posterior_weighted_unnorm = []
    for posterior, prior in zip(posterior_lst, prior_lst):
        posterior_weighted_unnorm.append(learning_rate * posterior + 
                                         (1 - learning_rate) * prior)
    posterior_weighted = normalize(posterior_weighted_unnorm)
    return posterior_weighted

In [None]:
indicate_coin_picked(p)

# initialize

priors = [0.5, 0.5]

learning_rate = 1.5

flips_lst = []

# set the number of flips
num_flips = 2000
p_fair_arr = np.zeros(num_flips)


for i in range(num_flips):
    flip_the_coin(p, flips_lst)
    likelihoods = calculate_likelihood(flips_lst)
    posteriors = calculate_posterior_with_learning_rate(likelihoods, priors, learning_rate)
    p_fair_arr[i] = posteriors[0]
    priors = posteriors


print("\nPosteriors after {0} trials".format(num_flips))
print("Probability Fair {0:0.3f}, Not fair {1:0.3f}".format(posteriors[0], posteriors[1]))

plot_pfair_prob(num_flips, p_fair_arr, flips_lst)