# Bayesian Statistics and Machine Learning (part one)

This Notebook will introduce Bayesian methods in basic statistical analysis to pave the way for understanding the use of Bayesian methods in deep learning, which is an increasingly important practice in the current State-of-the-Art (SOTA).

To begin with we will start with Bayes' Theorem:
<br><br>
$ \huge P(A|B) = \frac{P(B|A) ~ P(A)}{P(B)} $
<br><br>
Let's breakdown this down:
* $ P(A|B) $ (the posterior): This is the thing we are interested in measuring. The "|" means "conditional on" so in this case we are asking what the probability is of A given that B is True
* $ P(B|A) $ (the likelihood): Formally again we are computing the probability of B given A being true. This is typically the data we have. 
* $ P(A) $ (the prior): This is the probability of A in any circumstances (i.e. without any conditional). In practical terms this would be our belief as to the probability of A before looking at our data.
* $ P(B) $ (the evidence): This is the tricky bit! Essentially the is just a denominator that means that $ P(A|B) $ is presented as a probability (bounded between zero and one). Essentially this is all possible outcomes that comprise $ P(B) $ to include $ P(B|A) ~ P(A) $.

So you are probably none the wiser at this point so perhaps it's better to play out a scenario. Let's imagine that we are looking to calculate the probability that someone would pass the DSML module given that they spent 40 hours or more on the PMA ... so in other words:
<br><br>
$ \huge P(pass|40hrs) = \frac{P(40hrs|pass) ~ P(pass)}{P(40hrs)} $
<br><br>
We can add in some data:

In [1]:
import pandas as pd

ids = [1,2,3,4,5,6,7,8,9,10]
passQ = [True, True, False, False, True, True, False, False, False, True]
fortyHrs = [True, True, True, False, True, True, False, False, True, False]

df = pd.DataFrame([ids, passQ, fortyHrs]).transpose()
df.columns=["id", "pass?", "40hrs"]
df

Unnamed: 0,id,pass?,40hrs
0,1,True,True
1,2,True,True
2,3,False,True
3,4,False,False
4,5,True,True
5,6,True,True
6,7,False,False
7,8,False,False
8,9,False,True
9,10,True,False


From here we can start to calculate some of our values:

In [2]:
# P(A)
prior = 5 / 10 # five passes out of 10 students

# P(B)
evidence = 6 / 10 # six out of 10 students

print(f'Prior is {round(prior, 1)}')
print(f'Evidence is {round(evidence, 1)}')

Prior is 0.5
Evidence is 0.6


In terms of calculating the likelihood - $ P(B|A) $ - we need to consider the number of students for whom _40hrs_ and _pass?_ are both True, divided by the number of students who pass:

In [3]:
ba = 0
passed = 0

# iterrows() allows us to basically for loop through a dataframe
for index, row in df.iterrows():
    if row['pass?'] == True:
        passed += 1
        if row['40hrs'] == True:
            ba += 1
            
likelihood = ba / passed 
print(f'Likelihood is {round(likelihood, 1)}')

Likelihood is 0.8


In other words, if we know that a student passed there is an 80% chance they did 40 hours or more work.

Now we can put it all together!

In [4]:
posterior = (likelihood * prior) / evidence 
print(f'Posterior is {round(posterior, 1)}')

Posterior is 0.7


Awesome! However, this is a very much a toy example and actually we could have just directly counted the posterior in the same fashion as the likelihood:

In [5]:
ab = 0
fortyHrs = 0

# iterrows() allows us to basically for loop through a dataframe
for index, row in df.iterrows():
    if row['40hrs'] == True:
        fortyHrs += 1
        if row['pass?'] == True:
            ab += 1
            
posterior = ab / fortyHrs 
print(f'Posterior is {round(posterior, 1)}')

Posterior is 0.7


So what's the point? Well once we have the concepts we can start to think about calculating probability in a different way. Gather round for story time and we'll see this in action:

## The Tale of the Student, the Supervisor, the Statistician and the Bayesian

Our tale occurred several years ago here at WMG. A young student, let's call her Liping, had just completed her DSML PMA. She felt she had done quite well and really committed herself, however, in the 4x PMAs prior to this her results had been mixed and she currently averaged a merit. Unfortunately she had been spending a lot of time writing her food blog when she should have been writing, and had been a bit slapdash with her referencing. She was in a real quandary as to whether she would be able to achieve a distinction this time; on the one hand it felt like it had gone well, on the other that wasn't the current pattern. Young Liping sought some advice.

First she turned to her dissertation supervisor, lets call him Mark. Mark asked Liping a bunch of questions about the process she had followed and asked her to give a score out of 10 for each. She thought about these questions and gave the following answers:

In [6]:
understood_question = 9 / 10
worked_hard = 10 / 10
good_references = 9 / 10
good_code = 8 / 10

improve = (understood_question + worked_hard + good_references + good_code) / 4
print(f'Mark said: "You have a {round(improve,1)} chance of a distinction"')

Mark said: "You have a 0.9 chance of a distinction"


Liping left the meeting feeling really good ... an 90% chance of distinction! However, she had this nagging doubt. Her record so far wasn't distinction level and as a good data person she didn't feel comfortable just ignoring the data. To try and assuage her doubts she went to see a friendly, local (traditional) statistician. Let's call him Vagelis. 

After Liping explained her problem, Vagelis did some research into past performance on the DSML module. After a bit of searching, Vagelis found the following datapoints:

In [7]:
students = 100
distinctions = 18
increased = 25
increased2dist = 2

print(f'Vagelis said "There were {students} students and {distinctions} distinctions.\n{increased} students improved to a new boundary, {increased2dist} of which improved to a distinction"')

Vagelis said "There were 100 students and 18 distinctions.
25 students improved to a new boundary, 2 of which improved to a distinction"


Vagelis got out his pocket calculator and quickly asserted the following:

In [8]:
print(f'The probability of you getting a distinction is {round(increased2dist / students, 2)}')

The probability of you getting a distinction is 0.02


Now Liping felt sad. But as much as she trusted Vagelis and his trusty pocket calculator she also felt that she was more than just her history and this PMA definitely felt different. More confused than ever she sought the advice of one more person ... a Bayesian statistician (let's call him Michael).

Michael asked her what his colleagues had told her, and Liping shared their advice. Michael thanked her and quickly wrote up a Python program to answer her question. To help illustrate our story the original program he wrote is reproduced below:

In [9]:
# Liping's PMA predictor. Author Michael Mortenson, 10/02/2018
prior = improve
likelihood = increased2dist / distinctions

# probability of improvement in all cases
evidence = likelihood + (increased - increased2dist) / (students - distinctions) * (1 - prior) 


posterior = (likelihood * prior) / evidence 
print(f'The probability you will get a distinction is {round(posterior, 2)}')

The probability you will get a distinction is 0.72


Liping was happy. She recognised that the Bayesian way allowed her to combine her personal expectations with the data (previous performances on the module) and give her a balance between them. And, dear reader, we do get a happy ending because she did indeed pass her PMA and the rest is history!

## Statistical Note
_Ultimately the problem Michael wanted to solve was: 
<br><br>
$ \huge P(distinction|improvement) = \frac{P(improvement|distinction) ~ P(distinction)}{P(improvement)} $
<br><br>
This feels a little less intuitive than our earlier example so let's discuss it a bit more. When we think about the numerator then it is essentially, as the story tells us, a balance between the data (2 students achieved distinction having a lower average before out of 18 total distinctions) and her feeling about how this PMA process had gone (a 90% chance). That we define this as $ P(improvement|distinction) $ is to say that given the goal is distinction, how many students who achieve the distinction are ones for whom it is an improvement._ 

_The denominator also has something a bit different. As we know,_ $ P(B) $ _represents the probability of an improvement under all circumstances. That will include our numerator -_ ($ P(improvement|distinction) ~ P(distinction) $ - _but also the likelihood a student would improve if it wasn't a distinction (e.g. if a student improved from fail to pass or pass to merit). Consequently that is all improvements less those who improved to distinctions, out of all results less those that achieved distinction. We would represent this as follows: $ P(B|A') ~ P(A') $ or, $ P(improvement|not\_distinction) ~ P(not\_distinction) $. In this case we represent $ P(not\_distinction) $ as $ 1 - P(A) $._

## The Moral of the Story 
This gives us one of the main advantages of a Bayesian approach over the alternative, traditional approach ... typically called "frequentist" by Bayesians as it is based on assigning probabilities purely on the frequency of events. In the Bayesian calculation we can include some prior belief about the likelihood of an event which is combined with the frequencies in the data. 

Given that we know all datasets are incomplete and that all datasets contain error, we can see why not solely relying on our historical data can be an attractive thing. Secondly, as William Bruce Cameron said:
>Not everything that can be counted counts, and not everything that counts can be counted.

Very often there is good insight that is only available as "soft data" rather "hard". That is, the data doesn't come to us from a database or Excel spreadsheet but from the "beliefs" of subject matter experts.

However, Bayesian analyses can offer much more than this, as well see in part two where we look at the application of these methods to distributions of data rather than single values.