# Probability Basics
### Axioms of Probability
For an event $E_i$ from the set of all possible events $S$, $E_i \in S$

 $$0 \leq P(E_i) \leq 1$$

 $$\sum_{i} P(E_i)  = 1 $$

 $$P(\cup^n_i E_i) = \sum_i^n P(E_i) \text{ where all } E_i \text{ are mutually exclusive }$$
### Match the distribution
 Distribution | Notation | Parameters | Alternative Parameters | Description | PDF/PMF 
 --- | --- | --- | --- | --- | ---
Poisson | | | | | 
Gamma | | | | | 
Binomial | | | | | 
Normal  |  $\mathcal{N}(\mu,\sigma^2) $  | $\mu \in \mathbb{R}\\ \sigma > 0$  | <center> NA  | Symmetric function. Distribution for a continuous random variable <center> | <center> $\frac{1}{2\pi\sigma^2} e^{-\frac{(x-\mu)^2}{2\sigma^2}}$ <center>
    

<img src="graphs/discreteDistQuest.jpg" alt="Discrete Distribution" width="400" align="left"><img src="graphs/continuousDistQuest.jpg" alt="Continuous Distribution" width="400">




### Would you like a jelly baby?

One day walking to KB a rather conspicuous blue box appears inconspicuously beside you and a stranger steps out to offer you a coloured confectionery. Being completely unaware of popular culture and resentful of any advice your mother gave to you, you unquestionably oblige  and accept a sweet.

You ask if any of the sweets are vegan, naturally the stranger lists the exact numbers of jelly babies that are vegan and non vegan by colour. Your vacant look as the stranger lists a string of numbers prompts them to hand over a table they had conveniently pre-made.

 Type| Red | Green | Blue | Yellow | Total
 --- | --- | --- | --- | --- | ---
Vegan | 4 | 2 | 1 | 1 | 8
Non-Vegan | 8 | 5 | 8 | 3 | 24
Total | 12 | 7 | 9 | 4 | 32

- If all the sweets are mixed in one bag, what is the probability that you choose a vegan sweet, $P(\text{vegan})$?

- Given that you have chosen a non-vegan sweet, what is the probability that the sweet is green or blue, $P(\text{green or blue}|\text{non-vegan})$?

- What is the probability of choosing three red sweets in a row (and yes you eat the sweets as you go along),$P(\text{rrr})$?

- Challenge Question: If you pick 4 sweets, what's the probability of getting 1 of each colour $P(\text{rgby})$ (first assume you replace the sweet after choosing and then assume you eat it)?


### Do you even math?

From now on we shall always assume replacement of every chosen sweet for easy maths!

You've become suspicious of the stranger and are unsure if they even understand Earth based arithmetic. You want to be reasonably certain that the ratio of vegan to non-vegan sweets the stranger told you was correct. Obviously you cannot check all the sweets because that would be rude, and arouse suspicion. Instead you placate yourself by taking a sample of ten. You end up with 4 vegan and 6 non-vegan sweets, do you believe the stranger?

## Frequentist Approach

A binomial test!

 Under the null hypothesis, i.e. there are 24 vegan and 8 non-vegan sweets, the chances of getting 4 vegan sweets is 
 $$P(Vegan = 4| \theta = 8/32) = {10 \choose 4}\frac{1}{4}^4 \frac{3}{4}^6 = 0.145998.$$
 
 The probability that 4 or more sweets were vegan is 
 $$P(Vegan \geq 4| \theta = 8/32) = \sum^{10}_{k=4}{10 \choose k}\frac{1}{4}^k \frac{3}{4}^{10-k} = 0.22412.$$
 
 A Frequentist would then deduce that since there is a sizeable probability of getting 4 or more vegan sweets, given that 8 exist, there is not enough evidence to reject the null hypothesis.  'Sizeable probability' is commonly defined (entirely arbitrarily) as greater than 5%.
 
 Notice two things, in calculating our p-value we have used *possible* outcomes **not** just the outcome we saw. We only observed the case when vegan = 4  but the p-values uses all cases with 4 or more vegan sweets.
 
 Second, we **only** care about the null hypothesis and have no information on alternative hypotheses. Poorly worded statistics problems give you a null hypothesis together with an alternative then get you to conduct some statistical test like the one above and ask the question "Should you accept the alternative hypothesis?", i.e $\theta = 12/32$. You cannot answer any questions on alternative hypotheses given the test above as it is conditioned on the null hypothesis being true.




## Bayesian Approach
The previous frequentist example emphasises the focus on creating point statistics from the get go to make inferences.  As is the Bayesian way, we will now prioritise creating a posterior distribution of possible values for $\theta$ and their probabilities given the data. Then make decisions, often using point estimates of the posterior distribution. So how do you go about creating the conditional probability $P(\theta|x)$, where $\theta$ is the parameter(s) of the distribution and x is the data you have collected?

### Bayes Theorem 

$$p(\theta|x)=\frac{p(x|\theta)p(\theta)}{p(x)}$$

$$\text{posterior}=\frac{\text{likelihood}\times\text{prior}}{\text{evidence}}$$

### Likelihood : 
This is where your PhD lies, your deductions about the probability functions that best model the mechanisms creating your data.

$$p(x|\theta)$$ 

1) Sometimes denoted $f(\theta;x)$

2) Not strictly a probability function as 

$$\int_x\int_\theta p(x|\theta) dx d\theta \neq 1$$

3) For true Bayesian modelling, should contain all the information attainable from the data 

### Prior :
Here you are taking an educated guess where the parameters for the model are! Ideally this will give the model a head start in settling on a suitable posterior, but generally with enough data the likelihood will dominate!

$$p(\theta)$$

1) Can be proper $$\int_\theta p(\theta) d\theta = 1$$
or improper $$\int_\theta p(\theta) d\theta \neq 1$$

2) Primary source of controversy in Bayesian methodology, subjectivity and non-informative priors

3) Allows your model to be informed by prior information!

### Evidence :
Only through the development of complex Bayesian models do we worry about this, for now ignorance is bliss and magic algorithms in Bayesian MCMC samplers will handle it behind closed doors.

$$p(x)$$

1) Generally an after thought, the scale factor for the posterior

$$\text{posterior}\propto\text{likelihood}\times\text{prior}$$

2) Defined as the marginal of the joint density (or the normalisation of the joint model, likelihood $\times$ prior)

$$p(x) = \int p(x|\theta) p(\theta) d\theta \neq 1$$

3) Often computationally expensive to compute but crucial for the posterior to be an actual probability distribution

### Posterior :
The end game. Contains the probability that the model parameters have given values, according to your prior knowledge and new data

$$p(\theta|x)$$

1) Must be a probability function, fulfilling all of the axioms above

2) Sampling from the posterior gives you parameters that can be used to simulate data or calculate point estimates

3) Origins of Bayesian credible intervals rather than frequentist confidence intervals

### Return to the jellybaby

Likelihood : Binomial

Prior : Normal with mean 0.75 and variance 0.1

Posterior:

<img src="./jellyBabyPlot.jpg" width="550">

# Indiana Jones and the Quest for the Non-informative Prior

As hinted at previously, the choice of prior does cause some controversy. The figure below shows the effect of differing priors on the posterior for the jelly baby problem. 
<img src="graphs/priorPlotComparison.jpg" width="600">
It is an incontrovertible fact that priors are vital for bayesian methods. Without having a distribution to sample the parameters from you cannot produce a posterior with which to decide what parameter best fits the data. 

One attempt to overcome the posterior's dependence on the prior is to develop a 'non-informative prior', a prior which does not prefer one parameter over another (and is invariant over all parameterisations, which is the kicker). A uniform prior is a great example of prior which does not prefer a parameter, unfortunantely any transformation or grouping of of the prior can alter the prior distribution and make it informative. 

<img src="graphs/uniformPriorPlot.jpg" alt="Uniform Prior" width="400" align="left"><img src="graphs/transformedPriorPlot.jpg" alt="Transformed Prior" width="400">

 log(x) | x 
 --- | --- 
0 - 1 | 1 - 2.718 
1 - 2 | 2.718 - 7.389 
2 - 3 | 7.389 - 20.086
3 - 4 | 20.086 - 54.598
4 - 5 | 54.598 - 148.413

As of yet mathematicians have not found a suitable function that is both non-informative and invariant with respect to transformations

There are various schools of thought when it comes to prior choice, much of which we will not cover, but these are the reasons why prior subjectivity is generally unimportant in bayesian models;

1) With a suitable dataset the likelihood will always dominate the posterior

In cases where the prior seriously effects the posterior, take it as a sign that the dataset used simply doesn't have enough information to make informed decisions. You need to acquire more data or question how you have processed your data! For the graph above I used exceptionally poor alternative priors.

2) Non-informative priors are a nonsensical concept

Bayesian inference is built upon supplimentanting the knowledge you already know, wrapped up in a prior, with new data, via the likelihood. Developing a prior encoding no information  suggests you're tackling a problem you know nothing about (getting scarily close to a neural network blackbox). The ultimate faux pax is to not give any thought to your priors, which is quite common in many use cases of bayesian packages!

3) Conduct posterior checks!

Rerun your simulation with different priors and different hyperparameters then see the effect on the posterior. Why worry about something that MAY have an effect when you can literally rule it out.