# Introducing Probability



Please ensure you have watched the Chapter 2 video and read through the [Chapter 2 jupyter notebook](https://github.com/haleygomez/Data-Analysis-2021/blob/master/blended_exercises/Chapter%202/Chapter2.ipynb).

## You will learn the following things in this Chapter

- Basic probability rules 
- Bayes Theorem
- How to use Python program to estimate probabilities
- After completing this notebook you will be able to attempt CA 1 questions 1 and 4.

***

## Probably useful: probability overview

We are going to be using the concept of probability a lot in this course, so it is best
to first review the basic ideas behind probability theory, and get used to the notation. Unfortunately, there are many notations; this course will adopt one, but we will also
mention the others, so that you can recognise them when reading further.

- An experiment results in a set of outcomes, which we will call $\Omega$. This can be a discrete set of outcomes, such is in the classic coin toss, $\Omega = \{H, T\}$, or the roll of a die, $\Omega = \{1, 2, 3, 4, 5, 6\}$. However in many real life experiments, $\Omega$, which is referred to as the *outcome space*, or *event space*, can have an infinite continuum of values. We will return to this idea later, but for the moment we will consider just discrete events to help outline the basic properties of probability.

- Returning to the coin toss, we can say that a fair coin will have a probability of heads of $P(H) = 0.5$, and a probability of tails of $P(T) = 0.5$.  Each outcome of the experiment of tossing the coin,  $\Omega = \{H, T\}$, are thus equally likely. Similarly, for a die roll, $\Omega = \{1, 2, 3, 4, 5, 6\}$, with $p(i) = 1/6$. When an experiment has $m$ equally likely outcomes, the probability of any outcome $x$ is then:

    $P(x) = \dfrac{\#x}{m}$

    where $\#x$ is the number of times that $x$ occurs. For example, consider the more complicated case where we toss 3 different coins together, a 50p, a 20p and a pound coin. The outcome space is then

    $\Omega = \{HHH, HHT, HTH, HTT, THH, THT, TTH, TTT\}.$

    Assuming all outcomes were equally likely, then the probability of getting any one is 1/8.
    
- Not all outcomes are equally likely. For example, we could ask what the probability of the event that our 3-coin toss comes up with $n$ heads, so the outcome space is then  $\Omega = \{0, 1, 2, 3\}$. The outcome $\omega \in \Omega$ in this new experiment is just given by considering the outcomes in our previous 3-coin experiment, but ignoring which exact coin lands on Heads or Tails ie <br><br>

	- $\omega = 1$: n = 0, corresponds to TTT
	- $\omega = 2$: n = 1  corresponds to HTT, THT, or TTH
	- $\omega = 3$: n = 2  corresponds to HHT, HTH, or THH
    - $\omega = 4$: n = 3  corresponds to HHH

    Thus $P(n =0) = P(n =3) = 1/8$ while $P(n=1) = P(n=2) = 3/8$.
    
The above uses of $P$ hide an important aspect of probabilities: $P(x_i)$ is *normalised*, such that the sum of the probabilities of all possible events adds up to 1,

$P(X) = \sum_i^N P(x_i) = 1$,

where $X = (x_i, \ldots, x_N)$. Technically this in only true for either finite or countably infinite outcome spaces. If the outcome space is truly uncountably infinite, then the definition of probability has to be relaxed.

### Axioms 

The axioms of probability are:

Axiom 1 : $0 \le P(A) \le 1$

Axiom 2 : $P(\Omega) = 1$

Axiom 3 : $P(A_1 \cup A_2 \cup A_3 ...) = P(A_1)+P(A_2)+(P(A_3)+...$

Here the symbol $\cup$ denotes **or**.

The axioms permit us to work out the probability of event $A$ **not** occurring,

$P(A^c) = 1 - P(A)$.

This result can be generalised. For example, for any two events $C$, $D$,

$P(C \cup D) = P(C) + P(D) - P(C \cap D)$,

where now the symbol $\cap$ denotes **and** (also written as $P(CD)$ or $P(C, D)$ in the literature). A simple way to think of this is that the probability of getting either $C$ or $D$ is just the sum of the chances of getting either, $P(C) + P(D)$, minus that chances of getting both at the same time, $P(C \cap D)$. 

The last part is important since we're asking for the probability of *either* $C$ **or** $D$, not both! The best way to see this is by considering the diagram below.

![Screenshot](https://github.com/haleygomez/Data-Analysis-2021/raw/master/blended_exercises/Chapter%201/Colab_screenshot_sign_in.png)


## Bayes Theorem

Moving from conditional probabilities to Bayes Theorem which states that:

$ P (A|B) = \dfrac{P(A \cap B) }{P(B)} \rightarrow \dfrac{P(B|A)P(A)}{P(B|A)P(A)+P(B|A^c)P(A^c)}$ 

Bayes Rule is normally used to determine the probability of a specific model, $\theta$, given some data D, such that

$P(\theta|D) = \dfrac{P(D|\theta) P(\theta)}{P(D)}$ where

where $P(D|\theta)$ is the *likelihood*, $P(\theta)$ is the *prior*, and $P(D)$ is the *evidence*. $P(\theta|D)$ is the *posterior*. 

People feel uncomfortable about ‘priors’, since often they are a ‘best guess’

Indeed, different analysts may have differing opinions about what the prior for a given experiment should be.  Although ‘frequentists’ disagree with the use of priors, note that technically they do assume one: they assume that all is equally likely (i.e. a ‘flat’ prior).

Clearly this is also wrong.  So priors are useful — but they must be clearly stated. They provide a formal means for the analyst to include previous information that is relevant to the experiment. They also allow you test whether the model is good.

Bayesian analysis has a somewhat formidable reputation for being extremely difficult… why is that?

- In general, the denominator can be difficult to evaluate
- Tricky integrals
- Often require numerical solutions
- Large (multivariate) parameter space
- In the 20th century, the development of Monte Carlo Markov Chains have made the evaluation of the integrals and the probabilities much easier. 

You will learn more about this later in the course!

### Bayesian vs Frequentist:

A Bayesian might argue “the prior probability is a logical necessity when assessing the probability of a model. It should be stated, and if it is unknown you can just use an uninformative (wide) prior” 

A frequentist might argue “setting the prior is subjective - two experimenters could use the same data to come to two different conclusions just by taking different priors”

### Example

Imagine that a box contains five coins, one of which is a joke (J) coin, with heads on both sides. A coin is selected at random from the box, and flipped 3 times. The result is 3 heads (3H). What is probability that the coin is the trick coin?

### Solution

Click below to see the Solution.

First, we should define what we are trying to work out. 

We are interested in $P(J | 3H)$. We will let the normal coin be denoted by $C$. So using Bayes Theorem we can write,

$P(J | 3H) = \dfrac{ P(3H | J) P(J)}  {P(3H) }$

To get the probability of $P(3H)$ we need to add up all possibilities of getting it.

$P(J | 3H) = \dfrac{ P(3H | J) P(J)}  {P(3H | J) P(J) + P(3H | C) P(C)}$

The probability of randomly selecting the joke coin is $P(J) = 1/5$. 

The probability of not selecting it, is $P(J^c) = 1 - 1/5 = 4/5 = P(C)$. 

The probability of getting 3 heads with the joke coin is 1, so 

$P(3H | J) = 1$

The probability of getting 3 heads with a standard coin is $(1/2) \times (1/2) \times (1/2)$ (remember these are independent events), so

$P(3H | C) = 1/8$

$P(J | 3H) = \dfrac{ 1 \times 1/5}  {1 \times 1/5 ~ + ~1/8 \times 4/5} = 2 / 3$

So there's a 66% chance the coin that we are seeing flipped is the joke coin!

***

### So in the end what is probability?

Our examples of coin flipping, die rolling and card selecting, we introduced the notion that the probability of a particular outcome or event is simply the number of times that event occurs, divided by the number of all possible outcomes.

But say you have a coin, and you want to know $P(H)$ -- how do you proceed? You could guess that the coin is fair and assign 0.5 to outcome heads/tails. But is the coin fair? One way you could test this is to perform lots of experiments (coin flips) and keep track of the outcome. If you do enough of these, eventually you will get an empirical measure,

$P(H) = \dfrac{n_H}{n_{\rm flips}}$

where $n_H$ is the number of heads that appeared in the experiment and $n_{\rm flips}$ is the number of times you flipped the coin (and counted the result). But when do you stop? Well, that depends on how accurately you want to know $P(H)$. But for now, we will simply note that this type of determination of $P$ is *frequentist*, in that the probability is defined by counting the instances of occurrence.

However, what about the probability that it will rain tomorrow? You can see straight away that such a probability is more difficult to define. In fact, the use of Bayes Theorem, and in particular the prior, introduces a much more vague idea that $P$ represents the belief that something will occur.

Now you are ready to tackle the **Chapter 2 quiz** on Learning Central and the [Chapter 2 yourturn notebook](https://github.com/haleygomez/Data-Analysis-2021/blob/master/blended_exercises/Chapter%202/Chapter2_yourturn.ipynb).