# Review of Probability

## CSCI E-83
## Stephen Elston

Probability theory is at the core of probabalistic programming. Therefore, it is important to have a good understanding of the principles of probability theory. This lesson introduces you to the basic concepts you will need to know to tackle probabalistic programming. Specifically you will learn:

1. The four axioms of probability.
2. How to work with conditional probability and Bayes' rule. 
3. Factoring probability distributions using the chain rule of probability. 
3. Computing marginal distributions for inference. 
4. Applying the concepts of dependence and independence to factor distributions. 

As a first step run the code below to import the required packages. 

In [2]:
import pandas as pd
import numpy as np

## Axioms of probability

All probability distributions must have a certain properties, which we refer to as the **axioms of probability**. These are:

- Probability for any set, A, is bounded between 0 and 1:  

$$0 \le P(A) \le 1 $$
- Probability of the Sample Space = 1:  

$$P(S) = \sum_{All\ i}P(a_i) = 1 $$

- The probability of finite independent unions is the sum of their probabilities:

$$P(A \cup B) = P(A) + P(B)\\ 
if\ and\ only\ if\\ 
A \cap B = 0 $$

To make these ideas concrete, let's try an example. The code in the cell below creates a data frame with the the probabilities of hair and eye color combinations. 

In [3]:
eyeHair = pd.DataFrame({'Black':[0.11, 0.03, 0.03, 0.01], 
                     'Brunette':[0.21, 0.14, 0.09, 0.05],
                     'Red':[0.04, 0.03, 0.02, 0.02],
                     'Blond':[0.01, 0.16, 0.02, 0.03]}, 
                      index = ['Brown', 'Blue', 'Hazel', 'Green'])
eyeHair

Unnamed: 0,Black,Blond,Brunette,Red
Brown,0.11,0.01,0.21,0.04
Blue,0.03,0.16,0.14,0.03
Hazel,0.03,0.02,0.09,0.02
Green,0.01,0.03,0.05,0.02


This table contains a bivariate distribution of $p(hair,eye)$. For example, the probability of a subject in this sample having black hair and brown eyes: $p(black,brown) = 0.11$ .

You can see that all of these probabilities are in the range $0 \le p(hair,eye) \le 1.0$, and therefore satisfy one of the axioms. 

We can test if these probabilities add up to 1.0. 

In [4]:
np.array(eyeHair).sum()

0.9999999999999999

To within the rounding error, this probabilities add to 1.0 and satisfy another axiom. 

The question of independence or dependence is a bit more complicated, and will be addressed later. 

## Conditional distributions and Bayes' Theorem

A probability distribution of one random variable can be conditionally dependent on another random variable. **Bayes' theorem**, also known as **Bayes' rule**, gives us a powerful tool to think about and analyze conditional probabilities. We can 

$$P(A \cup B) = P(A|B)P(B)\\
P(B \cup A) = P(B|A)P(A)$$

Now:

$$P(A \cup B) = P(B \cup A)$$

This leads to Bayes' theorem as follows:

$$P(A|B)P(B) = P(B|A)P(A)\\
P(A|B) = \frac{P(B|A)P(A)}{P(B)}$$

Which is Bayes' theorem. 

We can compute $p(eye | hair = Black)$. First we can find the probabilities joint probabilities of $p(eye, hair = Black)$:

In [5]:
eyeHair_black = eyeHair.loc[:,'Black']
eyeHair_black

Brown    0.11
Blue     0.03
Hazel    0.03
Green    0.01
Name: Black, dtype: float64

These numbers cannot be a probability distribution as they do not add up to 1.0. However, normalizing is easy in this case:

In [6]:
eyeHair_black = eyeHair_black/sum(eyeHair_black)
eyeHair_black

Brown    0.611111
Blue     0.166667
Hazel    0.166667
Green    0.055556
Name: Black, dtype: float64

In [8]:
print('For pople with Black eyes: \nMost common hair color = ' + eyeHair_black.idxmax() + 
      '\nwith probability = %4.2f' % (eyeHair_black.max()))

For pople with Black eyes: 
Most common hair color = Brown
with probability = 0.61


## Chain rule of probability

Another way to work with distributions is by applying the **chain rule of probability**. This rule allows us to factor a joint distribution as follows:

$$P(A_1, A_2, A_3, A_4 \ldots, A_n) = P(A_1 | A_2, A_3, A_4, \ldots, A_n)\ P(A_2, A_3, A_4 \ldots, A_n)$$

We can continue this factorization until we reach an end point:

$$P(A_1, A_2, A_3, A_4 \ldots, A_n) = P(A_1 | A_2, A_3, A_4, \ldots, A_n)\ P(A_2 | A_3, A_4 \ldots, A_n)\ P(A_3| A_4 \ldots, A_n) \ldots p(A_n)$$

Or in general terms, we can expand a joint distribution as the product of conditional distributions:

$$P(\bigcap_{k=1}^n A_k) = \prod_{k=1}^n p(A_k \big| \bigcap_{j=1}^{n-1} A_j)$$

> **Note:** The factorization is not unique. We can factor the variables in any order. For example, we can write:

$$P(A_1, A_2, A_3, A_4 \ldots, A_n) = P(A_n | A_{n-1}, A_{n-2}, A_{n-3}, \ldots, A_1)\ P(A_{n-1}| A_{n-2}, A_{n-3}, \ldots, A_1)\ P(A_{n-2}| A_{n-3}, \ldots, A_1) \ldots p(A_1)$$

As an example of a factorization using the chain rule of probability, we can find the conditional distribution of eye color given hair color (or the reverse). Our table of data only gives us the joint distributions which we can factor as:

$$P(eye,hair) = P(eye|hair)\ P(hair) \\
or,\\
P(eye|hair) = \frac{P(eye,hair)}{P(hair)}$$

## Marginal distributions

Inference for probabilistic models is typically performed by computing **marginal distributions**. A marginal distribution is the probability distribution of one or more variables summed or integrated over the other variables of a multivariate distribution. This process is often an essential step in performing **inference**. By inference, we mean returning the results of a query. 

For example, if we start with a joint distribution we can factor it using the chain rule of probabilities:

$$p(A,B) = P(A|B)p(B)$$

we can then compute the marginal distribution over $A$ by summing over $B$:

$$P(A) = \frac{1}{Z} \sum_{B} P(A|B)p(B) \\
where\\
Z = partition\ function$$

The concept is simple. The the result of the summation is the distribution on the *margin* of the multivariate distribution. The **partition function** is a normalization used to create a proper marginal distribution.   

In this case, $Z = p(B)$, so the marginal distribution is just:

$$P(A) = \frac{1}{P(B)} \sum_{B} P(A|B)p(B) =  \frac{P(B)}{P(B)}\sum_{B} P(A|B) = \sum_{B} P(A|B)$$

Using our dataset of hair and eye color, it is easy to compute the marginal probabilities of eye color as follows:

In [9]:
eyeHair['MarginalEye'] = eyeHair.sum(axis = 1)
eyeHair

Unnamed: 0,Black,Blond,Brunette,Red,MarginalEye
Brown,0.11,0.01,0.21,0.04,0.37
Blue,0.03,0.16,0.14,0.03,0.36
Hazel,0.03,0.02,0.09,0.02,0.16
Green,0.01,0.03,0.05,0.02,0.11


We can also compute the marginal probability of hair color:

In [10]:
eyeHair = pd.concat([eyeHair, pd.DataFrame({'MarginalHair':eyeHair.sum(axis = 0)}).T])
eyeHair

Unnamed: 0,Black,Blond,Brunette,Red,MarginalEye
Brown,0.11,0.01,0.21,0.04,0.37
Blue,0.03,0.16,0.14,0.03,0.36
Hazel,0.03,0.02,0.09,0.02,0.16
Green,0.01,0.03,0.05,0.02,0.11
MarginalHair,0.18,0.22,0.49,0.11,1.0


In many cases we really only want to know the maximum of the marginal probability. For example, we can find the most probable eye color:

In [11]:
eyeHair.iloc[:3,4].idxmax()

'Brown'

While brown eyes are  the most probable, blue eyes are nearly as probable. 

Likewise, we can find the most probable hair color.  

In [12]:
eyeHair.iloc[4,:3].idxmax()

'Brunette'

## More Bayes' theorem

We have already compute the probability of eye color given hair color in a previous section. Let's check that we get the same result using Bayes' theorem. This is another form of inference. 

By the chain rule of probabilities the joint distribution,  $P(hair,eye)$ , can be factored as:

$$P(hair,eye) = P(hair|eye)\ P(eye)\\
Or,
P(hair|eye) = \frac{P(hair,eye)}{P(eye}$$

We can write:

$$P(eye | hair) = \frac{P(hair,eye)\ P(eye)}{P(hair) \ P(eye} = \frac{P(hair,eye)}{P(hair)}$$

The quantities $P(eye)$ and $P(hair)$ are just the marginal distributions. So, we can easily compute the conditional distribution of eye color give black hair:

In [45]:
PEyeGivenHair_black = eyeHair.loc[:,'Black'][:4]/eyeHair.loc['MarginalHair','Black']
PEyeGivenHair_black

Brown    0.611111
Blue     0.166667
Hazel    0.166667
Green    0.055556
Name: Black, dtype: float64

Finally, we can check that the result is a proper distribution by verifying the sum adds to 1.0. 

In [48]:
print('Sum of probabilites of eye given black hair = %4.2f' % sum(PEyeGivenHair_black)) 

Sum of probabilites of eye given black hair = 1.00


## Independence
