# Bayes Rule & Probality Review
Bayes rule is defined as: 
### $$p(A|B)=\frac{p(A,B)}{p(B)}$$
Where:
* p(A|B) is the **conditional probability** of A given B
* P(A,B) is the **joint probability** of A and B
* p(B) is the **marginal probability** 

# Example 1 

![example1.png](attachment:example1.png)

So again, we want to find probability that someone is going to buy from your site, given their country. 

That means we want to use Baye's rule! Which is this case looks like:

### $$p(Buys?|Country)=\frac{p(Buys?,Country)}{p(Country?)}$$
## Marginal Probabilities
We can focus on marginal probability first. Here we are just trying to find the probability that a user is going to be from a specific country.

That will allow us to use the formula:
### $$p(Country=US) = \frac{users\;from\;US}{total\;number\;of\;visitors}$$

![marginal1.png](attachment:marginal1.png)

## Joint Probabilities
Now we are trying to find the probability that some one will Buy? **and** they are from a certain country. 

We should ask, how many probabilities are we looking for here? A joint probability must encode all possibilities. In this case there are:
* two outcomes for if a user buys (yes or no)
* three outcomes for country (Mexico, US, Canada)
* So the total number of possibilities is 6

This can be shown to increase the space of probabilties exponentially as more variables are added. This is known as the curse of dimensionality-this is a bad thing, because as things get large we have more computation to perform. 

![joint1.png](attachment:joint1.png)

In the figure above we can see all 6 joint probabilities. Joint probability is defined as:
### $$p(A,B) = p(A|B)*p(B)$$
which in the first example (buy = 1 and country = canada) looks like:
### $$p(buy=1,country=canada) = p(buy=1|country=canada)*p(country=canada)$$
### $$p(buy=1,country=canada) = \frac{20}{320}*\frac{320}{1080}=0.019$$
The other examples follow this same form. 

Notice that these numbers are much smaller than the marginal probabilities. Remember though that the sum of all possible outcomes must equal 1, and if the total possibilities grows exponentially, actual probabilities values will shrink exponentially. 

This is important because computers have finite precision - 32-bit float holds 32-bits of information, so we can't store an infinite number of values. This is another consequence of the curse of dimensionality. 

### Underflow
* As probability approaches 0, eventually the computer will round down to 0
* this is known as the underflow problem 
* common in probability to use the log probability instead
* Log grows slowly as its argument increases

## Conditional Probabilities 
Now we are trying to find the probability that a user buys or doesn't buy, given they are from a certain country. This takes the shape of Bayes Rule, defined again here:
### $$p(A|B)=\frac{p(A,B)}{p(B)}$$

![conditional1.png](attachment:conditional1.png)

Lets walk through the first calculation from the figure above. We start with the formula:
### $$p(buy=1|country=canada)=\frac{p(buy=1,country=canada)}{p(country=canada)}$$
which can be simplified to:
### $$p(buy=1|country=canada)=\frac{0.019}{\frac{320}{1080}} = 0.06$$

Notice, that the conditional probabilities in the figure above no longer sum to 1, they sum to 3! Why? Well, we are **given** a country - the space of random variables is only buy and not buy. Country is no longer random here!

---

# Example 2
Lets look at a slightly different problem. It is the same variables, but different counts. 

![example2.png](attachment:example2.png)

In this case, the probability of buying seems to be **independent** of where you are from! 

## Independence 
When two variables are independent, the joint probability becomes the multiple of the marginal probabilities. 

For example, if A & B are independent: 
### $$p(A,B) = p(A)p(B)$$
So if Buy and country are independent: 
### $$p(Buy | Country) = \frac{p(Buy,Country)}{p(Country)} = \frac{p(Buy)p(Country)}{p(Country)}= p(Buy)$$

## Manipulating Bayes Rule
Lets make Bayes Rule look more like the form that we will use in this course. We know:
### $$p(A|B)=\frac{p(A,B)}{p(B)}$$
The opposite is also true:
### $$p(B|A)=\frac{p(B,A)}{p(A)}$$
And since:
### $$p(A,B)=p(B,A)$$:
We can write:
### $$p(A|B)=\frac{p(B|A)*p(A)}{p(B)}$$
Now, often times we may not have $p(B)$ directly, but this is just the marginal distribution of the joint probability p(a,b), summed over all p(a). It looks like:
### $$p(B)=\sum_Ap(A,B) = \sum_Ap(B|A)*p(A)$$
If we are working with continuous distributions, sum turns into an integral. 

Another way to think of this, is that the term on the bottom is just a normalization constant (Z) to ensure that the distribution sums to one. 
### $$p(A|B)=\frac{p(B|A)*p(A)}{Z}$$

Another way of saying this, is that:
### $$p(A|B)\propto p(B|A)*p(A)$$
Sometimes, this is used when we are trying to find the argmax of a distribution:
### $$argmax_Ap(A|B)$$
So in this case, we don't need to know the actual value of the probability, just the particular A that gives us the maximum probability. Because Z is independent of A:
### $$argmax_Ap(A|B) = argmax_Ap(B|A)p(A)$$ 

## Bayes for Classification
In the context of the Bayes Classifier, y represents the class, and x represents the data.
### $$p(y|x)=\frac{p(x|y)*p(y)}{p(x)}$$
We refer to p(x|y) the **generative distribution**, because it tells us what the features look like for a specific class y, which we are already given. 

Note, that while the bayes classifier does make use of bayes rule, it does NOT necessarily make use of bayesian statistics. 

--- 

# Probability Exercise

![exercise1.png](attachment:exercise1.png)

![exercise1a.png](attachment:exercise1a.png)

![exercise1b.png](attachment:exercise1b.png)

Because first 15 tosses are given, they are no longer random. So we only need to calculate the next 180 tosses, of which 90 are heads. 

If you guessed the incorrect answer, know that this mistake is so common that is has a name: 

![exercise1c.png](attachment:exercise1c.png)