# 3. Introduction to Probability Theory
This is a post that I have been excited to get to for over a year now. Probability theory plays an incredibly interesting and unique role in the studying of machine learning and articiail intelligence techniques. It gives us a wonderful way of dealing with **uncertainty**, and shows up in everything from **Hidden Markov Models**, **Bayesian Networks**, **Causal Path Analysis**, **Bayesian A/B** testing, and many other areas.

There are several things that make probability so interesting, and I am going to try and cover all of them in this post and several others. The main points are as follows:
* Probability is rather intertwined with statistics; we will dissect the differences and also how they fit together.
* Many paradox's arise from probability, which makes it rather unintuitive to understand. **Simpson's Paradox** and **the Monty Hall** problem are two hallmark probability paradox problems. We will go through each in detail to discuss why they are paradoxical, and how to remedy it.
* There are many different ways to visualize and conceptualize probability.
* **Discrete** vs. **Continuous** probability distributions cause certain visualizations to break down, causing a gap in understanding.

This post will not cover all of the above, but it will certainly help us build a base from which we can climb to higher levels of understanding. To begin, I want to start from a historical perspective, digging into how probability first came to be.

## 3.1 Historical Background and Definitions
At is core, probability theory was defined incredibly well by **Pierre-Simon Laplace** in 1814:

> Probability theory is nothing but common sense reduced to calculation. ... [Probability] is thus simply a fraction whose numerator is the number of favorable cases and whose denominator is the number of all the cases possible ... when nothing leads us to expect that any one of these cases should occur more than any other.

This simple summary should always be kept in mind when working with probability. The following terms must also be defined:

* **Trial**: A single occurrence with an outcome that is uncertain until we observe it. 
    * For example, rolling a single die.
* **Outcome**: A possible result of a trial; one particular state of the world. What Laplace calls a case. 
    * For example: 4.
* **Sample Space**: The set of all possible outcomes for the trial. 
    * For example, {1, 2, 3, 4, 5, 6}.
* **Event**: A subset of outcomes that together have some property we are interested in. 
    * For example, the event "even die roll" is the set of outcomes {2, 4, 6}.
* **Probability**: As Laplace said, the probability of an event with respect to a sample space is the "number of favorable cases" (outcomes from the sample space that are in the event) divided by the "number of all the cases" in the sample space (assuming "nothing leads us to expect that any one of these cases should occur more than any other"). Since this is a proper fraction, probability will always be a number between 0 (representing an impossible event) and 1 (representing a certain event). 
    * For example, the probability of an even die roll is 3/6 = 1/2.
   
There is one more term that I would like to discuss before moving onto the general rules of probability; that term is **probabilistic**. The term probabilistic is thrown around frequently without many people having a sound definition for what it really entails. Probabilistic can be defined as:

> **Probabilistic:** Subject to or involving chance variation

Another way of looking at it is that it deals with **uncertainty**. Now, uncertainty can come about in the real world in a variety of ways (no, I am not going to talk about rounds of cards):
1. We have a partial knowledge of the state of the world.
2. Noisy observations.
3. *Phenomena* not covered by our model.
4. Inherent **Stochasticity** 

The entire goal of probability theory is to allow us to allow us to deal with uncertainty in ways that are principled and proven. Now, these definitions must be understood and internalized before moving on. One of the troubles with probability is the new vocabularly that it introduces, so be sure to look back on these definitions if anything is unclear as we move forward. 

## 3.2 Probability Rules
Now, in general there are three very commonly used probabilities: **marginal**, **joint** and **conditional** probability. I like to do this via looking at an example, as I feel it will help keep the concepts more concrete. Suppose we have the situation shown in the table below:

<img src="https://drive.google.com/uc?id=1aNGxEap1BWNCnBbQlo8VJn8kp1RKtHs8" width="600">

Here we are looking at historical data surrounding an ecommerce site and purchases from different countries. Each cell corresponds to the number of people from a specific country who either purchased or did not purchase something from the site. This is known as a **discrete distribution**, which we will cover in more depth later in this post. For now, let's try and and answer some questions about this distribution, and in the process, get a feel for the main types of probability.

One final thing before we get going here. I want to quickly give an informal defintion for the term **random variable**. It comes up very frequently in the discussion of probability, and while we don't need to get into the technicalities surrounding it yet, we should have a general idea of what it represents:

> A **random variable** is a quantitative variable whose value depends on chance in some way. 

This is a very broad and informal definition, however it should at least convey the general idea. A random variable is not known until its value is observed. We can think of the roll of a die as a random variable, or if we were to flip a coin 3 times and set $X$ to be the number of heads, $X$ would then be a random variable that could take on the values: $\{1,2,3,4\}$

### 3.2.1 Marginal Probabilities
Say for a moment that we wanted to know:

> What is the probability that a user is from a specific country? 

Mathematically, that can be written as:

$$P(Country = c)$$

Where $c$ is the specific country you are interested in. Now, without knowing anything about marginal probabilities, I would guess that most people would end up with the following equation:

$$P(Country = c) = \frac{\text{users from } c}{\text{total number of users}}$$

This is entirely correct! Again, keeping in mind the original quote from Laplace, probability can very often be reduced to common sense and counting (at least in the discrete cases). So, we can do just that:

$$P(Country = Canada) = \frac{300 + 20}{300 + 20 + 50 + 500 + 10 + 200} = 0.30$$

$$P(Country = USA) = \frac{500 + 50}{300 + 20 + 50 + 500 + 10 + 200} = 0.51$$

$$P(Country = Mexia) = \frac{200 + 10}{300 + 20 + 50 + 500 + 10 + 200} = 0.19$$

Above we have just found the distribution $P(Country)$ over all countries. We can ensure that it is a correct and legal probability distribution by checking that it sums to 1:

$$\sum_{c} P(Country = c) = 0.3 + 0.51 + 0.19 = 1$$

Now, as I said earlier, I would have expected that even with _no knowledge_ of marginal probabilities most people would come to that conclusion. However, I want to give a way of thinking about the problem as it relates to marginal probabilities and **marginalization** (we will go over marginalization in much greater detail soon). 

When dealing with a distribution that contains multiple variables of interest, such as _purchase_ and _country_ in this case, we often want to acquire information about a single variable (such as country in the above example). To do that, we need to **marginalize** out the variable that we are not interested in (purchase in the above example). We intuitively did it above, but there is a great way to visualy think about this. Let's reconsider our table, but with a different shading of our cells:

<img src="https://drive.google.com/uc?id=11fqKDqAHHhFRFEaIcMncbZZETLjVMvR-" width="600">

Here, I have shaded the _purchased_ cell's gray since we are not interested in them. What we are interested in is the country information in the red and blue rows. We want to get rid of the purchase information, which by definition is what is means to **marginalize out** a variable. The visual trick here is to think of _collapsing_ the rows, so that the _purchase_ is erased and we are only left with our variable of interest: _country_.

<img src="https://drive.google.com/uc?id=1xh5D-zTZUZkneKF8w2FoPC7omc1Eu0en" width="600">

Now this visual collapsing is really just a summation along each individual column, but the ability to think about a visual collapse helps in understanding how we get rid of the variables that aren't of interest. Now, there is one issue though that I am sure you are wondering about at this point: This final table isn't a probability distrubtion is it? No, it is not! The row does not sum to 1, which actually brings us to our last step: Dividing each entry by the sum of the entire row to ensure it is a valid distribution:

<img src="https://drive.google.com/uc?id=1I_Y8LWwRx7cWUGwFIfXkDRRUW1Qw1KPx" width="600">

I hope that that visual has helped in understanding what is going on here, but if it still hasn't quite set in fear not! We will be going over a few more examples shortly. The main thing to take away is that on an intuitive level, a marginal distribution is used when trying to dissociate certain variables so you can gain insight into only the variable of interest. 

For those interested in more "text-book defintion", we can define a marginal distribution as:

> The marginal distribution of a subset of a collection of random variables is the probability distribution of the variables contained in the subset.

A marginal distribution gives the probabilities of various values of the variables in the subset without reference to the values of the other variables. **Marginal variables** are those variables in the subset of variables being _retained_. So, in our above case that would be _country_. These concepts are "marginal" because they can be found by summing values in a table along rows or columns, and writing the sum in the margins of the table. I don't like that visualization quite as much, but for reference that would look like:

<img src="https://drive.google.com/uc?id=1yxysUkkytZavgqZKv6xomXocRWJWOtAr" width="600">

The distribution of the marginal variables (the marginal distribution) is obtained by marginalizing – that is, focusing on the sums in the margin – over the distribution of the variables being discarded, and the discarded variables are said to have been marginalized out.

### 3.2.2 Joint Distribution 
Now let's move on to another question all together:

> What is the probability that someone will make a purchase and that they are from a certain country? 

Mathematically we can write that as:

$$P(Buy=b, Country=c)$$

Where $b$ is either 0 (they did not buy something) or 1 (they did buy something), and $c$ is again country. This visually looks like our original table:

<img src="https://drive.google.com/uc?id=1aNGxEap1BWNCnBbQlo8VJn8kp1RKtHs8" width="600">

Now, as before, I would guess that most people could probably reason about this problem based on general intuition. We know that there are two outcomes if a users buys (yes or no) and there are three countries. So, there are 6 total combinations of $b$ and $c$ that the above equation could be filled with. We can easily solve for each of the probabilities as follows:

$$P(Buy=1, Country=Canada)= \frac{20}{1080} = 0.019$$

$$P(Buy=0, Country=Canada)= \frac{300}{1080} = 0.28$$

$$P(Buy=1, Country=USA)= \frac{50}{1080} = 0.046$$

$$P(Buy=0, Country=USA)= \frac{500}{1080} = 0.46$$

$$P(Buy=1, Country=Mexico)= \frac{10}{1080} = 0.0093$$

$$P(Buy=0, Country=Mexico)= \frac{200}{1080} = 0.19$$

Above, we simply take each cell corresponding to $b$ and $c$ and divide its value by the total number of outcomes. Now, we just calculated the joint probabilities by what amounts to gut instinct. However, let's try and put _what just felt right_ into a more defined framework. For instance, what steps did we naturally take in order to calculate $P(Buy=1, Country=Canada)$?

Well, first and foremost we honed in on the column specifically associated with Canada. Because the table was already filled out we may not have even thought about it, but we actively _focused_ on a certain column. Well, that means that we needed to know how probable it was for a user to be from canada in the first place (that is a consequence of our focusing). This can be visualized as:

<img src="https://drive.google.com/uc?id=1OUBHqzdXcfMTyLHXtydL6Wz51uu4f_Sk" width="600">

Now, as just mentioned, this focusing must be accounted for in our calculation! By selecting the canada row, we must define how probable a user is to be from canada in the first place. Mathematically this looks like:

$$P(Country=Canada)$$

And from earlier, we know that is equal to:

$$P(Country=Canada) = \frac{320}{1080}$$

So, in defining a more robust way to reason about joint probabilities, we must ensure that the above is included! Now, after this focusing has occured, we are only looking at:

<img src="https://drive.google.com/uc?id=1Dzu8tM-qRcIEcAFk6NNFmsVh_MsbULTO" width="400">

Remember, the only way that we are able to ignore the rest of the table is if _we account for it in our equation_. That is why we need to make sure we include $P(Country=Canada)$ in our final calculation. 

So, with this focused view, what is the probability that someone from Canada buys something? Well, that can be written as:

$$P(Buy=1 \mid Country=Canada) = \frac{20}{320}$$

The way that we _encode_ the focusing that we had just done is via the $\mid$ symbol. In english, it means _given_. So, the entire equation above can be interpreted as: "The probability that a user does buy something, given that they are from canada". This idea of focusing will be defined more thoroughly in the next section concerning **conditional probability** and **conditioning** on certain variables. 

At this point, we have everything that we need to calculate $P(Buy=1, Country=Canada)$, via a more robust framework. Specifically, our equation is:

$$P(Buy=1, Country=Canada) = P(Buy=1 \mid Country=Canada) P(Country=Canada)$$

$$P(Buy=1, Country=Canada) = \frac{20}{320} *\frac{320}{1080} = 0.019$$

The equation above is the standard definition for joint probability. Written more generally, we have:

$$P(A,B) = P(A \mid B)P(B)$$

There is an additional visualization that can be rather helpful to further our understanding here:

<img src="https://drive.google.com/uc?id=1-yonC3jkzjoJ6pmm4M7MkfvMKpyKzYma" width="400"> 

We can think of the grey area above as our total probability space, of which there is also a space the represents the probability of being from canada, and another that represents the probability of buying something. Where those two areas overlap represents the probability of being from Canada and buying something.

With that we have just gone through an intuitive derivation of the formula for joint probability (in the discrete case). A few things are worth noting at this point:
* The table that we worked with above was the joint distribution
* As we add more variables it can be shown to increase the space of probabilities exponentially. For instance, we had 2 variables-one that had 2 potential values, the other 3 potential values, leaving us with a space of 6. If we added an additional variable that could have 2 potential values, our space would increase to 12. In the simplest case where each new variable is binary (and only holds 2 values), our space grows as: $2^n$, where $n$ is the number of variables in our distribution. This is known as the **curse of dimensionality**
* Notice that the final probabilities we found above are much smaller than the marginal probabilities. This is due to the fact that our distribution must sum to one, and since our total possibilities grows exponentially with each additional variable, the actual probabilities will shrink exponentially

Now, just as we did with marginal probabilities, we can close with some dry definitions for those who are interested:

> Given random variables $X$, $Y$, that are defined on a probability space, the joint probability distribution for $X$, $Y$, is a probability distribution that gives the probability that each of $X$, $Y$, falls in any particular range or discrete set of values specified for that variable. In the case of only two random variables, this is called a bivariate distribution, but the concept generalizes to any number of random variables, giving a multivariate distribution.

### 3.2.3 Conditional Probabilities
As before, we can transition to another question of interest:

> What is the probability that a user buys or doesn't buy, given that they are from a certain country? 

Again, just going off of gut instinct we can see that we start with the base table: 

<img src="https://drive.google.com/uc?id=1aNGxEap1BWNCnBbQlo8VJn8kp1RKtHs8" width="600">

We again perform the focusing that we talked about earlier, only now we have a new word for it: **conditioning**. Conditioning simply means that we are going to focus in on a subset of variables in the distributions in some way. Intuively, after we have focused our calculations look like:

$$P(Buy=1 \mid Country=Canada)= \frac{20}{320} = 0.07$$

$$P(Buy=0 \mid Country=Canada)= \frac{300}{320} = 0.93$$

$$P(Buy=1 \mid Country=USA)= \frac{50}{550} = 0.09$$

$$P(Buy=0 \mid Country=USA)= \frac{500}{550} = 0.91$$

$$P(Buy=1 \mid Country=Mexico)= \frac{10}{210} = 0.04$$

$$P(Buy=0 \mid Country=Mexico)= \frac{200}{210} = 0.96$$

Notice that the sum of the above results is three, not one. This is because country is no longer random at this point-it is given! So, the results for each given country sum to one, and since there are three countries all of the results sum to three. 

Now if we looked at a specific example, where we are trying to determine the probability that a user does or doesn't buy given they are from the united states, that would look like:

<img src="https://drive.google.com/uc?id=1mu2pIDnrEN-YZOWRyAQb5xadyNsH40Oj" width="600">

Now recall the equation for the joint distribution:

$$P(Buy=1, Country=USA) = P(Buy=1 \mid Country=USA) P(Country=USA)$$

In our current situation, we are not trying to solve for the joint, but rather the conditional probability above:

$$ P(Buy=1 \mid Country=USA) = \frac{P(Buy=1, Country=USA)}{P(Country=USA)}$$

The above was just some basic algebraic manipulation, but it allows us to express our conditional probability in a way that may not be quite as intuitive. However, the nice thing about it is that we already have values for the two probabilities on the right hand side of the equation!

$$ P(Buy=1 \mid Country=USA) = \frac{0.046}{0.51} = 0.09$$

And that is exactly what we had found above via our slightly more intuitive, gut-instinct approach. 

Now, using our visualization from before, we can think of finding $P(Buy=1 \mid Country=USA)$ as follows: 

<img src="https://drive.google.com/uc?id=1XpW1AGgUDGC_GliqeZ8xgZx-FyleKYhd" width="600">

We are specifically being _given_ information, in this case that the country is the USA. What that essentially means is that we are _focusing_ in on only the space where country is the USA. That space has a probability of 0.51. We then look at, within that space, what fraction of users buy something. In other words, we know that our total (focused) space has a size of 0.51 while our subspace where a user buys something is 0.046. We simply divide the latter by the former in order to compute our desired probability! 

So, we are (as always) looking for a specific event and then dividing by the total number of events. In this case, our total number of events is constrained (conditioned) based on the focusing around the country being the USA. 

These two visualizations are very helpful because they provide two unique viewpoints with the same information conveyed, but through a different medium (you could even say they are isomorphic, in the informal sense of the word). 

### 3.2.4 Bayes Rule 


### 3.2.5 Marginalization

## 3.3 Discrete vs. Continuous Distributions


### 3.3.1 Discrete Distributions


### 3.3.2 Continuous Distributions


### Probability Density Functions