# Probability

So we just finished talking about systems and about data types, and now we're going to talk about probability. Why? What does coin flipping have to do with it? It's not so much that we're interested in probability but we're interested in what we can do with probability.

Consider the coin flipping. If we think of this from a System Thinking perspective, what goes into the face of the coin that lies up after flipping it? Obviously, the coin has something to do with it, maybe it's shape, size, weight, balance, etc. Coins will behave differently on the Earth than on the Moon or elsewhere in the universe so the gravity, atmosphere, etc. all play a role. There's the size and shape of the hand and the force applied. Finally, there's this universe's general laws of physics. Whatever it is, it's not "chance" (it's not *arbitrary*).

In fact, it's very likely (increasingly certain) that nothing is "random" or truly probabilistic in the sense that word is generally understood. The one hold out has always been Quantum Physics (hence Einstein's famous quip "God does not play dice with the Universe" and Bohr's rejoinder "Stop telling God what to do."). However, even during Einstein and Bohr's time, Pilot Wave theory was put forward as an alternative explanation. Bohr disliked it because he liked his probabilistic equations better...but he was confusing the map for the territory. Pilot Wave theory may be true (it is gathering evidence) that doesn't mean that it isn't easier to treat quantum phenomena *as if* they were "truly" probabilistic.

One might think that complex (non-linear systems) would be another candidate for "truly" random except that complex systems are most definitely deterministic (that is a defining characteristic), the problem here is that small variances in the estimated parameters of the equations can lead to hugely different results. Therefore their behaviors are difficult to predict even though they are deterministic. Again, this just strengthens the case for using probability as a tool for dealing with such systems. It doesn't mean they are inherently random themselves whatever that might mean.

It is perhaps unfortunate that the interest in probability started with gambling... Interestingly enough, there are times when you actually want to introduce actual randomness into the process so that variables we don't know or care about cannot influence the outcomes. We'll talk more about that when we address experimental design.

Long story short, when we flip a coin, and we don't know or we simply ignore all of those other factors, we need a way of working with the resulting uncertainty over the outcome. Probability allows us to do that. Ultimately, however, our real interest in probability lies with its inverse. We don't generally want to know the probability of getting heads given a fair coin...we want to know when we see heads, we want to know if the coin was fair. We will discuss with the this problem of "inverse probability" later.

## Definition of Probability
For us *probability* mean "our degree of certainty in an outcome out of all possible outcomes". This lets us talk about our certainty of alien life on other planets or that a fair coin will come up heads or that a customer will visit our site or that a new drug will alleviate symptoms. It also does not commit us to talking about specific numbers. We can say, "the probability is low" but we will generally use numbers to represent that degree of certainty. However, when we use numbers, they must adhere to certain properties (see "Axioms" below).

This is not the only possible definition of probability. Although from the outside it might appear that probability/statistics is a united field of study and that the major problems are "settled"; this is not so. Anyone aware of the controversies in their own specialties can appreciate the problems this can cause for outsiders. The two main approaches to probability and statistics go by the names of "Frequentist" and "Bayesian". I will follow a predominately Bayesian approach throughout this course although I will discuss Frequentist statistics because you are very likely to have encountered them or you will encounter them.

## Axioms of Probability
There are Bayesian derived Axioms but they are a bit more complicated so we will use the Kolgomorov Axioms in our discussion here. These axioms are pretty straightforward and they define the basic properties of a probability metric. We will define different types of probability later but for now we take "P()"to mean "the probability of". The argument may be a set of events such as A = { "male", "female"} or B = {"AL", "AK", "AR", ..., "WY"} in which case P(A) returns the probability of each element of A (which can be thought of as a table of probability values) or P(A="male") which returns the probability of the single element of A. Following the conventional notation, when confusion is unlikely, we can also write P(male) to mean the same as P(A="male").

If we let $W$ be the set of all possible outcomes and $w$ be some particular outcome or elementary event and let $E$ be some set of outcomes called an event then the axioms of probability are:

1. $P(w) \geq 0$
2. $P(W) = \sum_i P(w_i) = 1$
3. $P(w_i \cup w_j) = P(w_i) + P(w_j)$

The first axiom states that a probability, a degree of certainty, cannot be non-negative. One can certainly think of an interpretation of a negative probability (perhaps our certainty against an event happening), it is cleaner to think of all probabilities as being non-negative.

The second axiom says that we must be 100% certain in at least one event in W or some combination of events.

The third axiom says that the probability of joint outcome $w_i$ *and* $w_j$ is equal to the sums of the their individual probabilities, $P(w_i)$ or $P(w_j)$. If our degree of belief in rain tomorrow is 0.23 and our degree of belief in snow tomorrow is 0.10 then our degree of belief in either rain or snow tomorrow must be 0.33. This axiom is known as the **Additive Law of Probability**. Note that this is only true if we're talking about mutually exclusive events (if it cannot both rain and snow at the same time).

If our events are not mutually exclusive, then we can use an alternate form of the axiom that removes "double counting".

3a. $P(E_i \cup E_j) = P(E_i) + P(E_j) - P(E_i \cap E_j)$

Why the difference? In the first case, we are working with outcomes. In the second case, we are working with events which are technically sets of outcomes. An example of $W$ might be the sides of a six-sided die $W = {1, 2, 3, 4, 5, 6}$. In contrast, $E_i$ might be "all even valued sides of the die" and $E_j$ might be "all sides whose value is < 4". While the *outcomes* in $W$ are definitely mutually exclusive, the *events* $E_i$ and $E_j$ are not (they have the value 2 in common).

The **power set** of a set X is the set of sets generated by combining all possible elements of X into sets. For example, {1, 2, 3} has the power set {{1, 2, 3}, {1, 2}, {1, 3}, {2, 3}, {1}, {2}, {3}, {}}. You can think of events as being some power set over the set of all individual outcomes, only some of which may be of interest. If W defines the set of outcomes, then the power set defines possible events. Note that all outcomes are events but not all events are outcomes. So the power set of {All US States} will include {AK, HI}, {AR}, {ME, NM}, {CA, AZ, OR, WA, NV}. Some of these may be of interest to us and some may not and some of these sets have names such as "Continental US" or "Western States". The upshot is that all outcomes are by definition mutually exclusive but events are not so this modifies axioms #2 and #3.

We will usually refer to events and need only worry about whether or not they're elementary or mutually exclusive. This meshes nicely with the language of software engineering in general and logging specifically.
    

## Types of Probability

There are different *types* of probability defined with respect to the events the probability is defined over.

### Joint Probability

Suppose we have two properties that we're interested in: what kind of community someone lives and their income level. We can think of these as two different sets of events C(ommunity) = {urban, suburban, rural} and I(ncome) = {low, high} and define an event space that is the cross product of the two sets {(urban, low), (urban, high),...}.

If we are looking at actual counts of events and $\#()$ is the counting function then $\#(C=urban)$ is the number of times the event $C=urban$ appears in our data set and $\#(C=urban, I=high)$ is the number of times that the event $C=urban$ and $I=high$ appears in our data set.

Suppose we have the following count data:

| area | low | high |
|:----:|:---:|:----:|
| rural | 873 | 473 |
|suburban | 3957 | 4593 |
|urban | 5937 | 4974 |

Before we proceed, how does this count data fit in with our systems thinking from the previous notebook? What are we not completely certain about?

Let $\#()$ be the size of the data set, then we can define:

$P(C=urban, I=high) = \frac{\#(C=urban, I=high)}{\#()}$

In this case, we're taking "relative frequency" as our certainty in a particular event such as (urban, high). We can calculate all of the probabilities using the previous table and thus create our own table for the joint probability distribution which is easier than specifying them as tuples.

| area | low | high |
|:----:|:---:|:----:|
| rural | 0.04 | 0.02 |
|suburban | 0.19 | 0.22 |
|urban | 0.29 | 0.24 |

Each entry in the table is a particular probability estimate such as:

$P(C=urban, I=high) = P(urban, high) =  0.24$

$P(C=rural, I=low) = P(rural, low) = 0.04$

If we want to know the probability of the event (urban, low) **or** (urban, high) then by axiom #3, we have: 

$P(urban, low \cup urban, high) = P(urban, low) + P(urban, high) = 0.29 + 0.24 = 0.53$

and similarly, if we want to know the probability of (urban, high) **or** (suburban, high) we can calculate it as:

$P(urban, high \cup suburban, high) = P(urban, high) + P(suburban, high) = 0.24 + 0.22 = 0.46$

### Conditional Probability

If joint probability is our degree of certainty in a joint event P(urban, low), for example, then the conditional probability is our degree of certainty in an event when we already know the outcome of at least one of the events in a joint probability space. For example, our degree of certainty in someone living in an urban community when we already know that their income is high is expressed as P(urban | high) where the vertical bar "|" means "given".

Effectively, when we know the value of at least one event in a joint probability distribution then our attention is limited a single row or column in the joint probability table (at least in the 2d case...it's more complicated in higher dimensions but the effect is the same) and we can ignore all of the other events because they are simply not possible any longer.

By the 2nd Axiom of Probability, all probability distributions must sum to 1 therefore we need to normalize the remaining probabilities (of that single row or column) by dividing through by the total.

More formally, the definition of conditional probability is:

$P(A|B) = \frac{P(A, B)}{P(B)}$

and if you play with the math a bit, you will see that this is the same thing. Therefore,

**Conditional Probability, $P(Income|Community)$**

| area | low | high |
|:----:|:---:|:----:|
| rural | 0.65 | 0.35 |
|suburban | 0.46 | 0.54 |
|urban | 0.54 | 0.46 |

**Conditional Probability, $P(Community|Income)$**

| area | low | high |
|:----:|:---:|:----:|
| rural | 0.08 | 0.05 |
|suburban | 0.37 | 0.46 |
|urban | 0.55 | 0.50 |

$P(low|urban) = P(urban, low)/P(urban) = 0.29/0.53 = 0.55$

In general, with conditional probabilities,knowing that some event occurs often changes our information about the certainty or uncertainty of another event. Therefore, in generall, we **cannot** say anything about whether or not:

$$P(A) <=> P(A | B)$$

Just as we can have a joint probability distribution over many different sets such as P(A, B, C, D, E, F) we can have a conditional probability distribution where the outcomes for some (at least one) of the sets is known:

$P(A, B | C, D) = \frac{P(A, B, C, D)}{P(C, D)}$

### Marginal or Prior Probability

When dealing with a joint probability distribution over many sets, there may sometimes be sets that we don't care about, that is, we're only interested in the probabilities of some subets of events. The simplest example is having a joint probability distribution P(A, B) and only being interested in the probability distribution P(A) or P(B). In this case, we can marginalize out the events we are not interested in.^[Marginalization comes from the practice of calculating marginal probabilities in the actual margins of books and ledgers]

For example, if we have our joint probability distribution P(C, I) and we only care about community, C, we can marginalize out income:

$P(urban, low OR urban, high) = P(urban, low) + P(urban, high) = 0.29 + 0.24 = 0.53$

$P(suburban, low OR suburban, high) = P(suburban, low) + P(suburban, high) = 0.19 + 0.22 = 0.41$

$P(rural, low OR rural, high) = P(rural, low) + P(rural, high) = 0.04 + 0.02 = 0.06$

which leads to the following:

**Marginal Probability, $P(Community)$**

| area | any |
|:----:|:---:|
| rural | 0.06 |
|suburban | 0.41 |
|urban | 0.52 |

**Marginal Probability, $P(Income)$**

| area | low | high |
|:----:|:---:|:----:|
| any | 0.52 | 0.48 |

Jaynes argued that there are only conditional probabilities and that we, at least, marginalize over "everything else" or "unknowns" usually because there is only one known or knowable value. For example, the canonical P(F=heads) = 0.5 is really P(F=heads|fair coin, physics, earth conditions) = 0.5 but since we assume all of this to be true, we essentially marginalize it out. Otherwise, the probability is $P(F=heads|fair coin, physics, earth conditions)$ $\times$ $P(fair coin, physics, earth conditions)$ which is just another way although we know that we're always dealing with a specific variable in a complex system, we're ignoring everything else for now.

## Rules of Probability

There are some handy rules that follow from the axioms of probability. In a probability course, you might be required to prove them. We will present them without proof.

### Monotonicity

$A \supseteq B \Rightarrow P(A) \leq P(B)$

This says that if A is a subset of B then the probablity of A must not exceed the probability of B. This is really an abuse of notation. What we really mean is:

$A \supseteq B \Rightarrow \sum_i P(a_i) \leq \sum_j P(b_j)$

### Negation

$P(\neg{a}) = 1 - P(a)$

This follows from axiom #2. If we write out the summation and isolate the single event $a$ and then re-arrange terms, we get the above rule.

### Total Probability

This is also called the Law of Total Probability. It has the form:

$P(A) = \sum_i P(A|B=b_i) P(B=b_i)$

Remember that P(A) is a set of probabilities (one for each element in A). This rules makes a bit more sense if you break it out:

$P(A=a_1) = P(A=a_1|B=b_1) P(B=b_1) + P(A=a_1|B=b_2) P(B=b_2) + \ldots + P(A=a_1|B=b_n) P(B=b_n)$

We can derive the Law by looking at the definition of conditional probability for a single event:

$P(A=a_1|B=b_1) = \frac{P(A=a_1, B=b_1)}{P(B=b_1)}$

and re-arranging terms:

$P(A=a_1, B=b_1) = P(A=a_1|B=b_1)P(B=B_1)$

From our discussion about marginalization and Axiom #2, we know that:

$P(A=a_1) = P(A=a_1, B=b_1) + P(A=a_1, B=b_2) + \ldots + P(A=a_1, B=b_n)$

By substitution, we have:

$P(A=a_1) = P(A=a_1|B=b_1) P(B=b_1) + P(A=a_1|B=b_2) P(B=b_2) + \ldots + P(A=a_1|B=b_n) P(B=b_n)$

and for the entire set A, we have:

$P(A) = SUM_i P(A|B=b_i) P(B=b_i)$

Total Probability is very useful (believe it or not) because there are many situations where you don't know P(A) but you know P(A|B) and P(B). We will see this later.

### Chain Rule

Again, starting with the definition of conditional probability we have:

$P(A|B) = \frac{P(A, B)}{P(B)}$

and by re-arranging we have:

$P(A, B) = P(A|B)P(B)$

We can expand this to more sets:

$P(A,B,C) = P(A|B,C) P(B|C) P(C)$

$P(A,B,C,D) = P(A|B,C,D) P(B|C,D) P(C|D) P(D)$

We can expand in any order:

$P(A,B,C,D) = P(D|A,B,C) P(B|A,C) P(A|C) P(C)$

But this is really just a clever manipulation of the definition of conditional probability:

$P(A,B,C,D) = P(D|A,B,C) P(B|A,C) P(A|C) P(C)$
$P(A,B,C,D) = \frac{P(D,A,B,C)}{P(B,A,C)} \frac{P(B,A,C)}{P(A, C)}\frac{P(A, C)}{P(C)} P(C)$

Still, it can be a handy thing to know and it presents the foundation for Bayesian Networks.

### Bayes Rule

Bayes Rule is a similar manipulation of conditional probability. We start with the definition of conditional probability:

$P(A|B) = \frac{P(A,B)}{P(B)}$

and re-arrange:

$P(A,B) = P(A|B)P(B)$

We can start with the "opposite" conditional probability:

$P(B|A) = \frac{P(A,B)}{P(A)}$

and re-arrange:

$P(A,B) = P(B|A)P(A)$

which means I can set these two equal to each other:

$P(B|A)P(A) = P(A|B)P(B)$

and re-arrange:

$P(B|A) = \frac{P(A|B)P(B)}{P(A)}$

which does not look particularly interesting until I start giving my sets interesting names: B=Hypothesis and A=Data:

$P(H|D) = \frac{P(D|H)P(H)}{P(D)}$

which says...the probability of the hypothesis (or model or parameter) *given* the data is equal to the probability of the data *given* the hypothesis times the probability of the hypothesis. This is the normalized by the probability of the data.

These probabilities all have names:

* P(H|D) = posterior probability
* P(D|H) = likelihood
* P(H)   = prior probability
* P(D)	 = normalizer

Note that we almost never know P(D) and we will often use total probability to calculate it:

$P(D) = \sum_h P(D|H=h)P(H=h)$

Bayes Rule is particularly important to data science because it says exactly how we should change our degree of certainty given new evidence. It also plays a foundational role in several machine learning techniques (Naive Bayes Classifier, Bayesian Belief Networks). It is also the main formula for Bayesian *inference*.

## Independence, Conditional Independence

From before, we manipulated the definition of conditional probability as follows:

$P(A|B) = \frac{P(A,B)}{P(B)}$

and re-arrange:

$P(A,B) = P(A|B)P(B)$

Remember our interpretation of conditional probability: knowing what event in B happened gives us additional information that influences our expectations about which event in A will happen. If it doesn't, then:
	
$P(A, B) = P(A)P(B)$

in which case A and B are said to be independent. This is known as the Multiplication Rule of Probability.

Remember from the chain rule that we can factor a joint probability distribution as we please. For example, we might have:

$P(A, B, C) = P(A|B, C)P(B|C)P(C)$

we can also do the same with a conditional joint probability:

$P(A,B| C) = P(A|B, C)P(B|C)$

which says given some event in C happened, what is our uncertainty in the occurrence of events in A and B and that we can factor our uncertainty into two parts: the probability of some event in A happening given that some event in B and some event in C happened times the probability of some event in B happening given some event in C happened. However, if the following is true:
	
$P(A,B|C) = P(A|C)P(B|C)$

then we can say that A and B are conditionally independent given C. Note that the above can be true and that the following is also true:

$P(A, B) \neq P(A)P(B)$

The one does not imply the other.

## Probabilistic Fallacies

As it turns out, we humans are not particularly good with probabilities and make quite a few mistakes. These mistakes are fallacies in probabilistic reasoning.

### Conjunction Fallacy

Monotonicity plays a very important role in judging the probabilities of events. Perhaps the most famous example arises in the form of the Conjunction Fallacy illustrated as follows:

"Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice and participated in anti-nuclear demonstrations.

"Which is more probable?

1. Linda is a bank teller.
2. Linda is a bank teller and active in the feminist movement."

The majority of people, when asked, picked #2 but this cannot be. If B is the set of all tellers then the set of feminist bank tellers is certainly smaller, a subset of A of B. Being an element of A cannot be more probable than being an element of B.

### Gambler's Fallacy and Hot-Streak Fallacy

A common fallacy is that there is some sort of overarching or underlying power or force that keeps probabilities in balance. The most common way that this fallacy shows itself is as the [Gambler's Fallacy](https://en.wikipedia.org/wiki/Prosecutor%27s_fallacy). In the Gambler's Fallacy, the person incorrectly believes that because a rare event has not happened, it's time must come or it must happen in order to balance the probabilities in some way. For example, if "red 13" has not come up in Roulette for a while, then this "must" happen any time now.

The Hot-Streak Fallacy is similar in that it assumes that some extraordinary event must continue to happen.

### Inverse Probability Fallacy

The Inverse Probability Fallacy relates to conditional probabilities, specifically by believing that the following are the same:

$P(A | B) = P(B | A)$

For example of this fallacy is falsely believing that the probability that it will rain (actual rain) given that rain was forecast (prediction of rain) is equal to the probability that rain was forecast (prediction of rain) given that it is now raining (actual rain). A moment's reflections shows that these are unlikely to be the same.

### Base Rate Fallacy

Suppose we're asked what religion we think Garth is and, knowing that he's from middle America, we can guess that he's notionally a Christian. Suppose we further learn that Garth is a goth and wears dark clothing with various mystical symbols on it, our estimation of Garth's religion would probably swing in the direction of being a Satanist.

However, the base rate (prior) of being a Satanist is really quite low. There are 2,000,000,000 Christians in the world and only 100,000s of Satanists. While the probability of Garth being a Christian might go down, knowing that Garth is a goth shouldn't really flip our sense of the probability from most likely a Christian to most likely a Satanist. Thinking this way is called the [Base Rate Fallacy](https://en.wikipedia.org/wiki/Base_rate_fallacy).

This is also related to Bayes Rule which tells us exactly how much we should change our prior when reviewing evidence.

### Prosecutor's Fallacy

Is related to to the Inverse Probability Fallacy. It refers to a Prosecutor (and sometimes Defense Attornies) arguing the wrong thing. This can result in a mistrial. As it turns out this rarely happens among the attornies but can happen to expert testimony. This is really a family of fallacies The first of which relates to Bayes Rule and is related tot he Inverse Probability Fallacy:

$P(innocence|evidence) = \frac{P(evidence|innocence)P(innocence)}{P(evidence)}$

The [Fallacy](https://en.wikipedia.org/wiki/Prosecutor%27s_fallacy) is committed when the prosecutor assumes that just because the damning evidence is small $P(evidence|innocence)$ ("if he were innocent, the evidence would be really unlikely") that $P(innocence|evidence)$ must be equally as small. This happens a lot with forensic evidence, especially DNA evidence. But it simply isn't true that if $P(evidence|innocence) = 1:1,000,000$ that $P(innocence|evidence) = 1:1,000,000$.

Another version of the Fallacy confuses the prior and conditional probabilities, that is, it assumes that $P(A)=P(A|B)$ and is thus related to the Base Rate Fallacy.

## Distributions

We've sort of danced around the idea of what P(A) is. We've called it our degree of certainty in the outcome of each event in A. In order to go further with probability we need to introduce the notion of a random variable and a probability distribution.

### Events and Random Variables

Our discussion so far has focused on sets and elements with P(A) representing the probability of observing some element of A, $a_i$. But A, B, C are just the features of our data: purchase {yes, no} and state {"AR", "AK", "AL",..., "WY"}.

Each column in a data set is the recorded outcome or event from some such possible set of outcomes. The technical name given for this column is **random variable**. Random variable is an awful name because it conjures up images of flipping coins and throwing dice and we've been trying hard to dispel that image of probability. A random variable just a mathemtical formalism that says whenever we observe the value of the variable, the value observed is uncertain because it changes. Each row is a realization for each of the random variables we are observing and you can think of each observation as the realization of the joint probability distribution described by the random variables represented by the columns. The collection of random variables represents the various states/outcomes the overall stochastic process of interest in or at least the ones we know about or can measure.

So far we've only talked about discrete events where the possible events are either categorical/labels or a small, finite set of numbers. However, a random variable can take on continuous values as well although technically the values are not infinite (they are limited either by the resolution of the measuring device or the IEEE floating point representations or ultimately memory; computers cannot actually represent all real numbers).

### Relative Frequency, Density

We haven't explicitly said it but P(A) is generally called a probability distribution (so-called because the probability mass of 1 is distributed over the possible events). Bear in mind, however, that when we talk about conditional probability, we actually have multiple probability distributions. For example, $P(A|B)$, is actually one probability distribution for each value of B.

More specifically, if A is a discrete random variable, P(A) is called a **probability mass function**, PMF. The heights of the PMF sum to 1 as per Axiom #2.

However, if A has continuous valued outcomes (such as a height measurement), we have a problem because the probability of any given value is zero (or at least infinitesimally small) and while the area under such a function would still be one, it would be indistinguishable from the x axes. 

So instead of representing continuous random variables as a probability mass function, they are represented as a **probability density function**. The probability density function (PDF) represents the probability of values in a range as the area under the curve in that range. However, the area under the entire PDF still sums to 1.

A related concept (which only is well-defined for random variables can be ordered) is the cumulative mass function (CMF) or cumulative density function (CDF). The CDF shows the probability of observation (realizing) a particular value of A or lower. CDFs are important of exploratory data analysis (EDA) as we will see.