# Probability

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

## Sample Spaces and Events

An **experiment** is any activity or process whose outcome is subject to uncertainty. It is a procedure that can be infinitely repeated and *has a well-defined set of possible outcomes*.

---
### The Sample Space of an Experiment

> We called the *set of all possible outcomes* of an experiment the **sample space** of the experiment, denoted by $S$.

For examples:

- One of the simplest experiment is _tossing a coin_. The possible outcomes of this experiment is _"the coin comes up heads"_ and _"the coin comes up tails"_. We may write the sample space of tossing a coin as $$S = \{H,T\}$$ where $H$ represents _"heads"_ and $T$ represents _"tails"_.

- Consider the experiment of _tossing a six-faced die_. If we are interested in the number that shows on the top face, the sample space is $S=\{1,2,3,4,5,6\}$. If we interested only in whether the number on the top face is odd or even, then the sample space is $S=\{\text{odd},\text{even}\}$.

- Consider an experiment consists of _examining a single fuse to see whether it is defective_. The sample space for this experiment can be abbreviated as $S=\{N,D\}$ where $N$ represents _"not defective"_ and $D$ represents _"defective"_. If we instead _examine three fuses in sequence_ and note the result of each examination, then an outcome for the entire experiment is any sequence of $N$'s and $D$'s of length 3, so $$S=\{NNN, NND, NDN, DNN, NDD, DND, DDN, DDD\}.$$

- An experiment consists of flipping a coin and then flipping it a second time if a head occurs on the first flip. If a tail occurs on the first flip, then a die is tossed once. The sample space of this experiment if we interested in the side of the coin that comes up and the top face of the die is $$S=\{HH, HT, T1, T2, T3, T4, T5, T6\}.$$

We called each outcome in a sample space a **sample point**. For example, when tossing a coin, the sample space is $S=\{H,T\}$ so the outcomes $H$ and $T$ are samples points. 

We will commonly interested in size of sample space—*the number of sample points in a sample space*, abbreviated $n(S)$. From the examples shown above, we can simply just count the sample point.

- $S=\set{H,T} \longrightarrow n(S)=2.$
- $S=\set{1,2,3,4,5,6} \longrightarrow n(S)=6.$
- $S$ being a sample space of examining three fuses in sequence $\longrightarrow n(S)=8.$

---
### Events

> An **event** is *any collection of outcomes contained in the sample space (any subset of sample space)*. An event is **simple** if it consists of exactly one outcome and **compound** if it consists of more than one outcome.
> $$\text{Event} \subseteq S.$$

When an experiment is performed, a particular event $E$ is said to occur if the resulting outcome is contained in $E$. In general, exactly one simple event will occur, but many compound events will occur simultaneously.

For example, consider the experiment of tossing three coins in sequence. So the sample space of this experiment is $S=\set{HHH, HHT, HTH, THH, HTT, THT, TTH, TTT}$. Thus there are eight simple events, each consists of only a single sample point of the sample space, among which are $E_1=\set{HHH}$ and $E_2=\set{HTT}$. Some compound events include:

- $A = \set{HHT, HTH, THH} = $ the event where exactly one coin comes up tails.
- $B = \set{HHH, TTT} = $ the event that all three coins come up the same side.
- $C = \set{HHT, HTH, THH, HTT, THT, TTH, TTT} = $ the event that at least one coin comes up heads.

---
### Some Relations from Set Theory

**An event is just a set**, so any operations that can be done on sets can also be done on events to create another event:

1. The **complement** of an event $E$, denoted by $E'$ or $\bar{E}$ or $\neg E$, is the event that *$E$ does not occur*.
2. The **union** of two events $A$ and $B$, denoted by $A \cup B$ and read *"A or B"*, is the event where *either $A$ or $B$ or both occur*.
3. The **intersection** of two events $A$ and $B$, denoted by $A \cap B$ and read *"A and B"*, is the event where *both $A$ and $B$ occur*.
4. If $A \cap B = \emptyset$ where $\emptyset$ is the *null event* (the event consisting of no outcomes), then $A$ and $B$ are said to be **mutually exclusive** or **disjoint** events. That is they can not occur simultaneously.

<table>
    <tr>
        <td> 
    <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/7/73/Venn10.svg/225px-Venn10.svg.png" alt="complement" style="height: 150px;"/>
        </td>
        <td> 
    <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/3/30/Venn0111.svg/330px-Venn0111.svg.png" alt="union" style="height: 150px;"/> 
        </td>
        <td> 
    <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/9/99/Venn0001.svg/330px-Venn0001.svg.png" alt="intersection" style="height: 150px;"/>
        </td>
    </tr>
    <tr>
        <td style="text-align: center;">Complement</td>
        <td style="text-align: center;">Union</td>
        <td style="text-align: center;">Intersection</td>
    </tr>
</table>

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

## What is Probability?

> The term **probability** refers to the study of *randomness and uncertainty*. In any
situation in which one of a number of possible outcomes may occur, the discipline of probability provides methods for quantifying the chances, or likelihoods, associated with the various outcomes. — Jay L. Devore, 2012

Let's begin the discussion by looking at the data example. *"According to the National Center for Health Statistics, in 2015, there were about 4 million babies born in the U.S., and 48.8% of the newborns were girls."* So, based on the statistics, if we look at a baby that's born someday in 2015 in the U.S., then the chance of that baby being a girl is 48.8% (and the chance of being a boy is 51.2% accordingly). Notionally, we write:

$$
P(\text{event}) = \text{the probability that the "event" will happen}.
$$

In this case, we write: $P(\text{``newborn is a girl"}) = 48.8\%$.

Conventionally, probabilities are always between 0 and 1, we can convert them into percentage if we want (e.g. 0.5 = 50%, 0.1 = 10%, 1.0 = 100%). We commonly used symbols to represent an event. For example, we might defined $A$ to be the event that a newborn in U.S. in 2015 is a girl, so we write

$$
P(\text{``newborn is a girl"}) \;=\; P(A) \;=\; 0.488
$$

The example above shows that the **probability of an event** is defined as the ***proportion of time* the event occurs** in many repetitions. This is the standard definition of probability and it requires that we look at a chance experiment that can be repeated many times.

However, this definition can make it difficult to interpret for single-run events. Sometimes people use a different definition of probability For example, "The chance that my dad will call me today is 60%." Clearly this statement cannot be interpreted as a long-run frequency because "today" happen only once. In this case it is a subjective feeling of the chance that the event will happen. This is called a **subjective probability** which is not based on experiments and different subjects may assign different subjective probabilities to the same event.

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

## The Basic Rules of Probability

The rules of probability we will be discuss in this section comes from the following set of axioms:

1. For any event $A$, $P(A) \ge 0$.
2. $P(S) = 1$.
3. If $A_1, A_2, A_3, \ldots$ is an infinite collection of disjoint events, then $$P(A_1 \cup A_2 \cup \cdots) = \sum_{i=1}^n P(A_i).$$

<h3 style="color: green; font-size: 14pt;">Complement Rule</h3>

The complement rule states that the sum of the probability that an event occurring and the probability that the event not occurring must equal $1$. Namely

$$
P(A \text{ does not occur}) = 1 - P(A).
$$

We called the event that $A$ does not occur the **complement** of $A$, denoted $A'$ or $\bar{A}$. Using this notation, we can write:

$$
P(A') = 1 - P(A).
$$

<h3 style="color: green; font-size: 14pt;">Equally Likely Outcomes</h3>

If an experiment have $n$ different possible outcomes, wriiten as events $E_1, E_2, \ldots, E_n$ and all outcomes are equally likely to happen. Then

$$
P(E_i) = \frac{1}{n}, \;\; \text{ for all } i = 1, 2, \ldots, n.
$$

For example, consider rolling a six-sided dice, each of its six faces is equally likely to come up, so the probability that each face will come up is $\frac{1}{6}$.

<h3 style="color: green; font-size: 14pt;">Addition Rule</h3>

If $A$ and $B$ are mutually exclusive, then the probability of $A$ or $B$ occurring is equal to the sum of there probability:

$$
P(A \text{ or } B) \;=\; P(A \cup B) \;=\; P(A) + P(B).
$$

This rule just come directly from the axiom number 3 mentioned previously. For two events $A$ and $B$ that are not mutually exclusive, we can used the principle of inclusion-exclusion from set theory and calculate:

$$
P(A \cup B) \;=\; P(A) + P(B) - P(A \cap B).
$$

<h3 style="color: green; font-size: 14pt;">Independent Events and Multiplication Rule</h3>

The events $A$ and $B$ are called **independent** if **knowing that one occurs doesn't change the probability that the other occurs**. For example, in the same experiment as described above, the events $A$ and $B$ are _not_ independent because if $A$ occurs, we know that $B$ cannot occur. This means that knowing one event occurs affects the probability of the other. On the other hand, $A$ and $C$ are independent since the result of the first roll does not affect the result of the next roll. 

If $A$ and $B$ are independent, then the probability that both $A$ and $B$ occurring is equal to the product of there probability:

$$
P(A \text{ and } B) \;=\; P(A \cap B) \;=\; P(A)P(B).
$$

---
<h3 style="color: green; font-size: 14pt;">An Example of Applying Probability Rules</h3>

Suppose do an experiment of rolling a die three times and ask *"What is the probability of rolling at least 1 six?"*.

We can apply the method of **total enumeration** to solve this problem. The idea is to split the event we interested in into multiple possibilities. In this case we could write *"at least one six"* as *"six on the first roll" **or** "six on the second roll" **or** "six on the third roll"*. For convenient sake, we define $s_1$, $s_2$, and $s_3$ to be the event *"six on the first roll"*, *"six on the scond roll"*, and *"six on the third roll"* respectively. So we can write:

$$
P(\text{``at least one six"}) = P(s_1 \cup s_2 \cup s_3).
$$

Seeing that there are "or (union)" between the events, we might want to apply the addition rule, but the event $s_1$, $s_2$, and $s_3$ are not mutually exclusive since they can happen at the same time with the others (they can happen in the same run of the experiment). Therefore we cannot aapply the addition rule.

Let's consider using the complement rule, so we have:

$$
P(\text{``at least one six"}) = 1 - P(\text{``no six in three roll"}).
$$

Now, we could write *"no six in three roll"* as *"no six on the first roll" **and** "no six on the second roll" **and** "no six on the third roll"*. That is:

$$
P(\text{``at least one six"}) = 1 - P(s_1' \cap s_2' \cap s_3').
$$

The events $s_1'$, $s_2'$, and $s_3'$ are independent since the occurance of one does not affect the others. So we can apply the multiplication rule to this expression:

$$
P(\text{``at least one six"}) = 1 - P(s_1')P(s_2')P(s_3').
$$

We know that each face of the dice is equally likely to come up. Thus, by the rule of equally likely outcomes, the probability that we will roll a six is $1/6$, which is the value of $s_1$, $s_2$, and $s_3$. And by the complement rule, we get $s_1' = s_2' = s_3' = 1 - 1/6 = 5/6$. Finally, we can substitude these values into the equation:

$$
\begin{align}
P(\text{``at least one six"}) &= 1 - P(s_1')P(s_2')P(s_3') \\
&= 1 - (\frac{5}{6} \times \frac{5}{6} \times \frac{5}{6}) \\
&= 1 - \frac{125}{216} \\
&\approx 0.4213  \\
&= 42.13\%
\end{align}
$$

And that's the final answer!

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

## Conditional Probability

Conditional probability is a measure of the probability of an event occurring, **given that another event is already known to have occurred**. We wirte "the conditional probability of A given B" as $P(A|B)$.

For example, suppose we do an analysis on spam emails and found that spam emails have 8% chance to contain the word *"money"* which is higer that the chance of the same word occurring in ham emails (emails that are not spams) of 1%. We may write:

$$
P(\text{``money"} \;|\; \text{spam}) = 8\% \quad\text{ and }\quad P(\text{``money"} \;|\; \text{ham}) = 1\%.
$$

The conditional probability $P(A|B)$ can be understood as the fraction of probability of $B$ which intersects with $A$:

$$
P(A|B) = \frac{P(A \cap B)}{P(B)}.
$$

The formula often rearranged as:

$$
P(A \cap B) = P(A)P(B|A) = P(B)P(A|B).
$$

This formula is called the **general multiplication rule**, which extents the multiplication rule for two events that are not independent. Notice that in the special case where $A$ and $B$ are independent, knowing that $B$ happened does not affect the probability of $A$, so $P(A|B) = P(A)$ which implies $P(A \cap B) = P(B)P(A)$ which is the multiplication rule mentioned earlier. 

---
<h3 style="color: green; font-size: 14pt;">Calculation Example</h3>

Given the information:

- 20% or all emails are spams.
- There are 8% chance that a spam email contain the word "money".
- There are 1% chance that a spam email contain the word "money".

We ask _"What is the probability that the word 'money' appears in an email?"_.

From the data, we write:

- $P(\text{spam}) = 0.2\%$.
- $P(\text{``money"}\;|\;\text{spam}) = 0.08$.
- $P(\text{``money"}\;|\;\text{ham}) = 0.01$.

And we can write the event *"the word 'money' appears in an email"* as *"the word 'money' appears in spam"* or *"the word 'money' appears in ham"*. Since those two events are mutually exclusive, we can apply the addition rule:

$$
P(\text{``money"}) = P(\text{``money" and spam}) + P(\text{``money" and ham}).
$$

Now apply the general multiplication rule, so we have:

$$
\begin{align*}
P(\text{``money"}) &= P(\text{``money"}\;|\;\text{spam})P(\text{spam}) + P(\text{``money"}\;|\;\text{ham})P(\text{ham}) \\
&= (0.08)(0.2) + (0.01)(1 - 0.2) \\
&= 0.024 = 2.4\%.
\end{align*}
$$

Therefore, the probability that the word 'money' appears in an email is 2.4%.

This kind of calculation can be applied to build a *spam filter*. But went we do build one, the question we need to solve is *"what is the probability that an email is a spam, given that it contain a particular word"*. For instance, we need to find
$$
P(\text{email is a spam} \;|\; \text{email contain the word ``money"})
$$

But by observation (e.g. scanning tons of emails), we can only get $P(\text{email contain the word ``money"} \;|\; \text{email is a spam})$. Which is the reverse of what we want. We will discuss the method used to reverse the conditional probability in the next section.

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

## Bayes' Theorem

Suppose that we know the value of $P(A|B)$ but we want to calculate $P(B|A)$, then what can we do? Let's start by rearranging the conditional probability formula:

$$
P(B|A) = \frac{P(B \cap A)}{P(A)}.
$$

Since the order of "and" (intersect) doesn't matter, we can write the general multiplication rule as:

$$
P(A \cap B) = P(B)P(A|B) = P(A)P(B|A) = P(B \cap A).
$$

We can then plug in this into the conditional probability formula:

$$
P(B|A) = \frac{P(B \cap A)}{P(A)} = \frac{P(A \cap B)}{P(A)} = \frac{P(B)P(A|B)}{P(A)}
$$

And that is what we wanted: we derived the formula that express $P(B|A)$ in terms of $P(A|B)$. This formula is called **Bayes' theorem**. 

In real-applications (most of the time), the value of $P(A)$ is not directly given. However, we can calculate it as shown in the previous section. Therefore, Bayes' theorem is often written explicitly with $P(A)$ expressed as

$$
P(B|A) \;=\; \frac{P(B)P(A|B)}{P(A)} \;=\; \frac{ P(B)P(A|B) }{ P(B)P(A|B) + P(B')P(A|B') }
$$

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

## Bayesian Analysis & Examples Case Studies

As we saw in the last section. Bayesian probability interprets probabilities based on a *degree of belief* in an event since this interpretation is defined by conditional probability, which determines the probability of an event given the probability (belief) of another event. 

In Bayesian analysis, we begin with some **prior** knowledge (or prior belief) about the event. Then, as we examine new data, we update the prior probability using Bayes' theorem to arrive at the **posterior** knowledge (or posterior belief).

A great example would be a spam filter. We might have a dataset of emails in which 20% of the data are spam emails, so the prior probability that an email is a spam is 20%. After that, we examime the emails for certain keywords, calculate the probabilities that the keywords appear in spam or non-spam emails, and use these probabilities to update the prior probability to get a better estimate of whether a new email is spam or not.

---
<h3 style="color: green; font-size: 14pt;">Example : False Positives</h3>

Suppose that 1% of the population has a certain disease. If an infected person is tested, then there is a 95% chance that the test is positive. If the person is not infected, then there is a 2% chance that the test gives an erroneous positive result (*'false positive'*).

To determine whether the test is effective, we ask: "What is the probability that a person who tests positive, what are the chances that he/she actually has the disease?"

Let $D$ be the event that a person has the disease and $p$ be the event that a person tests positive. Here is the data we have:

- $P(D) = 0.01$,
- $P(p|D) = 0.95$,
- $P(p|D') = 0.02$.

And want we want to know is $P(D|p)$. We can see that the probability we want is the reverse of want we know, so Bayes' theorem can be apply to solve the problem:

$$
P(D|p) = \frac{P(D)P(p|D)}{P(p)}.
$$

Since $P(p)$ is not directly known, we use the expanded form of Bayes' theorem:

$$
\begin{align*}
P(D|p) &= \frac{ P(D)P(p|D) }{ P(D)P(p|D) + P(D')P(p|D') } \\
&= \frac{ (0.01)(0.95) }{ (0.01)(0.95) + (1 - 0.01)(0.02) } \\
&= \frac{0.0095}{0.0095 + 0.0198} \\ \\
&\approx 0.3242 = 32.45\%
\end{align*}
$$

We can see that the performance of the test is relatively low, despite a very small false positive rate. This is because the proportion of the population with the disease is very small compared to the proportion without the disease. As a result, even a small error rate, when applied to the much larger group of non-infected individuals, can lead to a significant number of false positives.

The technical term **"false positive"** refers to an error in which a test result incorrectly indicates the *presence* of a condition (e.g. a disease when the disease is not present), while a **false negative** is the opposite error, where the test result incorrectly indicates the *absence* of a condition when it is actually present.

---
<h3 style="color: green; font-size: 14pt;">Case Study: Warner's Randomized Response Model</h3>

Suppose we want to know *"What percentage of students have cheated during the exam in a college?"* This question can be tricky to answer since students may not (or may) answer truthfully if we do a survey.

A research method called **randomized response** proposed by S. L. Warner allows respondents to respond to sensitive issues while maintaining confidentiality using randomization. Chance decides, **unknown to the interviewer**, whether the question is to be answered truthfully, or "yes", regardless of the truth.

For example, we may do a survey that first requires the students to toss a coin twice, then instructs them to answer question 1 if they gets 'tail' on the first toss and answer to question 2 if they gets 'head' on the first toss. And we may set the two questions as

- Question 1: Have you ever cheated on an exam in college?
- Question 2: Did you get tail on the second toss?

We do not know whether a "yes" answer is due to the student cheating or to getting tails on the second toss. This should put the student at ease to answer truthfully.

While not knowing what an individual answer means. We can estimate the proportion of cheaters using all the answer collectively. First we splits the event that a student answer "yes" into two parts:

$$
\begin{align*}
P(\text{``yes"}) &= P(\text{``yes"} \cap \text{Q1}) + P(\text{``yes"} \cap \text{Q2}) \\
&= P(\text{Q1})P(\text{``yes"} \;|\; \text{Q1}) + P(\text{Q2})P(\text{``yes"} \;|\; \text{Q2}).
\end{align*}
$$

So the answer we want to know would be:

$$
P(\text{``yes"} \;|\; \text{Q1}) = \frac{ P(\text{``yes"}) - P(\text{Q2})P(\text{``yes"} \;|\; \text{Q2}) }{ P(\text{Q1}) }.
$$

All the terms can be determine:

- $P(\text{Q1})$ and $P(\text{Q2})$: The chance that the student is answer to question 1 or 2, which is the chance that the student gets "tail" or "head" on the first toss. That is 50%.
- $P(\text{``yes"} \;|\; \text{Q2})$: The chance that the student answer "yes" to question 2, which is equal to the chance that the student gets "tail" on the second toss. Which is also 50%.
- $P(\text{``yes"})$: The proportion of students answered "yes", which can be determined from the survey results.

Suppose that 27 students answered "yes" and 30 students answered "no". So we estimate

$$
P(\text{``yes"} \;|\; \text{Q1}) = \frac{(27/57) - (0.5)(0.5)}{0.5} = 0.4473 = 44.73\%.
$$

And that is the estimation of fraction of cheaters.

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)