# Basic Probability

We often quantify uncertainty in the data, uncertainty in the ML model, and uncertainty in the predictions produced by the model. Probability is the way of **quantifying the uncertainty**. The theory of probability aims at defining a **mathematical structure** to **describe random outcomes of experiments**. That requires the idea of **random variables**: a **function that maps outcomes of random experiments to a set of properties (numbers)** that we are interested in. 

A random process (flipping a coin, asking a girl out): outcome => numbers

Process: ask a girl out => 1 if she says yes, 0 if she says no

Associated with the random variable is a **function that measures the probability that a particular outcome or set of outcomes will occur** called the **probability distribution function**.

![](https://img-9gag-fun.9cache.com/photo/aAD4p79_700bwp.webp)

Using probability, we can consider a model of some process, where the underlying uncertainty is captured by random variables, and we use the rules of probability to derive what happens. In statistics, we observe that **something has happened** and try to figure out the **underlying process that explains the observations**.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from functools import partial
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

sns.set_style("whitegrid")

In general, to calculate the probability of an event happening:
$$
\frac{\text{number of ways it can happen}}{\text{total number of outcomes}}
$$


Example:
- Throwing dices: when a single dice is thrown, there are 6 possible outcomes (1,2,3,4,5,6)
- The probability to roll any value in those outcomes is 1/6

![](https://i.imgur.com/7c1Zllz.png)

Question:

A dice is thrown once. What is the probability that the score is a factor of 6
- A. 1/6
- B. 1/2
- C. 2/3
- D. 1

## Basic

**The sample space** $\Omega$ is a fixed set of all possible outcomes. **The probability** $P(A)$ measures the probability that the event $A$ will occur.

* $P(\Omega) = 1$
* $0 \leq P(A) \leq 1$
* The **compliment** of $A$ is $A^C$, and $P(A^C) = 1 - P(A)$
* If $A$ and $B$ are events, the $P(A \cup B) = P(A) + P(B) - P(A \cap B)$
![](https://studywell.com/wp-content/uploads/2019/10/Venn-420x300.png)
* Two events $A$ and $B$ are **dependent** if knowing something about whether $A$ happens gives us information about whether $B$ happens(and vice versa). Otherwise, they are **independent**.

## Expected Value



The **expected value** of a random variable is **the sum (or intergrating) of the possible outcomes weighted by their probability**. It can be interpreted as the **long-run average of many independent samples from the given distribution.**

Expected value is defined as

$$
E[X] = \sum_{x \in X(\Omega)}{x}{p(x)}
$$

for discrete $X$ and as

$$
E[X] = \int_{-\infty}^{\infty}{x}{p(x)dx}
$$

for continuous X.

The expected value has a physical interpretation as the "center of mass" of the distribution. 

<img src="https://www.mathwords.com/e/e_assets/e41.gif" width="800px"/>

[Visualization](https://seeing-theory.brown.edu/basic-probability/index.html)

In [None]:
# X: random variable of a dice draw: ranging from 1 to 6
p =1/6
X = np.arange(1,7)
print(X)
E_X = np.sum([p*x for x in X])
print(E_X)

[1 2 3 4 5 6]
3.5


The **variance** measures the dispersion

$$
Var(X) = E[(X - E[X])^2]
$$

![](https://wikimedia.org/api/rest_v1/media/math/render/svg/4ad35c4161b9cf52868e879d457d8d796094ff02)

In [None]:
X1 = np.array([33,34,35,37,39]) # 35 (+-)

In [None]:
np.mean(X1**2) - np.mean(X1)**2

4.639999999999873

In [None]:
np.var(X1)

4.64

Variance is a useful notion, but **it suffers from that fact the units of variance are not the same as the units of the random variable** ( because of the squaring). To overcome this problem we can use **standard deviation**, which is defined as $\sqrt{Var(X)}$. The standard deviation of X has the same units as X.


## Distributions

There are two major classes of probability distributions: **discrete**, or **continuous**. 

A **random variable** $X$ is a function that assigns a real number to each outcome in the probability space. 
- When $X$ can take only a finite number of values, so it is known as a **discrete random variable**. 
- When $X$ takes on a infinite number of possible values, so it is called a **continuous random variable**. 

[Visualization](https://seeing-theory.brown.edu/probability-distributions/index.html#section1)

If X is a continuous random variable, then there exists unique nonnegative functions, $f(x)$ (**probability density function PDF**) and $F(x)$ (**cumulative distribution function CDF**), such that the following are true:

$$
P(a \leq X \leq b) = \int_{a}^{b}{f(x)dx} \\
P(X < x) = F(x)
$$

If $X$ is discrete, $f(x)$ is called **probability mass function PMF**:

$$
P(X=x) = f(x) \\
P(X<x) = F(x)
$$





Some common discrete distributions:

* **Bernoulli(p)** (where $0 \leq p \leq 1$): one if a coin with heads probability p comes up heads, zero otherwise.
$$
f(x) = \begin{cases}
    p & if\ x = 1 \\
    1−p & if\ x = 0
\end{cases}
$$

* **Binomial(n, p)** (where $0 \leq p \leq 1$): the number of heads in n independent flips of a coin with heads probability p.

$$
f(x) = \begin{pmatrix} n \\ k \end{pmatrix} p^k(1-p)^{n-k}
$$

with

$$
\begin{pmatrix} n \\ k \end{pmatrix} = \frac{n!}{k!(n-k)!}
$$

In [None]:
# Example: A car company with defective rate of 0.3 (30%). 


p=0.3
# What is the probability of getting 1 defective car out of 3 cars

n=3 # number of cars in total
k=1

# number of ways to have 1 defective car out of 3 cars (combination)
# bad good good, good bad good, good good bad
comb = 3



comb * (p**k) * (1-p)**(n-k)

0.4409999999999999

Some common continuous distributions:

* **Uniform(a,b)** (where a < b): equal probability density to every value between a and b on the real line.

$$
f(x) = \begin{cases}
\frac{1}{b-a} & if\ a \leq x \leq b \\
0 & otherwise
\end{cases}
$$

* **Normal($\mu$, $\sigma^2$)**, also known as the Gaussian distribution

$$
f(x) = \frac{1}{\sqrt{2\pi\sigma}}e^{-\frac{1}{2\sigma^2}{(x-\mu)^2}}
$$

[Visualization](https://seeing-theory.brown.edu/probability-distributions/index.html)

## Central limit theorem

The Central Limit Theorem (CLT) states that the sample mean of a sufficiently large number of i.i.d. random variables is approximately normally distributed. The larger the sample, the better the approximation.


When you take a sample with large enough observations (30 is sufficient) from a population and calculate sample mean, and repeat this procedure several times (>100), those means will form a normal distribution even though the population are not normally distributed.

[Visualization](https://seeing-theory.brown.edu/probability-distributions/index.html)

## Conditional probability (Bayes Theorem)

The conditional probability of event $A$ given that event $B$ has occurred is written $P(A|B)$ and defined as

$$
P(A|B) = \frac{P(A \cap B)}{P(B)}
$$

assuming $P(B) > 0$.

[Visualization](https://setosa.io/conditional/)


**The chain rule** follows from the definition of connditiontal probability:

$$
P(A \cap B) = P(A|B)P(B) = P(B|A)P(A)
$$


Taking one step further, we arrive at the simple but crucial **Bayes' rule**:

$$
P(A|B) = \frac{P(B|A)P(A)}{P(B)}
$$

$P(A)$ is often referred to as the **prior**, $P(A|B)$ as the **posterior**, and $P(B|A)$ as the **likelihood**. [Visualization](https://seeing-theory.brown.edu/bayesian-inference/index.html#section1)
