In [1]:
import numpy as np
import matplotlib.pyplot as plt

<br>

# Discrete random variables
---

A random variable $X$ is a variable whose value depends on some random phenomenon. A **discrete random variable**, or **categorical variable** is a variable which has a countable number of values $\mathcal{X}$ (which might still be infinite, such as $\mathbb{N}$ or $\mathbb{Z}$).

<br>

### Probabilities

The **probability mass function** of $X$ gives a numerical description of how much each value of $X$ is likely to occur. The probability of $X$ taking exactly the value $x$ is noted $P(X=x)$, $P_X(x)$ or $P(x)$ when not ambigious. The **mode** of a probability mass function is the value $x \in \mathcal{X}$ with the highest associated probability. There can be several mode of a distribution.

&emsp; $\displaystyle P(X = x) \le 1$
&emsp; and
&emsp; $\displaystyle P(X \ne x) = 1 - P(X = x)$
&emsp; and
&emsp; $\displaystyle \sum_{x \in \mathcal{X}} P(X = x) = 1$

The **cumulative mass function** of $X$ gives the probability that $X$ will take a value below or equal $x$. This quantity is only significant if there is some kind of ordering among the possible values of $X$.

&emsp; $\displaystyle C(x) = P(X \le x) = \sum_{x' \le x} P(x')$

<br>

### Sum and product rules

The product rules allows us to express the joint distribution as a product of conditional distributions:

&emsp; $\displaystyle P(X,Y|\mathcal{H}) = P(X|Y,\mathcal{H})P(Y|\mathcal{H}) = P(Y|X,\mathcal{H})P(X|\mathcal{H})$

The sum rule, or marginalization rule, allows us to summarize probablities on $X$ and $Y$ by summing overall possibles values of $Y$, or to introduce a new variable $Y$ if taken in the other direction:

&emsp; $\displaystyle P(X|\mathcal{H}) = \sum_Y P(X,Y|\mathcal{H}) = \sum_Y P(X|Y,\mathcal{H})P(Y|\mathcal{H})$

The product rule gives us the Bayes rule, which can be further rewriten using the sum rule.

&emsp; $\displaystyle P(X|Y,\mathcal{H}) = \frac{P(Y|X,\mathcal{H})P(X|\mathcal{H})}{P(Y|\mathcal{H})}$
&emsp; where
&emsp; $\displaystyle P(Y|\mathcal{H}) = \sum_{X} P(Y|X,\mathcal{H})P(X|\mathcal{H})$

<br>

### Expected value

The **expected value** of a function $f \in \mathcal{X} \rightarrow \mathbb{R}$ applied on the value of $X$ is the average value that this function will take under the probability mass function of $X$ (technically, the expectation is a *functional*), and the **mean** is the expected value of the identity function:

&emsp; $\displaystyle \mathbb{E}_X[f] = \sum_{x \in X} f(x) P(x)$
&emsp;
&emsp; $\displaystyle E_X[X] = \sum_{x \in X} x P(x)$

The expected value is linear (by definition), which implies:

&emsp; $\displaystyle \mathbb{E}_X[a f+ b g] = a \, \mathbb{E}_X[f] + b \, \mathbb{E}_X[g]$

For multivariate distributions, the expected value of a vector is the vector of the expected value of each separate dimension.


<br>

### Variance

The **variance** of $X$ is the expected value of the square of the difference between the values taken by the random variable $X$ and the mean value of $X$. The **standard deviation** is the square root of the variance, and expresses the *spread* of a distribution around its mean:

&emsp; $\displaystyle \mathbb{V}[X] = E[(X-E[X])(X-E[X])^T] = E[X X^T] - E[X]E[X]^T \in \mathbb{R}^{D \times D}$

The **covariance** is a generalisation of the variance for two random variables $X$ and $Y$:

&emsp; $\displaystyle \text{Cov}[X,Y] = E[(X-E[X])(Y-E[Y])^T] = E[XY^T] - E[X]E[Y]^T  \in \mathbb{R}^{D \times P}$

The expected value is linear, and so we have the following relationships:

&emsp; $\displaystyle \mathbb{V}[A X + b] = A \mathbb{V}[X] A^T$
&emsp; and
&emsp; $\displaystyle \mathbb{V}[X + Y] = \mathbb{V}[X] + \mathbb{V}[Y] + C[X,Y] + C[Y,X]$

The **standard deviation** $\sigma$ is the square root of the variance.

<br>

# Continuous random variables
---

A **continuous random variable** is a variable which has a non-countable number of values $\mathcal{X}$ (such as $\mathbb{R}$). Because the values are not countable, there notion of probability mass function would be meaningless (the probability of taking exactly one value is likely to be zero).

<br>

### Probabilities

The **probability density function** of $X$ gives a relative numerical description of how much a value is likely to occur,  noted $p(X=x)$, $p_X(x)$ or $p(x)$ if not ambiguous. The probability of a the value $\mathcal{X}$ being found in the interval $[a,b]$ is the integral of $p(x)$ between $a$ and $b$:

&emsp; $\displaystyle P(a \le X \le b) = \int_a^b p(x) dx$
&emsp; $\implies$
&emsp; $\displaystyle p(x) = \underset{h \rightarrow \infty}{\text{lim}} \frac{P(x \le X \le x+h)}{h}$

The **mode** of a probability density function is the value $x \in \mathcal{X}$ with the highest relative probability. The **cumulative density function** of $X$ gives the probability that $X$ will take a value below or equal $x$:

&emsp; $\displaystyle C(x) = P(X \le x) = \int_{-\infty}^{x} p(\hat{x}) d\hat{x}$
&emsp; $\implies$
&emsp; $\displaystyle p(x) = \frac{d}{dx} C(x)$


<br>

### Sum and product rules

The product rules allows us to express the joint distribution as a product of conditional distributions, and remains unchanged for the probability densities. The sum rule, or marginalization rule, allows us to summarize probablities on $X$ and $Y$ by summing overall possibles values of $Y$, and features an integral instead of a sum:

&emsp; $\displaystyle p(x) = \int_{y \in Y} p(x,y) dy = \int_{y \in Y} p(x|y)p(y) dy$

The product rule gives us the Bayes rule, which can be further rewriten using the sum rule:

&emsp; $\displaystyle p(x|y,\mathcal{h}) = \frac{p(y|x,\mathcal{h})p(x|\mathcal{h})}{p(y|\mathcal{h})}$
&emsp; where
&emsp; $\displaystyle p(y|\mathcal{h}) = \int_x p(y|x,\mathcal{h})p(x|\mathcal{h}) dx$

<br>

### Expected values

The **expected value** of a function $f \in \mathcal{X} \rightarrow \mathbb{R}$ applied on the value of $X$ is the average value that this function will take under the probability mass function of $X$, and the **mean** is the expected value of the identity function:

&emsp; $\displaystyle \mathbb{E}_X[f] = \int f(x) p(x) dx$
&emsp;
&emsp; $\displaystyle \mathbb{E}_X[X] = \int x p(x) dx$

As for the categorical variables, the expected value for continuous variable is linear.

<br>

### Variance

The **variance** of $X$ is the expected value of the square of the difference between the values taken by the random variable $X$ and the mean value of $X$. The **standard deviation** is the square root of the variance, and expresses the *spread* of a distribution around its mean:

&emsp; $\displaystyle \mathbb{V}[X] = \int (x - E[x]) (x - E[x])^T p(x) \, dx$
&emsp; and
&emsp; $\displaystyle \text{Cov}[X,Y] = \iint (x - E[X]) \; (y - E[Y])^T \; p(x,y) \; dx \; dy$

Some distributions do not have a variance because the integral is not defined.

<br>

# Prior, Posteriors, Likelihoods
---

When we use the Bayes theorem, we are often interested in **infering the cause $c$ that explains the effect** $e$. For instance, when doing classification of images, we are in fact interested in knowing whether a cat or a dog is responsible for generated the image.

&emsp; $\displaystyle p(c|e) = \frac{p(e|c)p(c)}{p(e)}$
&emsp; where
&emsp; $c$ is the cause
&emsp; and
&emsp; $e$ is the effect (the observation)

We often refer to the different parts of this formula are *priors*, *posterior* and *likelihood*:

* The **prior** is the a priory probability of the cause $p(c)$
* The **posterior** is the adjusted probability of the cause, given the evidence $p(c|e)$
* The **likelihood** is the probablity that the cause explains the effect $p(e|c)$

We call $p(e|c)$ a likelihood and not a conditional probability because it is a function of the the effect, but of the cause (which is what we are looking for). Therefore, **likelihoods do not define a probability distribution**, and in particular, do not sum to 1. So in short:

* $p(x|y)$ is a conditional probability distrubtion if viewed as a function of $x$
* $p(x|y)$ is the likelihood of $x$ given $y$ if viewed as a function of $y$

<br>

# Odds
---

* odds
* un-normalized probabilities

<br>

# Dependence and Independence
---

<br>

### Independence

Two random variables $X$ and $Y$ are independent if their joint distribution can be decomposed in the product of their marginal discribution:

&emsp; $X$ and $Y$ are independent
&emsp; $\iff$
&emsp; $p(x,y) = p(x)p(y)$
&emsp; $\iff$
&emsp; $p(x|y) = p(x) \;\;\; \text{or} \;\;\; p(y|x) = p(y)$

The covariance of two independent variables is zero:

&emsp; $\displaystyle \text{Cov}[X,Y] = \iint (x - E[X]) \; (y - E[Y]) \; p(x)p(y) \; dx \; dy = \int (x - E[X]) \; p(x) \; dx \int (y - E[Y]) \; p(y) \; dy = 0$

The covariance only measures linear relationship between variables. So if the covariance is zero, it does not mean the variables are independent.

<br>

### Correlation

The **correlation factor** is a standard measure for the linear relationship between two variables:

&emsp; $\displaystyle \boxed{r=\frac{\text{Cov}[X,Y]}{\sqrt{\mathbb{V}[X] \mathbb{V}[Y]}}}=\frac{\text{Cov}[X,Y]}{\sigma[x] \sigma[y]}$
&emsp; which strangely ressembles:
&emsp; $\displaystyle \cos \theta = \frac{\langle x, y \rangle}{\Vert x \Vert \Vert y \Vert}$

Indeed, we can show that the covariance is an inner product: the covariance is symmetric and definite positive. Its corresponding norm is therefore the square root of the variance. And so the correlation factor is always between -1 and 1.

<br>

# Law of large numbers
---

Let $(X_1 \dots X_N)$ be a set of random variables independent and having the same distribution as $X$.

The **expected value is linear** (because it involves a sum or an integral, which is itself linear) so the sum of the $X_n$ is equal to $N$ times the expected value of $X$.

&emsp; $\displaystyle E \Big[\sum_n X_n \Big] = N E[X]$
&emsp; $\implies$
&emsp; $\displaystyle E \Big[\frac{1}{N} \sum_n X_n \Big] = E[X]$
&emsp; (the expected value of the sample average is the average)

The **weak law of large numbers** says that the sample average converges in probability to the average:

&emsp; $\displaystyle \forall \epsilon, \underset{N \rightarrow \infty}{\text{lim}} P \; \Big \vert \frac{1}{N} \sum_n X_n - E[X]\Big \vert = 0$

One remarkable fact is that the **variance** of the sum of $X_n$ is equal to N times the expected value of $X$. And so the variance of the sample average is the variance of $X$ divided by N. This is only valid if the $X_n$ are independent (their covariance is zero):

&emsp; $\displaystyle \sigma^2\Big[\sum_n X_n \Big] = N \sigma^2 [X]$
&emsp; $\implies$
&emsp; $\displaystyle \sigma^2[\bar{X_n}] = \frac{\sigma^2[X]}{N}$
&emsp; $\implies$
&emsp; $\displaystyle \sigma[\bar{X_n}] = \frac{\sigma[X]}{\sqrt{N}}$

So we see that the **standard deviation** (the spread) of the sample average is decreasing as the number of samples increases.

<br>

# Central Limit Theorem
---