In [1]:
import numpy as np
import matplotlib.pyplot as plt

<br>

# Probability mass functions
---

A random variable $X$ is a variable whose value depends on some random phenomenon. A **discrete random variable** is a variable which has a countable number of values $\mathcal{X}$ (which might still be infinite, such as $\mathbb{N}$ or $\mathbb{Z}$).

The **probability mass function** of $X$ gives a numerical description of how much each value of $X$ is likely to occur. The probability of $X$ taking exactly the value $x$ is noted $P(X=x)$, $P_X(x)$ or $P(x)$ when not ambigious. The **mode** of a probability mass function is the value $x \in \mathcal{X}$ with the highest associated probability. There can be several mode of a distribution.

&emsp; $\displaystyle P(X = x) \le 1$
&emsp; and
&emsp; $\displaystyle P(X \ne x) = 1 - P(X = x)$
&emsp; and
&emsp; $\displaystyle \sum_{x \in \mathcal{X}} P(X = x) = 1$

The **cumulative mass function** of $X$ gives the probability that $X$ will take a value below or equal $x$. This quantity is only significant if there is some kind of ordering among the possible values of $X$.

&emsp; $\displaystyle C(x) = P(X \le x) = \sum_{x' \le x} P(x')$

The **expected value** of a function $f \in \mathcal{X} \rightarrow \mathbb{R}$ applied on the value of $X$ is the average value that this function will take under the probability mass function of $X$ (technically, the expected value is a *functional*: a function that takes a function and returns a real number):

&emsp; $\displaystyle E_X[f] = \sum_{x \in X} f(x) P(x)$

The **variance** of $X$ is the expected value of the square of the difference between the values taken by the random variable $X$ and the mean value of $X$. The **standard deviation** is the square root of the variance, and expresses the *spread* of a distribution around its mean:

&emsp; $\displaystyle \sigma^2[X] = E[(X-E[X])^2] = E[X^2] - E[X]^2$

The **covariance** is a generalisation of the variance for two random variables $X$ and $Y$:

&emsp; $\displaystyle \text{Cov}[X,Y] = E[(X-E[X])(Y-E[Y])] = E[XY] - E[X]E[Y]$

<br>

# Probability density functions
---

A **continuous random variable** is a variable which has a non-countable number of values $\mathcal{X}$ (such as $\mathbb{R}$). Because the values are not countable, there notion of probability mass function would be meaningless (the probability of taking exactly one value is likely to be zero).

The **probability density function** of $X$ gives a relative numerical description of how much a value is likely to occur,  noted $p(X=x)$, $p_X(x)$ or $p(x)$ if not ambiguous. The probability of a the value $\mathcal{X}$ being found in the interval $[a,b]$ is the integral of $p(x)$ between $a$ and $b$:

&emsp; $\displaystyle P(a \le X \le b) = \int_a^b p(x) dx$
&emsp; $\implies$
&emsp; $\displaystyle p(x) = \underset{h \rightarrow \infty}{\text{lim}} \frac{P(x \le X \le x+h)}{h}$

The **mode** of a probability density function is the value $x \in \mathcal{X}$ with the highest relative probability. The **cumulative density function** of $X$ gives the probability that $X$ will take a value below or equal $x$:

&emsp; $\displaystyle C(x) = P(X \le x) = \int_{-\infty}^{x} p(\hat{x}) d\hat{x}$
&emsp; $\implies$
&emsp; $\displaystyle p(x) = \frac{d}{dx} C(x)$

The **expected value** of a function $f \in \mathcal{X} \rightarrow \mathbb{R}$ applied on the value of $X$ is the average value that this function will take under the probability mass function of $X$ (technically, the expected value is a *functional*: a function that takes a function and returns a real number):

&emsp; $\displaystyle E_X[f] = \int f(x) p(x) dx$

The **variance** of $X$ is the expected value of the square of the difference between the values taken by the random variable $X$ and the mean value of $X$. The **standard deviation** is the square root of the variance, and expresses the *spread* of a distribution around its mean:

&emsp; $\displaystyle \sigma^2[X] = \int (x - E[x])^2 p(x) dx$
&emsp; and
&emsp; $\displaystyle \text{Cov}[X,Y] = \iint (x - E[X]) \; (y - E[Y]) \; p(x,y) \; dx \; dy$

Some distributions do not have a variance because the integral is not defined.

<br>

# Independence
---

Two random variables $X$ and $Y$ are independent if their joint distribution can be decomposed in the product of their marginal discribution:

&emsp; $X$ and $Y$ are independent
&emsp; $\iff$
&emsp; $p(x,y) = p(x)p(y)$
&emsp; $\iff$
&emsp; $p(x|y) = p(x) \;\;\; \text{or} \;\;\; p(y|x) = p(y)$

The covariance of two independent variables is zero:

&emsp; $\displaystyle \text{Cov}[X,Y] = \iint (x - E[X]) \; (y - E[Y]) \; p(x)p(y) \; dx \; dy = \int (x - E[X]) \; p(x) \; dx \int (y - E[Y]) \; p(y) \; dy = 0$

But if the covariance is zero, it does not mean the variables are independent.

<br>

# Law of large numbers
---

Let $(X_1 \dots X_N)$ be a set of random variables independent and having the same distribution as $X$.

The **expected value is linear** (because it involves a sum or an integral, which is itself linear) so the sum of the $X_n$ is equal to $N$ times the expected value of $X$.

&emsp; $\displaystyle E \Big[\sum_n X_n \Big] = N E[X]$
&emsp; $\implies$
&emsp; $\displaystyle E \Big[\frac{1}{N} \sum_n X_n \Big] = E[X]$
&emsp; (the expected value of the sample average is the average)

The **weak law of large numbers** says that the sample average converges in probability to the average:

&emsp; $\displaystyle \forall \epsilon, \underset{N \rightarrow \infty}{\text{lim}} P \; \Big \vert \frac{1}{N} \sum_n X_n - E[X]\Big \vert = 0$

One remarkable fact is that the **variance** of the sum of $X_n$ is equal to N times the expected value of $X$. And so the variance of the sample average is the variance of $X$ divided by N. This is only valid if the $X_n$ are independent (their covariance is zero):

&emsp; $\displaystyle \sigma^2\Big[\sum_n X_n \Big] = N \sigma^2 [X]$
&emsp; $\implies$
&emsp; $\displaystyle \sigma^2[\bar{X_n}] = \frac{\sigma^2[X]}{N}$
&emsp; $\implies$
&emsp; $\displaystyle \sigma[\bar{X_n}] = \frac{\sigma[X]}{\sqrt{N}}$

So we see that the **standard deviation** (the spread) of the sample average is decreasing as the number of samples increases.

<br>

# Central Limit Theorem
---