# Notebook 1: **Probability** basics

Probabilistic Machine Learning -- Spring 2023, UniTS

<a target="_blank" href="https://colab.research.google.com/github/emaballarin/probml-units/blob/main/notebooks/01_probability_basics.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"/></a>

### Random variables

**Measurable space** $(\Omega,\mathcal{F})$:

- $\Omega$ is a set;
- $\mathcal{F}$ is a $\sigma$-algebra on $\Omega$, i.e. $\mathcal{F}$:
    - contains $\emptyset, \Omega$;
    - is closed under complementary sets;
    - is closed under countable unions.


**Measurable function** $f$:

- $f:(\Omega_1,\mathcal{F}_1)\rightarrow (\Omega_2,\mathcal{F}_2)$;
- The pre-image $f^{-1}(E)$, $\forall$ measurable set $E\in\mathcal{F}_2$, is measurable (i.e. $f^{-1}(E)\in\mathcal{F}_1$).


**Probability measure** $P$ on $(\Omega,\mathcal{F})$:

- $P:\mathcal{F}\rightarrow [0,1]$;
- $P$ is countably additive on pairwise disjoint sets;
- $P(\emptyset)=0$ and $P(\Omega)=1$.


**Random variable** $X$:

- $(\Omega,\mathcal{F},P)$ probability space;
- $(\mathcal{X},\mathcal{A})$ measurable space;
- Measurable $X:(\Omega,\mathcal{F},P)\rightarrow (\mathcal{X},\mathcal{A})$.

$X$ induces the push-forward probability measure $\mu$ on $\mathcal{X}$: $\mu(A):=X_*P(A)=P(X\in A) := P(X^{-1}(A))$ for any $A\in\mathcal{A}$.


**Probability mass function** $p_X$

- Finite or countable $\mathcal{X}$;
- $p_X(x):=P(X=x)$ for $x\in\mathcal{X}$.


**Probability density function** $f_X$:

- Infinite $\mathcal{X}$;
- Measurable function $f_X:\mathcal{X}\rightarrow[0,+\infty)$;
- $P(a \leq X \leq b) = \int_a^b f_X(x)dx$.

It follows that $\int_\mathbb{R} f_X(x)dx=1$.


### Notable probability distributions


| discrete distribution | *pmf* | mean | variance |
| :--------------------:|:-----:|:----:|:--------:|
| Binomial $$\text{Bin}(n,p)$$ | $$ {n \choose x} p^x (1-p)^{n-x}$$ | $$np$$ | $$np(1-p)$$ |
| Bernoulli $$\text{Bern}(p)$$| $$\begin{cases}1-p &k=1\\ 0&k=0\end{cases}$$ | $$p$$ |$$p(1-p)$$ |
| Discrete Uniform $$\mathcal{U}(a,b)$$ | $$\frac{1}{b-a+1}$$ | $$\frac{b+a}{2}$$ |$$\frac{(b-a+1)^2-1}{12}$$ |
| Geometric $$\text{Geom}(p)$$ | $(1-p)^{k-1}p$ |$$\frac{1}{p}$$|$$\frac{1-p}{p^2}$$ |
| Poisson $$\text{Pois}(\lambda)$$ |$$\frac{\lambda^k e^{-\lambda}}{k!}$$|$$\lambda$$ | $$\lambda$$ |

where:
- $n\in\{0,1,2,...\}$
- $p \in [0,1]$ or $p \in (0,1)$
- $b\geq a$
- $k\in\{1,2,3,...\}$
- $\lambda \in \mathbb{R}^+$

| continuous distribution | *pdf* | mean | variance |
| :----------------------:|:-----:|:----:|:--------:|
| Continuous Uniform $$\mathcal{U}(a,b)$$|$$\begin{cases}\frac{1}{b-a} & x \in [a,b]\\0 & \text{otherwise}\end{cases}$$|$$\frac{a+b}{2}$$|$$\frac{(b-a)^2}{12}$$ |
| Exponential $$\text{Exp}(\lambda)$$|$$\lambda e^{-\lambda x}$$|$$1/\lambda$$|$$1/\lambda^2$$ |
| Gaussian $$\mathcal{N}(\mu,\sigma^2)$$|$$\frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2}\big(\frac{x-\mu}{\sigma}\big)^2}$$|$$\mu$$|$$\sigma^2$$|
|Beta $$\text{Beta}(\alpha,\beta)$$|$$\frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha,\beta)}$$|$$\frac{\alpha}{\alpha+\beta}$$|$$\frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}$$| 
|Gamma $$\text{Gamma}(\alpha, \beta)$$|$$\frac{\beta^\alpha}{\Gamma(\alpha)}x^{\alpha-1}e^{-\beta x}$$|$$\frac{\alpha}{\beta}$$|$$\frac{\alpha}{\beta^2}$$|
|Dirichlet $$Dir(\alpha)$$|$$\frac{1}{B(\alpha)}\prod_{i=1}^{K}x_i^{\alpha_i-1}$$|$$\tilde{\alpha}_i$$|$$\frac{\tilde{\alpha}_i(1-\tilde{\alpha}_i)}{\alpha_0+1}$$ |
|Student's t $$St(\nu)$$| $$\frac{\Gamma(\frac{\nu+1}{2})}{\sqrt{\nu\pi}\Gamma(\frac{\nu}{2})}{\Big(1+\frac{x^2}{\nu}\Big)^{-\frac{\nu+1}{2}}}$$ |$$0$$|$$\begin{cases}\frac{\nu}{\nu-2}&\nu>2\\\infty&1<\nu\leq2\end{cases}$$ |

where:
- $b \geq a$
- $\lambda \in \mathbb{R}^+$
- $\mu,\sigma,\alpha,\beta\in\mathbb{R}$
- $\alpha,\beta>0$ for the Gamma distribution
- $k,\theta > 0$
- $K\in\mathbb{Z}_{\geq2}$
- $\tilde{\alpha}_i=\frac{\alpha_i}{\sum_{h=1}^K\alpha_h}$, $\alpha_0=\sum_{i=1}^K \alpha_i$
- $\nu>1$


### Expected value

**Definition**

Let $X:(\Omega,\mathcal{F},P)\longrightarrow (\mathcal{X},\mathcal{A})$ be a random variable.

|values|expectation $E[X]$|
|:----:|:----------------:|
|finite| $$\sum_{i=1}^k x_i p_X(x_i)$$|
|countable|$$\sum_{i=1}^\infty x_i p_X(x_i)$$|
|continuous|$$\int_{\mathbb{R}}x f_X(x)dx$$|

where $p_X$ is the probability mass function of $X$ in the discrete case and $f_X$ is the probability density function of $X$ in the continuous case. 

**Example: discrete case**

Let $Y$ be a discrete random variable with values in $\{0,1\}$ and let $P(Y=1)=p$. Suppose we want to compute the expectation $\mathbb{E}[|Y-p|]$.

From the definition of expectation we know that, in the discrete case, we just need to multiply each possible value that the random variable can assume by its probability of occurring:

$$
\mathbb{E}[|Y-p|] = p(1-p) + (1-p) p = 2p(1-p)
$$

**Example: continuous case**

Let the pdf of $X$ be 

$$f(x)=\begin{cases}cx^2(1-x) & 0\leq x \leq 1\\ 0 & \text{otherwise}\end{cases}$$

We want to determine $c\in\mathbb{R}$ such that $f(x)$ is a valid *pdf*:

$$1 =\int_0^1 cx^2(1-x)dx = c \int_0^1 (x^2-x^3)dx = c\Big[\frac{x^3}{3}-\frac{x^4}{4}\Big]_0^1 = \frac{c}{12} $$

$$\Longrightarrow c=12$$

Now we compute the expected value of $X$:

$$E[X]=12\int_0^1x^3(1-x)dx=12\Big[\frac{x^4}{4}-\frac{x^5}{5}\Big]_0^1=\frac{3}{5}$$

### Covariance and correlation

**Covariance** measures the common variation of $X$ and $Y$. 
It is defined as $\text{cov}(X,Y)=\mathbb{E}[(X-\mathbb{E}[X])(Y-\mathbb{E}[Y])] = \mathbb{E}[XY]-\mathbb{E}[X]\mathbb{E}[Y].$

The covariance of a random variable with itself is called **variance**: $\text{var}(X)=\text{cov}(X,X)=\mathbb{E}[X-\mathbb{E}(X)^2]$.

The **correlation coefficient** between $X$ and $Y$ is the normalized covariance: $\displaystyle{\rho=\frac{\text{cov}(X,Y)}{\sqrt{\text{var}(X)\text{var}(Y)}}}$. 

The two variables are said to be *perfectly correlated* when $\rho=1$ and *anti-correlated* when $\rho=-1$.


### Marginal and conditional distributions

**Definitions**

Multiple random variables $X_1,\ldots, X_N$ on the same probability space define a **multivariate random variable**, whose **joint probability mass function** is -- in the discrete case:

$$p_{X_1,\ldots, X_N}(x_1,\ldots, x_N)=P(X_1=x_1,\ldots,X_N=x_N)$$


While the **joint probability density function** is -- in the continuous case:

$$P(X_1\in[a_1,b_1],\ldots, X_N\in[a_N,b_N])=\int_{a_1}^{b_1}\ldots\int_{a_N}^{b_N}f_{X_1,\ldots, X_N}(x_1,\ldots,x_N)dx_1\ldots dx_N$$


![](https://github.com/emaballarin/probml-units/blob/main/notebooks/img/multivariate_normal_sample.png?raw=1)
<br><sub><sup>From <a href="https://en.wikipedia.org/wiki/Joint_probability_distribution">Wikipedia: Joint probability distribution</a></sup></sub>


In the bivariate case, for example, we can derive marginal and conditional distributions from the joint distribution as follows: 

|$X$ values|marginal distribution| conditional distribution|
|:--------:|:-------------------------------------------------:|:-----------------------:|
| discrete | $$p_X(x)=\sum_{y\in{\mathcal{X}_Y}}p_{X,Y}(x,y)$$ | $$ p_{Y\|X}(y\|x) = \frac{p_{X,Y}(x,y)}{p_X(x)} $$ |
| continuous | $$f_X(x)=\int_{\mathcal{X_Y}}f_{X,Y}(x,y)dy$$ | $$ f_{Y\|X}(y\|x) = \frac{f_{X,Y}(x,y)}{f_X(x)} $$ |

These definitions easily extend to the multivariate case.


Two *r.v.*s $X,Y$ are **independent** if and only if their joint probability equals the product of the marginal probabilities

$$f_{X,Y}(x,y)=f_X(x)f_Y(y).$$

**Example: marginal and conditional from the joint**

Let $X$ and $Y$ be two discrete random variables with joint probability distribution
$$
p(x,y) = \frac{1}{21}(x+y)
$$
for $x=1,2,3$ and $y=1,2$.

The marginal distribution of $X$ is:
$$
p_X(x) = \sum_{y=1}^2 p(x,y) = \sum_{y=1}^2 \frac{1}{21}(x+y) = \frac{1}{21}(2x+3)
$$
for $x=1,2,3$.

The conditional distribution of $Y$ given $X=1$ is:

$$
p_{Y|X}(y|1)=\frac{p(1,y)}{p_X(1)}= \frac{\frac{1}{21}(1+y)}{\frac{5}{21}} = \frac{1}{5}(1+y)
$$

for $y=1,2$.


### Conditional independence

The **cumulative distribution function** of $X$ is defined as $F_X(x)=P(X\leq x)$.

Two random variables $X$ and $Y$ are **conditionally independent** given $Z$ if and only if 

$$
F_{X,Y|Z=z}(x,y)=F_{X|Z=z}(x)\cdot F_{Y|Z=z}(y)
$$

for all $x,y,z$, where $F_{X,Y|Z=z}(x,y)=P(X\leq x, Y\leq y|Z=z)$ is the conditional c.d.f. of $X,Y|Z$.


### Law of total probability

Let $\{B_n\}_{n\in I}$ be a partition of the sample space, then for any event $A$ in the same probability space:

$$P(A) = \sum_{n\in I} P(A,B_n) = \sum_{n \in I}P(A|B_n)P(B_n).$$

In other words, one can compute the probability of an event $A$ by conditioning on all the possible cases belonging to a partition of the sample space.

## References
-  [J. Jacod, P. Protter, "Probability Essentials"](https://zero.sci-hub.ru/6098/787f72eac157546be3d98fcc129b8ba6/jacod2004.pdf)