In [None]:
%load_ext autoreload
%autoreload 2

%matplotlib inline

## Basic setup

Create anaconda environment
<br>
```bash
conda create -n ml python=3.7.4 jupyter
```
Install fastai library
<br>
```bash
conda install -c pytorch -c fastai fastai
```

# Sigma-algebras

Let $X$ be a set and $\mathcal{F} \subset 2^X$ with properties:
- $X \in \mathcal{F}$
- if $A \in \mathcal{F}$ then $X - A \in \mathcal{F}$
- for every $(A_n)_{n = 1}^{\infty}$ such that $A_n \in \mathcal{F}$ for any $1 \le n \le \infty$: $\bigcup_{n=1}^{\infty}A_n \in \mathcal{F}$
<br>

The $\mathcal{F}$ with these properties is called the <b>$\sigma$-algebra</b> on X

From the above properties we can easily conclude that for every $\sigma$-algebra $\mathcal{F}$ on the set $X$:
- $\emptyset \in \mathcal{F}$
- for every $(A_n)_{n = 1}^{\infty}$ such that $A_n \in \mathcal{F}$ for any $1 \le n \le \infty$: $\bigcap_{n=1}^{\infty}A_n \in \mathcal{F}$

For every set $X$ the "smallest" $\sigma$-algebra will be $\{X, \emptyset\}$ and the biggest will be - $2^X$
<br>

The elements of the $\sigma$-algebra are called measurable sets and the pair $(X, \mathcal{F})$ is called measurable space

The function $f:(X, \mathcal{F}_X) \to (Y, \mathcal{F}_Y)$ between two measurable spaces is called measurable function if for every $F \in \mathcal{F}_Y$ the set $f^{-1}(F) \in \mathcal{F}_X$

## Measure

Let $(X, \mathcal{F})$ be a measurable space, the function $\mu:\mathcal{F} \to [0, \infty]$ is called <b>measure</b> if:
- $\mu(\emptyset) = 0$
- For pairwise disjoint sets $(E_k)_{k=1}^\infty$ the following holds: $\mu\left(\bigcup_{k=1}^\infty E_k\right)=\sum_{k=1}^\infty \mu(E_k)$
<br>

Properties:
- if $A \subset B$ then $\mu(A) \le \mu(B)$
<br>
Proof: $(B - A) \cup (A \cap B) = (B - A) \cup A = B$ and $(B - A) \cap (A \cap B) = \emptyset$ so $\mu((B - A) \cup (A \cap B)) = \mu(B - A) + \mu(A \cap B) = \mu(B - A) + \mu(A) = \mu(B)$ so $\mu(A) \le \mu(B)$

The triple $(X, \mathcal{F}, \mu)$ is called measure space

## Probability

Define $\sigma$-algebra on the sample space $\mathcal{F} \subset 2^{\Omega}$ and measure $P:\mathcal{F} \subset 2^{\Omega} \to [0, 1] \subset \mathbb{R}_+$ measure

The measure $P:\mathcal{F} \subset 2^{\Omega} \to [0, \infty]$ on $\sigma$-algebra is called probability if $P(\Omega) = 1$

Properties:
- $P(\emptyset) = 0$
- if $A \subset B$ then $P(A) \le P(B)$
- $0 \le P(A) \le 1$
- $P(A^c) = 1 - P(A)$ where $A^{c} = \Omega - A$
- $P(A \cup B) = P(A) + P(B) - P(A \cap B)$
<br>

For $A \subset \Omega$ and $B \subset \Omega$ events, we can assume that:
<br>
$A \cap B$ is interpreted as $A$ and $B$ events simultaneously and sometimes we denote $AB$ 
<br>
$A \cup B$ mean $A$ or $B$ events

#### Example (Uniform probability distribution):
Let $\Omega$ is finite set of samples and let the probability of each outcome equally likely, then for $A \subset \Omega$ 
$$
\frac{|A|}{|\Omega|}
$$
<br>
For instance, let $\Omega$ be a toss of a coin, then we have $\Omega = \{H, T\}$ and $|\Omega| = 2$ and $|\{H\}| = 1$, so the probability of head $P(H) = \frac{|A|}{|\Omega|} = \frac{1}{2}$

#### Independent events:

If we toss the coin twice, then the probability of two heads is 
$$P(\{H, H\}) = \frac{1}{2} \cdot \frac{1}{2} = \frac{1}{4}$$
<br>
#### Definition:
Two events $A$ and $B$ are independent if
$$P(A \cap B) = P(A)P(B)$$
<br>
A set of events $(A_i)_{i \in I}$ are independent if
$$P(\bigcap_{i \in I}A_i) = \prod_{i\in I}P(A_i)$$
<br>
We can assume, that two events are independent, for instance when we tossing the coins, we know that coins does not have a memory and thus each toss is independent for others.
On the other hand, we can prove that $P(AB) = P(A)P(B)$ and conclude that $A$ and $B$ are independent.
<br>
Let $\Omega$ be a tossing the fair dice and let $A = \{2, 4, 6\}$ and $B = \{1, 2, 3, 4\}$ then $AB = A \cap B = \{2, 4\}$ and $P(AB) = \frac{2}{6} = \frac{1}{3}$. But $P(A) = \frac{3}{6}$ and $P(B) = \frac{4}{6}$ which iplies $P(AB) = \frac{1}{3}$ by which we can conclude that this events are independent

#### Conditional probability:
If $P(B) \gt 0$ then conditional probability of $A$ given $B$ is:
$$P(A|B) = \frac{P(AB)}{P(B)}$$
<br>
For fixed $B \subset \Omega$ such that $P(B) \gt 0$ let define function $P_B = P(.|B):\Omega \to \mathbb{R}_+$, then $P_B$ is probability measure on $2^{\Omega}$
<br>
- $P(\Omega|B) = 1$ ir contains $P(B|B) = 1$
- $P(\bigcup_{i=1}^{n}A_i|B) = \sum_{i=1}^{n}P(A_i|B)$ for disjoint sets $(A_i)_{i=1}^{n}$
<br>

#### Example: 
Let we have a medical desease $D$ test with outcomes $+$ and $-$:

<table>
    <tr>
        <td>
        </td>
        <td>
        $D$
        </td>
        <td>
        $D^c$
        </td>
    </tr>
    <tr>
        <td>
        $+$
        </td>
        <td>
        $0.0081$
        </td>
        <td>
        $0.0900$
        </td>
    <tr>
        <td>
        $-$
        </td>
        <td>
        $0.0009$
        </td>
        <td>
        $0.9010$
        </td>
    </tr>
</table>

Then:
$$P(+|D) = \frac{P(+, D)}{P(D)} = \frac{0.0081}{0.0081 + 0.0009} = 0.9$$
<br>
$$P(-|D^c) = \frac{P(-, D^c)}{P(D^c)} = \frac{0.9010}{0.0900 + 0.9010} \approx 0.9$$

In [1]:
p1 = (0.0081) / (0.0081 + 0.0009)
p2 = (0.9010) / (0.0900 + 0.9010)
p1, p2

(0.9, 0.9091826437941474)

The test is accurate, sick people get positive with $0.9$ probability or $90%$ accuracy results as well as healthy people get negative result with $0.9$ probability or $90%$ accuracy.
<br>
The question is, if some person got test result as positive ($+$), what is the probability that they have disease?
<br>
The first answer is $0.9$ but it is wrong.

$$
P(D|+) = \frac{P(D, +)}{P(+)} = \frac{0.081}{0.081 + 0.0900} = \frac{0.081}{0.0981} \approx 0.48
$$

In [2]:
p3 = (0.081) / (0.081 + 0.0900)
p3

0.4736842105263158

If $A$ and $B$ are independent events, then 
$$P(A|B) = \frac{P(AB)}{P(B)} = \frac{P(A)P(B)}{P(B)} = P(A)$$
<br>
and if
$$P(A|B) = P(A)$$
$A$ and $B$ are independent events

## Bayes' theorem:

Partition of set $X$ on subsets $\{A_i \subset X|I \in I\}$ is $X = \bigcup_{i \in I}A_i$ and $A_i \cap A_j = \emptyset$ for every pair $i,j \in I$

For each subset $A \subset X$ we have partition $X = A \cup A^c$

#### Total probability theorem:
For every partition $A_1, A_2 \dots, A_k$ of $\Omega$ and event $B \subset \Omega$:
$$P(B) = \sum_{i=1}^{k}P(B|A_i)P(A_i)$$

#### Theorem (Bayes' theorem):
Let $A_1, A_2 \dots, A_k$ be a partition of $\Omega$ such that $P(A_i) > 0$ for each $i \in \{1, 2, \dots, k\}$. Then for $B \subset \Omega$ event, such that $P(B) > 0$, for each $i \in \{1, 2, \dots, k\}$:
$$
P(A_i|B) = \frac{P(B|A_i)P(A_i)}{\sum_{j=1}^{k}P(B|A_j)P(A_j)}
$$

#### Note:
We call $P(A_i)$ the prior probability and $P(A_i|B)$ the posterior probability

For the events $A$ and $B$ such that $P(B) \gt 0$ we have:
$$
P(A|B) = \frac{P(B|A)P(A)}{P(B)}
$$
<br>
We can consider the partition of $\Omega$ on $A$ and $A^c$, the from Byes' theorem we have:
$$
P(A|B) = \frac{P(B|A)P(A)}{P(B|A)P(A) + P(B|A^c)P(A^c)} = \text{(by the total probability) }\frac{P(B|A)P(A)}{P(B)}
$$

#### Example:
Divide emails $A_1 = \text{"spam"}$, $A_2 = \text{"low priority"}$ and $A_3 = \text{"high priority"}$ and let: $P(A_1) = 0.7$, $P(A_2) = 0.2$ and $P(A_3) = 0.1$. ($P(A_1) + P(A_2) + P(A_3) = 0.7 + 0.2 + 0.1 = 1$)
<br>
Let $B$ be the event that email contains the word "free" and we know from previous experience that: $P(B|A_1) = 0.9$, $P(B|A_2) = 0.01$ and $P(B|A_3) = 0.01$.
<br>
If we receive the email with word "free" in it, what is the probability, that this email is spam?
From Bayes' theorem:
$$
P(A_1|B) = \frac{P(B|A_1)P(A_1)}{P(B|A_1)P(A_1) + P(B|A_2)P(A_2) + P(B|A_3)P(A_3)} = \frac{0.9 \cdot 0.7}{0.9 \cdot 0.7 + 0.01 \cdot 0.2 + 0.01 \cdot .01} = 0.995
$$

## Random variable

Random variable is measurable mapping $X:\Omega \to \mathbb{R}$ which assigns number to each outcome.
<br>
For instance $X:\{H, T\} \to \{0, 1\}$
For randoem variable $X$ define:
$$
P(X \in A) = P(X^{-1}(A)) = P(\{\omega \in \Omega |X(\omega) \in A\})
$$
<br>
$$
P(X = x) = P(X^{-1}(x)) = P(\{\omega \in \Omega |X(\omega) = x\})
$$

#### The cumulative distribution function:
$$F_X:\mathbb{R} \to [0, 1]$$
of random variable $X$ is defined by
$$
F_X(x) = P(X \le x)
$$

#### Theorem:
A function $F:\mathbb{R} \to [0, 1]$ is CDF if and only if:
- $F$ is non decreasing: for each $x_1 \lt x_2$ we have $F(x_1) \le F(x_2)$
- $F$ is normalized: $\lim_{x \to -\infty}F(x) = 0$ and $\lim_{x \to \infty}F(x) = 1$
- $F$ is rights continuous

#### Definition:
$X$ is discrete if it takes countably many values:
$$\{x_1, x_2, \dots\}$$
<br>
We define probability mass function:
$f_X(x)=P(X=x)$

From the above we have:
- $f_X(x) \ge 0$ for each $x \in \mathbb{R}$
- $\sum_{x}f_X(x) = 1$
<br>
$$
F_X(x) = P(X \le x) = \sum_{x_i \le x}f_X(x_i)
$$

#### Definition:
$X$ is continuous if there exists a function $f_X:\mathbb{R} \to \mathbb{R}$ such that:
$$f_X(x) \ge 0 \text{ for all} x \in \mathbb{R}$$
$$P(a \lt x \lt b) = \int_{a}^{b}f_X(x)dx$$
<br>
The function $f_X$ is called probability density function PDF and we have that 
$$F_X(x)=\int_{-\infty}^{x}f_X(t)dt$$
and
$$f_X(x) = F_X'(x) \text{ for all points } x \in \mathbb{R} \text{ where } F_X \text{ is differentiable}$$

If $X$ continuous random variable, then $P(X = x) = 0$ for each $x \in \mathbb{R}$ so $f(x)$ does not mean $P(X = x)$, this only hold in case of discrete variables

#### Some discrete distributions:

- Discrete uniform $X \sim \operatorname{Uniform}(k)$:
$$f(x) = 
\begin{cases}
\frac{1}{k}, & \text{ for } x = 1, 2, \dots, k \\
0 & \text{ otherwise}
\end{cases}$$
<br>
- Bernoulli $X \sim \operatorname{Bernoulli}(p)$: Suppose we have two outcomes with probability $p$ and $1 - p$:
$$
f(x) = p^x(1- p)^{1 - x} \text{ for } x \in \{0, 1\}
$$
Example of flipping the coins 
<br>
- Binomial distribution $X \sim \operatorname{Binomial}(n, p)$: Suppose we have $n$ experiments with two outcomes with probability $p$ and $1 - p$ each
$$
f(x) = 
\begin{cases}
{n\choose x} p^x(1-p)^{1-x}, & \text{ for } x = 0, 1, \dots, n \\
0 & \text{ otherwise}
\end{cases}
$$
or
$$
f(x) = 
\begin{cases}
C_n^x p^x(1-p)^{1-x}, & \text{ for } x = 0, 1, \dots, n \\
0 & \text{ otherwise}
\end{cases}
$$
<br>
- Geometric distribution $X \sim \operatorname{Geom}(p)$:  Suppose we have $n$ experiments with two outcomes with probability $p$ and $1 - p$ each
$$
P(X=k) = p(1-p)^{k-1}
$$
Number of flips needed until the first head
<br>
- Hypergeometric distribution $X \sim \operatorname{Hypergeometric}(N,K,n)$:
$$
p_X(k) = \Pr(X = k) 
= \frac{\binom{K}{k} \binom{N - K}{n-k}}{\binom{N}{n}}
$$

#### Some continouos distributions:

- Uniform $X \sim \operatorname{Uniform}(a, b)$:
$$
f(x) = 
\begin{cases}
\frac{1}{b-a}, & \text{ for } x \in [a, b] \\
0 & \text{ otherwise}
\end{cases}
$$
<br>
- Normal (Gaussian) with parameters $(\mu, \sigma^2)$ where $\mu \in \mathbb{R}$ and $\sigma \gt 0$ $X \sim \operatorname{N}(\mu, \sigma^2)$:
$$
f(x) = \frac{1}{\sigma \sqrt{2\pi}}\exp{(-\frac{1}{2\sigma^2}(x - \mu)^2)}
$$
We say that $X$ had standard normal distribution if $\mu=0$ and $\sigma=1$.
<br>
Some properties of normal distributiun:
<br>
if $X \sim \operatorname{N}(\mu, \sigma^2)$ then 
$$Z = \frac{X - \mu}{\sigma} \sim \operatorname{N}(0, 1) \text{ (standardization)}$$
<br>
if $X = \sim \operatorname{N}(0, 1)$ then:
$$Z = \mu + \sigma X \sim \operatorname{N}(\mu, \sigma^2)$$
<br>
If $X_i \sim \operatorname{N}(\mu_i, \sigma_i^2)$ for $i = 1, 2, \dots, n$ are independent, then:
$$\sum_{i=1}^n X_i \sim \operatorname{N}(\sum_{i=1}^n\mu_i, \sum_{i=1}^n\sigma_i^2)$$

#### Bivariate Distributions:

For two discrete random variables $X$ and $Y$ define the joint mass function:
$$f(x, y) = P(X = x and Y = y) = P(X = x, Y =y)$$
<br>
For two continuous random variables $X$ and $Y$ we call function PDF if:
$$f(x, y) \ge 0 \text{ for all } (x, y)$$
$$\int_{-\infty}^{\infty}\int_{-\infty}^{\infty} f(x, y)\,dx\,dy = 1$$
<br>
and for any $A \subset \mathbb{R} \times \mathbb{R}$:
$$P(X, Y) \in A = \iint_A f(x,y) \,dx\,dy$$

#### Independent random variables:

Two random variables $X$ and $Y$ are independent if
$$
P(X \in A, Y \in B) = P(X \in A)\cdot P(Y \in B)
$$

#### Conditional probability mass function:

For two discrete random variables $X$ and $Y$ define the conditional probability mass function:
$$f_{X|Y}(x|y) = P(X = x | Y = y) = \frac{P(X = x, Y =y)}{P(Y = y)} = \frac{f_{X, Y}(x, y)}{f_Y(y)}$$
<br>
For two continuous random variables $X$ and $Y$ we call conditional PDF if:
$$f_{X|Y}(x|y) = \frac{f_{X, Y}(x, y)}{f_Y(y)}$$
assuming that $f_Y(y) \gt 0$
$$P(X \in A | Y = y) = \int_A f_{X|Y}(x|y)\,dx$$
<br>

## Multivariate distributions:

Let $X = (X_1, X_2, \dots, X_n)$ where $X_1, X_2, \dots, X_n$ are random variables, we call $X$ a random vector. Define $f(x_1, x_2, \dots, x_n)$ as PDF function.
<br>
We say that $X_1, X_2, \dots, X_n$ are independent if:
$$
P(X_1 \in A_1, X_2 \in A_2, \dots, X_n \in A_n) = \prod_{i=1}^nP(X_i \in A_i)
$$


- Multinomial (Binomial) $X \sim \operatorname{Multinomial}(p)$:
$$
\begin{align}
f(x_1,\ldots,x_k;n,p_1,\ldots,p_k) & {} = \Pr(X_1 = x_1 \text{ and } \dots \text{ and } X_k = x_k) \\
& {} = \begin{cases} { \displaystyle {n! \over x_1!\cdots x_k!}p_1^{x_1}\times\cdots\times p_k^{x_k}}, \quad &
\text{when } \sum_{i=1}^k x_i=n \\  \\
0 & \text{otherwise,} \end{cases}
\end{align}
$$
<br>
- Multivariate normal $X \sim \operatorname{N}(\mu, \Sigma)$:
$$
f_{\mathbf X}(x) = f_{\mathbf X}(x_1,\ldots,x_k) = \frac{\exp\left(-\frac 1 2 ({\mathbf x}-{\boldsymbol\mu})^\mathrm{T}{\boldsymbol\Sigma}^{-1}({\mathbf x}-{\boldsymbol\mu})\right)}{\sqrt{(2\pi)^k|\boldsymbol\Sigma|}}
$$

## Conditional distribution

Suppose $X$ and $Y$ are discrete random variables, we observed that $Y = y$ then conditional probability mass function will be:
$$
f_{X|Y}(x|y)=P(X=x|Y=y)=\frac{P(X=x, Y=y)}{P(Y = y)}=\frac{f_{X,Y}(x, y)}{f_Y(y)} \text{ if } f_Y(y) \gt 0
$$
<br>
For continuous random variables $X$ and $Y$ we have PDF:
$$
f_{X|Y}(x|y) = \frac{f_{X,Y}(x, y)}{f_Y(y)}
$$
<br>
when $f_Y(y) > 0$ and probability is:
$$
P(X \in A|Y=y) = \int_{A}f_{X|Y}(x|y)\, dx.
$$

## Expected value

Expected value of random variable $X$ is:

$$\operatorname{E}[X] = \int_{\mathbb{R}} x f(x)\, dx.$$
<br>
For discrete and continuous random variables:
$$
\operatorname{E}[X] = 
\begin{cases}
\sum_x xf(x), & \text{ if } X \text { is discrete} \\
\int_{\mathbb{R}} x f(x)\, dx. & \text{ if } X \text { is continuous}
\end{cases}
$$
<br>
Or in general for probability measure space $(\Omega, P, X)$:
$$
\operatorname{E} [X]  = \int_\Omega X(\omega)\,d\operatorname{P}(\omega)
$$
<br>
For multidimensional case
$$
\operatorname{E}[(X_1,\ldots,X_n)]=(\operatorname{E}[X_1],\ldots,\operatorname{E}[X_n])
$$

#### Properties of expectation:
<br>
$$
\begin{align}
  \operatorname{E}[X + Y] &=   \operatorname{E}[X] + \operatorname{E}[Y], \\
  \operatorname{E}[aX]    &= a \operatorname{E}[X],
\end{align}
$$

Or in general for any sequence of random variables $(X_1, X_2, \dots, X_n)$ and numbers $(a_1, a_2, \dots, a_n)$:
$$\operatorname{E}[\sum_{i=1}^na_iX_i] = \sum_{i=1}^n a_i \operatorname{E}[X_i]$$

Let $X_1, X_2, \dots, X_n$ be independent random variables, then:
<br>
$$
\operatorname{E}[\prod_{i=1}^nX_i] = \prod_{i=1}^n \operatorname{E}[X_i]
$$

In many cases expectation is denoted by $\mu$

## Variance and covariance

Variance (spread) of a distribution:

$$
\sigma^2 = \operatorname{Var}(X)=  \operatorname{E}[X - \operatorname{E}[X]]^2
$$
<br>
Easy to prove
$$
\sigma^2 = \operatorname{E}[X^2] - (\operatorname{E}[X])^2
$$
Assuming that expectation exists
<br>
Denoted by $\sigma^2$
<br>

Standard deviation:
$$\sigma = \sqrt{\operatorname{Var}(X)}$$
<br>
$$\sigma = \sqrt{\sigma^2}$$

#### Properties of variance:
- $\operatorname{Var}(X + Y) = \operatorname{Var}(X) + \operatorname{Var}(Y)$
- $\operatorname{Var}(a \cdot X) = a^2\operatorname{Var}(X)$

#### Covariance and correlation:
Let $X$ and $Y$ be a random variables with $\mu_X$ and $\mu_Y$ expectations and $\sigma_X$ and $\sigma_Y$ variances, then covariance between them is:
$$
Cov(X, Y) = \operatorname{E}[(X - \mu_X)(Y - \mu_Y)]
$$
<br>
and correlation is:
$$
\rho = \rho_{X,Y} = \rho(X, Y) = \frac{Cov(X, Y)}{\sigma_X \sigma_Y}
$$

#### Properties:
$$
Cov(X, Y) = \operatorname{E}[XY] - \operatorname{E}[X]\operatorname{E}[Y]
$$
<br>
$$
-1 \le \rho(X, Y) \le 1
$$

#### Examples:
If $Y = aX + b$ then:
$$
\rho(X, Y) = 1 \text{ if } a \gt 0
$$
and
$$
\rho(X, Y) = -1 \text{ if } a \lt 0
$$
<br>
If $X$ and $Y$ are independent then
$$
\rho(X, Y) = 0
$$
<br>
But contrary not always correct

$$
\operatorname{Var}(X + Y)= \operatorname{Var}(X) + \operatorname{Var}(Y) + 2Cov(X,Y)
$$
<br>
$$
\operatorname{Var}(X - Y)= \operatorname{Var}(X) + \operatorname{Var}(Y) - 2Cov(X,Y)
$$
<br>
In general:
$$
\operatorname{Var}(\sum_{i=1}^{n}a_iX_i) = \sum_{i=1}^{n}a_I^2\operatorname{Var}(X_i) + 2\sum_{i=1}^{n}\sum_{j\lt i}a_ia_jCov(X_i, X_j)
$$

## Conditional Expectation

For discrete and continuous random variables $X$ and $Y$ we can define conditional expectation, what is the mean of $X$ when $Y=y$:
<br>
$$
\operatorname{E}[X|Y] = 
\begin{cases}
\sum_x xf_{X|Y}(x|y), & \text{ if } X \text { is discrete} \\
\int_{\mathbb{R}} x f_{X|Y}(x|y)\, dx. & \text{ if } X \text { is continuous}
\end{cases}
$$
<br>

#### The rule of iterated expectation:
For random variables $X$ and $Y$ assuming the expectation exists, we have that:
$$
\operatorname{E}[\operatorname{E}[X|Y]] = \operatorname{E}[Y]
$$
<br>
and
<br>
$$
\operatorname{E}[\operatorname{E}[Y|X]] = \operatorname{E}[X]
$$
<br>

## Convergence in probability:
Let $X_1, X_2, \dots$ be a sequence of random variables, and $X$ be a random variable
- Convergence in probability written $X_n \xrightarrow{P}\ X \qquad\textrm{when}\ n \to \infty$:
$$
P(|X_n - X| > \epsilon) \rightarrow 0 \text{ when } n \rightarrow \infty
$$
<br>
- Convergence in distribution:
$$
\lim_{n \to \infty}F_n(t) = F(t) \text{ for all } t \text{when } F \text{ is continuous} 
$$

## Law of large numbers:

Let $X_1, X_2, \dots, X_n$ random variables, then
$$
\overline{X}_n=\frac1n(X_1+\cdots+X_n)
$$
We call identically distributed independent random variables (IID) if they have the same distribution
$$
F_{X_i}(x) = F_{X_j}(x) \text{ for each } i, j 
$$
and are independent
$$
F_{X_1,\ldots,X_n}(x_1,\ldots,x_n) = F_{X_1}(x_1) \cdot \ldots \cdot F_{X_n}(x_n) \text{ for all } x_1,\ldots,x_n \in I
$$

#### Weak law of large numbers:
Let $X_1, X_2, \dots, X_n$ be a IID random variables, then:
$$
\begin{matrix}{}\\
    \overline{X}_n\ \xrightarrow{P}\ \mu \qquad\textrm{when}\ n \to \infty.
\\{}\end{matrix}
$$