<a href="https://colab.research.google.com/github/StefanHubner/Auto-GPT/blob/master/Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

# Probability

### Language and Notation

The set of all possible outcomes of an experiment is called the **sample space** which we will denote by $ \Omega $ (e.g. $ \Omega = \left\{ HH, HT, TH, TT \right\} $). We refer to its elements $ \omega \in \Omega $ as elementary events.



---


An **event** is a subset of the sample space $ A \subseteq \Omega $. It allows us to define statements such as "at least one head occurs" with corresponding event $ A = \{HH, HT, TH\} $.


---



We call the set of all possible events a **sigma-algebra** which is denoted by $ \mathcal{F} = \sigma(\Omega) $. This includes the **null event** $ \emptyset $ and the **sure event** $ \Omega $.


---



A **probability measure** $P$ is a function that takes an event and measures how likely it will occur (e.g. for a fair coin we know that $ P(\{HH\}) = \frac{1}{4} $).


---



A **probability space** is a triple $ (\Omega, \mathcal{F}, P) $.

### Probability Measure

#### Definition

1. $P(A) \geq 0$ for any event $A \subseteq \Omega$,

2.  $P(\Omega) =1$,

3. If subsets $A_n\in\Omega,n=1,2,...$ are disjoint (do not intersect),
then $P( \bigcup_{n=1}^{\infty} A_n) =
\sum_{n=1}^{\infty} P(A_n)$. This is called countable additivity.


A few results that could be derived from the definition:

* $P(\emptyset) =0$,
* $P(\Omega \setminus A)  =1 - P(A)$, where $\Omega \setminus A = \{\omega \notin A \}$,
* If $A\subseteq B$ then $P(A) \leq P(B)$,
* $P(A\cup B) =P(A) + P(B) - P( A\cap B)$,
* For any sets $A_{n}$,
$P( \bigcup_{n=2}^{\infty} A_n ) \leq \sum\nolimits_{n=1}^{\infty}P( A_n )$.


We can define different probability functions (e.g., $P_1, P_2, P_3$) on the same sample space

### Random Variables

A **random variable** $X$ is a function from ($\Omega$, $\mathcal{F}) $ into $ (\mathbb{R}, \mathcal{B}) $ connecting an elementary event $ \omega $ to a real valued outcome.


> For example, let's assume we have two coloured dice and define their respective face values as two random variables $X(⚄⚂) = 5, Y(⚄⚂) = 3$. Their value tells us that for outcome $\omega_{53} = ⚄⚂ $ the
random variable $X$ will have realised value $5$ and the random variable $Y$ will have realised value $3$.


When we write a formula with random variables, the formula defines a new function:
$$ Z : \Omega \to \mathbb{R} : \omega \mapsto X(\omega)+ Y(\omega) $$
We usually write the short hand definition $ Z = X + Y $ but keep this definition in mind!

---

Think of all random variables as simultaneously determined by the outcome $\omega$.
Any expression you construct from random variables is also uniquely determined
by the outcome $\omega$.

\begin{equation*}
\begin{array}{llcccccc}
&  & X_1 & X_2 & X_3 & Y & X_3 / Y & \dots \\
\omega _{1} & ~ & 0 & 0 & 0 & 1.5 & 0 & \dots \\
\omega _{2} & ~ & 0 & 3 & 3 & 2 & 1.5 & \dots \\
\omega _{3} & ~ & 1 & 0 & 1 & 1 & 1 & \dots \\
\omega _{4} & ~ & 2 & 1 & 4 & 0.5 & 8 & \dots \\
... & ~ & ... & ... & ... & ... & ... & \dots
\end{array}%
\end{equation*}

---

### Measuring Events

The probability measure $P$ determines how likely different $\omega$'s are to happen and, consequently, values for $ X_{1}, X_{2},X_{3},Y$ and their functions.

We often want to make statements not only involving an elementary event $ \omega \in \Omega$ but instead a collection $ A \in \mathcal{F} $ of them.

We can then define the corresponding event for the random variable as $\{\omega: Z(\omega) \in B \}$ where $ B \in \mathcal{B} $, the Borel sigma-algebra defined as the smallest set of all open intervals on $ \mathbb{R} $.

If we write $P(\dots)$ around the statement, we mean the probability assigned by measure $P$ to the event described in parentheses e.g.
$$
P(\{\omega: Z(\omega) \in (a,b) \}) = P(Z \in (a,b)) = P (a < Z < b).
$$



> In our two dice example we might want to be interested in the event that corresponds to the sum of the eyes as $ Z = X + Y $ being less or equal to 3.

> For this, we can define the event $ B = (0, 3] \in \mathcal{B} $ that their sum is less or equal than 3.
$$ A = \{\omega: Z(\omega) \in (0, 3] \} =
\{ ⚀⚀, ⚀⚁, ⚀⚁\}$$

We can then measure this event with different probability measures.

> Let us denote as $ P_0 $ the probability measure corresponding to a fair dice that gives equal likelihood $ P_0(\omega) = \frac{1}{36} $ to all elementary events $ \omega \in \Omega $. In this case we get $ P(A) = \frac{3}{36} = \frac{1}{12} $.

#### Absolute continuity

An elementary element in $ \mathbb{R} $ is a point,  which is (losely speaking) an interval $ B \in \mathcal{B} $ of width zero. We sometimes call these intervals to have zero (Lebesque) measure.

If a probability measure for a random variable $ X $gives zero probability to **any zero-measure intervals**, we call the random variable $ X $ a **continuous random variable**.

If **every zero-measure interval** in the image of a random variable $ Y $ is given non-zero probability by it's probability measure, we call the random variable $ Y $ discrete.

There are also **mixed** random variables.

> In our example, the image of $ Z $ is $ \{ 2, \ldots\, 12 \} $. For any of these points $ P_0 $ associates a probability to the corresponding event of $ \frac{1}{36} > 0 $.

#### Distribution Functions

Instead of working with general probability measures, we will often distinguish discrete and continuous random variables and work with distribution functions:

$$ F_{X}(x) = P(X \leq x) $$

For a **continuous** random variable there exists a function $f_{X}(x) \geq 0$, called the **probability density function**, satisfying
$$
F_{X}(x) = \int_{-\infty }^{x}f_{X}(t)dt \text{ for all } x \in \mathbb{R}.
$$

In this case we can measure:
$$ P(a < X < b) = F_X(b) - F_X(a) = \int_{a}^{b}f_{X}(t)dt $$
by the **fundamental theorem of calculus**.



> *Exercise: Use the probability of the sure event to show that any density function must satisfy* $\int_{-\infty}^{\infty} f_X(x)dx = 1$.

If the random variable $X$ is **discrete** we have
$$
\sum_{k=1}^{\infty} P(X = x_k) = 1
$$
for some (possibly uncountable) sequence of numbers $\{x_1, x_2, x_3, \dots \}$.


We call $p(x_k) = P(X=x_k)$ the **probability mass function** of $X$.

### Moments

The **expectation** for any (measurable) function m is
$$
\mathbb{E} m(X) = \left\{
\begin{array}{ll}
\sum\limits_{k=1}^{\infty} m(x_k) P(X = x_k)
 & \text{ if $X$ is discrete}, \\
\int\limits_{-\infty}^{\infty} m(x) f_X(x) dx
 & \text{ if $X$ is continuous}.
\end{array}
\right.
$$

**Special cases**:
* Expected Value: $ m_1(X) = X $
* Variance: $ m_2(X) = (X - m_1(X))^2 $
* Covariance: $ m_3(X, Y) = (X - m_1(X))(Y - m_1(Y)) $

### Independence
Events $\{ A_{k},k = 1,2,\dots n\}$ are **independent**
$$
    P\hspace{-2ex}\underbrace{\left( \bigcap_{k=1}^{n} A_k \right)}_{\text{intersection of sets}} \hspace{-1.5ex} = \prod_{k=1}^{n}  P\left( A_k \right).
$$

~

This leads to a definition of **independent random variables** $\{X_1, X_2, \dots, X_n\}$  if
$$
P\underbrace{( X_1 \in B_1, X_2 \in B_2, \dots X_n \in B_n )}_{\text{all true }}
= \prod\limits_{k=1}^{n} P( X_k \in B_k )
$$
for any sets of real numbers $B_1, \dots, B_n \in \mathcal{B} $.


Consequently, random variables are **independent** if their **conditional probability**
$$ P(X_1 \in B_1 | X_2 \in B_2) = \frac{P(X_1 \in B_1, X_2 \in B_2)}{P(X_2 \in B_2)} = P(X_1 \in B_1) . $$


# Maximum Likelihood Recap

We assume we have $ n $ independent copies of random variables $ X_1, \ldots, X_n $ with $ P(X_i \leq x) = P_{\theta}(X_i \leq x) = F_{\theta}(x) $ with corresponding p.d.f or p.m.f. $ f(x|\theta) $.

Thus, the likelihood that we observe realisations $ (x_1, \ldots x_n) $
$$ f(x_1, \ldots, f_n | \theta) =\prod\limits_{i=1}^n f_i(x_i, \theta) =\prod\limits_{i=1}^n f(x_i, \theta)  $$

where the first equality follows from independence and the second from
the identical distribution.