<a href="https://colab.research.google.com/github/Probabilistic-ML/colab-notes/blob/master/Introduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fundamentals of Probability Theory

In the present section, we define the basic terms of probability theory and statistics. Moreover, we state the most common examples of discrete and continuous probability distributions. 

The content follows the textbooks

["Statistik für Ingenieure - 
Wahrscheinlichkeitsrechnung und Datenauswertung endlich verständlich"](https://www.springer.com/de/book/9783642548567)

by Aeneas Rooch and

["Grundlagen der
Wahrscheinlichkeitsrechnung
und Statistik - Eine Einführung für Studierende
der Informatik, der Ingenieur- und
Wirtschaftswissenschaften"](https://www.springer.com/de/book/9783662541616)

by Erhard Cramer and Udo Kamps.

The goal is to avoid unnecessarily complex mathematical backround, but to provide the required framework to understand the subsequent machine learning methods. Nevertheless, for the the sake of completeness, additional references are given from time to time. A more profound mathematical theory can for example be found in ["Wahrscheinlichkeitstheorie"](https://www.springer.com/de/book/9783642360183) by Achim Klenke. 

All three books are available free of charge via [DigiBib](https://hs-niederrhein.digibib.net/).

## Probability Spaces

**Definition 1.1**: In order to model the outcome of a random experiment, we denote by $\Omega$ the **sample space** of all possible outcomes, i.e.,

$$\Omega = \{ \omega ~|~ \omega \text{ is a possible outcome of the random experiment}\}.$$

Accordingly, each element $\omega \in \Omega$ is called an **outcome**. A subset $A$ of $\Omega$ of possible outcomes is called an **event**. If $A$ contains only a single outcome $\omega$, i.e., $A=\{\omega\}$ for some $\omega \in \Omega$, $A$ is also called an elementary event.

**Example 1.2**: If we model the rolling of an ordinary cubic dice, the sample space $\Omega$ is given by the 6 possible outcomes and 

$$\Omega = \{1, 2, 3, 4, 5, 6\}.$$

The event $A$ of rolling an even number is given by $A = \{2, 4, 6\} \subset \Omega$ and the elementary event of rolling a six is given by $A=\{6\}$.

### Discrete Probability Spaces

**Definition 1.3**: Let $\Omega$ be a finite or countable sample space and denote by $\mathcal{P}(\Omega) = \{A~|~A \subset \Omega\}$ the set of all subsets of $\Omega$ (the so-called power set). Moreover, let $p: \Omega \rightarrow [0, 1]$ be a map such that $\sum_{\omega \in \Omega} p(\omega) = 1$. Then, the map $P: \mathcal{P}(\Omega) \rightarrow [0,1]$ given by

$$ P(A) := \sum_{\omega \in A} p(\omega) \quad \text{for } A \in \mathcal{P}(\Omega)$$

is called a **discrete probability measure** or a **discrete probability distribution**. The triple $(\Omega, \mathcal{P}(\Omega), P)$ is called a **discrete probability space**.

**Remark 1.4**: 
- A probability measure $P$ assigns to each possible event a probability between 0 ("impossible") to 1 ("sure"). 
- $P$ is completely characterized by the elementary probabilities (i.e., the probabilities of elementary events specified by $p$) in the case of discrete probability distributions (by definition).
- The condition $\sum_{\omega \in \Omega} p(\omega) = 1$ guarantees that $P(\Omega) = 1$. In other words, it has to be sure that the outcome of a random experiment is indeed in $\Omega$ and moreover, $P(\Omega) > 1$ would make no sense in terms of probabilities.

**Example 1.5**: Assumed that we are dealing with a fair dice in Example 1.2, it is reasonable to define $p(\omega):=\frac{1}{6}$ for each $\omega=1, \dots,6$. Hence, each outcome of a dice roll is each likely. Consequently, the probability of rolling an even number is 

$$P(\{2, 4, 6\}) = \sum_{\omega \in \{2, 4, 6\}} p(\omega) = 3 \cdot \frac{1}{6} = 0.5$$

as expected.

**Corollary 1.6**: As a direct consequence of Definition 1.3, a discrete probability measure has the following properties:
- $0 \le P(A) \le 1$ for each event $A \in \mathcal{P}(\Omega)$,
- $P(\Omega) = 1$,
- $P$ is $\sigma$-additive, i.e., for pairwise disjoint events $A_i$, $i \in \mathbb{N}$, it holds

$$P(\cup_{i=1}^{\infty} A_i) = \sum_{i=1}^{\infty} P(A_i).$$

**Remark 1.7**: 
- The term "pairwise disjoint" means that two arbitrary events do not have any common elements. For example, the events $\{2, 4, 6\}$ and $\{1, 3\}$ are disjoint, but the events $\{2, 4, 6\}$ and $\{2, 3\}$ are not, since the share the outcome $2$.
- The last statement of Corollary 1.6 also holds true for a finite number of sets $A_i$, $i=1,\dots,n$, by simply choosingchoosing $A_i = \emptyset$ (empty set) for $i > n$. If we consider only two disjoint sets $A_1$ and $A_2$, it follows that $P(A_1 \cup A_2) = P(A_1) + P(A_2)$. This means that the probability that the event $A_1$ or the event $A_2$ occurs equals the sum of the probabilities, which is intuitive. 

### General Probability Spaces

It turns out that the definition of probability spaces is not as easy as before in the case of sample spaces that contain uncountably many outcomes. For this purpose, it is necessary to restrict the set of all events for which a probability can be assigned to a subset of $\mathcal{P}(\Omega)$.

## Random Variables

## Important Examples of Probability Distributions


### Discrete Distributions

### Continuous Distributions

## Basic Terms

Mean, Variance...

# Bayesian vs. Frequentists View

# MLE, MAP & Bayesian Inference

# An illustrative example for MLE and MAP: Linear Regression

## Ordinary Least Squares (OLS) = MLE

## Ridge Regression = MAP with Gaussian Prior

## LASSO = MAP with Laplace Prior

# Optimization Methods

# Machine Learning Workflow