# Probability

## discrete
probality to measure x<br>
\begin{equation}
p(x)
\end{equation}

### Mean
\begin{equation}
\mu (x) = \sum_i x_i \cdot p(x_i)
\end{equation}

### Variance
\begin{equation}
var (x) = \mu \left( (x - \mu(x))^2 \right)\\
= \sum_i (x_i - \mu(x))^2 \cdot p(x_i)
\end{equation}

### Standard deviation
\begin{equation}
\sigma = \sqrt{ var(x) }
\end{equation}

## continuous

### Probability density function
\begin{equation}
f(x)
\end{equation}

### Cumulative distribution function
probability to measure x between a and b<br>
\begin{equation}
P(a \leq x \leq b) = \int_a^b f(x) dx
\end{equation}

### Mean
\begin{equation}
\mu (x) = \int_{-\infty}^{\infty} x \cdot f(x) dx
\end{equation}

### Variance
\begin{equation}
var (x) = \int_{-\infty}^{\infty} (x-\mu)^2 \cdot f(x) dx
\end{equation}

### Standard deviation
\begin{equation}
\sigma = \sqrt{ var(x) }
\end{equation}

## moments

- median: value from wich left and right lie 50% of the distribution
- variance: dispersion of the distribution
- symmetry: amount how mirror-symmetric the distribution is
- skewness: amount the distribution leans to one side
- kurtosis: fatness of the tails of the distribution

## important distributions

### Bernoulli Distribution
probability for a random experiment with two possible outcomes.<br>
for example: x=0 failure, x=1 success;<br>
\begin{equation}
f(x \in \lbrack 0,1 \rbrack) =
\begin{cases}
p& \text{if $x=1$}\\
1-p& \text{if $x=0$}
\end{cases}\\
= p^x(1-p)^{1-x}
\end{equation}

### Geometric Distribution
if this experiment is performed n times, the probability for $x$ successful attempts in a particular order is<br>
\begin{equation}
f(x) = p^x(1-p)^{n-x}
\end{equation}

### Binomial Distribution
if the order the events are occuring is not important, the probability for $x$ becomes
\begin{equation}
f(x) = \begin{pmatrix} n\\x \end{pmatrix} p^x(1-p)^{n-x}
\end{equation}
with $\begin{pmatrix} n\\x \end{pmatrix} = \frac{n!}{x!(n-x)!}$, wich is composed of $\frac{n!}{(n-x)!}$ possible arrangements of x successful outcomes in n trials divided by $x!$ possible permutations of these x successful outcomes (if any two successes were switched place, the outcome would be the same!).<br><br>

Example: 5 coin tosses (3 heads (0) & 2 tails (1))<br><br>
N=5; x=2;

\begin{align}
\text{possible arrangements of k:} &\frac{n!}{(n-x)!} = 5 \cdot 4 \cdot 3 \\
\text{possible permutations of k:} &\frac{1}{x!} = \frac{1}{3 \cdot 2 \cdot 1} \\
\text{possible effective arrangements of k:} &\frac{n!}{(n-x)! \cdot x!} =10
\end{align}

possible effective (different) coin toss arrangement: <br><br>
00111 <br>
01011 <br>
01101 <br>
01110 <br>
10011 <br>
10101 <br>
10110 <br>
11001 <br>
11010 <br>
11100 <br>

### Poisson Distributions
for small probabilities p and big sample size n the binomial distribution can be approximated by a continuous poisson distribution<br>
\begin{equation}
f(x) = \frac{\mu^x}{x!} e^{-\mu}
\end{equation}

### Gauss/Normal-Distribution
additionally, if $x$ or $(n-x)$ is sufficiently big, the distribution get's symmetrical and can be approximated by a Normal distribution
\begin{equation}
f(x) = \frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{(x-\mu)^2}{2 \sigma^2}}
\end{equation}

### negative Binomial Distribution
the probability to get a x-1 successes with n-1 trials and an additional x-th success at trial number n afterwards <br>
\begin{equation}
f(x) = \begin{pmatrix} n-1\\x-1 \end{pmatrix} p^{x-1}(1-p)^{n-1-(x-1)} \cdot p\\
= \begin{pmatrix} n-1\\x-1 \end{pmatrix} p^{x}(1-p)^{n-x}
\end{equation}

<br> another formulation is for getting an additional success after $b=n-x$ errors with n trials<br>
\begin{equation}
p(x) = \begin{pmatrix} b+x-1\\x-1 \end{pmatrix} p^{x}(1-p)^{b}\\
= \begin{pmatrix} b+x-1\\b \end{pmatrix} p^{x}(1-p)^{b}
\end{equation}the second equation is equal to the first because the binomial coefficient is symmetrical around n/2.


In [2]:
import numpy as np
def nCr(n,k):
    return np.math.factorial(n)/(np.math.factorial(n-k)*np.math.factorial(k))
print('Binomial Distribution: probability in Lotto to pick 6 numbers out of 49 is 1 in',int(nCr(49,6)))

Binomial Distribution: probability in Lotto to pick 6 numbers out of 49 is 1 in 13983816


### Likelihood

The probability density function tells us how likely an event x is to occur. For example a coin toss n times has following probability for x successes

\begin{equation}
f(x|p) = p^x(1-p)^{n-x}
\end{equation}

where the probability $p$ for a success or $(p-1)$ for a failure is known prior. If we were to perform a coin toss experiment a couple of times, we would get each time the number of successes x, but we neither know the probability of a single coin toss p nor the combined probability $f$.
The solution is to guess $f(p|x)$ (values & distribution function) after many tosses and determine the probability of a single coin toss p from the likelihood function.

\begin{equation}
f(p|x) = p^x(1-p)^{n-x}
\end{equation}

The important difference between a probability and a likelihood is, that the probability $f(x|p)$ returns the probability $f$ of $x$ happening if the parameter $p$ is known. In contrast the likelihood $f(p|x)$ returns the probability $f$ if the parameter $p$ had a particular value when the event $x$ occured.


## maximum likelihood estimation

The likelihood of events $x_1,...,x_n$ occuring from a probability distribution P with the parameters $\Theta$ is:

\begin{equation}
P(\Theta|X) = P(\Theta|x_1) \cdot P(\Theta|x_2) \cdot ... \cdot P(\Theta|x_n)
\end{equation}

To calculate the maximum Likelhood we take the derivative equal to zero. But befor that we calculate the log likelihood to make the calculation easier (products become additions).

\begin{equation}
\log P(\Theta|X) = \log P(\Theta|x_1) + \log P(\Theta|x_2) + ... + \log P(\Theta|x_n)
\end{equation}

Now taking the derivative equal to zero in respect to the parameters $\Theta$ we get the optimal Parameters one at a time

\begin{equation}
\frac{\partial}{\partial \Theta_i} \log P(\Theta|X) = 0
\end{equation}


## example maximum likelihood

For example We have weight 5 marbles with weights 5.0g,4.8g,5.1g,4.0g,4.9g and assume a normal distribution of weights.

\begin{equation}
P(\mu,\sigma|x_i) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{1}{2}\bigg(\frac{x_i-\mu}{\sigma}\bigg)^2}
\end{equation}

\begin{equation}
P(\mu,\sigma|X) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{1}{2}\bigg(\frac{x_i-\mu}{\sigma}\bigg)^2}
\end{equation}

log likelihood

\begin{equation}
\log P(\mu,\sigma|X) = \sum_{i=1}^n -\frac{1}{2} \bigg( \log \big( 2\pi \big) + 2 \log \big( \sigma \big) + \Big(\frac{x_i-\mu}{\sigma}\Big)^2 \bigg)
\end{equation}

derivates

\begin{align}
\frac{\partial}{\partial \mu} \log P(\mu,\sigma|X) &= \sum_{i=1}^n \frac{x_i-\mu}{\sigma^2} \\
&= \frac{1}{\sigma^2} \big( \Big( \sum_{i=1}^n x_i \Big) -n\mu \big)
\end{align}

\begin{align}
\frac{\partial}{\partial \sigma} \log P(\mu,\sigma|X) &= \Big( \sum_{i=1}^n -\frac{1}{\sigma} + \frac{(x_i-\mu)^2}{\sigma^3} \Big) \\
&= \frac{1}{\sigma^3} \Big( \sum_{i=1}^n \big( x_i -\mu \big)^2 \Big) - \frac{n}{\sigma}
\end{align}

setting equal to zero

\begin{align}
\mu = \frac{\sum_{i=1}^n x_i}{n}
\end{align}

\begin{align}
\sigma^2 = \frac{\sum_{i=1}^n \big( x_i -\mu \big)^2}{n}
\end{align}

we get the parameters for maximum likelihood, wich we can input in the normal distribution to get our final result.

## Bayes Theorem

\begin{equation}
P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}
\end{equation}

## Bayes Theorem example

56 cards deck, 26 black/ 26 red

probability that a red card is a 4

\begin{equation}
P(4|red) = \frac{P(red|4) \cdot P(4)}{P(red)} = \frac{1/2 \cdot 1/13}{1/2} = 1/13
\end{equation}

## error for binary predictions


Data example:

<style>
table, th, td {
  padding: 5px;
  text-align: center;
  border: 1px solid black;
  border-collapse: collapse;
}
p{
  margin:0px;
  font-size:16px;
}
</style>

<table style="width:50%">
  <tr>
  	<td rowspan="2" colspan="2" ></td>
  	<td colspan="2" >actual</td>
  	<td rowspan="2" ></td>
  </tr>
  <tr>
  	<td>true</td>
  	<td>false</td>
  </tr>
  <tr>
  	<td rowspan="2" >predicted</td>
  	<td>true</td>
  	<td style='background:green'>60<br>true positives</td>
  	<td style='background:orange'>5<br>false positives</td>
  	<td>65</td>
  </tr>
  <tr>
  	<td>false</td>
  	<td style='background:yellow'>10<br>false negatives</td>
  	<td style='background:red'>25<br>true negatives</td>
  	<td>35</td>
  </tr>
  <tr>
  	<td colspan="2" ></td>
  	<td>70</td>
  	<td>30</td>
  	<td>100</td>
  </tr>
</table>





accuracy:<br>
percent of right predicted from whole data<br>
accuracy = <b style='background:green;'>tp</b>/(<b style='background:green;'>tp</b>+<b style='background:red;'>tn</b>+<b style='background:orange;'>fp</b>+<b style='background:yellow;'>fn</b>) = $\frac{60+25}{100} = 85 \%$<br><br>

precision:<br>
percent of true positives from positive predictions<br>
precision = <b style='background:green;'>tp</b>/(<b style='background:green;'>tp</b>+<b style='background:orange;'>fp</b>) = $ \frac{60}{60 + 5} = 92,3 \%$<br><br>


true positive rate (recall/sensitivity):<br>
percent of true positives from all true examples<br>
recall/sensitivity = <b style='background:green;'>tp</b>/(<b style='background:green;'>tp</b>+<b style='background:yellow;'>fn</b>) = $\frac{60}{60+10} = 85,7 \%$<br><br>

f1 score:<br>
good error measure, if the data set is unbalanced<br>
$f_1 = 2 \cdot \frac{precision \cdot recall}{precision + recall} = 2 \cdot \frac{92,3 \cdot 85,7}{92,3 + 85,7} = 88,9$<br><br>

true negative rate (specificity):<br>
percent of true negatives from all negative examples<br>
specifity = <b style='background:red;'>tn</b>/(<b style='background:red;'>tn</b>+<b style='background:orange;'>fp</b>) = $a$<br><br>

false positive rate (fall-out):<br>
percent of true negatives from all negative examples<br>
fall-out = <b style='background:orange;'>fp</b>/(<b style='background:red;'>tn</b>+<b style='background:orange;'>fp</b>) = $a$<br><br>

ROC curve:<br>
https://en.wikipedia.org/wiki/Receiver_operating_characteristic#ROC_space
true positive rate (sensitivity) plotted against false positive rate (fallout). As a classifier becomes overfitted the true positive rate on the training examples grows as well as the false positive rate.

for all possible metrics look at https://en.wikipedia.org/wiki/Confusion_matrix

## Bayesian inference

Using the bayes theorem we can combine our estimated likelihood with a prior probability distribution, where the likelihood is the actual probability distribution and the prior distribution reflects the bias in our setting (like demographic or ice cream sales in winter) where we apply this likelihood.

\begin{equation}
\text{posterior prob.} = \frac{\text{likelihood} \cdot \text{prior prob.}}{P(X)} \\
P(\Theta|X) = \frac{P(X|\Theta) \cdot P(\Theta)}{P(X)}
\end{equation}

X = observed data<br>
$\Theta$ = model parameters<br>
prior prob. = guessed prob. distribution for Parameters <br>
likelihood = calculated prob dependent on observed data X<br>
posterior prob. = resulting probability distribution of data<br>

