<a href="https://colab.research.google.com/github/HuberAdrian/DataScience-Lectures/blob/main/Statistics_Part_II.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Statistics Part II

## Outline
* Basic Probability Theory
* Bayes' Theorem
* Hypotheses and Statistical Tests


## Basic Probability Theory

***Probability*** is the measure of the likelihood that an event will occur. ***Probability*** quantifies as a number between $0$ and $1$, where, loosely speaking, $0$ indicates impossibility and $1$ indicates certainty. The higher the probability of an event, the more likely it is that the event will occur. [wikipedia]

### Notation and Probability Axioms 

we write $P(E) \in [0,1]$ to denote the ***probability*** of event $E$. $P(E)$ follows the following axioms:

* $P(E) \in \mathbb{R},$ and $P(E)\ge 0$ for all possible $E$
* $P(\Omega) = 1$ : it is certain that at least one event will occur
* $P\left(\cup_i E_i\right) = \sum_i P(E_i)$ 


### Dependence of Events
In probability theory, two events are independent, statistically independent, or stochastically independent if the occurrence of one does not affect the probability of occurrence of the other (equivalently, does not affect the odds). 


* we write $E_1 \perp E_1$ for ***independent*** events
* $P(E_1, E_2) = P(E_1)P(E_2)$

$\rightarrow$ **multiplication** of probabilities if events are independent! 

#### Example 1: Coin flip 

#### Example 2: Sex of children

#### How to compute $P(E_1, E_2)$ if events are dependent, e.g. $E_1$ depends on $E_2$ ?

#### Conditional Probability
 $P(E_1, E_2) = P(E_1|E_2)P(E_2)$
 
 where $P(E_1|E_2)$ is the ***conditional*** probability of event $E_1$, aka the probability that $E_1$ occurs **after** $P(E_2) = 1$ 
 

This results in:
* $P(E_1|E_2) = P(E_1)$ if events are independent 
* $P(E_1|E_2) = P(E_1, E_2) \div P(E_2)$


#### Probability can be quite counter intuitive:
Examples based on child sex (assuming 50% chance for a boy or girl an independence for several children):

* **Example 1**: What is the probability of both children to be girls $P(B):=0.25$, given the conditional event that the first child is a Girl $P(G):=0.5$

$P(B|G) = P(B,G) \div P(G) = P(B) \div P(G) = 0.25 \div 0.5 = 0.5 $

#### Probability can be quite counter intuitive:
Examples based on child sex (assuming 50% chance for a boy or girl an independence for several children):

* **Example 2**: What is the probability of both children to be girls $P(B):=0.25$, given the conditional event that at least one child is a girl $P(L):=0.75$

$P(B|L) = P(B,L) \div P(L) = P(B) \div P(L) = 0.25 \div 0.75 = 0.333 $

## Bayes' Theorem

In probability theory and statistics, ***Bayes' theorem*** describes the ***probability*** of an event, based on ***prior knowledge*** of conditions that might be related to the event.

### Bayes' Theorem

$ P(E|F) = {{ P(F|E)P(E)} \over { P(F)}}$ for $P(F)>0$

#### Interpretation:
* reversing the deduction of a conditional probability
* a feature often needed in inference (machine learning) settings (later more)

#### Example:
Suppose that a test for using a particular drug is 99% sensitive and 99% specific. That is, the test will produce 99% true positive results for drug users and 99% true negative results for non-drug users. Suppose that 0.5% of people are users of the drug. What is the probability that a randomly selected individual with a positive test is a drug user? 

<img src="https://github.com/keuperj/DataScience22/blob/main/week_4/IMG/drug.svg?raw=1">

## Experiments: Hypotheses and Statistical Tests
Ultimately, we use **statistics** and analyze **probabilities** in order to **draw conclusions**. A common approach in statistics is to start an experiment with a ***hypotheses***, which is then validated by a statistical ***test***.

### Statistical Inference Pipeline
* Formulate Hypotheses 
* Design Experiment
* Collect Data
* Test / draw conclusions

### Null-Hypotheses
* The statement being tested in a test of statistical significance is called the ***null hypothesis***. The test of significance is designed to assess the strength of the evidence against the null hypothesis. Usually, the null hypothesis is a statement of ***'no effect'*** or ***'no difference'***. <br>
It is often symbolized as $H_0$.

* The statement that is being tested against the null hypothesis is the alternative hypothesis $H_1$.

* ***Statistical significance test***: Very roughly, the procedure for deciding goes like this: Take a random sample from the population. If the sample data are consistent with the null hypothesis, then do not reject the null hypothesis; if the sample data are inconsistent with the null hypothesis, then reject the null hypothesis and conclude that the alternative hypothesis is true.

### Significant Tests
* $p$-Value (or significance): is, for a given statistical model, the probability that, when the null hypothesis is true, the statistical **test summary** would be greater than or equal to the actual observed results.

* at experiment design, a significance threshold $\alpha$ is chosen
* typically, $\alpha = 0.05$ for scientific experiments 

#### Tests
* t-Test
* $\chi^2$ Test
* ...

### t-Test
The ***t-Test*** (also called Student’s t-Test) compares two averages (means) and tells you if they are different from each other. The t-Test also tells you how significant the differences are; In other words it lets you know if those differences could have happened by chance.

$ t = \sqrt{n}{{\bar{X}-\mu_0} \over {\sigma }}$, where
<br>
* $\sigma$ standard deviation (from $n$ samples)
* $\bar{X}$ sampled mean from $n$ samples
* $\mu_0$ mean hypotheses

In [None]:
import numpy as np
from scipy import stats #statistics module

#Sample Size
N = 10
#Gaussian distributed data with mean = 2 and var = 1
a = np.random.randn(N) + 0.1
#Gaussian distributed data with with mean = 0 and var = 1
b = np.random.randn(N)

## Cross Checking with the internal scipy function
t2, p2 = stats.ttest_ind(a,b) 
print("means: ", a.mean(), b.mean())
print("t = " + str(t2))
print("p = " + str(p2))

means:  0.265499207605461 -0.3446791067063185
t = 1.1996256444955642
p = 0.24583762464743533


### $\chi^2$-Test
The "chi-square" test is used to compare multiple categorical distributions, e.g. is there a significant difference between the category counts of experiments.

#### Example:
<table>
    <tr><td>category</td> <td>Exp1</td> <td>Exp2</td> <td>Exp3</td> </tr>
    <tr><td>A</td><td>14</td> <td>8</td> <td>12</td> </tr>
    <tr><td>B</td> <td>986</td> <td>992</td> <td>988</td> </tr>
    
</table>
Null-Hypotheses: all are the same. 

#### Expected Values 
* $E(A) = 11.33$
* $E(B) = 988.67$

#### ***Pearson Residual***
$ R = { {observed - expected} \over {\sqrt{expected}}}$

<table>
    <tr><td>category</td> <td>Exp1</td> <td>Exp2</td> <td>Exp3</td> </tr>
    <tr><td>A</td><td>0.79</td> <td>-0.99</td> <td>0.19</td> </tr>
    <tr><td>B</td> <td>-0.08</td> <td>0.10</td> <td>-0.02</td> </tr>
    
</table>

* $\chi^2 := \sum_i\sum_j R_{ij}^2$, where $R_{ij}$ are all elements of the Table.
* $\chi^2 = 1.66$

#### In Python

In [None]:
from scipy.stats import chisquare
chisquare([16, 18, 16, 14, 12, 12], f_exp=[16, 16, 16, 16, 16, 8])

Power_divergenceResult(statistic=3.5, pvalue=0.6233876277495822)

## Discussion