# Statistics -- some basic concepts
*Disclaimer: Statistics is a complex subject, we can only cover a few basic elements here in this 1-day hands-on course*

## Characterizing Data
Two basic kinds of data:
* qualitative
  * categories, colors, emotions, preferences, popularity
  * polls, medical studies, market research 
  * *difficult to treat with mathematical methods*
* quantitative
  * numerical values -- std case in physics/science, we will focus on that
    * can be discrete -- number of tracks in collission, number of persons in car, number of heads when tossing coins, ...
    * or continuous -- mass of particle, size of person, intensity of source, velocity of object, ...
    


### Visualisation example
Frequency-distribution:
* 1 or more quantities are repeatedly measured 
* fill quantity in **histogram**
  * divide range *(xmin, xmax)* into *nbins* intervals
  * each observation is filled into histogram
    * interval content incremented by 1
    
Simple example:
* fill measured wire position into histogram


If normalized : 
$ \frac {N_{bin}} {N_{tot}}$

it can be interpreted as probabily density


In [None]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt

In [None]:
data = np.loadtxt('http://www-static.etp.physik.uni-muenchen.de/kurs/Computing/python/nb/data/rohr1.dat')
fig = plt.figure(1, figsize=(7,6))  # get handle to matplotlib figure and specify size
plt.hist(data) # automatic determination of range and bins
#plt.hist(data,bins=100, range=(-100,100),density=1) # specify range and bins and density flag
plt.xlabel('x-coord [$\mu m$]') # x-axis
plt.title('1D Histogram tube positions') # title header
plt.grid(True)

##### Can be extended to 2-D (and more...) 

In [None]:
data2 = np.loadtxt('http://www-static.etp.physik.uni-muenchen.de/kurs/Computing/python/nb/data/rohr2.dat')
fig = plt.figure(1, figsize=(9,6))  # get handle to matplotlib figure and specify size
plt.hist2d(data2[:,0],data2[:,1], bins=30, cmap='Blues')
cb = plt.colorbar()
cb.set_label('counts in bin')
plt.xlabel('x-coord [$\mu m$]') # x-axis
plt.ylabel('y-coord [$\mu m$]') # x-axis
plt.title('2D Histogram tube positions'); # title header


### Characterizing data contd

#### Mean value:
Most common way to describe a dataset ${(x_1, x_2, ... , x_N)}$ with a single value is the **arithmetic mean**:

$$\bar{x} = \frac{\sum_{i=1}^{N} x_i}{N}$$

Useful for many cases, but there are more alternatives:

**Geometric mean**: $$ \sqrt[N]{x_1 \cdot x_2 \cdot ... \cdot x_N}$$

**Harmonic mean**: $$ \left( \frac{N}{1/x_1 + 1/x_2 + ... + 1/x_N}\right)^{-1} $$

**Median**: 
* the value in the middle, i.e. half of the values are smthealler, the other half is larger
* very useful for datasets were the underlying distibution is not know or has large tails

A simple way to get it is sorting the dataset and then take the value in the middle:
$$ Median = x_{N/2} $$


#### Width
Besides the mean the 2nd important quantity to characterize data 

Most common is the **variance**:
    $$V(x) = \frac{\sum (x_i - \bar{x})^2}{N} = \frac{\sum x_i^2}{N} - \bar{x}^2$$
    
or the **standard-deviation**:
 $$\sigma(x) = \sqrt{V(x)} = \sqrt{ \frac{\sum (x_i - \bar{x})^2}{N}} = \sqrt{\frac{\sum x_i^2}{N} - \bar{x}^2 }$$
 
 Further option would be:
  $$ \frac{\sum |x_i - \bar{x}|}{N} $$

*not very common, ugly mathematical properties ...*


More practical are so-called **Quantiles**:
* lower Quartil: 25% of values are lower 
* upper Quartil: 75% of values are lower 
* take **width** as difference between upper and lower Quartile.
* or more general take arbitrary **Percentile**: XX% Percentile = XX% of values are smaller.

#### Higher powers

An obvious extension to arithmetic mean and variance are higher powers:

$$ \mbox{Skew} = \frac{\sum (x_i - \bar{x})^3}{N \sigma^3}$$

*(Factor $1/\sigma^3$ makes Skew dimensionless)*

Skew ist useful to charakterise asymmetries, it is positive in case of tails to high values (and v.v.), it is ~0 for symmetric distributions.




### Multiple variables
Often one records simultaneously several quantities per "Event"
* x,y,z-coordinates
* weight and height 
* exam results in different fields

Besides mean and variance of the individual quantities it is important to quantify also dependencies between variables. Similar to **variance** one can define **covariance**: 
$$cov(x,y) = \frac{1}{N} \sum (x_i - \overline{x}) \cdot (y_i - \overline{y}) = 
\overline{xy} - \overline{x} \cdot \overline{y}$$

* In case high (low) values of x often occur with high (low) values of y then the 
covarince is positiv
* the other way round negative
* in case of independent quantities the covariance is close to zero.


A better measure to quantify the dependency is the **correlation**  
$$ \rho \equiv \frac{cov(x,y)}{\sigma(x) \sigma(y)}$$
with $ -1 \le \rho \le 1$.

$\rho = \pm 1 $ mean full correlation, the values of y are fully determined by x (or v.v.) and do not provide additional information.



#### correlation examples

<table>
    <tr><td>
    <img src='../figures/cor_0.0.png' width="300" height="300"></td><td><img src='../figures/cor_m0.50.png' width="300" height="300"></td></tr>
    <tr><td>
    <img src='../figures/cor_0.90.png' width="300" height="300"></td><td><img src='../figures/cor_m0.98.png' width="300" height="300"></td></tr>

</table>

#### Covariance/correlation matrix
In case of more than 2 variables this approach leads to **Covariance matrix**:
 $$V_{ij} = cov( x_{(i)}, y_{(i)} )$$
 or **correlation matrix**:
  $$\rho_{ij} = \frac{cov( x_{(i)}, y_{(i)} )}{\sigma_i \sigma_j}$$
  
  These matrices are symmetric $N\times N$ matrices

## Important Distributions
In case of continuous distribtutions mean and variance are defined as integral over the probability density function (*pdf*).
 $$\mbox{ Mean:  } \bar{x} = \int_{-\infty}^{+\infty} x \cdot p(x) \, dx $$
 $$\mbox{ Variance:  }V(x) = \int_{-\infty}^{+\infty} (x - \bar{x})^2 \cdot p(x) \, dx $$
 $$\mbox{ Standard-deviation:  } \sigma(x) = \sqrt{V(x)} $$

### Uniform distribution
The most simple distribution, all values in a specified Interval are equally likely:
$$p(x) = \frac{1}{b-a} \,\, \forall \, x \in [a,b], \, 0\,  else$$

Important for all kinds of gambling and basis for random number generators, simulations and also nice test case:

Take a uniform distribution  in $[0,1]$. It's rather straightforwad to show that the mean is *0.5* and the 
variance $\sigma^2 = 1/12$

![Gleichverteilung](../figures/gleichv1.png)

### Binomial distribution
Describes experiments with 2 possible outcomes. A simple example is  tossing of a coin.

What is the probability to get *k* times *head* in *n* tries (*p* is the probabilty for *head* for single toss)

One special case would be to obtain *head* in the first *k* tosses ($p^k$) and *number* in the subsequent *n-k* tosses ($(1-p)^{n-k}$), i.e. the combined probability for this to happen is 
$$p^k (1-p)^{n-k}$$

Applying combinatorics yields that 
$${ n \choose k } = \frac{n !} { (n-k)! k!}$$
such cases exits, each with equal probability.

So in total the probability is 
$$P(k\,\times \, \mbox{Kopf}) = p^k (1-p)^{n-k}{ n \choose k }$$



Mean and variance of the binomial-distribution are:
$$ \bar{x} = n p , \; \; \sigma^2 = n p (1-p) $$




![Binomialverteilung](../figures/bin1.png)

### Poisson distribution
The binomial distribution changes to the poisson distribution in the limit for 
$$n \rightarrow \infty,\; p\rightarrow 0,\; np = const$$

The Poisson distribtuion is defined as 
$$P(k) = \frac{\mu^k e^{-\mu}}{k!}$$
with $\mu = n p$.

Mean and variance of the Poisson-distribution are:
$$ \bar{x} = \mu , \; \; \sigma^2 = \mu $$

In practical terms the binomial distribution is already reasonably well described by the corresponding  Poisson-distribution for *small* $n \approx 10-20$. 



![Poissonverteilung](../figures/bin_poiss.png)

#### Poisson--Statistic showcase in the Prussian army

A classical example in statistics textbooks is the statistics of the prussian army  on deadly accidents due to horse-kicks counted by army corps and year. 

Over 20 years and  10 corps ($ = 200 $ corps-years) one counted in total 122 deadly cases. 

This yields $ \mu = 122/200 = 0.61$.


![Preussenstatistik](../figures/preussenstat.png)

### Gaussian or Normal distribution
The most important distribution in statistics
 $$G(x|\mu,\sigma) = \frac{1}{\sqrt{2\pi} \, \sigma}
e^{-\frac{(x-\mu)^2}{2\sigma^2}}$$

with mean $\mu$  and variance $\sigma^2$.

For large $\mu$  the poisson distribution converges to the Gaussian distribution, in practice already for  $\mu \approx 10$ is a reasonable approximation.

And similarly the binomial distribution converges to the Gaussian in case of large *n* and *np* .


![Gaussverteilung](../figures/bin_poiss_ga.png)

#### Distribution summary

![BPG](../figures/BPG.png)

### Central limit theorem
The most fundamental theorem in statistics:

For a set of independent random-variables $x_i$ with mean $\mu$ and variance $\sigma^2$ the quantity 
$$ y = \frac{\sum x_i} { n} $$
approximates to a Gaussian with  mean $\mu$ and variance $\sigma^2$ for large $n$.

The underlying distribution of the $x_i$ does not matter, e.g. if you random random variables which are uniformly distributed or exponentially, the mean $y$ will follow approximately a normal distribution.

$\rightarrow$ we will demonstrate it in the exercises


The central limit theorem is the reason for the exceptional importance of the Guassian distribution.