# Statistics

### Mean aka average $
\begin{align}
{1 \over n} \sum_{i=1}^{n} a_i
\end{align}
$
- Sum of all values in a collection divided by number of items
- Preferable in large datasets with few outliers
- Inferential statistics are largely built on omeasurements of the mean
- $
\begin{align}
\mu
\end{align}
$
mu is the standard notation for a population mean

- $
\bar x $ "x-bar" is the standard notation for a sample mean.

- $
\bar X $ Capitalized "x-bar" is the common notation for a sample mean where X is a random variable.

- __Population__ is all possible data points from a set.
- __Sample__ does _not_ represent every possible data point.

```python
def mean(lst):
    return sum(lst) / len(lst)
```

### Median aka the middle value

- Odd number of items just take the middle item
- Even number of items take the mean of two middle values
- $ \tilde x $ Lower-case x with a tilde often denotes median
- Median is resistant to outliers

```python
def median(lst):
    length = len(lst)
    sorted_list = lst.sort() # must be sorted list
    if length % 2 != 0: # if it's odd length
        i = (length / 2 + .5) - 1
        return sorted_list[int(i)]
    
    else: # if length is even
        i1 = int(length / 2 - 1)
        i2 = i1 + 1
        # return the mean of 2 middle items
        return sorted_list[i1] + sorted_list[i2] / 2    
```

### Mode aka most frequent

- Can describe collections that are non-numerical

```python
def mode(lst):
    count_dict = {}
    for item in lst:
        if item in count_dict:
            count_dict[item] += 1
        else:
            count_dict[item] = 1
    max_freq = max(list(count_dict.values()))
    modes = [item for item, freq in count_dict.items() if freq == max_freq]
    return None if len(modes) == len(lst) else modes
```
##### Using python collections
```python
import collections

def mode(lst):
    count_dict = dict(collections.Counter(lst))

    max_freq = max(list(count_dict.values()))
    modes = [item for item, freq in count_dict.items() if freq == max_freq]
    return None if len(modes) == len(lst) else modes
```

### Five Number Summary
##### A more in-depth description of collections of values

- Min
- $ Q_1$ first quartile
- Median
- $ Q_3 $ third quartile
- Max   
------
- Often expressed in tuples (min, $ Q_1 $, median $ Q_3 $, max)

```python
import numpy as np

def five_summary(lst):
    sorted_lst = sorted(lst)
    med = np.median(sorted_lst)
    
    if len(lst) % 2 != 0: # odd length list
        med_idx = int(len(lst) / 2 + .5) - 1
        low_subset = sorted_lst[:int(med_idx + 1)]
        high_subset = sorted_lst[int(med_idx +1):]
    else: # even length
        i1 = int(len(lst) / 2 - 1)
        i2 = i1 + 1
        
        low_subset = sorted_lst[:i1 + 1]
        high_subset = sorted_lst[i2:]
    q1 = np.median(low_subset)
    q3 = np.median(high_subset)
    
    return min(sorted_lst), q1, med, q3, max(sorted_lst)
        
```
-----
##### Or numpy percentile function
```python
from numpy import percentile

def five_number_summary(lst):

    q1, median_, q3 = percentile(lst, [25, 50, 75])

    min_, max_ = min(lst), max(lst)

    return min_, q1, median_, q3, max_
```