# Summary Statistics

In this section, we will revise some of the basic statistics used to summarise data.

Here is a sample of (discrete) data:

In [None]:
import numpy as np
data = [0. , 1.5, 2. , 1. , 2. , 0.5, 1.5, 1. , 1.5, 1. , 1. , 1. , 3. ,
       2. , 1.5, 1. , 1.5, 1. , 0. , 1.5, 2. , 4. , 1.5, 2. , 0. , 2.5,
       1. , 2. , 1.5, 0.5, 2. , 1.5, 2.5, 1.5, 2.5, 1.5, 1.5, 3. , 0.5,
       2. , 1.5, 1. , 1. , 1.5, 1. , 3.5, 1. , 0.5, 2. , 1. , 1. , 1. ,
       1.5, 1.5, 2. , 0. , 3. , 2. , 1. , 1. , 0.5, 1. , 1. , 1.5, 1.5,
       1.5, 2.5, 1.5, 1.5, 2. , 1. , 1. , 2. , 1. , 1. , 1. , 0.5, 1.5,
       1.5, 2. , 0. , 0.5, 1. , 2. , 2. , 1. , 2. , 1.5, 1. , 3.5, 0.5,
       1. , 1.5, 1.5, 2. , 0.5, 1. , 2. , 0. , 3. , 0.5, 1.5, 1.5, 2. ,
       2. , 0.5, 3. , 3.5, 2. , 3.5, 1.5, 1.5, 1. , 2. , 0. , 1. , 2. ,
       1. , 1. , 0.5, 0.5, 2. , 3. , 3. , 1.5, 1. , 2. , 1. , 4. , 1.5,
       1.5, 0. , 0.5, 0.5, 2.5, 3. , 2. , 1. , 2.5, 2. , 1. , 2. , 2. ,
       0.5, 3. , 3. , 0.5, 0.5, 0.5, 1. , 0.5, 1.5, 2. , 1.5, 0.5, 0.5,
       1. , 2.5, 1. , 4. , 2. , 1.5, 1. , 0.5, 1. , 1. , 1.5, 0.5, 1.5,
       2. , 1.5, 1. , 1.5, 1.5, 1. , 1.5, 2. , 3. , 3. , 1. , 0. , 2. ,
       0.5, 2. , 1.5, 2. , 1. , 1.5, 1. , 1.5, 1. , 1. , 0.5, 1. , 1.5,
       2. , 1. , 0.5, 1.5, 2.5, 0. , 1. , 0.5, 4. , 0. , 4.5, 1.5, 2.5,
       1. , 1.5, 1. , 3. , 1. , 0.5, 2.5, 2. , 1.5, 2.5, 1.5, 0.5, 2. ,
       1.5, 2.5, 1. , 0. , 2. , 3. , 1.5, 1. , 1. , 2. , 0.5, 0.5, 2. ,
       1. , 1.5, 3. , 2.5, 3.5, 0.5, 0.5, 0.5, 4. , 1. , 1.5, 1. , 3.5,
       1.5, 1.5, 2.5, 2.5, 0.5, 1.5, 1. , 1. , 0.5, 1.5, 4. , 0.5, 2.5,
       2.5, 1.5, 1. , 1. , 2. , 1.5, 2. , 2. , 1.5, 1.5, 0.5, 1.5, 2. ,
       2. , 2.5, 1. , 2. , 2.5, 1.5, 1. , 1.5, 1.5, 1.5, 1.5, 1.5, 2. ,
       0.5, 1.5, 2. , 1. , 3.5, 0.5, 2.5, 1. , 1. , 1.5, 2.5, 0.5, 1. ,
       1.5, 0.5, 1. , 0.5, 0.5, 2. , 0.5, 1. , 1. , 1.5, 2. , 1. , 0.5,
       2. , 0.5, 2.5, 0.5, 2.5, 0.5, 1.5, 2. , 0.5, 0.5, 2. , 0.5, 0.5,
       3. , 1.5, 1.5, 0.5, 0.5, 2.5, 2. , 1.5, 1.5, 1.5, 1. , 1.5, 2. ,
       1. , 0.5, 1. , 0.5, 1.5, 3. , 1.5, 0. , 1.5, 2. , 1.5, 0.5, 2. ,
       1. , 0.5, 1.5, 1. , 1.5, 1. , 0.5, 1.5, 3. , 0.5, 2. , 1.5, 2. ,
       2.5, 2.5, 1.5, 1. , 1.5, 1. , 1.5, 0.5, 3. , 1. , 0.5, 1.5, 2. ,
       3. , 1. , 1. , 2. , 1. , 1. , 1.5, 1. , 2. , 1. , 2. , 0.5, 0.5,
       1. , 1. , 0. , 1. , 3. , 2. , 2. , 1. , 1. , 1. , 0.5, 1. , 2. ,
       1.5, 2. , 1. , 1. , 1. , 2. , 1.5, 1.5, 2. , 1. , 0.5, 2. , 1. ,
       0.5, 1. , 0.5, 2.5, 1.5, 1.5, 1. , 0.5, 1. , 1. , 1.5, 1. , 2. ,
       0.5, 2. , 3.5, 2. , 1. , 0. , 1.5, 2.5, 2. , 1. , 4.5, 1. , 1.5,
       3.5, 0.5, 1. , 1.5, 2. , 3. , 0.5, 2.5, 1. , 2. , 1.5, 1.5, 2.5,
       2.5, 0.5, 1.5, 2. , 1. , 1. , 1. , 1. , 2.5, 2. , 0.5, 2. , 1. ,
       1.5, 1. , 2. , 0.5, 1.5, 0.5, 3. , 0.5, 0.5, 0.5, 1. , 1. , 2.5,
       3.5, 1. , 1. , 1. , 0.5, 1. , 1.5, 1. , 3. , 2.5, 2.5, 1. , 1. ,
       1. , 1. , 0.5, 2.5, 1. , 1.5, 2.5, 3.5, 1.5, 1.5, 0.5, 0.5, 1. ,
       1. , 1. , 1.5, 1.5, 0.5, 1. , 1. , 2. , 1.5, 1.5, 3. , 1.5, 1. ,
       1. , 2. , 0. , 3. , 2.5, 2. , 0.5, 2.5, 1. , 3. , 1.5, 3. , 1.5,
       1.5, 0.5, 0. , 0.5, 0.5, 1. , 1.5, 0.5, 2. , 1.5, 1.5, 0.5, 2.5,
       1. , 4. , 2. , 1. , 1. , 0.5, 1. , 2. , 0.5, 0.5, 1. , 1.5, 3. ,
       2.5, 1.5, 1. , 2. , 1. , 0.5, 0.5, 1. , 2. , 3.5, 2.5, 1. , 2.5,
       0.5, 1. , 1.5, 0.5, 1.5, 0.5, 1. , 3. , 2. , 1. , 3. , 1.5, 2.5,
       1. , 0.5, 2. , 2.5, 1.5, 2.5, 1.5, 2.5, 2.5, 0. , 2.5, 1.5, 2.5,
       0.5, 0.5, 2.5, 0.5, 2.5, 1.5, 1.5, 0.5, 0.5, 2.5, 1. , 1.5, 3.5,
       1.5, 1. , 0.5, 1. , 3.5, 0.5, 0.5, 1. , 0. , 1.5, 0.5, 0.5, 1. ,
       1. , 0.5, 1. , 3.5, 0. , 1. , 1. , 2. , 3. , 0.5, 0. , 1.5, 2. ,
       2.5, 2. , 0.5, 2.5, 2.5, 1. , 0.5, 0.5, 1.5, 3. , 2.5, 1.5, 1. ,
       1.5, 1. , 1.5, 1.5, 1. , 1. , 3. , 2. , 0.5, 1.5, 1. , 2. , 1. ,
       0. , 1.5, 3. , 1. , 1.5, 2. , 0.5, 2. , 1. , 1. , 0. , 3. , 0.5,
       1.5, 1. , 1. , 1. , 1.5, 1. , 1.5, 0.5, 0.5, 1.5, 1. , 1.5, 1.5,
       1.5, 1.5, 2. , 1. , 0.5, 0.5, 2. , 1.5, 2.5, 2.5, 1.5, 2. , 1. ,
       2. , 2.5, 0.5, 1.5, 0.5, 2. , 1. , 2. , 1.5, 1.5, 3.5, 1. , 1. ,
       0.5, 1. , 3. , 2.5, 0. , 2.5, 1. , 1. , 4. , 2. , 2. , 1. , 3.5,
       0.5, 1. , 1.5, 1.5, 0.5, 0.5, 0.5, 0. , 1. , 2.5, 0. , 1. , 3. ,
       1.5, 1.5, 0. , 0.5, 2. , 1. , 1.5, 2.5, 2. , 2.5, 0. , 3. , 0.5,
       1.5, 1. , 2.5, 1. , 2.5, 2.5, 2. , 0.5, 1. , 0. , 2.5, 1. , 1. ,
       0.5, 2. , 1. , 2. , 1.5, 0.5, 1. , 1. , 1. , 2.5, 2. , 1.5, 2. ,
       0.5, 1. , 3. , 1. , 2. , 2.5, 1. , 2.5, 1.5, 2.5, 3. , 2. , 0. ,
       2.5, 1. , 1. , 0.5, 1.5, 2.5, 1. , 3.5, 1.5, 1. , 3. , 2. , 0.5,
       0.5, 1.5, 2. , 2.5, 1.5, 1.5, 0.5, 1. , 2. , 2. , 0. , 3. , 1.5,
       1. , 1.5, 3. , 3. , 1. , 1. , 0.5, 0.5, 0.5, 2. , 1. , 1.5, 0. ,
       3. , 1.5, 1.5, 1.5, 0.5, 0.5, 0.5, 1.5, 1.5, 1. , 1. , 2.5, 1.5,
       0.5, 1.5, 2. , 1.5, 0. , 1. , 1. , 1. , 2.5, 2.5, 1.5, 1.5, 1.5,
       0.5, 1.5, 2.5, 0.5, 3. , 2.5, 1.5, 1.5, 0.5, 2. , 1. , 3. , 1. ,
       0.5, 1.5, 1.5, 2.5, 1.5, 2.5, 0.5, 3. , 1. , 2. , 1.5, 1.5, 1. ,
       1. , 1.5, 2. , 1. , 0.5, 2.5, 2. , 1.5, 1.5, 1. , 0.5, 1. , 0.5,
       0.5, 2.5, 1. , 2. , 2.5, 1.5, 2. , 0.5, 2. , 2. , 0. , 2. , 2.5,
       1.5, 1. , 1. , 1. , 2.5, 1. , 3. , 1.5, 1.5, 2. , 1.5, 0. , 1.5,
       0. , 1. , 2.5, 2. , 2. , 1. , 2.5, 2.5, 3.5, 3. , 0.5, 2. , 1.5,
       2. , 2.5, 2. , 1.5, 2.5, 0. , 1. , 0.5, 1. , 1.5, 0.5, 2. , 2. ,
       2.5, 0.5, 1. , 1.5, 2.5, 1.5, 1. , 2. , 0.5, 1. , 0.5, 3. , 3. ,
       1.5, 2.5, 3. , 0. , 2. , 0.5, 0.5, 0.5, 2.5, 1. , 1.5, 2. , 1.5,
       0.5, 2. , 1.5, 1. , 1.5, 0.5, 2. , 1.5, 1. , 1.5, 1.5, 1. , 3. ,
       3. , 0.5, 2.5, 2.5, 0. , 1.5, 1. , 2. , 1. , 0.5, 1. , 1. ]

In [None]:
import matplotlib.pyplot as plt
plt.hist(data)
plt.show()

---

## Central Tendency

These three measures all aim to describe the "average" value of the data distribution. Each one is relevant under different circumstances.

### Arithmetic mean
*(quantitative data)*

$$
\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_{i} ,
$$
where $x$ is the set of values in the sample and $n$ is the sample size.

NB The mean can be a real (decimal) number even if the data themselves are discrete.

**Exercise**: using your knowledge of basic python (i.e. *without* additional libraries), find the mean of the data supplied.

Of course, it will be helpful to have some pre-built statistics functions :)

* [numpy stats functions](https://numpy.org/doc/stable/reference/routines.statistics.html)
* [scipy.stats](https://docs.scipy.org/doc/scipy/reference/stats.html)

**Exercise**: repeat the above exercise using `numpy.mean()`

You could mark the mean on the histogram like so:

In [None]:
xbar = np.mean(data)

plt.hist(data)
plt.vlines( xbar, 0, 250, colors='blue' )
plt.text( xbar + 0.1, 250, 'mean=' + str(xbar), color='blue', fontsize='large')
plt.show()

### Median
*(quantitative data)*

The median is the middle observation, such that  50% of data lies below its value.

Order the data and select the middle value (if the sample size *n* is even, take the midpoint of the middle two values)


**Exercise**: using only basic python, write a function `median()` to find the median value of a list of numbers. *Hint*: you will need to use `sorted()`

**Exercise**: try your function on the data provided above and compare to the result from `numpy.median()`

### Mode
*(categorical OR discrete data)*

The mode is the value that occurs most frequently in the sample.

**Exercise**: using only basic python, write a function `mode()` to find the median value of a list of values.

**Exercise**: try your function on the data provided and compare to the result from `scipy.stats.mode()` 

---

## Dispersion
Dispersion measures aim to describe the *variability* of quantitative data - i.e. the degree to which it is spread out from the "average".

### Variance
$$
s^2 = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2
$$

**Exercise**: using only basic python, write a function `var()` to calculate the variance of a list of numbers.

**Exercise**: try your function on the data provided and compare to the result from `np.var()` 

### Standard Deviation
The standard deviation, $s$, is simply the square root of the variance, $s^2$. You can use `numpy.std()`

In [None]:
np.std(data)

---