<img src="images/Project_logos.png" width="500" height="300" align="center">

## Application: Statistics and Masked Arrays <a class="anchor" id="statistics"></a>

**Learning outcome:** By the end of this section, you will be able to apply statistical operations and masked arrays to real-world problems.

### Statistics

Numpy arrays support many common statistical calculations. For a list of common operations, see: https://docs.scipy.org/doc/numpy/reference/routines.statistics.html.

The simplest operations consist of calculating a single statistical value from an array of numbers -- such as a mean value, a variance or a minimum.

For example:

In [None]:
import numpy as np

a = np.arange(12).reshape((3, 4))
mean = np.mean(a)
print(a)
print(mean)

Used without any further arguments, statistical functions simply reduce the whole array to a single value.  In practice, however, we very often want to calculate statistics over only *some* of the dimensions. The most common requirement is to calculate a statistic along a single array dimension, while leaving all the other dimensions intact.   This is referred to as "collapsing" or "reducing" the chosen dimension.

This is done by adding an "axis" keyword specifying the dimension to collapse:

In [None]:
print(np.mean(a, axis=1))

#### Exercise 

* What other similar statistical operations exist (see above link)?
* A mean value can also be calculated with `<array>.mean()`.  Is that the same thing?
* Create a 3D array (that could be considered to describe `[time, x, y]`) and find the mean over all `x` and `y` at each timestep.
* What shape does the result have?

### Masked Arrays

Real-world measurements processes often result in certain datapoint values being uncertain or simply "missing".  This is usually indicated by additional data quality information, stored alongside the data values.

In these cases we often need to make calculations that count only the valid datapoints.  NumPy provides a special "masked array" type for this type of calculation. Here's a link to the documentation for NumPy masked arrays: https://docs.scipy.org/doc/numpy-1.11.0/reference/maskedarray.generic.html#maskedarray-generic-constructing.

To construct a masked array we need some data and a mask. The data can be any kind of NumPy array, but the mask array must contain a boolean-type values only (either `True` and `False` or `1` and `0`). Let's make each of these and convert them together into a NumPy masked array:

In [None]:
data = np.arange(4)
mask = np.array([0, 0, 1, 0])
print('Data: {}'.format(data))
print('Mask: {}'.format(mask))
masked_data = np.ma.masked_array(data, mask)
print('Masked data: {}'.format(masked_data))

The mask is applied where the values in the mask array are **`True`**. Masked arrays are printed with a double-dash `--` denoting the locations in the array where the mask has been applied.

The statistics of masked data are different:

In [None]:
print('Unmasked average: {}'.format(np.mean(data)))
print('Masked average: {}'.format(np.ma.mean(masked_data)))

Note that most file formats represent missing data in a _different_ way, using a distinct "missing data" value appearing in the data. There is special support for converting between this type of representation and NumPy masked arrays. Every masked array has a `fill_value` property and a `filled()` method to fill the masked points with the fill value.

#### Exercise

  * Create a masked array from the numbers 0-11, where all the values less than 5 are masked.
  * Create a masked array of positive values, with a value of `-1.0` to represent missing points.
  * Look up the masked array creation documentation. What routines exist to produce masked arrays like the ones you've just created more efficiently?
  * Use `np.ma.filled()` to create a 'plain' (i.e. unmasked) array from a masked array.
  * How can you create a plain array from a masked array, but using a _different_ fill-value for masked points?
  * Try performing a mathematical operation between two masked arrays. What happens to the 'fill_value' properties when you do so?

#### Statistics and Masked Arrays: Summary of key points
 * most statistical functions are available in two different forms, as in `array.mean()` and also `np.mean(array)`,
   the choice being mostly a question of style.
 * statistical operations operate over, and remove (or "collapse") the array dimensions that they are applied to.
 * an "axis" keyword specifies operation over dimensions : this can be one; multiple; or all.
   * NOTE: not all operations permit operation over specifically selected dimensions
 * Statistical operations are not really part of NumPy itself, but are defined by the higher-level Scipy project.
 * Missing datapoints can be represented using "masked arrays"
   * these are useful for calculation, but usually require converting to another form for data storage
