Percentile
==========

In [1]:
import xarray
import climtas
import dask.array
import pandas
import numpy

We have hourly input data for a year that we want calculate the 90th percentile along the time axis for each grid point

In [2]:
time = pandas.date_range('20010101', '20020101', freq='H', closed='left')

data = dask.array.random.random((len(time),50,100), chunks=(24*60,25,25))
lat = numpy.linspace(-90, 90, data.shape[1])
lon = numpy.linspace(-180, 180, data.shape[2], endpoint=False)

da = xarray.DataArray(data, coords=[('time', time), ('lat', lat), ('lon', lon)], name='temperature')
da

Unnamed: 0,Array,Chunk
Bytes,334.17 MiB,6.87 MiB
Shape,"(8760, 50, 100)","(1440, 25, 25)"
Count,56 Tasks,56 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 334.17 MiB 6.87 MiB Shape (8760, 50, 100) (1440, 25, 25) Count 56 Tasks 56 Chunks Type float64 numpy.ndarray",100  50  8760,

Unnamed: 0,Array,Chunk
Bytes,334.17 MiB,6.87 MiB
Shape,"(8760, 50, 100)","(1440, 25, 25)"
Count,56 Tasks,56 Chunks
Type,float64,numpy.ndarray


The Xarray way is to use [xarray.DataArray.quantile](http://xarray.pydata.org/en/stable/generated/xarray.DataArray.quantile.html), however for a dataset chunked along the time axis this will give an error message:

In [3]:
try:
    da.quantile(0.9, 'time')
except Exception as e:
    print('Error:', e)

Error: dimension time on 0th function argument to apply_ufunc with dask='parallelized' consists of multiple chunks, but is also a core dimension. To fix, either rechunk into a single dask array chunk along this dimension, i.e., ``.chunk(time: -1)``, or pass ``allow_rechunk=True`` in ``dask_gufunc_kwargs`` but beware that this may significantly increase memory usage.


To get this to work we must rechunk the data so it is not chunked on the time axis, a very expensive operation for large datasets where variables are split into multiple files for different years.

In [4]:
da.chunk({'time': None}).quantile(0.9, 'time')

Unnamed: 0,Array,Chunk
Bytes,39.06 kiB,4.88 kiB
Shape,"(50, 100)","(25, 25)"
Count,112 Tasks,8 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 39.06 kiB 4.88 kiB Shape (50, 100) (25, 25) Count 112 Tasks 8 Chunks Type float64 numpy.ndarray",100  50,

Unnamed: 0,Array,Chunk
Bytes,39.06 kiB,4.88 kiB
Shape,"(50, 100)","(25, 25)"
Count,112 Tasks,8 Chunks
Type,float64,numpy.ndarray


Dask has an approximate percentile operation that works on data chunked along time, however that will only run on one dimensional data

In [5]:
dask.array.percentile(da.data[:, 30, 60], 90)

Unnamed: 0,Array,Chunk
Bytes,8 B,8 B
Shape,"(1,)","(1,)"
Count,71 Tasks,1 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 8 B 8 B Shape (1,) (1,) Count 71 Tasks 1 Chunks Type float64 numpy.ndarray",1  1,

Unnamed: 0,Array,Chunk
Bytes,8 B,8 B
Shape,"(1,)","(1,)"
Count,71 Tasks,1 Chunks
Type,float64,numpy.ndarray


[climtas.blocked.approx_percentile](api/blocked.rst#climtas.blocked.approx_percentile) extends the Dask parallel approximate percentile calculation to multi-dimensional datasets, so large datasets don't need to be rechunked.

Note that this is an approximation to the true percentile, check it behaves appropriately for your dataset by comparing `approx_percentile` and `numpy.percentile` on a subset of the data.

In [6]:
climtas.approx_percentile(da, 90, 'time')

Unnamed: 0,Array,Chunk
Bytes,39.06 kiB,4.88 kiB
Shape,"(1, 50, 100)","(1, 25, 25)"
Count,176 Tasks,8 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 39.06 kiB 4.88 kiB Shape (1, 50, 100) (1, 25, 25) Count 176 Tasks 8 Chunks Type float64 numpy.ndarray",100  50  1,

Unnamed: 0,Array,Chunk
Bytes,39.06 kiB,4.88 kiB
Shape,"(1, 50, 100)","(1, 25, 25)"
Count,176 Tasks,8 Chunks
Type,float64,numpy.ndarray
