Resample
========

In [1]:
import xarray
import climtas
import dask.array
import pandas
import numpy

Say we have hourly input data for a year that we want to convert to daily data

In [2]:
time = pandas.date_range('20010101', '20020101', freq='H', closed='left')

data = dask.array.random.random((len(time),50,100), chunks=(24*60,25,25))
lat = numpy.linspace(-90, 90, data.shape[1])
lon = numpy.linspace(-180, 180, data.shape[2], endpoint=False)

da = xarray.DataArray(data, coords=[('time', time), ('lat', lat), ('lon', lon)], name='temperature')
da

Unnamed: 0,Array,Chunk
Bytes,334.17 MiB,6.87 MiB
Shape,"(8760, 50, 100)","(1440, 25, 25)"
Count,56 Tasks,56 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 334.17 MiB 6.87 MiB Shape (8760, 50, 100) (1440, 25, 25) Count 56 Tasks 56 Chunks Type float64 numpy.ndarray",100  50  8760,

Unnamed: 0,Array,Chunk
Bytes,334.17 MiB,6.87 MiB
Shape,"(8760, 50, 100)","(1440, 25, 25)"
Count,56 Tasks,56 Chunks
Type,float64,numpy.ndarray


The Xarray way is to use [xarray.DataArray.resample](http://xarray.pydata.org/en/stable/generated/xarray.DataArray.resample.html), however that is an expensive function to run - we started with 56 tasks and 56 chunks in the Dask graph, and this has exploded to 11,736 tasks and 2920 chunks. For a large dataset this increase in chunk counts really bogs down Dask.

The reason for this is that with `resample` Xarray will create a new output chunk for each individual day - you can see the chunk size of the output is now `(1, 25, 25)`.

In [3]:
da.resample(time='D').mean()

Unnamed: 0,Array,Chunk
Bytes,13.92 MiB,4.88 kiB
Shape,"(365, 50, 100)","(1, 25, 25)"
Count,11736 Tasks,2920 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 13.92 MiB 4.88 kiB Shape (365, 50, 100) (1, 25, 25) Count 11736 Tasks 2920 Chunks Type float64 numpy.ndarray",100  50  365,

Unnamed: 0,Array,Chunk
Bytes,13.92 MiB,4.88 kiB
Shape,"(365, 50, 100)","(1, 25, 25)"
Count,11736 Tasks,2920 Chunks
Type,float64,numpy.ndarray


A better way to do this is to use [xarray.DataArray.coarsen](http://xarray.pydata.org/en/stable/generated/xarray.DataArray.coarsen.html) to do the resampling. This keeps the original number of chunks, but has the drawback that it's not aware of the time axis, you need to specify that it should be reduced by 24 samples. It also won't complain if the time axis is uneven, however for most well-behaved datasets this shouldn't be an issue.

In [4]:
da.coarsen(time=24).mean()

Unnamed: 0,Array,Chunk
Bytes,13.92 MiB,292.97 kiB
Shape,"(365, 50, 100)","(60, 25, 25)"
Count,224 Tasks,56 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 13.92 MiB 292.97 kiB Shape (365, 50, 100) (60, 25, 25) Count 224 Tasks 56 Chunks Type float64 numpy.ndarray",100  50  365,

Unnamed: 0,Array,Chunk
Bytes,13.92 MiB,292.97 kiB
Shape,"(365, 50, 100)","(60, 25, 25)"
Count,224 Tasks,56 Chunks
Type,float64,numpy.ndarray


[climtas.blocked.blocked_resample](api/blocked.rst#climtas.blocked.blocked_resample) works the same as `coarsen`, giving you the same number of chunks as you started with, and it is also time-axis aware - it will check to make sure that the time axis is evenly spaced and you can use Pandas time interval names instead of a sample count.

In [5]:
climtas.blocked_resample(da, time='D').mean()

Unnamed: 0,Array,Chunk
Bytes,13.92 MiB,292.97 kiB
Shape,"(365, 50, 100)","(60, 25, 25)"
Count,112 Tasks,56 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 13.92 MiB 292.97 kiB Shape (365, 50, 100) (60, 25, 25) Count 112 Tasks 56 Chunks Type float64 numpy.ndarray",100  50  365,

Unnamed: 0,Array,Chunk
Bytes,13.92 MiB,292.97 kiB
Shape,"(365, 50, 100)","(60, 25, 25)"
Count,112 Tasks,56 Chunks
Type,float64,numpy.ndarray
