## MLDataset - Examples with Reshaping and Chaining Methods

`xarray_filters.MLDataset` is a subclass of `xarray.Dataset` with methods for reshaping the `Dataset`'s `DataArray`s from time series, rasters, or N-D arrays into a single 2-D `DataArray` for input to statistical models.  

The notebook works with the following methods of `MLDataset`s:

 * `MLDataset.to_features`: convert `DataArray`s of a `MLDataset` to a single 2-D `DataArray` by calling `.ravel()` on each `DataArray` (i.e. each `DataArray` to a single column)
 * `MLDataset.from_features`: convert a `MLDataset` with a single 2-D `DataArray` back to the original separate time series, rasters, or N-D arrays
 * `MLDataset.chain`: Calling `MLDataset.pipe` (`xarray.Dataset.pipe`) repeatedly for a `Sequence` of items that may be:
   * a callable
   * a `Sequence` composed of a callable followed by positional arguments and keyword arguments
   * a `Sequence` of a string (assumed to be a `DataArray` method, e.g. `DataArray.quantile`) followed by positional arguments and keyword arguments

In [1]:
import numpy as np
import xarray as xr
from xarray_filters import *

The following cell imports a function to create example `xarray_filters.MLDataset` objects.

In [2]:
from xarray_filters.tests.test_data import new_test_dataset

In [3]:
X = new_test_dataset(layers=('temperature', 'pressure', 'wind_x', 'wind_y'))

In [4]:
X

<xarray.MLDataset>
Dimensions:      (t: 48, x: 20, y: 15, z: 8)
Coordinates:
  * x            (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
  * y            (y) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
  * z            (z) int64 0 1 2 3 4 5 6 7
  * t            (t) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...
Data variables:
    temperature  (x, y, z, t) float64 0.7834 0.01429 0.09623 0.3987 0.796 ...
    pressure     (x, y, z, t) float64 0.1834 0.3439 0.4344 0.06264 0.8126 ...
    wind_x       (x, y, z, t) float64 0.7015 0.4172 0.9833 0.07133 0.5087 ...
    wind_y       (x, y, z, t) float64 0.658 0.7856 0.9179 0.3296 0.2921 ...

Methods of `MLDataset` that are not methods of `xarray.Dataset`:

In [5]:
set(dir(MLDataset)) - set(dir(xr.Dataset))

{'chain', 'concat_ml_features', 'from_features', 'has_features', 'to_features'}

`MLDataset` works like an `xarray.Dataset`.  The following is converting 4-D arrays to 2-D arrays by taking the mean over the `z` and `t` dimensions.

In [6]:
X_means_raster = X.mean(dim=('z', 't'))
X_means_raster

<xarray.MLDataset>
Dimensions:      (x: 20, y: 15)
Coordinates:
  * x            (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
  * y            (y) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Data variables:
    temperature  (x, y) float64 0.4747 0.4868 0.5182 0.472 0.4955 0.515 ...
    pressure     (x, y) float64 0.4853 0.4908 0.4898 0.502 0.5125 0.4918 ...
    wind_x       (x, y) float64 0.524 0.4942 0.4913 0.4994 0.4807 0.5008 ...
    wind_y       (x, y) float64 0.5168 0.5054 0.4896 0.4873 0.4955 0.4959 ...

### `xarray_filters.MLDataset.to_features`
`to_features()` below converts each 4-D array of `X` to a column and concatenates the columns into a single `DataArray`.  See optional arguments to `to_features` for controlling the name of the 2-D `DataArray` (`features_layer='features'` by default) and the row and column dimension names (`(space, layer)` by default).  

In [7]:
f = X.to_features()
f

<xarray.MLDataset>
Dimensions:   (layer: 4, space: 115200)
Coordinates:
  * space     (space) MultiIndex
  - x         (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - y         (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - z         (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - t         (space) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ...
  * layer     (layer) object 'temperature' 'pressure' 'wind_x' 'wind_y'
Data variables:
    features  (space, layer) float64 0.7834 0.1834 0.7015 0.658 0.01429 ...

Note that the `features` `DataArray` preserves the original coordinates of the 4-D arrays by creating a `pandas.MultiIndex`.  This allows calling `from_features` later to reshape into the 4-D array shapes, even if rows from the `features` `DataArray` are dropped, e.g. as in the case of dropping rows with `NaN` values.

In [8]:
f.space

<xarray.DataArray 'space' (space: 115200)>
array([(0, 0, 0, 0), (0, 0, 0, 1), (0, 0, 0, 2), ..., (19, 14, 7, 45),
       (19, 14, 7, 46), (19, 14, 7, 47)], dtype=object)
Coordinates:
  * space    (space) MultiIndex
  - x        (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - y        (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - z        (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - t        (space) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...

The columns of the `features` `DataArray` are named by the `layer` that was flattened from 4-D to a 1-D column.  Usage of `OrderedDict` throughout `MLDataset` internals ensures that the `layers` (`DataArray`s) always iterate into the same column order.

In [9]:
f.layer

<xarray.DataArray 'layer' (layer: 4)>
array(['temperature', 'pressure', 'wind_x', 'wind_y'], dtype=object)
Coordinates:
  * layer    (layer) object 'temperature' 'pressure' 'wind_x' 'wind_y'

Showing the first few `(x, y, z, t)` coordinates of the `pandas.MultiIndex` `space`:

In [10]:
f.space.indexes['space'].tolist()[:4]

[(0, 0, 0, 0), (0, 0, 0, 1), (0, 0, 0, 2), (0, 0, 0, 3)]

In [11]:
f.space.indexes['space'].names

FrozenList(['x', 'y', 'z', 't'])

It is also possible to transpose the `layers` before calling `.ravel()` on each one (the usage of the `trans_dims` keyword to `to_features()`):

In [12]:
example2 = X.mean(dim='x').to_features(trans_dims=('t', 'z', 'y'))
example2

<xarray.MLDataset>
Dimensions:   (layer: 4, space: 5760)
Coordinates:
  * space     (space) MultiIndex
  - t         (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - z         (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 ...
  - y         (space) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 ...
  * layer     (layer) object 'temperature' 'pressure' 'wind_x' 'wind_y'
Data variables:
    features  (space, layer) float64 0.5368 0.3778 0.5543 0.5802 0.4907 ...

### `data_vars_func` decorator
The `data_vars_func` decorator allows writing a function that takes named `layers` as keywords or positional arguments.  In the example below, it is assumed that the decorated `magnitude` function will be passed to `X.chain` in situations where `X` has `layers` named `wind_x`, `wind_y`.  All other `data_vars` keys/values are passed as `other_data_vars` keyword arguments.

In [13]:
@data_vars_func
def magnitude(wind_x, wind_y, **other_data_vars):
    a2 = wind_x ** 2
    b2 = wind_y ** 2
    mag = (a2 + b2) ** 0.5
    return dict(magnitude=mag)
X.chain(magnitude, layers=['wind_x', 'wind_y']).to_features(features_layer='magnitude')

<xarray.MLDataset>
Dimensions:    (layer: 1, space: 115200)
Coordinates:
  * space      (space) MultiIndex
  - x          (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - y          (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - z          (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - t          (space) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ...
  * layer      (layer) object 'magnitude'
Data variables:
    magnitude  (space, layer) float64 0.9619 0.8895 1.345 0.3372 0.5866 ...

### `for_each_array` decorator
`for_each_array` allows automates calling a function that takes a `DataArray` argument and returns a `DataArray` for each `DataArray` (`layer`) in a `MLDataset`:

In [14]:
@for_each_array
def plus_one(arr, **kw):
    return arr + 1

@for_each_array
def minus_one(arr, **kw):
    return arr - 1


plus = X.chain(plus_one)
minus = X.chain(minus_one)

assert np.all(plus.wind_x - minus.wind_x == 2.)
assert np.all(plus.temperature - minus.temperature == 2.)

In [15]:
@for_each_array
def transform_example(arr, **kw):
    up = arr.quantile(0.75, dim='z')
    low = arr.quantile(0.25, dim='z')
    median = arr.quantile(0.5, dim='z')
    return (arr - median) / (up - low)

X.chain(transform_example)

<xarray.MLDataset>
Dimensions:      (t: 48, x: 20, y: 15, z: 8)
Coordinates:
  * x            (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
  * y            (y) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
  * z            (z) int64 0 1 2 3 4 5 6 7
  * t            (t) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...
    quantile     float64 0.5
Data variables:
    temperature  (x, y, z, t) float64 1.161 -0.9847 -1.321 -0.6079 0.4967 ...
    pressure     (x, y, z, t) float64 -0.6144 0.04393 -0.0398 -0.3017 0.9494 ...
    wind_x       (x, y, z, t) float64 0.5866 -0.552 0.8789 -2.302 -0.2759 ...
    wind_y       (x, y, z, t) float64 0.2814 0.4859 0.7806 -0.9098 -0.2132 ...

In [16]:
@for_each_array
def agg_example(arr, **kw):
    return arr.mean(dim='t').quantile(0.25, dim='z')

aggregated = X.chain((transform_example, agg_example))

In [17]:
aggregated

<xarray.MLDataset>
Dimensions:      (x: 20, y: 15)
Coordinates:
  * x            (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
  * y            (y) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
    quantile     float64 0.25
Data variables:
    temperature  (x, y) float64 -0.06935 -0.08409 -0.1278 -0.03797 -0.09637 ...
    pressure     (x, y) float64 -0.04355 -0.01633 -0.02807 -0.05753 0.02403 ...
    wind_x       (x, y) float64 -0.05455 -0.01381 0.009585 -0.0417 -0.0842 ...
    wind_y       (x, y) float64 -0.1077 -0.04966 -0.06329 -0.0489 -0.08314 ...

With `data_vars_func` decorated functions, anything `dict`-like, an `MLDataset` or `xarray.Dataset` may be returned and it will be converted to `MLDataset`:

In [18]:
from collections import OrderedDict
@data_vars_func
def f(wind_x, wind_y, temperature, pressure):
    mag = (wind_x ** 2 + wind_y ** 2) ** 0.5
    return OrderedDict([('mag', mag), ('temperature', temperature), ('pressure', pressure)])

f(X)

<xarray.MLDataset>
Dimensions:      (t: 48, x: 20, y: 15, z: 8)
Coordinates:
  * x            (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
  * y            (y) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
  * z            (z) int64 0 1 2 3 4 5 6 7
  * t            (t) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...
Data variables:
    mag          (x, y, z, t) float64 0.9619 0.8895 1.345 0.3372 0.5866 ...
    temperature  (x, y, z, t) float64 0.7834 0.01429 0.09623 0.3987 0.796 ...
    pressure     (x, y, z, t) float64 0.1834 0.3439 0.4344 0.06264 0.8126 ...

In [19]:
feat = f(X).to_features()
feat

<xarray.MLDataset>
Dimensions:   (layer: 3, space: 115200)
Coordinates:
  * space     (space) MultiIndex
  - x         (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - y         (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - z         (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - t         (space) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ...
  * layer     (layer) object 'mag' 'temperature' 'pressure'
Data variables:
    features  (space, layer) float64 0.9619 0.7834 0.1834 0.8895 0.01429 ...

In [20]:
feat.features

<xarray.DataArray 'features' (space: 115200, layer: 3)>
array([[ 0.961853,  0.783353,  0.183358],
       [ 0.889498,  0.01429 ,  0.343931],
       [ 1.345139,  0.09623 ,  0.434426],
       ..., 
       [ 0.571369,  0.103416,  0.233513],
       [ 1.040656,  0.258218,  0.518165],
       [ 0.962929,  0.610584,  0.887188]])
Coordinates:
  * space    (space) MultiIndex
  - x        (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - y        (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - z        (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - t        (space) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...
  * layer    (layer) object 'mag' 'temperature' 'pressure'

In [21]:
feat.features.values

array([[ 0.96185279,  0.7833532 ,  0.18335759],
       [ 0.88949826,  0.01428978,  0.34393149],
       [ 1.34513925,  0.09622997,  0.4344256 ],
       ..., 
       [ 0.57136909,  0.10341648,  0.23351262],
       [ 1.04065553,  0.25821825,  0.51816505],
       [ 0.96292909,  0.61058403,  0.887188  ]])

### `xarray_filters.MLDataset.chain`

`.chain` can be called on an `MLDataset` to run callables in sequence, passing an `MLDataset` between steps.

In [22]:
@for_each_array
def agg_x(arr, **kw):
    return arr.mean(dim='x')

@for_each_array
def agg_y(arr, **kw):
    return arr.mean(dim='y')

@for_each_array
def agg_z(arr, **kw):
    return arr.mean(dim='z')


time_series = X.chain((agg_x, agg_y, agg_z))
time_series

<xarray.MLDataset>
Dimensions:      (t: 48)
Coordinates:
  * t            (t) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...
Data variables:
    temperature  (t) float64 0.4894 0.5027 0.5047 0.4985 0.495 0.505 0.4899 ...
    pressure     (t) float64 0.5004 0.5105 0.4937 0.505 0.4969 0.4914 0.5059 ...
    wind_x       (t) float64 0.507 0.4932 0.5038 0.4979 0.5018 0.5064 0.5015 ...
    wind_y       (t) float64 0.4997 0.5038 0.4837 0.5098 0.5007 0.5045 ...

In [23]:
time_series.to_features().features

<xarray.DataArray 'features' (t: 48, layer: 4)>
array([[ 0.489435,  0.500388,  0.506969,  0.499719],
       [ 0.502653,  0.510548,  0.493246,  0.503789],
       [ 0.504696,  0.493682,  0.503837,  0.483678],
       [ 0.498458,  0.504962,  0.497936,  0.509782],
       [ 0.495021,  0.49691 ,  0.501802,  0.500716],
       [ 0.50496 ,  0.491416,  0.506407,  0.504466],
       [ 0.489902,  0.505911,  0.501499,  0.490181],
       [ 0.491742,  0.503711,  0.499072,  0.496724],
       [ 0.499646,  0.503898,  0.495835,  0.495929],
       [ 0.500161,  0.497577,  0.492006,  0.497488],
       [ 0.501137,  0.50201 ,  0.497537,  0.501166],
       [ 0.504312,  0.503931,  0.505468,  0.49453 ],
       [ 0.502633,  0.499254,  0.496469,  0.496817],
       [ 0.495249,  0.492976,  0.501345,  0.500524],
       [ 0.497719,  0.492521,  0.502393,  0.507621],
       [ 0.505877,  0.497034,  0.499893,  0.496671],
       [ 0.499644,  0.499203,  0.497065,  0.503747],
       [ 0.510594,  0.495226,  0.497582,  0.509192]

In [24]:
np.all(time_series.to_features().from_features().temperature == time_series.temperature)

True

Creating some synthetic rasters in `MLDataset` that are similar to LANDSAT imagery with 8 spectral bands:

In [25]:
layers = ['band_{}'.format(idx) for idx in range(1, 9)]
shape = (200, 200)
rand_np_arr = lambda: np.random.normal(0, 1, shape)
coords = [('x', np.arange(shape[0])), ('y', np.arange(shape[1]))]
rand_data_arr = lambda: xr.DataArray(rand_np_arr(), coords=coords, dims=('x', 'y'))
data_vars = OrderedDict([(layer, rand_data_arr()) for layer in layers])
dset = MLDataset(data_vars)
dset

<xarray.MLDataset>
Dimensions:  (x: 200, y: 200)
Coordinates:
  * x        (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
  * y        (y) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
Data variables:
    band_1   (x, y) float64 0.6134 -1.646 -0.3666 -1.608 0.08411 -1.401 ...
    band_2   (x, y) float64 1.617 -1.592 0.613 -0.4155 0.002528 -0.9831 ...
    band_3   (x, y) float64 0.65 -0.06426 -0.926 0.9073 -0.3103 1.595 0.807 ...
    band_4   (x, y) float64 0.2165 -0.6776 1.759 0.3044 0.3617 1.512 0.02728 ...
    band_5   (x, y) float64 0.539 -0.6671 0.1346 -1.231 -1.643 0.3514 0.4706 ...
    band_6   (x, y) float64 -0.2563 0.6217 -0.1454 -0.1024 -0.6704 -1.26 ...
    band_7   (x, y) float64 -0.7994 -0.1967 0.2357 0.2029 0.1077 1.316 ...
    band_8   (x, y) float64 1.03 1.131 1.54 1.579 0.4304 1.403 -1.142 -2.003 ...

Examples of chaining callables that use `for_each_array` and `data_vars_func` as decorators, where the example functions also show the variety of return data types allowed in functions decorated by `data_vars_func`.

Note the `keep_arrays=True` keyword argument in the function prototypes - this means that the original `layers` passed into the decorated functions will be part of the `MLDataset` outputs, even if the decorated functions do not return them.

In [26]:
from functools import partial
@for_each_array
def standardize(arr, dim=None, **kw):
    mean = arr.mean(dim=dim)
    std = arr.std(dim=dim)
    return (arr - mean) / std

@data_vars_func
def ndvi(band_5, band_4, keep_arrays=True):
    return OrderedDict([('ndvi', (band_5 - band_4) / (band_5 + band_4))])


@data_vars_func
def ndwi(band_3, band_5, keep_arrays=True, **kw):
    return {'ndwi': (band_3 - band_5) / (band_3 + band_5)}


@data_vars_func
def mndwi_36(band_3, band_6, keep_arrays=True):
    return xr.Dataset({'mndwi_36': (band_3 - band_6) / (band_3 + band_6)})


@data_vars_func
def mndwi_37(band_3, band_7, keep_arrays=True):
    return MLDataset(OrderedDict([('mndwi_37', (band_3 - band_7) / (band_3 + band_7))]))

normed_diffs = dset.chain((ndvi, ndwi, mndwi_36, mndwi_37))
standardized = dset.chain(partial(standardize, dim='x'))

In [27]:
normed_diffs

<xarray.MLDataset>
Dimensions:   (x: 200, y: 200)
Coordinates:
  * x         (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
  * y         (y) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
Data variables:
    mndwi_37  (x, y) float64 -9.704 -0.5075 1.683 0.6344 2.064 0.09571 ...
    mndwi_36  (x, y) float64 2.302 -1.231 0.7286 1.254 -0.3673 8.514 ...
    ndwi      (x, y) float64 0.09338 -0.8243 1.34 -6.597 -0.6823 0.6389 ...
    ndvi      (x, y) float64 0.4268 -0.007786 -0.8579 1.657 1.565 -0.6228 ...
    band_1    (x, y) float64 0.6134 -1.646 -0.3666 -1.608 0.08411 -1.401 ...
    band_2    (x, y) float64 1.617 -1.592 0.613 -0.4155 0.002528 -0.9831 ...
    band_3    (x, y) float64 0.65 -0.06426 -0.926 0.9073 -0.3103 1.595 0.807 ...
    band_4    (x, y) float64 0.2165 -0.6776 1.759 0.3044 0.3617 1.512 ...
    band_5    (x, y) float64 0.539 -0.6671 0.1346 -1.231 -1.643 0.3514 ...
    band_6    (x, y) float64 -0.2563 0.6217 -0.1454 -0.1024 -0.6704 -1.26

In [28]:
standardized

<xarray.MLDataset>
Dimensions:  (x: 200, y: 200)
Coordinates:
  * x        (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
  * y        (y) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
Data variables:
    band_1   (x, y) float64 0.5198 -1.781 -0.2804 -1.501 0.2069 -1.536 -1.08 ...
    band_2   (x, y) float64 1.636 -1.515 0.647 -0.259 -0.07859 -1.042 1.723 ...
    band_3   (x, y) float64 0.6481 0.01808 -0.82 0.9569 -0.3674 1.732 0.8329 ...
    band_4   (x, y) float64 0.2751 -0.7244 1.688 0.3105 0.373 1.473 -0.03129 ...
    band_5   (x, y) float64 0.5338 -0.7055 0.1548 -1.268 -1.476 0.2677 ...
    band_6   (x, y) float64 -0.1832 0.5361 -0.03936 -0.145 -0.5835 -1.34 ...
    band_7   (x, y) float64 -0.7773 -0.2383 0.2057 0.2167 0.03876 1.402 ...
    band_8   (x, y) float64 0.9814 1.22 1.467 1.646 0.3285 1.242 -1.166 ...

Merging two `MLDataset`s and converting the merged output to a features 2-D `DataArray`:

In [29]:
catted = normed_diffs.merge(standardized, overwrite_vars=standardized.data_vars.keys())
catted = catted.to_features()

In [30]:
catted.features

<xarray.DataArray 'features' (space: 40000, layer: 12)>
array([[ -9.704037e+00,   2.302364e+00,   9.337710e-02, ...,  -1.832430e-01,
         -7.772986e-01,   9.814206e-01],
       [ -5.075029e-01,  -1.230547e+00,  -8.242737e-01, ...,   5.361044e-01,
         -2.382748e-01,   1.219512e+00],
       [  1.683114e+00,   7.286442e-01,   1.340088e+00, ...,  -3.936262e-02,
          2.057063e-01,   1.466604e+00],
       ..., 
       [ -6.645271e-02,   5.968537e+00,   4.999184e-01, ...,  -6.088668e-01,
          1.129051e+00,  -9.138719e-03],
       [ -8.539964e-01,  -1.349632e+00,  -1.904447e+00, ...,  -1.471173e+00,
          2.768239e+00,   5.431708e-02],
       [  1.026418e+00,   9.528835e-01,   1.729390e-01, ...,   1.819941e-01,
          5.555479e-02,  -2.911281e-01]])
Coordinates:
  * space    (space) MultiIndex
  - x        (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - y        (space) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...
  * layer    (l

In [31]:
catted.layer

<xarray.DataArray 'layer' (layer: 12)>
array(['mndwi_37', 'mndwi_36', 'ndwi', 'ndvi', 'band_1', 'band_2', 'band_3',
       'band_4', 'band_5', 'band_6', 'band_7', 'band_8'], dtype=object)
Coordinates:
  * layer    (layer) object 'mndwi_37' 'mndwi_36' 'ndwi' 'ndvi' 'band_1' ...

In [32]:
catted.from_features()

<xarray.MLDataset>
Dimensions:   (x: 200, y: 200)
Coordinates:
  * x         (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
  * y         (y) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
Data variables:
    mndwi_37  (x, y) float64 -9.704 -0.5075 1.683 0.6344 2.064 0.09571 ...
    mndwi_36  (x, y) float64 2.302 -1.231 0.7286 1.254 -0.3673 8.514 ...
    ndwi      (x, y) float64 0.09338 -0.8243 1.34 -6.597 -0.6823 0.6389 ...
    ndvi      (x, y) float64 0.4268 -0.007786 -0.8579 1.657 1.565 -0.6228 ...
    band_1    (x, y) float64 0.5198 -1.781 -0.2804 -1.501 0.2069 -1.536 ...
    band_2    (x, y) float64 1.636 -1.515 0.647 -0.259 -0.07859 -1.042 1.723 ...
    band_3    (x, y) float64 0.6481 0.01808 -0.82 0.9569 -0.3674 1.732 ...
    band_4    (x, y) float64 0.2751 -0.7244 1.688 0.3105 0.373 1.473 ...
    band_5    (x, y) float64 0.5338 -0.7055 0.1548 -1.268 -1.476 0.2677 ...
    band_6    (x, y) float64 -0.1832 0.5361 -0.03936 -0.145 -0.5835 -1.34 ..

The following synthetic data example shows the logic above in this notebook can work for any number of dimensions, e.g. the 6-D `DataArray`s below:

In [33]:
shp = (2, 3, 4, 5, 6, 7)
dims = ('a', 'b', 'c', 'd', 'e', 'f')
coords = OrderedDict([(dim, np.arange(s)) for s, dim in zip(shp, dims)])
dset = MLDataset(OrderedDict([('layer_{}'.format(idx), 
                               xr.DataArray(np.random.normal(0, 10, shp),
                                            coords=coords,
                                            dims=dims)) 
                              for idx in range(6)]))
dset

<xarray.MLDataset>
Dimensions:  (a: 2, b: 3, c: 4, d: 5, e: 6, f: 7)
Coordinates:
  * a        (a) int64 0 1
  * b        (b) int64 0 1 2
  * c        (c) int64 0 1 2 3
  * d        (d) int64 0 1 2 3 4
  * e        (e) int64 0 1 2 3 4 5
  * f        (f) int64 0 1 2 3 4 5 6
Data variables:
    layer_0  (a, b, c, d, e, f) float64 -15.05 7.549 6.477 12.35 -7.569 ...
    layer_1  (a, b, c, d, e, f) float64 2.809 -0.1019 -5.843 -12.14 12.3 ...
    layer_2  (a, b, c, d, e, f) float64 0.1602 -1.019 3.215 -21.31 5.844 ...
    layer_3  (a, b, c, d, e, f) float64 -6.403 -0.6729 5.94 -7.722 0.6715 ...
    layer_4  (a, b, c, d, e, f) float64 -2.999 3.359 0.1273 14.33 -3.326 ...
    layer_5  (a, b, c, d, e, f) float64 -23.13 13.51 -1.654 -12.14 -24.53 ...

In [34]:
dset.layer_0.shape

(2, 3, 4, 5, 6, 7)

With 6-D `DataArray`s, calling `to_features` creates a `pandas.MultiIndex` with 6 components:

In [35]:
dset.to_features()

<xarray.MLDataset>
Dimensions:   (layer: 6, space: 5040)
Coordinates:
  * space     (space) MultiIndex
  - a         (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - b         (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - c         (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - d         (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - e         (space) int64 0 0 0 0 0 0 0 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 ...
  - f         (space) int64 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 ...
  * layer     (layer) object 'layer_0' 'layer_1' 'layer_2' 'layer_3' ...
Data variables:
    features  (space, layer) float64 -15.05 2.809 0.1602 -6.403 -2.999 ...

The following cells demonstrate `MLDataset.chain` is the same as calling `.pipe` several times in sequence.

In [36]:
@for_each_array
def example_agg(arr, dim=None):
    return arr.std(dim=dim)

@data_vars_func
def layers_example_with_kw(**kw):
    new = OrderedDict([('new_layer_100', kw['layer_3'] + kw['layer_4'])])
    new.update(kw)
    return MLDataset(new)

@data_vars_func
def layers_example_named_args(layer_1, layer_2, new_layer_100):
    return MLDataset(OrderedDict([('final', new_layer_100 / (layer_1 + layer_2))]))


In [37]:
dset.pipe(example_agg, dim='a'
         ).pipe(example_agg, dim='b'
               ).pipe(layers_example_with_kw
                     ).pipe(layers_example_named_args).to_features()

<xarray.MLDataset>
Dimensions:   (layer: 1, space: 840)
Coordinates:
  * space     (space) MultiIndex
  - c         (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - d         (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - e         (space) int64 0 0 0 0 0 0 0 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 ...
  - f         (space) int64 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 ...
  * layer     (layer) object 'final'
Data variables:
    features  (space, layer) float64 2.567 3.789 0.2975 0.4185 1.367 1.571 ...

In [38]:
dset.chain([(example_agg, dict(dim='a')),
             (example_agg, dict(dim='b')),
             layers_example_with_kw,
             layers_example_named_args,
            ]).to_features()

<xarray.MLDataset>
Dimensions:   (layer: 1, space: 840)
Coordinates:
  * space     (space) MultiIndex
  - c         (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - d         (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - e         (space) int64 0 0 0 0 0 0 0 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 ...
  - f         (space) int64 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 ...
  * layer     (layer) object 'final'
Data variables:
    features  (space, layer) float64 2.567 3.789 0.2975 0.4185 1.367 1.571 ...