The `datasets` submodule of `xarray_filters` provides data simulation capabilities.

We wrap simulation functions from `scikit-learn` with our own code to return more flexible data structures.

The goal is to make it easier to generate data for testing `elm`.

In [None]:
import xarray_filters.datasets as ds

Note that the at import time we are notified which functions from sklearn could not be converted. That is because we restrict ourselves to simulation functions from sklearn that

- return a tuple `(X, y)` with a feature matrix `X` and a 1d vector of labels `y`;
- can be called with default values alone

That is to keep a section of code simple. Making our solution more general to address the two points above would be an unnecessary distraction at this stage.

Note:

- We can check that a function can be called with default values alone before we call the function. The warnings above are for the functions that fail that requirement.
- However, we cannot check that a function returns the features/labels pair `(X, y)` without calling the function (not in Python). We will find those additional problematic functions in a [later section](#sec-okfuncs)(again, can be fixed, but it's a distraction now).

## Showcasing the design and functionality

### A drop-in replacement of scikit-learn functionality

The `datasets` library was designed to provide drop-in replacements for the `sklearn.datasets.make_*` functions. 

In [None]:
import sklearn.datasets as skd

In [None]:
skd.make_classification(n_samples=20, n_features=4, n_classes=2, random_state=0)  # sklearn function

In [None]:
ds.make_classification(n_samples=20, n_features=4, n_classes=2, random_state=0,  # sklearn args
                       astype='array')                                           # new args

In [None]:
help(ds.make_classification)

In [None]:
help(skd.make_classification)

### An extension of scikit-learn functionality

We also provide postprocessing functionality on top of the `scikit-learn` routines via additional keywords (`astype` and `feature_shape` below).

In [None]:
ds.make_classification(n_samples=20, n_features=4, n_classes=2, random_state=0,  # sklearn args
                       astype='array')                                           # new args

We can also convert to `xarray.Dataset` (or other types, like `pandas.DataFrame`)

In [None]:
dst = ds.make_classification(n_samples=20, n_features=4, n_classes=2, random_state=0,  # sklearn args
                            astype='dataset')                                          # new args
dst

In [None]:
dst.y

In [None]:
dst = ds.make_classification(n_samples=20, n_features=4, n_classes=2, random_state=0,  # sklearn args
                             astype='dataset', dims=('horizontal','vertical'), shape=(4,5))            # new args
dst

In [None]:
dst.y

In [None]:
dst = ds.make_classification(n_samples=20, n_features=4, n_classes=2, random_state=0,  # sklearn args
                             astype='dataset', dims=('horizontal','vertical'), shape=(4,5),
                             coords=(list('abcd'), list('efghi')),
                             layers=['feat_{:d}'.format(n) for n in range(4)],
                             yname='LABEL', attrs={'metadata1': 'super important'})  
dst

In [None]:
dst.LABEL

## Which simulation functions can be used right now?
<a id='sec-okfuncs'></a>

In [None]:
ds_make_funcs = [f for f in dir(ds) if f.startswith('make_')]  # all make_* functions in xarray_filters/datasets.py

All of the functions above work with defaults only.

But some of them do not return a tuple `(X, y)` where X is a feature matrix and y is a 1d vector of labels.

We will find which ones now (see the `bad` list below).

In [None]:
good = []  # to store the make_* functions that return a features/labels pair (X, y)
bad = []   # to store the make_* functions that do _not_ return a features/labels pair (X, y)

for f in ds_make_funcs:
    try:
        simdata = ds.__getattribute__(f)(astype='array')
    except ValueError as e:
        print('ERROR: {}'.format(str(e)))
        bad.append(f)
    else:
        good.append(f)

We can see the problematic functions in the error messages above. Also listed here

In [None]:
bad

And here are the functions we can use without a problem with the current implementation.

Again, we can make it more general, but I'd recommend doing that after we pin down the whole API and tests for the functions that work with the simpler code.

In [None]:
good

## Implementation details

The central functionality here is implemented in the following two objects:

- The `NpXyTransformer` class that has multiple `to_*` methods (`to_dataset`, `to_dataframe`, `to_array`, etc.). Adding different postprocessing routines can be done by adding a new `NpXyTransformer.to_*` method with the appropriate code and documentation.
- A `_make_base` function that takes as input a `sklearn.datasets._make_*` function (like `make_classification`) and creates a new "version" of it under the `datasets` namespace, with useful signature, docs and extended functionality.

It's easier to see with an example. Let's construct the same data with the "direct" approach (using the keyword `astype` inside the `make_*` function) and the step-by-step approach (which is what the direct approach does under the hood).

In [None]:
X1, y1 = ds.make_classification(n_samples=20, n_features=4, n_classes=2, random_state=0,  # sklearn args
                                astype='array')                                           # new args    

In [None]:
Xyt = ds.make_classification(n_samples=20, n_features=4, n_classes=2, random_state=0,  # sklearn args
                             astype=None)
X2, y2 = Xyt.to_array()

In [None]:
import numpy as np
np.allclose(X1, X2)  # floating-point data

In [None]:
np.alltrue(y1 == y2)  # integer data

In [None]:
help(ds.NpXyTransformer.astype)

In [None]:
ds.NpXyTransformer.astype??

In [None]:
help(ds.NpXyTransformer.to_array)

This design allows us to implement any data transformations we want by just creating new `to_*` methods under `NpXyTransformer`, while still enjoying:

- All the work (code and docs) done in sklearn
- Argument checking, docs for each transformation in its own method, easier to inspect than `**kwargs` with lots of `if/else` checks.

For recap, here is the full "low-level path" to a new `make_classification` function and using it.

In [None]:
my_classification = ds._make_base(skd.make_classification)
Xyt = my_classification(n_samples=20, n_features=4, n_classes=2, random_state=0, astype=None)
X, y = Xyt.to_array()
X, y

# same as
# ds.make_classification(n_samples=20, n_features=4, n_classes=2, random_state=0, astype='array')

In [None]:
help(my_classification)  # signature/docstring build automatically