The `datasets` submodule of `xarray_filters` provides data simulation capabilities.

We wrap simulation functions from `scikit-learn` with our own code to return more flexible data structures.

The goal is to make it easier to generate data for testing `elm`.

In [1]:
import datasets as ds



Note that the at import time we are notified which functions from sklearn could not be converted. That is because we restrict ourselves to simulation functions from sklearn that

- return a tuple `(X, y)` with a feature matrix `X` and a 1d vector of labels `y`;
- can be called with default values alone

That is to keep a section of code simple. Making our solution more general to address the two points above would be an unnecessary distraction at this stage.

Note:

- We can check that a function can be called with default values alone before we call the function. The warnings above are for the functions that fail that requirement.
- However, we cannot check that a function returns the features/labels pair `(X, y)` without calling the function (not in Python). We will find those additional problematic functions in a [later section](#sec-okfuncs)(again, can be fixed, but it's a distraction now).

## Showcasing the design and functionality

### A drop-in replacement of scikit-learn functionality

The `datasets` library was designed to provide drop-in replacements for the `sklearn.datasets.make_*` functions. 

In [2]:
import sklearn.datasets as skd

In [3]:
skd.make_classification(n_samples=20, n_features=4, n_classes=2, random_state=0)  # sklearn function

(array([[-1.53237475,  1.16966272,  0.26639974,  1.52058089],
        [ 0.52829584, -1.70605142,  0.97125855, -0.09974296],
        [ 0.52689704,  0.53680201, -0.85782025, -0.82878672],
        [-0.75384103,  1.01255241, -0.22566169,  0.60560597],
        [-0.13926741, -0.76073832,  0.73172656,  0.42070001],
        [ 0.4773532 ,  1.02662418, -1.21804851, -0.92689922],
        [ 0.63845131,  1.79302957, -1.9717922 , -1.37653776],
        [-1.01950582, -0.19691761,  0.97293689,  1.32937436],
        [-1.27077651,  1.28618119, -0.03709842,  1.157971  ],
        [-0.16531167, -0.32750302,  0.39895146,  0.31186184],
        [ 1.6606369 ,  1.37035939, -2.44127359, -2.50736   ],
        [ 1.65162356, -1.59500861, -0.01431944, -1.52998076],
        [ 1.35971085, -1.37196835,  0.03624686, -1.24038743],
        [ 1.5988656 , -1.82660002,  0.21669448, -1.38904931],
        [-2.77719001,  0.98620969,  1.40785555,  3.12517866],
        [-0.11152424,  0.57879094, -0.3834474 , -0.05018274],
        

In [4]:
ds.make_classification(n_samples=20, n_features=4, n_classes=2, random_state=0,  # sklearn args
                       astype='array')                                           # new args

(array([[-1.53237475,  1.16966272,  0.26639974,  1.52058089],
        [ 0.52829584, -1.70605142,  0.97125855, -0.09974296],
        [ 0.52689704,  0.53680201, -0.85782025, -0.82878672],
        [-0.75384103,  1.01255241, -0.22566169,  0.60560597],
        [-0.13926741, -0.76073832,  0.73172656,  0.42070001],
        [ 0.4773532 ,  1.02662418, -1.21804851, -0.92689922],
        [ 0.63845131,  1.79302957, -1.9717922 , -1.37653776],
        [-1.01950582, -0.19691761,  0.97293689,  1.32937436],
        [-1.27077651,  1.28618119, -0.03709842,  1.157971  ],
        [-0.16531167, -0.32750302,  0.39895146,  0.31186184],
        [ 1.6606369 ,  1.37035939, -2.44127359, -2.50736   ],
        [ 1.65162356, -1.59500861, -0.01431944, -1.52998076],
        [ 1.35971085, -1.37196835,  0.03624686, -1.24038743],
        [ 1.5988656 , -1.82660002,  0.21669448, -1.38904931],
        [-2.77719001,  0.98620969,  1.40785555,  3.12517866],
        [-0.11152424,  0.57879094, -0.3834474 , -0.05018274],
        

In [5]:
help(ds.make_classification)

Help on function make_classification in module datasets:

make_classification(n_samples=100, n_features=20, n_informative=2, n_redundant=2, n_repeated=0, n_classes=2, n_clusters_per_class=2, weights=None, flip_y=0.01, class_sep=1.0, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=None, *, astype='dataset', **kwargs)
    Like sklearn.datasets.samples_generator.make_classification, but with added functionality.
    
    Parameters
    ---------------------
    Same parameters/arguments as sklearn.datasets.samples_generator.make_classification, in addition to the following
    keyword-only arguments:
    
    astype: str
        One of ('array', 'dataframe', 'dataset') or None to return an NpXyTransformer. See documentation
        of NpXyTransformer.astype.
        
    **kwargs: dict
        Optional arguments that depend on astype. See documentation of
        NpXyTransformer.astype. 
    
    See Also
    --------
    sklearn.datasets.samples_generator.make_classifica

In [6]:
help(skd.make_classification)

Help on function make_classification in module sklearn.datasets.samples_generator:

make_classification(n_samples=100, n_features=20, n_informative=2, n_redundant=2, n_repeated=0, n_classes=2, n_clusters_per_class=2, weights=None, flip_y=0.01, class_sep=1.0, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=None)
    Generate a random n-class classification problem.
    
    This initially creates clusters of points normally distributed (std=1)
    about vertices of a `2 * class_sep`-sided hypercube, and assigns an equal
    number of clusters to each class. It introduces interdependence between
    these features and adds various types of further noise to the data.
    
    Prior to shuffling, `X` stacks a number of these primary "informative"
    features, "redundant" linear combinations of these, "repeated" duplicates
    of sampled features, and arbitrary noise for and remaining features.
    
    Read more in the :ref:`User Guide <sample_generators>`.
    
    Param

### An extension of scikit-learn functionality

We also provide postprocessing functionality on top of the `scikit-learn` routines via additional keywords (`astype` and `feature_shape` below).

In [7]:
ds.make_classification(n_samples=20, n_features=4, n_classes=2, random_state=0,  # sklearn args
                       astype='array', feature_shape=(2,10,4))                   # new args

(array([[[-1.53237475,  1.16966272,  0.26639974,  1.52058089],
         [ 0.52829584, -1.70605142,  0.97125855, -0.09974296],
         [ 0.52689704,  0.53680201, -0.85782025, -0.82878672],
         [-0.75384103,  1.01255241, -0.22566169,  0.60560597],
         [-0.13926741, -0.76073832,  0.73172656,  0.42070001],
         [ 0.4773532 ,  1.02662418, -1.21804851, -0.92689922],
         [ 0.63845131,  1.79302957, -1.9717922 , -1.37653776],
         [-1.01950582, -0.19691761,  0.97293689,  1.32937436],
         [-1.27077651,  1.28618119, -0.03709842,  1.157971  ],
         [-0.16531167, -0.32750302,  0.39895146,  0.31186184]],
 
        [[ 1.6606369 ,  1.37035939, -2.44127359, -2.50736   ],
         [ 1.65162356, -1.59500861, -0.01431944, -1.52998076],
         [ 1.35971085, -1.37196835,  0.03624686, -1.24038743],
         [ 1.5988656 , -1.82660002,  0.21669448, -1.38904931],
         [-2.77719001,  0.98620969,  1.40785555,  3.12517866],
         [-0.11152424,  0.57879094, -0.3834474 , -0.

We can also convert to `xarray.Dataset` (or other types, like `pandas.DataFrame`)

In [8]:
dst = ds.make_classification(n_samples=20, n_features=4, n_classes=2, random_state=0,  # sklearn args
                            astype='dataset')                                          # new args
dst

<xarray.Dataset>
Dimensions:  (dim_0: 20)
Dimensions without coordinates: dim_0
Data variables:
    X0       (dim_0) float64 -1.532 0.5283 0.5269 -0.7538 -0.1393 0.4774 ...
    X1       (dim_0) float64 1.17 -1.706 0.5368 1.013 -0.7607 1.027 1.793 ...
    X2       (dim_0) float64 0.2664 0.9713 -0.8578 -0.2257 0.7317 -1.218 ...
    X3       (dim_0) float64 1.521 -0.09974 -0.8288 0.6056 0.4207 -0.9269 ...
    y        (dim_0) int64 1 1 1 1 1 0 0 1 1 1 0 0 0 0 1 1 0 0 0 0

In [9]:
dst.y

<xarray.DataArray 'y' (dim_0: 20)>
array([1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0])
Dimensions without coordinates: dim_0

In [10]:
dst = ds.make_classification(n_samples=20, n_features=4, n_classes=2, random_state=0,  # sklearn args
                             astype='dataset', dims=('horizontal','vertical'), shape=(4,5))            # new args
dst

<xarray.Dataset>
Dimensions:  (horizontal: 4, vertical: 5)
Dimensions without coordinates: horizontal, vertical
Data variables:
    X0       (horizontal, vertical) float64 -1.532 0.5283 0.5269 -0.7538 ...
    X1       (horizontal, vertical) float64 1.17 -1.706 0.5368 1.013 -0.7607 ...
    X2       (horizontal, vertical) float64 0.2664 0.9713 -0.8578 -0.2257 ...
    X3       (horizontal, vertical) float64 1.521 -0.09974 -0.8288 0.6056 ...
    y        (horizontal, vertical) int64 1 1 1 1 1 0 0 1 1 1 0 0 0 0 1 1 0 ...

In [11]:
dst.y

<xarray.DataArray 'y' (horizontal: 4, vertical: 5)>
array([[1, 1, 1, 1, 1],
       [0, 0, 1, 1, 1],
       [0, 0, 0, 0, 1],
       [1, 0, 0, 0, 0]])
Dimensions without coordinates: horizontal, vertical

In [12]:
dst = ds.make_classification(n_samples=20, n_features=4, n_classes=2, random_state=0,  # sklearn args
                             astype='dataset', dims=('horizontal','vertical'), shape=(4,5),
                             coords=(list('abcd'), list('efghi')),
                             xnames=['feat_{:d}'.format(n) for n in range(4)],
                             yname='LABEL', attrs={'metadata1': 'super important'})  
dst

<xarray.Dataset>
Dimensions:     (horizontal: 4, vertical: 5)
Coordinates:
  * horizontal  (horizontal) <U1 'a' 'b' 'c' 'd'
  * vertical    (vertical) <U1 'e' 'f' 'g' 'h' 'i'
Data variables:
    feat_0      (horizontal, vertical) float64 -1.532 0.5283 0.5269 -0.7538 ...
    feat_1      (horizontal, vertical) float64 1.17 -1.706 0.5368 1.013 ...
    feat_2      (horizontal, vertical) float64 0.2664 0.9713 -0.8578 -0.2257 ...
    feat_3      (horizontal, vertical) float64 1.521 -0.09974 -0.8288 0.6056 ...
    LABEL       (horizontal, vertical) int64 1 1 1 1 1 0 0 1 1 1 0 0 0 0 1 1 ...
Attributes:
    metadata1:  super important

In [13]:
dst.LABEL

<xarray.DataArray 'LABEL' (horizontal: 4, vertical: 5)>
array([[1, 1, 1, 1, 1],
       [0, 0, 1, 1, 1],
       [0, 0, 0, 0, 1],
       [1, 0, 0, 0, 0]])
Coordinates:
  * horizontal  (horizontal) <U1 'a' 'b' 'c' 'd'
  * vertical    (vertical) <U1 'e' 'f' 'g' 'h' 'i'

## Which simulation functions can be used right now?
<a id='sec-okfuncs'></a>

In [14]:
ds_make_funcs = [f for f in dir(ds) if f.startswith('make_')]  # all make_* functions in xarray_filters/datasets.py

All of the functions above work with defaults only.

But some of them do not return a tuple `(X, y)` where X is a feature matrix and y is a 1d vector of labels.

We will find which ones now (see the `bad` list below).

In [15]:
good = []  # to store the make_* functions that return a features/labels pair (X, y)
bad = []   # to store the make_* functions that do _not_ return a features/labels pair (X, y)

for f in ds_make_funcs:
    try:
        simdata = ds.__getattribute__(f)(astype='array')
    except ValueError as e:
        print('ERROR: {}'.format(str(e)))
        bad.append(f)
    else:
        good.append(f)

ERROR: Function make_low_rank_matrix must return a tuple of 2 elements
ERROR: Y must have dimension 1.
ERROR: Function make_sparse_spd_matrix must return a tuple of 2 elements


We can see the problematic functions in the error messages above. Also listed here

In [16]:
bad

['make_low_rank_matrix',
 'make_multilabel_classification',
 'make_sparse_spd_matrix']

And here are the functions we can use without a problem with the current implementation.

Again, we can make it more general, but I'd recommend doing that after we pin down the whole API and tests for the functions that work with the simpler code.

In [17]:
good

['make_blobs',
 'make_circles',
 'make_classification',
 'make_friedman1',
 'make_friedman2',
 'make_friedman3',
 'make_gaussian_quantiles',
 'make_hastie_10_2',
 'make_moons',
 'make_regression',
 'make_s_curve',
 'make_sparse_uncorrelated',
 'make_swiss_roll']

## Implementation details

The central functionality here is implemented in the following two objects:

- The `NpXyTransformer` class that has multiple `to_*` methods (`to_dataset`, `to_dataframe`, `to_array`, etc.). Adding different postprocessing routines can be done by adding a new `NpXyTransformer.to_*` method with the appropriate code and documentation.
- A `_make_base` function that takes as input a `sklearn.datasets._make_*` function (like `make_classification`) and creates a new "version" of it under the `datasets` namespace, with useful signature, docs and extended functionality.

It's easier to see with an example. Let's construct the same data with the "direct" approach (using the keyword `astype` inside the `make_*` function) and the step-by-step approach (which is what the direct approach does under the hood).

In [18]:
X1, y1 = ds.make_classification(n_samples=20, n_features=4, n_classes=2, random_state=0,  # sklearn args
                                astype='array', feature_shape=(2,10,4))                   # new args    

In [19]:
Xyt = ds.make_classification(n_samples=20, n_features=4, n_classes=2, random_state=0,  # sklearn args
                             astype=None)
X2, y2 = Xyt.to_array(feature_shape=(2,10,4))

In [20]:
import numpy as np
np.allclose(X1, X2)  # floating-point data

True

In [21]:
np.alltrue(y1 == y2)  # integer data

True

In [22]:
help(ds.NpXyTransformer.astype)

Help on function astype in module datasets:

astype(self, to_type, **kwargs)
    Convert to given type.
    
    self.astype(f, **kwargs) calls self.to_f(**kwargs)
    
    Valid types are in NpXyTransformer.accepted_types.
    
    See Also
    --------
    
    NpXyTransformer.to_dataset
    NpXyTransformer.to_array
    NpXyTransformer.to_dataframe
    NpXyTransformer.to_*
    etc...



In [23]:
ds.NpXyTransformer.astype??

In [24]:
help(ds.NpXyTransformer.to_array)

Help on function to_array in module datasets:

to_array(self, feature_shape=None)
    Return X, y NumPy arrays with given shape



This design allows us to implement any data transformations we want by just creating new `to_*` methods under `NpXyTransformer`, while still enjoying:

- All the work (code and docs) done in sklearn
- Argument checking, docs for each transformation in its own method, easier to inspect than `**kwargs` with lots of `if/else` checks.

For recap, here is the full "low-level path" to a new `make_classification` function and using it.

In [25]:
my_classification = ds._make_base(skd.make_classification)
Xyt = my_classification(n_samples=20, n_features=4, n_classes=2, random_state=0, astype=None)
X, y = Xyt.to_array()
X, y

# same as
# ds.make_classification(n_samples=20, n_features=4, n_classes=2, random_state=0, astype='array')

(array([[-1.53237475,  1.16966272,  0.26639974,  1.52058089],
        [ 0.52829584, -1.70605142,  0.97125855, -0.09974296],
        [ 0.52689704,  0.53680201, -0.85782025, -0.82878672],
        [-0.75384103,  1.01255241, -0.22566169,  0.60560597],
        [-0.13926741, -0.76073832,  0.73172656,  0.42070001],
        [ 0.4773532 ,  1.02662418, -1.21804851, -0.92689922],
        [ 0.63845131,  1.79302957, -1.9717922 , -1.37653776],
        [-1.01950582, -0.19691761,  0.97293689,  1.32937436],
        [-1.27077651,  1.28618119, -0.03709842,  1.157971  ],
        [-0.16531167, -0.32750302,  0.39895146,  0.31186184],
        [ 1.6606369 ,  1.37035939, -2.44127359, -2.50736   ],
        [ 1.65162356, -1.59500861, -0.01431944, -1.52998076],
        [ 1.35971085, -1.37196835,  0.03624686, -1.24038743],
        [ 1.5988656 , -1.82660002,  0.21669448, -1.38904931],
        [-2.77719001,  0.98620969,  1.40785555,  3.12517866],
        [-0.11152424,  0.57879094, -0.3834474 , -0.05018274],
        

In [26]:
help(my_classification)  # signature/docstring build automatically

Help on function make_classification in module datasets:

make_classification(n_samples=100, n_features=20, n_informative=2, n_redundant=2, n_repeated=0, n_classes=2, n_clusters_per_class=2, weights=None, flip_y=0.01, class_sep=1.0, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=None, *, astype='dataset', **kwargs)
    Like sklearn.datasets.samples_generator.make_classification, but with added functionality.
    
    Parameters
    ---------------------
    Same parameters/arguments as sklearn.datasets.samples_generator.make_classification, in addition to the following
    keyword-only arguments:
    
    astype: str
        One of ('array', 'dataframe', 'dataset') or None to return an NpXyTransformer. See documentation
        of NpXyTransformer.astype.
        
    **kwargs: dict
        Optional arguments that depend on astype. See documentation of
        NpXyTransformer.astype. 
    
    See Also
    --------
    sklearn.datasets.samples_generator.make_classifica