EaSearchCV - CV splitting is failing with Xarray inputs currently #204

PeterDSteinberg · 2017-10-12T19:04:23Z

In EaSearchCV (PR #192), cross validation fails when given a MLDataset / Dataset in a Pipeline or estimator, as the cross validation logic in dask-searchcv and/or scikit-learn subsets rows of a dask array/dataframe or numpy array. Things to consider:

@gbrener 's work in xarray_filters allows parameterization of chained transformers on an MLDataset / Dataset and that has utility in ML and non-ML contexts, e.g later using param to control transformers for viz. These transformers may be transforming a dataset with 1 or more DataArrays
In scikit-learn Pipeline and its usage with dask or numpy, the cross validation tools are for 2D features array inputs generally, so the cross validation classes fail now when EaSearchCV is used on MLDataset/Dataset

Ideas: We want to allow both use styles, but cross validation can only be supported on the portion of a pipeline that has a 2D features matrix, e.g a numpy or dask array or MLDataset/Dataset with a single features DataArray whose .values array is given to cross validation tools. For example,in a Pipeline of:

Step 1: Spatial filters on 4-D array - custom function to set NaN where the 4-D arrays are out of domain, such as NaNs for ocean on a 4-D terrestrial DataArray(s)
Step 2: Parameterizable operation on the 4D arrays that allows Laplacian, gradient or no filter
Step 3: Call to_features on the MLDataset of 4-D arrays to convert to features matrix
Step 4: Drop the NaN rows of the .features 2D DataArray
Step 5: PCA
Step 6: KMeans

Hyperparameterization should be supported for any step but cross validation only of the steps 3 through 6 that use a typical feature matrix.

The text was updated successfully, but these errors were encountered:

PeterDSteinberg · 2017-10-20T17:54:35Z

Here are some notes I took regarding cross validation, thinking about how to cross validation work with xarray/xarray_filters data structures. Currently this Pipeline is failing due to xarray*.* data structures being used on steps up to scaler (it runs fine as a Pipeline but fails in cross validation if used in GridSearchCV or EaSearchCV).

pipe = Pipeline([
    ('sampler', Sampler(max_time_steps=max_time_steps)),
    ('time', Differencing(layers=FEATURE_LAYERS)),
    ('flatten', Flatten()),
    ('soil_phys', AddSoilPhysicalChemical()),
    ('drop_null', DropNaRows()),
    ('get_y', GetY(SOIL_MOISTURE)),
    ('None', None),
    ('scaler', ChooseWithPreproc(trans_if=log_trans_only_positive)),
    ('pca', ChooseWithPreproc()),
    ('estimator', linear_model.LinearRegression(n_jobs=-1))])

In pseudocode this is what cross validating at Sampler level would look like:

for n in range(num_samples):
    dset = pipe.steps[0][1].fit_transform(**sample_args)
    for name, step in pipe.steps[:-1]:
        dset = step.fit_transform(dset)
    return pipe.steps[-1][1].fit(dset)

Currently sklearn cross validation iterators would support:

for n in range(num_splits):
    # Test / train split input array X
    # Run the `scaler`, `pca`, and `estimator`
    # steps on each test/train batch

Nested cross validation idea - cross validating at the input samples level (e.g. filenames that make up an xarray data structure in Sampler), as well as cross validation splitting of the input matrix:

def make_sample(n):
    return load_array('big_file_{}.nc'.format(n))

xarray_pipe = Pipeline([
    ('sampler', Sampler(func=make_sample),
    ('time', Differencing(layers=FEATURE_LAYERS)),
    ('flatten', Flatten()),
    ('soil_phys', AddSoilPhysicalChemical()),
    ('drop_null', DropNaRows()),
    ('get_y', GetY(SOIL_MOISTURE)),
    ('None', None),])
numpy_or_dask_pipe = Pipeline([
    ('scaler', ChooseWithPreproc(trans_if=log_trans_only_positive)),
    ('pca', ChooseWithPreproc()),
    ('estimator', linear_model.LinearRegression(n_jobs=-1))])

def nested_cross_val(outer_num_samples, inner_num_samples):
    for n in range(outer_num_samples):
        # This is the outer cross validation, e.g
        # over file names or dates that
        # determine a sample reading from file, for examples.
        # In the NLDAS ML
        sample = xarray_pipe[0][1].fit_transform(n)
        for name, step in xarray_pipe.steps[1:]:
            sample = step.fit_transform(sample)
        X, y = sample
        for n in range(inner_num_samples):  # e.g. KFold
            # test / train split X, y
            # run the steps in the `numpy_pipe`
            # This is the "inner cross validation"

Nested cross validation inside evolutionary search (pseudocode):

def ea_search_cv(outer_cv, inner_cv):
    pop = initialize()
    for generation in range(ngen):
        # Each generation in evo algo
        for model in pop:
            # Each member of population
            # Do outer / inner cross validations
            nested_cross_validation(outer_cv, inner_cv)
            scores = # accumulate two-layer cross validation scores
        # The EA search chooses the
        # best parameters based on cross validation scores
        pop = select_new_population(scores)
    return pop

PeterDSteinberg mentioned this issue Oct 12, 2017

Improvements for evolutionary algorithms #185

Closed

7 tasks

PeterDSteinberg changed the title ~~CV splitting is failing with Xarray inputs currently but EaSearchCV~~ EaSearchCV - CV splitting is failing with Xarray inputs currently Oct 12, 2017

PeterDSteinberg self-assigned this Oct 12, 2017

PeterDSteinberg added this to the Phase II Milestone 2 - Improved Tools for Ensemble Fitting and Prediction milestone Oct 12, 2017

PeterDSteinberg mentioned this issue Oct 18, 2017

Elm Quarter 3, 2017 Priorities #216

Open

23 tasks

This was referenced Oct 20, 2017

Cross validation for xarray.* data structures #215

Closed

Cross validation of Pipeline/estimators using MLDataset / xarray.Dataset #221

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EaSearchCV - CV splitting is failing with Xarray inputs currently #204

EaSearchCV - CV splitting is failing with Xarray inputs currently #204

PeterDSteinberg commented Oct 12, 2017

PeterDSteinberg commented Oct 20, 2017

EaSearchCV - CV splitting is failing with Xarray inputs currently #204

EaSearchCV - CV splitting is failing with Xarray inputs currently #204

Comments

PeterDSteinberg commented Oct 12, 2017

PeterDSteinberg commented Oct 20, 2017