Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EaSearchCV - CV splitting is failing with Xarray inputs currently #204

Open
PeterDSteinberg opened this issue Oct 12, 2017 · 1 comment
Open
Assignees

Comments

@PeterDSteinberg
Copy link
Contributor

In EaSearchCV (PR #192), cross validation fails when given a MLDataset / Dataset in a Pipeline or estimator, as the cross validation logic in dask-searchcv and/or scikit-learn subsets rows of a dask array/dataframe or numpy array. Things to consider:

  • @gbrener 's work in xarray_filters allows parameterization of chained transformers on an MLDataset / Dataset and that has utility in ML and non-ML contexts, e.g later using param to control transformers for viz. These transformers may be transforming a dataset with 1 or more DataArrays
  • In scikit-learn Pipeline and its usage with dask or numpy, the cross validation tools are for 2D features array inputs generally, so the cross validation classes fail now when EaSearchCV is used on MLDataset/Dataset

Ideas: We want to allow both use styles, but cross validation can only be supported on the portion of a pipeline that has a 2D features matrix, e.g a numpy or dask array or MLDataset/Dataset with a single features DataArray whose .values array is given to cross validation tools. For example,in a Pipeline of:

  • Step 1: Spatial filters on 4-D array - custom function to set NaN where the 4-D arrays are out of domain, such as NaNs for ocean on a 4-D terrestrial DataArray(s)
  • Step 2: Parameterizable operation on the 4D arrays that allows Laplacian, gradient or no filter
  • Step 3: Call to_features on the MLDataset of 4-D arrays to convert to features matrix
  • Step 4: Drop the NaN rows of the .features 2D DataArray
  • Step 5: PCA
  • Step 6: KMeans

Hyperparameterization should be supported for any step but cross validation only of the steps 3 through 6 that use a typical feature matrix.

@PeterDSteinberg PeterDSteinberg changed the title CV splitting is failing with Xarray inputs currently but EaSearchCV EaSearchCV - CV splitting is failing with Xarray inputs currently Oct 12, 2017
@PeterDSteinberg PeterDSteinberg self-assigned this Oct 12, 2017
@PeterDSteinberg
Copy link
Contributor Author

Here are some notes I took regarding cross validation, thinking about how to cross validation work with xarray/xarray_filters data structures. Currently this Pipeline is failing due to xarray*.* data structures being used on steps up to scaler (it runs fine as a Pipeline but fails in cross validation if used in GridSearchCV or EaSearchCV).

pipe = Pipeline([
    ('sampler', Sampler(max_time_steps=max_time_steps)),
    ('time', Differencing(layers=FEATURE_LAYERS)),
    ('flatten', Flatten()),
    ('soil_phys', AddSoilPhysicalChemical()),
    ('drop_null', DropNaRows()),
    ('get_y', GetY(SOIL_MOISTURE)),
    ('None', None),
    ('scaler', ChooseWithPreproc(trans_if=log_trans_only_positive)),
    ('pca', ChooseWithPreproc()),
    ('estimator', linear_model.LinearRegression(n_jobs=-1))])

In pseudocode this is what cross validating at Sampler level would look like:

for n in range(num_samples):
    dset = pipe.steps[0][1].fit_transform(**sample_args)
    for name, step in pipe.steps[:-1]:
        dset = step.fit_transform(dset)
    return pipe.steps[-1][1].fit(dset)

Currently sklearn cross validation iterators would support:

for n in range(num_splits):
    # Test / train split input array X
    # Run the `scaler`, `pca`, and `estimator`
    # steps on each test/train batch

Nested cross validation idea - cross validating at the input samples level (e.g. filenames that make up an xarray data structure in Sampler), as well as cross validation splitting of the input matrix:

def make_sample(n):
    return load_array('big_file_{}.nc'.format(n))

xarray_pipe = Pipeline([
    ('sampler', Sampler(func=make_sample),
    ('time', Differencing(layers=FEATURE_LAYERS)),
    ('flatten', Flatten()),
    ('soil_phys', AddSoilPhysicalChemical()),
    ('drop_null', DropNaRows()),
    ('get_y', GetY(SOIL_MOISTURE)),
    ('None', None),])
numpy_or_dask_pipe = Pipeline([
    ('scaler', ChooseWithPreproc(trans_if=log_trans_only_positive)),
    ('pca', ChooseWithPreproc()),
    ('estimator', linear_model.LinearRegression(n_jobs=-1))])

def nested_cross_val(outer_num_samples, inner_num_samples):
    for n in range(outer_num_samples):
        # This is the outer cross validation, e.g
        # over file names or dates that
        # determine a sample reading from file, for examples.
        # In the NLDAS ML
        sample = xarray_pipe[0][1].fit_transform(n)
        for name, step in xarray_pipe.steps[1:]:
            sample = step.fit_transform(sample)
        X, y = sample
        for n in range(inner_num_samples):  # e.g. KFold
            # test / train split X, y
            # run the steps in the `numpy_pipe`
            # This is the "inner cross validation"

Nested cross validation inside evolutionary search (pseudocode):

def ea_search_cv(outer_cv, inner_cv):
    pop = initialize()
    for generation in range(ngen):
        # Each generation in evo algo
        for model in pop:
            # Each member of population
            # Do outer / inner cross validations
            nested_cross_validation(outer_cv, inner_cv)
            scores = # accumulate two-layer cross validation scores
        # The EA search chooses the
        # best parameters based on cross validation scores
        pop = select_new_population(scores)
    return pop

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant