Cross validation of Pipeline/estimators using MLDataset / xarray.Dataset #221

PeterDSteinberg · 2017-10-24T15:34:04Z

Work in progress to fix #204

…upervised

PeterDSteinberg · 2017-11-01T16:37:45Z

Currently status of tests (for a simple Pipeline of only one unsupervised estimator step) - these are mostly failing due to test harness not putting together all the requisite arguments for the cross validators (such as not giving it a grouping variable):

test_xarray_cross_validation.py::test_each_cv[GroupKFold] PASSED
test_xarray_cross_validation.py::test_each_cv[GroupShuffleSplit] PASSED
test_xarray_cross_validation.py::test_each_cv[KFold] PASSED
test_xarray_cross_validation.py::test_each_cv[LeaveOneGroupOut] PASSED
test_xarray_cross_validation.py::test_each_cv[LeavePGroupsOut] FAILED
test_xarray_cross_validation.py::test_each_cv[LeaveOneOut] PASSED
test_xarray_cross_validation.py::test_each_cv[LeavePOut] FAILED
test_xarray_cross_validation.py::test_each_cv[PredefinedSpl\u0192it] FAILED
test_xarray_cross_validation.py::test_each_cv[RepeatedKFold] PASSED
test_xarray_cross_validation.py::test_each_cv[RepeatedStratifiedKFold] FAILED
test_xarray_cross_validation.py::test_each_cv[ShuffleSplit] PASSED
test_xarray_cross_validation.py::test_each_cv[StratifiedKFold] FAILED
test_xarray_cross_validation.py::test_each_cv[StratifiedShuffleSplit] FAILED
test_xarray_cross_validation.py::test_each_cv[TimeSeriesSplit] PASSED
test_xarray_cross_validation.py::test_each_cv[MLDatasetMixin] FAILED
test_xarray_cross_validation.py::test_each_cv[CVCacheSampleId] FAILED

PeterDSteinberg · 2017-11-01T16:38:42Z

I'm going to add more tests using pytest.mark.parametrize to better quantify what Pipeline options, such as supervised vs unsupervised, with MLDataset and cross validation.

PeterDSteinberg · 2017-11-01T17:56:48Z

elm/pipeline/steps.py

@@ -55,12 +44,20 @@ class Wrapped(SklearnMixin, cls):
    for cls in get_module_classes(m).values():
        if cls.__name__ in _seen:
            continue
+        if not m in cls.__module__:


This is just checking that we are getting StandardScaler or similar from the sklearn module where it is actually defined, not some other one where it is imported for internal usage.

…ther methods are wrapped

…EST.in

…m channels

PeterDSteinberg · 2017-11-03T18:40:19Z

elm/tests/test_pipeline.py

@@ -14,7 +14,7 @@
 import pytest


-def new_pipeline(*args, flatten_first=True):


This was not Python 2.7 compatible.

PeterDSteinberg · 2017-11-03T18:41:33Z

elm/mldataset/wrap_sklearn.py

@@ -64,38 +67,46 @@ class SklearnMixin:
    _as_numpy_arrs = _as_numpy_arrs
    _from_numpy_arrs = _from_numpy_arrs

-    def _call_sk_method(self, sk_method, X=None, y=None, **kw):
+    def _call_sk_method(self, sk_method, X=None, y=None, do_split=True, **kw):


I am currently working on simplifying this function - checking what is actually needed.

PeterDSteinberg · 2017-11-04T02:54:52Z

Added the several label encoding classes to test_config.yaml in (SKIP) - (LabelBinarizer, LabelEncoder, SelectFromModel). I think the test harness is not preparing the right input data - haven't looked into yet.

The new test module xarray_cross_validation.py tests cross validation in Pipelines that use xarray_filters.MLDataset (or xarray.Dataset that are converted to xarray_filters.MLDataset). Typically when cross validation is used with GridSearchCV or other estimators, a large tabular matrix is given as input feature matrix and cross validation iterators from sklearn.model_selection, e.g. sklearn.model_selection.KFold, are used to split the input feature matrix rows into train / test batches.

When hyperparameterizing a Pipeline of operations on an MLDataset, cross validation requires that a sampler callable be passed to EaSearchCV initialization method and an iterable of sampler arguments to EaSearchCV.fit. Repeated calls to sampler are used for form train / test batches. An outstanding issue to fix (before the dask-searchcv PR 61 can be merged) is the usage of refit=True as an argument to EaSearchCV when cross validating Pipelines that use MLDataset in steps. See the TODO note in test_xarray_cross_validation.py regarding refit=True.

refit_options = (False,) # TODO - refit is not working because
                         # it is passing sampler arguments not
                         # sampler output to the refitting
                         # of best model logic.  We need
                         # to make separate issue to figure
                         # out what "refit" means in a fitting
                         # operation of many samples - not
                         # as obvious what that should be
                         # when not CV-splitting a large matrix
                         # but rather CV-splitting input file
                         # names or other sampler arguments
test_args = product(CV_CLASSES, configs, refit_options)

The problem above with refit=True prevents EaSearchCV.predict from running (the best estimator has not been refit for prediction). When the issue above is fixed, hopefully that also means this part of test_ea_search.py:

test_args = product(args, (None,))

can be changed to:

test_args = product(args, ('predict', None)) # Test "refit"=True and predict(...)

I'll open issues and link them here:

Test refit=True when running EaSearchCV Pipelines passing MLDataset between steps
Label encoding preprocessors from sklearn.preprocessing - (LabelBinarizer, LabelEncoder, SelectFromModel)

I'm running this PR with:

To run the tests:

cd elm/tests && py.test -m "not slow" -vvv

Test summary

============================= 126 tests deselected =============================
==== 1850 passed, 23 skipped, 126 deselected, 12 warnings in 526.52 seconds ====

PeterDSteinberg · 2017-11-06T22:32:39Z

Notes:

Python 3.5 and 3.6 CI tests above passed (running a subset of the test_pipeline.py generated tests - see also Elm CI tests improvements #227 )
Python 2.7 failed due to AttributeError: 'unicode' object has no attribute 'version' - LooseVersion with Py 2.7 dask/dask-searchcv#64

PeterDSteinberg · 2017-11-15T18:57:59Z

Replaced by #228

Peter Steinberg added 2 commits October 24, 2017 08:30

cross validation of MLDataset Pipeline

55959a5

changes with CV sampling

396f9aa

PeterDSteinberg mentioned this pull request Oct 26, 2017

cross validation for xarray_filters.MLDataset - Elm PR 221 dask/dask-searchcv#61

Closed

3 tasks

Peter Steinberg added 2 commits October 26, 2017 11:15

changes to cv_cache

33bac56

closer to working cross validation for MLDataset

b422e68

PeterDSteinberg mentioned this pull request Oct 27, 2017

Wrap GridSearchCV and RandomizedSearchCV for MLDataset (xarray data structures) #223

Open

Peter Steinberg added 2 commits October 31, 2017 13:11

CV / xarray experimentation - work in progress

d45d4e1

MLDataset cross validation working for pipeline of 1 step that is uns…

92054c9

…upervised

PeterDSteinberg commented Nov 1, 2017

View reviewed changes

PeterDSteinberg mentioned this pull request Nov 1, 2017

Better repr / str for elm.pipeline.steps wrapped classes? #224

Open

Peter Steinberg added 7 commits November 1, 2017 12:59

wrapped sklearn classes need to wrap score methods as fit, predict, o…

35450c1

…ther methods are wrapped

update tests;fix cross validation with most data structures

f86a079

a couple tests for Python 2.7

5cf646f

avoid dask-searchcv test in conda.recipe;add test_config.yml to MANIF…

744109a

…EST.in

remove print statement

1e7bec8

ensure test_config.yaml included in pkg

83437f5

remove elm.mldataset.cross_validation - modify environment.yml for el…

de9efd0

…m channels

PeterDSteinberg commented Nov 3, 2017

View reviewed changes

fix usage of is_arr utility to separate X, y tuple

6267041

This was referenced Nov 4, 2017

Label encoding preprocessing classes - xarray support #225

Open

refit=True when running EaSearchCV Pipelines passing MLDataset between steps #226

Open

Peter Steinberg added 5 commits November 3, 2017 20:06

1850 passing tests

66013e6

dask-searchcv in meta.yaml

a91caf6

use elm/label/dev and elm for CI installs

e9b5d85

change earthio version for fixing CI build

f6ef7c8

ensure EARTHIO_CHANNEL_STR is set correctly in .travis.yml

948efe5

Peter Steinberg added 3 commits November 6, 2017 08:25

ensure ANACONDA_UPLOAD_USER is defined in .travis for pkg upload

edbe1f5

change order of channels to ensure dask-searchcv comes from elm

6304e37

subset the number of tests being run in CI

8a6d46f

PeterDSteinberg changed the title ~~Cross validation of Pipeline/estimators/transformers using MLDataset / xarray~~ Cross validation of Pipeline/estimators using MLDataset / xarray Nov 6, 2017

PeterDSteinberg changed the title ~~Cross validation of Pipeline/estimators using MLDataset / xarray~~ Cross validation of Pipeline/estimators using MLDataset / xarray.Dataset Nov 6, 2017

Peter Steinberg added 4 commits November 6, 2017 12:18

better diagnostics on upload failure in CI

21a18d9

remove earthio from CI

8ad7b4c

be sure to create env from elm's conda build output

9a1734d

remove diagnostic print from deploy section

dc47f65

PeterDSteinberg mentioned this pull request Nov 6, 2017

Elm CI tests improvements #227

Open

3 tasks

refactor to simplify changes in dask-searchcv

00ea1be

PeterDSteinberg mentioned this pull request Nov 8, 2017

[WIP] Improve wrapped sklearn class repr - simplify cross val for Dataset / MLDataset #228

Open

PeterDSteinberg closed this Nov 15, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cross validation of Pipeline/estimators using MLDataset / xarray.Dataset #221

Cross validation of Pipeline/estimators using MLDataset / xarray.Dataset #221

PeterDSteinberg commented Oct 24, 2017

PeterDSteinberg commented Nov 1, 2017

PeterDSteinberg commented Nov 1, 2017

PeterDSteinberg Nov 1, 2017

PeterDSteinberg Nov 3, 2017

PeterDSteinberg Nov 3, 2017

PeterDSteinberg commented Nov 4, 2017

PeterDSteinberg commented Nov 6, 2017

PeterDSteinberg commented Nov 15, 2017

		@@ -14,7 +14,7 @@
		import pytest


		def new_pipeline(*args, flatten_first=True):

Cross validation of Pipeline/estimators using MLDataset / xarray.Dataset #221

Cross validation of Pipeline/estimators using MLDataset / xarray.Dataset #221

Conversation

PeterDSteinberg commented Oct 24, 2017

PeterDSteinberg commented Nov 1, 2017

PeterDSteinberg commented Nov 1, 2017

PeterDSteinberg Nov 1, 2017

Choose a reason for hiding this comment

PeterDSteinberg Nov 3, 2017

Choose a reason for hiding this comment

PeterDSteinberg Nov 3, 2017

Choose a reason for hiding this comment

PeterDSteinberg commented Nov 4, 2017

PeterDSteinberg commented Nov 6, 2017

PeterDSteinberg commented Nov 15, 2017