Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cross validation of Pipeline/estimators using MLDataset / xarray.Dataset #221

Closed
wants to merge 27 commits into from

Conversation

PeterDSteinberg
Copy link
Contributor

Work in progress to fix #204

@PeterDSteinberg
Copy link
Contributor Author

Currently status of tests (for a simple Pipeline of only one unsupervised estimator step) - these are mostly failing due to test harness not putting together all the requisite arguments for the cross validators (such as not giving it a grouping variable):

test_xarray_cross_validation.py::test_each_cv[GroupKFold] PASSED
test_xarray_cross_validation.py::test_each_cv[GroupShuffleSplit] PASSED
test_xarray_cross_validation.py::test_each_cv[KFold] PASSED
test_xarray_cross_validation.py::test_each_cv[LeaveOneGroupOut] PASSED
test_xarray_cross_validation.py::test_each_cv[LeavePGroupsOut] FAILED
test_xarray_cross_validation.py::test_each_cv[LeaveOneOut] PASSED
test_xarray_cross_validation.py::test_each_cv[LeavePOut] FAILED
test_xarray_cross_validation.py::test_each_cv[PredefinedSpl\u0192it] FAILED
test_xarray_cross_validation.py::test_each_cv[RepeatedKFold] PASSED
test_xarray_cross_validation.py::test_each_cv[RepeatedStratifiedKFold] FAILED
test_xarray_cross_validation.py::test_each_cv[ShuffleSplit] PASSED
test_xarray_cross_validation.py::test_each_cv[StratifiedKFold] FAILED
test_xarray_cross_validation.py::test_each_cv[StratifiedShuffleSplit] FAILED
test_xarray_cross_validation.py::test_each_cv[TimeSeriesSplit] PASSED
test_xarray_cross_validation.py::test_each_cv[MLDatasetMixin] FAILED
test_xarray_cross_validation.py::test_each_cv[CVCacheSampleId] FAILED

@PeterDSteinberg
Copy link
Contributor Author

I'm going to add more tests using pytest.mark.parametrize to better quantify what Pipeline options, such as supervised vs unsupervised, with MLDataset and cross validation.

@@ -55,12 +44,20 @@ class Wrapped(SklearnMixin, cls):
for cls in get_module_classes(m).values():
if cls.__name__ in _seen:
continue
if not m in cls.__module__:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just checking that we are getting StandardScaler or similar from the sklearn module where it is actually defined, not some other one where it is imported for internal usage.

@@ -14,7 +14,7 @@
import pytest


def new_pipeline(*args, flatten_first=True):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was not Python 2.7 compatible.

@@ -64,38 +67,46 @@ class SklearnMixin:
_as_numpy_arrs = _as_numpy_arrs
_from_numpy_arrs = _from_numpy_arrs

def _call_sk_method(self, sk_method, X=None, y=None, **kw):
def _call_sk_method(self, sk_method, X=None, y=None, do_split=True, **kw):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am currently working on simplifying this function - checking what is actually needed.

@PeterDSteinberg
Copy link
Contributor Author

Added the several label encoding classes to test_config.yaml in (SKIP) - (LabelBinarizer, LabelEncoder, SelectFromModel). I think the test harness is not preparing the right input data - haven't looked into yet.

The new test module xarray_cross_validation.py tests cross validation in Pipelines that use xarray_filters.MLDataset (or xarray.Dataset that are converted to xarray_filters.MLDataset). Typically when cross validation is used with GridSearchCV or other estimators, a large tabular matrix is given as input feature matrix and cross validation iterators from sklearn.model_selection, e.g. sklearn.model_selection.KFold, are used to split the input feature matrix rows into train / test batches.

When hyperparameterizing a Pipeline of operations on an MLDataset, cross validation requires that a sampler callable be passed to EaSearchCV initialization method and an iterable of sampler arguments to EaSearchCV.fit. Repeated calls to sampler are used for form train / test batches. An outstanding issue to fix (before the dask-searchcv PR 61 can be merged) is the usage of refit=True as an argument to EaSearchCV when cross validating Pipelines that use MLDataset in steps. See the TODO note in test_xarray_cross_validation.py regarding refit=True.

refit_options = (False,) # TODO - refit is not working because
                         # it is passing sampler arguments not
                         # sampler output to the refitting
                         # of best model logic.  We need
                         # to make separate issue to figure
                         # out what "refit" means in a fitting
                         # operation of many samples - not
                         # as obvious what that should be
                         # when not CV-splitting a large matrix
                         # but rather CV-splitting input file
                         # names or other sampler arguments
test_args = product(CV_CLASSES, configs, refit_options)

The problem above with refit=True prevents EaSearchCV.predict from running (the best estimator has not been refit for prediction). When the issue above is fixed, hopefully that also means this part of test_ea_search.py:

test_args = product(args, (None,))

can be changed to:

test_args = product(args, ('predict', None)) # Test "refit"=True and predict(...)

I'll open issues and link them here:

  • Test refit=True when running EaSearchCV Pipelines passing MLDataset between steps
  • Label encoding preprocessors from sklearn.preprocessing - (LabelBinarizer, LabelEncoder, SelectFromModel)

I'm running this PR with:

To run the tests:

cd elm/tests && py.test -m "not slow" -vvv

Test summary

============================= 126 tests deselected =============================
==== 1850 passed, 23 skipped, 126 deselected, 12 warnings in 526.52 seconds ====

@PeterDSteinberg PeterDSteinberg changed the title Cross validation of Pipeline/estimators/transformers using MLDataset / xarray Cross validation of Pipeline/estimators using MLDataset / xarray Nov 6, 2017
@PeterDSteinberg PeterDSteinberg changed the title Cross validation of Pipeline/estimators using MLDataset / xarray Cross validation of Pipeline/estimators using MLDataset / xarray.Dataset Nov 6, 2017
@PeterDSteinberg PeterDSteinberg mentioned this pull request Nov 6, 2017
3 tasks
@PeterDSteinberg
Copy link
Contributor Author

Notes:

@PeterDSteinberg
Copy link
Contributor Author

Replaced by #228

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

EaSearchCV - CV splitting is failing with Xarray inputs currently
1 participant