use dask-searchcv for evolutionary search EaSearchCV by PeterDSteinberg · Pull Request #192 · ContinuumIO/elm

PeterDSteinberg · 2017-09-13T01:00:26Z

Refactor of evolutionary algorithms as EaSearchCV (subclass of dask_searchcv.DaskBaseSearchCV):

Allows cross validation test/train splits within each individual of a genetic algorithm (i.e. hyperparameterization based on models' scores in test rather than training batches)
Uses dask parallelism of dask_searchcv rather than the Phase I elm approach
Better organization of model scores (see the cv_results_ attribute of EaSearchCV)
Improves modularity

Example usage:

from collections import OrderedDict

from dask_glm.datasets import make_regression as dsk_make_regression
import numpy as np
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from xarray_filters import MLDataset
from xarray_filters.datasets import _make_base
from elm.model_selection.ea_searchcv import EaSearchCV


dsk_make_regression = _make_base(dsk_make_regression)
shape = (100, 10, 5, 2)
dset = dsk_make_regression(shape=shape, n_samples=np.prod(shape))
dv = OrderedDict([(k, v) for k, v in dset.data_vars.items()
                  if k != 'y'])
X = MLDataset(dv)
y = dset.y.values.ravel()
X = X.to_features().features.values
data_source = dict(X=X, y=y)

pipe = Pipeline([('poly', PolynomialFeatures()),
                 ('pca', PCA()),
                 ('reg', LinearRegression())])

param_grid = dict(poly__degree=list(range(1, 3)),
                  poly__interaction_only=[True, False],
                  reg__fit_intercept=[True, False],
                  reg__normalize=[True, False],
                  pca__n_components=list(range(3, 12)))

k = 40
mu = 20
ngen = 10
mutpb = 0.4
cxpb = 0.6
param_grid_name = 'example_1'

ea = EaSearchCV(estimator=pipe,
                param_grid=param_grid,
                score_weights=[1],
                k=k,
                mu=mu,
                ngen=ngen,
                mutpb=mutpb,
                cxpb=cxpb,
                param_grid_name=param_grid_name,
                early_stop=None,
                toolbox=None,
                scoring=None,
                refit=False,
                cv=None,
                error_score='raise',
                return_train_score=True,
                scheduler=None,
                n_jobs=-1,
                cache_cv=True)
ea.fit(X, y=y)

PeterDSteinberg · 2017-09-18T18:07:09Z

TODO:

See Issue Improvements for evolutionary algorithms #185 and make sure all of those evolutionary search improvements are handled in this PR or a separate issue(s) is created.
docstrings
- Be sure to describe the score_weights parameter is used correctly related to deap (it is used to flip minimization to maximization) and make sure the examples use it correctly (IIRC, there is some confusion currently in the elm docs from Phase I).
doctests

…numpy, mldataset

…ve two small modules

PeterDSteinberg · 2017-10-11T04:47:21Z

Here are the py.test -m "not slow" -vvvv (skipping slow tests and running with verbose flag) output.

The tests show 18 failed, 1866 passed, 1941 skipped, 194 deselected, 15 warnings in 383.18 seconds
pytest_vvv_not_slow_tuesday_october_11_results.txt

Over the next day I'll continue commenting on existing issues and making new ones (about 4 or 6) that relate to the 18 test failures. Those test failures do not delay the merge of this PR as some are "expected failures" (not marked as such in py.test but expected to fail because we have not completed all of data structure flexibility goals).

@gbrener Could you checkout this branch and run the py.test command in Py 3.6 / 2.7 locally and pipe your output to a similar file so we can check the number of failures is the same or explain why different. I constructed my env by install elm from the anaconda elm 3.5 dev branch to get the environment, then installed from this branch + xarray_filters PR 19

…d update environment.yml Environment.yml now has dask-glm and dask-searchcv.

use dask-searchcv for evolutionary search EaSearchCV

eb1ca41

PeterDSteinberg mentioned this pull request Sep 18, 2017

Improvements for evolutionary algorithms #185

Closed

7 tasks

PeterDSteinberg added this to the Phase II Milestone 2 - Improved Tools for Ensemble Fitting and Prediction milestone Sep 18, 2017

This was referenced Sep 18, 2017

Remove the installs / imports of attrs in favor of params #174

Open

change the Pipeline.ensemble attribute to be a simple list not list of tuples #108

Closed

Smarter settings for ensembles fitting only one large sample #98

Closed

Peter Steinberg added 7 commits September 20, 2017 13:52

fixes to evolutionary search

6624fe7

add SklearnMixin and hierarchical models

6fa5ca1

wrap all scikit-learn transformers / estimators

da3a53a

remove extraneous printing

1deae55

remove extraneous printing

f033b30

remove extraneous printing

7ca7ad1

many changes on large refactor of Pipeline and elm.pipeline.steps

a658c56

This was referenced Sep 22, 2017

Deprecate ElmStore in favor of xarray_filters.MLDataset ContinuumIO/earthio#30

Merged

consistency with elm PR 192 and earthio PR 30 ContinuumIO/xarray_filters#19

Merged

Peter Steinberg added 3 commits September 25, 2017 19:54

refactor elm.pipeline.Pipeline and elm.pipeline.steps

0df3b1f

tests and related fixes for elm.pipeline's Pipeline, steps

627a7ce

changes to CI for combined elm.tests dir

094e94b

This was referenced Sep 28, 2017

Fixes to support sklearn.manifold #200

Closed

Refactor SklearnBase #197

Closed

Ensure the hyperparameterization of pipelines allows model structure optimization #198

Open

Use scikit-learn BaseEstimator as a base class for all pipeline steps #194

Closed

Peter Steinberg added 2 commits October 4, 2017 10:55

fixes to support evolutionary search with MLDataset and Dask-searchcv

6609777

improve test harness for different data structures - xarray, pandas, …

92a8070

…numpy, mldataset

PeterDSteinberg mentioned this pull request Oct 10, 2017

Documentation structure overhaul #188

Open

8 tasks

naming of arguments - add docstrings and todo messages - deprecate/mo…

a4c56b7

…ve two small modules

This was referenced Oct 11, 2017

Custom estimators / sklearn / dask and data structure flexibility checklist #201

Closed

Custom estimators / sklearn / dask and data structure flexibility checklist #202

Open

PeterDSteinberg mentioned this pull request Oct 12, 2017

EaSearchCV - CV splitting is failing with Xarray inputs currently #204

Open

Peter Steinberg added 2 commits October 12, 2017 14:07

commit just to test out Travis CI config

4fa77c3

ensure temporary method for dask-searchcv install works

e8089e0

add xarray_filters to elm dependencies

86fdad8

This was referenced Oct 16, 2017

Updates for Elm / Earthio / Xarray_filters evolutionary search improvements ContinuumIO/Elm-Earthio-NLDAS#4

Open

Cross validation for xarray.* data structures #215

Closed

Elm Quarter 3, 2017 Priorities #216

Open

gbrener added 2 commits October 18, 2017 14:45

Add missing test_config.yaml, modify elm/tests/util.py to find it, an…

7bec311

…d update environment.yml Environment.yml now has dask-glm and dask-searchcv.

Add __init__.py, to make elm/tests a subpackage

c2c55da

PeterDSteinberg merged commit 55bdbfe into master Oct 18, 2017

This was referenced Oct 20, 2017

Use the dill library instead of sklearn.externals.joblib to dump/load models #132

Closed

Change model scoring to be more similar to scikit-learn and dask-searchcv #99

Closed

Take advantage of new easy way to start dask-distributed #107

Closed

PeterDSteinberg mentioned this pull request Oct 27, 2017

Wrap GridSearchCV and RandomizedSearchCV for MLDataset (xarray data structures) #223

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use dask-searchcv for evolutionary search EaSearchCV#192

use dask-searchcv for evolutionary search EaSearchCV#192
PeterDSteinberg merged 19 commits intomasterfrom
ea-search-refactor

PeterDSteinberg commented Sep 13, 2017

Uh oh!

PeterDSteinberg commented Sep 18, 2017

Uh oh!

PeterDSteinberg commented Oct 11, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

PeterDSteinberg commented Sep 13, 2017

Uh oh!

PeterDSteinberg commented Sep 18, 2017

Uh oh!

PeterDSteinberg commented Oct 11, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants