Skip to content

use dask-searchcv for evolutionary search EaSearchCV#192

Merged
PeterDSteinberg merged 19 commits intomasterfrom
ea-search-refactor
Oct 18, 2017
Merged

use dask-searchcv for evolutionary search EaSearchCV#192
PeterDSteinberg merged 19 commits intomasterfrom
ea-search-refactor

Conversation

@PeterDSteinberg
Copy link
Copy Markdown
Contributor

Refactor of evolutionary algorithms as EaSearchCV (subclass of dask_searchcv.DaskBaseSearchCV):

  • Allows cross validation test/train splits within each individual of a genetic algorithm (i.e. hyperparameterization based on models' scores in test rather than training batches)
  • Uses dask parallelism of dask_searchcv rather than the Phase I elm approach
  • Better organization of model scores (see the cv_results_ attribute of EaSearchCV)
  • Improves modularity

Example usage:

from collections import OrderedDict

from dask_glm.datasets import make_regression as dsk_make_regression
import numpy as np
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from xarray_filters import MLDataset
from xarray_filters.datasets import _make_base
from elm.model_selection.ea_searchcv import EaSearchCV


dsk_make_regression = _make_base(dsk_make_regression)
shape = (100, 10, 5, 2)
dset = dsk_make_regression(shape=shape, n_samples=np.prod(shape))
dv = OrderedDict([(k, v) for k, v in dset.data_vars.items()
                  if k != 'y'])
X = MLDataset(dv)
y = dset.y.values.ravel()
X = X.to_features().features.values
data_source = dict(X=X, y=y)

pipe = Pipeline([('poly', PolynomialFeatures()),
                 ('pca', PCA()),
                 ('reg', LinearRegression())])

param_grid = dict(poly__degree=list(range(1, 3)),
                  poly__interaction_only=[True, False],
                  reg__fit_intercept=[True, False],
                  reg__normalize=[True, False],
                  pca__n_components=list(range(3, 12)))

k = 40
mu = 20
ngen = 10
mutpb = 0.4
cxpb = 0.6
param_grid_name = 'example_1'

ea = EaSearchCV(estimator=pipe,
                param_grid=param_grid,
                score_weights=[1],
                k=k,
                mu=mu,
                ngen=ngen,
                mutpb=mutpb,
                cxpb=cxpb,
                param_grid_name=param_grid_name,
                early_stop=None,
                toolbox=None,
                scoring=None,
                refit=False,
                cv=None,
                error_score='raise',
                return_train_score=True,
                scheduler=None,
                n_jobs=-1,
                cache_cv=True)
ea.fit(X, y=y)

@PeterDSteinberg
Copy link
Copy Markdown
Contributor Author

TODO:

  • See Issue Improvements for evolutionary algorithms #185 and make sure all of those evolutionary search improvements are handled in this PR or a separate issue(s) is created.
  • docstrings
    • Be sure to describe the score_weights parameter is used correctly related to deap (it is used to flip minimization to maximization) and make sure the examples use it correctly (IIRC, there is some confusion currently in the elm docs from Phase I).
  • doctests

@PeterDSteinberg
Copy link
Copy Markdown
Contributor Author

Here are the py.test -m "not slow" -vvvv (skipping slow tests and running with verbose flag) output.

The tests show 18 failed, 1866 passed, 1941 skipped, 194 deselected, 15 warnings in 383.18 seconds
pytest_vvv_not_slow_tuesday_october_11_results.txt

Over the next day I'll continue commenting on existing issues and making new ones (about 4 or 6) that relate to the 18 test failures. Those test failures do not delay the merge of this PR as some are "expected failures" (not marked as such in py.test but expected to fail because we have not completed all of data structure flexibility goals).

@gbrener Could you checkout this branch and run the py.test command in Py 3.6 / 2.7 locally and pipe your output to a similar file so we can check the number of failures is the same or explain why different. I constructed my env by install elm from the anaconda elm 3.5 dev branch to get the environment, then installed from this branch + xarray_filters PR 19

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants