Closes #1017. Fix pickling of CV generators. #1018

adriangb · 2020-02-18T02:37:16Z

Fixes pickling of custom cv split generators to fix bug described in #1017

I ended up having to make changes to a couple of fo the existing tests because they interacted with _wrapped_cross_val_score directly and that API has now changed. The general API for fit and TPOT instantiation is unchanged. I also added a test for this bug.

I do not think any of the docs need to be updated, please correct me if I am wrong.

coveralls · 2020-02-18T02:40:45Z

Coverage increased (+0.002%) to 96.749% when pulling 917e64b on adriangb:fix-custom-cv-pickling into e45cf6f on EpistasisLab:development.

tests/tpot_tests.py

weixuanfu

Thank you for making those changes. Please update a little based on my comments then I will merge it to dev branch once I get a chance.

tests/tpot_tests.py

adriangb · 2020-02-19T01:31:47Z

@weixuanfu should be ready to merge 👍

weixuanfu · 2020-02-19T03:28:51Z

Thank you a lot! Merged!

adriangb · 2020-02-19T04:13:30Z

Awesome!

Glad I could help!

I didn't want to include it in this PR, but I did want to bring this up: is there any reason for the cv parameter to be in the TPOT constructor? It seems like it's not really used until fit is called.

weixuanfu · 2020-02-19T13:49:54Z

The reason is that we don’t want fit() to take params which don’t follow scikit-learn API.

adriangb · 2020-02-19T15:12:49Z

That makes sense. What about making a TPOT.cross_validate method that essentially achieves the same thing as the current implementation, but abstracts it out to a seperate function to keep the API closer to sklearn?

weixuanfu · 2020-02-19T17:14:42Z

Hmm, I think that do not follow scikit-learn API for Estimator neither. Also, those cv splits is for evaluating those pipelines in TPOT with average cv scores (as one of fitness scores in GP) which has different purpose with cross_validate from scikit-learn that can compute/store all cv scores and even fitted models on all splits

adriangb · 2020-02-19T19:50:18Z

You're right, thank you for entertaining the thought.

Maybe a quick summary of my use case will be better at explaining why I was asking. My workflow is as follows (pseudocode but you get the idea):

# setup
raw_data = pd.DataFrame(...)
feature_extractors = [ext_ft_1, ext_ft_2, ...]   # callable methods ext_ft(raw_data) -> X, y, groups
models = [
     TPOTClassifier(max_time_mins=10),
     KNeighborsClassifier(),
]

# extract featues -> fit -> score
for ft_ext in feature_extractors:
    X, y, groups = ft_ext(raw_data)
    cv = list(LeaveOneGroupOut().split(X, y, groups))
    for model in models:
        model.fit(X, y)
        if isinstance(model, TPOTClassifier):
             model =model.fitted_pipeline_
       score = np.mean(cross_val_score(estimator=model, X=X, y=y, cv=cv)
    # save model, score

This works with the scikit-learn models, but not with TPOT. So I modified, kind of as follows:

# setup
raw_data = pd.DataFrame(...)
feature_extractors = [ext_ft_1, ext_ft_2, ...]
models = [
    (TPOTClassifier, {"max_time_mins": 10,}),
    (KNeighborsClassifier, dict()),
]

# extract featues -> fit -> score
for ft_ext in feature_extractors:
    X, y, groups = ft_ext(raw_data)
    cv = list(LeaveOneGroupOut().split(X, y, groups))
    for model_class, opts in models:
        if isinstance(model, TPOTClassifier):
             model = model_class(**opts, cv=cv)
             model.fit(X, y)
             model =model.fitted_pipeline_
       else:
            model = model_class(**opts)
            model.fit(X, y)
       score = np.mean(cross_val_score(estimator=model, X=X, y=y, cv=cv)
      # save model, score

I know this is a contrived use case, but I was just wondering if there is a better way.

weixuanfu · 2020-02-20T14:31:24Z

But for the original workflow, how about adding the cv into the TPOTClassifier’s param setting? Or for modified workflow how about using model.set_params. I think both follow scikit-learn API.

adriangb · 2020-02-20T21:10:16Z

Using model.set_params worked beautifully! I wasn't aware of that API. Thank you.

adriangb commented Feb 18, 2020

View reviewed changes

tests/tpot_tests.py Show resolved Hide resolved

tests/tpot_tests.py Outdated Show resolved Hide resolved

tests/tpot_tests.py Show resolved Hide resolved

tests/tpot_tests.py Show resolved Hide resolved

tests/tpot_tests.py Show resolved Hide resolved

adriangb requested a review from weixuanfu February 18, 2020 02:55

weixuanfu approved these changes Feb 18, 2020

View reviewed changes

tests/tpot_tests.py Show resolved Hide resolved

tests/tpot_tests.py Show resolved Hide resolved

weixuanfu changed the base branch from master to development February 18, 2020 20:02

Closes EpistasisLab#1017. Fix pickling of CV generators.

917e64b

adriangb force-pushed the fix-custom-cv-pickling branch from 6766a40 to 917e64b Compare February 19, 2020 01:05

weixuanfu merged commit 1783f3c into EpistasisLab:development Feb 19, 2020

adriangb deleted the fix-custom-cv-pickling branch February 19, 2020 19:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Closes #1017. Fix pickling of CV generators. #1018

Closes #1017. Fix pickling of CV generators. #1018

adriangb commented Feb 18, 2020

coveralls commented Feb 18, 2020 •

edited

Loading

weixuanfu left a comment

adriangb commented Feb 19, 2020

weixuanfu commented Feb 19, 2020

adriangb commented Feb 19, 2020

weixuanfu commented Feb 19, 2020

adriangb commented Feb 19, 2020

weixuanfu commented Feb 19, 2020

adriangb commented Feb 19, 2020

weixuanfu commented Feb 20, 2020 •

edited

Loading

adriangb commented Feb 20, 2020

Closes #1017. Fix pickling of CV generators. #1018

Closes #1017. Fix pickling of CV generators. #1018

Conversation

adriangb commented Feb 18, 2020

coveralls commented Feb 18, 2020 • edited Loading

weixuanfu left a comment

Choose a reason for hiding this comment

adriangb commented Feb 19, 2020

weixuanfu commented Feb 19, 2020

adriangb commented Feb 19, 2020

weixuanfu commented Feb 19, 2020

adriangb commented Feb 19, 2020

weixuanfu commented Feb 19, 2020

adriangb commented Feb 19, 2020

weixuanfu commented Feb 20, 2020 • edited Loading

adriangb commented Feb 20, 2020

coveralls commented Feb 18, 2020 •

edited

Loading

weixuanfu commented Feb 20, 2020 •

edited

Loading