Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closes #1017. Fix pickling of CV generators. #1018

Merged
merged 1 commit into from
Feb 19, 2020

Conversation

adriangb
Copy link
Contributor

Fixes pickling of custom cv split generators to fix bug described in #1017

I ended up having to make changes to a couple of fo the existing tests because they interacted with _wrapped_cross_val_score directly and that API has now changed. The general API for fit and TPOT instantiation is unchanged. I also added a test for this bug.

I do not think any of the docs need to be updated, please correct me if I am wrong.

@coveralls
Copy link

coveralls commented Feb 18, 2020

Coverage Status

Coverage increased (+0.002%) to 96.749% when pulling 917e64b on adriangb:fix-custom-cv-pickling into e45cf6f on EpistasisLab:development.

tests/tpot_tests.py Show resolved Hide resolved
tests/tpot_tests.py Outdated Show resolved Hide resolved
tests/tpot_tests.py Show resolved Hide resolved
tests/tpot_tests.py Show resolved Hide resolved
tests/tpot_tests.py Show resolved Hide resolved
Copy link
Contributor

@weixuanfu weixuanfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for making those changes. Please update a little based on my comments then I will merge it to dev branch once I get a chance.

tests/tpot_tests.py Show resolved Hide resolved
tests/tpot_tests.py Show resolved Hide resolved
@weixuanfu weixuanfu changed the base branch from master to development February 18, 2020 20:02
@adriangb
Copy link
Contributor Author

@weixuanfu should be ready to merge 👍

@weixuanfu weixuanfu merged commit 1783f3c into EpistasisLab:development Feb 19, 2020
@weixuanfu
Copy link
Contributor

Thank you a lot! Merged!

@adriangb
Copy link
Contributor Author

Awesome!

Glad I could help!

I didn't want to include it in this PR, but I did want to bring this up: is there any reason for the cv parameter to be in the TPOT constructor? It seems like it's not really used until fit is called.

@weixuanfu
Copy link
Contributor

The reason is that we don’t want fit() to take params which don’t follow scikit-learn API.

@adriangb
Copy link
Contributor Author

That makes sense. What about making a TPOT.cross_validate method that essentially achieves the same thing as the current implementation, but abstracts it out to a seperate function to keep the API closer to sklearn?

@weixuanfu
Copy link
Contributor

Hmm, I think that do not follow scikit-learn API for Estimator neither. Also, those cv splits is for evaluating those pipelines in TPOT with average cv scores (as one of fitness scores in GP) which has different purpose with cross_validate from scikit-learn that can compute/store all cv scores and even fitted models on all splits

@adriangb
Copy link
Contributor Author

You're right, thank you for entertaining the thought.

Maybe a quick summary of my use case will be better at explaining why I was asking. My workflow is as follows (pseudocode but you get the idea):

# setup
raw_data = pd.DataFrame(...)
feature_extractors = [ext_ft_1, ext_ft_2, ...]   # callable methods ext_ft(raw_data) -> X, y, groups
models = [
     TPOTClassifier(max_time_mins=10),
     KNeighborsClassifier(),
]

# extract featues -> fit -> score
for ft_ext in feature_extractors:
    X, y, groups = ft_ext(raw_data)
    cv = list(LeaveOneGroupOut().split(X, y, groups))
    for model in models:
        model.fit(X, y)
        if isinstance(model, TPOTClassifier):
             model =model.fitted_pipeline_
       score = np.mean(cross_val_score(estimator=model, X=X, y=y, cv=cv)
    # save model, score

This works with the scikit-learn models, but not with TPOT. So I modified, kind of as follows:

# setup
raw_data = pd.DataFrame(...)
feature_extractors = [ext_ft_1, ext_ft_2, ...]
models = [
    (TPOTClassifier, {"max_time_mins": 10,}),
    (KNeighborsClassifier, dict()),
]

# extract featues -> fit -> score
for ft_ext in feature_extractors:
    X, y, groups = ft_ext(raw_data)
    cv = list(LeaveOneGroupOut().split(X, y, groups))
    for model_class, opts in models:
        if isinstance(model, TPOTClassifier):
             model = model_class(**opts, cv=cv)
             model.fit(X, y)
             model =model.fitted_pipeline_
       else:
            model = model_class(**opts)
            model.fit(X, y)
       score = np.mean(cross_val_score(estimator=model, X=X, y=y, cv=cv)
      # save model, score

I know this is a contrived use case, but I was just wondering if there is a better way.

@adriangb adriangb deleted the fix-custom-cv-pickling branch February 19, 2020 19:50
@weixuanfu
Copy link
Contributor

weixuanfu commented Feb 20, 2020

But for the original workflow, how about adding the cv into the TPOTClassifier’s param setting? Or for modified workflow how about using model.set_params. I think both follow scikit-learn API.

@adriangb
Copy link
Contributor Author

Using model.set_params worked beautifully! I wasn't aware of that API. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants