-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closes #1017. Fix pickling of CV generators. #1018
Closes #1017. Fix pickling of CV generators. #1018
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for making those changes. Please update a little based on my comments then I will merge it to dev branch once I get a chance.
6766a40
to
917e64b
Compare
@weixuanfu should be ready to merge 👍 |
Thank you a lot! Merged! |
Awesome! Glad I could help! I didn't want to include it in this PR, but I did want to bring this up: is there any reason for the |
The reason is that we don’t want fit() to take params which don’t follow scikit-learn API. |
That makes sense. What about making a |
Hmm, I think that do not follow scikit-learn API for Estimator neither. Also, those cv splits is for evaluating those pipelines in TPOT with average cv scores (as one of fitness scores in GP) which has different purpose with |
You're right, thank you for entertaining the thought. Maybe a quick summary of my use case will be better at explaining why I was asking. My workflow is as follows (pseudocode but you get the idea): # setup
raw_data = pd.DataFrame(...)
feature_extractors = [ext_ft_1, ext_ft_2, ...] # callable methods ext_ft(raw_data) -> X, y, groups
models = [
TPOTClassifier(max_time_mins=10),
KNeighborsClassifier(),
]
# extract featues -> fit -> score
for ft_ext in feature_extractors:
X, y, groups = ft_ext(raw_data)
cv = list(LeaveOneGroupOut().split(X, y, groups))
for model in models:
model.fit(X, y)
if isinstance(model, TPOTClassifier):
model =model.fitted_pipeline_
score = np.mean(cross_val_score(estimator=model, X=X, y=y, cv=cv)
# save model, score This works with the # setup
raw_data = pd.DataFrame(...)
feature_extractors = [ext_ft_1, ext_ft_2, ...]
models = [
(TPOTClassifier, {"max_time_mins": 10,}),
(KNeighborsClassifier, dict()),
]
# extract featues -> fit -> score
for ft_ext in feature_extractors:
X, y, groups = ft_ext(raw_data)
cv = list(LeaveOneGroupOut().split(X, y, groups))
for model_class, opts in models:
if isinstance(model, TPOTClassifier):
model = model_class(**opts, cv=cv)
model.fit(X, y)
model =model.fitted_pipeline_
else:
model = model_class(**opts)
model.fit(X, y)
score = np.mean(cross_val_score(estimator=model, X=X, y=y, cv=cv)
# save model, score I know this is a contrived use case, but I was just wondering if there is a better way. |
But for the original workflow, how about adding the cv into the TPOTClassifier’s param setting? Or for modified workflow how about using model.set_params. I think both follow scikit-learn API. |
Using |
Fixes pickling of custom cv split generators to fix bug described in #1017
I ended up having to make changes to a couple of fo the existing tests because they interacted with
_wrapped_cross_val_score
directly and that API has now changed. The general API forfit
andTPOT
instantiation is unchanged. I also added a test for this bug.I do not think any of the docs need to be updated, please correct me if I am wrong.