Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support string / categorical targets for binary and multiclass problems #932

Merged
merged 65 commits into from Jul 20, 2020

Conversation

angela97lin
Copy link
Contributor

Continuation of #917, closes #215

@angela97lin angela97lin added this to the July 2020 milestone Jul 15, 2020
Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff! The impl looks ready to go. I left a comment about a line in ClassificationPipeline.predict_proba which I'm not sure I follow yet, and a comment about docstrings.

Its great that we have coverage for binary classification estimators’ class order, thanks for adding.

Here's a few test cases and potential changes I'd like us to address before merging:

  • What happens when the target passed to fit and score disagree, and there are new unique values passed to score, or we passed str to fit and now int to score? We should raise an error explaining the problem. I think what would currently happen is a stack trace out of LabelEncoder, right?
  • Try calling binary and multiclass pipeline fit, predict, predict_proba and score with various types of target (int, float, bool, category, string), and test that either the encode/decode still functions properly, or we get a clear error message.
  • If we’re going to modify predict_proba col names in this PR, as it appears we are, we should have coverage for that change too. My suggestion would be to push it to Include target column name and class labels in predict/predict_proba output #645 , unless I'm misunderstanding.

Happy to talk more about this if you'd like!

@angela97lin angela97lin requested a review from dsherry July 16, 2020 05:02
@angela97lin
Copy link
Contributor Author

@dsherry

  • What happens when the target passed to fit and score disagree, and there are new unique values passed to score, or we passed str to fit and now int to score? We should raise an error explaining the problem. I think what would currently happen is a stack trace out of LabelEncoder, right?

Yup, we get a pretty clear error out of LabelEncoder but I'll catch it and rethrow with our own exception. Added test_new_unique_targets_in_score as a test for this.

  • Try calling binary and multiclass pipeline fit, predict, predict_proba and score with various types of target (int, float, bool, category, string), and test that either the encode/decode still functions properly, or we get a clear error message.

I already had what's now test_targets_data_types for string and categorical testing automl searches. I added bool and float/int checks. How does that look?

On second thought, I'm going to update the tests and address this in #645. I know you added a potential code snippet, but I think rather than introduce even more code to fix/patch something, might as well punt it after all!

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin the tests you added look great! I left one suggested for an additional check. Well done on this, there were lots of details to chase down.

It occurs to me, would be nice to have a target type test for regression too, mirroring your classification test. That one would just expect that categorical and str types fail. Not required for merge.

@angela97lin
Copy link
Contributor Author

@dsherry Thanks for the feedback! Added in the extra assert but RE regression, I opened up #946 to ensure that we have a clear error thrown to the user and then test that; I can tackle that issue in this epic as well but wanted this merged first :)

@dsherry
Copy link
Contributor

dsherry commented Jul 17, 2020

@angela97lin sounds good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support string targets for binary and multiclass problems
2 participants