Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plot just one ROC curve in the binary case (#1040) #1041

Closed

Conversation

VladSkripniuk
Copy link
Contributor

Remove per-class ROC curves and micro- and macro- averaging in the case
of binary classification for estimators with predict_proba.

This PR fixes #1040 which reported that ROCAUC draws per-class curves in case if estimator has predict_proba. This is not consistent with behavior of ROCAUC when estimator only has decision_function and is not consistent with PrecisionRecallCurve which draws single curve in the case of binary classification.

I have made the following changes:

  1. Changed ROCAUC to treat binary classifiers with predict_proba as binary problem, not multiclass.
  2. Changed tests accordingly.

Sample Code and Plot

from sklearn.linear_model import LogisticRegression
X_train, X_test, y_train, y_test = train_test_split(X, y)
viz = ROCAUC(LogisticRegression())
viz.fit(X_train, y_train)
viz.score(X_test, y_test)
viz.show()

image

CHECKLIST

  • Is the commit message formatted correctly?
  • Have you noted the new functionality/bugfix in the release notes of the next release?
  • Included a sample plot to visually illustrate your changes?
  • Do all of your functions and methods have docstrings?
  • Have you added/updated unit tests where appropriate?
  • Have you updated the baseline images if necessary?
  • Have you run the unit tests using pytest?
  • Is your code style correct (are you using PEP8, pyflakes)?
  • Have you documented your new feature/functionality in the docs?
  • Have you built the docs using make html?

Remove per-class ROC curves and micro- and macro- averaging in the case
of binary classification for estimators with predict_proba.
@lwgray
Copy link
Contributor

lwgray commented Apr 6, 2020

@VladSkripniuk The ROC Curve Visualizer was built with multi-class cases in mind. If you don't want the micro and macro curves... you can turn them off.

@VladSkripniuk
Copy link
Contributor Author

@lwgray my concern is that ROCAUC works differently for binary classifiers which have decision_function and predict_proba. For example, LinearSVC only has decision_function and ROCAUC produces single curve:

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC

from yellowbrick.classifier.prcurve import PrecisionRecallCurve
from yellowbrick.classifier.rocauc import ROCAUC

X_train, X_test, y_train, y_test = train_test_split(X, y)
viz = ROCAUC(LinearSVC(), micro=False, macro=False, per_class=False)
viz.fit(X_train, y_train)
viz.score(X_test, y_test)
viz.show()

image
LogisticRegression on the other hand has predict_proba and the piece of code above raises an exception, when LinearSVC is substituted with LogisticRegression:

YellowbrickValueError                     Traceback (most recent call last)
<ipython-input-25-e5a47d9cc874> in <module>()
     16 viz = ROCAUC(LogisticRegression(), micro=False, macro=False, per_class=False)
     17 viz.fit(X_train, y_train)
---> 18 viz.score(X_test, y_test)
     19 viz.show()

~/yb_devenv/yellowbrick/yellowbrick/classifier/rocauc.py in score(self, X, y)
    242             if not self.micro and not self.macro and not self.per_class:
    243                 raise YellowbrickValueError(
--> 244                     "no curves will be drawn; specify micro, macro, or per_class"
    245                 )
    246 

YellowbrickValueError: no curves will be drawn; specify micro, macro, or per_class

So I have to change per_class to True to produce a plot:

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC

from yellowbrick.classifier.prcurve import PrecisionRecallCurve
from yellowbrick.classifier.rocauc import ROCAUC

X_train, X_test, y_train, y_test = train_test_split(X, y)
viz = ROCAUC(LogisticRegression(), micro=False, macro=False, per_class=True)
viz.fit(X_train, y_train)
viz.score(X_test, y_test)
viz.show()

image
For comparison, PrecisionRecallCurve works identically for LinearSVC and LogisticRegression:

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC

from yellowbrick.classifier.prcurve import PrecisionRecallCurve
from yellowbrick.classifier.rocauc import ROCAUC

X_train, X_test, y_train, y_test = train_test_split(X, y)
viz = PrecisionRecallCurve(LogisticRegression(), micro=False, macro=False, per_class=False)
viz.fit(X_train, y_train)
viz.score(X_test, y_test)
viz.show()

image

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC

from yellowbrick.classifier.prcurve import PrecisionRecallCurve
from yellowbrick.classifier.rocauc import ROCAUC

X_train, X_test, y_train, y_test = train_test_split(X, y)
viz = PrecisionRecallCurve(LinearSVC(), micro=False, macro=False, per_class=False)
viz.fit(X_train, y_train)
viz.score(X_test, y_test)
viz.show()

image

Comment on lines +398 to +401
y_scores = method(X)
if y_scores.ndim == 2 and y_scores.shape[1] == 2:
y_scores = y_scores[:,1]
return y_scores
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would prefer to use sklearn.utils.multiclass.type_of_target here just as a side note, but we'd also like the user to be able to specify if this is a true binary case or if per_class is actually valid.

@bbengfort
Copy link
Member

@VladSkripniuk I've got to say, I was using ROCAUC with the binary case this weekend, and I was also finding the params thing annoying as well. @lwgray is right that we built this with the multiclass case in mind primarily, and we also have to be aware of the true binary vs (a v not a/b v not b) 2 class case. It would, however, be nice to make the logic a bit more robust.

I just took a look over the PR (not a full review, just a glance) - and was wondering why so many tests were removed? It looks like there is an interesting start to making the visualizer more robust though.

@lwgray are you reviewing this PR?

@lwgray
Copy link
Contributor

lwgray commented Apr 7, 2020

@bbengfort I did start to review this PR and as we discussed a couple weeks back, I too was concerned about the removed tests and the overall robustness of this solution. I definitely could use your assistance if we are going to work this into a YB quality PR . We could table it for now and continue the conversation back on the original issue page #1040 and work on the other PRs @VladSkripniuk has posted

@rebeccabilbro
Copy link
Member

Hi @VladSkripniuk and thanks for your PR! @bbengfort and I took a look this morning and we're going to go ahead and close this PR in favor of your newer solution in #1056. Stay tuned for a review on that PR in the next few days and thank you for your patience!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ROCAUC treats binary classification as multiclass for estimators with predict_proba available
4 participants