Plot just one ROC curve in the binary case (#1040) #1041

VladSkripniuk · 2020-02-14T23:01:19Z

Remove per-class ROC curves and micro- and macro- averaging in the case
of binary classification for estimators with predict_proba.

This PR fixes #1040 which reported that ROCAUC draws per-class curves in case if estimator has predict_proba. This is not consistent with behavior of ROCAUC when estimator only has decision_function and is not consistent with PrecisionRecallCurve which draws single curve in the case of binary classification.

I have made the following changes:

Changed ROCAUC to treat binary classifiers with predict_proba as binary problem, not multiclass.
Changed tests accordingly.

Sample Code and Plot

from sklearn.linear_model import LogisticRegression
X_train, X_test, y_train, y_test = train_test_split(X, y)
viz = ROCAUC(LogisticRegression())
viz.fit(X_train, y_train)
viz.score(X_test, y_test)
viz.show()

CHECKLIST

Is the commit message formatted correctly?
Have you noted the new functionality/bugfix in the release notes of the next release?

Included a sample plot to visually illustrate your changes?
Do all of your functions and methods have docstrings?
Have you added/updated unit tests where appropriate?
Have you updated the baseline images if necessary?
Have you run the unit tests using pytest?
Is your code style correct (are you using PEP8, pyflakes)?
Have you documented your new feature/functionality in the docs?

Have you built the docs using make html?

Remove per-class ROC curves and micro- and macro- averaging in the case of binary classification for estimators with predict_proba.

lwgray · 2020-04-06T23:53:35Z

@VladSkripniuk The ROC Curve Visualizer was built with multi-class cases in mind. If you don't want the micro and macro curves... you can turn them off.

VladSkripniuk · 2020-04-07T08:22:31Z

@lwgray my concern is that ROCAUC works differently for binary classifiers which have decision_function and predict_proba. For example, LinearSVC only has decision_function and ROCAUC produces single curve:

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC

from yellowbrick.classifier.prcurve import PrecisionRecallCurve
from yellowbrick.classifier.rocauc import ROCAUC

X_train, X_test, y_train, y_test = train_test_split(X, y)
viz = ROCAUC(LinearSVC(), micro=False, macro=False, per_class=False)
viz.fit(X_train, y_train)
viz.score(X_test, y_test)
viz.show()

LogisticRegression on the other hand has predict_proba and the piece of code above raises an exception, when LinearSVC is substituted with LogisticRegression:

YellowbrickValueError                     Traceback (most recent call last)
<ipython-input-25-e5a47d9cc874> in <module>()
     16 viz = ROCAUC(LogisticRegression(), micro=False, macro=False, per_class=False)
     17 viz.fit(X_train, y_train)
---> 18 viz.score(X_test, y_test)
     19 viz.show()

~/yb_devenv/yellowbrick/yellowbrick/classifier/rocauc.py in score(self, X, y)
    242             if not self.micro and not self.macro and not self.per_class:
    243                 raise YellowbrickValueError(
--> 244                     "no curves will be drawn; specify micro, macro, or per_class"
    245                 )
    246 

YellowbrickValueError: no curves will be drawn; specify micro, macro, or per_class

So I have to change per_class to True to produce a plot:

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC

from yellowbrick.classifier.prcurve import PrecisionRecallCurve
from yellowbrick.classifier.rocauc import ROCAUC

X_train, X_test, y_train, y_test = train_test_split(X, y)
viz = ROCAUC(LogisticRegression(), micro=False, macro=False, per_class=True)
viz.fit(X_train, y_train)
viz.score(X_test, y_test)
viz.show()

For comparison, PrecisionRecallCurve works identically for LinearSVC and LogisticRegression:

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC

from yellowbrick.classifier.prcurve import PrecisionRecallCurve
from yellowbrick.classifier.rocauc import ROCAUC

X_train, X_test, y_train, y_test = train_test_split(X, y)
viz = PrecisionRecallCurve(LogisticRegression(), micro=False, macro=False, per_class=False)
viz.fit(X_train, y_train)
viz.score(X_test, y_test)
viz.show()

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC

from yellowbrick.classifier.prcurve import PrecisionRecallCurve
from yellowbrick.classifier.rocauc import ROCAUC

X_train, X_test, y_train, y_test = train_test_split(X, y)
viz = PrecisionRecallCurve(LinearSVC(), micro=False, macro=False, per_class=False)
viz.fit(X_train, y_train)
viz.score(X_test, y_test)
viz.show()

bbengfort · 2020-04-07T13:13:32Z

yellowbrick/classifier/rocauc.py

+                    y_scores = method(X)
+                    if y_scores.ndim == 2 and y_scores.shape[1] == 2:
+                        y_scores = y_scores[:,1]
+                    return y_scores


Would prefer to use sklearn.utils.multiclass.type_of_target here just as a side note, but we'd also like the user to be able to specify if this is a true binary case or if per_class is actually valid.

bbengfort · 2020-04-07T13:14:23Z

@VladSkripniuk I've got to say, I was using ROCAUC with the binary case this weekend, and I was also finding the params thing annoying as well. @lwgray is right that we built this with the multiclass case in mind primarily, and we also have to be aware of the true binary vs (a v not a/b v not b) 2 class case. It would, however, be nice to make the logic a bit more robust.

I just took a look over the PR (not a full review, just a glance) - and was wondering why so many tests were removed? It looks like there is an interesting start to making the visualizer more robust though.

@lwgray are you reviewing this PR?

lwgray · 2020-04-07T15:17:46Z

@bbengfort I did start to review this PR and as we discussed a couple weeks back, I too was concerned about the removed tests and the overall robustness of this solution. I definitely could use your assistance if we are going to work this into a YB quality PR . We could table it for now and continue the conversation back on the original issue page #1040 and work on the other PRs @VladSkripniuk has posted

rebeccabilbro · 2020-04-14T12:15:12Z

Hi @VladSkripniuk and thanks for your PR! @bbengfort and I took a look this morning and we're going to go ahead and close this PR in favor of your newer solution in #1056. Stay tuned for a review on that PR in the next few days and thank you for your patience!

Plot just one ROC curve in the binary case (DistrictDataLabs#1040)

01b9e76

Remove per-class ROC curves and micro- and macro- averaging in the case of binary classification for estimators with predict_proba.

bbengfort reviewed Apr 7, 2020

View reviewed changes

bbengfort mentioned this pull request Apr 9, 2020

ROCAUC treats binary classification as multiclass for estimators with predict_proba available #1040

Closed

VladSkripniuk mentioned this pull request Apr 9, 2020

Robust ROCAUC for binary classification. fix #1040 #1056

Merged

10 tasks

rebeccabilbro closed this Apr 14, 2020

bbengfort mentioned this pull request Jul 20, 2020

Show ROC for train and test + hide macro/micro in binary classification. #1087

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plot just one ROC curve in the binary case (#1040) #1041

Plot just one ROC curve in the binary case (#1040) #1041

VladSkripniuk commented Feb 14, 2020

lwgray commented Apr 6, 2020 •

edited

VladSkripniuk commented Apr 7, 2020

bbengfort Apr 7, 2020

bbengfort commented Apr 7, 2020

lwgray commented Apr 7, 2020

rebeccabilbro commented Apr 14, 2020

Plot just one ROC curve in the binary case (#1040) #1041

Plot just one ROC curve in the binary case (#1040) #1041

Conversation

VladSkripniuk commented Feb 14, 2020

Sample Code and Plot

CHECKLIST

lwgray commented Apr 6, 2020 • edited

VladSkripniuk commented Apr 7, 2020

bbengfort Apr 7, 2020

Choose a reason for hiding this comment

bbengfort commented Apr 7, 2020

lwgray commented Apr 7, 2020

rebeccabilbro commented Apr 14, 2020

lwgray commented Apr 6, 2020 •

edited