Add "scoring" argument to `score` #28995

amueller · 2024-05-10T20:11:17Z

Describe the workflow you want to enable

I want to enable non-accuracy metrics to estimator.score, and ultimately deprecate the default values of accuracy and r2. I would call it scoring though it's a bit redundant but consistent.
That would allow us to get rid of the default scoring methods, which are objectively bad and misleading, and it would require the minimum amount of code changes for anyone.

Describe your proposed solution

Replace

est.score(X, y)

with

est.score(X, y, scoring="accuracy")

or rather

est.score(X, y, scoring="recall_macro")

(or r2 for regression).

Describe alternatives you've considered, if relevant

Keep current status, which is bad (both accuracy and r2 are bad)
Remove scoring method.

I can't think of any other alternatives tbh.

Additional context

I think in theory this requires a slep, as it's changing shared API, right?

The text was updated successfully, but these errors were encountered:

PragyanTiwari · 2024-05-12T02:15:38Z

Well there is already a separate module for importing metrics..
This score function uses default metric to predict based on ur estimator.

Can't we just make a bridge between the scoring parameter u want and the metrics sklearn module, such that it refers to the metric from sklearn module.

Besides this, there are lot of non accuracy metrics....which one should we focus upon

glevv · 2024-05-12T05:23:45Z

I agree that current score behavior is at best useless, at worst bad.

But IMO, the question is a bit broader: does scikit-learn want to have more syntactic sugar or not?

If yes, then adding the possibility of choosing the metric to the score method should be the way to go. If not, removing the score method, which will be a breaking change, will ease the codebase and maintenance.

thomasjpfan · 2024-05-12T19:42:20Z

Can't we just make a bridge between the scoring parameter u want and the metrics sklearn module, such that it refers to the metric from sklearn module.

The scorers are the bridge to sklearn.metric. The scorers add additional metadata on top of sklearn.metrics that tells the estimator what it needs to compute for a specific metric. For example:

scikit-learn/sklearn/metrics/_scorer.py

Lines 810 to 814 in 5e5cc34

    
           top_k_accuracy_scorer = make_scorer( 
        
               top_k_accuracy_score, 
        
               greater_is_better=True, 
        
               response_method=("decision_function", "predict_proba"), 
        
           )

But IMO, the question is a bit broader: does scikit-learn want to have more syntactic sugar or not?

I'm okay with adding a scorer argument est.score because it makes it easier to switch between metrics. I think there is a API gap between using est.score to a scorer or a metric in sklearn.metric. For example, to use the ROC scorer:

# User needs to know about `get_scorer`
from sklearn.metrics import get_scorer

roc_auc_scorer = get_scorer("roc_auc")
roc_auc_scorer(est, X_test, y_test)

Or using the metrics directly:

from sklearn.metrics import roc_auc_score

# User needs to know that `roc_auc_score` requires `y_score`:
y_decision = est.decision_function(X_test)
roc_auc_score(y_test, y_decision)

# Of if estimator only has `predict_proba`:
roc_auc_score(y, clf.predict_proba(X)[:, 1])

With this proposal, we can write est.score(X_test, y_test, scoring="roc_auc").

adrinjalali · 2024-05-13T09:30:41Z

I would be in favor of deprecating default scorers.

A few related points:

Do we let users do something line estimator.set_scorer("roc_auc") and use it in *SearchCV? I rather not. Prefer to always require users to specify scorer when calling score.
We can start by deprecating scoring=None in all search methods/classes.
Can users pass anything that check_scoring accepts to estimator.score? I guess yes? While deprecating _Passthrough I'd say.

This probably does need a SLEP.

glemaitre · 2024-05-13T11:00:11Z

I think it needs a SLEP because it is going to change deeply the API.

Regarding the default scorers, considering the difficulties to select the right metric for each use case, then not having a default might be better.

Do we let users do something line estimator.set_scorer("roc_auc") and use it in *SearchCV

To me this is an extension of the API. We don't need to settle on this at first. Having the scoring parameter everywhere and behaving consistently would already be good.

Can users pass anything that check_scoring accepts to estimator.score

I would say yes but it means that we need to properly define what is the scorer API. For instance, there already this difference between metric scorer and curve scorer (that does not exist yet but it probably should).

betatim · 2024-05-13T14:07:32Z

Why call the argument scoring? For me something like scorer sounds better, maybe just because it isn't a verb :-/

est.score(X, y, scorer="accuracy")

est.score(X, y, scorer="recall_macro")

I am -1 on set_scorer.

I think providing a default metric as part of an estimator is useful for users. A bit like setting something like the number of trees in a random forest. While ten (I think the current default) is rarely the right value, I think it is useful that step 1 of using RandomForestClassifier isn't "think about a (random) value for each of its constructor arguments so that it starts working".

Same with the metric you use to evaluate your setup. You almost always want to think about it and make a deliberate choice, but the fact that there is a default to get you started is a good thing imho.

amueller · 2024-05-13T19:23:09Z

I agree on not adding set_scorer. Also agree with the rest of what @adrinjalali suggested.

I would say yes but it means that we need to properly define what is the scorer API. For instance, there already this difference between metric scorer and curve scorer (that does not exist yet but it probably should).

Can you elaborate on that? I think I'm not familiar with the proposal.

@betatim I suggested scoring because that's the argument in cross_validate and GridSearchCV and they are probably that just because we had scorer before with a different API and we had to do a deprecation cycle?

While ten (I think the current default) is rarely the right value, I

It's 100, we fixed that quite a while ago :)

I think the default parameters and the default metric have indeed similar properties, though I'd say the consequences of the choice of metric are even worse.
The problem on both accounts is that people don't do the second step of actually thinking about it.

There's some hyper-parameters that have a huge impact, like kernel in SVC, but I think not tuning hypers is rarely as bad as using accuracy in an imbalanced classification problem (which is all classification problems).
Also, knowing reasonable defaults for all the model is not really feasible; but if you don't know how to pick at least a reasonable metric for your task, you probably shouldn't build a model yet.

I proposed two changes, adding they keyword arg, and removing the default. Just adding the keyword arg would at least smooth over the path from using accuracy to using something more relevant.

PragyanTiwari · 2024-05-14T05:44:07Z

What would be the default value of scorer?..

scorer = None ???

If you compare it to GridSearchCV etc.
They usually come with 'accuracy' as default scoring technique.

Moreover, scorer value would vary a lot in order to run multiple benchmark models.

My question is would you conclude to put None as default value?

adrinjalali · 2024-05-14T07:12:26Z

I think we're hurting people by not forcing people to think about their performance metrics. Those metrics have real life consequences and our defaults make no sense, and have nothing to do with the real life usecase at hand for the user.

We should have a much better documentation page on how to choose the metric for that purpose, and then we should force people to choose themselves.

amueller · 2024-05-14T15:40:47Z

So....

add the scoring argument
add the documentation
deprecate the default?

Or

add the documentation
add the scoring argument and deprecate the default?

I'm happy to write the SLEP after the NeurIPS deadline (and my vacation).

adrinjalali · 2024-05-14T16:03:33Z

I'd go more with the first version; it feels smoother. Happy to review the SLEP.

PragyanTiwari · 2024-05-14T18:19:52Z

Will you open a PR...coz I'm very eager to contribute.....

adrinjalali · 2024-05-14T20:08:05Z

@PragyanTiwari this first needs extensive discussion via a SLEP, which @amueller will work on, and then we'll get to the implementation. This is also not a good first contribution candidate.

glemaitre · 2024-05-14T20:08:09Z

Will you open a PR...coz I'm very eager to contribute.....

@PragyanTiwari This is probably not the best issue to start with because we are going to need a scikit-learn enhancement proporal (SLEP) that will require much more thinking before to actually doing a PR that is going to be merged.

glemaitre · 2024-05-14T20:17:36Z

I would say yes but it means that we need to properly define what is the scorer API. For instance, there already this difference between metric scorer and curve scorer (that does not exist yet but it probably should).

Can you elaborate on that? I think I'm not familiar with the proposal.

When speaking about the curve scorer, I'm referring to roc_curve and precision_recall_curve. Basically, those metrics are returning 3 arrays. In some way, the PR linked to the TunedThresholdClassifierCV shown that you could abuse the make_scorer with make_scorer(roc_curve, response_method="predict_proba") that will return the scores and threshold. I assume that it make sense to have a "scorer" API in the sense of being able to call curve_scorer(estimator, X, y) that returns the info but it should not be allowed with make_scorer that should only accept metric that return a single value.

PragyanTiwari · 2024-05-15T00:51:45Z

Ok....thanks for the info

betatim · 2024-05-15T06:49:30Z

@betatim I suggested scoring because that's the argument in cross_validate and GridSearchCV and they are probably that just because we had scorer before with a different API and we had to do a deprecation cycle?

Ah ok, well, then lets stick to it :D

The problem on both accounts is that people don't do the second step of actually thinking about it.

Agreed. For me the question is if forcing people to think about it is the solution. My guess as to why people keep doing "crazy" things like accuracy for imbalanced problems is that somehow they have no idea and don't realise it and have no idea where to even start thinking about this. Which isn't a novel insight, so probably has been discussed to death. The philosophical question (for me) is if we actually do these kinds of people a service by forcing them to make a decision about something that they don't even know where to start reading/informing themselves about. I am sure people use tools that are easy to use and get started with. No matter how methodologically flawed they are. Which makes me think we need to maintain an easy on ramp and then find a way of educating people after the fact, somehow (the hard part). Otherwise they will leave and use simple-scikit-learn - the fork with all the wrong defaults :D

I proposed two changes, adding they keyword arg, and removing the default. Just adding the keyword arg would at least smooth over the path from using accuracy to using something more relevant.

The keyword arg change is an easy yes for me. Removing the default is a more difficult/philosophical question for me on which I flipflop regarding "the thing we should do" :-/

glevv · 2024-05-15T07:34:22Z

The question on my mind is this: will scoring be able to take callable as an input (after make_scorer transformation)?

Because, IMO, this is crucial functionality, especially if we are talking about choosing the right metric for the task. A lot of real world problems require specific mathematical or business metrics, that users will need to code themselves.

glemaitre · 2024-05-15T08:35:04Z

@glevv scoring will accept "scorer" as provided by make_scorer.

adrinjalali · 2024-05-15T08:36:47Z

@betatim

The philosophical question (for me) is if we actually do these kinds of people a service by forcing them to make a decision about something that they don't even know where to start reading/informing themselves about.

That's why to me it's crucial to have a page where people can easily understand what they should use. We should write that first. And then when they don't provide a scorer, we raise with an error which has a link to the page where they can read about it.

adamjstewart · 2024-05-21T14:04:52Z

Related discussion in #29065. I'm using the scoring parameter for GridSearchCV, but currently it only applies to fit, not to score. I would like to be able to compute the same metrics for both. Example code looks like:

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, train_test_split

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)

estimator = LogisticRegression(n_jobs=-1)
param_grid = {"C": [10, 1.0, 0.1]}
score = ["accuracy", "balanced_accuracy", "precision_macro", "recall_macro", "f1_macro"]
grid_search = GridSearchCV(estimator, param_grid, scoring=score, refit="accuracy")
grid_search.fit(X_train, y_train)
grid_search.score(X_test, y_test)

At the moment, score only computes accuracy. If I want to compute the same metrics as fit, I have to replace the last line with:

metrics = {}
for name, scorer in grid_search.scorer_.items():
    score = scorer(grid_search, X_test, y_test)
    metrics[name] = score

IMO, sklearn is all about syntactic sugar. The fact that I can perform model training, cross validation, hyperparameter tuning, and refitting on a single line is awesome. The fact that evaluating on test takes 4 lines is a bit of a surprise.

thomasjpfan · 2024-05-21T15:42:56Z

In the case of #28995 (comment), if we introduce score(X, y, scoring=None) and scoring is already set in GridSearchCV, then there is a valid default for grid_search.score`.

I'm liking the proposal more because it can also support multiple metrics, i.e.

est.score(
    X, y,
    scoring=["accuracy", "balanced_accuracy", "precision_macro", "recall_macro", "f1_macro"],
)

amueller added New Feature Needs Triage Issue requires triage labels May 10, 2024

thomasjpfan added RFC module:metrics labels May 12, 2024

glemaitre added API and removed Needs Triage Issue requires triage labels May 13, 2024

This was referenced May 18, 2024

ENH allow to pass str or scorer to make_scorer #18141

Closed

GridSearchCV.score: support multiple scoring metrics #29065

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add "scoring" argument to `score` #28995

Add "scoring" argument to `score` #28995

amueller commented May 10, 2024 •

edited

PragyanTiwari commented May 12, 2024 •

edited

glevv commented May 12, 2024

thomasjpfan commented May 12, 2024

adrinjalali commented May 13, 2024

glemaitre commented May 13, 2024

betatim commented May 13, 2024

amueller commented May 13, 2024

PragyanTiwari commented May 14, 2024 •

edited

adrinjalali commented May 14, 2024

amueller commented May 14, 2024

adrinjalali commented May 14, 2024

PragyanTiwari commented May 14, 2024

adrinjalali commented May 14, 2024

glemaitre commented May 14, 2024

glemaitre commented May 14, 2024

PragyanTiwari commented May 15, 2024

betatim commented May 15, 2024

glevv commented May 15, 2024

glemaitre commented May 15, 2024

adrinjalali commented May 15, 2024

adamjstewart commented May 21, 2024

thomasjpfan commented May 21, 2024

Add "scoring" argument to score #28995

Add "scoring" argument to score #28995

Comments

amueller commented May 10, 2024 • edited

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context

PragyanTiwari commented May 12, 2024 • edited

glevv commented May 12, 2024

thomasjpfan commented May 12, 2024

adrinjalali commented May 13, 2024

glemaitre commented May 13, 2024

betatim commented May 13, 2024

amueller commented May 13, 2024

PragyanTiwari commented May 14, 2024 • edited

adrinjalali commented May 14, 2024

amueller commented May 14, 2024

adrinjalali commented May 14, 2024

PragyanTiwari commented May 14, 2024

adrinjalali commented May 14, 2024

glemaitre commented May 14, 2024

glemaitre commented May 14, 2024

PragyanTiwari commented May 15, 2024

betatim commented May 15, 2024

glevv commented May 15, 2024

glemaitre commented May 15, 2024

adrinjalali commented May 15, 2024

adamjstewart commented May 21, 2024

thomasjpfan commented May 21, 2024

Add "scoring" argument to `score` #28995

Add "scoring" argument to `score` #28995

amueller commented May 10, 2024 •

edited

PragyanTiwari commented May 12, 2024 •

edited

PragyanTiwari commented May 14, 2024 •

edited