Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "scoring" argument to score #28995

Open
amueller opened this issue May 10, 2024 · 22 comments
Open

Add "scoring" argument to score #28995

amueller opened this issue May 10, 2024 · 22 comments

Comments

@amueller
Copy link
Member

amueller commented May 10, 2024

Describe the workflow you want to enable

I want to enable non-accuracy metrics to estimator.score, and ultimately deprecate the default values of accuracy and r2. I would call it scoring though it's a bit redundant but consistent.
That would allow us to get rid of the default scoring methods, which are objectively bad and misleading, and it would require the minimum amount of code changes for anyone.

Describe your proposed solution

Replace

est.score(X, y)

with

est.score(X, y, scoring="accuracy")

or rather

est.score(X, y, scoring="recall_macro")

(or r2 for regression).

Describe alternatives you've considered, if relevant

  • Keep current status, which is bad (both accuracy and r2 are bad)
  • Remove scoring method.

I can't think of any other alternatives tbh.

Additional context

I think in theory this requires a slep, as it's changing shared API, right?

@amueller amueller added New Feature Needs Triage Issue requires triage labels May 10, 2024
@PragyanTiwari
Copy link

PragyanTiwari commented May 12, 2024

Well there is already a separate module for importing metrics..
This score function uses default metric to predict based on ur estimator.

Can't we just make a bridge between the scoring parameter u want and the metrics sklearn module, such that it refers to the metric from sklearn module.

Besides this, there are lot of non accuracy metrics....which one should we focus upon

@glevv
Copy link
Contributor

glevv commented May 12, 2024

I agree that current score behavior is at best useless, at worst bad.

But IMO, the question is a bit broader: does scikit-learn want to have more syntactic sugar or not?

If yes, then adding the possibility of choosing the metric to the score method should be the way to go. If not, removing the score method, which will be a breaking change, will ease the codebase and maintenance.

@thomasjpfan
Copy link
Member

Can't we just make a bridge between the scoring parameter u want and the metrics sklearn module, such that it refers to the metric from sklearn module.

The scorers are the bridge to sklearn.metric. The scorers add additional metadata on top of sklearn.metrics that tells the estimator what it needs to compute for a specific metric. For example:

top_k_accuracy_scorer = make_scorer(
top_k_accuracy_score,
greater_is_better=True,
response_method=("decision_function", "predict_proba"),
)

But IMO, the question is a bit broader: does scikit-learn want to have more syntactic sugar or not?

I'm okay with adding a scorer argument est.score because it makes it easier to switch between metrics. I think there is a API gap between using est.score to a scorer or a metric in sklearn.metric. For example, to use the ROC scorer:

# User needs to know about `get_scorer`
from sklearn.metrics import get_scorer

roc_auc_scorer = get_scorer("roc_auc")
roc_auc_scorer(est, X_test, y_test)

Or using the metrics directly:

from sklearn.metrics import roc_auc_score

# User needs to know that `roc_auc_score` requires `y_score`:
y_decision = est.decision_function(X_test)
roc_auc_score(y_test, y_decision)

# Of if estimator only has `predict_proba`:
roc_auc_score(y, clf.predict_proba(X)[:, 1])

With this proposal, we can write est.score(X_test, y_test, scoring="roc_auc").

@adrinjalali
Copy link
Member

I would be in favor of deprecating default scorers.

A few related points:

  • Do we let users do something line estimator.set_scorer("roc_auc") and use it in *SearchCV? I rather not. Prefer to always require users to specify scorer when calling score.
  • We can start by deprecating scoring=None in all search methods/classes.
  • Can users pass anything that check_scoring accepts to estimator.score? I guess yes? While deprecating _Passthrough I'd say.

This probably does need a SLEP.

@glemaitre glemaitre added API and removed Needs Triage Issue requires triage labels May 13, 2024
@glemaitre
Copy link
Member

I think it needs a SLEP because it is going to change deeply the API.

Regarding the default scorers, considering the difficulties to select the right metric for each use case, then not having a default might be better.

Do we let users do something line estimator.set_scorer("roc_auc") and use it in *SearchCV

To me this is an extension of the API. We don't need to settle on this at first. Having the scoring parameter everywhere and behaving consistently would already be good.

Can users pass anything that check_scoring accepts to estimator.score

I would say yes but it means that we need to properly define what is the scorer API. For instance, there already this difference between metric scorer and curve scorer (that does not exist yet but it probably should).

@betatim
Copy link
Member

betatim commented May 13, 2024

Why call the argument scoring? For me something like scorer sounds better, maybe just because it isn't a verb :-/

est.score(X, y, scorer="accuracy")
est.score(X, y, scorer="recall_macro")

I am -1 on set_scorer.

I think providing a default metric as part of an estimator is useful for users. A bit like setting something like the number of trees in a random forest. While ten (I think the current default) is rarely the right value, I think it is useful that step 1 of using RandomForestClassifier isn't "think about a (random) value for each of its constructor arguments so that it starts working".

Same with the metric you use to evaluate your setup. You almost always want to think about it and make a deliberate choice, but the fact that there is a default to get you started is a good thing imho.

@amueller
Copy link
Member Author

I agree on not adding set_scorer. Also agree with the rest of what @adrinjalali suggested.

I would say yes but it means that we need to properly define what is the scorer API. For instance, there already this difference between metric scorer and curve scorer (that does not exist yet but it probably should).

Can you elaborate on that? I think I'm not familiar with the proposal.

@betatim I suggested scoring because that's the argument in cross_validate and GridSearchCV and they are probably that just because we had scorer before with a different API and we had to do a deprecation cycle?

While ten (I think the current default) is rarely the right value, I

It's 100, we fixed that quite a while ago :)

I think the default parameters and the default metric have indeed similar properties, though I'd say the consequences of the choice of metric are even worse.
The problem on both accounts is that people don't do the second step of actually thinking about it.

There's some hyper-parameters that have a huge impact, like kernel in SVC, but I think not tuning hypers is rarely as bad as using accuracy in an imbalanced classification problem (which is all classification problems).
Also, knowing reasonable defaults for all the model is not really feasible; but if you don't know how to pick at least a reasonable metric for your task, you probably shouldn't build a model yet.

I proposed two changes, adding they keyword arg, and removing the default. Just adding the keyword arg would at least smooth over the path from using accuracy to using something more relevant.

@PragyanTiwari
Copy link

PragyanTiwari commented May 14, 2024

What would be the default value of scorer?..

scorer = None ???

If you compare it to GridSearchCV etc.
They usually come with 'accuracy' as default scoring technique.

Moreover, scorer value would vary a lot in order to run multiple benchmark models.

My question is would you conclude to put None as default value?

@adrinjalali
Copy link
Member

I think we're hurting people by not forcing people to think about their performance metrics. Those metrics have real life consequences and our defaults make no sense, and have nothing to do with the real life usecase at hand for the user.

We should have a much better documentation page on how to choose the metric for that purpose, and then we should force people to choose themselves.

@amueller
Copy link
Member Author

So....

  1. add the scoring argument
  2. add the documentation
  3. deprecate the default?

Or

  1. add the documentation
  2. add the scoring argument and deprecate the default?

I'm happy to write the SLEP after the NeurIPS deadline (and my vacation).

@adrinjalali
Copy link
Member

I'd go more with the first version; it feels smoother. Happy to review the SLEP.

@PragyanTiwari
Copy link

Will you open a PR...coz I'm very eager to contribute.....

@adrinjalali
Copy link
Member

@PragyanTiwari this first needs extensive discussion via a SLEP, which @amueller will work on, and then we'll get to the implementation. This is also not a good first contribution candidate.

@glemaitre
Copy link
Member

Will you open a PR...coz I'm very eager to contribute.....

@PragyanTiwari This is probably not the best issue to start with because we are going to need a scikit-learn enhancement proporal (SLEP) that will require much more thinking before to actually doing a PR that is going to be merged.

@glemaitre
Copy link
Member

I would say yes but it means that we need to properly define what is the scorer API. For instance, there already this difference between metric scorer and curve scorer (that does not exist yet but it probably should).

Can you elaborate on that? I think I'm not familiar with the proposal.

When speaking about the curve scorer, I'm referring to roc_curve and precision_recall_curve. Basically, those metrics are returning 3 arrays. In some way, the PR linked to the TunedThresholdClassifierCV shown that you could abuse the make_scorer with make_scorer(roc_curve, response_method="predict_proba") that will return the scores and threshold. I assume that it make sense to have a "scorer" API in the sense of being able to call curve_scorer(estimator, X, y) that returns the info but it should not be allowed with make_scorer that should only accept metric that return a single value.

@PragyanTiwari
Copy link

Ok....thanks for the info

@betatim
Copy link
Member

betatim commented May 15, 2024

@betatim I suggested scoring because that's the argument in cross_validate and GridSearchCV and they are probably that just because we had scorer before with a different API and we had to do a deprecation cycle?

Ah ok, well, then lets stick to it :D

The problem on both accounts is that people don't do the second step of actually thinking about it.

Agreed. For me the question is if forcing people to think about it is the solution. My guess as to why people keep doing "crazy" things like accuracy for imbalanced problems is that somehow they have no idea and don't realise it and have no idea where to even start thinking about this. Which isn't a novel insight, so probably has been discussed to death. The philosophical question (for me) is if we actually do these kinds of people a service by forcing them to make a decision about something that they don't even know where to start reading/informing themselves about. I am sure people use tools that are easy to use and get started with. No matter how methodologically flawed they are. Which makes me think we need to maintain an easy on ramp and then find a way of educating people after the fact, somehow (the hard part). Otherwise they will leave and use simple-scikit-learn - the fork with all the wrong defaults :D

I proposed two changes, adding they keyword arg, and removing the default. Just adding the keyword arg would at least smooth over the path from using accuracy to using something more relevant.

The keyword arg change is an easy yes for me. Removing the default is a more difficult/philosophical question for me on which I flipflop regarding "the thing we should do" :-/

@glevv
Copy link
Contributor

glevv commented May 15, 2024

The question on my mind is this: will scoring be able to take callable as an input (after make_scorer transformation)?

Because, IMO, this is crucial functionality, especially if we are talking about choosing the right metric for the task. A lot of real world problems require specific mathematical or business metrics, that users will need to code themselves.

@glemaitre
Copy link
Member

@glevv scoring will accept "scorer" as provided by make_scorer.

@adrinjalali
Copy link
Member

@betatim

The philosophical question (for me) is if we actually do these kinds of people a service by forcing them to make a decision about something that they don't even know where to start reading/informing themselves about.

That's why to me it's crucial to have a page where people can easily understand what they should use. We should write that first. And then when they don't provide a scorer, we raise with an error which has a link to the page where they can read about it.

@adamjstewart
Copy link
Contributor

Related discussion in #29065. I'm using the scoring parameter for GridSearchCV, but currently it only applies to fit, not to score. I would like to be able to compute the same metrics for both. Example code looks like:

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, train_test_split

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)

estimator = LogisticRegression(n_jobs=-1)
param_grid = {"C": [10, 1.0, 0.1]}
score = ["accuracy", "balanced_accuracy", "precision_macro", "recall_macro", "f1_macro"]
grid_search = GridSearchCV(estimator, param_grid, scoring=score, refit="accuracy")
grid_search.fit(X_train, y_train)
grid_search.score(X_test, y_test)

At the moment, score only computes accuracy. If I want to compute the same metrics as fit, I have to replace the last line with:

metrics = {}
for name, scorer in grid_search.scorer_.items():
    score = scorer(grid_search, X_test, y_test)
    metrics[name] = score

IMO, sklearn is all about syntactic sugar. The fact that I can perform model training, cross validation, hyperparameter tuning, and refitting on a single line is awesome. The fact that evaluating on test takes 4 lines is a bit of a surprise.

@thomasjpfan
Copy link
Member

In the case of #28995 (comment), if we introduce score(X, y, scoring=None) and scoring is already set in GridSearchCV, then there is a valid default for grid_search.score`.

I'm liking the proposal more because it can also support multiple metrics, i.e.

est.score(
    X, y,
    scoring=["accuracy", "balanced_accuracy", "precision_macro", "recall_macro", "f1_macro"],
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants