Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace balanced_accuracy with macro-averaged recall from sklearn #108

Closed
rhiever opened this issue Mar 8, 2016 · 18 comments
Closed

Replace balanced_accuracy with macro-averaged recall from sklearn #108

rhiever opened this issue Mar 8, 2016 · 18 comments

Comments

@rhiever
Copy link
Contributor

rhiever commented Mar 8, 2016

From conversations with @amueller, we discovered that "balanced accuracy" (as we've called it) is also known as "macro-averaged recall" as implemented in sklearn. As such, we don't need our own custom implementation of balanced_accuracy in TPOT. Let's refactor TPOT to replace balanced_accuracy with recall_score.

The correct call is:

recall_score(y_test, predictions, average='macro')

where y_test is class and predictions is guess in our case.

Here's some code that compares the two and confirms that they're the same:

from sklearn.datasets import load_digits
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split
from sklearn.metrics import recall_score
import numpy as np
import pandas as pd

digits = load_digits(10)
features, labels = digits['data'], digits['target']

X_train, X_test, y_train, y_test = train_test_split(features, labels, train_size=0.75, test_size=0.25)

clf = RandomForestClassifier(n_estimators=100, n_jobs=-1)
clf.fit(X_train, y_train)

def balanced_accuracy(result):
    all_classes = list(set(result['class'].values))
    all_class_accuracies = []
    for this_class in all_classes:
        this_class_accuracy = len(result[(result['guess'] == this_class) & (result['class'] == this_class)])\
            / float(len(result[result['class'] == this_class]))
        all_class_accuracies.append(this_class_accuracy)

    balanced_accuracy = np.mean(all_class_accuracies)

    return balanced_accuracy

predictions = clf.predict(X_test)

print('Macro-averaged recall:\t', recall_score(y_test, predictions, average='macro'))

data = pd.DataFrame({'class': y_test,
                     'guess': predictions})

print('Balanced accuracy:\t', balanced_accuracy(data))
@rhiever rhiever changed the title Replaced balanced_accuracy with macro-averaged recall from sklearn Replace balanced_accuracy with macro-averaged recall from sklearn Mar 8, 2016
@rhiever
Copy link
Contributor Author

rhiever commented Mar 8, 2016

Possibly not true after further validation. Closing this issue until we figure it out.

@rhiever rhiever closed this as completed Mar 8, 2016
@amueller
Copy link

amueller commented May 2, 2016

What's the definition of balanced accuracy? Is it 1 - balanced error rate? Then this should be true.

@rhiever
Copy link
Contributor Author

rhiever commented Jun 9, 2016

I'm reopening this issue now that I'm unsure again. The primary difference seems to be that our implementation of balanced accuracy also takes into account TNR, whereas other implementations only take into account TPR (recall).

I don't quite understand the intuition in not including TNR in the multiclass case. I understand that in the binary classification case, TPR for class 0 = TNR for class 1. In the multiclass case that becomes muddled: TPR for class 0 = TNR for all other classes.

@amueller
Copy link

amueller commented Jun 9, 2016

In your formula, len(result[result['class'] == this_class] is just np.sum(result['class'] == this_class], right?

So you compute for each class

TP / (TP + FN)

which is recall.

and then average over classes, right? That's what your code says and that's what wikipedia suggests, I think (though only for the two-class case https://en.wikipedia.org/wiki/Accuracy_and_precision)

Computing recall and averaging over all classes is macro average recall.

I'm not sure what you mean by not including TNR. Your definition as in the code above doesn't include it, right. Do you have a reference for that being the semantics of balanced accuracy in the multi-class case?

I don't think arguing about if it is a good metric or should be changed is a good idea if you use a name that already has particular semantics. In the multi-class case they don't seem to be that established, but it would be good to know what other people mean.

@rhiever rhiever closed this as completed Jun 17, 2016
@rhiever
Copy link
Contributor Author

rhiever commented Jun 17, 2016

See scikit-learn/scikit-learn#6747 (comment) for a detailed discussion of balanced accuracy. I think there is consensus to add this metric to sklearn now.

@amueller
Copy link

so the one you are using now is different from the one you posted above, right?
I wouldn't say there is consensus but we can discuss there.

@rhiever
Copy link
Contributor Author

rhiever commented Jun 17, 2016

so the one you are using now is different from the one you posted above, right?

Yes that's correct.

@kegl
Copy link

kegl commented Jul 21, 2017

I went through this thread and the related sklearn thread and it's not clear to me what the consensus is. Somebody asked me to used balanced accuracy from here, scoring_program/libscores.py, line 187. Should I clean this up or I can use recall_score(..., average='macro') from sklearn?

@amueller
Copy link

@kegl I would have hoped you could tell us ;) There is multiple definitions of balanced accuracy, and one of them is recall_score(..., average='macro') and another is something different, see scikit-learn/scikit-learn#8066
It looks like https://github.com/ch-imad/AutoMl_Challenge/blob/2353ec0/Starting_kit/scoring_program/libscores.py#L187 implements recall_score(..., average='macro') see @jnothman's comment. Whoever told you to use this metric should have given you a paper reference or use a more specific name ;)

@amueller
Copy link

@kegl are you doing binary classification? Then it's pretty clear and using the macro average should be fine. If it's multi-class, it's a bit less clear.

@kegl
Copy link

kegl commented Jul 21, 2017

No, it's multiclass.

@weixuanfu
Copy link
Contributor

weixuanfu commented Jul 21, 2017

@kegl you may try the balanced_accuracy in tpot.metrics

@kegl
Copy link

kegl commented Jul 21, 2017

OK, thanks!

@amueller
Copy link

@kegl the one in that toolkit is "adjusted for chance" though, and the one in TPOT is not. So that toolkit does macro average recall but adjusted for chance.

@amueller
Copy link

while tpot.metrics does macro-average accuracy.

@kegl
Copy link

kegl commented Aug 28, 2017

OK, so the TPOT version is exactly sklearn.metrics.recall_score(y_true, y_pred, average='macro'), and the AutoML score adjusts this by https://github.com/ch-imad/AutoMl_Challenge/blob/2353ec0/Starting_kit/scoring_program/libscores.py#L210, right?

@amueller
Copy link

No, the TPOT version is something else entirely. It's macro-average accuracy, not macro-average recall.

@kegl
Copy link

kegl commented Aug 28, 2017

OK, got it, thanks.

kegl added a commit to paris-saclay-cds/ramp-workflow that referenced this issue Aug 28, 2017
Also adding label_names in `scores.classifier_base` so __call__ can use without falling back to `y_true` or `y_pred` which may not have all the labels.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants