New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metrics fmeasure and matthews_correlation don't work batchwise #4592
Comments
I think I stumbled upon the same problem while experimenting with an unbalanced dataset. Here is an example I wrote to isolate the problem:
Output:
In this simple example the classification is always correct, yet the precision and recall is almost 0. It goes up when the batch size is increased. It roughly follows the probability of getting a random batch sized subset of the data where at least on sample is 1. (In this example 1-(.99**2) = 0.019) That is why I think that this is the same issue the OP described. |
I'm also using a binary classifier and seeing some suspicious metrics reported. Namely every epoch reports the exact same value for accuracy, precision and recall over hundreds of epochs. This occurs in training, validation, and evaluation.
|
@laxatives I have this same problem with a binary classifier with a balanced dataset. I use a generator and have verified that it gives balanced data each call to next(). @DepthFirstSearch @sietschie Have you figured out the issue? |
@laxatives @isaacgerg I don't think you have the same issue as @sietschie or me. As I stated in my initial post, I don't think it's possible to compute fmeasure, precision etc. in a batch-wise manner (see my example). That's the whole issue here. |
@DepthFirstSearch During training these metrics are meant to be computed for each batch, not for all the batches. In any case, for a binary classifier, the metrics are still being reported incorrectly even on a per batch basis. See #5400. |
Hello,
In my opinion, the metrics
fmeasure
,matthews_correlation
,precision
andrecall
all don't work batchwise. In general, this is the case for all metrics which incorporate true/false positives/negatives.Here is a small and easy counterexample:
Let's assume we have just 4 samples: two negatives and two positives. Also, our batch size is 2:
Now we want to calculate the recall aka. TP-rate in batchwise manner. For the first batch, we have a TP-rate of 0 (since we don't have any true positives). For the second batch, we have a TP-rate of 0.5. Finally, we take the mean over all batches and end with recall=TP-rate=mean(0 + 0.5) = 0.25.
But: as we easily can see the correct recall over the entire dataset is 0.5. The problem with the batchwise calculation is that we wrongly incorporated the first batch.
The text was updated successfully, but these errors were encountered: