
![image.png](attachment:image.png)

The traditional __Accuracy__ is a good measure if you have quite balanced datasets and are interested in all types of outputs equally. I like to start with it in any case, as it is intuitive, and dig deeper from there as needed.

__Precision__ is great to focus on if you want to minimize false positives. For example, you build a spam email classifier. You want to see as little spam as possible. But you do not want to miss any important, non-spam emails. In such cases, you may wish to aim for maximizing precision.

__Recall__ is very important in domains such as medical (e.g., identifying cancer), where you really want to minimize the chance of missing positive cases (predicting false negatives). These are typically cases where missing a positive case has a much bigger cost than wrongly classifying something as positive.

Neither __precision__ nor __recall__ is necessarily useful alone, since we rather generally are interested in the overall picture. __Accuracy__ is always good to check as one option. __F1-score__ is another.

__F1-score__ combines precision and __recall__, and works also for cases where the datasets are imbalanced as it requires both __precision__ and __recall__ to have a reasonable value, as demonstrated by the experiments I showed in this post. Even if you have a small number of positive cases vs negative cases, the formula will weight the metric value down if the __precision__ or __recall__ of the positive class is low.

__F1-score vs Accuracy__
- Accuracy is commonly described as a more intuitive metric, with F1-score better addressing a more imbalanced dataset. 

- So how does the F1-score (F1) vs Accuracy (ACC) compare across different types of data distributions (ratios of positive/negative)?

__Imbalance: Few Positive Cases__
-In this example, there is an imbalance of 10 positive cases, and 90 negative cases, with different TN, TP, FN, and FP values for a classifier to calculate F1 and ACC:

![image.png](attachment:image.png)

- The maximum accuracy with the class imbalance is with a result of TN=90 and TP=10, as shown on row 2.

- In each case where TP =0, the Precision and Recall both become 0, and F1-score cannot be calculated (division by 0). Such cases can be scored as F1-score = 0, or generally marking the classifier as useless. 
- Because the classifier cannot predict any correct positive result. 
- This is rows 0, 4, and 8 in the above table. 
- These also illustrate some cases of high Accuracy for a broken classifier (e.g., row 0 with 90% Accuracy while always predicting only negative).

- The remaining rows illustrate how the F1-score is reacting much better to the classifier making more balanced predictions. 
- For example, F1-score=0.18 vs Accuracy = 0.91 on row 5, to F1-score=0.46 vs Accuracy = 0.93 on row 7. 
- This is only a change of 2 positive predictions, but as it is out of 10 possible, the change is actually quite large, and the F1-score emphasizes this (and Accuracy sees no difference to any other values).

__Balance 50/50 Positive and Negative cases:__
- How about when the datasets are more balanced? Here are similar values for a balanced dataset with 50 negative and 50 positive items:

![image.png](attachment:image-2.png)

- F1-score is still a slightly better metric here, when there are only very few (or none) of the positive predictions. 

- But the difference is not as huge as with imbalanced classes. 

- In general, it is still always useful to look a bit deeper into the results, although in balanced datasets, a high accuracy is usually a good indicator of a decent classifier performance.


__Imbalance: Few Negative Cases__
- Finally, what happens if the minority class is measured as the negative and not positive? F1-score no longer balances it but rather the opposite. Here is an example with 10 negative cases and 90 positive cases:

![image.png](attachment:image.png)

- For example, row 5 has only 1 correct prediction out of 10 negative cases. But the F1-score is still at around 95%, so very good and even higher than accuracy. 
- In the case where the same ratio applied to the positive cases being the minority, the F1-score for this was 0.18 vs now it is 0.95. Which was a much better indicator of quality rather than in this case.

- This result with minority negative cases is because of how the formula to calculate F1-score is defined over precision and recall (emphasizing positive cases). 
- If you look back at the figure illustrating the metrics hierarchy at the beginning of this article, you will see how True Positives feed into both Precision and Recall, and from there to F1-score. 
- The same figure also shows how True Negatives do not contribute to F1-score at all. 
- This seems to be viisble here if you reverse the ratios and have fewer true negatives.

- So, as usual, I believe it is good to keep in mind how to represent your data, and do your own data exploration, not blindly trusting any single metric.