-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Single Concept Classifier for handling label inbalance #538
Comments
Thanks for the suggestion. I think it makes sense to try to support this in Annif, although one would hope that individual algorithms could deal with this kind of imbalance better than they seem to do. Suppose you want to have a setup of the kind you describe, with Omikuji handling almost every concept but the two most frequent ones using SVC - what would the Annif configuration look like then? Would this involve some special kind of ensemble project delegating to specialized Omikuji and SVC projects? I think that thinking about the configuration aspect would clarify questions about metrics and concept exclusion. |
My original plan was to do something modular. So the individual classifiers could also be of fasttext type. Config would be like the following, excluding some stuff for brevity.
Possibly also adding other sources to the ensemble. |
That looks very reasonable! Instead of a config option like Though I guess Just thinking aloud here...maybe your suggestion is better anyway, just trying to think this through and come up with a generic mechanism that might be useful perhaps even more broadly than the scenario that you describe. |
edit: Quick example for logic required to handle the single concept case. sk-learn SVC requires different input for single class (1d-array) vs. multiple classes(2d array with rows being labels for one sample) Nothing bad. Just to keep in mind. |
The problem I see with So in terms of configuration, we could certainly do this (your example above, just renamed the setting):
but what happens if we do this:
What would the SVC try to distinguish then? USA vs. Theory vs. nothing? What if we change the backend to Omikuji (which supports multi-label, unlike SVC), would it then be different? In short, |
I think for more than one concept the algorithms should just behave as they would normally, just on that subset of concepts. I haven't evaluated it but I could imagine omikuji for a subset of concepts. At one point we had the idea of using separate classifiers for a sub thesaurus(e.g., geographic concepts) |
That sounds like a plan!
I can imagine that there could be whitelisting/blacklisting not just by individual concept URIs, but also by e.g.
I'm not saying that these should be supported in the first iteration of blacklist/whitelist features, just that it would be possible to expand that functionality in these directions if desired. As for combining |
On second thought, maybe it would be better to give precedence to whitelist rules, when both whitelists and blacklists are specified. Also, it seems that those terms are going out of favor (maybe) because of possible connotations - the Linux kernel has switched to allowlist/denylist. I think we need a new issue for this discussion, which is a bit separate from the original idea of a Single Concept Classifier. |
Nearly two years later, we now have that issue: #735 🎉 |
In Automated subject indexing (and multi label classification in general) the distribution of assigned concepts often follows Zipf's law. In our experience this leads to algorithms having low precision on the most frequently assigned concepts. Subject Indexers at ZBW largest complaint in our last review was the frequent prediction for our top tow concepts. As a remedy we evaluated assigning the most frequent concepts individually. While this lead to a minor decrease in F1 (
"samples"
avg). It provided benefits in precision and F1 (both"binary"
avg.) for single concepts. Here are some results for our top concepts theory and USA.Note that the last classifier still uses individual thresholds for the three classifiers. I think using an (neural network )ensemble to combine the results would probably allow use of a single threshold.
I open this issue to discuss if here is interest in bringing this functionality to Annif. And also discuss some implementation details.
Adding or modifying existing classifiers (FastText or SVC) for supporting single classes is straightforward. But there are some details regarding overall architecture that are not straightforward to handle:
Looking forward to hearing your thoughts on the topic.
The text was updated successfully, but these errors were encountered: