New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New lexical backend MLLM #462
Conversation
Codecov Report
@@ Coverage Diff @@
## master #462 +/- ##
==========================================
+ Coverage 99.44% 99.46% +0.02%
==========================================
Files 67 73 +6
Lines 4838 5280 +442
==========================================
+ Hits 4811 5252 +441
- Misses 27 28 +1
Continue to review full report at Codecov.
|
be61408
to
f2f154c
Compare
Rebased and force-pushed. |
I've now tackled some QA tool issues, though complaints still remain. Anyway I think this is ready for wider testing now. |
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information |
I wrote some wiki documentation: https://github.com/NatLibFi/Annif/wiki/Backend:-MLLM |
There are still some QA tool warnings, but it's not easy to fix them. I would like to merge this pretty soon, unless there are objections. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked through the code, I could comment only about n_jobs
parameter, and probably that would not be very useful (at least before candidate generation is parallelized, which is stated as TODO).
Overall it is great to have MLLM in Annif! :)
This (draft) PR adds a new Annif backend called MLLM, which stands for "Maui-like Lexical Matching". This is a Python reimplementation of the Maui algorithm (the parts related to matching using a controlled vocabulary), but with some important differences. For example, the initial matching between subject labels and input text is done using sets of tokens, one sentence at a time, instead of n-grams as Maui does. The intent is that this backend would eventually replace the use of Maui, since it's a lot easier to set up and use a built-in algoritm instead of an external Maui Server instance.
The code is in an early draft state and a number of things need to be improved before this can be merged, but I'm opening the PR now to get feedback from automated tools. Here are some TODO items:
--cached
option for training with previously preprocessed training dataannif.lexical
)annif hyperopt
command