Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New lexical backend MLLM #462

Merged
merged 49 commits into from Apr 13, 2021
Merged

New lexical backend MLLM #462

merged 49 commits into from Apr 13, 2021

Conversation

osma
Copy link
Member

@osma osma commented Jan 15, 2021

This (draft) PR adds a new Annif backend called MLLM, which stands for "Maui-like Lexical Matching". This is a Python reimplementation of the Maui algorithm (the parts related to matching using a controlled vocabulary), but with some important differences. For example, the initial matching between subject labels and input text is done using sets of tokens, one sentence at a time, instead of n-grams as Maui does. The intent is that this backend would eventually replace the use of Maui, since it's a lot easier to set up and use a built-in algoritm instead of an external Maui Server instance.

The code is in an early draft state and a number of things need to be improved before this can be merged, but I'm opening the PR now to get feedback from automated tools. Here are some TODO items:

  • support --cached option for training with previously preprocessed training data
  • refactor by moving some classes to a separate module (e.g. annif.lexical)
  • add support for semantic features based on vocabulary structure (broader, narrower, related)
  • add support for collection membership feature
    • verify that this works also without SKOS Collections in the vocabulary
  • support hyperparameter optimization using the annif hyperopt command
  • complete unit tests
  • fix QA tool warnings
  • wiki documentation

@osma osma added this to the 0.51 milestone Jan 15, 2021
@codecov
Copy link

codecov bot commented Jan 15, 2021

Codecov Report

Merging #462 (57a2a38) into master (b8e556c) will increase coverage by 0.02%.
The diff coverage is 100.00%.

❗ Current head 57a2a38 differs from pull request most recent head aa2c287. Consider uploading reports for the commit aa2c287 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master     #462      +/-   ##
==========================================
+ Coverage   99.44%   99.46%   +0.02%     
==========================================
  Files          67       73       +6     
  Lines        4838     5280     +442     
==========================================
+ Hits         4811     5252     +441     
- Misses         27       28       +1     
Impacted Files Coverage Δ
annif/backend/__init__.py 100.00% <100.00%> (ø)
annif/backend/mllm.py 100.00% <100.00%> (ø)
annif/corpus/subject.py 100.00% <100.00%> (ø)
annif/lexical/mllm.py 100.00% <100.00%> (ø)
annif/lexical/tokenset.py 100.00% <100.00%> (ø)
annif/suggestion.py 100.00% <100.00%> (ø)
tests/conftest.py 100.00% <100.00%> (ø)
tests/test_backend_mllm.py 100.00% <100.00%> (ø)
tests/test_backend_omikuji.py 100.00% <100.00%> (ø)
tests/test_lexical_mllm.py 100.00% <100.00%> (ø)
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7f42c96...aa2c287. Read the comment docs.

@osma osma mentioned this pull request Jan 15, 2021
@osma osma modified the milestones: 0.51, 0.52 Mar 24, 2021
osma added 24 commits March 29, 2021 14:50
@osma
Copy link
Member Author

osma commented Mar 29, 2021

Rebased and force-pushed.

@osma osma marked this pull request as ready for review March 29, 2021 14:20
@osma
Copy link
Member Author

osma commented Mar 29, 2021

I've now tackled some QA tool issues, though complaints still remain. Anyway I think this is ready for wider testing now.

@osma osma requested a review from juhoinkinen March 29, 2021 14:22
@sonarcloud
Copy link

sonarcloud bot commented Mar 30, 2021

Kudos, SonarCloud Quality Gate passed!

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 3 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@osma
Copy link
Member Author

osma commented Mar 31, 2021

I wrote some wiki documentation: https://github.com/NatLibFi/Annif/wiki/Backend:-MLLM
This page has not yet been linked to from the Home page or the sidebar. The links should be added when a version of Annif with MLLM is released.

@osma
Copy link
Member Author

osma commented Apr 12, 2021

There are still some QA tool warnings, but it's not easy to fix them. I would like to merge this pretty soon, unless there are objections.

Copy link
Member

@juhoinkinen juhoinkinen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked through the code, I could comment only about n_jobs parameter, and probably that would not be very useful (at least before candidate generation is parallelized, which is stated as TODO).

Overall it is great to have MLLM in Annif! :)

annif/lexical/mllm.py Show resolved Hide resolved
@osma osma merged commit ea11a02 into master Apr 13, 2021
@osma osma deleted the feature-mllm-backend branch April 13, 2021 11:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants