New lexical backend MLLM #462

osma · 2021-01-15T07:21:34Z

This (draft) PR adds a new Annif backend called MLLM, which stands for "Maui-like Lexical Matching". This is a Python reimplementation of the Maui algorithm (the parts related to matching using a controlled vocabulary), but with some important differences. For example, the initial matching between subject labels and input text is done using sets of tokens, one sentence at a time, instead of n-grams as Maui does. The intent is that this backend would eventually replace the use of Maui, since it's a lot easier to set up and use a built-in algoritm instead of an external Maui Server instance.

The code is in an early draft state and a number of things need to be improved before this can be merged, but I'm opening the PR now to get feedback from automated tools. Here are some TODO items:

support --cached option for training with previously preprocessed training data
refactor by moving some classes to a separate module (e.g. annif.lexical)
add support for semantic features based on vocabulary structure (broader, narrower, related)
add support for collection membership feature
- verify that this works also without SKOS Collections in the vocabulary
support hyperparameter optimization using the annif hyperopt command
complete unit tests
fix QA tool warnings
wiki documentation

codecov · 2021-01-15T07:22:00Z

Codecov Report

Merging #462 (57a2a38) into master (b8e556c) will increase coverage by 0.02%.
The diff coverage is 100.00%.

❗ Current head 57a2a38 differs from pull request most recent head aa2c287. Consider uploading reports for the commit aa2c287 to get more accurate results

@@            Coverage Diff             @@
##           master     #462      +/-   ##
==========================================
+ Coverage   99.44%   99.46%   +0.02%     
==========================================
  Files          67       73       +6     
  Lines        4838     5280     +442     
==========================================
+ Hits         4811     5252     +441     
- Misses         27       28       +1

Impacted Files	Coverage Δ
annif/backend/__init__.py	`100.00% <100.00%> (ø)`
annif/backend/mllm.py	`100.00% <100.00%> (ø)`
annif/corpus/subject.py	`100.00% <100.00%> (ø)`
annif/lexical/mllm.py	`100.00% <100.00%> (ø)`
annif/lexical/tokenset.py	`100.00% <100.00%> (ø)`
annif/suggestion.py	`100.00% <100.00%> (ø)`
tests/conftest.py	`100.00% <100.00%> (ø)`
tests/test_backend_mllm.py	`100.00% <100.00%> (ø)`
tests/test_backend_omikuji.py	`100.00% <100.00%> (ø)`
tests/test_lexical_mllm.py	`100.00% <100.00%> (ø)`
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7f42c96...aa2c287. Read the comment docs.

…operty at a time

…ackend

osma · 2021-03-29T12:02:36Z

Rebased and force-pushed.

…abels

…dule

osma · 2021-03-29T14:22:04Z

I've now tackled some QA tool issues, though complaints still remain. Anyway I think this is ready for wider testing now.

sonarcloud · 2021-03-30T07:34:17Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
3 Code Smells

No Coverage information
0.0% Duplication

osma · 2021-03-31T12:13:11Z

I wrote some wiki documentation: https://github.com/NatLibFi/Annif/wiki/Backend:-MLLM
This page has not yet been linked to from the Home page or the sidebar. The links should be added when a version of Annif with MLLM is released.

osma · 2021-04-12T11:43:47Z

There are still some QA tool warnings, but it's not easy to fix them. I would like to merge this pretty soon, unless there are objections.

juhoinkinen

I looked through the code, I could comment only about n_jobs parameter, and probably that would not be very useful (at least before candidate generation is parallelized, which is stated as TODO).

Overall it is great to have MLLM in Annif! :)

annif/lexical/mllm.py

osma added the enhancement label Jan 15, 2021

osma added this to the 0.51 milestone Jan 15, 2021

osma mentioned this pull request Jan 15, 2021

Lexical STWFSAPY Backend #438

Merged

osma modified the milestones: 0.51, 0.52 Mar 24, 2021

osma added 24 commits March 29, 2021 14:50

First steps of MLLM backend: token index and matching functionality

11517c9

MLLM backend: candidate generation functionality

2ea07cf

MLLM backend: fix pep8 errors and dead code

7d1b282

MLLM backend: implement feature transform and model training

af284ca

MLLM backend: separate Model class + working suggest functionality

6ec1f22

pep8 fixes

9ee8cf4

MLLM backend: first unit tests

d50db58

MLLM backend: unit tests for TokenSetIndex

1dccbad

MLLM backend: unit test for train method (requires a vocab fixture)

aa0e88c

MLLM backend: test for suggest method

92605fa

refactor test fixtures for accessing vocabulary

dad3fb5

MLLM backend: configurable hyperparameters with default values

d1ac5af

MLLM backend: force integer values for parameters

b809be6

MLLM backend: handle no matches case

965c12d

bugfix: max_samples should be a float, not int

21527d2

MLLM backend: implement --cached option for train command

969bb6c

MLLM backend: hyperparameter optimization (first draft, no tests yet)

198530c

MLLM backend: test + bugfix for hyperopt functionality

7852598

MLLM backend: rename variable train_X to train_x, to please SonarQube

c4302ed

MLLM backend: support for hidden labels (use_hidden_labels option)

cd55b55

MLLM backend: support skos:related feature, drop use of Pipeline

344d8a4

MLLM backend: avoid using graph.preferredLabel as it uses only one pr…

94f1e73

…operty at a time

MLLM backend: refactor + bugfix

17b254a

MLLM backend: add unit test for MLLMModel._prepare_terms

05eaa83

osma added 11 commits March 29, 2021 14:50

MLLM model: add save/load methods to MLLMModel and use them in mllm b…

e03c25c

…ackend

MLLM backend: refactor TokenSetIndex and add docstrings

5258c22

Add SKOS collections to YSO archaeology slice, for testing MLLM

1f17e4e

MLLM backend: implement collection feature

ecfd2db

MLLM backend: reimplement collection feature using smaller matrix

3aec702

MLLM backend: refactor MLLMModel._prepare_relations

687da92

MLLM backend: use csc_matrix for semantic relations, saving memory

0d7a478

MLLM backend: use csc_matrix for collection matrix, which is faster

6a63265

MLLM backend: refactor prepare_train by splitting out idf calculation

7a40fce

MLLM model: add unit tests for missing train/model file cases

e68ba38

introduce SubjectIndex.active property and use it in MLLM backend

f2f154c

osma force-pushed the feature-mllm-backend branch from be61408 to f2f154c Compare March 29, 2021 11:57

osma added 5 commits March 29, 2021 16:00

MLLM backend: simplify _make_relation_matrix

2394039

MLLM backend: refactor _prepare_terms by splitting out _get_subject_l…

7fe9bfa

…abels

MLLM backend: make get_subject_labels into a separate function

d2478fd

MLLM backend: simplify get_subject_labels logic

6cb24c7

MLLM backend: put utility functions in separate annif.lexical.util mo…

1529780

…dule

osma marked this pull request as ready for review March 29, 2021 14:20

osma requested a review from juhoinkinen March 29, 2021 14:22

osma added 3 commits March 30, 2021 10:16

refactor get_subject_labels using a list comprehension

7357b98

refactor _find_subj_ambiguity using a list comprehension

fd37191

refactor: simplify _find_subj_tsets a little bit

aa2c287

juhoinkinen approved these changes Apr 13, 2021

View reviewed changes

annif/lexical/mllm.py Show resolved Hide resolved

osma merged commit ea11a02 into master Apr 13, 2021

osma deleted the feature-mllm-backend branch April 13, 2021 11:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New lexical backend MLLM #462

New lexical backend MLLM #462

osma commented Jan 15, 2021 •

edited

codecov bot commented Jan 15, 2021 •

edited

osma commented Mar 29, 2021

osma commented Mar 29, 2021

sonarcloud bot commented Mar 30, 2021

osma commented Mar 31, 2021

osma commented Apr 12, 2021

juhoinkinen left a comment

New lexical backend MLLM #462

New lexical backend MLLM #462

Conversation

osma commented Jan 15, 2021 • edited

codecov bot commented Jan 15, 2021 • edited

Codecov Report

osma commented Mar 29, 2021

osma commented Mar 29, 2021

sonarcloud bot commented Mar 30, 2021

osma commented Mar 31, 2021

osma commented Apr 12, 2021

juhoinkinen left a comment

Choose a reason for hiding this comment

osma commented Jan 15, 2021 •

edited

codecov bot commented Jan 15, 2021 •

edited