Support for batch suggest operations for CLI commands #663

juhoinkinen · 2023-01-20T15:20:57Z

Adds support for passing multiple documents in a batch from suggest, index, eval and optimize CLI commands to backends (like discussed in issue #579).

Makes annif suggest accept path(s) to file(s) to be indexed, in addition to stdin:

annif suggest yso-tfidf-en <document.txt              # just like before
annif suggest yso-tfidf-en document.txt               # the same, but from a named file
annif suggest yso-tfidf-en doc1.txt doc2.txt doc3.txt # many files
annif suggest yso-tfidf-en doc*.txt                   # similar to above, but using shell expansion
annif suggest yso-tfidf-en -                          # stdin with dash
annif suggest yso-tfidf-en doc1.txt -                 # mixing file and stdin with dash

The output for each file path input begins with line Suggestions for <file/path>:

Suggestions for tests/corpora/archaeology/fulltext/440866.txt
<http://www.yso.fi/onto/yso/p6218>	riimukirjoitus	0.3213897943496704
<http://www.yso.fi/onto/yso/p6479>	viikingit	0.18659920990467072
<http://www.yso.fi/onto/yso/p12738>	viikinkiaika	0.18625082075595856
<http://www.yso.fi/onto/yso/p22768>	Kiinan muuri	0.15950888395309448
<http://www.yso.fi/onto/yso/p3973>	antiikki	0.13840530812740326
<http://www.yso.fi/onto/yso/p14588>	riimukivet	0.1362432837486267
<http://www.yso.fi/onto/yso/p14173>	kaivaukset	0.1201547235250473
<http://www.yso.fi/onto/yso/p5713>	hautalöydöt	0.11249098181724548
<http://www.yso.fi/onto/yso/p15031>	viikinkiretket	0.11039584875106812
<http://www.yso.fi/onto/yso/p5714>	muinaishaudat	0.10336380451917648

The documents are passed as batches, i.e. lists of texts (lists are generated by a generator also in the case of suggest command with files) to the backend.py module, which defines the default _suggest_batch method that uses the regular, single-text _suggest method. This allows the actual backends to define their own _suggest_batch methods that can operate on the document batch.

The implementation of document batching for the hyperopt CLI command is left from this PR, as the hyperopt functionality is implemented in individual backends.

Note that there is no support the actual batch-processing in any backends yet.

…commands

… plain list of documents

codecov · 2023-01-20T15:28:25Z

Codecov Report

Base: 99.55% // Head: 89.44% // Decreases project coverage by -10.11% ⚠️

Coverage data is based on head (ec16a21) compared to base (ca4d61c).
Patch coverage: 97.94% of modified lines in pull request are covered.

Additional details and impacted files

@@             Coverage Diff             @@
##           master     #663       +/-   ##
===========================================
- Coverage   99.55%   89.44%   -10.11%     
===========================================
  Files          87       87               
  Lines        6017     6142      +125     
===========================================
- Hits         5990     5494      -496     
- Misses         27      648      +621

Impacted Files	Coverage Δ
tests/test_backend_fasttext.py	`100.00% <ø> (ø)`
tests/test_backend_nn_ensemble.py	`6.77% <0.00%> (-93.23%)`	⬇️
tests/test_backend_omikuji.py	`5.15% <0.00%> (-94.85%)`	⬇️
tests/test_backend_pav.py	`100.00% <ø> (ø)`
tests/test_backend_yake.py	`6.94% <0.00%> (-93.06%)`	⬇️
annif/backend/backend.py	`100.00% <100.00%> (ø)`
annif/backend/dummy.py	`100.00% <100.00%> (ø)`
annif/backend/ensemble.py	`100.00% <100.00%> (ø)`
annif/backend/pav.py	`98.87% <100.00%> (ø)`
annif/cli.py	`99.70% <100.00%> (+0.01%)`	⬆️
... and 28 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

…ommand as necessary The test was not working as intended: the command got input as stdin

juhoinkinen · 2023-01-23T07:44:47Z

Some open thoughts & questions:

Using DocumentList object made the implementation for ìndex command a bit easier that using plain list of texts. But actually, instead of using DocumentList class there could be a DocumentBatch class just for this use, but would that give any benefits.
Should there be a limit for number of docs that can be send via REST batch-suggest? Or
is it enough to have the possibility to set a payload limit for the method in e.g. NGINX?
Maybe batch-suggest would be a better name for the method than suggest-batch.

osma

This is a very good starting point for batching. Also, it already provides some new functionality (possibility to pass many documents in suggest operations via CLI and REST) so I think it would be good to try to merge this first, even before any actual improved implementations in the backends have been developed.

What worries me here a little is the potentially large size of batches. On the REST side they are naturally a bit limited by request size (and maybe we need some other limit as well?), but on the CLI side, it's not uncommon to use index and eval on thousands of large documents at once. But I don't think it makes sense to process that many documents in a single operation on the backend level - more likely, processing something like 16 or 32 documents at once (let's call it a minibatch) would already give a performance boost, and any larger minibatch size would probably just increase memory overhead with diminishing returns.

It's good to use DocumentList here on the "outer" level, because it's generator based and thus scales naturally to even huge numbers of documents. But I don't think it should be passed directly to backends. Instead, some intermediate layer (probably project.py, or the layer above i.e. cli.py and rest.py) should chop this up into minibatches which are then passed to the backend methods, maybe just as simple lists of text strings. The results (hit_sets) from the backends would have the same size as the minibatch and these would then be assembled back to a single iterable - a generator would probably be a good idea here too, since the number of documents can be huge.

I gave a few minor detailed comments on the code.

I won't comment on the naming in this round, let's see first how the code evolves.

annif/cli.py

tests/test_project.py

tests/test_swagger.py

annif/project.py

annif/cli.py

osma · 2023-01-27T09:34:56Z

annif/cli.py

@@ -345,35 +364,51 @@ def run_learn(project_id, paths, docs_limit, backend_param):

 @cli.command("suggest")
 @click.argument("project_id")
+@click.argument("paths", type=click.Path(dir_okay=False, exists=True), nargs=-1)


Thought: Should it be possible to pass a directory (containing .txt documents) as well?

Also support mixing "-" and real file paths; in that case show hits in file-like output

juhoinkinen · 2023-01-30T16:05:58Z

I tried to utilize the new suggest_batch function of the project module for the eval CLI command, but it does not seem very straightforward.

First, the eval command uses the imap_unordered, which expects an iterable as the second argument), but suggest_batch operates on a corpus object (that has documents iterable only as its property). Also, I'm not sure if imap_unordered could in any way feed multiple documents to the suggest_batch function: it has the chunksize parameter, but afaik even when setting that to a non-default value (!=1) it does not send multiple elements from the iterable to the function in one pass.

juhoinkinen · 2023-02-01T09:53:21Z

I added BatchingDocumentCorpus to help getting batches of documents, but I wonder could the doc_batches(batch_size) method be added already to the DocumentCorpus base class, as batching of documents is used quite many times. (TODO: add testing of the batching method.)

(Black v23.1.0 was just released, and it introduced some changes to the style, which raise complaints for the current Annif code. :( )

osma · 2023-02-01T10:12:50Z

I wonder could the doc_batches(batch_size) method be added already to the DocumentCorpus base class, as batching of documents is used quite many times.

Yes, this sounds like a great idea! I think it would simplify the code and enable batching in many places.

osma · 2023-02-03T07:35:01Z

annif/cli.py

@@ -83,6 +83,30 @@ def open_doc_path(path, subject_index):
    return docs


+def open_text_documents(paths, docs_limit):


This adds yet another utility function to the top of cli.py. Nothing wrong with that, but cli.py has grown very long so I think we could refactor this in a follow-up PR, moving the utility functions to a separate module such as cli_util.py. Then cli.py would just contain the Click-decorated functions that implement the CLI commands themselves.

annif/project.py

osma

I already gave a comment about moving DOC_BATCH_SIZE inside DocumentCorpus since it's only needed there.

Apart from that, I think there's an opportunity for further simplification. Basically we don't need the single-text versions of suggest methods anymore, as they can be handled as a special case of suggest_batch. This may seem a bit radical but I think it's not very hard (well, except maybe fixing up all the tests):

The old AnnifProject.suggest() method is now needed only in two places, in the cli.py run_suggest function (the stdin-only case) and in the rest.py suggest function. Convert these two to use the AnnifProject.suggest_batch() method instead - passing a list with one text.
Remove the now unused AnnifProject.suggest() method as well as AnnifProject._suggest_with_backend() that it relies on (but nothing else needs it)
Remove the now unused AnnifBackend.suggest() method.

It may also make sense to rename the remaining suggest_batch() methods to just suggest(), now that the shorter name is available. Though we still need both AnnifBackend._suggest and AnnifBackend._suggest_batch, because it's up to backends which variant they will implement.

osma · 2023-02-03T08:32:37Z

Oh by the way, I tested the annif eval command (yso-mllm-fi project, evaluated on kirjastonhoitaja test set) with the -j 4 option, before and after this PR. The evaluation results were the same. It took a few seconds longer with the code in this PR, probably because the parallel processing is done on larger batches and the final batches cannot be distributed between CPUs. But I think that's not too bad, and we should see performance gains in the future as we implement support for batching in individual backends.

…batch size

juhoinkinen · 2023-02-03T10:44:41Z

Now some debug-level logs from project module are removed:
Before:

annif suggest tfidf-fi <kissa.txt -v DEBUG
debug: creating app with configuration annif.default_config.Config
debug: Reading configuration file projects.cfg in CFG format
debug: loading subjects from data/vocabs/yso/subjects.csv
debug: Suggesting subjects for text "kissa
debug: ..." (len=6)
debug: Backend tfidf: loading vectorizer from data/projects/tfidf-fi/vectorizer
debug: Backend tfidf: loading similarity index from data/projects/tfidf-fi/tfidf-index
debug: Backend tfidf: Suggesting subjects for text "kissa
debug: ..." (len=6)
debug: Got 100 hits from backend tfidf
debug: 100 hits from backend
<http://www.yso.fi/onto/yso/p19378>	kissa	0.9530429244041443
<http://www.yso.fi/onto/yso/p864>	kissaeläimet	0.5541983842849731

Now:

annif suggest tfidf-fi <kissa.txt -v DEBUG
debug: creating app with configuration annif.default_config.Config
debug: Reading configuration file projects.cfg in CFG format
debug: loading subjects from data/vocabs/yso/subjects.csv
debug: Backend tfidf: loading vectorizer from data/projects/tfidf-fi/vectorizer
debug: Backend tfidf: loading similarity index from data/projects/tfidf-fi/tfidf-index
debug: Backend tfidf: Suggesting subjects for text "kissa
debug: ..." (len=6)
<http://www.yso.fi/onto/yso/p19378>	kissa	0.9530429244041443
<http://www.yso.fi/onto/yso/p864>	kissaeläimet	0.5541983842849731

And in the case giving multiple files to annif suggest, the debug log has some duplication:

annif suggest tfidf-fi *.txt -v DEBUG
debug: creating app with configuration annif.default_config.Config
debug: Reading configuration file projects.cfg in CFG format
debug: loading subjects from data/vocabs/yso/subjects.csv
debug: Backend tfidf: loading vectorizer from data/projects/tfidf-fi/vectorizer
debug: Backend tfidf: loading similarity index from data/projects/tfidf-fi/tfidf-index
debug: Backend tfidf: Suggesting subjects for text "kissa
debug:  cat
debug: dog
debug: viiki..." (len=24)
debug: Backend tfidf: Suggesting subjects for text "kissa
debug: ..." (len=6)
debug: Backend tfidf: Suggesting subjects for text "koira
debug: ..." (len=6)
debug: Backend tfidf: Suggesting subjects for text "laiva
debug: ..." (len=6)
debug: Backend tfidf: Suggesting subjects for text "viikinki
debug: ..." (len=9)

osma · 2023-02-03T11:30:14Z

Now some debug-level logs from project module are removed

Right, because the debug information was printed only in the single text suggest methods. I don't think this is a big loss.

And in the case giving multiple files to annif suggest, the debug log has some duplication:

I don't see any exact duplicates - the messages are related to different input files, right?

osma

LGTM!

Maybe the PR title could be amended, as it also covers the eval command now (but only the CLI side)

sonarcloud · 2023-02-03T12:50:31Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
6 Code Smells

No Coverage information
0.0% Duplication

juhoinkinen and others added 6 commits January 20, 2023 14:44

Add initial support for batched operations for suggest and index CLI …

1825801

…commands

Use common helper function for output of suggest and index commands

933990b

Ensure tests fail if 'text' with wrong type ends up to dummy backend

5064a48

Fix index command by using DocumentList in batch functions instead of…

47423e5

… plain list of documents

Add suggest-batch REST method

37ece6c

Use common SuggestParameters via reference in suggest method too

294baab

juhoinkinen added the enhancement label Jan 20, 2023

juhoinkinen added this to the Short term milestone Jan 20, 2023

juhoinkinen added 2 commits January 23, 2023 09:19

Fix & improve testing of suggest CLI cmd with file input; adapt the c…

49931f8

…ommand as necessary The test was not working as intended: the command got input as stdin

Fix import order

74a9b80

juhoinkinen and others added 9 commits January 23, 2023 11:20

Add tests for applying transform in suggest calls

4919e03

Apply transform to document batch on project level

741494a

Use common _suggest function in REST methods

e6cf0d3

Define error function for not supported language

dc7e868

Merge branch 'master' into issue579-batch-suggest-operation

ba1a951

Add optional id field to suggest-batch REST method

b6558f8

Remove superfluous implementation of _suggest_batch fn in dummy backend

47da0e1

Refactor to address complexity issue by CodeClimate

1ebedde

Refactor again

45d6ed1

juhoinkinen marked this pull request as ready for review January 27, 2023 07:41

juhoinkinen requested a review from osma January 27, 2023 07:41

osma requested changes Jan 27, 2023

View reviewed changes

juhoinkinen added 6 commits January 27, 2023 12:48

Fix typo

0767d10

Add swagger tests for 404 & 503 cases in suggest-batch request

e49e91f

Remove debug message showing number of documents in suggest batch

b8a411f

Support "-" as file path in suggest CLI command for stdin

77d5266

Also support mixing "-" and real file paths; in that case show hits in file-like output

Open text documents using generator instead of list in CLI suggest fn

53e39d2

Implement minibatching of documents in suggest fn in project module

1a5af59

juhoinkinen added 2 commits February 1, 2023 10:30

Add BatchingDocumentCorpus

017b9d1

Use BatchingDocumentCorpus in suggest, index, eval & optimize CLI cmds

49c06a6

juhoinkinen added 2 commits February 1, 2023 13:21

Add document batching method to DocumentCorpus class & basic test for it

365dbcd

Add evaluate_many() method to EvaluationBatch and use it in CLI cmds

ba31310

osma reviewed Feb 3, 2023

View reviewed changes

annif/project.py Outdated Show resolved Hide resolved

osma requested changes Feb 3, 2023

View reviewed changes

juhoinkinen added 2 commits February 3, 2023 10:44

Turn DocumentCorpus.doc_batches() method to property with a constant …

b4c059e

…batch size

Remove single-text versions of suggest methods

9e1c272

osma approved these changes Feb 3, 2023

View reviewed changes

This was referenced Feb 3, 2023

Support batch suggest in Omikuji backend #665

Closed

Support batch suggest in STWFSA backend #666

Open

Support batch suggest in SVC backend #667

Closed

juhoinkinen changed the title ~~Support for batch suggest operations in suggest and index methods~~ Support for batch suggest operations for CLI commands Feb 3, 2023

juhoinkinen modified the milestones: Short term, 0.61 Feb 3, 2023

Refine document batching test

ec16a21

juhoinkinen merged commit 4b03e79 into master Feb 3, 2023

juhoinkinen deleted the issue579-batch-suggest-operation branch February 3, 2023 12:54

osma mentioned this pull request Feb 3, 2023

Batch suggest in Omikuji backend #669

Merged

juhoinkinen mentioned this pull request Feb 21, 2023

Refactor and cleanup CLI module #675

Merged

juhoinkinen mentioned this pull request May 15, 2023

Fix crashing index command when targeted directory contains subject files #705

Merged

juhoinkinen mentioned this pull request Jun 26, 2023

Batch suggest operation #579

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for batch suggest operations for CLI commands #663

Support for batch suggest operations for CLI commands #663

juhoinkinen commented Jan 20, 2023 •

edited

codecov bot commented Jan 20, 2023 •

edited

juhoinkinen commented Jan 23, 2023 •

edited

osma left a comment

osma Jan 27, 2023

juhoinkinen commented Jan 30, 2023

juhoinkinen commented Feb 1, 2023 •

edited

osma commented Feb 1, 2023

osma Feb 3, 2023

osma left a comment

osma commented Feb 3, 2023 •

edited

juhoinkinen commented Feb 3, 2023

osma commented Feb 3, 2023

osma left a comment

sonarcloud bot commented Feb 3, 2023

		@@ -83,6 +83,30 @@ def open_doc_path(path, subject_index):
		return docs


		def open_text_documents(paths, docs_limit):

Support for batch suggest operations for CLI commands #663

Support for batch suggest operations for CLI commands #663

Conversation

juhoinkinen commented Jan 20, 2023 • edited

codecov bot commented Jan 20, 2023 • edited

Codecov Report

juhoinkinen commented Jan 23, 2023 • edited

osma left a comment

Choose a reason for hiding this comment

osma Jan 27, 2023

Choose a reason for hiding this comment

juhoinkinen commented Jan 30, 2023

juhoinkinen commented Feb 1, 2023 • edited

osma commented Feb 1, 2023

osma Feb 3, 2023

Choose a reason for hiding this comment

osma left a comment

Choose a reason for hiding this comment

osma commented Feb 3, 2023 • edited

juhoinkinen commented Feb 3, 2023

osma commented Feb 3, 2023

osma left a comment

Choose a reason for hiding this comment

sonarcloud bot commented Feb 3, 2023

juhoinkinen commented Jan 20, 2023 •

edited

codecov bot commented Jan 20, 2023 •

edited

juhoinkinen commented Jan 23, 2023 •

edited

juhoinkinen commented Feb 1, 2023 •

edited

osma commented Feb 3, 2023 •

edited