yake backend is a wrapper around YAKE library, which performs unsupervised automatic keyword extraction.
In the backend the keywords found by YAKE are searched from an index, which is formed from the SKOS vocabulary labels. The index can include
hiddenLabels. Keywords and labels in the index are lemmatized and sorted alphabetically for matching.
The YAKE backend is based on lexical principle, but currently it does not perform as well as the other lexical backends (MLLM, STWFSA or Maui, which are from the beginning designed to utilize the SKOS vocabulary features). However, the (free) keyword extraction operation offers a possibility to add new features to Annif, especially the feature for suggesting new terms for a vocabulary (the keywords not found in the vocabulary). Currently the keywords not found from the vocabulary are shown in the debug log.
Also the unsupervised approach can be useful in some cases: there is no need for training data.
Please note that the YAKE library is licended under GPLv3, while Annif is licensed under the Apache License 2.0. The licenses are compatible, but depending on legal interpretation, the terms of the GPLv3 (for example the requirement to publish corresponding source code when publishing an executable application) may be considered to apply to the whole of Annif+Yake if you decide to install the optional YAKE dependency.
For installation see Optional features and dependencies.
A minimal configuration that relies on default values:
[yso-yake-en] language=en backend=yake analyzer=snowball(english) vocab=yso-en
For long texts it can be advantageous to use the
limit transformation project setting to truncate the documents before passing them to YAKE. For Finnish thesis and dissertations good results can be achieved with
remove_parentheses parameters are used for constructing the label index.
Note that if these parameters are changed after the label index has been created, which occurs on the first
suggest call for a project, the update does not change the index, but the project then needs to be reset by
annif clear <project>.
Resetting is needed also after vocabulary update.
The other parameters are passed to YAKE when extracting keywords; for the detailed description of them and the YAKE algorithm see the article by R. Campos et al..
|label_types||SKOS label types to use in matching. Values are given in a comma separated list of
|remove_parentheses||Whether to remove parts of SKOS labels inside parentheses (a specifier for a label, e.g.
|window_size||Distance (in number of tokens) considered when computing co-occurances of tokens. Defaults to 1.|
|max_ngram_size||Maximum number of consequtive words to use in forming candidate keywords. Defaults to 4.|
|deduplication_algo||Algorithm to measure the similarity of candidate keywords for deduplication:
|deduplication_threshold||Threshold for the value of the similarity measure for deduplication. Defaults to 0.9.|
|num_keywords||Limit for the number of keywords that YAKE extracts. Defaults to 100.|
Load a vocabulary:
annif loadvoc yso-yake-en /path/to/Annif-corpora/vocab/yso-skos.ttl
Training is not necessary or possible. Test the model with a single document:
cat document.txt | annif suggest yso-yake-en
Evaluate a directory full of files in fulltext document corpus format:
annif eval yso-yake-en /path/to/documents/