Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

YAKE backend #461

Merged
merged 57 commits into from
May 14, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
7c4fad8
Initial YAKE integration commit
juhoinkinen Jan 5, 2021
45b18a6
Cleanup & pep8 fixes
juhoinkinen Jan 12, 2021
0995535
Increase keyphrase word number to 4
juhoinkinen Jan 12, 2021
1c6e0e8
Use sets of uris instead of lists of uris in index
juhoinkinen Jan 12, 2021
4bf08da
Put also alt and hidden labels in index instead of just prefs
juhoinkinen Jan 12, 2021
85eb8f7
More straightforward score transformation
juhoinkinen Jan 12, 2021
5a97da5
Shorten&simplify code
juhoinkinen Jan 12, 2021
583272a
Remove unused import
juhoinkinen Jan 12, 2021
db9d4b0
Load graph using as_graph method of the vocab module
juhoinkinen Jan 26, 2021
b697918
Configurable label types for index creation
juhoinkinen Jan 26, 2021
630af04
Dont unnecessarily pass a label to SubjectSuggestion
juhoinkinen Jan 26, 2021
44611e9
Replace negative Yake scores with zero
juhoinkinen Jan 28, 2021
4169945
Omit processing vocabulary label (it's fetched when creating ListSugg…
juhoinkinen Jan 28, 2021
ea8d005
Improve variable names; remove unnecessary comments
juhoinkinen Jan 28, 2021
3d0afb9
Get language for label picking from params (to ease testing)
juhoinkinen Jan 29, 2021
f9354a0
Config switch for removing a specifier in parenthesis from index labels
juhoinkinen Jan 29, 2021
7116aa0
Test for suggest method of Yake
juhoinkinen Jan 29, 2021
154ae55
Skip Yake tests when Yake is not installed (Python 3.7 in Travis)
juhoinkinen Jan 29, 2021
9ce122f
Try to simplify _create_index method; reorder methods
juhoinkinen Feb 1, 2021
3651488
Better name for option (remove_parentheses)
juhoinkinen Feb 3, 2021
4b5d730
Install Yake from PyPI, not from GitHub
juhoinkinen Feb 5, 2021
b3578ef
Get labels of a concept using a helper method in index creation
juhoinkinen Feb 5, 2021
64e015a
Return index from create_index() instead setting the index field in t…
juhoinkinen Feb 5, 2021
7741097
Combine scores using "additive conflation"
juhoinkinen Feb 5, 2021
5bde900
Create Yake object on suggest (allows setting Yake params on runtime)
juhoinkinen Feb 5, 2021
b2c08cf
Avoid crash for empty or non-alphanumeric input
juhoinkinen Feb 5, 2021
daee74b
Add unit tests
juhoinkinen Feb 5, 2021
c19e8b7
Make graph_project a common pytest fixture (move it to conftest.py)
juhoinkinen Feb 11, 2021
af1129e
Avoid need for clumsy mapping for labeltypes by using directly SKOS n…
juhoinkinen Feb 11, 2021
dca4474
Avoid need for "default_label_types" name for defaults
juhoinkinen Feb 11, 2021
2917628
Refactor attempting to resolve complexitity complains by CodeClimate
juhoinkinen Feb 12, 2021
6a3aebd
Add test for invalid label types
juhoinkinen Feb 12, 2021
b06ca9c
Remove pointless test
juhoinkinen Feb 12, 2021
79e225b
Test for removing parentheses from label when creating index
juhoinkinen Feb 12, 2021
f091119
Add methods for accessing SKOS concepts & labels via AnnifVocabulary
juhoinkinen Feb 17, 2021
fb0b7b5
Access SKOS concepts & labels via AnnifVocabulary in Yake
juhoinkinen Feb 17, 2021
d1c2af5
Access SKOS graph via skos_vocab in AnnifVocabulary
juhoinkinen Feb 17, 2021
f080d02
Reduce code duplication by using the skos_concepts property
juhoinkinen Feb 17, 2021
1fdf803
Update YAKE to 0.4.5 (eliminates warnings on input with no keywords)
juhoinkinen Mar 10, 2021
2fb2244
Install Yake in GH Actions jobs for unit tests for Python 3.6 & 3.8
juhoinkinen Mar 17, 2021
0b1cacd
Remove condition and debug message for neg. Yake score, use max() ins…
juhoinkinen Mar 17, 2021
975b1bc
Adapt to current master: remove unnecessary skos_project fixture
juhoinkinen Apr 30, 2021
f354015
Adapt to current master: altLabels in archaelogy corpus have changed
juhoinkinen Apr 30, 2021
d83b9fb
Adapt to current master: use project fixture in test_stwfsa, remove g…
juhoinkinen Apr 30, 2021
315b474
Remove test for removing parentheses from labels
juhoinkinen Apr 30, 2021
c797b56
Implement get_skos_concept_labels using list comprehension
juhoinkinen Apr 30, 2021
fcacef8
Rename & refactor methods for SKOS vocabulary
juhoinkinen Apr 30, 2021
d6c2aa6
Rename method for accessing SKOS vocab as a file object
juhoinkinen Apr 30, 2021
6ddb557
Adjust license explanation
juhoinkinen Apr 30, 2021
27e0cdc
Better name and docstring for the property for accessing SKOS vocabulary
juhoinkinen May 5, 2021
338c9b6
Change log message for loading index to debug level
juhoinkinen May 5, 2021
d09e8e3
Readjust license explanation for Yake backend
juhoinkinen May 11, 2021
06fe6ef
Pass project's language to AnnifVocabulary and adapt fixtures as needed
juhoinkinen May 11, 2021
4d67d44
Rename lemmatize_phrase function to normalize_phrase
juhoinkinen May 12, 2021
9597526
Use atomic_save for saving YAKE index
juhoinkinen May 12, 2021
2dafa54
Adjust license explanation comment to point to license section in REA…
juhoinkinen May 12, 2021
9a2127a
Truncate long log messages for objects to be saved
juhoinkinen May 12, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/workflows/python-package.yml
Original file line number Diff line number Diff line change
Expand Up @@ -32,9 +32,9 @@ jobs:
# Install the optional neural network dependencies (TensorFlow and LMDB)
# - except for one Python version (3.7) so that we can test also without them
if [[ ${{ matrix.python-version }} != '3.7' ]]; then pip install .[nn]; fi
# Install the optional Omikuji dependency
# Install the optional Omikuji and YAKE dependencies
# - except for one Python version (3.7) so that we can test also without them
if [[ ${{ matrix.python-version }} != '3.7' ]]; then pip install .[omikuji]; fi
if [[ ${{ matrix.python-version }} != '3.7' ]]; then pip install .[omikuji,yake]; fi
# Install the optional fastText dependencies for Python 3.7 only
if [[ ${{ matrix.python-version }} == '3.7' ]]; then pip install .[fasttext]; fi
# For Python 3.6
Expand Down
9 changes: 8 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -133,4 +133,11 @@ Zenodo DOI:

The code in this repository is licensed under Apache License 2.0, except for the
dependencies included under `annif/static/css` and `annif/static/js`,
which have their own licenses. See the file headers for details.
which have their own licenses, see the file headers for details.
Please note that the [YAKE](https://github.com/LIAAD/yake) library is licended
under [GPLv3](https://www.gnu.org/licenses/gpl-3.0.txt), while Annif is
licensed under the Apache License 2.0. The licenses are compatible, but
depending on legal interpretation, the terms of the GPLv3 (for example the
requirement to publish corresponding source code when publishing an executable
application) may be considered to apply to the whole of Annif+Yake if you
decide to install the optional Yake dependency.
6 changes: 6 additions & 0 deletions annif/backend/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,3 +60,9 @@ def get_backend(backend_id):
register_backend(omikuji.OmikujiBackend)
except ImportError:
annif.logger.debug("Omikuji not available, not enabling omikuji backend")

try:
from . import yake
register_backend(yake.YakeBackend)
except ImportError:
annif.logger.debug("YAKE not available, not enabling yake backend")
2 changes: 1 addition & 1 deletion annif/backend/maui.py
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ def _upload_vocabulary(self, params):
json = {}
try:
resp = requests.put(self.tagger_url(params) + '/vocab',
data=self.project.vocab.as_skos())
data=self.project.vocab.as_skos_file())
try:
json = resp.json()
except ValueError:
Expand Down
184 changes: 184 additions & 0 deletions annif/backend/yake.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,184 @@
"""Annif backend using Yake keyword extraction"""
# For license remarks of this backend see README.md:
# https://github.com/NatLibFi/Annif#license.

import yake
import joblib
import os.path
import re
from collections import defaultdict
from rdflib.namespace import SKOS
import annif.util
from . import backend
from annif.suggestion import SubjectSuggestion, ListSuggestionResult
from annif.exception import ConfigurationException


class YakeBackend(backend.AnnifBackend):
"""Yake based backend for Annif"""
name = "yake"
needs_subject_index = False

# defaults for uninitialized instances
_index = None
_graph = None
INDEX_FILE = 'yake-index'

DEFAULT_PARAMETERS = {
'max_ngram_size': 4,
'deduplication_threshold': 0.9,
'deduplication_algo': 'levs',
'window_size': 1,
'num_keywords': 100,
'features': None,
'label_types': ['prefLabel', 'altLabel'],
'remove_parentheses': False
}

def default_params(self):
params = backend.AnnifBackend.DEFAULT_PARAMETERS.copy()
params.update(self.DEFAULT_PARAMETERS)
return params

@property
def is_trained(self):
return True

@property
def label_types(self):
if type(self.params['label_types']) == str: # Label types set by user
label_types = [lt.strip() for lt
in self.params['label_types'].split(',')]
self._validate_label_types(label_types)
else:
label_types = self.params['label_types'] # The defaults
return [getattr(SKOS, lt) for lt in label_types]

def _validate_label_types(self, label_types):
for lt in label_types:
if lt not in ('prefLabel', 'altLabel', 'hiddenLabel'):
raise ConfigurationException(
f'invalid label type {lt}', backend_id=self.backend_id)

def initialize(self):
self._initialize_index()

def _initialize_index(self):
if self._index is None:
path = os.path.join(self.datadir, self.INDEX_FILE)
if os.path.exists(path):
self._index = joblib.load(path)
self.debug(
f'Loaded index from {path} with {len(self._index)} labels')
else:
self.info('Creating index')
self._index = self._create_index()
self._save_index(path)
self.info(f'Created index with {len(self._index)} labels')
juhoinkinen marked this conversation as resolved.
Show resolved Hide resolved

def _save_index(self, path):
juhoinkinen marked this conversation as resolved.
Show resolved Hide resolved
annif.util.atomic_save(
self._index,
self.datadir,
self.INDEX_FILE,
method=joblib.dump)

def _create_index(self):
index = defaultdict(set)
skos_vocab = self.project.vocab.skos
for concept in skos_vocab.concepts:
uri = str(concept)
labels = skos_vocab.get_concept_labels(
concept, self.label_types, self.params['language'])
for label in labels:
label = self._normalize_label(label)
index[label].add(uri)
index.pop('', None) # Remove possible empty string entry
return dict(index)

def _normalize_label(self, label):
label = str(label)
if annif.util.boolean(self.params['remove_parentheses']):
label = re.sub(r' \(.*\)', '', label)
normalized_label = self._normalize_phrase(label)
return self._sort_phrase(normalized_label)

def _normalize_phrase(self, phrase):
normalized = []
for word in phrase.split():
normalized.append(
self.project.analyzer.normalize_word(word).lower())
return ' '.join(normalized)

def _sort_phrase(self, phrase):
words = phrase.split()
return ' '.join(sorted(words))

def _suggest(self, text, params):
juhoinkinen marked this conversation as resolved.
Show resolved Hide resolved
self.debug(
f'Suggesting subjects for text "{text[:20]}..." (len={len(text)})')
limit = int(params['limit'])

self._kw_extractor = yake.KeywordExtractor(
lan=params['language'],
n=int(params['max_ngram_size']),
dedupLim=float(params['deduplication_threshold']),
dedupFunc=params['deduplication_algo'],
windowsSize=int(params['window_size']),
top=int(params['num_keywords']),
features=self.params['features'])
keyphrases = self._kw_extractor.extract_keywords(text)
suggestions = self._keyphrases2suggestions(keyphrases)

subject_suggestions = [SubjectSuggestion(
uri=uri,
label=None,
notation=None,
score=score)
for uri, score in suggestions[:limit] if score > 0.0]
return ListSuggestionResult.create_from_index(subject_suggestions,
self.project.subjects)

def _keyphrases2suggestions(self, keyphrases):
suggestions = []
not_matched = []
for kp, score in keyphrases:
uris = self._keyphrase2uris(kp)
for uri in uris:
suggestions.append(
(uri, self._transform_score(score)))
if not uris:
not_matched.append((kp, self._transform_score(score)))
# Remove duplicate uris, conflating the scores
suggestions = self._combine_suggestions(suggestions)
self.debug('Keyphrases not matched:\n' + '\t'.join(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a future version, I think these non-matched keyphrases should be propagated back to the user as well, but it could be done in a subsequent PR as it requires a lot more scaffolding.

[kp[0] + ' ' + str(kp[1]) for kp
in sorted(not_matched, reverse=True, key=lambda kp: kp[1])]))
return suggestions

def _keyphrase2uris(self, keyphrase):
keyphrase = self._normalize_phrase(keyphrase)
keyphrase = self._sort_phrase(keyphrase)
return self._index.get(keyphrase, [])

def _transform_score(self, score):
score = max(score, 0)
return 1.0 / (score + 1)

def _combine_suggestions(self, suggestions):
combined_suggestions = {}
for uri, score in suggestions:
if uri not in combined_suggestions:
combined_suggestions[uri] = score
else:
old_score = combined_suggestions[uri]
combined_suggestions[uri] = self._combine_scores(
score, old_score)
return list(combined_suggestions.items())

def _combine_scores(self, score1, score2):
# The result is never smaller than the greater input
score1 = score1/2 + 0.5
score2 = score2/2 + 0.5
confl = score1 * score2 / (score1 * score2 + (1-score1) * (1-score2))
return (confl-0.5) * 2
17 changes: 14 additions & 3 deletions annif/corpus/skos.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,9 +35,7 @@ def __init__(self, path, language):

@property
def subjects(self):
for concept in self.graph.subjects(RDF.type, SKOS.Concept):
if (concept, OWL.deprecated, rdflib.Literal(True)) in self.graph:
continue
for concept in self.concepts:
labels = self.graph.preferredLabel(concept, lang=self.language)
notation = self.graph.value(concept, SKOS.notation, None, any=True)
if not labels:
Expand All @@ -48,6 +46,19 @@ def subjects(self):
yield Subject(uri=str(concept), label=label, notation=notation,
text=None)

@property
def concepts(self):
for concept in self.graph.subjects(RDF.type, SKOS.Concept):
if (concept, OWL.deprecated, rdflib.Literal(True)) in self.graph:
continue
yield concept

def get_concept_labels(self, concept, label_types, language):
return [str(label)
for label_type in label_types
for label in self.graph.objects(concept, label_type)
if label.language == language]

@staticmethod
def is_rdf_file(path):
"""return True if the path looks like an RDF file that can be loaded
Expand Down
3 changes: 2 additions & 1 deletion annif/project.py
Original file line number Diff line number Diff line change
Expand Up @@ -143,7 +143,8 @@ def vocab(self):
raise ConfigurationException("vocab setting is missing",
project_id=self.project_id)
self._vocab = annif.vocab.AnnifVocabulary(self.vocab_id,
self._base_datadir)
self._base_datadir,
self.language)
return self._vocab

@property
Expand Down
2 changes: 1 addition & 1 deletion annif/util.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ def atomic_save(obj, dirname, filename, method=None):
tempfd, tempfilename = tempfile.mkstemp(
prefix=prefix, suffix=suffix, dir=dirname)
os.close(tempfd)
logger.debug('saving %s to temporary file %s', str(obj), tempfilename)
logger.debug('saving %s to temporary file %s', str(obj)[:90], tempfilename)
if method is not None:
method(obj, tempfilename)
else:
Expand Down
27 changes: 18 additions & 9 deletions annif/vocab.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
"""Vocabulary management functionality for Annif"""

import os.path
import rdflib.graph
import annif
import annif.corpus
import annif.util
Expand All @@ -18,9 +17,11 @@ class AnnifVocabulary(DatadirMixin):
# defaults for uninitialized instances
_subjects = None

def __init__(self, vocab_id, datadir):
def __init__(self, vocab_id, datadir, language):
DatadirMixin.__init__(self, datadir, 'vocabs', vocab_id)
self.vocab_id = vocab_id
self.language = language
self._skos_vocab = None

def _create_subject_index(self, subject_corpus):
self._subjects = annif.corpus.SubjectIndex(subject_corpus)
Expand Down Expand Up @@ -55,6 +56,19 @@ def subjects(self):
"subject file {} not found".format(path))
return self._subjects

@property
def skos(self):
"""return the subject vocabulary from SKOS file"""
if self._skos_vocab is None:
path = os.path.join(self.datadir, 'subjects.ttl')
if os.path.exists(path):
logger.debug(f'loading graph from {path}')
self._skos_vocab = annif.corpus.SubjectFileSKOS(path,
self.language)
else:
raise NotInitializedException(f'graph file {path} not found')
return self._skos_vocab

def load_vocabulary(self, subject_corpus, language):
"""load subjects from a subject corpus and save them into a
SKOS/Turtle file for later use"""
Expand All @@ -67,15 +81,10 @@ def load_vocabulary(self, subject_corpus, language):
subject_corpus.save_skos(os.path.join(self.datadir, 'subjects.ttl'),
language)

def as_skos(self):
def as_skos_file(self):
"""return the vocabulary as a file object, in SKOS/Turtle syntax"""
return open(os.path.join(self.datadir, 'subjects.ttl'), 'rb')

def as_graph(self):
"""return the vocabulary as an rdflib graph"""
g = rdflib.graph.Graph()
g.load(
os.path.join(self.datadir, 'subjects.ttl'),
format='ttl'
)
return g
return self.skos.graph
8 changes: 8 additions & 0 deletions projects.cfg.dist
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,14 @@ backend=omikuji
analyzer=snowball(english)
vocab=yso-en

[yake-fi]
name=YAKE Finnish
language=fi
backend=yake
vocab=yso-fi
analyzer=voikko(fi)
input_limit=20000

[ensemble-fi]
name=Ensemble Finnish
language=fi
Expand Down
1 change: 1 addition & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ def read(fname):
'vw': ['vowpalwabbit==8.8.1'],
'nn': ['tensorflow-cpu==2.3.1', 'lmdb==1.0.0'],
'omikuji': ['omikuji==0.3.*'],
'yake': ['yake==0.4.5'],
'dev': [
'codecov',
'pytest-cov',
Expand Down
2 changes: 1 addition & 1 deletion tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ def subject_file():

@pytest.fixture(scope='module')
def vocabulary(datadir):
vocab = annif.vocab.AnnifVocabulary('my-vocab', datadir)
vocab = annif.vocab.AnnifVocabulary('my-vocab', datadir, 'fi')
subjfile = os.path.join(
os.path.dirname(__file__),
'corpora',
Expand Down
Loading