Skip to content
This repository has been archived by the owner on Jan 29, 2024. It is now read-only.

add ner k8s #643

Open
wants to merge 36 commits into
base: master
Choose a base branch
from
Open

add ner k8s #643

wants to merge 36 commits into from

Conversation

drsantos89
Copy link
Contributor

@drsantos89 drsantos89 commented Oct 26, 2022

Description

Adds a function to perform and store the output of NER.
NER is run remotely using the deployment on Kubernetes. It supports both ML and RULE-based approaches.
The NER output and model version are stored in ES.

Notes

  • One function, handle_conflits, is currently found in two repositories (this current PR and one repo on GitLab). It might be interesting to keep it only in BlueSearch and import it into the other repository.
  • The remote models are relatively slow when running 1 sample at a time (~1-2 paragraphs/s for ML and ~15 for RULE). Testing the remote model speed with locust revealed a maximum possible performant of ~15 and ~50 for ML and RULE models, respectively. A multiprocessing option was hence added.
  • pool.apply_async compies the arguments to a new memory location. The client object is not serializable and needs to be called inside the function if required.
  • by default, the function only updates the paragraphs which are empty or with an outdated model version. There is an option to force the update of every paragraph.
  • The JSON output of both models is saved in an ES field of type flattened. ("This data type can be useful for indexing objects with a large or unknown number of unique keys. Only one field mapping is created for the whole JSON object, which can help prevent a [mappings explosion (https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html#mapping-limit-settings) from having too many distinct field mappings.")

How to test?

tests/unit/k8s/test_ner.py

Checklist

  • Unit tests added.
    (if needed)
  • Type annotations added.
    (if a function is added or modified)
  • All CI tests pass.

@drsantos89 drsantos89 marked this pull request as ready for review October 31, 2022 14:37
@drsantos89 drsantos89 marked this pull request as draft October 31, 2022 14:39
@drsantos89 drsantos89 marked this pull request as ready for review November 2, 2022 09:06
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants