Fix/optimize vectorisation #55

lpi-tn · 2025-07-08T13:41:56Z

This pull request introduces several changes to enhance the document vectorization workflow, improve configurability, and optimize database operations. Key updates include the addition of a new ST_BACKEND parameter to support different backend options, modifications in the embedding model loading process, and a shift to bulk database operations for better performance.

Workflow Configuration Enhancements:

Added a new st_backend parameter with default values (onnx or torch) in cron-workflow.yaml and workflow-template-document-vectorizer.yaml to configure the backend for SentenceTransformer operations. ([[1]](https://github.com/CyberCRI/welearn-datastack/pull/55/files#diff-195276bf211a41586b97e72cb50a98692eeab3c32b32390542d941e82b3e950dR57-R58), [[2]](https://github.com/CyberCRI/welearn-datastack/pull/55/files#diff-859552c85c1e5b3b6a4762acce2f720dc026699172dd49ab3736cdf889492530R46-R47))
Updated the corresponding parameter mappings (ST_BACKEND) in the workflow templates to ensure the backend configuration is passed correctly. ([[1]](https://github.com/CyberCRI/welearn-datastack/pull/55/files#diff-195276bf211a41586b97e72cb50a98692eeab3c32b32390542d941e82b3e950dR117-R119), [[2]](https://github.com/CyberCRI/welearn-datastack/pull/55/files#diff-859552c85c1e5b3b6a4762acce2f720dc026699172dd49ab3736cdf889492530R83-R85))

Embedding Model Updates:

Introduced the ST_BACKEND environment variable in test_embedding_model_helpers.py and added backend validation in embedding_model_helpers.py to support torch, onnx, and openvino backends. ([[1]](https://github.com/CyberCRI/welearn-datastack/pull/55/files#diff-b2a05290b6cc863622270477243d69325f82eec1557511d86ebbac8a5183da4eL21-R21), [[2]](https://github.com/CyberCRI/welearn-datastack/pull/55/files#diff-17bcf515f80dd27693f72294b08949c99c319d7dbc2fbe9013bb74cdd5c20cd6R87-R96))
Modified the load_embedding_model function to include the backend parameter when initializing the SentenceTransformer. ([welearn_datastack/modules/embedding_model_helpers.pyR106](https://github.com/CyberCRI/welearn-datastack/pull/55/files#diff-17bcf515f80dd27693f72294b08949c99c319d7dbc2fbe9013bb74cdd5c20cd6R106))

Database Optimization:

Refactored database operations in document_vectorizer.py to use bulk saving for slices and process states, reducing the number of individual database commits and improving performance. ([[1]](https://github.com/CyberCRI/welearn-datastack/pull/55/files#diff-901e13e14e140e484d205e2c0cb27a6cf0f1287560dabc68241fd91e113ac2d8R76-R77), [[2]](https://github.com/CyberCRI/welearn-datastack/pull/55/files#diff-901e13e14e140e484d205e2c0cb27a6cf0f1287560dabc68241fd91e113ac2d8R92-R146))

Dependency Addition:

Added the optimum library with the onnxruntime extra to the pyproject.toml file to support ONNX-based operations. ([pyproject.tomlR48](https://github.com/CyberCRI/welearn-datastack/pull/55/files#diff-50c86b7ed8ac2cf95bd48334961bf0530cdc77b5a56f852c5c61b89d735fd711R48))

Copilot

Pull Request Overview

This PR adds configurable backend support for sentence embeddings, refactors embedding model loading, and optimizes database writes by batching operations.

Introduced st_backend parameter in workflow and cron templates, mapped to ST_BACKEND.
Updated load_embedding_model to validate and pass a backend to SentenceTransformer.
Refactored document_vectorizer.py to collect slices and process states into lists and perform bulk saves.
Added optimum[onnxruntime] as a dependency.

Reviewed Changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
welearn_datastack/nodes_workflow/DocumentVectorizer/document_vectorizer.py	Switched to bulk collecting and saving slices and process states
welearn_datastack/modules/embedding_model_helpers.py	Added `ST_BACKEND` handling, validation, and passed backend to model
tests/document_vectorizer/test_embedding_model_helpers.py	Updated setup to include `ST_BACKEND`
pyproject.toml	Added `optimum` dependency with `onnxruntime` extra
k8s/.../workflow-template-document-vectorizer.yaml	Added `st_backend` input and mapped to `ST_BACKEND`
k8s/.../cron-workflow.yaml	Added `st_backend` input and mapped to `ST_BACKEND`

Comments suppressed due to low confidence (1)

tests/document_vectorizer/test_embedding_model_helpers.py:21

[nitpick] The tests only set one backend and don’t cover invalid values or other supported backends (torch, openvino). Consider adding tests for default behavior, valid backends, and failure on unsupported backends.

        os.environ["ST_BACKEND"] = "onnx"

welearn_datastack/modules/embedding_model_helpers.py

welearn_datastack/nodes_workflow/DocumentVectorizer/document_vectorizer.py

welearn_datastack/modules/embedding_model_helpers.py

welearn_datastack/nodes_workflow/DocumentVectorizer/document_vectorizer.py

lpi-tn added 9 commits July 8, 2025 13:39

backend support

ff49d26

backend support

237b8f3

backend support

cca15e3

backend support

72b880a

backend support

e0a666d

backend support

b1524a8

backend support + batch mode

7b3e2e9

backend support + batch mode

3daade6

kept for trace in bulk too

9d5cb84

lpi-tn requested review from Nastaliss, Copilot and sandragjacinto July 8, 2025 13:41

onnx

e520d2b

Copilot AI reviewed Jul 8, 2025

View reviewed changes

Nastaliss reviewed Jul 8, 2025

View reviewed changes

welearn_datastack/modules/embedding_model_helpers.py Outdated Show resolved Hide resolved

sandragjacinto reviewed Jul 8, 2025

View reviewed changes

welearn_datastack/nodes_workflow/DocumentVectorizer/document_vectorizer.py Outdated Show resolved Hide resolved

lpi-tn added 2 commits July 8, 2025 15:47

rm comments

7d9a637

rm useless test

ae47cb9

sandragjacinto approved these changes Jul 8, 2025

View reviewed changes

type hint

a67583c

lpi-tn merged commit 9a2f379 into main Jul 8, 2025
6 checks passed

lpi-tn deleted the Fix/optimize-vectorisation branch July 8, 2025 13:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix/optimize vectorisation #55

Fix/optimize vectorisation #55

Uh oh!

lpi-tn commented Jul 8, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix/optimize vectorisation #55

Fix/optimize vectorisation #55

Uh oh!

Conversation

lpi-tn commented Jul 8, 2025

Workflow Configuration Enhancements:

Embedding Model Updates:

Database Optimization:

Dependency Addition:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants