Skip to content

Conversation

@lpi-tn
Copy link
Collaborator

@lpi-tn lpi-tn commented Jul 8, 2025

This pull request introduces several changes to enhance the document vectorization workflow, improve configurability, and optimize database operations. Key updates include the addition of a new ST_BACKEND parameter to support different backend options, modifications in the embedding model loading process, and a shift to bulk database operations for better performance.

Workflow Configuration Enhancements:

  • Added a new st_backend parameter with default values (onnx or torch) in cron-workflow.yaml and workflow-template-document-vectorizer.yaml to configure the backend for SentenceTransformer operations. ([[1]](https://github.com/CyberCRI/welearn-datastack/pull/55/files#diff-195276bf211a41586b97e72cb50a98692eeab3c32b32390542d941e82b3e950dR57-R58), [[2]](https://github.com/CyberCRI/welearn-datastack/pull/55/files#diff-859552c85c1e5b3b6a4762acce2f720dc026699172dd49ab3736cdf889492530R46-R47))
  • Updated the corresponding parameter mappings (ST_BACKEND) in the workflow templates to ensure the backend configuration is passed correctly. ([[1]](https://github.com/CyberCRI/welearn-datastack/pull/55/files#diff-195276bf211a41586b97e72cb50a98692eeab3c32b32390542d941e82b3e950dR117-R119), [[2]](https://github.com/CyberCRI/welearn-datastack/pull/55/files#diff-859552c85c1e5b3b6a4762acce2f720dc026699172dd49ab3736cdf889492530R83-R85))

Embedding Model Updates:

  • Introduced the ST_BACKEND environment variable in test_embedding_model_helpers.py and added backend validation in embedding_model_helpers.py to support torch, onnx, and openvino backends. ([[1]](https://github.com/CyberCRI/welearn-datastack/pull/55/files#diff-b2a05290b6cc863622270477243d69325f82eec1557511d86ebbac8a5183da4eL21-R21), [[2]](https://github.com/CyberCRI/welearn-datastack/pull/55/files#diff-17bcf515f80dd27693f72294b08949c99c319d7dbc2fbe9013bb74cdd5c20cd6R87-R96))
  • Modified the load_embedding_model function to include the backend parameter when initializing the SentenceTransformer. ([welearn_datastack/modules/embedding_model_helpers.pyR106](https://github.com/CyberCRI/welearn-datastack/pull/55/files#diff-17bcf515f80dd27693f72294b08949c99c319d7dbc2fbe9013bb74cdd5c20cd6R106))

Database Optimization:

  • Refactored database operations in document_vectorizer.py to use bulk saving for slices and process states, reducing the number of individual database commits and improving performance. ([[1]](https://github.com/CyberCRI/welearn-datastack/pull/55/files#diff-901e13e14e140e484d205e2c0cb27a6cf0f1287560dabc68241fd91e113ac2d8R76-R77), [[2]](https://github.com/CyberCRI/welearn-datastack/pull/55/files#diff-901e13e14e140e484d205e2c0cb27a6cf0f1287560dabc68241fd91e113ac2d8R92-R146))

Dependency Addition:

  • Added the optimum library with the onnxruntime extra to the pyproject.toml file to support ONNX-based operations. ([pyproject.tomlR48](https://github.com/CyberCRI/welearn-datastack/pull/55/files#diff-50c86b7ed8ac2cf95bd48334961bf0530cdc77b5a56f852c5c61b89d735fd711R48))

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds configurable backend support for sentence embeddings, refactors embedding model loading, and optimizes database writes by batching operations.

  • Introduced st_backend parameter in workflow and cron templates, mapped to ST_BACKEND.
  • Updated load_embedding_model to validate and pass a backend to SentenceTransformer.
  • Refactored document_vectorizer.py to collect slices and process states into lists and perform bulk saves.
  • Added optimum[onnxruntime] as a dependency.

Reviewed Changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
welearn_datastack/nodes_workflow/DocumentVectorizer/document_vectorizer.py Switched to bulk collecting and saving slices and process states
welearn_datastack/modules/embedding_model_helpers.py Added ST_BACKEND handling, validation, and passed backend to model
tests/document_vectorizer/test_embedding_model_helpers.py Updated setup to include ST_BACKEND
pyproject.toml Added optimum dependency with onnxruntime extra
k8s/.../workflow-template-document-vectorizer.yaml Added st_backend input and mapped to ST_BACKEND
k8s/.../cron-workflow.yaml Added st_backend input and mapped to ST_BACKEND
Comments suppressed due to low confidence (1)

tests/document_vectorizer/test_embedding_model_helpers.py:21

  • [nitpick] The tests only set one backend and don’t cover invalid values or other supported backends (torch, openvino). Consider adding tests for default behavior, valid backends, and failure on unsupported backends.
        os.environ["ST_BACKEND"] = "onnx"

@lpi-tn lpi-tn merged commit 9a2f379 into main Jul 8, 2025
6 checks passed
@lpi-tn lpi-tn deleted the Fix/optimize-vectorisation branch July 8, 2025 13:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants