-
Notifications
You must be signed in to change notification settings - Fork 0
Fix/optimize vectorisation #55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds configurable backend support for sentence embeddings, refactors embedding model loading, and optimizes database writes by batching operations.
- Introduced
st_backendparameter in workflow and cron templates, mapped toST_BACKEND. - Updated
load_embedding_modelto validate and pass a backend toSentenceTransformer. - Refactored
document_vectorizer.pyto collect slices and process states into lists and perform bulk saves. - Added
optimum[onnxruntime]as a dependency.
Reviewed Changes
Copilot reviewed 6 out of 7 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| welearn_datastack/nodes_workflow/DocumentVectorizer/document_vectorizer.py | Switched to bulk collecting and saving slices and process states |
| welearn_datastack/modules/embedding_model_helpers.py | Added ST_BACKEND handling, validation, and passed backend to model |
| tests/document_vectorizer/test_embedding_model_helpers.py | Updated setup to include ST_BACKEND |
| pyproject.toml | Added optimum dependency with onnxruntime extra |
| k8s/.../workflow-template-document-vectorizer.yaml | Added st_backend input and mapped to ST_BACKEND |
| k8s/.../cron-workflow.yaml | Added st_backend input and mapped to ST_BACKEND |
Comments suppressed due to low confidence (1)
tests/document_vectorizer/test_embedding_model_helpers.py:21
- [nitpick] The tests only set one backend and don’t cover invalid values or other supported backends (
torch,openvino). Consider adding tests for default behavior, valid backends, and failure on unsupported backends.
os.environ["ST_BACKEND"] = "onnx"
This pull request introduces several changes to enhance the document vectorization workflow, improve configurability, and optimize database operations. Key updates include the addition of a new
ST_BACKENDparameter to support different backend options, modifications in the embedding model loading process, and a shift to bulk database operations for better performance.Workflow Configuration Enhancements:
st_backendparameter with default values (onnxortorch) incron-workflow.yamlandworkflow-template-document-vectorizer.yamlto configure the backend for SentenceTransformer operations. ([[1]](https://github.com/CyberCRI/welearn-datastack/pull/55/files#diff-195276bf211a41586b97e72cb50a98692eeab3c32b32390542d941e82b3e950dR57-R58),[[2]](https://github.com/CyberCRI/welearn-datastack/pull/55/files#diff-859552c85c1e5b3b6a4762acce2f720dc026699172dd49ab3736cdf889492530R46-R47))ST_BACKEND) in the workflow templates to ensure the backend configuration is passed correctly. ([[1]](https://github.com/CyberCRI/welearn-datastack/pull/55/files#diff-195276bf211a41586b97e72cb50a98692eeab3c32b32390542d941e82b3e950dR117-R119),[[2]](https://github.com/CyberCRI/welearn-datastack/pull/55/files#diff-859552c85c1e5b3b6a4762acce2f720dc026699172dd49ab3736cdf889492530R83-R85))Embedding Model Updates:
ST_BACKENDenvironment variable intest_embedding_model_helpers.pyand added backend validation inembedding_model_helpers.pyto supporttorch,onnx, andopenvinobackends. ([[1]](https://github.com/CyberCRI/welearn-datastack/pull/55/files#diff-b2a05290b6cc863622270477243d69325f82eec1557511d86ebbac8a5183da4eL21-R21),[[2]](https://github.com/CyberCRI/welearn-datastack/pull/55/files#diff-17bcf515f80dd27693f72294b08949c99c319d7dbc2fbe9013bb74cdd5c20cd6R87-R96))load_embedding_modelfunction to include the backend parameter when initializing theSentenceTransformer. ([welearn_datastack/modules/embedding_model_helpers.pyR106](https://github.com/CyberCRI/welearn-datastack/pull/55/files#diff-17bcf515f80dd27693f72294b08949c99c319d7dbc2fbe9013bb74cdd5c20cd6R106))Database Optimization:
document_vectorizer.pyto use bulk saving for slices and process states, reducing the number of individual database commits and improving performance. ([[1]](https://github.com/CyberCRI/welearn-datastack/pull/55/files#diff-901e13e14e140e484d205e2c0cb27a6cf0f1287560dabc68241fd91e113ac2d8R76-R77),[[2]](https://github.com/CyberCRI/welearn-datastack/pull/55/files#diff-901e13e14e140e484d205e2c0cb27a6cf0f1287560dabc68241fd91e113ac2d8R92-R146))Dependency Addition:
optimumlibrary with theonnxruntimeextra to thepyproject.tomlfile to support ONNX-based operations. ([pyproject.tomlR48](https://github.com/CyberCRI/welearn-datastack/pull/55/files#diff-50c86b7ed8ac2cf95bd48334961bf0530cdc77b5a56f852c5c61b89d735fd711R48))