Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release/533-release-candidate #14227

Merged
merged 7 commits into from
Apr 5, 2024
Merged

Conversation

maziyarpanahi
Copy link
Member

@maziyarpanahi maziyarpanahi commented Apr 5, 2024

  • example notebook for DocumentCharacterTextSplitter
  • example notebook for DeBertaForZeroShotClassification
  • example notebooks for BGEEmbeddings and MPNetEmbeddings
  • example notebook for MPNetForQuestionAnswering
  • example notebook + path for MPNetForSequenceClassification
  • Delete examples/python/annotation/text/english/language-translation/Multilingual_Translation_with_M2M100.ipynb
  • Add files via upload
  • Delete examples/python/annotation/text/english/language-translation/Multilingual_Translation_with_M2M100.ipynb
  • fixing colab link for M2M100 notebook

Sentence embeddings using Universal AnglE Embedding (UAE).
UAE is a novel angle-optimized text embedding model, designed to improve semantic textual
similarity tasks, which are crucial for Large Language Model (LLM) applications. By
introducing angle optimization in a complex space, AnglE effectively mitigates saturation of
the cosine similarity function.

Additionally, it fixes a bug with serializing onnx models that do not have a .onnx_data file (b73dc0b). @prabod I think you worked on this part, could you review if the fix looks good? I provided a description in the commit message. Thanks!

  • Cache mechanism implementation for metadata.json #14224
    1 - gets3Object that includes getLastModified() (just contains a summary, do not download the whole metadata.json file.)
    2- check the condition (cache contains up-to-date metadata)
    3- If the cache contains up-to-date metadata, get it;
    Otherwise, download it, set it to the cache, and return it.

  • [SPARKNLP-1031] Solves Dependency Parsers training issue #14225
    This PR introduces critical enhancements and optimizations to the processing of the CoNLL-U format, which is instrumental in the training of Dependency Parsers. The key improvements include:

Enhanced Multiword Token Handling: This update ensures proper processing of lines identified by id columns as multiword tokens (e.g., 2-3 no _ _ _ _ _ _ _ _). This adjustment guarantees that multiword tokens are accurately recognized and managed throughout the parsing process.

Improved Handling of Missing uPos Values: Before this change, lines with unavailable uPos values could disrupt the parsing flow. With the current enhancements, the system gracefully handles such scenarios, ensuring uninterrupted parsing operations even in the absence of uPos values.

Beyond these functional enhancements, this PR undertakes a comprehensive refactoring of the underlying codebase. The refactoring efforts focus on enhancing code readability, cleanliness, and maintainability. These improvements pave the way for easier future modifications and debugging, aligning with best practices in software development.

* example notebook for DocumentCharacterTextSplitter

* example notebook for DeBertaForZeroShotClassification

* example notebooks for BGEEmbeddings and MPNetEmbeddings

* example notebook for MPNetForQuestionAnswering

* example notebook + path for MPNetForSequenceClassification

* Delete examples/python/annotation/text/english/language-translation/Multilingual_Translation_with_M2M100.ipynb

* Add files via upload

* Delete examples/python/annotation/text/english/language-translation/Multilingual_Translation_with_M2M100.ipynb

* fixing colab link for M2M100 notebook
@maziyarpanahi maziyarpanahi self-assigned this Apr 5, 2024
@maziyarpanahi maziyarpanahi added enhancement documentation bug-fix new-feature Introducing a new feature DON'T MERGE Do not merge this PR labels Apr 5, 2024
* SPARKNLP-962: UAE Embeddings

- added Scala side

* SPARKNLP-962: UAE Embeddings

- added Python Side

* SPARKNLP-962: UAE Embeddings

- Added default values
- Serialization tests

* Bugfix: Can't serialize models without onnx_data file

- onnxModelPath is not set for models without an .onnx_data file, so it will be None
- None.get will throw an error, this checks for it first

* SPARKNLP-962: UAE Embeddings

- Documentation

* SPARKNLP-962: UAE Embeddings

- make tests lazy
@maziyarpanahi maziyarpanahi merged commit 4a37687 into master Apr 5, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment