update the default embd. #310

emrgnt-cmplxty · 2024-04-19T19:48:00Z

🚀 This description was created by Ellipsis for commit `c9cba2e`

Summary:

This PR updates the default embedding model, updates library versions, and refactors the EmbeddingProvider classes to accept a configuration dictionary.

Key points:

Updated default embedding model from all-MiniLM-L6-v2 to mixedbread-ai/mxbai-embed-large-v1 in various files.
Updated sentence-transformers library version and added tokenizers library in pyproject.toml.
Updated EmbeddingProvider classes to accept a configuration dictionary.
Updated SentenceTransformerEmbeddingProvider to initialize SentenceTransformer with trust_remote_code=True.
Updated get_embeddings_provider function in E2EPipelineFactory to pass entire configuration dictionary to EmbeddingProvider classes.

Generated with ❤️ by ellipsis.dev

vercel · 2024-04-19T19:48:05Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
r2r-docs	❌ Failed (Inspect)			Apr 19, 2024 7:49pm

ellipsis-dev

❌ Changes requested.

Reviewed the entire pull request up to c9cba2e
Looked at 365 lines of code in 14 files
Took 4 minutes and 10 seconds to review

More info

Skipped 1 files when reviewing.
Skipped posting 4 additional comments because they didn't meet confidence threshold of 50%.

1. r2r/core/providers/embedding.py:8:

Assessed confidence : 33%
Comment:
The changes in the EmbeddingProvider classes are to accommodate the new configuration format where the provider, model, and dimensions are specified in a dictionary. This is a good practice as it makes the code more flexible and easier to extend in the future. However, the PR lacks a description explaining these changes.
Reasoning:
The PR changes the default embedding model from all-MiniLM-L6-v2 to mixedbread-ai/mxbai-embed-large-v1. The changes are consistent across the documentation and configuration files. The PR also updates the sentence-transformers library version and adds the tokenizers library in the pyproject.toml file. The changes in the EmbeddingProvider classes are to accommodate the new configuration format where the provider, model, and dimensions are specified in a dictionary. The trust_remote_code=True argument is added when initializing the SentenceTransformer model, which allows the execution of remote code for custom models. This could potentially be a security risk if untrusted models are used. The PR lacks a description, which should ideally explain the reason for these changes and their impact.

2. r2r/embeddings/openai/base.py:24:

Assessed confidence : 33%
Comment:
The changes in the EmbeddingProvider classes are to accommodate the new configuration format where the provider, model, and dimensions are specified in a dictionary. This is a good practice as it makes the code more flexible and easier to extend in the future. However, the PR lacks a description explaining these changes.
Reasoning:
The PR changes the default embedding model from all-MiniLM-L6-v2 to mixedbread-ai/mxbai-embed-large-v1. The changes are consistent across the documentation and configuration files. The PR also updates the sentence-transformers library version and adds the tokenizers library in the pyproject.toml file. The changes in the EmbeddingProvider classes are to accommodate the new configuration format where the provider, model, and dimensions are specified in a dictionary. The trust_remote_code=True argument is added when initializing the SentenceTransformer model, which allows the execution of remote code for custom models. This could potentially be a security risk if untrusted models are used. The PR lacks a description, which should ideally explain the reason for these changes and their impact.

3. r2r/embeddings/setence_transformer/base.py:10:

Assessed confidence : 33%
Comment:
The changes in the EmbeddingProvider classes are to accommodate the new configuration format where the provider, model, and dimensions are specified in a dictionary. This is a good practice as it makes the code more flexible and easier to extend in the future. However, the PR lacks a description explaining these changes.
Reasoning:
The PR changes the default embedding model from all-MiniLM-L6-v2 to mixedbread-ai/mxbai-embed-large-v1. The changes are consistent across the documentation and configuration files. The PR also updates the sentence-transformers library version and adds the tokenizers library in the pyproject.toml file. The changes in the EmbeddingProvider classes are to accommodate the new configuration format where the provider, model, and dimensions are specified in a dictionary. The trust_remote_code=True argument is added when initializing the SentenceTransformer model, which allows the execution of remote code for custom models. This could potentially be a security risk if untrusted models are used. The PR lacks a description, which should ideally explain the reason for these changes and their impact.

4. r2r/main/factory.py:51:

Assessed confidence : 33%
Comment:
The changes in the EmbeddingProvider classes are to accommodate the new configuration format where the provider, model, and dimensions are specified in a dictionary. This is a good practice as it makes the code more flexible and easier to extend in the future. However, the PR lacks a description explaining these changes.
Reasoning:
The PR changes the default embedding model from all-MiniLM-L6-v2 to mixedbread-ai/mxbai-embed-large-v1. The changes are consistent across the documentation and configuration files. The PR also updates the sentence-transformers library version and adds the tokenizers library in the pyproject.toml file. The changes in the EmbeddingProvider classes are to accommodate the new configuration format where the provider, model, and dimensions are specified in a dictionary. The trust_remote_code=True argument is added when initializing the SentenceTransformer model, which allows the execution of remote code for custom models. This could potentially be a security risk if untrusted models are used. The PR lacks a description, which should ideally explain the reason for these changes and their impact.

Workflow ID: wflow_IhRBb4LDbAa7v97B

Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. We'll respond in a few minutes. Learn more here.

ellipsis-dev · 2024-04-19T19:52:29Z

r2r/core/utils/splitter/text.py

@@ -1132,7 +1132,7 @@ def __init__(
            )

        self.model = model
-        self._model = SentenceTransformer(self.model)
+        self._model = SentenceTransformer(self.model, trust_remote_code=True)


The trust_remote_code=True argument is added when initializing the SentenceTransformer model, which allows the execution of remote code for custom models. This could potentially be a security risk if untrusted models are used. It would be better to make this an optional configuration that can be turned on if necessary, rather than having it on by default.

ellipsis-dev · 2024-04-19T19:52:29Z

r2r/embeddings/setence_transformer/base.py

+            raise ValueError(
+                "Must set dimensions in order to initialize SentenceTransformerEmbeddingProvider."
+            )
+        self.encoder = SentenceTransformer(model, truncate_dim=dimension, trust_remote_code=True)


The trust_remote_code=True argument is added when initializing the SentenceTransformer model, which allows the execution of remote code for custom models. This could potentially be a security risk if untrusted models are used. It would be better to make this an optional configuration that can be turned on if necessary, rather than having it on by default.

* Update local_rag.mdx * Update llms.mdx (#322) * update the default embd. (#310) * Feature/add agent provider (#317) * update pipeline (#315) * Update CONTRIBUTING.md * Delete CONTRIBUTOR.md * Adding agent provider * Feature/modify get all uniq values (#325) * refine * update * format * fix * dev merge

* update the default embd. (#310) * Feature/add jina reranker rebased (#312) * Add jina reranker * fix oai * Revampt client & server approach. (#316) * Revampt client & server approach. * cleanups * tweak hyde prompt * Feature/add agent provider (#317) * update pipeline (#315) * Update CONTRIBUTING.md * Delete CONTRIBUTOR.md * Adding agent provider * Feature/add agent provider rebased v2 (#319) * modify prompt provider workflow, add agent * fix run qna client * add provider abstraction * tweaks and cleans * move text splitter config locale * Feature/modify get all uniq values (#325) * refine * update * format * fix * Feature/dev merge rebased (#329) * Update local_rag.mdx * Update llms.mdx (#322) * update the default embd. (#310) * Feature/add agent provider (#317) * update pipeline (#315) * Update CONTRIBUTING.md * Delete CONTRIBUTOR.md * Adding agent provider * Feature/modify get all uniq values (#325) * refine * update * format * fix * dev merge * Feature/fix dev merge mistakes rebased (#330) * Feature/add agent provider (#317) * update pipeline (#315) * Update CONTRIBUTING.md * Delete CONTRIBUTOR.md * Adding agent provider * Feature/modify get all uniq values (#325) * refine * update * format * fix * dev merge * cleanup * Feature/dev cleanup (#331) * final pub * final pub * json clean * fix sentence transformer issue * include rerank * fix llama cpp * rebase * fix rerank * small tweaks * rollbk config * cleanup

update the default embd.

c9cba2e

emrgnt-cmplxty marked this pull request as ready for review April 19, 2024 19:48

vercel bot had a problem deploying to Preview April 19, 2024 19:49 Failure

ellipsis-dev bot reviewed Apr 19, 2024

View reviewed changes

emrgnt-cmplxty changed the base branch from main to dev April 19, 2024 23:25

emrgnt-cmplxty merged commit 49e4a0c into dev Apr 19, 2024
1 of 2 checks passed

emrgnt-cmplxty deleted the feature/modify-default-embedding-model branch April 19, 2024 23:37

emrgnt-cmplxty added a commit that referenced this pull request Apr 22, 2024

update the default embd. (#310)

a707831

emrgnt-cmplxty added a commit that referenced this pull request Apr 23, 2024

update the default embd. (#310)

9b19044

emrgnt-cmplxty added a commit that referenced this pull request Apr 23, 2024

update the default embd. (#310)

a4ef13a

emrgnt-cmplxty added a commit that referenced this pull request Apr 23, 2024

update the default embd. (#310)

7959fad

emrgnt-cmplxty added a commit that referenced this pull request Apr 23, 2024

update the default embd. (#310)

c0c11fe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update the default embd. #310

update the default embd. #310

emrgnt-cmplxty commented Apr 19, 2024 •

edited by ellipsis-dev bot

vercel bot commented Apr 19, 2024 •

edited

ellipsis-dev bot left a comment

ellipsis-dev bot Apr 19, 2024

ellipsis-dev bot Apr 19, 2024

update the default embd. #310

update the default embd. #310

Conversation

emrgnt-cmplxty commented Apr 19, 2024 • edited by ellipsis-dev bot

Summary:

vercel bot commented Apr 19, 2024 • edited

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

ellipsis-dev bot Apr 19, 2024

Choose a reason for hiding this comment

ellipsis-dev bot Apr 19, 2024

Choose a reason for hiding this comment

emrgnt-cmplxty commented Apr 19, 2024 •

edited by ellipsis-dev bot

vercel bot commented Apr 19, 2024 •

edited