-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update the default embd. #310
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❌ Changes requested.
- Reviewed the entire pull request up to c9cba2e
- Looked at
365
lines of code in14
files - Took 4 minutes and 10 seconds to review
More info
- Skipped
1
files when reviewing. - Skipped posting
4
additional comments because they didn't meet confidence threshold of50%
.
1. r2r/core/providers/embedding.py:8
:
- Assessed confidence :
33%
- Comment:
The changes in theEmbeddingProvider
classes are to accommodate the new configuration format where the provider, model, and dimensions are specified in a dictionary. This is a good practice as it makes the code more flexible and easier to extend in the future. However, the PR lacks a description explaining these changes. - Reasoning:
The PR changes the default embedding model fromall-MiniLM-L6-v2
tomixedbread-ai/mxbai-embed-large-v1
. The changes are consistent across the documentation and configuration files. The PR also updates thesentence-transformers
library version and adds thetokenizers
library in thepyproject.toml
file. The changes in theEmbeddingProvider
classes are to accommodate the new configuration format where the provider, model, and dimensions are specified in a dictionary. Thetrust_remote_code=True
argument is added when initializing theSentenceTransformer
model, which allows the execution of remote code for custom models. This could potentially be a security risk if untrusted models are used. The PR lacks a description, which should ideally explain the reason for these changes and their impact.
2. r2r/embeddings/openai/base.py:24
:
- Assessed confidence :
33%
- Comment:
The changes in theEmbeddingProvider
classes are to accommodate the new configuration format where the provider, model, and dimensions are specified in a dictionary. This is a good practice as it makes the code more flexible and easier to extend in the future. However, the PR lacks a description explaining these changes. - Reasoning:
The PR changes the default embedding model fromall-MiniLM-L6-v2
tomixedbread-ai/mxbai-embed-large-v1
. The changes are consistent across the documentation and configuration files. The PR also updates thesentence-transformers
library version and adds thetokenizers
library in thepyproject.toml
file. The changes in theEmbeddingProvider
classes are to accommodate the new configuration format where the provider, model, and dimensions are specified in a dictionary. Thetrust_remote_code=True
argument is added when initializing theSentenceTransformer
model, which allows the execution of remote code for custom models. This could potentially be a security risk if untrusted models are used. The PR lacks a description, which should ideally explain the reason for these changes and their impact.
3. r2r/embeddings/setence_transformer/base.py:10
:
- Assessed confidence :
33%
- Comment:
The changes in theEmbeddingProvider
classes are to accommodate the new configuration format where the provider, model, and dimensions are specified in a dictionary. This is a good practice as it makes the code more flexible and easier to extend in the future. However, the PR lacks a description explaining these changes. - Reasoning:
The PR changes the default embedding model fromall-MiniLM-L6-v2
tomixedbread-ai/mxbai-embed-large-v1
. The changes are consistent across the documentation and configuration files. The PR also updates thesentence-transformers
library version and adds thetokenizers
library in thepyproject.toml
file. The changes in theEmbeddingProvider
classes are to accommodate the new configuration format where the provider, model, and dimensions are specified in a dictionary. Thetrust_remote_code=True
argument is added when initializing theSentenceTransformer
model, which allows the execution of remote code for custom models. This could potentially be a security risk if untrusted models are used. The PR lacks a description, which should ideally explain the reason for these changes and their impact.
4. r2r/main/factory.py:51
:
- Assessed confidence :
33%
- Comment:
The changes in theEmbeddingProvider
classes are to accommodate the new configuration format where the provider, model, and dimensions are specified in a dictionary. This is a good practice as it makes the code more flexible and easier to extend in the future. However, the PR lacks a description explaining these changes. - Reasoning:
The PR changes the default embedding model fromall-MiniLM-L6-v2
tomixedbread-ai/mxbai-embed-large-v1
. The changes are consistent across the documentation and configuration files. The PR also updates thesentence-transformers
library version and adds thetokenizers
library in thepyproject.toml
file. The changes in theEmbeddingProvider
classes are to accommodate the new configuration format where the provider, model, and dimensions are specified in a dictionary. Thetrust_remote_code=True
argument is added when initializing theSentenceTransformer
model, which allows the execution of remote code for custom models. This could potentially be a security risk if untrusted models are used. The PR lacks a description, which should ideally explain the reason for these changes and their impact.
Workflow ID: wflow_IhRBb4LDbAa7v97B
Want Ellipsis to fix these issues? Tag @ellipsis-dev
in a comment. We'll respond in a few minutes. Learn more here.
@@ -1132,7 +1132,7 @@ def __init__( | |||
) | |||
|
|||
self.model = model | |||
self._model = SentenceTransformer(self.model) | |||
self._model = SentenceTransformer(self.model, trust_remote_code=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The trust_remote_code=True
argument is added when initializing the SentenceTransformer
model, which allows the execution of remote code for custom models. This could potentially be a security risk if untrusted models are used. It would be better to make this an optional configuration that can be turned on if necessary, rather than having it on by default.
raise ValueError( | ||
"Must set dimensions in order to initialize SentenceTransformerEmbeddingProvider." | ||
) | ||
self.encoder = SentenceTransformer(model, truncate_dim=dimension, trust_remote_code=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The trust_remote_code=True
argument is added when initializing the SentenceTransformer
model, which allows the execution of remote code for custom models. This could potentially be a security risk if untrusted models are used. It would be better to make this an optional configuration that can be turned on if necessary, rather than having it on by default.
* Update local_rag.mdx * Update llms.mdx (#322) * update the default embd. (#310) * Feature/add agent provider (#317) * update pipeline (#315) * Update CONTRIBUTING.md * Delete CONTRIBUTOR.md * Adding agent provider * Feature/modify get all uniq values (#325) * refine * update * format * fix * dev merge
* Update local_rag.mdx * Update llms.mdx (#322) * update the default embd. (#310) * Feature/add agent provider (#317) * update pipeline (#315) * Update CONTRIBUTING.md * Delete CONTRIBUTOR.md * Adding agent provider * Feature/modify get all uniq values (#325) * refine * update * format * fix * dev merge
* update the default embd. (#310) * Feature/add jina reranker rebased (#312) * Add jina reranker * fix oai * Revampt client & server approach. (#316) * Revampt client & server approach. * cleanups * tweak hyde prompt * Feature/add agent provider (#317) * update pipeline (#315) * Update CONTRIBUTING.md * Delete CONTRIBUTOR.md * Adding agent provider * Feature/add agent provider rebased v2 (#319) * modify prompt provider workflow, add agent * fix run qna client * add provider abstraction * tweaks and cleans * move text splitter config locale * Feature/modify get all uniq values (#325) * refine * update * format * fix * Feature/dev merge rebased (#329) * Update local_rag.mdx * Update llms.mdx (#322) * update the default embd. (#310) * Feature/add agent provider (#317) * update pipeline (#315) * Update CONTRIBUTING.md * Delete CONTRIBUTOR.md * Adding agent provider * Feature/modify get all uniq values (#325) * refine * update * format * fix * dev merge * Feature/fix dev merge mistakes rebased (#330) * Feature/add agent provider (#317) * update pipeline (#315) * Update CONTRIBUTING.md * Delete CONTRIBUTOR.md * Adding agent provider * Feature/modify get all uniq values (#325) * refine * update * format * fix * dev merge * cleanup * Feature/dev cleanup (#331) * final pub * final pub * json clean * fix sentence transformer issue * include rerank * fix llama cpp * rebase * fix rerank * small tweaks * rollbk config * cleanup
Summary:
This PR updates the default embedding model, updates library versions, and refactors the
EmbeddingProvider
classes to accept a configuration dictionary.Key points:
all-MiniLM-L6-v2
tomixedbread-ai/mxbai-embed-large-v1
in various files.sentence-transformers
library version and addedtokenizers
library inpyproject.toml
.EmbeddingProvider
classes to accept a configuration dictionary.SentenceTransformerEmbeddingProvider
to initializeSentenceTransformer
withtrust_remote_code=True
.get_embeddings_provider
function inE2EPipelineFactory
to pass entire configuration dictionary toEmbeddingProvider
classes.Generated with ❤️ by ellipsis.dev