Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update the default embd. #310

Merged
merged 1 commit into from
Apr 19, 2024
Merged

Conversation

emrgnt-cmplxty
Copy link
Contributor

@emrgnt-cmplxty emrgnt-cmplxty commented Apr 19, 2024

🚀 This description was created by Ellipsis for commit c9cba2e

Summary:

This PR updates the default embedding model, updates library versions, and refactors the EmbeddingProvider classes to accept a configuration dictionary.

Key points:

  • Updated default embedding model from all-MiniLM-L6-v2 to mixedbread-ai/mxbai-embed-large-v1 in various files.
  • Updated sentence-transformers library version and added tokenizers library in pyproject.toml.
  • Updated EmbeddingProvider classes to accept a configuration dictionary.
  • Updated SentenceTransformerEmbeddingProvider to initialize SentenceTransformer with trust_remote_code=True.
  • Updated get_embeddings_provider function in E2EPipelineFactory to pass entire configuration dictionary to EmbeddingProvider classes.

Generated with ❤️ by ellipsis.dev

Copy link

vercel bot commented Apr 19, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
r2r-docs ❌ Failed (Inspect) Apr 19, 2024 7:49pm

@emrgnt-cmplxty emrgnt-cmplxty marked this pull request as ready for review April 19, 2024 19:48
Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Changes requested.

  • Reviewed the entire pull request up to c9cba2e
  • Looked at 365 lines of code in 14 files
  • Took 4 minutes and 10 seconds to review
More info
  • Skipped 1 files when reviewing.
  • Skipped posting 4 additional comments because they didn't meet confidence threshold of 50%.
1. r2r/core/providers/embedding.py:8:
  • Assessed confidence : 33%
  • Comment:
    The changes in the EmbeddingProvider classes are to accommodate the new configuration format where the provider, model, and dimensions are specified in a dictionary. This is a good practice as it makes the code more flexible and easier to extend in the future. However, the PR lacks a description explaining these changes.
  • Reasoning:
    The PR changes the default embedding model from all-MiniLM-L6-v2 to mixedbread-ai/mxbai-embed-large-v1. The changes are consistent across the documentation and configuration files. The PR also updates the sentence-transformers library version and adds the tokenizers library in the pyproject.toml file. The changes in the EmbeddingProvider classes are to accommodate the new configuration format where the provider, model, and dimensions are specified in a dictionary. The trust_remote_code=True argument is added when initializing the SentenceTransformer model, which allows the execution of remote code for custom models. This could potentially be a security risk if untrusted models are used. The PR lacks a description, which should ideally explain the reason for these changes and their impact.
2. r2r/embeddings/openai/base.py:24:
  • Assessed confidence : 33%
  • Comment:
    The changes in the EmbeddingProvider classes are to accommodate the new configuration format where the provider, model, and dimensions are specified in a dictionary. This is a good practice as it makes the code more flexible and easier to extend in the future. However, the PR lacks a description explaining these changes.
  • Reasoning:
    The PR changes the default embedding model from all-MiniLM-L6-v2 to mixedbread-ai/mxbai-embed-large-v1. The changes are consistent across the documentation and configuration files. The PR also updates the sentence-transformers library version and adds the tokenizers library in the pyproject.toml file. The changes in the EmbeddingProvider classes are to accommodate the new configuration format where the provider, model, and dimensions are specified in a dictionary. The trust_remote_code=True argument is added when initializing the SentenceTransformer model, which allows the execution of remote code for custom models. This could potentially be a security risk if untrusted models are used. The PR lacks a description, which should ideally explain the reason for these changes and their impact.
3. r2r/embeddings/setence_transformer/base.py:10:
  • Assessed confidence : 33%
  • Comment:
    The changes in the EmbeddingProvider classes are to accommodate the new configuration format where the provider, model, and dimensions are specified in a dictionary. This is a good practice as it makes the code more flexible and easier to extend in the future. However, the PR lacks a description explaining these changes.
  • Reasoning:
    The PR changes the default embedding model from all-MiniLM-L6-v2 to mixedbread-ai/mxbai-embed-large-v1. The changes are consistent across the documentation and configuration files. The PR also updates the sentence-transformers library version and adds the tokenizers library in the pyproject.toml file. The changes in the EmbeddingProvider classes are to accommodate the new configuration format where the provider, model, and dimensions are specified in a dictionary. The trust_remote_code=True argument is added when initializing the SentenceTransformer model, which allows the execution of remote code for custom models. This could potentially be a security risk if untrusted models are used. The PR lacks a description, which should ideally explain the reason for these changes and their impact.
4. r2r/main/factory.py:51:
  • Assessed confidence : 33%
  • Comment:
    The changes in the EmbeddingProvider classes are to accommodate the new configuration format where the provider, model, and dimensions are specified in a dictionary. This is a good practice as it makes the code more flexible and easier to extend in the future. However, the PR lacks a description explaining these changes.
  • Reasoning:
    The PR changes the default embedding model from all-MiniLM-L6-v2 to mixedbread-ai/mxbai-embed-large-v1. The changes are consistent across the documentation and configuration files. The PR also updates the sentence-transformers library version and adds the tokenizers library in the pyproject.toml file. The changes in the EmbeddingProvider classes are to accommodate the new configuration format where the provider, model, and dimensions are specified in a dictionary. The trust_remote_code=True argument is added when initializing the SentenceTransformer model, which allows the execution of remote code for custom models. This could potentially be a security risk if untrusted models are used. The PR lacks a description, which should ideally explain the reason for these changes and their impact.

Workflow ID: wflow_IhRBb4LDbAa7v97B


Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. We'll respond in a few minutes. Learn more here.

@@ -1132,7 +1132,7 @@ def __init__(
)

self.model = model
self._model = SentenceTransformer(self.model)
self._model = SentenceTransformer(self.model, trust_remote_code=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trust_remote_code=True argument is added when initializing the SentenceTransformer model, which allows the execution of remote code for custom models. This could potentially be a security risk if untrusted models are used. It would be better to make this an optional configuration that can be turned on if necessary, rather than having it on by default.

raise ValueError(
"Must set dimensions in order to initialize SentenceTransformerEmbeddingProvider."
)
self.encoder = SentenceTransformer(model, truncate_dim=dimension, trust_remote_code=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trust_remote_code=True argument is added when initializing the SentenceTransformer model, which allows the execution of remote code for custom models. This could potentially be a security risk if untrusted models are used. It would be better to make this an optional configuration that can be turned on if necessary, rather than having it on by default.

@emrgnt-cmplxty emrgnt-cmplxty changed the base branch from main to dev April 19, 2024 23:25
@emrgnt-cmplxty emrgnt-cmplxty merged commit 49e4a0c into dev Apr 19, 2024
1 of 2 checks passed
@emrgnt-cmplxty emrgnt-cmplxty deleted the feature/modify-default-embedding-model branch April 19, 2024 23:37
emrgnt-cmplxty added a commit that referenced this pull request Apr 22, 2024
emrgnt-cmplxty added a commit that referenced this pull request Apr 23, 2024
emrgnt-cmplxty added a commit that referenced this pull request Apr 23, 2024
emrgnt-cmplxty added a commit that referenced this pull request Apr 23, 2024
emrgnt-cmplxty added a commit that referenced this pull request Apr 23, 2024
* Update local_rag.mdx

* Update llms.mdx (#322)

* update the default embd. (#310)

* Feature/add agent provider (#317)

* update pipeline (#315)

* Update CONTRIBUTING.md

* Delete CONTRIBUTOR.md

* Adding agent provider

* Feature/modify get all uniq values (#325)

* refine

* update

* format

* fix

* dev merge
emrgnt-cmplxty added a commit that referenced this pull request Apr 23, 2024
emrgnt-cmplxty added a commit that referenced this pull request Apr 23, 2024
* Update local_rag.mdx

* Update llms.mdx (#322)

* update the default embd. (#310)

* Feature/add agent provider (#317)

* update pipeline (#315)

* Update CONTRIBUTING.md

* Delete CONTRIBUTOR.md

* Adding agent provider

* Feature/modify get all uniq values (#325)

* refine

* update

* format

* fix

* dev merge
emrgnt-cmplxty added a commit that referenced this pull request Apr 24, 2024
* update the default embd. (#310)

* Feature/add jina reranker rebased (#312)

* Add jina reranker

* fix oai

* Revampt client & server approach. (#316)

* Revampt client & server approach.

* cleanups

* tweak hyde prompt

* Feature/add agent provider (#317)

* update pipeline (#315)

* Update CONTRIBUTING.md

* Delete CONTRIBUTOR.md

* Adding agent provider

* Feature/add agent provider rebased v2 (#319)

* modify prompt provider workflow, add agent

* fix run qna client

* add provider abstraction

* tweaks and cleans

* move text splitter config locale

* Feature/modify get all uniq values (#325)

* refine

* update

* format

* fix

* Feature/dev merge rebased (#329)

* Update local_rag.mdx

* Update llms.mdx (#322)

* update the default embd. (#310)

* Feature/add agent provider (#317)

* update pipeline (#315)

* Update CONTRIBUTING.md

* Delete CONTRIBUTOR.md

* Adding agent provider

* Feature/modify get all uniq values (#325)

* refine

* update

* format

* fix

* dev merge

* Feature/fix dev merge mistakes rebased (#330)

* Feature/add agent provider (#317)

* update pipeline (#315)

* Update CONTRIBUTING.md

* Delete CONTRIBUTOR.md

* Adding agent provider

* Feature/modify get all uniq values (#325)

* refine

* update

* format

* fix

* dev merge

* cleanup

* Feature/dev cleanup (#331)

* final pub

* final pub

* json clean

* fix sentence transformer issue

* include rerank

* fix llama cpp

* rebase

* fix rerank

* small tweaks

* rollbk config

* cleanup
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant