Skip to content

Commit

Permalink
update the default embd. (#310)
Browse files Browse the repository at this point in the history
  • Loading branch information
emrgnt-cmplxty committed Apr 22, 2024
1 parent 57eadf0 commit a707831
Show file tree
Hide file tree
Showing 14 changed files with 867 additions and 852 deletions.
2 changes: 1 addition & 1 deletion docs/pages/providers/embeddings.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -35,4 +35,4 @@ Anything supported by OpenAI, such as:
- **Pricing**: Approximately 12,500 pages per dollar. Balances cost and performance effectively.
- **More**: [Embeddings Guide](https://platform.openai.com/docs/guides/embeddings)

Lastly, the `sentence_transformer` package from HuggingFace is also supported as a provider. For example, one such popular model is [`all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2).
Lastly, the `sentence_transformer` package from HuggingFace is also supported as a provider. For example, one such popular model is [`mixedbread-ai/mxbai-embed-large-v1`](https://huggingface.co/sentence-transformers/mixedbread-ai/mxbai-embed-large-v1).
2 changes: 1 addition & 1 deletion docs/pages/tutorials/configuring_your_rag_pipeline.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -226,7 +226,7 @@ Set the `provider` field under `vector_database` in `config.json` to specify you
#### Embedding Provider
R2R supports OpenAI and local inference embedding providers:
- `openai`: OpenAI models like `text-embedding-3-small`
- `sentence-transformers`: HuggingFace models like `all-MiniLM-L6-v2`
- `sentence-transformers`: HuggingFace models like `mixedbread-ai/mxbai-embed-large-v1`
Configure the `embedding` section to set your desired embedding model, dimension, and batch size.
Expand Down
10 changes: 5 additions & 5 deletions docs/pages/tutorials/local_rag.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ To streamline this process, we've provided pre-configured local settings in the
{
"embedding": {
"provider": "sentence-transformers",
"model": "all-MiniLM-L6-v2",
"model": "mixedbread-ai/mxbai-embed-large-v1",
"dimension": 384,
"batch_size": 32
},
Expand All @@ -78,7 +78,7 @@ To streamline this process, we've provided pre-configured local settings in the

You may also modify the configuration defaults for ingestion, logging, and your vector database provider in a similar manner. More information on this follows below.

This chosen config modification above instructs R2R to use the `sentence-transformers` library for embeddings with the `all-MiniLM-L6-v2` model, turns off evals, and sets the LLM provider to `ollama`. During ingestion, the default is to split documents into chunks of 512 characters with 20 characters of overlap between chunks.
This chosen config modification above instructs R2R to use the `sentence-transformers` library for embeddings with the `mixedbread-ai/mxbai-embed-large-v1` model, turns off evals, and sets the LLM provider to `ollama`. During ingestion, the default is to split documents into chunks of 512 characters with 20 characters of overlap between chunks.

A local vector database will be used to store the embeddings. The current default is a minimal sqlite implementation, with plans to migrate the tutorial to LanceDB shortly.

Expand Down Expand Up @@ -117,7 +117,7 @@ The output should look something like this:
Here's what's happening under the hood:
1. R2R loads the included PDF and converts it to text using PyPDF2.
2. It splits the text into chunks of 512 characters each, with 20 characters overlapping between chunks.
3. Each chunk is embedded using the `all-MiniLM-L6-v2` model from `sentence-transformers`.
3. Each chunk is embedded using the `mixedbread-ai/mxbai-embed-large-v1` model from `sentence-transformers`.
4. The chunks and embeddings are stored in the specified vector database, which defaults to a local SQLite database.

With just one command, we've gone from a raw document to an embedded knowledge base we can query. In addition to the raw chunks, metadata such as user ID or document ID can be attached to enable easy filtering later.
Expand Down Expand Up @@ -151,7 +151,7 @@ python -m r2r.examples.clients.run_qna_client rag_completion \
```

This command tells R2R to use the specified model to generate a completion for the given query. R2R will:
1. Embed the query using `all-MiniLM-L6-v2`.
1. Embed the query using `mixedbread-ai/mxbai-embed-large-v1`.
2. Find the chunks most similar to the query embedding.
3. Pass the query and relevant chunks to the LLM to generate a response.

Expand Down Expand Up @@ -183,7 +183,7 @@ Set the `provider` field under `vector_database` in `config.json` to specify you
#### Embedding Provider
R2R supports OpenAI and local inference embedding providers:
- `openai`: OpenAI models like `text-embedding-3-small`
- `sentence-transformers`: HuggingFace models like `all-MiniLM-L6-v2`
- `sentence-transformers`: HuggingFace models like `mixedbread-ai/mxbai-embed-large-v1`

Configure the `embedding` section to set your desired embedding model, dimension, and batch size.

Expand Down
1,576 changes: 782 additions & 794 deletions poetry.lock

Large diffs are not rendered by default.

7 changes: 4 additions & 3 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,8 @@ ionic-api-sdk = {version = "0.9.3", optional = true}
boto3 = {version = "^1.34.71", optional = true}
exa-py = {version = "^1.0.9", optional = true}
llama-cpp-python = {version = "^0.2.57", optional = true}
sentence-transformers = {version = "^2.6.1", optional = true}
sentence-transformers = {version = "^2.7.0", optional = true}
tokenizers = {version = "^0.15.2", optional = true}

[tool.poetry.extras]
embedding = ["tiktoken"]
Expand All @@ -58,8 +59,8 @@ eval = ["parea-ai"]
ionic = ["ionic-api-sdk"]
reducto = ["boto3"]
exa = ["exa-py"]
sentence_transformers = ["sentence-transformers"]
local_llm = ["llama-cpp-python", "sentence-transformers"]
sentence_transformers = ["sentence-transformers", "tokenizers"]
local_llm = ["llama-cpp-python", "sentence-transformers", "tokenizers"]
all = ["tiktoken", "datasets", "qdrant_client", "psycopg2-binary", "sentry-sdk", "parea-ai", "boto3", "exa-py", "llama-cpp-python", "ionic-api-sdk"]

[tool.poetry.group.dev.dependencies]
Expand Down
7 changes: 6 additions & 1 deletion r2r/core/providers/embedding.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,12 @@
class EmbeddingProvider(ABC):
supported_providers = ["openai", "sentence-transformers"]

def __init__(self, provider: str):
def __init__(self, config: dict):
provider = config.get("provider", None)
if not provider:
raise ValueError(
"Must set provider in order to initialize EmbeddingProvider."
)
if provider not in EmbeddingProvider.supported_providers:
raise ValueError(
f"Error, `{provider}` is not in EmbeddingProvider's list of supported providers."
Expand Down
2 changes: 1 addition & 1 deletion r2r/core/utils/splitter/text.py
Original file line number Diff line number Diff line change
Expand Up @@ -1132,7 +1132,7 @@ def __init__(
)

self.model = model
self._model = SentenceTransformer(self.model)
self._model = SentenceTransformer(self.model, trust_remote_code=True)
self.tokenizer = self._model.tokenizer
self._initialize_chunk_configuration(tokens_per_chunk=tokens_per_chunk)

Expand Down
9 changes: 7 additions & 2 deletions r2r/embeddings/openai/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,12 +21,17 @@ class OpenAIEmbeddingProvider(EmbeddingProvider):
"text-embedding-3-large": [256, 1024, 3072],
}

def __init__(self, provider: str = "openai"):
def __init__(self, config: dict):
logger.info(
"Initializing `OpenAIEmbeddingProvider` to provide embeddings."
)
super().__init__(config)
provider = config.get("provider", None)
if not provider:
raise ValueError(
"Must set provider in order to initialize SentenceTransformerEmbeddingProvider."
)

super().__init__(provider)
if provider != "openai":
raise ValueError(
"OpenAIEmbeddingProvider must be initialized with provider `openai`."
Expand Down
24 changes: 20 additions & 4 deletions r2r/embeddings/setence_transformer/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,18 @@

class SentenceTransformerEmbeddingProvider(EmbeddingProvider):
def __init__(
self, embedding_model: str, provider: str = "sentence-transformers"
self, config: dict,
):
super().__init__(config)
logger.info(
"Initializing `SentenceTransformerEmbeddingProvider` to provide embeddings."
)

super().__init__(provider)
print("config = ", config)
provider = config.get("provider", None)
if not provider:
raise ValueError(
"Must set provider in order to initialize SentenceTransformerEmbeddingProvider."
)
if provider != "sentence-transformers":
raise ValueError(
"SentenceTransformerEmbeddingProvider must be initialized with provider `sentence-transformers`."
Expand All @@ -25,7 +30,18 @@ def __init__(
raise ValueError(
"Must download sentence-transformers library to run `SentenceTransformerEmbeddingProvider`."
)
self.encoder = SentenceTransformer(embedding_model)

model = config.get("model", None)
if not model:
raise ValueError(
"Must set model in order to initialize SentenceTransformerEmbeddingProvider."
)
dimension = config.get("dimension", None)
if not dimension:
raise ValueError(
"Must set dimensions in order to initialize SentenceTransformerEmbeddingProvider."
)
self.encoder = SentenceTransformer(model, truncate_dim=dimension, trust_remote_code=True)

def _check_inputs(self, model: str, dimensions: Optional[int]) -> None:
if (
Expand Down
4 changes: 2 additions & 2 deletions r2r/examples/configs/local_llama_cpp.json
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
{
"embedding": {
"provider": "sentence-transformers",
"model": "all-MiniLM-L6-v2",
"dimension": 384,
"model": "mixedbread-ai/mxbai-embed-large-v1",
"dimension": 512,
"batch_size": 32
},
"evals": {
Expand Down
4 changes: 2 additions & 2 deletions r2r/examples/configs/local_ollama.json
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
{
"embedding": {
"provider": "sentence-transformers",
"model": "all-MiniLM-L6-v2",
"dimension": 384,
"model": "mixedbread-ai/mxbai-embed-large-v1",
"dimension": 512,
"batch_size": 32
},
"evals": {
Expand Down
66 changes: 33 additions & 33 deletions r2r/examples/configs/local_ollama_qdrant.json
Original file line number Diff line number Diff line change
@@ -1,37 +1,37 @@
{
"embedding": {
"provider": "sentence-transformers",
"model": "all-MiniLM-L6-v2",
"dimension": 384,
"batch_size": 32
},
"evals": {
"provider": "none",
"frequency": 0.0
},
"language_model": {
"provider": "litellm"
},
"logging_database": {
"provider": "local",
"collection_name": "demo_logs",
"level": "INFO"
},
"ingestion":{
"provider": "local",
"text_splitter": {
"type": "recursive_character",
"chunk_size": 512,
"chunk_overlap": 20
}
},
"vector_database": {
"provider": "qdrant",
"collection_name": "demo_vecs"
},
"app": {
"max_logs": 100,
"max_file_size_in_mb": 100
"embedding": {
"provider": "sentence-transformers",
"model": "mixedbread-ai/mxbai-embed-large-v1",
"dimension": 512,
"batch_size": 32
},
"evals": {
"provider": "none",
"frequency": 0.0
},
"language_model": {
"provider": "litellm"
},
"logging_database": {
"provider": "local",
"collection_name": "demo_logs",
"level": "INFO"
},
"ingestion":{
"provider": "local",
"text_splitter": {
"type": "recursive_character",
"chunk_size": 512,
"chunk_overlap": 20
}
},
"vector_database": {
"provider": "qdrant",
"collection_name": "demo_vecs"
},
"app": {
"max_logs": 100,
"max_file_size_in_mb": 100
}
}

4 changes: 2 additions & 2 deletions r2r/main/factory.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,12 +48,12 @@ def get_embeddings_provider(embedding_config: dict[str, Any]):
if embedding_config["provider"] == "openai":
from r2r.embeddings import OpenAIEmbeddingProvider

return OpenAIEmbeddingProvider()
return OpenAIEmbeddingProvider(embedding_config)
elif embedding_config["provider"] == "sentence-transformers":
from r2r.embeddings import SentenceTransformerEmbeddingProvider

return SentenceTransformerEmbeddingProvider(
embedding_config["model"]
embedding_config
)
else:
raise ValueError(
Expand Down
2 changes: 1 addition & 1 deletion r2r/vecs/adapter/text.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
"all-distilroberta-v1",
"all-MiniLM-L12-v2",
"multi-qa-distilbert-cos-v1",
"all-MiniLM-L6-v2",
"mixedbread-ai/mxbai-embed-large-v1",
"multi-qa-MiniLM-L6-cos-v1",
"paraphrase-multilingual-mpnet-base-v2",
"paraphrase-albert-small-v2",
Expand Down

0 comments on commit a707831

Please sign in to comment.