Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure on data ingestion into "qdrant" using "text-embedding-3-small" embedding with 512 dimension size #355

Closed
avibathula opened this issue May 28, 2024 · 1 comment

Comments

@avibathula
Copy link

avibathula commented May 28, 2024

Describe the bug

Failure on data ingestion into "qdrant" using "text-embedding-3-small" embedding with 512 dimension size, which I took from sample in docs https://r2r-docs.sciphi.ai/deep-dive/config

Raw response content:
b'{"status":{"error":"Wrong input: Vector dimension error: expected dim: 1536, got 512"},"time":0.00937252}' - 2024-05-27 20:37:35,296
r2r.pipes.vector_storage_pipe - ERROR - Failed to store vector entries in the database: Unexpected Response: 400 (Bad Request)
Raw response content:
b'{"status":{"error":"Wrong input: Vector dimension error: expected dim: 1536, got 512"},"time":0.008870695}' - 2024-05-27 20:37:35,678
r2r.pipes.vector_storage_pipe - ERROR - Failed to store vector entries in the database: Unexpected Response: 400 (Bad Request)

and the same keep getting printed on console repeatedly.

To Reproduce
Use a config of

{
"app": {
"max_logs": 100,
"max_file_size_in_mb": 50
},
"completions": {
"provider": "openai"
},
"embedding": {
"provider": "openai",
"search_model": "text-embedding-3-small",
"search_dimension": 512,
"batch_size": 128,
"text_splitter": {
"type": "recursive_character",
"chunk_size": 512,
"chunk_overlap": 20
},
"rerank_model": "None"
},
"eval": {
"provider": "local",
"llm": {
"model": "gpt-4o",
"provider": "openai"
},
"sampling_fraction": 1.0
},
"ingestion": {
"selected_parsers": {
"csv": "default",
"docx": "default",
"html": "default",
"json": "default",
"md": "default",
"pdf": "default",
"pptx": "default",
"txt": "default",
"xlsx": "default",
"gif": "default",
"png": "default",
"jpg": "default",
"jpeg": "default",
"svg": "default"
}
},
"logging": {
"provider": "local",
"log_table": "logs",
"log_info_table": "log_info"
},
"prompt": {
"provider": "local"
},
"vector_database": {
"provider": "qdrant",
"collection_name": "blahblahblah"
}
}

I even tried removing the whole dictionary of "embedding" - but I was getting same errors as the above values I was using were the defaults.

Expected behavior
Data files vectorized and uploaded to qdrant

Additional context
I installed r2r package and programmatically provided a list of files and called r2r.aingest_files for the issue to hit.

@avibathula
Copy link
Author

It was a user error on my side.

I initially tried using "text-embedding-ada-002" which requires 1536 sized embeddings and it failed for a different reason: see #354

And I changed the text embedding to "text-embedding-3-small" with embedding size 512, but kept using the same collection name. As the collection was created with an expectation of 1536 sized embeddings, it was failing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant