Open
Description
System Info
/info:
{
"model_id": "Snowflake/snowflake-arctic-embed-m-v1.5",
"model_sha": null,
"model_dtype": "float16",
"model_type": {
"embedding": {
"pooling": "cls"
}
},
"max_concurrent_requests": 512,
"max_input_length": 512,
"max_batch_tokens": 102400,
"max_batch_requests": 32,
"max_client_batch_size": 200,
"auto_truncate": true,
"tokenization_workers": 4,
"version": "1.7.2",
"sha": "a69cc2ee285ca87a8c7a6b8fc9abc1be360f8335",
"docker_label": "sha-a69cc2e"
}
Information
- Docker
- The CLI directly
Tasks
- An officially supported command
- My own modifications
Reproduction
When TEI reaches concurrency limit, it reacts differently to whether single input or batch is sent.
- Set low
max_concurrent_requests
to easier reproduce it. - Generate some traffic to reach concurrency limit.
- Sending
/embed
with single input payload{"inputs": "apple"}
returns429 Model is overloaded
error. - Sending
/embed
with batch payload{"inputs": ["apple"]}
always succeeds on the same load.
Expected behavior
It would be expected single and batch inputs work the same way.
Now it's unclear what max_concurrent_requests
controls exactly and when it should be adjusted.
Metadata
Metadata
Assignees
Labels
No labels