Skip to content

Difference between single input and batch concurrency #662

Open
@augustas1

Description

@augustas1

System Info

/info:

{
	"model_id": "Snowflake/snowflake-arctic-embed-m-v1.5",
	"model_sha": null,
	"model_dtype": "float16",
	"model_type": {
		"embedding": {
			"pooling": "cls"
		}
	},
	"max_concurrent_requests": 512,
	"max_input_length": 512,
	"max_batch_tokens": 102400,
	"max_batch_requests": 32,
	"max_client_batch_size": 200,
	"auto_truncate": true,
	"tokenization_workers": 4,
	"version": "1.7.2",
	"sha": "a69cc2ee285ca87a8c7a6b8fc9abc1be360f8335",
	"docker_label": "sha-a69cc2e"
}

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

When TEI reaches concurrency limit, it reacts differently to whether single input or batch is sent.

  1. Set low max_concurrent_requests to easier reproduce it.
  2. Generate some traffic to reach concurrency limit.
  3. Sending /embed with single input payload {"inputs": "apple"} returns 429 Model is overloaded error.
  4. Sending /embed with batch payload {"inputs": ["apple"]} always succeeds on the same load.

Expected behavior

It would be expected single and batch inputs work the same way.
Now it's unclear what max_concurrent_requests controls exactly and when it should be adjusted.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions