Difference between single input and batch concurrency

### System Info

/info:
```
{
	"model_id": "Snowflake/snowflake-arctic-embed-m-v1.5",
	"model_sha": null,
	"model_dtype": "float16",
	"model_type": {
		"embedding": {
			"pooling": "cls"
		}
	},
	"max_concurrent_requests": 512,
	"max_input_length": 512,
	"max_batch_tokens": 102400,
	"max_batch_requests": 32,
	"max_client_batch_size": 200,
	"auto_truncate": true,
	"tokenization_workers": 4,
	"version": "1.7.2",
	"sha": "a69cc2ee285ca87a8c7a6b8fc9abc1be360f8335",
	"docker_label": "sha-a69cc2e"
}
```

### Information

- [x] Docker
- [ ] The CLI directly

### Tasks

- [x] An officially supported command
- [ ] My own modifications

### Reproduction

When TEI reaches concurrency limit, it reacts differently to whether single input or batch is sent.
1. Set low `max_concurrent_requests` to easier reproduce it.
2. Generate some traffic to reach concurrency limit.
3. Sending `/embed` with single input payload `{"inputs": "apple"}` returns `429 Model is overloaded` error.
4. Sending `/embed` with batch payload `{"inputs": ["apple"]}` always succeeds on the same load.

### Expected behavior

It would be expected single and batch inputs work the same way.
Now it's unclear what `max_concurrent_requests` controls exactly and when it should be adjusted.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Difference between single input and batch concurrency #662

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Difference between single input and batch concurrency #662

Description

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions