Embedding engine hardcodes last-token pooling, producing wrong vectors for CLS/mean models (e.g. bge-m3)

## Problem

`src/engine/optimum/optimum_emb.py` always applies last-token pooling in `generate_embeddings`:

```python
embeddings = self.last_token_pool(outputs.last_hidden_state, batch_dict["attention_mask"])
```

This is correct for decoder-style embedding models like `Qwen3-Embedding-*`, but wrong for the much larger family of encoder-style models that ship via sentence-transformers — most notably:

- `BAAI/bge-m3`, `BAAI/bge-large-en-v1.5`, etc. → **CLS pooling**
- `sentence-transformers/all-MiniLM-L6-v2`, `intfloat/multilingual-e5-*` → **mean pooling**

Loading any of these through OpenArc's optimum engine today returns numerically valid but semantically wrong vectors (last hidden state of the final non-pad token instead of `[CLS]` / masked mean). There's no error and no warning — retrieval quality just silently collapses.

## Reproducer

1. Convert `BAAI/bge-m3` to OpenVINO IR via `optimum-intel` (preserves the `1_Pooling/config.json` shipped with the model).
2. Register it in `openarc_config.json` with `engine: "optimum"`, `model_type: "emb"`.
3. Hit `/v1/embeddings` and compare against the PyTorch reference — cosine similarity is ~0.3–0.6 instead of ~1.0.

## Proposal

Make pooling mode metadata-driven, with the same precedence sentence-transformers itself uses:

1. **`runtime_config.pool_mode`** (operator override; one of `"cls" | "mean" | "last"`). Unknown values raise at load time so typos don't silently fall back.
2. **`<model_path>/1_Pooling/config.json`** (auto-detect from the file the model ships with — `pooling_mode_cls_token` → cls, `pooling_mode_mean_tokens` → mean).
3. **Default**: `"last"` (preserves current Qwen3-Embedding behavior; no change for existing users).

This keeps the registry config minimal for the common case (correct pooling auto-detected from model files) while giving a clear escape hatch for models that ship without sentence-transformers metadata or with the wrong metadata.

## Scope

- `src/engine/optimum/optimum_emb.py`: add `cls_pool` / `mean_pool`, dispatch in `pool()`, resolve mode in `load_model`.
- Unit tests for each pool, `_detect_pool_mode`, the override path, and unknown-value rejection.
- Integration test loading bge-m3 and asserting cls auto-detect + 1024-dim unit-normed vector.

Happy to send a PR — branch is ready (`feat/embedding-pool-dispatch` on KIntegrated/OpenArc), verified end-to-end against PyTorch (cos > 0.999) and live via `/v1/embeddings` on GPU.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Embedding engine hardcodes last-token pooling, producing wrong vectors for CLS/mean models (e.g. bge-m3) #89

Problem

Reproducer

Proposal

Scope

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Embedding engine hardcodes last-token pooling, producing wrong vectors for CLS/mean models (e.g. bge-m3) #89

Description

Problem

Reproducer

Proposal

Scope

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions