Problem
src/engine/optimum/optimum_emb.py always applies last-token pooling in generate_embeddings:
embeddings = self.last_token_pool(outputs.last_hidden_state, batch_dict["attention_mask"])
This is correct for decoder-style embedding models like Qwen3-Embedding-*, but wrong for the much larger family of encoder-style models that ship via sentence-transformers — most notably:
BAAI/bge-m3, BAAI/bge-large-en-v1.5, etc. → CLS pooling
sentence-transformers/all-MiniLM-L6-v2, intfloat/multilingual-e5-* → mean pooling
Loading any of these through OpenArc's optimum engine today returns numerically valid but semantically wrong vectors (last hidden state of the final non-pad token instead of [CLS] / masked mean). There's no error and no warning — retrieval quality just silently collapses.
Reproducer
- Convert
BAAI/bge-m3 to OpenVINO IR via optimum-intel (preserves the 1_Pooling/config.json shipped with the model).
- Register it in
openarc_config.json with engine: "optimum", model_type: "emb".
- Hit
/v1/embeddings and compare against the PyTorch reference — cosine similarity is ~0.3–0.6 instead of ~1.0.
Proposal
Make pooling mode metadata-driven, with the same precedence sentence-transformers itself uses:
runtime_config.pool_mode (operator override; one of "cls" | "mean" | "last"). Unknown values raise at load time so typos don't silently fall back.
<model_path>/1_Pooling/config.json (auto-detect from the file the model ships with — pooling_mode_cls_token → cls, pooling_mode_mean_tokens → mean).
- Default:
"last" (preserves current Qwen3-Embedding behavior; no change for existing users).
This keeps the registry config minimal for the common case (correct pooling auto-detected from model files) while giving a clear escape hatch for models that ship without sentence-transformers metadata or with the wrong metadata.
Scope
src/engine/optimum/optimum_emb.py: add cls_pool / mean_pool, dispatch in pool(), resolve mode in load_model.
- Unit tests for each pool,
_detect_pool_mode, the override path, and unknown-value rejection.
- Integration test loading bge-m3 and asserting cls auto-detect + 1024-dim unit-normed vector.
Happy to send a PR — branch is ready (feat/embedding-pool-dispatch on KIntegrated/OpenArc), verified end-to-end against PyTorch (cos > 0.999) and live via /v1/embeddings on GPU.
Problem
src/engine/optimum/optimum_emb.pyalways applies last-token pooling ingenerate_embeddings:This is correct for decoder-style embedding models like
Qwen3-Embedding-*, but wrong for the much larger family of encoder-style models that ship via sentence-transformers — most notably:BAAI/bge-m3,BAAI/bge-large-en-v1.5, etc. → CLS poolingsentence-transformers/all-MiniLM-L6-v2,intfloat/multilingual-e5-*→ mean poolingLoading any of these through OpenArc's optimum engine today returns numerically valid but semantically wrong vectors (last hidden state of the final non-pad token instead of
[CLS]/ masked mean). There's no error and no warning — retrieval quality just silently collapses.Reproducer
BAAI/bge-m3to OpenVINO IR viaoptimum-intel(preserves the1_Pooling/config.jsonshipped with the model).openarc_config.jsonwithengine: "optimum",model_type: "emb"./v1/embeddingsand compare against the PyTorch reference — cosine similarity is ~0.3–0.6 instead of ~1.0.Proposal
Make pooling mode metadata-driven, with the same precedence sentence-transformers itself uses:
runtime_config.pool_mode(operator override; one of"cls" | "mean" | "last"). Unknown values raise at load time so typos don't silently fall back.<model_path>/1_Pooling/config.json(auto-detect from the file the model ships with —pooling_mode_cls_token→ cls,pooling_mode_mean_tokens→ mean)."last"(preserves current Qwen3-Embedding behavior; no change for existing users).This keeps the registry config minimal for the common case (correct pooling auto-detected from model files) while giving a clear escape hatch for models that ship without sentence-transformers metadata or with the wrong metadata.
Scope
src/engine/optimum/optimum_emb.py: addcls_pool/mean_pool, dispatch inpool(), resolve mode inload_model._detect_pool_mode, the override path, and unknown-value rejection.Happy to send a PR — branch is ready (
feat/embedding-pool-dispatchon KIntegrated/OpenArc), verified end-to-end against PyTorch (cos > 0.999) and live via/v1/embeddingson GPU.