Skip to content

Embedding engine hardcodes last-token pooling, producing wrong vectors for CLS/mean models (e.g. bge-m3) #89

@unverbraucht

Description

@unverbraucht

Problem

src/engine/optimum/optimum_emb.py always applies last-token pooling in generate_embeddings:

embeddings = self.last_token_pool(outputs.last_hidden_state, batch_dict["attention_mask"])

This is correct for decoder-style embedding models like Qwen3-Embedding-*, but wrong for the much larger family of encoder-style models that ship via sentence-transformers — most notably:

  • BAAI/bge-m3, BAAI/bge-large-en-v1.5, etc. → CLS pooling
  • sentence-transformers/all-MiniLM-L6-v2, intfloat/multilingual-e5-*mean pooling

Loading any of these through OpenArc's optimum engine today returns numerically valid but semantically wrong vectors (last hidden state of the final non-pad token instead of [CLS] / masked mean). There's no error and no warning — retrieval quality just silently collapses.

Reproducer

  1. Convert BAAI/bge-m3 to OpenVINO IR via optimum-intel (preserves the 1_Pooling/config.json shipped with the model).
  2. Register it in openarc_config.json with engine: "optimum", model_type: "emb".
  3. Hit /v1/embeddings and compare against the PyTorch reference — cosine similarity is ~0.3–0.6 instead of ~1.0.

Proposal

Make pooling mode metadata-driven, with the same precedence sentence-transformers itself uses:

  1. runtime_config.pool_mode (operator override; one of "cls" | "mean" | "last"). Unknown values raise at load time so typos don't silently fall back.
  2. <model_path>/1_Pooling/config.json (auto-detect from the file the model ships with — pooling_mode_cls_token → cls, pooling_mode_mean_tokens → mean).
  3. Default: "last" (preserves current Qwen3-Embedding behavior; no change for existing users).

This keeps the registry config minimal for the common case (correct pooling auto-detected from model files) while giving a clear escape hatch for models that ship without sentence-transformers metadata or with the wrong metadata.

Scope

  • src/engine/optimum/optimum_emb.py: add cls_pool / mean_pool, dispatch in pool(), resolve mode in load_model.
  • Unit tests for each pool, _detect_pool_mode, the override path, and unknown-value rejection.
  • Integration test loading bge-m3 and asserting cls auto-detect + 1024-dim unit-normed vector.

Happy to send a PR — branch is ready (feat/embedding-pool-dispatch on KIntegrated/OpenArc), verified end-to-end against PyTorch (cos > 0.999) and live via /v1/embeddings on GPU.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions