scoring: switch canonical sentence model to bge-small-en-v1.5 @ 0.72#209
Merged
Merged
Conversation
Two weeks of shadow logs comparing canonical Qwen3-Embedding-0.6B against
the candidate BAAI/bge-small-en-v1.5 over ~6,200 production (product,
gt) title pairs:
* Pearson r 0.85 (race-phase, n=5735) and 0.75 (qualifying, n=501)
* Best agreement at shadow=0.76 for race-phase (84.7%) and 0.70 for
qualifying (82.4%) — picking 0.72 as the compromise threshold that
keeps both phases above 78% agreement at canon=0.7.
* 15-pair human-judgment review of model disagreements split close to
even (Qwen3 right 7, bge-small right 8), so neither is meaningfully
more correct.
* bge-small is 33M params vs Qwen3's 600M — roughly 16x cheaper to
encode on the validator's ARM CPU.
Same swap was attempted in v1.1.12 at threshold 0.95 and rolled back
because the 0.95 cliff was wildly off bge-small's actual distribution
(mean ~0.79). 0.72 lines up with the observed distribution.
* Drop the bfloat16 model_kwargs override — bge-small ships fp32 by
default and is small enough that bf16 buys little.
* Drop the Qwen3 pre-cache from the validator Dockerfile so the image
no longer carries the larger model.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
shardi-b
approved these changes
Jun 30, 2026
shardi-b
left a comment
Contributor
There was a problem hiding this comment.
Approved. No issues found during code review.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Switch the canonical sentence-embedding model from
Qwen/Qwen3-Embedding-0.6B@ 0.7 toBAAI/bge-small-en-v1.5@ 0.72 for rule-based title scoring. The smaller model is ~16x cheaper to encode on the validator's ARM CPU and matches Qwen3's correctness on a 15-pair human-judgment audit (Qwen3 right 7, bge-small right 8 across the disagreement set).Decision is backed by two weeks of opt-in shadow logging that re-ran every (product_title, gt_title) pair through bge-small alongside Qwen3 and pushed the dual cosine to CloudWatch. Aggregated signal across ~6,200 pairs:
n=5735): Pearson r 0.8554 vs Qwen3, best agreement at shadow=0.76 (84.7%)n=501, partial window): Pearson r 0.7547, best agreement at shadow=0.70 (82.4%)The shadow-scoring scaffolding stays in place so the next candidate swap can run through the same protocol without code changes.
Changes Made
src/agent/rewards/orm.py:SENTENCE_MODEL_NAME = "BAAI/bge-small-en-v1.5",TITLE_SIM_THRESHOLD = 0.72. Drop thetorch_dtype=bfloat16override since bge-small is small enough that the precision shave buys little.docker/validator/Dockerfile: pre-cache only the new canonical model. Drop the Qwen3 pre-cache to keep the image lean.tests/test_scoring_perf.py: update the inline threshold comment.Issue Link
Testing
Manual Testing
Ran the existing scoring-perf unit suite locally.
Test Results:
uv run pytest tests/test_scoring_perf.py -x -q→ 6 passed in 0.12s.Automated Testing
Existing tests cover the GT-skip / non-GT encode / shadow-logging / precomputed-embedding-dim paths. They mock the sentence model so they are model-agnostic; the threshold-comment update keeps the doc accurate.
Test Command(s):
Documentation
Documentation Changes:
N/A
Checklist
Additional Notes
After merge:
latest; promote tostableonly after a staging burn-in confirms the swap matches the shadow-derived expectation in a live race phase).