scoring: switch canonical sentence model to bge-small-en-v1.5 @ 0.72 by that-guy-wade · Pull Request #209 · ORO-AI/oro

that-guy-wade · 2026-06-30T04:40:15Z

Description

Switch the canonical sentence-embedding model from Qwen/Qwen3-Embedding-0.6B @ 0.7 to BAAI/bge-small-en-v1.5 @ 0.72 for rule-based title scoring. The smaller model is ~16x cheaper to encode on the validator's ARM CPU and matches Qwen3's correctness on a 15-pair human-judgment audit (Qwen3 right 7, bge-small right 8 across the disagreement set).

Decision is backed by two weeks of opt-in shadow logging that re-ran every (product_title, gt_title) pair through bge-small alongside Qwen3 and pushed the dual cosine to CloudWatch. Aggregated signal across ~6,200 pairs:

Race-phase (n=5735): Pearson r 0.8554 vs Qwen3, best agreement at shadow=0.76 (84.7%)
Qualifying-phase (n=501, partial window): Pearson r 0.7547, best agreement at shadow=0.70 (82.4%)
Distribution stats are nearly identical between phases (canon median 0.747 / 0.739; shadow 0.788 / 0.772), so the threshold shift is mostly absorbed by the model swap, not by phase-specific behaviour.
0.72 is the compromise threshold that keeps both phases above ~78% agreement with the historical canonical decision and lines up with bge-small's observed distribution. The previous v1.1.12 attempt at threshold 0.95 was wildly off bge-small's mean (0.79) and produced the regression we rolled back from.

The shadow-scoring scaffolding stays in place so the next candidate swap can run through the same protocol without code changes.

Changes Made

src/agent/rewards/orm.py: SENTENCE_MODEL_NAME = "BAAI/bge-small-en-v1.5", TITLE_SIM_THRESHOLD = 0.72. Drop the torch_dtype=bfloat16 override since bge-small is small enough that the precision shave buys little.
docker/validator/Dockerfile: pre-cache only the new canonical model. Drop the Qwen3 pre-cache to keep the image lean.
tests/test_scoring_perf.py: update the inline threshold comment.

Issue Link

Related to: N/A (follow-up to the rolled-back v1.1.12 embedding swap; tracked internally)
Closes: N/A

Testing

Manual Testing

Ran the existing scoring-perf unit suite locally.

Test Results: uv run pytest tests/test_scoring_perf.py -x -q → 6 passed in 0.12s.

Automated Testing

Existing tests cover the GT-skip / non-GT encode / shadow-logging / precomputed-embedding-dim paths. They mock the sentence model so they are model-agnostic; the threshold-comment update keeps the doc accurate.

Test Command(s):

uv run pytest tests/test_scoring_perf.py -x -q

Documentation

README updated
Code comments added/updated
API documentation updated
Configuration documentation updated
Other documentation updated (please specify):

Documentation Changes:
N/A

Checklist

I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings or errors
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
Any dependent changes have been published and merged

Additional Notes

After merge:

Tag a release (this lands on latest; promote to stable only after a staging burn-in confirms the swap matches the shadow-derived expectation in a live race phase).
Watch the canonical-vs-shadow disagreement rate via the same CloudWatch filter we used to decide on the swap — if the agreement drops below the ~80% floor under live scoring, roll back rather than tune the threshold further.

Two weeks of shadow logs comparing canonical Qwen3-Embedding-0.6B against the candidate BAAI/bge-small-en-v1.5 over ~6,200 production (product, gt) title pairs: * Pearson r 0.85 (race-phase, n=5735) and 0.75 (qualifying, n=501) * Best agreement at shadow=0.76 for race-phase (84.7%) and 0.70 for qualifying (82.4%) — picking 0.72 as the compromise threshold that keeps both phases above 78% agreement at canon=0.7. * 15-pair human-judgment review of model disagreements split close to even (Qwen3 right 7, bge-small right 8), so neither is meaningfully more correct. * bge-small is 33M params vs Qwen3's 600M — roughly 16x cheaper to encode on the validator's ARM CPU. Same swap was attempted in v1.1.12 at threshold 0.95 and rolled back because the 0.95 cliff was wildly off bge-small's actual distribution (mean ~0.79). 0.72 lines up with the observed distribution. * Drop the bfloat16 model_kwargs override — bge-small ships fp32 by default and is small enough that bf16 buys little. * Drop the Qwen3 pre-cache from the validator Dockerfile so the image no longer carries the larger model. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

shardi-b

Approved. No issues found during code review.

that-guy-wade self-assigned this Jun 30, 2026

that-guy-wade requested a review from shardi-b June 30, 2026 04:40

shardi-b approved these changes Jun 30, 2026

View reviewed changes

that-guy-wade merged commit 23fb138 into main Jun 30, 2026
2 checks passed

that-guy-wade deleted the sethschilbe/swap-canonical-to-bge-small branch June 30, 2026 04:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

scoring: switch canonical sentence model to bge-small-en-v1.5 @ 0.72#209

scoring: switch canonical sentence model to bge-small-en-v1.5 @ 0.72#209
that-guy-wade merged 1 commit into
mainfrom
sethschilbe/swap-canonical-to-bge-small

that-guy-wade commented Jun 30, 2026

Uh oh!

shardi-b left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

that-guy-wade commented Jun 30, 2026

Description

Changes Made

Issue Link

Testing

Manual Testing

Automated Testing

Documentation

Checklist

Additional Notes

Uh oh!

shardi-b left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants