Skip to content

scoring: switch canonical sentence model to bge-small-en-v1.5 @ 0.72#209

Merged
that-guy-wade merged 1 commit into
mainfrom
sethschilbe/swap-canonical-to-bge-small
Jun 30, 2026
Merged

scoring: switch canonical sentence model to bge-small-en-v1.5 @ 0.72#209
that-guy-wade merged 1 commit into
mainfrom
sethschilbe/swap-canonical-to-bge-small

Conversation

@that-guy-wade

Copy link
Copy Markdown
Contributor

Description

Switch the canonical sentence-embedding model from Qwen/Qwen3-Embedding-0.6B @ 0.7 to BAAI/bge-small-en-v1.5 @ 0.72 for rule-based title scoring. The smaller model is ~16x cheaper to encode on the validator's ARM CPU and matches Qwen3's correctness on a 15-pair human-judgment audit (Qwen3 right 7, bge-small right 8 across the disagreement set).

Decision is backed by two weeks of opt-in shadow logging that re-ran every (product_title, gt_title) pair through bge-small alongside Qwen3 and pushed the dual cosine to CloudWatch. Aggregated signal across ~6,200 pairs:

  • Race-phase (n=5735): Pearson r 0.8554 vs Qwen3, best agreement at shadow=0.76 (84.7%)
  • Qualifying-phase (n=501, partial window): Pearson r 0.7547, best agreement at shadow=0.70 (82.4%)
  • Distribution stats are nearly identical between phases (canon median 0.747 / 0.739; shadow 0.788 / 0.772), so the threshold shift is mostly absorbed by the model swap, not by phase-specific behaviour.
  • 0.72 is the compromise threshold that keeps both phases above ~78% agreement with the historical canonical decision and lines up with bge-small's observed distribution. The previous v1.1.12 attempt at threshold 0.95 was wildly off bge-small's mean (0.79) and produced the regression we rolled back from.

The shadow-scoring scaffolding stays in place so the next candidate swap can run through the same protocol without code changes.

Changes Made

  • src/agent/rewards/orm.py: SENTENCE_MODEL_NAME = "BAAI/bge-small-en-v1.5", TITLE_SIM_THRESHOLD = 0.72. Drop the torch_dtype=bfloat16 override since bge-small is small enough that the precision shave buys little.
  • docker/validator/Dockerfile: pre-cache only the new canonical model. Drop the Qwen3 pre-cache to keep the image lean.
  • tests/test_scoring_perf.py: update the inline threshold comment.

Issue Link

  • Related to: N/A (follow-up to the rolled-back v1.1.12 embedding swap; tracked internally)
  • Closes: N/A

Testing

Manual Testing

Ran the existing scoring-perf unit suite locally.

Test Results: uv run pytest tests/test_scoring_perf.py -x -q → 6 passed in 0.12s.

Automated Testing

Existing tests cover the GT-skip / non-GT encode / shadow-logging / precomputed-embedding-dim paths. They mock the sentence model so they are model-agnostic; the threshold-comment update keeps the doc accurate.

Test Command(s):

uv run pytest tests/test_scoring_perf.py -x -q

Documentation

  • README updated
  • Code comments added/updated
  • API documentation updated
  • Configuration documentation updated
  • Other documentation updated (please specify):

Documentation Changes:
N/A

Checklist

  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings or errors
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been published and merged

Additional Notes

After merge:

  1. Tag a release (this lands on latest; promote to stable only after a staging burn-in confirms the swap matches the shadow-derived expectation in a live race phase).
  2. Watch the canonical-vs-shadow disagreement rate via the same CloudWatch filter we used to decide on the swap — if the agreement drops below the ~80% floor under live scoring, roll back rather than tune the threshold further.

Two weeks of shadow logs comparing canonical Qwen3-Embedding-0.6B against
the candidate BAAI/bge-small-en-v1.5 over ~6,200 production (product,
gt) title pairs:

  * Pearson r 0.85 (race-phase, n=5735) and 0.75 (qualifying, n=501)
  * Best agreement at shadow=0.76 for race-phase (84.7%) and 0.70 for
    qualifying (82.4%) — picking 0.72 as the compromise threshold that
    keeps both phases above 78% agreement at canon=0.7.
  * 15-pair human-judgment review of model disagreements split close to
    even (Qwen3 right 7, bge-small right 8), so neither is meaningfully
    more correct.
  * bge-small is 33M params vs Qwen3's 600M — roughly 16x cheaper to
    encode on the validator's ARM CPU.

Same swap was attempted in v1.1.12 at threshold 0.95 and rolled back
because the 0.95 cliff was wildly off bge-small's actual distribution
(mean ~0.79). 0.72 lines up with the observed distribution.

  * Drop the bfloat16 model_kwargs override — bge-small ships fp32 by
    default and is small enough that bf16 buys little.
  * Drop the Qwen3 pre-cache from the validator Dockerfile so the image
    no longer carries the larger model.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@that-guy-wade that-guy-wade self-assigned this Jun 30, 2026
@that-guy-wade that-guy-wade requested a review from shardi-b June 30, 2026 04:40

@shardi-b shardi-b left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. No issues found during code review.

@that-guy-wade that-guy-wade merged commit 23fb138 into main Jun 30, 2026
2 checks passed
@that-guy-wade that-guy-wade deleted the sethschilbe/swap-canonical-to-bge-small branch June 30, 2026 04:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants