Skip to content

chore(session): raise arbitration threshold 0.65 → 0.8#1577

Merged
hijzy merged 1 commit into
MemTensor:mem-agent-0424from
hijzy:chore/raise-arbitration-threshold
Apr 29, 2026
Merged

chore(session): raise arbitration threshold 0.65 → 0.8#1577
hijzy merged 1 commit into
MemTensor:mem-agent-0424from
hijzy:chore/raise-arbitration-threshold

Conversation

@hijzy
Copy link
Copy Markdown
Collaborator

@hijzy hijzy commented Apr 29, 2026

Summary

  • Tighten the trust gate for the LLM's new_task verdict in apps/memos-local-plugin/core/session/relation-classifier.ts — bump ARBITRATION_THRESHOLD from 0.65 to 0.8.
  • Real-world traces show the primary relation classifier often returns new_task at 0.65–0.75 confidence for messages that are actually sub-tasks of the same project (e.g. "那数据库怎么配" right after "配置Nginx"). The old 0.65 cut-off let those slip through without a second-pass review, falsely splitting one logical task into two.
  • Routing more borderline new_task predictions through the bias-toward-follow_up arbitration prompt costs one extra LLM call on those turns, but reduces false topic splits (each false split prematurely closes the episode and leaves a stale lastEpisodeBySession entry that the L2/L3/skill chain has to recover from later).

Behaviour change

Only one knob changes; nothing about the public API, schema, or storage layout.

Path Before After
core/session/relation-classifier.ts::ARBITRATION_THRESHOLD 0.65 0.8

Decision flow (unchanged structure, only the cut-off moves):

  1. LLM returns new_task with confidence < ARBITRATION_THRESHOLD → run the second-pass arbitration prompt (biased toward follow_up).
  2. Otherwise → take the LLM verdict at face value.

Test plan

  • npx vitest run tests/unit/session/relation-classifier.test.ts — 16 tests passed (existing arbitration test uses a 0.5 confidence input which is below both old and new thresholds, so it still exercises the same path).
  • Smoke check on a real session where consecutive sub-task messages were previously being mis-split — confirm relation.classified log now shows signals: ["llm", "arbitration_override"] for those turns and the episode stays open.

Tighten the trust gate for the LLM's `new_task` verdict in
`relation-classifier.ts`. Real-world traces show the primary classifier
often returns `new_task` at 0.65–0.75 confidence for messages that are
actually sub-tasks of the same project (e.g. "now configure the DB"
right after "set up nginx"). The old 0.65 cut-off let those slip through
without a second-pass review, splitting one logical task into two.

Pulling the threshold up to 0.8 routes more borderline `new_task`
predictions through the bias-toward-follow_up arbitration prompt. The
trade-off is one extra LLM call on borderline turns; the upside is
fewer false topic boundaries (each false split currently costs us a
premature episode close + a stale `lastEpisodeBySession` entry, which
the L2/L3/skill chain has to recover from later).

No schema/API change. Existing arbitration unit test still passes
(it was already using a 0.5 confidence input — well below either
threshold).
@hijzy hijzy merged commit ac276cb into MemTensor:mem-agent-0424 Apr 29, 2026
@hijzy hijzy deleted the chore/raise-arbitration-threshold branch May 8, 2026 03:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant