Skip to content

feat: add skip_should_run_check config flag#14

Merged
yilu331 merged 2 commits into
mainfrom
fix/type-checking-annotations
Apr 14, 2026
Merged

feat: add skip_should_run_check config flag#14
yilu331 merged 2 commits into
mainfrom
fix/type-checking-annotations

Conversation

@yilu331
Copy link
Copy Markdown
Collaborator

@yilu331 yilu331 commented Apr 14, 2026

Summary

  • Add skip_should_run_check: bool = False to the Config model
  • Add early-return guard in _should_run_before_extraction() that bypasses the LLM eligibility check when the flag is enabled
  • Add unit tests for the new flag in test_base_generation_service.py

Context

Users have no way to bypass the LLM pre-extraction eligibility check. This flag lets orgs skip it to save cost/latency when they want every batch extracted.

Test plan

  • TestShouldRunBeforeExtraction::test_skip_should_run_check_bypasses_llm_call — verifies flag returns True without LLM call
  • TestShouldRunBeforeExtraction::test_default_skip_should_run_check_does_not_bypass — verifies default is False
  • Full test suite (64 tests) passes

yilu331 added 2 commits April 14, 2026 15:47
Add a boolean config toggle that lets users bypass the LLM eligibility
check before extraction. When enabled, extraction always proceeds
without the pre-check LLM call.
@yilu331 yilu331 merged commit 4834903 into main Apr 14, 2026
yilu331 added a commit that referenced this pull request May 1, 2026
Adds an opt-in LLM relevance-judge rerank stage to search_user_profiles
(and the playbook variants), parallel to the existing cross-encoder
rerank. The new stage bridges synonym/brand→category gaps that pure
lexical/semantic models can't bridge — e.g. "Thrive Market" = grocery
service, "Suica card" = Tokyo transit, "TripIt app" = travel-organizer.
Cross-encoder upgrades (bge-reranker-v2-m3) were tested and rejected:
they don't have the retail-brand world knowledge needed.

Architecture:
- New helper score_pairs_llm() in reflexio/server/llm/rerank/llm_reranker.py
- New prompt rerank_relevance/v1.0.0 (relevance-judge with explicit
  brand→category and tool→use-case guidance, scoring rubric, and a rule
  that user-owned tools/cards/apps score 7-9 on help/tips questions)
- New tool arg llm_rerank: bool = False on SearchUserProfilesArgs and
  the playbook variants
- _maybe_rerank_hits dispatches LLM rerank → cross-encoder → hybrid
  order in fallback chain; any failure path returns None and the
  caller falls back gracefully
- Bundle wiring: search-tool handlers now receive llm_client +
  prompt_manager via _bundle_handler_with_llm

Search prompt v1.10.0 documents llm_rerank in the tool palette and adds
targeted exceptions to Patterns A, C, D, F where brand/proper-noun
profiles are likely the answer but don't share the question's literal
keywords. Pattern B explicitly OPTS OUT (recency dominates; rerank
scrambles date order). All exceptions are tightly scoped to the
question shape.

Tested:
- 16 unit tests for score_pairs_llm fallback chain
- 10 unit tests for _maybe_rerank_hits dispatch + fallback semantics
- Trip-wire test updated; semver-sort bug in _get_latest_prompt_version
  fixed (would have locked v1.10.0 → v1.9.0 lexically)
- Smoke test on gpt4_2ba83207 (grocery superlative): Thrive Market
  ranked #14 baseline → #4 with llm_rerank=True
- Smoke test on 0a34ad58 (Tokyo Suica/TripIt): TripIt missing baseline
  → #3 with llm_rerank=True
- LongMemEval tune-100 r93 vs r91: 76/100 vs 74/100 (+2 acc); macro
  81.6% vs 80.5% (+1.1pt); M-S +14pt (the target gain), SS-P +10pt;
  K-U regression observed but traced to extraction-time non-determinism
  (knowledge updates not captured during re-ingest), not the rerank
  changes

Bundled prompt-bank state catch-up:
- answer_synthesis v1.3.0/v1.4.0 (rules 13/14 from earlier rounds)
- extraction_user_profile v1.1.0/v1.1.1/v1.1.2 (relative-time
  resolution, started/finished pair preservation)
- compress_session_for_query v1.0.0–v1.3.0 (the in-tool denoiser
  introduced earlier; currently hard-disabled at the code level)
- Older prompt versions flipped to active: false

Misc:
- LiteLLMClient seeds default to "42" for benchmark reproducibility
- /api/search response now exposes rehydrated_text (set by the search
  agent when it called read_session_text)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant