feat: add skip_should_run_check config flag#14
Merged
Conversation
Add a boolean config toggle that lets users bypass the LLM eligibility check before extraction. When enabled, extraction always proceeds without the pre-check LLM call.
yilu331
added a commit
that referenced
this pull request
May 1, 2026
Adds an opt-in LLM relevance-judge rerank stage to search_user_profiles (and the playbook variants), parallel to the existing cross-encoder rerank. The new stage bridges synonym/brand→category gaps that pure lexical/semantic models can't bridge — e.g. "Thrive Market" = grocery service, "Suica card" = Tokyo transit, "TripIt app" = travel-organizer. Cross-encoder upgrades (bge-reranker-v2-m3) were tested and rejected: they don't have the retail-brand world knowledge needed. Architecture: - New helper score_pairs_llm() in reflexio/server/llm/rerank/llm_reranker.py - New prompt rerank_relevance/v1.0.0 (relevance-judge with explicit brand→category and tool→use-case guidance, scoring rubric, and a rule that user-owned tools/cards/apps score 7-9 on help/tips questions) - New tool arg llm_rerank: bool = False on SearchUserProfilesArgs and the playbook variants - _maybe_rerank_hits dispatches LLM rerank → cross-encoder → hybrid order in fallback chain; any failure path returns None and the caller falls back gracefully - Bundle wiring: search-tool handlers now receive llm_client + prompt_manager via _bundle_handler_with_llm Search prompt v1.10.0 documents llm_rerank in the tool palette and adds targeted exceptions to Patterns A, C, D, F where brand/proper-noun profiles are likely the answer but don't share the question's literal keywords. Pattern B explicitly OPTS OUT (recency dominates; rerank scrambles date order). All exceptions are tightly scoped to the question shape. Tested: - 16 unit tests for score_pairs_llm fallback chain - 10 unit tests for _maybe_rerank_hits dispatch + fallback semantics - Trip-wire test updated; semver-sort bug in _get_latest_prompt_version fixed (would have locked v1.10.0 → v1.9.0 lexically) - Smoke test on gpt4_2ba83207 (grocery superlative): Thrive Market ranked #14 baseline → #4 with llm_rerank=True - Smoke test on 0a34ad58 (Tokyo Suica/TripIt): TripIt missing baseline → #3 with llm_rerank=True - LongMemEval tune-100 r93 vs r91: 76/100 vs 74/100 (+2 acc); macro 81.6% vs 80.5% (+1.1pt); M-S +14pt (the target gain), SS-P +10pt; K-U regression observed but traced to extraction-time non-determinism (knowledge updates not captured during re-ingest), not the rerank changes Bundled prompt-bank state catch-up: - answer_synthesis v1.3.0/v1.4.0 (rules 13/14 from earlier rounds) - extraction_user_profile v1.1.0/v1.1.1/v1.1.2 (relative-time resolution, started/finished pair preservation) - compress_session_for_query v1.0.0–v1.3.0 (the in-tool denoiser introduced earlier; currently hard-disabled at the code level) - Older prompt versions flipped to active: false Misc: - LiteLLMClient seeds default to "42" for benchmark reproducibility - /api/search response now exposes rehydrated_text (set by the search agent when it called read_session_text)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
skip_should_run_check: bool = Falseto theConfigmodel_should_run_before_extraction()that bypasses the LLM eligibility check when the flag is enabledtest_base_generation_service.pyContext
Users have no way to bypass the LLM pre-extraction eligibility check. This flag lets orgs skip it to save cost/latency when they want every batch extracted.
Test plan
TestShouldRunBeforeExtraction::test_skip_should_run_check_bypasses_llm_call— verifies flag returns True without LLM callTestShouldRunBeforeExtraction::test_default_skip_should_run_check_does_not_bypass— verifies default is False