feat: add retrieval latency benchmark and embedding input truncation#3
Merged
Conversation
Introduce a reusable retrieval-latency benchmark suite and fix a long- standing issue where oversized embedding inputs could 400 against the provider. Also ship a small smoke test + baseline so CI catches regressions, and correct a stale docstring reference.
yilu331
added a commit
that referenced
this pull request
May 1, 2026
Adds an opt-in LLM relevance-judge rerank stage to search_user_profiles (and the playbook variants), parallel to the existing cross-encoder rerank. The new stage bridges synonym/brand→category gaps that pure lexical/semantic models can't bridge — e.g. "Thrive Market" = grocery service, "Suica card" = Tokyo transit, "TripIt app" = travel-organizer. Cross-encoder upgrades (bge-reranker-v2-m3) were tested and rejected: they don't have the retail-brand world knowledge needed. Architecture: - New helper score_pairs_llm() in reflexio/server/llm/rerank/llm_reranker.py - New prompt rerank_relevance/v1.0.0 (relevance-judge with explicit brand→category and tool→use-case guidance, scoring rubric, and a rule that user-owned tools/cards/apps score 7-9 on help/tips questions) - New tool arg llm_rerank: bool = False on SearchUserProfilesArgs and the playbook variants - _maybe_rerank_hits dispatches LLM rerank → cross-encoder → hybrid order in fallback chain; any failure path returns None and the caller falls back gracefully - Bundle wiring: search-tool handlers now receive llm_client + prompt_manager via _bundle_handler_with_llm Search prompt v1.10.0 documents llm_rerank in the tool palette and adds targeted exceptions to Patterns A, C, D, F where brand/proper-noun profiles are likely the answer but don't share the question's literal keywords. Pattern B explicitly OPTS OUT (recency dominates; rerank scrambles date order). All exceptions are tightly scoped to the question shape. Tested: - 16 unit tests for score_pairs_llm fallback chain - 10 unit tests for _maybe_rerank_hits dispatch + fallback semantics - Trip-wire test updated; semver-sort bug in _get_latest_prompt_version fixed (would have locked v1.10.0 → v1.9.0 lexically) - Smoke test on gpt4_2ba83207 (grocery superlative): Thrive Market ranked #14 baseline → #4 with llm_rerank=True - Smoke test on 0a34ad58 (Tokyo Suica/TripIt): TripIt missing baseline → #3 with llm_rerank=True - LongMemEval tune-100 r93 vs r91: 76/100 vs 74/100 (+2 acc); macro 81.6% vs 80.5% (+1.1pt); M-S +14pt (the target gain), SS-P +10pt; K-U regression observed but traced to extraction-time non-determinism (knowledge updates not captured during re-ingest), not the rerank changes Bundled prompt-bank state catch-up: - answer_synthesis v1.3.0/v1.4.0 (rules 13/14 from earlier rounds) - extraction_user_profile v1.1.0/v1.1.1/v1.1.2 (relative-time resolution, started/finished pair preservation) - compress_session_for_query v1.0.0–v1.3.0 (the in-tool denoiser introduced earlier; currently hard-disabled at the code level) - Older prompt versions flipped to active: false Misc: - LiteLLMClient seeds default to "42" for benchmark reproducibility - /api/search response now exposes rehydrated_text (set by the search agent when it called read_session_text)
4 tasks
yilu331
added a commit
that referenced
this pull request
May 9, 2026
…nit/QMD hardening (#51) ## Summary Five threads, surfaced across two `/test-e2e-cli --full` runs on a Linux VPS with only `MINIMAX_API_KEY` configured. The first three were the original PR scope; threads 4–5 were surfaced by the e2e run validating the first three: 1. **Cache invalidation** (R3 fix) — file-mtime auto-invalidation for self-hosted file-backed config + explicit admin endpoint/CLI/client method. Direct edits to `~/.reflexio/configs/config_<org-id>.json` are now picked up within 1 request instead of waiting up to an hour for the TTL. 2. **PATCH config endpoints** (R2 fix) — `POST /api/update_config` accepts a partial dict and shallow-merges into existing config. Removes the "must send the entire `Config` schema to flip one field" pain. 3. **MiniMax-only e2e gaps** (R5 fix + adjacent) — auto-fall-back to the bundled local embedder when no cloud embedder is configured, MiniMax in `extraction_agent`/`search_agent` defaults, `supports_tool_calling` override allowlist, and 5 regression tests locking in the auto-detect contract. 4. **`reflexio setup init` writes `.env` to `~/.reflexio/.env`** (e2e Issue #2) — was writing to CWD, polluting whatever directory the user was in. 5. **DiskStorage QMD client: stale `/tmp/e2e-disk` path + 60s probe timeouts** (e2e Issue #3) — hard-coded test path leaked into production logs; probe queries had no usable timeout. --- ## 1. Cache invalidation The per-org cache (`_reflexio_cache: TTLCache(maxsize=100, ttl=3600)`) keyed by `(org_id, storage_base_dir)` was only invalidated when writes went through `/api/set_config` or (newly) `/api/update_config`. Anything that mutated config out-of-band — hand-edited config file, sibling-replica write, direct DB UPDATE — was invisible until the entry's hour-long TTL expired. ### Phase 1 — file-mtime auto-invalidation `Reflexio.current_config_version()` returns a tuple — `('file', mtime)` for `LocalFileConfigStorage`, `None` when probing isn't supported. The cache stamps the version at load time; on every cache hit it re-probes (cheap `os.stat`, ~10 µs) and evicts on mismatch. The check is implemented outside the cache lock to keep contention low. ```python # reflexio/server/services/configurator/local_file_config_storage.py def get_version(self) -> tuple[str, float] | None: try: return ("file", os.stat(self._config_path).st_mtime) except OSError: return None ``` ### Phase 2 — explicit admin invalidate endpoint + CLI + client method For backends without cheap version probing, or any "I just SQL'd the DB and want my replica to re-read": ```bash reflexio admin cache invalidate # caller's own org reflexio admin cache invalidate --org-id ID # explicit (must match auth-resolved org) ``` ```python client.invalidate_cache() # caller's own org client.invalidate_cache(org_id="...") # explicit ``` ```http POST /api/admin/cache/invalidate {"org_id": "..."} # optional, verification-only ``` Auth uses the same `Depends(default_get_org_id)` as `/api/set_config`. Cross-org invalidation is **intentionally NOT exposed** — request body `org_id` must match the dep-resolved value. ### Companion: cloud DB-level versioning Phase 3 (DB-level `config_version` column for multi-replica cloud invalidation) ships in the parent enterprise PR ReflexioAI/reflexio-enterprise#47 — it depends on the abstract `ConfigStorage.get_version()` shipped here, but the concrete `SupabaseConfigStorage.get_version()` and the migration live in the enterprise repo. --- ## 2. PATCH config endpoints `POST /api/set_config` requires the full `Config` schema. Sending `{"extraction_backend": "classic"}` alone errors with `1 validation error: storage_config Field required`. PATCH semantics close the gap with shallow top-level merge: ```python @core_router.post("/api/update_config") def update_config(partial: dict[str, Any], org_id: str = Depends(default_get_org_id)) -> SetConfigResponse: existing = reflexio.request_context.configurator.get_config().model_dump(mode="python") response = reflexio.set_config(Config(**{**existing, **partial})) if response.success: invalidate_reflexio_cache(org_id=org_id) return response ``` ```python client.update_config({"extraction_backend": "classic"}) # no client-side validation; server is authority ``` ```bash reflexio config update --field extraction_backend=classic reflexio config update --field extraction_backend=classic --field search_backend=agentic reflexio config update --data '{"search_backend": "agentic"}' reflexio config update --file partial.json ``` Live-verified: `extraction_backend` and `search_backend` can now be set independently and atomically (e.g., one classic + one agentic). --- ## 3. MiniMax-only e2e gaps Three independent fixes that together let a user with **only** `MINIMAX_API_KEY` configured run `/test-e2e-cli` end-to-end and produce non-zero profiles + playbooks. ### 3a. Embedding auto-fallback (Layer A) + setup-init choice (Layer B) When no cloud embedder is configured, today blocks on stdin: > Your LLM provider doesn't support text embeddings. Which provider for embeddings? `[1] OpenAI [2] Gemini` `chromadb>=1.5.8` is already a hard dependency, so the bundled all-MiniLM-L6-v2 ONNX model is always physically present — it was just gated behind `CLAUDE_SMART_USE_LOCAL_EMBEDDING=1` (the [`claude-smart`](https://github.com/ReflexioAI/claude-smart) public-contract env var). Three explicit paths in `_auto_detect_model` for the embedding role: | Path | Trigger | Picks | |---|---|---| | **1. claude-smart explicit** (existing, unchanged) | `CLAUDE_SMART_USE_LOCAL_EMBEDDING=1` AND chromadb importable | `local/minilm-l6-v2` (priority 2) | | **2. Cloud embedder** (existing, unchanged) | `OPENAI_API_KEY` or `GEMINI_API_KEY` set | `text-embedding-3-small` / `text-embedding-004` | | **3. Local fallback** (NEW) | No cloud embedder configured AND chromadb importable | `local/minilm-l6-v2` | Path 1 wins over 2; 2 wins over 3. claude-smart contract preserved (`is_local_embedder_available()`, `register_if_enabled()` kept as alias). LiteLLM dispatch in `litellm_client.py` is also relaxed: `local/` model names are routed to `LocalEmbedder.get()` whenever chromadb is importable, no env-var requirement. `setup init` gets a new embedding-provider step (default = local when chromadb available) and a `--embedding {local|openai|gemini|auto}` flag for non-interactive use. The choice is written to `~/.reflexio/configs/config_<org-id>.json` under `llm_config.embedding_model_name`; the existing 3-tier `resolve_model_name` (`config_override > site_var > auto_detect`) ensures it wins over Layer A's auto-detection. `services.py` start gate: not-fully-configured + chromadb importable → log `INFO` "Using local embedder as fallback" and proceed (instead of prompting or erroring). ### 3b. MiniMax in `extraction_agent` + `search_agent` defaults Added MiniMax M2/M2.7 to `_PROVIDER_DEFAULTS[minimax].extraction_agent` and `.search_agent`. Without this, the agentic-v2 path silently failed with `Model … lacks tool-calling and no fallback_schema provided` and produced 0 profiles. ### 3c. `supports_tool_calling` override allowlist `litellm.supports_function_calling("minimax/MiniMax-M2.7")` returns `False` because litellm's registry only catalogues `minimax/MiniMax-M2`. M2.7 actually [supports function calling](https://platform.minimax.io/docs/guides/text-m2-function-call). Added an allowlist override in `tools.py` for known-good models the registry hasn't caught up with yet. ### 3d. `TestMinimaxOnlyEnvRegression` — auto-detect contract lock-in Added 5 regression tests in `tests/server/llm/test_model_defaults.py` that exercise the MiniMax-only env scenario end-to-end through `_auto_detect_model` and `resolve_model_name`. Specifically guards against the failure mode that surfaced as task #77 (extractor resolved to `gpt-5-mini` despite providers list being `['minimax']` — turned out to be benchmark contamination of the org config DB row, but the auto-detect contract was the canary). --- ## 4. `setup init` writes `.env` to the canonical location (e2e Issue #2) `reflexio setup init` was writing `.env` to `Path.cwd() / ".env"`. When the user runs setup from any directory other than `~/.reflexio/` (e.g. inside a worktree, or via tmux with cwd=worktree-root), the resulting `.env` lands somewhere unexpected. Subsequent `reflexio` invocations partially mask the issue via python-dotenv's parent-walk-up, but the worktree gets polluted. Now: setup commands always write to `Path.home() / ".reflexio" / ".env"` regardless of CWD. Plus a 2-test regression in `tests/cli/test_setup_cmd.py` that monkeypatches `Path.cwd` to a non-home dir and asserts the resulting file ends up at `~/.reflexio/.env`. Also touches `cli/env_loader.py` — small change to make the canonical path centrally defined so `setup_cmd.py` and `env_loader.py` agree. --- ## 5. DiskStorage QMD client: stale path + probe timeout (e2e Issue #3) Two bugs in `reflexio/server/services/storage/disk_storage/_qmd_client.py`: 1. **Hard-coded `/tmp/e2e-disk` collection path**. On any non-test deployment, the QMD client logged `Collection: /tmp/e2e-disk/disk_<org-id> (**/*.md)` even though the actual storage `dir_path` was `~/.reflexio/data/disk_<org-id>/`. Test scaffolding leaked into production code paths. 2. **No usable timeout on probe queries**. Slow QMD subprocesses caused 60s waits per probe, bloating logs and slowing the publish path. Fix: use the storage backend's `dir_path` directly (no env-var fallback to a test path), add a configurable probe timeout (default 5s, callsite-overridable). 8 new tests in `tests/server/services/storage/test_disk_storage_file_io.py` covering: stale-collection re-registration, configured-timeout-honored, default-timeout-applied, dir_path-mismatch-detected. --- ## Behavioral diff ### Embedding paths | User profile | Today | After | |---|---|---| | `CLAUDE_SMART_USE_LOCAL_EMBEDDING=1` + chromadb | local | unchanged | | `OPENAI_API_KEY` set | OpenAI | unchanged | | `GEMINI_API_KEY` set | Gemini | unchanged | | MiniMax / Anthropic / no key | **interactive prompt blocks `services start`** | **silent local fallback** | | `setup init`, picked local | (not an option) | local (org config override) | | `setup init --embedding local` (CI) | (no flag) | local; no prompt | | No chromadb importable | `RuntimeError` | `RuntimeError` with clearer install hint | ### Config update | Action | Today | After | |---|---|---| | Update one field via API | Send full `Config` schema; otherwise validation error | `POST /api/update_config {"extraction_backend": "classic"}` works | | Update one field via CLI | Hand-edit JSON, then `set` | `reflexio config update --field extraction_backend=classic` | | Update one field via Python client | Construct full `Config` object | `client.update_config({...})` | ### Cache freshness (file-backed self-host) | Trigger | Today | After | |---|---|---| | `POST /api/set_config` / `/api/update_config` | invalidates own cache (existing) | unchanged | | Hand-edit config file | stale until TTL expires (~1 h) | **picked up within 1 request** (mtime probe) | | `reflexio admin cache invalidate` | (didn't exist) | explicit eviction | --- ## Test plan - [x] Full submodule suite green - [x] **+40 new tests** across all five threads: - **Cache** (Phase 1+2): `test_cache_evicts_when_config_file_modified`, `test_cache_keeps_when_config_file_unchanged`, `test_cache_handles_missing_config_path`, `ConfigStorage.get_version()` contract test - **Admin invalidate**: `test_admin_invalidate_endpoint_clears_cache`, `test_admin_invalidate_returns_false_when_not_cached`, `test_cli_admin_cache_invalidate`, `test_client_invalidate_cache_method` - **PATCH config**: `test_update_config_shallow_merges`, `test_update_config_invalidates_cache`, all three CLI input modes (`--data`, `--file`, `--field`), client method behavior - **Embedding fallback** (Layer A + B): `test_is_chromadb_importable_when_present/absent`, `test_register_if_chromadb_available_no_env_var`, `test_embedding_fallback_to_local_when_no_cloud`, `test_embedding_fallback_skipped_when_cloud_available`, `test_embedding_explicit_opt_in_beats_cloud`, `test_embedding_no_chromadb_raises`, `test_prompt_embedding_provider_non_tty_picks_local`, `test_setup_init_includes_embedding_step`, `test_setup_init_local_is_default`, `test_setup_init_local_choice_writes_org_config`, `test_setup_init_embedding_flag_local`, `test_setup_init_embedding_flag_auto_no_override`, `test_services_start_proceeds_without_cloud_embedder_when_chromadb_present` - **Tool-calling override**: allowlist application + non-listed models still defer to litellm - **MiniMax-only env auto-detect contract** (`TestMinimaxOnlyEnvRegression`, 5 tests): generation/evaluation/should_run/pre_retrieval all resolve to `minimax/MiniMax-M2.7`; empty `OPENAI_API_KEY=` placeholder doesn't promote OpenAI; embedding falls through to `local/minilm-l6-v2` - **`setup init` `.env` location** (2 tests): writes to `~/.reflexio/.env` regardless of CWD - **DiskStorage QMD client** (8 tests): stale-collection re-registration, configurable probe timeout, default 5s timeout, dir_path mismatch detection - [x] `ruff check` + `ruff format --check` + `pyright` clean on all touched files - [x] **Live-verified end-to-end** on Linux VPS with `MINIMAX_API_KEY` only via `/test-e2e-cli --full`: all 4 storage modes (SQLite, Managed, Disk, Self-hosted Supabase) hit Hard PASS criterion (non-zero profile extraction for priya/marcus/elena/raj). Schema Gate green for all modes (Mode D required `sg docker -c` workaround until Issue #1 in companion PR ReflexioAI/reflexio-enterprise#47 fixed it at the source). --- ## Backwards compatibility - **claude-smart users (env var set):** zero change. `is_local_embedder_available()`, priority order, dispatch all preserved. `register_if_enabled()` is kept as an alias to `register_if_chromadb_available()`. - **Cloud-embedder users:** zero change unless they re-run `setup init`. Layer A's auto-detection still picks cloud when keys are configured. - **`POST /api/set_config`:** semantics unchanged (full-schema replace). `/api/update_config` is purely additive. - **Existing cache API** — `get_reflexio`, `invalidate_reflexio_cache`, `clear_reflexio_cache`, `get_cache_stats` — signatures unchanged. Phase 1 layered the version probe inside without breaking callers. - **Public docs at `public_docs/content/docs/claude-smart/configuration.mdx`** — still accurate. --- ## Companion follow-up After this lands and a release is cut, `reflexio-enterprise` bumps its submodule pointer + ships Phase 3 (DB-level `config_version` column → multi-replica cache invalidation in cloud) in ReflexioAI/reflexio-enterprise#47. ## Out of scope - **Reducing total install size of `reflexio-ai`** by moving heavy ML deps to extras. Bigger refactor; revisit after a couple of weeks of usage data. - **Removing `CLAUDE_SMART_USE_LOCAL_EMBEDDING`.** Public contract — any deprecation cycle would be coordinated with the claude-smart repo, not a side effect of this fix. - **Quality comparison: local MiniLM (384-dim, zero-padded to 512) vs OpenAI text-embedding-3-small (1536).** Local is measurably worse; acceptable for the fallback case, irrelevant for users with cloud keys. --- ## Notable implementation deviations from the original plan 1. The legacy `_prompt_embedding_provider` is no longer called from `setup init`. Plan said to add a step "after" the LLM step, but calling both back-to-back would have asked the user to pick the embedder twice. Replaced the legacy call with the new step in `init`. Legacy helper preserved for the `services start` first-run wizard. 2. The new `_choose_embedding_provider` step folds in the API-key prompt for openai/gemini when the env var isn't already set. Cleaner UX. 3. Drive-by: `tests/cli/test_service_builders.py::test_invalid_raises_bad_parameter` was hardcoded to `"postgres"` (which became valid in PR #47); switched to `"not-a-real-backend"`. 4. Cache invalidation: cross-storage `current_config_version()` abstraction unified file-mtime and DB-version checks behind one tuple-returning method, so the cache module doesn't have to know about storage backends. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes * **New Features** * Added `--embedding` CLI option to setup commands for explicit embedding provider configuration * Added `admin cache invalidate` command to manually clear server-side cache * Added `config update` command for partial configuration updates * Enhanced automatic cache management with out-of-band change detection * Extended MiniMax model support for tool-calling capabilities * **Bug Fixes** * Fixed local embedding provider initialization without requiring environment variable * Improved tool-calling support detection for specific model variants <!-- end of auto-generated comment: release notes by coderabbit.ai -->
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
reflexio/benchmarks/retrieval_latency/that measures profile, user/agent playbook, and unified cross-entity search across storage backends, corpus sizes, and layers (servicevshttpvia FastAPITestClient).LiteLLMClientwhere embedding inputs longer than the provider's token cap would surface as provider 400s. Inputs are now token-truncated per model with a registry-aware limit and a once-per-model log warning.skip_in_precommitsmoke test + committed baseline so/test-allcatches retrieval regressions past 3× p95.config showdocstring reference inReflexioClient.get_my_config(the command is nowconfig storage).Changes
Retrieval latency benchmark (
reflexio/benchmarks/retrieval_latency/)bench.py— CLI driver. Sweeps(retrieval_type × backend × layer × N)cells with configurable trials/warmup; supports--baselinediffing and flags cells where p95 regressed ≥20%.backends.py— runs each trial against either the in-process service layer or the FastAPITestClient(ASGI transport, no TCP) to isolate framework overhead from storage work.scenarios.py— 20 fixed queries and corpus generators for each retrieval type.seed.py— populates storages with deterministic hash-derived fake document vectors (query-time latency depends on row count, not vector content, so this is a sound cost optimization).embed_cache.py— disk-cached real query embeddings at~/.cache/reflexio-benchmarks/embeddings-<model>.json, so only the first run hits the embedding API.report.py— renders per-retrieval-type markdown tables grouped by(backend, layer)withp50 / p95 (mean)cell format.README.md— full usage, layer semantics, interpretation sanity checks, and baseline regeneration instructions.results/as a baseline;results/is gitignored so future runs don't pollute the working tree.pyproject.toml— relaxesS311(non-crypto random) forreflexio/benchmarks/**so seed generators can userandomwithout noqa spam.Embedding input truncation (
reflexio/server/llm/litellm_client.py)_get_embedding_limit(model)— LRU-cached. Consultslitellm.get_model_infofirst (so Cohere 512, Voyage 32k, etc. are respected); falls back to the documented OpenAI 8191 cap only when the model name looks OpenAI-family (text-embedding-,openai/,azure/); returnsNonefor unknown non-OpenAI providers so we don't over-truncate their input._get_embedding_encoding(model)— LRU-cached tiktoken encoder; non-OpenAI providers getcl100k_baseas an approximate proxy (tends to over-truncate by a small fraction rather than under-truncate and cause 400s)._truncate_for_embedding(text, model, max_tokens=None)— called fromget_embeddingandget_embeddings(per-element for batches). Emits aWARNINGthe first time a given model's input is truncated, thenDEBUGon every subsequent hit — keeps large backfills from flooding logs.Tests
tests/server/llm/test_litellm_client_unit.py— new test cases covering: known OpenAI family, unknown OpenAI-prefix fallback, Cohere small-cap, non-embedding mode rejection, unknown-provider pass-through, short/empty/long text paths, prefix preservation (mirrors howsqlite_storage._get_embeddingbuildssearch_document: …), warn-once behavior, and end-to-end assertions that the string reachinglitellm.embeddingis always under the cap.tests/benchmarks/test_retrieval_latency_smoke.py—skip_in_precommitsmoke test atN=50, trials=10, sqlite + serviceasserting p95 within 3× of the committedbaseline.json.Test Plan
ruff check/ruff format— cleanuv run python -m reflexio.benchmarks.retrieval_latency.bench --sizes 100 --trials 10 --warmup 2 --backend sqlite --layer service --retrieval profileuv run pytest tests/benchmarks/test_retrieval_latency_smoke.py -v/test-allon the full suite