feat: add retrieval latency benchmark and embedding input truncation by yyiilluu · Pull Request #3 · ReflexioAI/reflexio

yyiilluu · 2026-04-13T02:24:38Z

Summary

Add a reusable retrieval latency benchmark suite at reflexio/benchmarks/retrieval_latency/ that measures profile, user/agent playbook, and unified cross-entity search across storage backends, corpus sizes, and layers (service vs http via FastAPI TestClient).
Fix a production-affecting bug in LiteLLMClient where embedding inputs longer than the provider's token cap would surface as provider 400s. Inputs are now token-truncated per model with a registry-aware limit and a once-per-model log warning.
Add a small skip_in_precommit smoke test + committed baseline so /test-all catches retrieval regressions past 3× p95.
Drive-by: correct a stale config show docstring reference in ReflexioClient.get_my_config (the command is now config storage).

Changes

Retrieval latency benchmark (reflexio/benchmarks/retrieval_latency/)

bench.py — CLI driver. Sweeps (retrieval_type × backend × layer × N) cells with configurable trials/warmup; supports --baseline diffing and flags cells where p95 regressed ≥20%.
backends.py — runs each trial against either the in-process service layer or the FastAPI TestClient (ASGI transport, no TCP) to isolate framework overhead from storage work.
scenarios.py — 20 fixed queries and corpus generators for each retrieval type.
seed.py — populates storages with deterministic hash-derived fake document vectors (query-time latency depends on row count, not vector content, so this is a sound cost optimization).
embed_cache.py — disk-cached real query embeddings at ~/.cache/reflexio-benchmarks/embeddings-<model>.json, so only the first run hits the embedding API.
report.py — renders per-retrieval-type markdown tables grouped by (backend, layer) with p50 / p95 (mean) cell format.
README.md — full usage, layer semantics, interpretation sanity checks, and baseline regeneration instructions.
One reference report committed under results/ as a baseline; results/ is gitignored so future runs don't pollute the working tree.
pyproject.toml — relaxes S311 (non-crypto random) for reflexio/benchmarks/** so seed generators can use random without noqa spam.

Embedding input truncation (reflexio/server/llm/litellm_client.py)

_get_embedding_limit(model) — LRU-cached. Consults litellm.get_model_info first (so Cohere 512, Voyage 32k, etc. are respected); falls back to the documented OpenAI 8191 cap only when the model name looks OpenAI-family (text-embedding-, openai/, azure/); returns None for unknown non-OpenAI providers so we don't over-truncate their input.
_get_embedding_encoding(model) — LRU-cached tiktoken encoder; non-OpenAI providers get cl100k_base as an approximate proxy (tends to over-truncate by a small fraction rather than under-truncate and cause 400s).
_truncate_for_embedding(text, model, max_tokens=None) — called from get_embedding and get_embeddings (per-element for batches). Emits a WARNING the first time a given model's input is truncated, then DEBUG on every subsequent hit — keeps large backfills from flooding logs.

Tests

tests/server/llm/test_litellm_client_unit.py — new test cases covering: known OpenAI family, unknown OpenAI-prefix fallback, Cohere small-cap, non-embedding mode rejection, unknown-provider pass-through, short/empty/long text paths, prefix preservation (mirrors how sqlite_storage._get_embedding builds search_document: …), warn-once behavior, and end-to-end assertions that the string reaching litellm.embedding is always under the cap.
tests/benchmarks/test_retrieval_latency_smoke.py — skip_in_precommit smoke test at N=50, trials=10, sqlite + service asserting p95 within 3× of the committed baseline.json.

Test Plan

ruff check / ruff format — clean
Unit tests for truncation helpers exercise registry hit, miss + OpenAI-family fallback, miss + unknown provider, non-embedding mode, and Cohere-style small caps
Run the benchmark end-to-end against sqlite: uv run python -m reflexio.benchmarks.retrieval_latency.bench --sizes 100 --trials 10 --warmup 2 --backend sqlite --layer service --retrieval profile
Run the smoke test: uv run pytest tests/benchmarks/test_retrieval_latency_smoke.py -v
Confirm warn-once log behavior during a real backfill
/test-all on the full suite

Introduce a reusable retrieval-latency benchmark suite and fix a long- standing issue where oversized embedding inputs could 400 against the provider. Also ship a small smoke test + baseline so CI catches regressions, and correct a stale docstring reference.

Adds an opt-in LLM relevance-judge rerank stage to search_user_profiles (and the playbook variants), parallel to the existing cross-encoder rerank. The new stage bridges synonym/brand→category gaps that pure lexical/semantic models can't bridge — e.g. "Thrive Market" = grocery service, "Suica card" = Tokyo transit, "TripIt app" = travel-organizer. Cross-encoder upgrades (bge-reranker-v2-m3) were tested and rejected: they don't have the retail-brand world knowledge needed. Architecture: - New helper score_pairs_llm() in reflexio/server/llm/rerank/llm_reranker.py - New prompt rerank_relevance/v1.0.0 (relevance-judge with explicit brand→category and tool→use-case guidance, scoring rubric, and a rule that user-owned tools/cards/apps score 7-9 on help/tips questions) - New tool arg llm_rerank: bool = False on SearchUserProfilesArgs and the playbook variants - _maybe_rerank_hits dispatches LLM rerank → cross-encoder → hybrid order in fallback chain; any failure path returns None and the caller falls back gracefully - Bundle wiring: search-tool handlers now receive llm_client + prompt_manager via _bundle_handler_with_llm Search prompt v1.10.0 documents llm_rerank in the tool palette and adds targeted exceptions to Patterns A, C, D, F where brand/proper-noun profiles are likely the answer but don't share the question's literal keywords. Pattern B explicitly OPTS OUT (recency dominates; rerank scrambles date order). All exceptions are tightly scoped to the question shape. Tested: - 16 unit tests for score_pairs_llm fallback chain - 10 unit tests for _maybe_rerank_hits dispatch + fallback semantics - Trip-wire test updated; semver-sort bug in _get_latest_prompt_version fixed (would have locked v1.10.0 → v1.9.0 lexically) - Smoke test on gpt4_2ba83207 (grocery superlative): Thrive Market ranked #14 baseline → #4 with llm_rerank=True - Smoke test on 0a34ad58 (Tokyo Suica/TripIt): TripIt missing baseline → #3 with llm_rerank=True - LongMemEval tune-100 r93 vs r91: 76/100 vs 74/100 (+2 acc); macro 81.6% vs 80.5% (+1.1pt); M-S +14pt (the target gain), SS-P +10pt; K-U regression observed but traced to extraction-time non-determinism (knowledge updates not captured during re-ingest), not the rerank changes Bundled prompt-bank state catch-up: - answer_synthesis v1.3.0/v1.4.0 (rules 13/14 from earlier rounds) - extraction_user_profile v1.1.0/v1.1.1/v1.1.2 (relative-time resolution, started/finished pair preservation) - compress_session_for_query v1.0.0–v1.3.0 (the in-tool denoiser introduced earlier; currently hard-disabled at the code level) - Older prompt versions flipped to active: false Misc: - LiteLLMClient seeds default to "42" for benchmark reproducibility - /api/search response now exposes rehydrated_text (set by the search agent when it called read_session_text)

…nit/QMD hardening (#51) ## Summary Five threads, surfaced across two `/test-e2e-cli --full` runs on a Linux VPS with only `MINIMAX_API_KEY` configured. The first three were the original PR scope; threads 4–5 were surfaced by the e2e run validating the first three: 1. **Cache invalidation** (R3 fix) — file-mtime auto-invalidation for self-hosted file-backed config + explicit admin endpoint/CLI/client method. Direct edits to `~/.reflexio/configs/config_<org-id>.json` are now picked up within 1 request instead of waiting up to an hour for the TTL. 2. **PATCH config endpoints** (R2 fix) — `POST /api/update_config` accepts a partial dict and shallow-merges into existing config. Removes the "must send the entire `Config` schema to flip one field" pain. 3. **MiniMax-only e2e gaps** (R5 fix + adjacent) — auto-fall-back to the bundled local embedder when no cloud embedder is configured, MiniMax in `extraction_agent`/`search_agent` defaults, `supports_tool_calling` override allowlist, and 5 regression tests locking in the auto-detect contract. 4. **`reflexio setup init` writes `.env` to `~/.reflexio/.env`** (e2e Issue #2) — was writing to CWD, polluting whatever directory the user was in. 5. **DiskStorage QMD client: stale `/tmp/e2e-disk` path + 60s probe timeouts** (e2e Issue #3) — hard-coded test path leaked into production logs; probe queries had no usable timeout. --- ## 1. Cache invalidation The per-org cache (`_reflexio_cache: TTLCache(maxsize=100, ttl=3600)`) keyed by `(org_id, storage_base_dir)` was only invalidated when writes went through `/api/set_config` or (newly) `/api/update_config`. Anything that mutated config out-of-band — hand-edited config file, sibling-replica write, direct DB UPDATE — was invisible until the entry's hour-long TTL expired. ### Phase 1 — file-mtime auto-invalidation `Reflexio.current_config_version()` returns a tuple — `('file', mtime)` for `LocalFileConfigStorage`, `None` when probing isn't supported. The cache stamps the version at load time; on every cache hit it re-probes (cheap `os.stat`, ~10 µs) and evicts on mismatch. The check is implemented outside the cache lock to keep contention low. ```python # reflexio/server/services/configurator/local_file_config_storage.py def get_version(self) -> tuple[str, float] | None: try: return ("file", os.stat(self._config_path).st_mtime) except OSError: return None ``` ### Phase 2 — explicit admin invalidate endpoint + CLI + client method For backends without cheap version probing, or any "I just SQL'd the DB and want my replica to re-read": ```bash reflexio admin cache invalidate # caller's own org reflexio admin cache invalidate --org-id ID # explicit (must match auth-resolved org) ``` ```python client.invalidate_cache() # caller's own org client.invalidate_cache(org_id="...") # explicit ``` ```http POST /api/admin/cache/invalidate {"org_id": "..."} # optional, verification-only ``` Auth uses the same `Depends(default_get_org_id)` as `/api/set_config`. Cross-org invalidation is **intentionally NOT exposed** — request body `org_id` must match the dep-resolved value. ### Companion: cloud DB-level versioning Phase 3 (DB-level `config_version` column for multi-replica cloud invalidation) ships in the parent enterprise PR ReflexioAI/reflexio-enterprise#47 — it depends on the abstract `ConfigStorage.get_version()` shipped here, but the concrete `SupabaseConfigStorage.get_version()` and the migration live in the enterprise repo. --- ## 2. PATCH config endpoints `POST /api/set_config` requires the full `Config` schema. Sending `{"extraction_backend": "classic"}` alone errors with `1 validation error: storage_config Field required`. PATCH semantics close the gap with shallow top-level merge: ```python @core_router.post("/api/update_config") def update_config(partial: dict[str, Any], org_id: str = Depends(default_get_org_id)) -> SetConfigResponse: existing = reflexio.request_context.configurator.get_config().model_dump(mode="python") response = reflexio.set_config(Config(**{**existing, **partial})) if response.success: invalidate_reflexio_cache(org_id=org_id) return response ``` ```python client.update_config({"extraction_backend": "classic"}) # no client-side validation; server is authority ``` ```bash reflexio config update --field extraction_backend=classic reflexio config update --field extraction_backend=classic --field search_backend=agentic reflexio config update --data '{"search_backend": "agentic"}' reflexio config update --file partial.json ``` Live-verified: `extraction_backend` and `search_backend` can now be set independently and atomically (e.g., one classic + one agentic). --- ## 3. MiniMax-only e2e gaps Three independent fixes that together let a user with **only** `MINIMAX_API_KEY` configured run `/test-e2e-cli` end-to-end and produce non-zero profiles + playbooks. ### 3a. Embedding auto-fallback (Layer A) + setup-init choice (Layer B) When no cloud embedder is configured, today blocks on stdin: > Your LLM provider doesn't support text embeddings. Which provider for embeddings? `[1] OpenAI [2] Gemini` `chromadb>=1.5.8` is already a hard dependency, so the bundled all-MiniLM-L6-v2 ONNX model is always physically present — it was just gated behind `CLAUDE_SMART_USE_LOCAL_EMBEDDING=1` (the [`claude-smart`](https://github.com/ReflexioAI/claude-smart) public-contract env var). Three explicit paths in `_auto_detect_model` for the embedding role: | Path | Trigger | Picks | |---|---|---| | **1. claude-smart explicit** (existing, unchanged) | `CLAUDE_SMART_USE_LOCAL_EMBEDDING=1` AND chromadb importable | `local/minilm-l6-v2` (priority 2) | | **2. Cloud embedder** (existing, unchanged) | `OPENAI_API_KEY` or `GEMINI_API_KEY` set | `text-embedding-3-small` / `text-embedding-004` | | **3. Local fallback** (NEW) | No cloud embedder configured AND chromadb importable | `local/minilm-l6-v2` | Path 1 wins over 2; 2 wins over 3. claude-smart contract preserved (`is_local_embedder_available()`, `register_if_enabled()` kept as alias). LiteLLM dispatch in `litellm_client.py` is also relaxed: `local/` model names are routed to `LocalEmbedder.get()` whenever chromadb is importable, no env-var requirement. `setup init` gets a new embedding-provider step (default = local when chromadb available) and a `--embedding {local|openai|gemini|auto}` flag for non-interactive use. The choice is written to `~/.reflexio/configs/config_<org-id>.json` under `llm_config.embedding_model_name`; the existing 3-tier `resolve_model_name` (`config_override > site_var > auto_detect`) ensures it wins over Layer A's auto-detection. `services.py` start gate: not-fully-configured + chromadb importable → log `INFO` "Using local embedder as fallback" and proceed (instead of prompting or erroring). ### 3b. MiniMax in `extraction_agent` + `search_agent` defaults Added MiniMax M2/M2.7 to `_PROVIDER_DEFAULTS[minimax].extraction_agent` and `.search_agent`. Without this, the agentic-v2 path silently failed with `Model … lacks tool-calling and no fallback_schema provided` and produced 0 profiles. ### 3c. `supports_tool_calling` override allowlist `litellm.supports_function_calling("minimax/MiniMax-M2.7")` returns `False` because litellm's registry only catalogues `minimax/MiniMax-M2`. M2.7 actually [supports function calling](https://platform.minimax.io/docs/guides/text-m2-function-call). Added an allowlist override in `tools.py` for known-good models the registry hasn't caught up with yet. ### 3d. `TestMinimaxOnlyEnvRegression` — auto-detect contract lock-in Added 5 regression tests in `tests/server/llm/test_model_defaults.py` that exercise the MiniMax-only env scenario end-to-end through `_auto_detect_model` and `resolve_model_name`. Specifically guards against the failure mode that surfaced as task #77 (extractor resolved to `gpt-5-mini` despite providers list being `['minimax']` — turned out to be benchmark contamination of the org config DB row, but the auto-detect contract was the canary). --- ## 4. `setup init` writes `.env` to the canonical location (e2e Issue #2) `reflexio setup init` was writing `.env` to `Path.cwd() / ".env"`. When the user runs setup from any directory other than `~/.reflexio/` (e.g. inside a worktree, or via tmux with cwd=worktree-root), the resulting `.env` lands somewhere unexpected. Subsequent `reflexio` invocations partially mask the issue via python-dotenv's parent-walk-up, but the worktree gets polluted. Now: setup commands always write to `Path.home() / ".reflexio" / ".env"` regardless of CWD. Plus a 2-test regression in `tests/cli/test_setup_cmd.py` that monkeypatches `Path.cwd` to a non-home dir and asserts the resulting file ends up at `~/.reflexio/.env`. Also touches `cli/env_loader.py` — small change to make the canonical path centrally defined so `setup_cmd.py` and `env_loader.py` agree. --- ## 5. DiskStorage QMD client: stale path + probe timeout (e2e Issue #3) Two bugs in `reflexio/server/services/storage/disk_storage/_qmd_client.py`: 1. **Hard-coded `/tmp/e2e-disk` collection path**. On any non-test deployment, the QMD client logged `Collection: /tmp/e2e-disk/disk_<org-id> (**/*.md)` even though the actual storage `dir_path` was `~/.reflexio/data/disk_<org-id>/`. Test scaffolding leaked into production code paths. 2. **No usable timeout on probe queries**. Slow QMD subprocesses caused 60s waits per probe, bloating logs and slowing the publish path. Fix: use the storage backend's `dir_path` directly (no env-var fallback to a test path), add a configurable probe timeout (default 5s, callsite-overridable). 8 new tests in `tests/server/services/storage/test_disk_storage_file_io.py` covering: stale-collection re-registration, configured-timeout-honored, default-timeout-applied, dir_path-mismatch-detected. --- ## Behavioral diff ### Embedding paths | User profile | Today | After | |---|---|---| | `CLAUDE_SMART_USE_LOCAL_EMBEDDING=1` + chromadb | local | unchanged | | `OPENAI_API_KEY` set | OpenAI | unchanged | | `GEMINI_API_KEY` set | Gemini | unchanged | | MiniMax / Anthropic / no key | **interactive prompt blocks `services start`** | **silent local fallback** | | `setup init`, picked local | (not an option) | local (org config override) | | `setup init --embedding local` (CI) | (no flag) | local; no prompt | | No chromadb importable | `RuntimeError` | `RuntimeError` with clearer install hint | ### Config update | Action | Today | After | |---|---|---| | Update one field via API | Send full `Config` schema; otherwise validation error | `POST /api/update_config {"extraction_backend": "classic"}` works | | Update one field via CLI | Hand-edit JSON, then `set` | `reflexio config update --field extraction_backend=classic` | | Update one field via Python client | Construct full `Config` object | `client.update_config({...})` | ### Cache freshness (file-backed self-host) | Trigger | Today | After | |---|---|---| | `POST /api/set_config` / `/api/update_config` | invalidates own cache (existing) | unchanged | | Hand-edit config file | stale until TTL expires (~1 h) | **picked up within 1 request** (mtime probe) | | `reflexio admin cache invalidate` | (didn't exist) | explicit eviction | --- ## Test plan - [x] Full submodule suite green - [x] **+40 new tests** across all five threads: - **Cache** (Phase 1+2): `test_cache_evicts_when_config_file_modified`, `test_cache_keeps_when_config_file_unchanged`, `test_cache_handles_missing_config_path`, `ConfigStorage.get_version()` contract test - **Admin invalidate**: `test_admin_invalidate_endpoint_clears_cache`, `test_admin_invalidate_returns_false_when_not_cached`, `test_cli_admin_cache_invalidate`, `test_client_invalidate_cache_method` - **PATCH config**: `test_update_config_shallow_merges`, `test_update_config_invalidates_cache`, all three CLI input modes (`--data`, `--file`, `--field`), client method behavior - **Embedding fallback** (Layer A + B): `test_is_chromadb_importable_when_present/absent`, `test_register_if_chromadb_available_no_env_var`, `test_embedding_fallback_to_local_when_no_cloud`, `test_embedding_fallback_skipped_when_cloud_available`, `test_embedding_explicit_opt_in_beats_cloud`, `test_embedding_no_chromadb_raises`, `test_prompt_embedding_provider_non_tty_picks_local`, `test_setup_init_includes_embedding_step`, `test_setup_init_local_is_default`, `test_setup_init_local_choice_writes_org_config`, `test_setup_init_embedding_flag_local`, `test_setup_init_embedding_flag_auto_no_override`, `test_services_start_proceeds_without_cloud_embedder_when_chromadb_present` - **Tool-calling override**: allowlist application + non-listed models still defer to litellm - **MiniMax-only env auto-detect contract** (`TestMinimaxOnlyEnvRegression`, 5 tests): generation/evaluation/should_run/pre_retrieval all resolve to `minimax/MiniMax-M2.7`; empty `OPENAI_API_KEY=` placeholder doesn't promote OpenAI; embedding falls through to `local/minilm-l6-v2` - **`setup init` `.env` location** (2 tests): writes to `~/.reflexio/.env` regardless of CWD - **DiskStorage QMD client** (8 tests): stale-collection re-registration, configurable probe timeout, default 5s timeout, dir_path mismatch detection - [x] `ruff check` + `ruff format --check` + `pyright` clean on all touched files - [x] **Live-verified end-to-end** on Linux VPS with `MINIMAX_API_KEY` only via `/test-e2e-cli --full`: all 4 storage modes (SQLite, Managed, Disk, Self-hosted Supabase) hit Hard PASS criterion (non-zero profile extraction for priya/marcus/elena/raj). Schema Gate green for all modes (Mode D required `sg docker -c` workaround until Issue #1 in companion PR ReflexioAI/reflexio-enterprise#47 fixed it at the source). --- ## Backwards compatibility - **claude-smart users (env var set):** zero change. `is_local_embedder_available()`, priority order, dispatch all preserved. `register_if_enabled()` is kept as an alias to `register_if_chromadb_available()`. - **Cloud-embedder users:** zero change unless they re-run `setup init`. Layer A's auto-detection still picks cloud when keys are configured. - **`POST /api/set_config`:** semantics unchanged (full-schema replace). `/api/update_config` is purely additive. - **Existing cache API** — `get_reflexio`, `invalidate_reflexio_cache`, `clear_reflexio_cache`, `get_cache_stats` — signatures unchanged. Phase 1 layered the version probe inside without breaking callers. - **Public docs at `public_docs/content/docs/claude-smart/configuration.mdx`** — still accurate. --- ## Companion follow-up After this lands and a release is cut, `reflexio-enterprise` bumps its submodule pointer + ships Phase 3 (DB-level `config_version` column → multi-replica cache invalidation in cloud) in ReflexioAI/reflexio-enterprise#47. ## Out of scope - **Reducing total install size of `reflexio-ai`** by moving heavy ML deps to extras. Bigger refactor; revisit after a couple of weeks of usage data. - **Removing `CLAUDE_SMART_USE_LOCAL_EMBEDDING`.** Public contract — any deprecation cycle would be coordinated with the claude-smart repo, not a side effect of this fix. - **Quality comparison: local MiniLM (384-dim, zero-padded to 512) vs OpenAI text-embedding-3-small (1536).** Local is measurably worse; acceptable for the fallback case, irrelevant for users with cloud keys. --- ## Notable implementation deviations from the original plan 1. The legacy `_prompt_embedding_provider` is no longer called from `setup init`. Plan said to add a step "after" the LLM step, but calling both back-to-back would have asked the user to pick the embedder twice. Replaced the legacy call with the new step in `init`. Legacy helper preserved for the `services start` first-run wizard. 2. The new `_choose_embedding_provider` step folds in the API-key prompt for openai/gemini when the env var isn't already set. Cleaner UX. 3. Drive-by: `tests/cli/test_service_builders.py::test_invalid_raises_bad_parameter` was hardcoded to `"postgres"` (which became valid in PR #47); switched to `"not-a-real-backend"`. 4. Cache invalidation: cross-storage `current_config_version()` abstraction unified file-mtime and DB-version checks behind one tuple-returning method, so the cache module doesn't have to know about storage backends.  ## Summary by CodeRabbit ## Release Notes * **New Features** * Added `--embedding` CLI option to setup commands for explicit embedding provider configuration * Added `admin cache invalidate` command to manually clear server-side cache * Added `config update` command for partial configuration updates * Enhanced automatic cache management with out-of-band change detection * Extended MiniMax model support for tool-calling capabilities * **Bug Fixes** * Fixed local embedding provider initialization without requiring environment variable * Improved tool-calling support detection for specific model variants

yyiilluu merged commit 7bfd449 into main Apr 13, 2026

yyiilluu deleted the feat/retrieval-latency-bench branch April 14, 2026 07:20

yilu331 mentioned this pull request May 9, 2026

feat: cache invalidation + PATCH config + MiniMax e2e fixes + setup-init/QMD hardening #51

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add retrieval latency benchmark and embedding input truncation#3

feat: add retrieval latency benchmark and embedding input truncation#3
yyiilluu merged 1 commit into
mainfrom
feat/retrieval-latency-bench

yyiilluu commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yyiilluu commented Apr 13, 2026

Summary

Changes

Test Plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant