Skip to content

[Improvement] Agent: faster turns and lower LLM cost#1422

Merged
MODSetter merged 36 commits into
MODSetter:devfrom
CREDO23:improvement-agent-speed
May 20, 2026
Merged

[Improvement] Agent: faster turns and lower LLM cost#1422
MODSetter merged 36 commits into
MODSetter:devfrom
CREDO23:improvement-agent-speed

Conversation

@CREDO23
Copy link
Copy Markdown
Contributor

@CREDO23 CREDO23 commented May 20, 2026

What changed

  • MCP tool discovery cached — persistent per-connector cached_tools (DB-backed) + in-process LRU; eliminates 3-12s of network discovery per turn, refreshed only on connector lifecycle events
  • Azure prompt caching enabledprompt_cache_key routing hint added; cache hit ratios now consistently 85-99% across all LLM calls, ~70-90% input-token cost reduction
  • Subagent outputs trimmed — 9 connectors (linear, slack, clickup, jira, gmail, calendar, airtable, discord, teams, luma) no longer echo raw API payloads into evidence.items; KB/research subagent excerpts capped at 500 chars
  • Preflight LLM probe removed — replaced ~2.5s pre-turn ping with reactive 429 recovery (existing fallback path); rate-limit safety unchanged
  • KB planner moved to dedicated small/fast LLM — internal query rewriting no longer uses the main chat model
  • Sync embedding calls offloaded to thread — kb_search, kb_persistence, document_converters, indexers, revert_service no longer block the event loop
  • Perf observability — per-LLM-call latency + cache read/write counts, MCP discover/call/oauth-refresh timing, subagent compile breakdown, middleware step logs

Impact

  • Latency: ~40% faster typical turn
  • Cost: ~70% cheaper per turn
  • Cold-start tax (process restart / autoscale): MCP discovery cost drops from ~3-12s to ~24ms via the persistent DB cache

High-level PR Summary

This PR delivers substantial latency and cost improvements to the agent system through four major optimizations: MCP tool discovery is now cached in the database and an in-process LRU, eliminating 3-12 seconds of network round-trips per turn; Azure prompt caching is enabled alongside OpenAI/DeepSeek/xAI, achieving 85-99% cache hit ratios and ~70-90% input token cost reduction; subagent outputs are trimmed across 9 connectors to avoid echoing large API payloads into evidence fields; and the KB planner now uses a dedicated small/fast LLM (configurable via is_planner: true in global config) instead of the main chat model for internal query rewriting. Additionally, sync embedding calls are offloaded to worker threads to avoid blocking the event loop, and the ~2.5s preflight LLM probe is removed in favor of reactive 429 recovery. Comprehensive performance observability is added throughout (per-LLM-call latency, cache read/write counts, MCP timings, subagent compile breakdown). The combined impact is ~40% faster turns and ~70% lower cost per turn, with cold-start MCP discovery dropping from 3-12s to ~24ms.

⏱️ Estimated Review Time: 30-90 minutes

💡 Review Order Suggestion
Order File Path
1 surfsense_backend/app/config/global_llm_config.example.yaml
2 surfsense_backend/app/config/__init__.py
3 surfsense_backend/app/services/llm_service.py
4 surfsense_backend/app/agents/new_chat/prompt_caching.py
5 surfsense_backend/tests/unit/agents/new_chat/test_prompt_caching.py
6 surfsense_backend/app/agents/new_chat/tools/mcp_tools_cache.py
7 surfsense_backend/tests/unit/agents/new_chat/tools/test_mcp_tools_cache.py
8 surfsense_backend/app/agents/new_chat/tools/mcp_tool.py
9 surfsense_backend/app/routes/mcp_oauth_route.py
10 surfsense_backend/app/routes/search_source_connectors_routes.py
11 surfsense_backend/app/agents/new_chat/middleware/knowledge_search.py
12 surfsense_backend/app/agents/multi_agent_chat/middleware/main_agent/knowledge_priority.py
13 surfsense_backend/app/agents/new_chat/chat_deepagent.py
14 surfsense_backend/app/tasks/chat/stream_new_chat.py
15 surfsense_backend/tests/unit/test_stream_new_chat_contract.py
16 surfsense_backend/app/agents/new_chat/middleware/kb_persistence.py
17 surfsense_backend/app/services/gmail/kb_sync_service.py
18 surfsense_backend/app/services/google_calendar/kb_sync_service.py
19 surfsense_backend/app/services/jira/kb_sync_service.py
20 surfsense_backend/app/services/onedrive/kb_sync_service.py
21 surfsense_backend/app/services/revert_service.py
22 surfsense_backend/app/utils/document_converters.py
23 surfsense_backend/app/tasks/connector_indexers/discord_indexer.py
24 surfsense_backend/app/tasks/connector_indexers/luma_indexer.py
25 surfsense_backend/app/tasks/connector_indexers/teams_indexer.py
26 surfsense_backend/app/tasks/document_processors/_save.py
27 surfsense_backend/app/agents/multi_agent_chat/subagents/builtins/knowledge_base/system_prompt_cloud.md
28 surfsense_backend/app/agents/multi_agent_chat/subagents/builtins/knowledge_base/system_prompt_desktop.md
29 surfsense_backend/app/agents/multi_agent_chat/subagents/builtins/research/system_prompt.md
30 surfsense_backend/app/agents/multi_agent_chat/subagents/connectors/airtable/system_prompt.md
31 surfsense_backend/app/agents/multi_agent_chat/subagents/connectors/calendar/system_prompt.md
32 surfsense_backend/app/agents/multi_agent_chat/subagents/connectors/clickup/system_prompt.md
33 surfsense_backend/app/agents/multi_agent_chat/subagents/connectors/discord/system_prompt.md
34 surfsense_backend/app/agents/multi_agent_chat/subagents/connectors/gmail/system_prompt.md
35 surfsense_backend/app/agents/multi_agent_chat/subagents/connectors/jira/system_prompt.md
36 surfsense_backend/app/agents/multi_agent_chat/subagents/connectors/linear/system_prompt.md
37 surfsense_backend/app/agents/multi_agent_chat/subagents/connectors/luma/system_prompt.md
38 surfsense_backend/app/agents/multi_agent_chat/subagents/connectors/slack/system_prompt.md
39 surfsense_backend/app/agents/multi_agent_chat/subagents/connectors/teams/system_prompt.md
40 surfsense_backend/app/services/token_tracking_service.py
41 surfsense_backend/app/agents/multi_agent_chat/middleware/main_agent/checkpointed_subagent_middleware/middleware.py
42 surfsense_backend/app/agents/multi_agent_chat/middleware/main_agent/checkpointed_subagent_middleware/task_tool.py
43 surfsense_backend/app/agents/multi_agent_chat/middleware/shared/kb_context_projection.py
44 surfsense_backend/app/agents/new_chat/middleware/knowledge_tree.py
45 surfsense_backend/app/agents/new_chat/middleware/memory_injection.py

Need help? Join our Discord

CREDO23 added 30 commits May 19, 2026 21:29
LiteLLM normalizes every provider's cache fields onto
usage.prompt_tokens_details (cached_tokens + cache_creation_tokens).
The earlier fallback to usage.cache_read_input_tokens /
usage.cache_creation_input_tokens was wrong: Anthropic-shaped fields
only live there via a trailing setattr loop, and the canonical field
name on the wrapper is cache_creation_tokens (not _input_tokens).
embed_texts holds a threading.Lock and runs a sync embedding call inside
search_knowledge_base, an async coroutine on the KB priority middleware
critical path. Blocking the event loop here stalls every other coroutine
on the worker (SSE keepalives, concurrent chat requests, background
tasks). Wrap in asyncio.to_thread so the embed runs on the default
executor pool while the loop keeps serving.
_create_document and _update_document run on the chat critical path
when the filesystem subagent writes via the user's chat turn. Both
called embed_texts synchronously inside an async coroutine, blocking
the event loop for the duration of the embed.
generate_document_summary and create_document_chunks are async helpers
called from the chat path and from many connector indexers. Both wrapped
embed_text/embed_texts directly inside the coroutine, blocking the event
loop for the full duration of the embedding call.
_restore_in_place_document and _reinsert_document_from_revision are
async helpers invoked by the synchronous-feeling POST /api/threads/.../revert
route; both ran embed_texts inline, blocking the event loop while the
HTTP client waited.
…orkers

Connector kb_sync_services (gmail, onedrive, google_calendar, jira),
streaming indexers (discord, luma, teams) and the file-processor save
path all called embed_text inside async coroutines, blocking the
background worker's event loop for the duration of the embed. Wrap each
call site in asyncio.to_thread so concurrent indexing tasks stop
serialising on the embed.
…tive 429 recovery

The preflight pattern probed the LLM with a 1-token ping before each
cold turn (when requested_llm_config_id==0, llm_config_id<0, and the
45s healthy TTL had expired) to detect 429s before fanning out into
planner/classifier/title-gen. To absorb its ~1-5s RTT cost we built the
agent speculatively in parallel; on 429 we discarded the build and
repinned.

Three problems with that design:

1. False security. Provider rate limits are token-bucket. A 1-token
   ping consumes ~5 tokens; the real request consumes 10-50K. The
   probe can return 200 while the real call still 429s.
2. Pure overhead in the common case. On warm-agent-cache turns the
   probe dominates wall time: ~2.5s of TTFT pure tax for ~99% of users
   who never see a 429.
3. The in-stream recovery loop (catch of _is_provider_rate_limited
   gated by not _first_event_logged) already does the right thing
   reactively: mark_runtime_cooldown -> resolve_or_get_pinned_llm_config_id
   with exclude_config_ids={previous} -> rebuild agent -> retry the
   stream. Preflight was never the only safety net; it was a redundant
   probe in front of one.

Changes:
- Delete _preflight_llm, _settle_speculative_agent_build, and the
  _PREFLIGHT_TIMEOUT_SEC / _PREFLIGHT_MAX_TOKENS constants.
- Drop the parallel agent_build_task / preflight_task plumbing in
  both stream_new_chat and stream_resume_chat; build the agent inline
  with await _build_main_agent_for_thread(...).
- Drop the unused is_recently_healthy / mark_healthy imports here
  (still exported from auto_model_pin_service since OpenRouter
  catalogue refresh and a few tests reference clear_healthy).
- Remove the obsolete preflight + settle-speculative tests from
  test_stream_new_chat_contract.py.

Net: -447 LOC. ~2.5s removed from TTFT on every cold preflight-eligible
turn. 429 recovery path is unchanged - same repin/rebuild/retry, just
not paid in advance on the healthy path.
…t LLM

Adds an optional planner LLM role wired through KnowledgePriorityMiddleware
so KB query rewriting, date extraction, and recency classification run on a
cheap model (e.g. gpt-4o-mini, Haiku, Azure nano) instead of the user's
chat LLM. Operators opt in by setting is_planner: true on exactly one
global config; without it, behavior is unchanged.
CREDO23 added 6 commits May 20, 2026 11:58
Splits the OpenAI-family gate into per-param predicates so AZURE and
AZURE_OPENAI configs now receive prompt_cache_key for backend routing
affinity (Microsoft auto-caches GPT-4o+ deployments at >=1024 tokens;
the key clusters same-prefix requests on the same GPU pool and raises
hit rate on turn 2+). prompt_cache_retention stays opted out for Azure
because litellm 1.83.14's Azure transformer would drop it silently;
revisit when Azure's supported params list is updated.
Skip the ~1-3s MCP initialize + list_tools handshake on every cache miss
by reading tool definitions from the connector row we already load. Lazy
populate on first miss, self-heal on corrupt cache, zero schema migration.
Collapse the invalidate + warmup pair into a single
refresh_mcp_tools_cache_for_connector(connector_id, search_space_id)
helper and scope live discovery to the one connector that changed
instead of the whole search space.

- new mcp_tool.discover_single_mcp_connector: load one connector,
  refresh OAuth if needed, force live MCP discovery so its cached_tools
  row is rewritten; returned wrappers are discarded since the in-process
  LRU is rebuilt lazily on the next user query
- mcp_tools_cache.refresh_mcp_tools_cache_for_connector: synchronously
  evicts the per-space LRU (LRU keys cannot scope finer) and schedules
  the per-connector prefetch via loop.create_task
- routes (OAuth callback, MCP POST, MCP PUT) collapse their two
  back-to-back calls into a single refresh call; DELETE handlers keep
  using bare invalidate_mcp_tools_cache (nothing to prefetch)

No new automated tests: the new functions are I/O glue (DB + network)
where mocked unit tests would test implementation rather than behavior.
The existing 9 unit tests for the cached_tools data shape are unchanged.
The probe answered its question (informing the cached_tools persistence
design). Future MCP session-pooling work, if revived, can recreate it.
Resolves: surfsense_backend/app/agents/new_chat/middleware/memory_injection.py
- Took both imports: upstream moved MEMORY_HARD_LIMIT/SOFT_LIMIT to
  app.services.memory; kept our perf-logger import for timing.

Pulls in upstream changes:
- Memory document feature (services/memory refactor, removal of
  app.agents.new_chat.memory_extraction and background extraction in
  stream_new_chat — agent now drives memory via update_memory tool).
- BACKEND_URL env refactor across web tool-ui/editor/chat/dashboard/lib.
- GitHub Actions backend test workflow + pre-commit biome bump.
- Token-display polish in MessageInfoDropdown; save_memory no-update
  sentinel.

Verified: 1723 unit tests pass, ruff clean. No semantic regression in
stream_new_chat (their memory-extraction deletion and our preflight
removal touch different functions).
@vercel
Copy link
Copy Markdown

vercel Bot commented May 20, 2026

@CREDO23 is attempting to deploy a commit to the Rohan Verma's projects Team on Vercel.

A member of the Team first needs to authorize it.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 20, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a467f8fa-f3f0-4ac5-81f7-f95b7991cb59

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@MODSetter MODSetter merged commit 5c4da79 into MODSetter:dev May 20, 2026
5 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants