[Improvement] Agent: faster turns and lower LLM cost by CREDO23 · Pull Request #1422 · MODSetter/SurfSense

CREDO23 · 2026-05-20T19:26:45Z

What changed

MCP tool discovery cached — persistent per-connector cached_tools (DB-backed) + in-process LRU; eliminates 3-12s of network discovery per turn, refreshed only on connector lifecycle events
Azure prompt caching enabled — prompt_cache_key routing hint added; cache hit ratios now consistently 85-99% across all LLM calls, ~70-90% input-token cost reduction
Subagent outputs trimmed — 9 connectors (linear, slack, clickup, jira, gmail, calendar, airtable, discord, teams, luma) no longer echo raw API payloads into evidence.items; KB/research subagent excerpts capped at 500 chars
Preflight LLM probe removed — replaced ~2.5s pre-turn ping with reactive 429 recovery (existing fallback path); rate-limit safety unchanged
KB planner moved to dedicated small/fast LLM — internal query rewriting no longer uses the main chat model
Sync embedding calls offloaded to thread — kb_search, kb_persistence, document_converters, indexers, revert_service no longer block the event loop
Perf observability — per-LLM-call latency + cache read/write counts, MCP discover/call/oauth-refresh timing, subagent compile breakdown, middleware step logs

Impact

Latency: ~40% faster typical turn
Cost: ~70% cheaper per turn
Cold-start tax (process restart / autoscale): MCP discovery cost drops from ~3-12s to ~24ms via the persistent DB cache

High-level PR Summary

This PR delivers substantial latency and cost improvements to the agent system through four major optimizations: MCP tool discovery is now cached in the database and an in-process LRU, eliminating 3-12 seconds of network round-trips per turn; Azure prompt caching is enabled alongside OpenAI/DeepSeek/xAI, achieving 85-99% cache hit ratios and ~70-90% input token cost reduction; subagent outputs are trimmed across 9 connectors to avoid echoing large API payloads into evidence fields; and the KB planner now uses a dedicated small/fast LLM (configurable via is_planner: true in global config) instead of the main chat model for internal query rewriting. Additionally, sync embedding calls are offloaded to worker threads to avoid blocking the event loop, and the ~2.5s preflight LLM probe is removed in favor of reactive 429 recovery. Comprehensive performance observability is added throughout (per-LLM-call latency, cache read/write counts, MCP timings, subagent compile breakdown). The combined impact is ~40% faster turns and ~70% lower cost per turn, with cold-start MCP discovery dropping from 3-12s to ~24ms.

⏱️ Estimated Review Time: 30-90 minutes

💡 Review Order Suggestion

Order	File Path
1	`surfsense_backend/app/config/global_llm_config.example.yaml`
2	`surfsense_backend/app/config/__init__.py`
3	`surfsense_backend/app/services/llm_service.py`
4	`surfsense_backend/app/agents/new_chat/prompt_caching.py`
5	`surfsense_backend/tests/unit/agents/new_chat/test_prompt_caching.py`
6	`surfsense_backend/app/agents/new_chat/tools/mcp_tools_cache.py`
7	`surfsense_backend/tests/unit/agents/new_chat/tools/test_mcp_tools_cache.py`
8	`surfsense_backend/app/agents/new_chat/tools/mcp_tool.py`
9	`surfsense_backend/app/routes/mcp_oauth_route.py`
10	`surfsense_backend/app/routes/search_source_connectors_routes.py`
11	`surfsense_backend/app/agents/new_chat/middleware/knowledge_search.py`
12	`surfsense_backend/app/agents/multi_agent_chat/middleware/main_agent/knowledge_priority.py`
13	`surfsense_backend/app/agents/new_chat/chat_deepagent.py`
14	`surfsense_backend/app/tasks/chat/stream_new_chat.py`
15	`surfsense_backend/tests/unit/test_stream_new_chat_contract.py`
16	`surfsense_backend/app/agents/new_chat/middleware/kb_persistence.py`
17	`surfsense_backend/app/services/gmail/kb_sync_service.py`
18	`surfsense_backend/app/services/google_calendar/kb_sync_service.py`
19	`surfsense_backend/app/services/jira/kb_sync_service.py`
20	`surfsense_backend/app/services/onedrive/kb_sync_service.py`
21	`surfsense_backend/app/services/revert_service.py`
22	`surfsense_backend/app/utils/document_converters.py`
23	`surfsense_backend/app/tasks/connector_indexers/discord_indexer.py`
24	`surfsense_backend/app/tasks/connector_indexers/luma_indexer.py`
25	`surfsense_backend/app/tasks/connector_indexers/teams_indexer.py`
26	`surfsense_backend/app/tasks/document_processors/_save.py`
27	`surfsense_backend/app/agents/multi_agent_chat/subagents/builtins/knowledge_base/system_prompt_cloud.md`
28	`surfsense_backend/app/agents/multi_agent_chat/subagents/builtins/knowledge_base/system_prompt_desktop.md`
29	`surfsense_backend/app/agents/multi_agent_chat/subagents/builtins/research/system_prompt.md`
30	`surfsense_backend/app/agents/multi_agent_chat/subagents/connectors/airtable/system_prompt.md`
31	`surfsense_backend/app/agents/multi_agent_chat/subagents/connectors/calendar/system_prompt.md`
32	`surfsense_backend/app/agents/multi_agent_chat/subagents/connectors/clickup/system_prompt.md`
33	`surfsense_backend/app/agents/multi_agent_chat/subagents/connectors/discord/system_prompt.md`
34	`surfsense_backend/app/agents/multi_agent_chat/subagents/connectors/gmail/system_prompt.md`
35	`surfsense_backend/app/agents/multi_agent_chat/subagents/connectors/jira/system_prompt.md`
36	`surfsense_backend/app/agents/multi_agent_chat/subagents/connectors/linear/system_prompt.md`
37	`surfsense_backend/app/agents/multi_agent_chat/subagents/connectors/luma/system_prompt.md`
38	`surfsense_backend/app/agents/multi_agent_chat/subagents/connectors/slack/system_prompt.md`
39	`surfsense_backend/app/agents/multi_agent_chat/subagents/connectors/teams/system_prompt.md`
40	`surfsense_backend/app/services/token_tracking_service.py`
41	`surfsense_backend/app/agents/multi_agent_chat/middleware/main_agent/checkpointed_subagent_middleware/middleware.py`
42	`surfsense_backend/app/agents/multi_agent_chat/middleware/main_agent/checkpointed_subagent_middleware/task_tool.py`
43	`surfsense_backend/app/agents/multi_agent_chat/middleware/shared/kb_context_projection.py`
44	`surfsense_backend/app/agents/new_chat/middleware/knowledge_tree.py`
45	`surfsense_backend/app/agents/new_chat/middleware/memory_injection.py`

…items

…to evidence.items

…d into evidence.items

…o bound output

… call

LiteLLM normalizes every provider's cache fields onto usage.prompt_tokens_details (cached_tokens + cache_creation_tokens). The earlier fallback to usage.cache_read_input_tokens / usage.cache_creation_input_tokens was wrong: Anthropic-shaped fields only live there via a trailing setattr loop, and the canonical field name on the wrapper is cache_creation_tokens (not _input_tokens).

embed_texts holds a threading.Lock and runs a sync embedding call inside search_knowledge_base, an async coroutine on the KB priority middleware critical path. Blocking the event loop here stalls every other coroutine on the worker (SSE keepalives, concurrent chat requests, background tasks). Wrap in asyncio.to_thread so the embed runs on the default executor pool while the loop keeps serving.

_create_document and _update_document run on the chat critical path when the filesystem subagent writes via the user's chat turn. Both called embed_texts synchronously inside an async coroutine, blocking the event loop for the duration of the embed.

generate_document_summary and create_document_chunks are async helpers called from the chat path and from many connector indexers. Both wrapped embed_text/embed_texts directly inside the coroutine, blocking the event loop for the full duration of the embedding call.

_restore_in_place_document and _reinsert_document_from_revision are async helpers invoked by the synchronous-feeling POST /api/threads/.../revert route; both ran embed_texts inline, blocking the event loop while the HTTP client waited.

…orkers Connector kb_sync_services (gmail, onedrive, google_calendar, jira), streaming indexers (discord, luma, teams) and the file-processor save path all called embed_text inside async coroutines, blocking the background worker's event loop for the duration of the embed. Wrap each call site in asyncio.to_thread so concurrent indexing tasks stop serialising on the embed.

…tive 429 recovery The preflight pattern probed the LLM with a 1-token ping before each cold turn (when requested_llm_config_id==0, llm_config_id<0, and the 45s healthy TTL had expired) to detect 429s before fanning out into planner/classifier/title-gen. To absorb its ~1-5s RTT cost we built the agent speculatively in parallel; on 429 we discarded the build and repinned. Three problems with that design: 1. False security. Provider rate limits are token-bucket. A 1-token ping consumes ~5 tokens; the real request consumes 10-50K. The probe can return 200 while the real call still 429s. 2. Pure overhead in the common case. On warm-agent-cache turns the probe dominates wall time: ~2.5s of TTFT pure tax for ~99% of users who never see a 429. 3. The in-stream recovery loop (catch of _is_provider_rate_limited gated by not _first_event_logged) already does the right thing reactively: mark_runtime_cooldown -> resolve_or_get_pinned_llm_config_id with exclude_config_ids={previous} -> rebuild agent -> retry the stream. Preflight was never the only safety net; it was a redundant probe in front of one. Changes: - Delete _preflight_llm, _settle_speculative_agent_build, and the _PREFLIGHT_TIMEOUT_SEC / _PREFLIGHT_MAX_TOKENS constants. - Drop the parallel agent_build_task / preflight_task plumbing in both stream_new_chat and stream_resume_chat; build the agent inline with await _build_main_agent_for_thread(...). - Drop the unused is_recently_healthy / mark_healthy imports here (still exported from auto_model_pin_service since OpenRouter catalogue refresh and a few tests reference clear_healthy). - Remove the obsolete preflight + settle-speculative tests from test_stream_new_chat_contract.py. Net: -447 LOC. ~2.5s removed from TTFT on every cold preflight-eligible turn. 429 recovery path is unchanged - same repin/rebuild/retry, just not paid in advance on the healthy path.

…t LLM Adds an optional planner LLM role wired through KnowledgePriorityMiddleware so KB query rewriting, date extraction, and recency classification run on a cheap model (e.g. gpt-4o-mini, Haiku, Azure nano) instead of the user's chat LLM. Operators opt in by setting is_planner: true on exactly one global config; without it, behavior is unchanged.

Splits the OpenAI-family gate into per-param predicates so AZURE and AZURE_OPENAI configs now receive prompt_cache_key for backend routing affinity (Microsoft auto-caches GPT-4o+ deployments at >=1024 tokens; the key clusters same-prefix requests on the same GPU pool and raises hit rate on turn 2+). prompt_cache_retention stays opted out for Azure because litellm 1.83.14's Azure transformer would drop it silently; revisit when Azure's supported params list is updated.

Skip the ~1-3s MCP initialize + list_tools handshake on every cache miss by reading tool definitions from the connector row we already load. Lazy populate on first miss, self-heal on corrupt cache, zero schema migration.

Collapse the invalidate + warmup pair into a single refresh_mcp_tools_cache_for_connector(connector_id, search_space_id) helper and scope live discovery to the one connector that changed instead of the whole search space. - new mcp_tool.discover_single_mcp_connector: load one connector, refresh OAuth if needed, force live MCP discovery so its cached_tools row is rewritten; returned wrappers are discarded since the in-process LRU is rebuilt lazily on the next user query - mcp_tools_cache.refresh_mcp_tools_cache_for_connector: synchronously evicts the per-space LRU (LRU keys cannot scope finer) and schedules the per-connector prefetch via loop.create_task - routes (OAuth callback, MCP POST, MCP PUT) collapse their two back-to-back calls into a single refresh call; DELETE handlers keep using bare invalidate_mcp_tools_cache (nothing to prefetch) No new automated tests: the new functions are I/O glue (DB + network) where mocked unit tests would test implementation rather than behavior. The existing 9 unit tests for the cached_tools data shape are unchanged.

The probe answered its question (informing the cached_tools persistence design). Future MCP session-pooling work, if revived, can recreate it.

Resolves: surfsense_backend/app/agents/new_chat/middleware/memory_injection.py - Took both imports: upstream moved MEMORY_HARD_LIMIT/SOFT_LIMIT to app.services.memory; kept our perf-logger import for timing. Pulls in upstream changes: - Memory document feature (services/memory refactor, removal of app.agents.new_chat.memory_extraction and background extraction in stream_new_chat — agent now drives memory via update_memory tool). - BACKEND_URL env refactor across web tool-ui/editor/chat/dashboard/lib. - GitHub Actions backend test workflow + pre-commit biome bump. - Token-display polish in MessageInfoDropdown; save_memory no-update sentinel. Verified: 1723 unit tests pass, ruff clean. No semantic regression in stream_new_chat (their memory-extraction deletion and our preflight removal touch different functions).

vercel · 2026-05-20T19:26:49Z

@CREDO23 is attempting to deploy a commit to the Rohan Verma's projects Team on Vercel.

A member of the Team first needs to authorize it.

coderabbitai · 2026-05-20T19:28:24Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a467f8fa-f3f0-4ac5-81f7-f95b7991cb59

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

CREDO23 added 30 commits May 19, 2026 21:29

perf(mcp): add per-call, discovery, and oauth-refresh timing logs

9bfba34

perf(subagent): add subagent compile timing log

9e81f2a

perf(subagent): add atask EXIT breakdown timing log

33bfce4

perf(multi-agent): add kb_context_projection timing log

bd153d3

perf(new-chat): add knowledge_tree middleware timing log

1df40fb

perf(new-chat): add memory_injection middleware timing log

b3b66e4

perf(tokens): add per-call latency to capture log

581bbfb

perf(calendar): stop echoing raw events into evidence.items

3a5e16e

chore(scripts): add MCP session lifetime probe

1481394

perf(gmail subagent): stop echoing raw emails array into evidence.items

553bece

perf(linear subagent): stop echoing raw issues list into evidence.items

d3d396a

perf(slack subagent): stop echoing raw messages list into evidence.items

6e5dd54

perf(jira subagent): stop echoing raw issues list into evidence.items

6be1b22

perf(clickup subagent): stop echoing raw tasks list into evidence.items

1b2f13e

perf(airtable subagent): stop echoing raw records list into evidence.…

56d8ff8

…items

perf(discord subagent): stop echoing raw channels/messages payload in…

f4e6671

…to evidence.items

perf(luma subagent): stop echoing raw events list into evidence.items

20f7896

perf(teams subagent): stop echoing raw teams/channels/messages payloa…

6c173dc

…d into evidence.items

perf(research subagent): cap evidence.findings and evidence.sources t…

b554c60

…o bound output

perf(kb subagent, cloud): cap evidence.content_excerpt to 500 chars

5edf052

perf(kb subagent, desktop): cap evidence.content_excerpt to 500 chars

0cdda14

obs(tokens): log prompt-cache read/write counts and hit ratio per LLM…

6090980

… call

CREDO23 added 6 commits May 20, 2026 11:58

chore(scripts): drop one-off MCP session lifetime probe

2be3f04

The probe answered its question (informing the cached_tools persistence design). Future MCP session-pooling work, if revived, can recreate it.

Merge remote-tracking branch 'upstream/dev' into improvement-agent-speed

d5ee8cc

MODSetter merged commit 5c4da79 into MODSetter:dev May 20, 2026
5 of 12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Improvement] Agent: faster turns and lower LLM cost#1422

[Improvement] Agent: faster turns and lower LLM cost#1422
MODSetter merged 36 commits into
MODSetter:devfrom
CREDO23:improvement-agent-speed

CREDO23 commented May 20, 2026 •

edited by recurseml Bot

Loading

Uh oh!

vercel Bot commented May 20, 2026

Uh oh!

coderabbitai Bot commented May 20, 2026

Review skipped

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

CREDO23 commented May 20, 2026 • edited by recurseml Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed

Impact

High-level PR Summary

Uh oh!

vercel Bot commented May 20, 2026

Uh oh!

coderabbitai Bot commented May 20, 2026

Review skipped

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CREDO23 commented May 20, 2026 •

edited by recurseml Bot

Loading