Skip to content

V2.0.2 Version Release

Choose a tag to compare

@th3sanjai th3sanjai released this 03 Jun 05:42
· 5 commits to main since this release
e618510

What's New

6–8× Cost Reduction via Prompt Caching Parity with Claude Code

RAI now sends requests using the Anthropic wire format (POST /v1/messages) instead of the OpenAI format (POST /chat/completions). This single change unlocks full prompt caching — the same strategy used by Claude Code — saving 60–90% on input token costs for long sessions.

Before v2.0.2: $40–60 for a full 6-step VAPT session
After v2.0.2: $5–7 for the same session

What changed under the hood

Change Impact
ChatAnthropic replaces ChatLiteLLM for Claude models cache_control preserved through proxy
System prompt (70k chars) cached with ephemeral ~28k tokens saved every turn after first
Tool definitions (90 tools) cached ~35k tokens saved every turn after first
Last human message cached Full history served from cache on next turn
Cache TTL removed (was 5 min, now 1h default) 12× longer cache lifetime

Automatic upgrade

All existing Claude model configs (litellm:openai/bedrock-claude-*, anthropic:claude-*) are automatically upgraded to ChatAnthropic routing at runtime. No config changes required.

To explicitly use the new routing:

rai agents config-set rai \
  --model "chatanthropic:bedrock-claude-sonnet-4.6-(US)" \
  --api-key "sk-..." \
  --base-url "https://your-litellm-proxy.example.com"

Extended Thinking Enabled by Default

RAI now sends thinking: {type: enabled, budget_tokens: 31999} on every call — matching Claude Code's behavior. This improves reasoning quality for complex security assessments, reducing mistakes and re-runs.

Temperature override: Anthropic requires temperature=1.0 when extended thinking is enabled. RAI enforces this automatically for all Claude models. Your config.toml temperature setting is ignored while thinking is active. Non-Claude models (OpenAI, Gemini, Ollama) are unaffected.

To disable thinking and restore your configured temperature:

RAI_THINKING=0 rai chat          # per-run
export RAI_THINKING=0            # permanent
Mode Temperature used Notes
RAI_THINKING=1 (default) 1.0 (forced by Anthropic) Best reasoning quality
RAI_THINKING=0 Your config.toml value (default 0.7) Standard mode, lower cost

MITM Proxy Support for Debugging

Capture every LLM request in Burp Suite or mitmproxy:

RAI_INSPECT=1 RAI_INSPECT_PROXY=http://127.0.0.1:8080 rai chat

Works correctly with macOS system proxies (WARP, VPN) — those are bypassed automatically.


Bug Fixes

  • Fixed StaticSystemPromptCacheBreakpointMiddleware not tagging system[0] due to _should_apply_caching returning False for ChatAnthropic in deepagents
  • Fixed AnthropicPromptCachingMiddleware stamping ttl: "5m" on all cache blocks (now defaults to Anthropic's 1h)
  • Fixed RequestInspectorMiddleware failing on macOS when WARP/VPN SOCKS proxy is active