V2.0.2 Version Release
What's New
6–8× Cost Reduction via Prompt Caching Parity with Claude Code
RAI now sends requests using the Anthropic wire format (POST /v1/messages) instead of the OpenAI format (POST /chat/completions). This single change unlocks full prompt caching — the same strategy used by Claude Code — saving 60–90% on input token costs for long sessions.
Before v2.0.2: $40–60 for a full 6-step VAPT session
After v2.0.2: $5–7 for the same session
What changed under the hood
| Change | Impact |
|---|---|
ChatAnthropic replaces ChatLiteLLM for Claude models |
cache_control preserved through proxy |
System prompt (70k chars) cached with ephemeral |
~28k tokens saved every turn after first |
| Tool definitions (90 tools) cached | ~35k tokens saved every turn after first |
| Last human message cached | Full history served from cache on next turn |
| Cache TTL removed (was 5 min, now 1h default) | 12× longer cache lifetime |
Automatic upgrade
All existing Claude model configs (litellm:openai/bedrock-claude-*, anthropic:claude-*) are automatically upgraded to ChatAnthropic routing at runtime. No config changes required.
To explicitly use the new routing:
rai agents config-set rai \
--model "chatanthropic:bedrock-claude-sonnet-4.6-(US)" \
--api-key "sk-..." \
--base-url "https://your-litellm-proxy.example.com"Extended Thinking Enabled by Default
RAI now sends thinking: {type: enabled, budget_tokens: 31999} on every call — matching Claude Code's behavior. This improves reasoning quality for complex security assessments, reducing mistakes and re-runs.
⚠ Temperature override: Anthropic requires
temperature=1.0when extended thinking is enabled. RAI enforces this automatically for all Claude models. Yourconfig.tomltemperature setting is ignored while thinking is active. Non-Claude models (OpenAI, Gemini, Ollama) are unaffected.
To disable thinking and restore your configured temperature:
RAI_THINKING=0 rai chat # per-run
export RAI_THINKING=0 # permanent| Mode | Temperature used | Notes |
|---|---|---|
RAI_THINKING=1 (default) |
1.0 (forced by Anthropic) |
Best reasoning quality |
RAI_THINKING=0 |
Your config.toml value (default 0.7) |
Standard mode, lower cost |
MITM Proxy Support for Debugging
Capture every LLM request in Burp Suite or mitmproxy:
RAI_INSPECT=1 RAI_INSPECT_PROXY=http://127.0.0.1:8080 rai chatWorks correctly with macOS system proxies (WARP, VPN) — those are bypassed automatically.
Bug Fixes
- Fixed
StaticSystemPromptCacheBreakpointMiddlewarenot taggingsystem[0]due to_should_apply_cachingreturningFalseforChatAnthropicin deepagents - Fixed
AnthropicPromptCachingMiddlewarestampingttl: "5m"on all cache blocks (now defaults to Anthropic's 1h) - Fixed
RequestInspectorMiddlewarefailing on macOS when WARP/VPN SOCKS proxy is active