Release V2.0.2 Version Release · RevoltSecurities/RAI

What's New

6–8× Cost Reduction via Prompt Caching Parity with Claude Code

RAI now sends requests using the Anthropic wire format (POST /v1/messages) instead of the OpenAI format (POST /chat/completions). This single change unlocks full prompt caching — the same strategy used by Claude Code — saving 60–90% on input token costs for long sessions.

Before v2.0.2: $40–60 for a full 6-step VAPT session
After v2.0.2: $5–7 for the same session

What changed under the hood

Change	Impact
`ChatAnthropic` replaces `ChatLiteLLM` for Claude models	`cache_control` preserved through proxy
System prompt (70k chars) cached with `ephemeral`	~28k tokens saved every turn after first
Tool definitions (90 tools) cached	~35k tokens saved every turn after first
Last human message cached	Full history served from cache on next turn
Cache TTL removed (was 5 min, now 1h default)	12× longer cache lifetime

Automatic upgrade

All existing Claude model configs (litellm:openai/bedrock-claude-*, anthropic:claude-*) are automatically upgraded to ChatAnthropic routing at runtime. No config changes required.

To explicitly use the new routing:

rai agents config-set rai \
  --model "chatanthropic:bedrock-claude-sonnet-4.6-(US)" \
  --api-key "sk-..." \
  --base-url "https://your-litellm-proxy.example.com"

Extended Thinking Enabled by Default

RAI now sends thinking: {type: enabled, budget_tokens: 31999} on every call — matching Claude Code's behavior. This improves reasoning quality for complex security assessments, reducing mistakes and re-runs.

⚠ Temperature override: Anthropic requires temperature=1.0 when extended thinking is enabled. RAI enforces this automatically for all Claude models. Your config.toml temperature setting is ignored while thinking is active. Non-Claude models (OpenAI, Gemini, Ollama) are unaffected.

To disable thinking and restore your configured temperature:

RAI_THINKING=0 rai chat          # per-run
export RAI_THINKING=0            # permanent

Mode	Temperature used	Notes
`RAI_THINKING=1` (default)	`1.0` (forced by Anthropic)	Best reasoning quality
`RAI_THINKING=0`	Your `config.toml` value (default `0.7`)	Standard mode, lower cost

MITM Proxy Support for Debugging

Capture every LLM request in Burp Suite or mitmproxy:

RAI_INSPECT=1 RAI_INSPECT_PROXY=http://127.0.0.1:8080 rai chat

Works correctly with macOS system proxies (WARP, VPN) — those are bypassed automatically.

Bug Fixes

Fixed StaticSystemPromptCacheBreakpointMiddleware not tagging system[0] due to _should_apply_caching returning False for ChatAnthropic in deepagents
Fixed AnthropicPromptCachingMiddleware stamping ttl: "5m" on all cache blocks (now defaults to Anthropic's 1h)
Fixed RequestInspectorMiddleware failing on macOS when WARP/VPN SOCKS proxy is active

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

V2.0.2 Version Release

Choose a tag to compare

Sorry, something went wrong.