Catch prompt-cache regressions before production.
CacheSentry is an open-source CI and runtime validation tool for LLM apps. It detects when prompt changes break reusable prefixes, fails regressions in CI, and compares predictions against real cache signals.
LLM apps often use long prompts:
- system instructions
- tool schemas
- retrieved documents
- memory
- policies
- metadata
- user context
Prompt caching can reuse stable prompt prefixes. But one small dynamic field near the front can silently break reuse:
- timestamp
- UUID
- request_id
- session_id
- dynamic metadata
- randomized tool/schema order
This can hurt cache reuse, latency, and cost, often without being caught during PR review.
- analyzes OpenAI-style, LiteLLM, and OpenTelemetry GenAI traces
- computes stable-prefix ratio
- estimates lost reusable tokens
- identifies culprit fields/kinds
- creates known-good baselines
- diffs current traces against baselines
- fails CI when cacheability regresses
- emits Markdown, JSON, GitHub annotations, and SARIF
- supports provider-aware offline projections
- validates predictions against observed runtime cache signals
In one controlled live OpenAI validation run, cached_tokens moved in the expected direction:
| Case | Cached tokens | Interpretation |
|---|---|---|
| Stable prompt variant | 2816 | cache reuse preserved |
| Broken early-UUID variant | 0 | early dynamic field broke reusable prefix |
| Fixed late-UUID variant | 2816 | cache reuse restored |
Trace/logs
→ normalize safely
→ render/tokenize prompt structure
→ compare stable prefixes
→ detect culprit
→ baseline/diff
→ CI/SARIF report
→ runtime validation against observed cache signals
python -m cachesentry.cli audit examples/traces/mixed_cache_breakers.jsonl --trace-wide --show-fix-recommendationspython -m cachesentry.cli baseline create examples/case_studies/cacheability_regression/baseline_trace.jsonl --provider-profile openai --output examples/case_studies/cacheability_regression/expected_baseline.jsonpython -m cachesentry.cli diff examples/case_studies/cacheability_regression/regressed_trace.jsonl --baseline examples/case_studies/cacheability_regression/expected_baseline.json --provider-profile openai --max-stable-prefix-drop 0.15 --max-lost-token-increase 100CacheSentry can run fully offline in CI to detect prompt-cache regressions. Add this to your .github/workflows/cachesentry.yml:
name: CacheSentry Audit
on: [push, pull_request]
jobs:
audit-cacheability:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
# Offline structural audit of a trace file
- name: Run CacheSentry
uses: PS4Emp/cachesentry@v0.3.0
with:
trace-path: 'examples/traces/mixed_cache_breakers.jsonl'
model: 'Qwen/Qwen2.5-Coder-32B-Instruct'
fail-on-severity: 'high'
sarif-output: 'reports/cachesentry.sarif'
# Optional: Upload SARIF report to GitHub Code Scanning
- name: Upload SARIF report
uses: github/codeql-action/upload-sarif@v3
if: always()
with:
sarif_file: 'reports/cachesentry.sarif'
category: cachesentryNote: The GitHub Action runs purely offline using structural analysis. It requires NO live API calls and NO OPENAI_API_KEY.
- OpenAI-style chat/request traces
- LiteLLM logs
- OpenTelemetry GenAI traces
- sanitized observed runtime logs containing cached_tokens/cache_hit-style fields
CacheSentry today is a CI guardrail and runtime validation layer for prompt-cache regressions.
The long-term goal is to become a cacheability control plane for LLM applications:
- Layer 1: CI / PR guardrail — built
- Layer 2: Runtime validation / observed-signal correlation — built
- Layer 3: Org-wide cacheability control plane — future
Future direction:
- OTel processor/exporter
- prompt layout contracts
- cacheability budgets
- route-level cacheability trends
- team/service ownership
- dashboard after real users exist
CacheSentry now provides a privacy-safe LiteLLM callback plugin that observes request/response metadata, computes best-effort rolling cacheability signals, captures observed cache signals, and emits CacheSentry runtime events.
It operates entirely offline with no live API calls, drops raw prompts and sensitive fields, and enforces strict bounded in-memory state. See the Runtime Agent documentation for details.
- teams building RAG systems
- LLM agents
- long-context apps
- OpenAI/LiteLLM-based products
- teams using OpenTelemetry GenAI traces
- teams worried about LLM latency/cost regressions
- people maintaining prompt templates in CI
We are looking for beta testers! Please provide 10–50 sanitized request traces.
Preferred:
- OpenAI-style messages
- LiteLLM logs
- OpenTelemetry GenAI spans
- cached_tokens/cache_hit/response_cost fields if available
Do not send:
- API keys
- Authorization headers
- raw customer data
- private docs
- secrets
- unredacted prompts
CacheSentry operates entirely offline by default. In CI or GitHub Actions, it performs static structural analysis of your prompt prefixes and does NOT make live provider API calls.
CacheSentry is built with a strict privacy boundary:
- It never stores raw prompts, raw responses, headers, API keys, Authorization values, or raw cache keys in reports.
- It aggressively drops fields matching
api_key,secret,bearer, andtokenduring telemetry sanitization. - SARIF and Markdown reports only contain the names of culprit fields, structural token counts, and file-level metrics. No sensitive request payloads are included.
Please see our SECURITY.md and Security & Privacy Overview for more details.
CacheSentry is designed to avoid storing raw prompts, raw responses, headers, API keys, Authorization values, and raw cache keys in reports.
Caveat: CacheSentry detects structural cacheability regressions. It does not guarantee exact cache hits, cost savings, or latency reduction. Runtime behavior depends on provider/runtime policy, routing, TTL, eviction, isolation, prompt_cache_key, model, and cache state.