CacheSentry

Catch prompt-cache regressions before production.

CacheSentry is an open-source CI and runtime validation tool for LLM apps. It detects when prompt changes break reusable prefixes, fails regressions in CI, and compares predictions against real cache signals.

The Problem

LLM apps often use long prompts:

system instructions
tool schemas
retrieved documents
memory
policies
metadata
user context

Prompt caching can reuse stable prompt prefixes. But one small dynamic field near the front can silently break reuse:

timestamp
UUID
request_id
session_id
dynamic metadata
randomized tool/schema order

This can hurt cache reuse, latency, and cost, often without being caught during PR review.

What CacheSentry Does Today

analyzes OpenAI-style, LiteLLM, and OpenTelemetry GenAI traces
computes stable-prefix ratio
estimates lost reusable tokens
identifies culprit fields/kinds
creates known-good baselines
diffs current traces against baselines
fails CI when cacheability regresses
emits Markdown, JSON, GitHub annotations, and SARIF
supports provider-aware offline projections
validates predictions against observed runtime cache signals

Live Validation Result

In one controlled live OpenAI validation run, cached_tokens moved in the expected direction:

Case	Cached tokens	Interpretation
Stable prompt variant	2816	cache reuse preserved
Broken early-UUID variant	0	early dynamic field broke reusable prefix
Fixed late-UUID variant	2816	cache reuse restored

How it works

Trace/logs
→ normalize safely
→ render/tokenize prompt structure
→ compare stable prefixes
→ detect culprit
→ baseline/diff
→ CI/SARIF report
→ runtime validation against observed cache signals

Quickstart

A. Demo audit:

python -m cachesentry.cli audit examples/traces/mixed_cache_breakers.jsonl --trace-wide --show-fix-recommendations

B. Baseline create:

python -m cachesentry.cli baseline create examples/case_studies/cacheability_regression/baseline_trace.jsonl --provider-profile openai --output examples/case_studies/cacheability_regression/expected_baseline.json

C. Diff regression:

python -m cachesentry.cli diff examples/case_studies/cacheability_regression/regressed_trace.jsonl --baseline examples/case_studies/cacheability_regression/expected_baseline.json --provider-profile openai --max-stable-prefix-drop 0.15 --max-lost-token-increase 100

D. GitHub Action (CI)

CacheSentry can run fully offline in CI to detect prompt-cache regressions. Add this to your .github/workflows/cachesentry.yml:

name: CacheSentry Audit

on: [push, pull_request]

jobs:
  audit-cacheability:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      # Offline structural audit of a trace file
      - name: Run CacheSentry
        uses: PS4Emp/cachesentry@v0.3.0
        with:
          trace-path: 'examples/traces/mixed_cache_breakers.jsonl'
          model: 'Qwen/Qwen2.5-Coder-32B-Instruct'
          fail-on-severity: 'high'
          sarif-output: 'reports/cachesentry.sarif'
          
      # Optional: Upload SARIF report to GitHub Code Scanning
      - name: Upload SARIF report
        uses: github/codeql-action/upload-sarif@v3
        if: always()
        with:
          sarif_file: 'reports/cachesentry.sarif'
          category: cachesentry

Note: The GitHub Action runs purely offline using structural analysis. It requires NO live API calls and NO OPENAI_API_KEY.

Supported Inputs

OpenAI-style chat/request traces
LiteLLM logs
OpenTelemetry GenAI traces
sanitized observed runtime logs containing cached_tokens/cache_hit-style fields

Where this is going

CacheSentry today is a CI guardrail and runtime validation layer for prompt-cache regressions.

The long-term goal is to become a cacheability control plane for LLM applications:

Layer 1: CI / PR guardrail — built
Layer 2: Runtime validation / observed-signal correlation — built
Layer 3: Org-wide cacheability control plane — future

Future direction:

OTel processor/exporter
prompt layout contracts
cacheability budgets
route-level cacheability trends
team/service ownership
dashboard after real users exist

Runtime agent / LiteLLM callback preview

CacheSentry now provides a privacy-safe LiteLLM callback plugin that observes request/response metadata, computes best-effort rolling cacheability signals, captures observed cache signals, and emits CacheSentry runtime events.

It operates entirely offline with no live API calls, drops raw prompts and sensitive fields, and enforces strict bounded in-memory state. See the Runtime Agent documentation for details.

Who should try this?

teams building RAG systems
LLM agents
long-context apps
OpenAI/LiteLLM-based products
teams using OpenTelemetry GenAI traces
teams worried about LLM latency/cost regressions
people maintaining prompt templates in CI

Beta Users Wanted

We are looking for beta testers! Please provide 10–50 sanitized request traces.

Preferred:

OpenAI-style messages
LiteLLM logs
OpenTelemetry GenAI spans
cached_tokens/cache_hit/response_cost fields if available

Do not send:

API keys
Authorization headers
raw customer data
private docs
secrets
unredacted prompts

Privacy and Security

CacheSentry operates entirely offline by default. In CI or GitHub Actions, it performs static structural analysis of your prompt prefixes and does NOT make live provider API calls.

CacheSentry is built with a strict privacy boundary:

It never stores raw prompts, raw responses, headers, API keys, Authorization values, or raw cache keys in reports.
It aggressively drops fields matching api_key, secret, bearer, and token during telemetry sanitization.
SARIF and Markdown reports only contain the names of culprit fields, structural token counts, and file-level metrics. No sensitive request payloads are included.

Please see our SECURITY.md and Security & Privacy Overview for more details.

Privacy and Caveats

CacheSentry is designed to avoid storing raw prompts, raw responses, headers, API keys, Authorization values, and raw cache keys in reports.

Caveat: CacheSentry detects structural cacheability regressions. It does not guarantee exact cache hits, cost savings, or latency reduction. Runtime behavior depends on provider/runtime policy, routing, TTL, eviction, isolation, prompt_cache_key, model, and cache state.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
.github/workflows		.github/workflows
cachesentry.egg-info		cachesentry.egg-info
cachesentry		cachesentry
docs		docs
examples		examples
scripts		scripts
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
PROJECT_BRIEF.md		PROJECT_BRIEF.md
README.md		README.md
SECURITY.md		SECURITY.md
action.yml		action.yml
pyproject.toml		pyproject.toml
report.md		report.md
scratch.py		scratch.py
smoke_tests.ps1		smoke_tests.ps1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CacheSentry

The Problem

What CacheSentry Does Today

Live Validation Result

How it works

Quickstart

A. Demo audit:

B. Baseline create:

C. Diff regression:

D. GitHub Action (CI)

Supported Inputs

Where this is going

Runtime agent / LiteLLM callback preview

Who should try this?

Beta Users Wanted

Privacy and Security

Privacy and Caveats

Important Docs Links

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CacheSentry

The Problem

What CacheSentry Does Today

Live Validation Result

How it works

Quickstart

A. Demo audit:

B. Baseline create:

C. Diff regression:

D. GitHub Action (CI)

Supported Inputs

Where this is going

Runtime agent / LiteLLM callback preview

Who should try this?

Beta Users Wanted

Privacy and Security

Privacy and Caveats

Important Docs Links

About

Resources

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages