Skip to content

PS4Emp/cachesentry

Use this GitHub action with your project
Add this Action to an existing workflow or create a new one
View on Marketplace

Repository files navigation

CacheSentry

Catch prompt-cache regressions before production.

CacheSentry is an open-source CI and runtime validation tool for LLM apps. It detects when prompt changes break reusable prefixes, fails regressions in CI, and compares predictions against real cache signals.


The Problem

LLM apps often use long prompts:

  • system instructions
  • tool schemas
  • retrieved documents
  • memory
  • policies
  • metadata
  • user context

Prompt caching can reuse stable prompt prefixes. But one small dynamic field near the front can silently break reuse:

  • timestamp
  • UUID
  • request_id
  • session_id
  • dynamic metadata
  • randomized tool/schema order

This can hurt cache reuse, latency, and cost, often without being caught during PR review.


What CacheSentry Does Today

  • analyzes OpenAI-style, LiteLLM, and OpenTelemetry GenAI traces
  • computes stable-prefix ratio
  • estimates lost reusable tokens
  • identifies culprit fields/kinds
  • creates known-good baselines
  • diffs current traces against baselines
  • fails CI when cacheability regresses
  • emits Markdown, JSON, GitHub annotations, and SARIF
  • supports provider-aware offline projections
  • validates predictions against observed runtime cache signals

Live Validation Result

In one controlled live OpenAI validation run, cached_tokens moved in the expected direction:

Case Cached tokens Interpretation
Stable prompt variant 2816 cache reuse preserved
Broken early-UUID variant 0 early dynamic field broke reusable prefix
Fixed late-UUID variant 2816 cache reuse restored

How it works

Trace/logs
→ normalize safely
→ render/tokenize prompt structure
→ compare stable prefixes
→ detect culprit
→ baseline/diff
→ CI/SARIF report
→ runtime validation against observed cache signals


Quickstart

A. Demo audit:

python -m cachesentry.cli audit examples/traces/mixed_cache_breakers.jsonl --trace-wide --show-fix-recommendations

B. Baseline create:

python -m cachesentry.cli baseline create examples/case_studies/cacheability_regression/baseline_trace.jsonl --provider-profile openai --output examples/case_studies/cacheability_regression/expected_baseline.json

C. Diff regression:

python -m cachesentry.cli diff examples/case_studies/cacheability_regression/regressed_trace.jsonl --baseline examples/case_studies/cacheability_regression/expected_baseline.json --provider-profile openai --max-stable-prefix-drop 0.15 --max-lost-token-increase 100

D. GitHub Action (CI)

CacheSentry can run fully offline in CI to detect prompt-cache regressions. Add this to your .github/workflows/cachesentry.yml:

name: CacheSentry Audit

on: [push, pull_request]

jobs:
  audit-cacheability:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      # Offline structural audit of a trace file
      - name: Run CacheSentry
        uses: PS4Emp/cachesentry@v0.3.0
        with:
          trace-path: 'examples/traces/mixed_cache_breakers.jsonl'
          model: 'Qwen/Qwen2.5-Coder-32B-Instruct'
          fail-on-severity: 'high'
          sarif-output: 'reports/cachesentry.sarif'
          
      # Optional: Upload SARIF report to GitHub Code Scanning
      - name: Upload SARIF report
        uses: github/codeql-action/upload-sarif@v3
        if: always()
        with:
          sarif_file: 'reports/cachesentry.sarif'
          category: cachesentry

Note: The GitHub Action runs purely offline using structural analysis. It requires NO live API calls and NO OPENAI_API_KEY.


Supported Inputs

  • OpenAI-style chat/request traces
  • LiteLLM logs
  • OpenTelemetry GenAI traces
  • sanitized observed runtime logs containing cached_tokens/cache_hit-style fields

Where this is going

CacheSentry today is a CI guardrail and runtime validation layer for prompt-cache regressions.

The long-term goal is to become a cacheability control plane for LLM applications:

  • Layer 1: CI / PR guardrail — built
  • Layer 2: Runtime validation / observed-signal correlation — built
  • Layer 3: Org-wide cacheability control plane — future

Future direction:

  • OTel processor/exporter
  • prompt layout contracts
  • cacheability budgets
  • route-level cacheability trends
  • team/service ownership
  • dashboard after real users exist

Runtime agent / LiteLLM callback preview

CacheSentry now provides a privacy-safe LiteLLM callback plugin that observes request/response metadata, computes best-effort rolling cacheability signals, captures observed cache signals, and emits CacheSentry runtime events.

It operates entirely offline with no live API calls, drops raw prompts and sensitive fields, and enforces strict bounded in-memory state. See the Runtime Agent documentation for details.


Who should try this?

  • teams building RAG systems
  • LLM agents
  • long-context apps
  • OpenAI/LiteLLM-based products
  • teams using OpenTelemetry GenAI traces
  • teams worried about LLM latency/cost regressions
  • people maintaining prompt templates in CI

Beta Users Wanted

We are looking for beta testers! Please provide 10–50 sanitized request traces.

Preferred:

  • OpenAI-style messages
  • LiteLLM logs
  • OpenTelemetry GenAI spans
  • cached_tokens/cache_hit/response_cost fields if available

Do not send:

  • API keys
  • Authorization headers
  • raw customer data
  • private docs
  • secrets
  • unredacted prompts

Privacy and Security

CacheSentry operates entirely offline by default. In CI or GitHub Actions, it performs static structural analysis of your prompt prefixes and does NOT make live provider API calls.

CacheSentry is built with a strict privacy boundary:

  • It never stores raw prompts, raw responses, headers, API keys, Authorization values, or raw cache keys in reports.
  • It aggressively drops fields matching api_key, secret, bearer, and token during telemetry sanitization.
  • SARIF and Markdown reports only contain the names of culprit fields, structural token counts, and file-level metrics. No sensitive request payloads are included.

Please see our SECURITY.md and Security & Privacy Overview for more details.


Privacy and Caveats

CacheSentry is designed to avoid storing raw prompts, raw responses, headers, API keys, Authorization values, and raw cache keys in reports.

Caveat: CacheSentry detects structural cacheability regressions. It does not guarantee exact cache hits, cost savings, or latency reduction. Runtime behavior depends on provider/runtime policy, routing, TTL, eviction, isolation, prompt_cache_key, model, and cache state.


Important Docs Links

About

No description, website, or topics provided.

Resources

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages