🥬 LettuceCache

Context-aware semantic cache for LLMs. Sits in front of OpenAI (or any LLM) and returns cached responses in 1–60 ms — without false hits, even when identical queries mean different things in different conversations.

The Problem

Traditional semantic caching matches on query text alone:

"What is the cancellation policy?" — from a hotel booking conversation
"What is the cancellation policy?" — from a gym membership conversation

Both are identical in text and nearly identical in embedding space. A naive cache serves the wrong answer. LettuceCache solves this by baking conversation context into every cache key.

Request Lifecycle

POST /query
  ↓
ContextBuilder  →  SHA-256(intent : domain : anon_user_scope : sorted_context)
  ↓
[L1] Redis exact-match lookup                    1–3 ms   → HIT: return
  ↓ miss
EmbeddingClient  →  Python sidecar (all-MiniLM-L6-v2, 384-dim)   20–50 ms (CPU)
  ↓
[L2] FAISS IVF+PQ ANN search (top-5)            1–3 ms
  ↓
ValidationService  →  0.60·cosine + 0.25·ctx + 0.15·domain ≥ 0.85
  (with TurboQuant: unbiased inner-product estimation — ENABLE_TURBO_QUANT=1)
  ↓ miss
  → HIT: Templatizer::render() fills {{SLOT_N}}, backfill L1, return  25–60 ms
LLM call (OpenAI gpt-4o-mini)                    500–2000 ms
  ↓ async (never blocks response)
CacheBuilderWorker
  ├─ ResponseQualityFilter  (skip conversational / session-bound / dynamic)
  └─ IntelligentAdmissionPolicy  (CVF = 0.30·freq + 0.25·cost + 0.25·quality + 0.20·novelty)
       ↓ admitted
  Templatizer  →  FAISS add  +  Redis SET  +  Redis slots SET

Key Features

Feature	Description
Context isolation	Same query from different conversations never cross-hits
Two-level cache	L1 Redis exact match (1–3 ms) + L2 embedding + FAISS semantic search (25–60 ms)
TurboQuant compression	7.8× embedding compression with provably unbiased inner products (arXiv:2504.19874)
Intelligent admission	Multi-signal Cache Value Function: frequency × cost × quality × MMR novelty
Response quality filter	Rejects conversational, session-bound, refusal, and time-sensitive responses
Templatizer	Strips UUIDs/dates/numbers into `{{SLOT_N}}` for reuse; fills slots on serve
Metadata persistence	FAISS + metadata JSON sidecar survive restarts (critical bug fix)
Circuit breaker	EmbeddingClient fails fast when Python sidecar is down
Context canonicalization	Sorted context turns → order-independent cache keys
Adaptive threshold	Per-domain admission threshold adjusts based on observed hit rates

Quick Start

# Docker Compose (Redis + Python sidecar + C++ orchestrator)
export OPENAI_API_KEY=sk-...
docker compose up

# Verify
curl http://localhost:8080/health

# First and second queries (LLM called — admission requires 2 appearances)
curl -X POST http://localhost:8080/query \
  -H 'Content-Type: application/json' \
  -d '{"query":"What is machine learning?","domain":"tech"}'
# → {"cache_hit": false, "latency_ms": 712, ...}

# Third query (L2 semantic hit; subsequent identical calls hit L1 at ~2ms)
curl -X POST http://localhost:8080/query \
  -H 'Content-Type: application/json' \
  -d '{"query":"What is machine learning?","domain":"tech"}'
# → {"cache_hit": true, "confidence": 0.94, "latency_ms": 45, ...}

Building from Source

# macOS
brew install cmake faiss hiredis openssl curl pkg-config

# Ubuntu 22.04+
sudo apt-get install -y cmake build-essential pkg-config \
  libfaiss-dev libhiredis-dev libcurl4-openssl-dev libssl-dev

cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)
./build/lettucecache

Testing

# Unit tests (52/52, no external deps)
cmake -B build -DCMAKE_BUILD_TYPE=Debug
cmake --build build --target unit_tests
cd build && ctest --output-on-failure

# Happy path integration test (starts stack automatically)
bash tests/happy_path.sh

# Live demo with real LLM
export OPENAI_API_KEY=sk-...
bash tests/live_demo.sh

Configuration

All runtime config via environment variables:

Variable	Default	Description
`REDIS_HOST`	`localhost`
`REDIS_PORT`	`6379`
`EMBED_URL`	`http://localhost:8001`	Python sidecar base URL
`EMBED_DIM`	`384`	Must match embedding model
`OPENAI_API_KEY`	(empty)	Empty → stub mode
`LLM_MODEL`	`gpt-4o-mini`	Any OpenAI chat model
`HTTP_PORT`	`8080`
`FAISS_INDEX_PATH`	`./faiss.index`	Also writes `.meta.json` sidecar
`ENABLE_TURBO_QUANT`	(unset)	`1` to enable TurboQuant compression
`CACHE_QUALITY_THRESHOLD`	`0.40`	Minimum quality score for admission

Tuning (requires source change)

Parameter	Location	Default
Validation threshold	`src/api/HttpServer.cpp`	`0.85`
Scoring weights (cosine/ctx/domain)	`src/validation/ValidationService.h`	`0.60/0.25/0.15`
Admission CVF weights	`src/builder/IntelligentAdmissionPolicy.h`	`0.30/0.25/0.25/0.20`
FAISS NLIST / NPROBE / M_PQ	`src/cache/FaissVectorStore.h`	`100/10/8`
Admission half-life (seconds)	`IntelligentAdmissionConfig`	`120`

Intelligent Admission: Cache Value Function

The IntelligentAdmissionPolicy decides what gets cached using a multi-signal CVF:

CVF = 0.30 × frequency_score      (exponential-decay recency weighting)
    + 0.25 × generation_cost       (token_count × model_tier multiplier)
    + 0.25 × quality_score         (from ResponseQualityFilter)
    + 0.20 × novelty_score         (MMR: 1 − max_cosine_to_existing_cache)

Key behaviours:

An expensive GPT-4 response (500 tokens) can be admitted after just 1 request
Near-duplicate responses (cosine > 0.94) are hard-rejected to prevent cache pollution
Per-domain adaptive threshold: domains with >30% hit rate get a relaxed threshold

TurboQuant (ENABLE_TURBO_QUANT=1)

Implements arXiv:2504.19874. Two-stage encoding:

MSE stage (3 bits): Randomized Hadamard Transform → Lloyd-Max scalar quantization
QJL stage (1 bit): sign(S·residual) corrects inner-product bias

Result: 244 bytes/vector vs 1536 bytes FP32 (6.3×), with zero bias on cosine estimates. The validation threshold of 0.85 remains statistically valid at 4-bit compression (σ ≈ 0.027).

Architecture

src/
├── api/            HttpServer (composition root, all wiring)
├── orchestrator/   QueryOrchestrator, ContextBuilder, ContextSignature
├── cache/          FaissVectorStore (IVectorStore), RedisCacheAdapter
├── embedding/      EmbeddingClient (persistent CURL, circuit breaker)
├── validation/     ValidationService (TurboQuant-aware cosine)
├── builder/        CacheBuilderWorker, IntelligentAdmissionPolicy,
│                   ResponseQualityFilter, AdmissionController, Templatizer
├── quantization/   TurboQuantizer (MSE + QJL)
└── llm/            LLMAdapter, OpenAIAdapter

Composition root: HttpServer owns all component lifetimes as unique_ptr. Everything else receives references.

Async write path: CacheBuilderWorker runs on a background thread with a std::condition_variable queue. The orchestrator calls enqueue() and returns the HTTP response without waiting.

Thread safety: FaissVectorStore uses std::shared_mutex (concurrent reads / exclusive writes). RedisCacheAdapter uses a single connection (mutex needed — known open issue).

API

Endpoint	Description
`POST /query`	Main cache lookup + LLM fallback
`GET /health`	Dependency health (Redis, sidecar, FAISS count)
`GET /stats`	FAISS entry count + queue depth
`DELETE /cache/:key`	Evict a specific entry

See full API docs for request/response schemas.

Research References

Paper	Relevance
arXiv:2504.19874 — TurboQuant	Vector quantization, unbiased inner products
MeanCache (IPDPS 2025)	Context-aware caching, federated learning
vCache (arXiv:2502.03771)	Per-prompt adaptive thresholds
SCALM (IWQoS 2024)	Semantic pattern ranking, selective admission
LeCaR / CACHEUS	RL-based cache replacement, decay-weighted frequency
CacheSack (ATC 2022)	Cost-aware admission (Google Colossus)
MMR (SIGIR 1998)	Diversity/novelty scoring for cache eviction
LHD (NSDI 2018)	Hit-density multi-signal cache value function

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
.github/workflows		.github/workflows
docs		docs
k8s		k8s
python_sidecar		python_sidecar
resources		resources
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CMakeLists.txt		CMakeLists.txt
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
mkdocs.yml		mkdocs.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🥬 LettuceCache

The Problem

Request Lifecycle

Key Features

Quick Start

Building from Source

Testing

Configuration

Tuning (requires source change)

Intelligent Admission: Cache Value Function

TurboQuant (ENABLE_TURBO_QUANT=1)

Architecture

API

Research References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🥬 LettuceCache

The Problem

Request Lifecycle

Key Features

Quick Start

Building from Source

Testing

Configuration

Tuning (requires source change)

Intelligent Admission: Cache Value Function

TurboQuant (ENABLE_TURBO_QUANT=1)

Architecture

API

Research References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages