A production-grade control layer that sits between your application logic and any LLM — input validation, schema enforcement, circuit breaking, targeted retry, and audit logging in one composable pipeline.
Most LLM integrations stop at: write a prompt, call the model, use the response. This library handles what prompt engineering cannot — enforcing what the model actually returns, blocking what should never reach it, and recovering cleanly when things break.
Read the full write-up on Towards Data Science → Prompt Engineering Failed in Production — I Built the Control Layer That Actually Works
User Input
|
[1] InputGuard -- injection detection (20 patterns), length check, sanitization
|
[2] CircuitBreaker -- stops hammering a failing LLM backend
|
[3] TokenBudget -- tiktoken-accurate slot allocation, priority order
[4] PromptBuilder -- assembles prompt within budget, injects constraints
|
[5] LLMCaller -- enforces hard timeout on every call
|
[6] ResponseValidator -- JSON schema, length bounds, forbidden phrases, quality score
| [failed?]
[7] RetryEngine -- targeted prompt mutation per failure mode, jittered backoff
| [exhausted?]
[8] FallbackRouter -- cached response, template, or escalation chain
|
AuditLogger -- every attempt written to JSONL, thread-safe, persistent
|
ControlPacket -- response, attempts, latency, score, audit_id
| Component | Job |
|---|---|
| InputGuard | Blocks injection attempts and oversized input before any LLM call |
| CircuitBreaker | Opens after N consecutive failures; rejects calls instantly during recovery |
| TokenBudget | tiktoken-accurate slot-based allocator; prevents silent overflow |
| PromptBuilder | Assembles prompt in priority order with hard constraints injected structurally |
| LLMCaller | Wraps any callable LLM with thread-based timeout enforcement |
| ResponseValidator | Validates JSON structure, required keys, length, forbidden phrases |
| RetryEngine | Maps each failure mode to a targeted mutation hint; jittered exponential backoff |
| FallbackRouter | Registered fallback chain; first non-empty response wins |
| AuditLogger | Thread-safe JSONL audit log; P50/P90/P99 latency stats; failure distribution |
git clone https://github.com/Emmimal/control-layer.git
cd control-layer
pip install tiktoken tenacity pydantic structlog # required
pip install pytest # optional — for running testsNo ML dependencies. No GPU required. All functionality runs on the Python standard library plus the four packages above.
from control_layer import ControlLayer, ControlLayerConfig, ResponseSchema
# Define your output contract
schema = ResponseSchema(
must_be_json=True,
required_keys=["summary", "confidence"],
max_length=400,
forbidden_phrases=["I cannot", "As an AI"],
)
# Configure the layer
config = ControlLayerConfig(
total_tokens=800,
max_attempts=3,
timeout_seconds=30.0,
cb_failure_threshold=5,
cb_recovery_seconds=30.0,
)
# Swap in any LLM callable — OpenAI, Anthropic, local model, mock
def your_llm_call(prompt: str) -> str:
...
layer = ControlLayer(
llm_fn=your_llm_call,
system_prompt="You are a structured research assistant.",
schema=schema,
config=config,
)
# Register fallbacks — called in order when retries exhaust
layer.register_fallback(
"cache",
lambda q: '{"summary": "Cached response.", "confidence": 0.5}',
)
# Run
packet = layer.run(
user_input="How does token budget allocation work?",
constraints=[
"Return only valid JSON.",
"Include 'summary' and 'confidence' keys.",
"No markdown fencing.",
],
context=retrieved_documents, # optional RAG context
)
print(packet.response) # final response
print(packet.validation.passed) # True / False
print(packet.attempts) # 1, 2, or 3
print(packet.total_latency_ms) # end-to-end latency
print(packet.audit_id) # ties all log lines to this requestFive runnable demos covering every failure mode and recovery path. No API key required.
The MockLLM simulates realistic failure behavior at a configurable rate.
python demo.py| Demo | What It Shows |
|---|---|
| 1 | Input guard blocking 7 of 8 inputs — injection, empty, oversized |
| 2 | Schema enforcement with retry — 75% first-attempt failure rate, mutation hints |
| 3 | Constraint violation recovery — length and forbidden phrase, 3 attempts |
| 4 | Fallback router — exhausted retries route to cached response |
| 5 | Benchmark — naive 0% pass rate vs control layer 100%, latency breakdown |
Running Demo 5 also generates control_layer_benchmark.png — a 6-panel benchmark figure
showing pass rate, failure mode distribution, retry distribution, latency percentiles,
token budget allocation, and quality score histogram.
pytest tests/ -vTestInputGuard 14 tests PASSED
TestTokenBudget 5 tests PASSED
TestPromptBuilder 6 tests PASSED
TestResponseValidator 10 tests PASSED
TestCircuitBreaker 5 tests PASSED
TestRetryEngine 6 tests PASSED
TestFallbackRouter 4 tests PASSED
TestLLMCaller 2 tests PASSED
TestAuditLogger 5 tests PASSED
TestControlLayerIntegration 8 tests PASSED
TestPydanticConfig 4 tests PASSED
69 passed in 1.19s
Every component is tested in isolation. Integration tests cover the full orchestration path: first-attempt success, retry on schema violation, fallback after exhausted retries, circuit breaker rejection after consecutive timeouts, and Pydantic config validation errors.
ControlLayerConfig(
# Token budget
total_tokens=800, # Total token budget for prompt assembly
model_name="cl100k_base", # tiktoken encoding name
# Input validation
max_input_chars=2000, # Hard limit on user input length
# LLM call
timeout_seconds=30.0, # Hard timeout per LLM call
# Retry
max_attempts=3, # Maximum retry attempts per request
base_delay_ms=50.0, # Base exponential backoff delay
max_delay_ms=2000.0, # Maximum backoff delay
jitter_ms=25.0, # Random jitter added to each delay
# Circuit breaker
cb_failure_threshold=5, # Consecutive failures before opening
cb_recovery_seconds=30.0, # Seconds before attempting recovery
# Audit
audit_log_path="audit.jsonl", # JSONL audit log path
)ResponseSchema(
must_be_json=False, # Require valid JSON response
required_keys=[], # Keys that must appear in JSON output
max_length=None, # Maximum response length in characters
min_length=None, # Minimum response length in characters
forbidden_phrases=[], # Phrases that must not appear in response
must_contain=[], # Phrases that must appear (used for quality score)
)The llm_fn parameter accepts any callable that takes a str and returns a str.
# OpenAI
import openai
client = openai.OpenAI()
def openai_call(prompt: str) -> str:
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
)
return response.choices[0].message.content
layer = ControlLayer(llm_fn=openai_call, ...)
# Anthropic
import anthropic
client = anthropic.Anthropic()
def claude_call(prompt: str) -> str:
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}],
)
return response.content[0].text
layer = ControlLayer(llm_fn=claude_call, ...)
# Any local model
layer = ControlLayer(llm_fn=lambda prompt: your_local_model.generate(prompt), ...)control-layer/
├── control_layer.py # All eight components + ControlLayer orchestrator
├── demo.py # Five runnable demos + benchmark charts
├── tests/
│ └── test_control_layer.py # 69 tests across all components
├── audit.jsonl # Generated on first run (append-only audit log)
├── control_layer_benchmark.png # Generated by demo.py
└── README.md
Measured on Python 3.12.6, Windows 11, CPU only, no GPU. Ten structured output queries, 55% first-attempt failure rate.
| Metric | Naive | Control Layer |
|---|---|---|
| Pass rate | 0% | 100% |
| Min latency (ms) | 37.3 | 46.2 |
| Median latency (ms) | 43.3 | 143.5 |
| Mean latency (ms) | 42.9 | 139.8 |
| P90 latency (ms) | 45.6 | 168.0 |
| Max latency (ms) | 48.4 | 281.9 |
| Resolved on attempt 1 | N/A | 2 |
| Resolved on attempt 2 | N/A | 7 |
| Resolved on attempt 3+ | N/A | 1 |
Component overhead (excluding LLM call):
| Operation | Latency | Notes |
|---|---|---|
| InputGuard validation | ~0.2ms | 20 regex patterns |
| tiktoken count (100 tokens) | ~0.8ms | Encoding lookup |
| PromptBuilder.build() | ~1.1ms | Budget allocation + assembly |
| ResponseValidator.validate() | ~0.3ms | JSON parse + rule checks |
| CircuitBreaker.is_open() | ~0.05ms | Lock acquire + state check |
| AuditLogger.log() | ~0.4ms | Lock + file append |
| Total non-LLM overhead | ~2.9ms | Per request |
The LLM call dominates every other number. The control layer adds under 3ms of overhead per request, which is within the variance of a single network round-trip.
Worth it when you have:
- LLM responses that drive downstream code — JSON parsed programmatically, data written to a database, outputs shown to users without human review
- User input passed to an LLM without a validation layer in between
- Structured output requirements the model violates intermittently
- Production systems where a LLM outage would block threads or hang requests
Skip it when you have:
- Single-turn, low-stakes use cases where a bad response is displayed and discarded
- Hard latency requirements under 50ms — retry delays alone can exceed this
- A chatbot where the user sees the raw model output and can judge it themselves
Injection patterns are not exhaustive. Twenty patterns cover the OWASP LLM Top 10 attack taxonomy. Adversarial prompts crafted to avoid known patterns will pass. Combine with embedding-based anomaly detection for high-risk deployments.
Circuit breaker state is in-process only. A restart resets the circuit to CLOSED regardless of backend status. For multi-instance deployments, share circuit state via Redis or a similar low-latency store.
No streaming support. The LLMCaller collects the full response before validation.
Streaming APIs require partial validation heuristics or full response buffering — neither
is implemented.
Quality score uses phrase matching, not semantic similarity. must_contain checks
exact string presence. A response that paraphrases a required concept without using the
exact phrase scores zero. Swap in an embedding-based scorer for higher precision.
AuditLogger grows unbounded. The JSONL file appends on every call. In production, ship it to object storage on a rolling basis and rotate locally.
Same series — production layers for LLM systems:
-
RAG Is Blind to Time — I Built a Temporal Layer to Fix It in Production — temporal awareness layer for RAG systems that treats time as a first-class retrieval signal.
-
LLM Evals Are Based on Vibes — I Built the Missing Layer That Decides What Ships — evaluation layer that replaces gut-feel shipping decisions with measurable output quality gates.
-
PyTorch NaNs Are Silent Killers — I Built a 3ms Hook to Catch Them at the Exact Layer — lightweight hook that catches NaN propagation at the exact layer it originates, in under 3ms overhead.
-
context-engine — retrieval, re-ranking, memory decay, and token budget control for RAG systems. The control layer handles what the model returns. The context engine handles what it receives. They compose.
MIT