Enterprise-grade offline LLM API simulator for testing and development.
LLM-Simulator provides a drop-in replacement for production LLM APIs, enabling cost-effective, deterministic, and comprehensive testing of LLM-powered applications. It simulates OpenAI, Anthropic, and Google Gemini APIs with realistic latency, streaming support, and chaos engineering capabilities.
- OpenAI - Chat completions, embeddings, models endpoints (
/v1/chat/completions,/v1/embeddings,/v1/models) - Anthropic - Messages API (
/v1/messages) - Google Gemini - Generate content API (
/v1/models/{model}:generateContent)
- Latency Modeling - Statistical distributions (log-normal, exponential, Pareto) for TTFT and ITL
- Token-by-Token Streaming - Server-Sent Events (SSE) with realistic inter-token delays
- Deterministic Mode - Seed-based RNG for reproducible tests
- Error Injection - Configurable error rates and types (rate limits, timeouts, server errors)
- Circuit Breaker - Simulate service degradation and recovery
- Model-Specific Rules - Target chaos to specific models or endpoints
- API Key Authentication - Role-based access control (admin, user, readonly)
- Rate Limiting - Token bucket algorithm with configurable tiers
- CORS Support - Configurable origins and headers
- Security Headers - Production-ready security header configuration
- OpenTelemetry Integration - Distributed tracing with OTLP export
- Prometheus Metrics - Request counts, latencies, error rates
- Structured Logging - JSON log format with trace correlation
- Health Endpoints - Liveness (
/health) and readiness (/ready) probes
- 10,000+ RPS - Optimized async architecture
- <5ms Overhead - Minimal latency impact
- Graceful Shutdown - Connection draining support
# Clone the repository
git clone https://github.com/llm-devops/llm-simulator.git
cd llm-simulator
# Build release binary
cargo build --release
# Binary will be at ./target/release/llm-simulator- Rust 1.75 or later
- Linux, macOS, or Windows
# Start with default settings
llm-simulator serve
# Start with custom port and chaos enabled
llm-simulator serve --port 9090 --chaos --chaos-probability 0.1
# Start with authentication
llm-simulator serve --require-auth --api-key "sk-test-key"
# Start with deterministic responses
llm-simulator serve --seed 42# OpenAI-compatible chat completion
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4",
"messages": [{"role": "user", "content": "Hello!"}]
}'
# Anthropic-compatible messages
curl http://localhost:8080/v1/messages \
-H "Content-Type: application/json" \
-H "anthropic-version: 2023-06-01" \
-d '{
"model": "claude-3-5-sonnet-20241022",
"max_tokens": 256,
"messages": [{"role": "user", "content": "Hello!"}]
}'
# Streaming response
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4",
"messages": [{"role": "user", "content": "Tell me a story"}],
"stream": true
}'| Command | Alias | Description |
|---|---|---|
serve |
s |
Start the simulator server |
generate |
gen |
Generate test data or responses |
config |
cfg |
Configuration management |
health |
- | Health check a running instance |
models |
- | Show available models |
benchmark |
bench |
Benchmark the simulator |
client |
- | Send requests to a running instance |
version |
- | Show version and build information |
llm-simulator serve [OPTIONS]
Options:
-p, --port <PORT> Port to listen on [default: 8080]
--host <HOST> Host to bind to [default: 0.0.0.0]
--chaos Enable chaos engineering
--chaos-probability <P> Chaos probability (0.0-1.0)
--no-latency Disable latency simulation
--latency-multiplier <M> Latency multiplier (1.0 = normal)
--seed <SEED> Fixed seed for deterministic behavior
--require-auth Enable API key authentication
--api-key <KEY> API key for authentication
--max-concurrent <N> Maximum concurrent requests
--timeout <SECONDS> Request timeout
--otlp-endpoint <URL> OpenTelemetry endpoint
--workers <N> Worker threads (default: CPU count)# Generate a chat completion
llm-simulator generate chat --model gpt-4 --message "Hello" --format json
# Generate embeddings
llm-simulator generate embedding --text "Hello world" --dimensions 1536
# Generate sample configuration
llm-simulator generate config --format yaml --full
# Generate sample requests for testing
llm-simulator generate requests --count 100 --provider openai# Send a chat request
llm-simulator client chat --url http://localhost:8080 --model gpt-4 "Hello!"
# Interactive chat session
llm-simulator client interactive --model gpt-4 --system "You are helpful"
# Generate embeddings
llm-simulator client embed --text "Hello world"# Single health check
llm-simulator health --url http://localhost:8080
# Watch mode with 5-second interval
llm-simulator health --url http://localhost:8080 --watch --interval 5
# Check readiness
llm-simulator health --url http://localhost:8080 --ready# Basic benchmark
llm-simulator benchmark --url http://localhost:8080 --requests 1000
# High concurrency benchmark
llm-simulator benchmark --requests 10000 --concurrency 100 --model gpt-4
# Duration-based benchmark
llm-simulator benchmark --duration 60 --concurrency 50The project includes a Rust SDK for programmatic access:
use llm_simulator::sdk::{Client, Provider};
#[tokio::main]
async fn main() -> anyhow::Result<()> {
// Create a client
let client = Client::builder()
.base_url("http://localhost:8080")
.api_key("sk-test-key")
.default_model("gpt-4")
.timeout(std::time::Duration::from_secs(30))
.max_retries(3)
.build()?;
// Send a chat completion request
let response = client
.chat()
.model("gpt-4")
.system("You are a helpful assistant.")
.message("What is the capital of France?")
.temperature(0.7)
.max_tokens(100)
.send()
.await?;
println!("Response: {}", response.content());
println!("Tokens used: {}", response.total_tokens());
Ok(())
}use futures::StreamExt;
use llm_simulator::sdk::Client;
#[tokio::main]
async fn main() -> anyhow::Result<()> {
let client = Client::new("http://localhost:8080")?;
let mut stream = client
.stream()
.model("gpt-4")
.message("Tell me a story")
.start()
.await?;
while let Some(chunk) = stream.next().await {
if let Ok(c) = chunk {
print!("{}", c.content);
}
}
Ok(())
}use llm_simulator::sdk::Client;
#[tokio::main]
async fn main() -> anyhow::Result<()> {
let client = Client::new("http://localhost:8080")?;
let result = client
.embeddings()
.model("text-embedding-3-small")
.input("Hello, world!")
.dimensions(1536)
.send()
.await?;
println!("Embedding dimensions: {}", result.dimensions());
println!("Tokens used: {}", result.total_tokens());
Ok(())
}Create llm-simulator.yaml:
server:
host: "0.0.0.0"
port: 8080
max_concurrent_requests: 10000
request_timeout: 300s
cors_enabled: true
cors_origins: ["*"]
latency:
enabled: true
multiplier: 1.0
profiles:
default:
ttft:
distribution: log_normal
mean_ms: 200
std_dev_ms: 50
itl:
distribution: exponential
mean_ms: 30
chaos:
enabled: false
default_probability: 0.0
rules: []
security:
api_keys:
enabled: false
keys: []
rate_limiting:
enabled: true
default_tier: standard
cors:
enabled: true
allowed_origins: ["*"]
telemetry:
enabled: true
log_level: info
json_logs: false
trace_requests: true
metrics_path: /metrics
default_provider: openai
seed: null # Set for deterministic behavior| Variable | Description | Default |
|---|---|---|
LLM_SIMULATOR_PORT |
Server port | 8080 |
LLM_SIMULATOR_HOST |
Server host | 0.0.0.0 |
LLM_SIMULATOR_CONFIG |
Config file path | - |
LLM_SIMULATOR_SEED |
Random seed | - |
LLM_SIMULATOR_CHAOS |
Enable chaos | false |
LLM_SIMULATOR_NO_LATENCY |
Disable latency | false |
LLM_SIMULATOR_LOG_LEVEL |
Log level | info |
LLM_SIMULATOR_JSON_LOGS |
JSON log format | false |
LLM_SIMULATOR_REQUIRE_AUTH |
Require auth | false |
LLM_SIMULATOR_API_KEY |
API key | - |
OTEL_EXPORTER_OTLP_ENDPOINT |
OTLP endpoint | - |
gpt-4,gpt-4-turbo,gpt-4o,gpt-4o-minigpt-3.5-turbotext-embedding-ada-002,text-embedding-3-small,text-embedding-3-large
claude-3-5-sonnet-20241022claude-3-opus-20240229claude-3-sonnet-20240229claude-3-haiku-20240307
gemini-1.5-progemini-1.5-flash
| Endpoint | Method | Description |
|---|---|---|
/v1/chat/completions |
POST | Chat completions |
/v1/embeddings |
POST | Generate embeddings |
/v1/models |
GET | List models |
/v1/models/{id} |
GET | Get model details |
| Endpoint | Method | Description |
|---|---|---|
/v1/messages |
POST | Messages API |
| Endpoint | Method | Description |
|---|---|---|
/v1/models/{model}:generateContent |
POST | Generate content |
/v1beta/models/{model}:generateContent |
POST | Beta endpoint |
| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Liveness check |
/ready |
GET | Readiness check |
/metrics |
GET | Prometheus metrics |
/version |
GET | Version info |
| Endpoint | Method | Description |
|---|---|---|
/admin/config |
GET | Current config |
/admin/stats |
GET | Runtime statistics |
/admin/chaos |
GET/POST | Chaos status |
# Run all tests
cargo test
# Run with output
cargo test -- --nocapture
# Run specific test suite
cargo test --test integration_tests
cargo test --test property_tests
# Run benchmarks
cargo benchBenchmark results on a typical development machine:
| Metric | Value |
|---|---|
| Throughput | 15,000+ RPS |
| P50 Latency | 0.8ms |
| P99 Latency | 3.2ms |
| Memory Usage | ~50MB base |
This project is licensed under the LLM DevOps Permanent Source-Available License. See LICENSE for details.
Contributions are welcome! Please read our contributing guidelines before submitting pull requests.
- GitHub Issues: Report bugs or request features
- Documentation: See
/docsdirectory for detailed guides