Run AI inference at peak performance.
PeakInfer scans your code. Finds every LLM call. Shows you exactly what's holding back your latency, throughput, and reliability.
30 seconds. Zero config. Real numbers.
npm install -g @kalmantic/peakinfer
peakinfer analyze .Your code says streaming: true. Your runtime shows 0% streams.
That's drift—and it's killing your latency.
| What You Think | What's Actually Happening |
|---|---|
| Streaming enabled | Blocking calls |
| Fast responses | p95 latency 5x slower than benchmarks |
| Retry logic works | Never triggered |
| Fallbacks ready | Never tested |
Static analysis sees code. Monitoring sees requests. Neither sees the gap.
PeakInfer sees both.
Peak Inference Performance means: Improving latency, throughput, reliability, and cost without changing evaluated behavior.
No one else correlates what PeakInfer sees together:
CODE RUNTIME BENCHMARKS EVALS
──── ─────── ────────── ─────
What you What actually The upper bound Your quality
declared happened of possible gate
streaming: true 0% streaming InferenceMAX: "extraction" 94%
model: gpt-4o p95: 2400ms gpt-4o p95: 1200ms accuracy
└───────────────────┴────────────────────┴───────────────────┘
│
▼
PEAKINFER
(correlation)
PeakInfer analyzes every inference point across four dimensions:
| Dimension | What We Find | Typical Improvement |
|---|---|---|
| Latency | Missing streaming, blocking calls, p95 gaps | 50-80% faster |
| Throughput | Sequential loops, no batching | 10-50x improvement |
| Reliability | No retry, no fallback, no timeout | 99%+ uptime |
| Cost | Wrong model for the job | 60-90% reduction |
peakinfer analyze ./srcFinds every inference point. OpenAI, Anthropic, Azure, Bedrock, self-hosted. All of them.
7 inference points found
39 issues detected
LATENCY:
- Streaming configured but not consumed (p95: 2400ms, should be 400ms)
- Blocking calls in hot path (6x latency penalty)
THROUGHPUT:
- Sequential batch processing (50x throughput opportunity)
RELIABILITY:
- Zero error handling across all LLM calls
- No fallback on critical inference path
QUICK WINS:
- Enable streaming consumption: -80% latency
- Add retry logic: +99% reliability
- Parallelize batch: 50x throughput
Add to every PR:
- uses: kalmantic/peakinfer-action@v1
with:
path: ./src
token: ${{ secrets.PEAKINFER_TOKEN }}npm install -g @kalmantic/peakinferRequires Node.js 18+.
PeakInfer uses Claude for semantic analysis. You provide your own Anthropic API key (BYOK mode).
- Go to console.anthropic.com
- Create an account or sign in
- Navigate to API Keys and create a new key
- Copy the key (starts with
sk-ant-)
Option A: Environment File (Recommended)
# .env
ANTHROPIC_API_KEY=sk-ant-your-key-hereOption B: Shell Export
export ANTHROPIC_API_KEY=sk-ant-your-key-herepeakinfer analyze . --verboseBYOK Mode: Your API key, your costs, full transparency. Analysis runs locally. No data sent to PeakInfer servers.
# Basic scan
peakinfer analyze .
# With code fix suggestions
peakinfer analyze . --fixes
# With HTML report
peakinfer analyze . --html --open
# Compare to InferenceMAX benchmarks
peakinfer analyze . --benchmark
# With runtime correlation (drift detection)
peakinfer analyze . --events production.jsonl
# Fetch runtime from observability platforms
peakinfer analyze . --runtime helicone --runtime-key $HELICONE_KEY
# Full analysis
peakinfer analyze . --fixes --benchmark --html --open| Flag | Description |
|---|---|
| Output | |
--fixes |
Show code fix suggestions for each issue |
--html |
Generate HTML report |
--pdf |
Generate PDF report |
--open |
Auto-open report in browser/viewer |
--output <format> |
Output format: text, json, or inference-map |
--verbose |
Show detailed analysis logs |
| Runtime Data | |
--events <file> |
Path to runtime events file (JSONL) |
--events-url <url> |
URL to fetch runtime events |
--runtime <source> |
Fetch from: helicone, langsmith |
--runtime-key <key> |
API key for runtime source |
--runtime-days <n> |
Days of runtime data (default: 7) |
| Comparison | |
--compare [runId] |
Compare with previous analysis run |
--benchmark |
Compare to InferenceMAX benchmarks |
--predict |
Generate deploy-time latency predictions |
--target-p95 <ms> |
Target p95 latency for budget calculation |
| Cost Control | |
--estimate |
Show cost estimate before analysis |
--yes |
Auto-proceed without confirmation |
--max-cost <dollars> |
Skip if estimated cost exceeds threshold |
--cached |
View previous analysis (offline) |
Every PR. Every merge. Automatic.
name: PeakInfer
on: [pull_request]
permissions:
contents: read
pull-requests: write
jobs:
analyze:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: kalmantic/peakinfer-action@v1
with:
token: ${{ secrets.PEAKINFER_TOKEN }}
github-token: ${{ github.token }}See peakinfer-action for full documentation.
PeakInfer's real power: correlating code with runtime behavior.
# From file
peakinfer analyze ./src --events events.jsonl
# From Helicone
peakinfer analyze ./src --runtime helicone --runtime-key $HELICONE_KEY
# From LangSmith
peakinfer analyze ./src --runtime langsmith --runtime-key $LANGSMITH_KEYSupported formats: JSONL, JSON, CSV, OpenTelemetry, Jaeger, Zipkin, LangSmith, LiteLLM, Helicone.
| Provider | Status |
|---|---|
| OpenAI | Full support |
| Anthropic | Full support |
| Azure OpenAI | Full support |
| AWS Bedrock | Full support |
| Google Vertex | Full support |
| vLLM / TensorRT-LLM | HTTP detection |
| LangChain / LlamaIndex | Framework support |
43 templates across two categories:
Detect issues: streaming drift, overpowered model, context accumulation, token underutilization, retry explosion, untested fallback, dead code, and more.
Actionable fixes: model routing, batch utilization, prompt caching, vLLM high-throughput, GPTQ quantization, TensorRT-LLM, multi-provider fallback, auto-scaling, and more.
CLI: Free forever. BYOK — you provide your Anthropic API key.
GitHub Action:
- Free: 50 credits one-time (6-month expiry)
- Starter: $19 for 200 credits
- Growth: $49 for 600 credits
- Scale: $149 for 2,000 credits
- Mega: $499 for 10,000 credits
No subscriptions. No per-seat pricing. Team pooling.
| Feature | Status |
|---|---|
| Unified Prompt-Based Analysis | ✅ |
| GitHub Action with PR Comments | ✅ |
| Code Fix Suggestions | ✅ |
| Runtime Drift Detection | ✅ |
| InferenceMAX Benchmark Comparison | ✅ |
| 43 Optimization Templates | ✅ |
| Run History & Comparison | ✅ |
| BYOK Mode (CLI) | ✅ |
Built by Kalmantic. Apache-2.0 license.