Automated LLM serving parameter optimization using LLM agents.
Inspired by Andrej Karpathy's autoresearch, which lets AI agents autonomously iterate on LLM training code to minimize validation loss. Autooptimizer applies the same autonomous experiment loop idea to a different domain: instead of optimizing training code, it optimizes serving parameters to maximize inference throughput and minimize latency.
| Framework | Status |
|---|---|
| vLLM | Supported |
| SGLang | Supported |
| TensorRT-LLM | Planned |
An LLM agent iteratively:
- Proposes hypotheses for improving serving performance
- Modifies the framework's serve configuration
- Starts the server, runs benchmarks, records results
- Keeps improvements, reverts failures
- Loops indefinitely until manually stopped
Goal: Maximize score (throughput / latency composite metric).
autooptimizer/
├── memory/ # Agent instructions and context
│ ├── overview.md # Target model, framework, goal, best known config
│ ├── experiment_loop.md # Experiment workflow
│ ├── setup.md # How to start a new run
│ ├── rules.md # Allowed/prohibited changes
│ ├── search_strategy_vllm.md # Hypothesis strategy for vLLM params
│ └── search_strategy_sglang.md # Hypothesis strategy for SGLang params
├── project/
│ ├── edit/
│ │ ├── vllm_serve_config.sh # Editable vLLM serve command
│ │ └── sglang_serve_config.sh # Editable SGLang serve command
│ └── no_edit/
│ ├── benchmark.py # Fixed benchmark runner & scoring
│ └── adapters/ # Framework adapters (routing logic)
│ ├── __init__.py
│ ├── vllm.py
│ └── sglang.py
├── artifacts/
│ ├── hypothesis_backlog.tsv # Experiment queue
│ └── results.tsv # Experiment results log
├── pyproject.toml
└── README.md
# For vLLM optimization
uv sync --extra vllm
# For SGLang optimization
uv sync --extra sglang
# Both
uv sync --extra vllm --extra sglangIn memory/overview.md, set the FRAMEWORK line:
FRAMEWORK=vllm
or
FRAMEWORK=sglang
cursor agent --yolo"Hi, refresh your memory with .md files in @memory folder, and let's kick off a new experiment! Let's do the setup first."
The agent will:
- Create a new experiment branch (e.g.,
autooptimizer/apr11) - Run baseline benchmark
- Begin the experiment loop
# Start server (example for vLLM)
bash project/edit/vllm_serve_config.sh > server.log 2>&1 &
# Wait + benchmark + kill
uv run project/no_edit/benchmark.py runSet the target model and framework in memory/overview.md:
MODEL=Qwen/Qwen2.5-1.5B-Instruct
FRAMEWORK=vllm
If you have a known-good configuration from previous experiments, update the "Best Known Configuration" section in memory/overview.md.
- Editable: Only the active framework's serve config in
project/edit/ - Metric: Optimize
scoreonly (higher is better) - Model: Fixed per experiment run (set in
memory/overview.md) - Framework: Fixed per experiment run (set in
memory/overview.md)
Tab-separated file (artifacts/results.tsv):
experiment score memory_gb status description
Status values: keep, discard, crash
- Autonomous experiment loop with LLM agent
- vLLM serve parameter tuning
- SGLang serve parameter tuning
- Multi-framework support with adapter pattern
- Deterministic benchmarking with fixed seeds
- Hypothesis backlog with priority-based ordering
- Automatic keep/revert based on score improvement
- TensorRT-LLM support
- Unified cross-framework benchmark comparison (
--unifiedflag) - Smarter search — Bayesian optimization, parameter interaction detection, Pareto frontier visualization
- Hardware-aware profiles — GPU auto-detection, per-family defaults (A100, H100, L40S, ...)
- Production tooling — web dashboard, exportable configs (Docker/K8s), CI/CD integration
MIT