KernelAgent turns PyTorch programs into verified Triton kernels. It was designed around KernelBench workloads and combines:
- Static problem analysis to decide whether to run a lightweight path or a full pipeline
- LLM‑assisted refactoring that isolates fusable subgraphs
- Parallel Triton kernel generation with strict runtime verification
- End‑to‑end composition that rebuilds the original forward pass using only the synthesized kernels
Blog post: [TBD] • Additional docs: coming soon
Every stage writes artifacts to a run directory under .fuse/<run_id>/, including the fused PyTorch code, subgraphs.json, individual KernelAgent sessions, and the final compose_out/composed_kernel.py.
- Linux or macOS; CUDA‑capable GPU for Triton execution
- Python 3.8–3.12
- Triton (install separately:
pip install tritonor nightly from source) - At least one LLM provider:
- OpenAI (
OPENAI_API_KEY, models likeo4-mini,gpt-5) - Anthropic (
ANTHROPIC_API_KEY; default fallback model isclaude-sonnet-4-20250514whenOPENAI_MODELis unset) - Any OpenAI‑compatible relay endpoint (
LLM_RELAY_URL, optionalLLM_RELAY_API_KEY; seetriton_kernel_agent/providers/relay_provider.py)
- OpenAI (
- Gradio (UI dependencies; installed as part of the core package)
- PyTorch (https://pytorch.org/get-started/locally/)
git clone https://github.com/pytorch-labs/KernelAgent.git
cd KernelAgent
python -m venv .venv && source .venv/bin/activate # choose your own env manager
pip install -e .[dev] # project + tooling deps
pip install triton # not part of extras; install the version you need
# (optional) Install KernelBench for problem examples
git clone https://github.com/ScalingIntelligence/KernelBench.gitYou can export keys directly or use an .env file that the CLIs load automatically:
OPENAI_API_KEY=sk-...
OPENAI_MODEL=gpt-5 # override default fallback (claude-sonnet-4-20250514)
NUM_KERNEL_SEEDS=4 # parallel workers per kernel
MAX_REFINEMENT_ROUNDS=10 # retry budget per worker
LOG_LEVEL=INFO
# Optional relay configuration for self-hosted gateways
# LLM_RELAY_URL=http://127.0.0.1:11434
# LLM_RELAY_API_KEY=your-relay-token
# LLM_RELAY_TIMEOUT_S=120More knobs live in triton_kernel_agent/agent.py and Fuser/config.py.
-
Auto-route a KernelBench problem — static analysis picks between the direct KernelAgent path and the full Fuser pipeline, with automatic fallback if the first attempt fails:
python -m Fuser.auto_agent \ --problem /abs/path/to/KernelBench/level1/19_ReLU.py \ --verify # ensure final composition test runs -
Manually run the pipeline (extract → dispatch → compose) when you want explicit control over models or concurrency:
python -m Fuser.pipeline \ --problem /abs/path/to/problem.py \ --extract-model gpt-5 \ --dispatch-model o4-mini \ --dispatch-jobs auto \ --compose-model o4-mini \ --workers 4 \ --max-iters 5 \ --verify
dispatch-jobs automatches the number of discovered subgraphs; artifacts are placed under.fuse/<run_id>/. -
Direct KernelAgent run — bypass Fuser and provide a plain language problem description or a KernelBench snippet:
from triton_kernel_agent import TritonKernelAgent agent = TritonKernelAgent(num_workers=4, max_rounds=8, model_name="gpt-5") result = agent.generate_kernel( problem_description="Implement ReLU over a contiguous 1D tensor of length 1024" ) if result["success"]: print("Kernel path:", result["kernel_path"]) print("Session directory:", result["session_dir"]) else: print("Failure:", result["message"])
-
UIs — interactive runs with Gradio frontends:
- Triton KernelAgent UI:
kernel-agentorpython scripts/triton_ui.py - Fuser orchestration UI:
fuser-uiorpython scripts/fuser_ui - Full pipeline UI:
pipeline-uiorpython scripts/pipeline_ui
- Triton KernelAgent UI:
-
AutoRouter (
Fuser/auto_agent.py): parses the problem’s AST, looks for attention blocks, transposed convolutions, control flow, and long op chains. It caches decisions under.fuse/router_cache.jsonand can fall back to the other path if the first attempt fails. -
Fuser Orchestrator (
Fuser/orchestrator.py): rewrites the PyTorch module into fusable modules, executes them for validation, and packages a tarball of the fused code. Run IDs and directories are managed viaFuser/paths.py. -
Subgraph Extractor (
Fuser/subgraph_extractor.py): prompts the LLM to emit a JSON array describing each unique subgraph, including ops, shapes, dtypes, and parameter tensors. Entries are deduplicated by shape signature so the dispatcher can reuse kernels. -
Dispatcher (
Fuser/dispatch_kernel_agent.py): converts each JSON item into a precise Triton generation spec, then spins upTritonKernelAgentprocesses in parallel. Each worker writes its own session directory with the candidate kernel, test harness, and verification logs. -
TritonKernelAgent (
triton_kernel_agent/): manages a pool of verification workers (worker.py,manager.py). Each worker iteratively asks an LLM for improvements, executes unit tests under sandboxed subprocesses (Fuser/runner.py), and enforces strict bans on PyTorch fallbacks. A run succeeds only when the test printsPASS(or the sentinel string) and exits with status 0. -
Composer (
Fuser/compose_end_to_end.py): stitches the verified kernels back into a single Triton program. The composed file contains one or more@triton.jitkernels plus akernel_function(...)wrapper and a self-test that replays the original PyTorch problem. With--verify, the test is executed immediately and must succeed.
A successful pipeline run yields a structure similar to:
.fuse/<run_id>/
orchestrator/code.py.tgz # fused PyTorch refactor
subgraphs.json # shape-specialized subgraph descriptions
kernels_out/
<subgraph_id>/* # per-subgraph KernelAgent sessions
summary.json # success/failure per subgraph
compose_out/
composed_kernel.py # final Triton program + self-test
summary.json # composition metadata
These artifacts are designed for reproducibility: you can re-run a single kernel session, inspect prompts/responses, or feed composed_kernel.py directly into downstream tooling.
triton_kernel_agent/— KernelAgent core (agent, worker manager, provider adapters, prompt templates)Fuser/— auto-router, orchestration pipeline, CLIs, Gradio UIstriton_kernel_agent/templates/— Jinja templates used when prompting TritonKernelAgentexamples/— sample problems and prompt snippetstests/— unit tests for agents and utilitiese2e_test.py— example end-to-end kernel generation harnessscripts/— coverage/benchmark tooling, profiling helpers, CLI entry points (e.g., autoroute coverage runners, Triton UI)
- Install in editable mode with
pip install -e .[dev] - Run the test suite with
pytest -v - Follow the contribution guidelines in
CONTRIBUTING.md - KernelAgent intentionally leaves Triton installation to the user so you can pin the version that matches your GPU driver/toolchain
- Architecture and deep-dive docs:
docs/kernelfalcon_overview.html,docs/kernelfalcon_agents2_overview.html,docs/FuserAgent_sketch.html,docs/fuser_agent_compare.html - Issues: https://github.com/pytorch-labs/KernelAgent/issues
- Discussions & blog posts: [TBD]
KernelAgent is released under the Apache License 2.0; see LICENSE.