Evidence-first autonomous GPU-kernel optimization campaigns for SGLang.
KDA-Pilot turns real serving-framework kernels into reproducible optimization tasks: frozen production shapes, copied upstream baselines, symmetric benchmarks, correctness gates, Nsight Compute evidence, KernelWiki references, and RLCR-style agent iteration in one place.
Most AI kernel demos optimize a snippet. KDA-Pilot optimizes the parts that actually show up in SGLang diffusion and LLM serving workflows, then keeps the evidence needed to tell whether the agent really improved the production path.
If you care about autonomous CUDA/Triton/CuTe-DSL optimization that can be replayed, reviewed, and compared against real framework baselines, this is the repo to watch.
- Real workloads, not toy shapes. Diffusion tasks were built from 20 real SGLang diffusion models and collapsed into per-kernel multi-shape workloads.
- Wall-time metrics. The headline numbers include Python, dispatch,
wrappers, kernel launch, and
cuda.synchronize()overhead, not just isolated device time. - No reward-hacking path. Baseline and candidate use matching local ABIs; the task does not monkey-patch or import SGLang at runtime.
- Knowledge-guided iteration. Tasks can pull from
KernelWikiandncu-report-skill, so prior Blackwell/Hopper kernel work and NCU bottleneck evidence become part of the optimization loop. - Agent loop with review. Candidate promotion is tied to correctness gates, run logs, and code review rather than "one fast row wins".
These are wall geomean speedups against the corresponding SGLang/Triton/CuTe-DSL baselines on B200. The measurements include dispatch and synchronization overheads, so they are closer to what a user sees from the public kernel path.
| Kernel task | B200 wall geomean | Representative wins |
|---|---|---|
qknorm_rope |
1.1341x | large rows 1.145-1.279x |
norm_infer |
1.3523x | RMS small 1.634-1.641x |
rotary_embedding |
1.4912x | HunyuanVideo 2.087x; LTX2 1.133-1.622x |
cutedsl_norm_tanh_mul_add |
1.4953x | v1 1.602-1.625x |
cutedsl_norm_scale_shift |
1.3201x | Hunyuan 1.388-1.516x; JoyAI 1.477-1.495x |
fuse_scale_shift |
2.7499x | small broadcast rows 7.365-7.891x |
group_norm_silu |
2.3118x | small/mid C rows 1.369-4.982x; NC rows up to 3.648x |
| Kernel | KernelWiki / reference | Key techniques |
|---|---|---|
qknorm_rope |
TensorRT-LLM PR-13052/11869 DiT QKNorm+RoPE; SGLang PR-15141/19059/21440/21654 fused QKNorm/RoPE; memory-bound pattern | Shared RoPE staging, Q/K reuse, staged path only for large rows |
norm_infer |
KernelWiki memory-bound/vectorized-loads/register-budgeting; vLLM PR-31828 SM100 RMSNorm opt-in path | Warp-row RMS, tiled persistent RMS, 8B/16B vector paths |
rotary_embedding |
SGLang PR-24411 LTX2 split RoPE; vLLM PR-21126/30729 FlashInfer RoPE routing; vectorized-loads | 128-bit vector I/O, cos/sin hoisting, LTX2 block matching |
cutedsl_norm_tanh_mul_add |
KernelWiki memory-bound/vectorized-loads/register-budgeting; NCU long-scoreboard and launch-bounds evidence | Hoisted row-invariant math, launch-bounds tuning, exact tanhf |
cutedsl_norm_scale_shift |
SGLang PR-14717 CuTe-DSL norm/scale/shift fusion; vectorized-loads; register-budgeting | Operand-class dispatch, 16B/32B vectors, two-pass variance |
fuse_scale_shift |
SGLang PR-14717 fused norm/scale/shift family; vectorized-loads; cache-policy; memory-bound pattern | Rowgrid/flatvec/exact-C paths, cache hints, one-pass reduction |
group_norm_silu |
SGLang PR-22814/23148/23938 GroupNorm+SiLU; memory-bound pattern; vectorized-loads | Split-group stats, generation counters, channels-last transpose |
The companion write-up records the benchmark interpretation, kernel-specific optimization paths, KernelWiki/reference links, and AKO4X comparison: KDA-Pilot optimizing SGLang Diffusion Kernel.
diffusion/ SGLang diffusion-operator kernel tasks.
Each task owns a copied baseline, optimized solution, benchmark,
correctness contract, run logs, and result ledger.
llm/ SGLang autoregressive-model kernel-workflow campaign.
Serve priority models on B200/H200, benchmark low/mid/high
concurrency, profile forward passes, and turn >=1% non-attention
kernels into optimization task cards.
external/ Optional shared knowledge submodules.
KernelWiki/ Blackwell/Hopper kernel design references
ncu-report-skill/ Nsight Compute profiling/report helper
Start with:
diffusion/README.mdfor standalone diffusion kernel tasks and benchmark rules.llm/README.mdfor the LLM kernel-workflow campaign.diffusion/docs/standalone_diffusion_benchmark.mdfor the baseline/candidate benchmark contract.diffusion/docs/diffusion_kernel_rules.mdfor correctness, fallback, and promotion guardrails.
Every diffusion kernel task follows the same shape:
prompt.md task card for the agent
config.toml benchmark/build defaults
baseline/ copied upstream SGLang baseline source
solution/ optimized candidate source
bench/ standalone benchmark and correctness harness
docs/ run logs, profile notes, source notes, decision ledger
The important rule is symmetry: the agent must compare the copied baseline and candidate through matching local interfaces, fixed workload rows, preallocated outputs, CUDA-event timing, interleaved A/B sampling, strict correctness checks, and full provenance.
Clone submodules when you want the optional knowledge references:
git submodule update --init --recursiveLaunch a task from the repo root:
diffusion/scripts/launch_kernels/k03_b200_diffusion_qknorm_rope__multi_shape.shUseful environment switches:
KDA_NO_CLAUDE=1 # prepare the worktree without launching an agent
KDA_BASE_BRANCH=<ref> # launch from a specific committed ref
KDA_BASH_BIN=/opt/homebrew/bin/bashmacOS /bin/bash 3.2 is rejected by the launcher because nested Humanize/Codex
hooks rely on modern Bash behavior.
- Diffusion kernels: qk norm + RoPE, norm inference, rotary embedding, fused scale/shift, group norm + SiLU, CuTe-DSL norm/tanh/mul/add, and CuTe-DSL norm/scale/shift across B200 and H200 task folders.
- LLM kernel workflow: model-level serving commands, benchmark sweeps, torch profiler traces, and kernel inventories for future optimization tasks.
- Open frontier: compute-bound kernels such as FA4/MHA and GEMM-like paths remain harder; this repo keeps the failed and partial attempts visible so the next loop can start from evidence instead of folklore.