Auto GPU Kernel 🏆

Autonomous GPU-kernel discovery & optimizer.

Ranked #1 on MLSys 2026 - FlashInfer AI Kernel Generation Contest for the DeepSeek Sparse Attention (DSA) track with an average speedup of 34.93x. Submissions can be found at:

Kernel	Runtime (ms)
dsa_sparse_attention_h16_ckv512_kpe64_topk2048_ps64 — DSA Sparse Attention	0.010
dsa_topk_indexer_fp8_h64_d128_topk2048_ps64 — DSA TopK Indexer	0.016

Setup

Copy the template directory into a separate folder / git repository to make sure your agents work in an isolated environment.

The kernel agent is compatible with FlashInfer format and can run without a local GPU on cloud using Modal. Requires Claude Code CLI.

# Python env
conda create -n fi-bench python=3.12
conda activate fi-bench
pip install flashinfer-bench modal

# One-time environment setup
modal setup
modal volume create flashinfer-trace
modal volume put flashinfer-trace /path/to/flashinfer-trace/

To get started clone the MLSys-2026 Contest Dataset. To change the kernel you are implementing, please refer to the FlashInfer-Trace - Bring Your Own Kernel guide.

Important

Make sure you update CLAUDE.md to describe the kernel you are optimizing. The example in template is customized for sparse attention. Also optimize.md and benchmark.md has some parameters tuned for sparse attention such as number of test cases to run to get a sanity check. You can ask an agent to help you adjsut them.

Launch the loop

To run one iteration,

claude --dangerously-skip-permissions -p "/optimize"

Or you can launch interactive mode by running claude --dangerously-skip-permissions, selecting the right model, thinking mode and enter /loop Run /optimize every 15 minutes.

That's it. The loop runs indefinitely, each iteration picks one optimization, benchmarks it, logs an experiment folder, and continues. Stop with Ctrl+C when you want to step in. As agent struggles to find new optimizations, it will start to change its schedule to be less frequent.

Architecture

For more details on the agentic loop, please refer to the technical report.

Agents:

Profiler
Research
Workload inspector

Command	Purpose
`/optimize`	Main loop
`/benchmark <quick\|stride N\|full>`	One-shot Modal run
`/log-experiment`	Snapshot + write `result.md` + update index

See CLAUDE.md for rules and .claude/commands/ for full command specs.

solution/triton/sparse_fused.py — the kernel being optimized (overwritten each iteration)
experiments/exp_N/ — snapshot + results for iteration N
experiments/summary.md — master index, one row per iteration
experiments/LESSONS.md — durable cross-experiment findings

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
dsa_sparse_attention_h16_ckv512_kpe64_topk2048_ps64		dsa_sparse_attention_h16_ckv512_kpe64_topk2048_ps64
dsa_topk_indexer_fp8_h64_d128_topk2048_ps64		dsa_topk_indexer_fp8_h64_d128_topk2048_ps64
template		template
.gitignore		.gitignore
README.md		README.md
report.pdf		report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Auto GPU Kernel 🏆

Setup

Launch the loop

Architecture

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Auto GPU Kernel 🏆

Setup

Launch the loop

Architecture

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages