KernelBench

Description

KernelBench is an environment for evaluating language model agents on GPU kernel optimization. Based on the KernelBench benchmark from the Stanford Scaling Intelligence Lab, agents are given a reference PyTorch implementation and must write optimized CUDA kernels that are both functionally correct and faster than the original. Tasks span single operators, fused kernel patterns, full model architectures, and HuggingFace models across 4 difficulty levels.

Capabilities

Writing custom CUDA kernels to replace PyTorch operators
Fusing multiple operations into single optimized kernels
Optimizing full neural network architectures end-to-end
Iterating on kernel implementations based on compilation and correctness feedback

Compute Requirements

KernelBench uses a 2-stage sandbox pattern per run_kernel tool call:

CPU sandbox (2 CPUs, 8GB RAM): Compiles CUDA kernels without requiring a GPU
GPU sandbox (NVIDIA L4): Benchmarks compiled kernels against the reference implementation

Both sandboxes are ephemeral and created/destroyed per tool call. The sandbox execution timeout is 300 seconds, matching the original KernelBench evaluation.

License

MIT.

Tasks

There are 7 splits with tasks sourced from the ScalingIntelligence/KernelBench HuggingFace dataset:

Split	Tasks	Type	Description
dev	1	validation	Single task for development testing
level1	100	test	Single-kernel operators (matrix multiply, activations, norms, pooling, convolutions, reductions, loss functions)
level1_verified	79	test	Verified subset of level1
level2	100	test	Fused kernel patterns (e.g., Conv2D + ReLU + BiasAdd)
level2_verified	60	test	Verified subset of level2
level3	50	test	Full model architectures (MLP, ResNet, VGG, DenseNet, EfficientNet, ViT, LSTM, GRU, MiniGPT, UNet, Mamba)
level4	20	test	HuggingFace model architectures (GPT-Neo, OPT, BART, BigBird, Reformer, ELECTRA, GPT-2)

Each task presents a reference PyTorch Model class. The agent must produce a ModelNew class that implements the same functionality using custom CUDA kernels, along with the corresponding CUDA and C++ code.

Reward Structure

This is a dense reward environment with continuous scoring. Each run_kernel call compiles and benchmarks the agent's kernel, returning a reward based on the outcome:

Outcome	Reward	Description
Correct, faster	`median_speedup - 1`	Positive reward proportional to speedup over reference
Correct, same speed	0.0	Matches reference performance
Incorrect output	-2.0	Kernel runs but produces wrong results
Execution failure	-3.0	Kernel initialized but failed during execution
Initialization failure	-4.0	Kernel compiled but `ModelNew` could not be initialized
Compilation failure	-5.0	CUDA kernel failed to compile
Unknown error	-6.0	Unclassified error during evaluation
Exception/timeout	-7.0	Sandbox error or evaluation exceeded 300s timeout

The agent can call run_kernel multiple times to iterate on its solution. The finish tool ends the episode with reward 0.

We do not use LLM graders for this task. Correctness is verified by comparing the kernel output against the reference implementation, and speedup is measured over 1,000 benchmark trials.

Data

Task data is loaded at runtime from the ScalingIntelligence/KernelBench HuggingFace dataset. Each task contains a complete PyTorch file with a Model class, get_inputs(), and get_init_inputs() functions. The agent sees only the Model class and imports (test harness functions are stripped).

Tools

Agents are given two tools:

run_kernel: Submit Python code (defining ModelNew), CUDA code, and C++ code. The environment compiles the kernel on CPU, then benchmarks it on GPU against the reference implementation. Returns correctness, speedup, and detailed error messages on failure.
finish: End the task. Should be called once the agent has produced its best solution.

Time Horizon

KernelBench is a multi-turn environment. The agent receives a reference implementation and iteratively writes and tests CUDA kernels. There is no hard limit on the number of run_kernel calls; the agent decides when to call finish.

Environment Difficulty

One-shot performance from the original KernelBench paper (fast_1: % of kernels that match or exceed the PyTorch eager baseline):

Model	Level 1	Level 2	Level 3
DeepSeek R1	12%	36%	2%
OpenAI o1	10%	24%	12%
Claude 3.5 Sonnet	10%	7%	2%
DeepSeek V3	6%	4%	8%
GPT-4o	4%	5%	0%
Llama 3.1-405B	3%	0%	2%
Llama 3.1-70B	3%	0%	0%

With 10 turns of iterative refinement (execution + profiler feedback), DeepSeek R1 improves to 43% on Level 1, 72% on Level 2, and 18% on Level 3. Writing functionally correct kernels remains the primary challenge, as models struggle with CUDA correctness even when compilation succeeds.

Other Environment Requirements

There are no further environment requirements beyond the OpenReward platform.

Safety

Agents in KernelBench write and execute CUDA code inside isolated sandboxes. The sandboxes are ephemeral and destroyed after each tool call, limiting the blast radius of any generated code. The environment does not present direct safety risks beyond those inherent in arbitrary code execution, which is contained by the sandbox.

Citations

@article{ouyang2025kernelbench,
  title={KernelBench: Can LLMs Write Efficient GPU Kernels?},
  author={Ouyang, Anne and Guo, Simon and Arora, Simran and Zhang, Alex L. and Hu, William and R{\'e}, Christopher and Mirhoseini, Azalia},
  journal={arXiv preprint arXiv:2502.10517},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
extend_dataset		extend_dataset
runtime		runtime
splits		splits
tests		tests
DATA_UPLOAD.md		DATA_UPLOAD.md
Dockerfile		Dockerfile
README.md		README.md
benchmark.py		benchmark.py
computer.dockerfile		computer.dockerfile
kernelbench.py		kernelbench.py
prompts.py		prompts.py
requirements.txt		requirements.txt
server.py		server.py
test_agent.py		test_agent.py
test_single.py		test_single.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KernelBench

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

KernelBench

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages