KernelBench is an environment for evaluating language model agents on GPU kernel optimization. Based on the KernelBench benchmark from the Stanford Scaling Intelligence Lab, agents are given a reference PyTorch implementation and must write optimized CUDA kernels that are both functionally correct and faster than the original. Tasks span single operators, fused kernel patterns, full model architectures, and HuggingFace models across 4 difficulty levels.
- Writing custom CUDA kernels to replace PyTorch operators
- Fusing multiple operations into single optimized kernels
- Optimizing full neural network architectures end-to-end
- Iterating on kernel implementations based on compilation and correctness feedback
KernelBench uses a 2-stage sandbox pattern per run_kernel tool call:
- CPU sandbox (2 CPUs, 8GB RAM): Compiles CUDA kernels without requiring a GPU
- GPU sandbox (NVIDIA L4): Benchmarks compiled kernels against the reference implementation
Both sandboxes are ephemeral and created/destroyed per tool call. The sandbox execution timeout is 300 seconds, matching the original KernelBench evaluation.
MIT.
There are 7 splits with tasks sourced from the ScalingIntelligence/KernelBench HuggingFace dataset:
| Split | Tasks | Type | Description |
|---|---|---|---|
| dev | 1 | validation | Single task for development testing |
| level1 | 100 | test | Single-kernel operators (matrix multiply, activations, norms, pooling, convolutions, reductions, loss functions) |
| level1_verified | 79 | test | Verified subset of level1 |
| level2 | 100 | test | Fused kernel patterns (e.g., Conv2D + ReLU + BiasAdd) |
| level2_verified | 60 | test | Verified subset of level2 |
| level3 | 50 | test | Full model architectures (MLP, ResNet, VGG, DenseNet, EfficientNet, ViT, LSTM, GRU, MiniGPT, UNet, Mamba) |
| level4 | 20 | test | HuggingFace model architectures (GPT-Neo, OPT, BART, BigBird, Reformer, ELECTRA, GPT-2) |
Each task presents a reference PyTorch Model class. The agent must produce a ModelNew class that implements the same functionality using custom CUDA kernels, along with the corresponding CUDA and C++ code.
This is a dense reward environment with continuous scoring. Each run_kernel call compiles and benchmarks the agent's kernel, returning a reward based on the outcome:
| Outcome | Reward | Description |
|---|---|---|
| Correct, faster | median_speedup - 1 |
Positive reward proportional to speedup over reference |
| Correct, same speed | 0.0 | Matches reference performance |
| Incorrect output | -2.0 | Kernel runs but produces wrong results |
| Execution failure | -3.0 | Kernel initialized but failed during execution |
| Initialization failure | -4.0 | Kernel compiled but ModelNew could not be initialized |
| Compilation failure | -5.0 | CUDA kernel failed to compile |
| Unknown error | -6.0 | Unclassified error during evaluation |
| Exception/timeout | -7.0 | Sandbox error or evaluation exceeded 300s timeout |
The agent can call run_kernel multiple times to iterate on its solution. The finish tool ends the episode with reward 0.
We do not use LLM graders for this task. Correctness is verified by comparing the kernel output against the reference implementation, and speedup is measured over 1,000 benchmark trials.
Task data is loaded at runtime from the ScalingIntelligence/KernelBench HuggingFace dataset. Each task contains a complete PyTorch file with a Model class, get_inputs(), and get_init_inputs() functions. The agent sees only the Model class and imports (test harness functions are stripped).
Agents are given two tools:
run_kernel: Submit Python code (definingModelNew), CUDA code, and C++ code. The environment compiles the kernel on CPU, then benchmarks it on GPU against the reference implementation. Returns correctness, speedup, and detailed error messages on failure.finish: End the task. Should be called once the agent has produced its best solution.
KernelBench is a multi-turn environment. The agent receives a reference implementation and iteratively writes and tests CUDA kernels. There is no hard limit on the number of run_kernel calls; the agent decides when to call finish.
One-shot performance from the original KernelBench paper (fast_1: % of kernels that match or exceed the PyTorch eager baseline):
| Model | Level 1 | Level 2 | Level 3 |
|---|---|---|---|
| DeepSeek R1 | 12% | 36% | 2% |
| OpenAI o1 | 10% | 24% | 12% |
| Claude 3.5 Sonnet | 10% | 7% | 2% |
| DeepSeek V3 | 6% | 4% | 8% |
| GPT-4o | 4% | 5% | 0% |
| Llama 3.1-405B | 3% | 0% | 2% |
| Llama 3.1-70B | 3% | 0% | 0% |
With 10 turns of iterative refinement (execution + profiler feedback), DeepSeek R1 improves to 43% on Level 1, 72% on Level 2, and 18% on Level 3. Writing functionally correct kernels remains the primary challenge, as models struggle with CUDA correctness even when compilation succeeds.
There are no further environment requirements beyond the OpenReward platform.
Agents in KernelBench write and execute CUDA code inside isolated sandboxes. The sandboxes are ephemeral and destroyed after each tool call, limiting the blast radius of any generated code. The environment does not present direct safety risks beyond those inherent in arbitrary code execution, which is contained by the sandbox.
@article{ouyang2025kernelbench,
title={KernelBench: Can LLMs Write Efficient GPU Kernels?},
author={Ouyang, Anne and Guo, Simon and Arora, Simran and Zhang, Alex L. and Hu, William and R{\'e}, Christopher and Mirhoseini, Azalia},
journal={arXiv preprint arXiv:2502.10517},
year={2025}
}