Evolutionary generation of efficient GPU kernels using GigaEvo.
Define a task, run evolution with an LLM backend, extract and compare optimized programs.
- Custom tasks — Define your own kernel tasks in KernelBench format and evolve them.
- KernelBench integration — Use existing KernelBench problems.
- Triton and CUDA inline backends - two most popular ways to create kernels, suitable for different scenarios.
- Remote or local execution — Run validation locally or via a remote eval server.
- Cost efficient - works with fast models gemini flash 3 and gpt-oss-120b. Current experiments costs 0.5-1$. Frontier models with high reasoning effort would be beneficial, yet cost would be magnitude higher.
- Python >= 3.12
- LLM API — OpenAI-compatible (e.g. OpenRouter, or a local server like SGLang).
- Redis — Used by GigaEvo for experiment state.
- GPU — Used by the evaluation stage to measure kernel correctness and efficiency.
git clone https://github.com/AXXX-Institute/kernel-evo.git
cd kernel-evo
pip install -e . --ignore-requires-pythonNote:
--ignore-requires-pythonrelaxes the Python version check (KernelBench may declare 3.10 but works on 3.12).
For custom branches ofgigaevoorkernelbench, edit the Git URLs inpyproject.toml.
Pull and run (when a pre-built image is published):
docker pull sivtsovdt/kernel-evo:latest
docker run --rm sivtsovdt/kernel-evo:latest kernel-evo --helpTo build the image yourself (e.g. for private dependencies or development), see build/README.md.
To evolve your own kernel, create a task in KernelBench format. Example layout:
tasks/
└── armt_associate/
└── task.py
See tasks/armt_associate in this repo for a reference. You can also use any existing task from KernelBench.
Evolution can use a local or remote LLM (e.g. SGLang, OpenRouter). Examples below use OpenRouter and a remote eval server.
In a separate terminal:
kernel-evo eval-server --port 15000OPENAI_API_KEY="sk-or-v1-..." kernel-evo evolve \
--problem-path tasks/armt_associate/task.py \
--experiment-name custom_associate \
--backend triton \
--precision fp16 \
--model-name <MODEL> \
--llm-base-url https://openrouter.ai/api/v1 \
--redis-db 0 \
--max-generations 400 \
--max-mutations-per-generation 4 \
--validator-debug \
--log-dir <dir_for_logs> \
--execution-mode remote_executionOPENAI_API_KEY="<KEY>" kernel-evo evolve \
--level 1 \
--problem-id 36 \
--experiment-name kb_level1_36 \
--dataset-src huggingface \
--dataset-name ScalingIntelligence/KernelBench \
--backend triton \
--precision fp16 \
--model-name <MODEL> \
--llm-base-url https://openrouter.ai/api/v1 \
--redis-db 0 \
--max-generations 400 \
--max-mutations-per-generation 4 \
--validator-debug \
--log-dir <dir_for_logs> \
--execution-mode remote_executioncd gigaevo/outputs/<DATE>/<EXPERIMENT_START>
tensorboard --logdir .Use TensorBoard to find iterations with good performance before extracting programs.
Export the program from a specific iteration (e.g. after inspecting TensorBoard):
kernel-evo extract \
--redis-db 0 \
--iteration 8 \
--redis-prefix "kernel_evo" \
--output-file best_program.pykernel-evo compare \
--program-a prog_a.py \
--program-b prog_b.py \
--problem-path tasks/armt_associate/task.py \
--backend triton \
--precision fp16 \
--num-perf-trials 200 \
--num-correct-trials 20kernel-evo compare \
--program-a prog_a.py \
--program-b prog_b.py \
--dataset-src huggingface \
--dataset-name ScalingIntelligence/KernelBench \
--level 1 \
--problem-id 36 \
--backend triton \
--precision fp16 \
--num-perf-trials 200 \
--num-correct-trials 20| Command | Description |
|---|---|
evolve |
Run evolution (custom or KernelBench) |
eval-server |
Start remote validation server |
extract |
Export program by iteration from Redis |
compare |
Compare two programs (correctness + perf) |
Evolution deeply depends on underlying model. For better results, one should use frontier models, like gpt, claude or gemini.
Recommendation for best value vendor model:
- gemini flash 3. Capable, yet not very costly. It creates faulty kernels, but able to recover buggy code.
Recommendation for open-source models:
- gpt-oss-120b - best baseline for kernel evolution. Good enough reasoning to recover faulty kernels.
- GLM-5. From all very large open LLMs, only one seems to know Triton and generate decent kernels. Downside - slower on generation and very large for local inference.
Quality of result depends on starting seeds and can vary between different runs. So it makes sense to restart and try again if the solution is very bad during the first 200k tokens.
Also, we noticed that Triton is better on small efficient kernels, like softmax and matmuls, because it requires less knowledge from the model. For complex tasks like KernelBench level 2, the difference is lower.
Better to run validation via validator server in different terminal. This way, one can see results.
Use flag --disable-insights-lineage with kernel-evo evolve to disable additional calls. Beneficial for short debug runs or with expensive models.


