English | 简体中文
Frontier-Eng is a benchmark designed to evaluate the ability of AI Agents to solve open-ended optimization problems in real-world engineering domains.
Unlike existing benchmarks that focus on Computer Science (CS) or purely abstract mathematical problems, Frontier-Eng focuses on engineering challenges with actual economic benefits and physical constraints. It is expected to cover multiple fields such as aerospace, civil engineering, EDA, bioengineering, and more.
frontier_eval/requirements.txt only sets up the evaluation framework itself. It does not mean every benchmark can run inside the same environment.
Before running any specific benchmark, always read the corresponding environment instructions in:
benchmarks/<Domain>/README*.mdbenchmarks/<Domain>/<Task>/README*.mdif the task has its own README
Many benchmark families require their own runtime environment, extra requirements.txt, extra third_party/ checkouts, or Docker-based execution. When a benchmark README documents runtime overrides such as task.runtime.conda_env=..., task.runtime.python_path=..., or task.runtime.use_conda_run=false, treat the benchmark README as the source of truth and copy those overrides into your run command.
Examples already in this repository include ReactionOptimisation (summit), MolecularMechanics (openff-dev), SustainableDataCenterControl (sustaindc), PyPortfolioOpt (pyportfolioopt), QuantumComputing (mqt), InventoryOptimization (stock), JobShop (custom python_path), and EngDesign (Docker / local mode).
Project-local agent skills are bundled under skill/. Run python -m frontier_eval skill to choose interactively, or use python -m frontier_eval skill evaluator codex for a direct install.
To reduce the number of runtime environments used by the effective v1 task pool without breaking existing setups, the repository now uses the following convention:
frontier-eval-2remains the evaluation-framework / driver environment and is left unchanged.- Existing task environments such as
bio,mqt,optics,stock,pyportfolioopt,motion,jobshop,summit,sustaindc, andkernelare preserved and not overwritten. - New merged task environments are created under whichever environment prefix the current
condainstallation manages, with default namesfrontier-v1-main,frontier-v1-summit,frontier-v1-sustaindc, andfrontier-v1-kernel. - For
v1tasks that need a direct interpreter instead ofconda run(currentlyReactionOptimisation/*andJobShop/*), the batch matrices use the portable markerconda-env:<env-name>. The unified evaluator resolves that marker to the target env's Python executable at runtime, so repository files stay machine-independent.
Current v1 runtime consolidation:
frontier-v1-main:SingleCellAnalysis/predict_modality,QuantumComputing/*,Optics/*,InventoryOptimization/*,PyPortfolioOpt/*,JobShop/*,Robotics/DynamicObstacleAvoidanceNavigation,Robotics/PIDTuning,Robotics/UAVInspectionCoverageWithWind,Robotics/QuadrupedGaitOptimization,Robotics/RobotArmCycleTimeOptimization,Aerodynamics/CarAerodynamicsSensing,KernelEngineering/FlashAttentionfrontier-v1-summit:ReactionOptimisation/*frontier-v1-sustaindc:SustainableDataCenterControl/*frontier-v1-kernel:KernelEngineering/MLA,KernelEngineering/TriMul
If an older benchmark README still mentions legacy env names such as mqt, stock, pyportfolioopt, or jobshop, prefer the batch matrix files under frontier_eval/conf/batch/ as the source of truth for current v1 runs.
Setup and validation scripts:
- Initialize merged envs:
bash scripts/setup_v1_merged_task_envs.sh - Validate merged envs with
iter=0:DRIVER_ENV=frontier-eval-2 GPU_DEVICES=<gpu_id> bash scripts/validate_v1_merged_task_envs.sh
Notes:
- The validation script uses
conda run -n frontier-eval-2 pythonas the default driver, and can also be overridden withDRIVER_PY=/path/to/python. It checks CPUv1, GPUv1,FlashAttention,MLA, andTriMul. MuonTomographyremains excluded from the current effectivev1pool as described later in this README.- Known caveat: the official
KernelEngineering/TriMulfull benchmark (verification/tri_bench.txt) may still be VRAM-limited on 24GB-class GPUs; this is typically a task-level memory-bound issue rather than a missing dependency infrontier-v1-kernel.
Current AI4Research evaluation systems have the following limitations:
- Limited Evaluation Methods: Most adopt 0/1 binary evaluation or closed-interval rubrics, failing to effectively measure an Agent's ability to perform iterative optimization through interaction in an open world.
- Domain Limitations: Existing benchmarks are mostly confined to the CS domain (e.g., code generation) or highly abstract real problems into math problems, stripping away real-world complexity and preventing Agents from utilizing rich external knowledge and tools.
- Metric Bias: Traditional computational metrics focus on model average performance, whereas for engineering optimization problems, we should focus more on the Peak Performance a model can achieve on a single problem through exploration mechanisms.
Frontier-Eng aims to evaluate the ability of Agents to solve problems with practical value across a wide range of engineering disciplines by providing rich context and tool support.
We need the power of the community to expand the coverage of the Benchmark. We welcome the submission of new engineering problems via Pull Requests (PR). If you wish to contribute, please follow the standards and processes below:
AI-Assisted Contributions: We welcome contributions created with the assistance of AI tools. If you're using an agent to help with your contribution, run
python -m frontier_eval skilland installContributor, or useskill/source/frontier-contributor/SKILL.mddirectly. However, please do not over-rely on AI tools or leave the process entirely to AI. Human review and supervision are essential to ensure quality and correctness.
- Reality Gap: Must be close to reality, considering real-world influencing factors, not purely abstract mathematics.
- Economic Value: The problem should have clear engineering or economic value upon solution.
- Verifiability: Must provide an executable verification program (Docker preferred) capable of completing the evaluation within an acceptable time.
Each Task should contain the following file structure:
<Domain_Name>/ # Level 1 Directory: Domain Name (e.g., Astrodynamics)
├── README.md # [Required] Domain Overview (Default entry, EN or CN): Background & sub-task index
├── README_zh-CN.md # [Optional] Domain Overview (Chinese version. Used only if README.md is in English)
├── <Task_Name_A>/ # Level 2 Directory: Specific Task Name (e.g., MannedLunarLanding)
│ ├── README.md # [Required] Navigation Doc: File structure, how to run & quick start
│ ├── README_zh-CN.md # [Optional] Navigation Doc (Chinese version)
│ ├── Task.md # [Required] Task Detail Doc: Core doc including background, physical model, I/O definitions
│ ├── Task_zh-CN.md # [Optional] Task Detail Doc (Chinese version)
│ ├── references/ # References Directory
│ │ ├── constants.json # Physical constants, simulation parameters, etc.
│ │ └── manuals.pdf # Domain knowledge manual, physical equations, or constraints docs
│ ├── frontier_eval/ # [Required] Unified-task metadata for Frontier Eval onboarding
│ │ ├── initial_program.txt # Initial editable program path (relative to task root)
│ │ ├── eval_command.txt # Evaluation command template used by `task=unified`
│ │ ├── agent_files.txt # Context files exposed to the agent
│ │ ├── artifact_files.txt # Output files/logs to collect after evaluation
│ │ └── constraints.txt # Optional task-specific constraints/instructions
│ ├── verification/ # Verification & Scoring System
│ │ ├── evaluator.py # [Core] Scoring script entry point
│ │ ├── requirements.txt # Dependencies required for the scoring environment
│ │ └── docker/ # Environment containerization configuration
│ │ └── Dockerfile # Ensures consistency of the evaluation environment
│ └── baseline/ # [Optional] Baseline Solution / Example Code
│ ├── solution.py # Reference code implementation
│ └── result_log.txt # Execution log or scoring result of the reference code
└── <Task_Name_B>/ # Another task under this domain
└── ...
The above directory structure serves only as a reference template. Contributors may adjust the file organization based on specific circumstances, provided that all core elements (e.g., background, input/output, evaluation metrics) are included. Additionally, there are no restrictions on the programming language and format of the verification code.
New benchmark contributions must be onboarded through the unified task format. In practice, this means adding benchmark-local metadata under
<Task_Name>/frontier_eval/and validating the task withtask=unified. Adding a new custom task underfrontier_eval/tasks/<task>/...is an exception path that should only be used when the unified format is demonstrably insufficient and the maintainer team has agreed on the exception first. Seefrontier_eval/README.mdfor the full unified metadata schema.
-
Keep test commands as short as possible (ideally single-line commands). Testing is mandatory before submission!
-
python verification/evaluator.py scripts/init.py# Run under benchmark, usingverification/evaluator.pyas the evaluation entry point. The target of the test, i.e., the target of agent evolution, isscripts/init.py. -
python -m frontier_eval task=unified task.benchmark=<Domain_Name>/<Task_Name> algorithm.iterations=0# Framework compatibility verification for new benchmark contributions. Please document the exact unified benchmark id and any required runtime overrides (for exampletask.runtime.conda_env=...) in the README, and explicitly call out any benchmark-specific environment setup (extra envs, Docker,third_party/, custompython_path, etc.). -
Please avoid files containing private information, such as:
.env, API keys, IDE configurations (.vscode/), temporary files (*.log,temp/,__pycache__, and personal test scripts). Also, please check that the submitted content does not contain absolute paths to avoid reproducibility issues and privacy leaks. -
EVOLVE-BLOCK Markers (Required for ShinkaEvolve / ABMCTS): The file evolved by the agent (e.g.,
scripts/init.py, or language-specific baselines likemalloclab-handout/mm.c) must includeEVOLVE-BLOCK-STARTandEVOLVE-BLOCK-ENDmarkers to define the only editable region.- Keep the marker lines intact, and keep all code outside the markers read-only (CLI/I/O contracts, constraint checks, evaluator glue, etc.).
- Use the correct comment style for your language:
- Python:
# EVOLVE-BLOCK-START/# EVOLVE-BLOCK-END - C/C++/CUDA/Rust/Swift:
// EVOLVE-BLOCK-START/// EVOLVE-BLOCK-END
- Python:
We adopt the standard GitHub collaboration flow:
- Fork this Repository: Click the "Fork" button in the top right corner to copy the project to your GitHub account.
- Create Branch:
- Clone your Fork locally.
- Create a new branch for development, recommended naming format:
feat/<Domain>/<TaskName>(e.g.,feat/Astrodynamics/MarsLanding).
- Add/Modify Content:
- Add your engineering problem files following the submission format above.
- Ensure all necessary explanatory documentation and verification code are included.
- Local Test: Run
evaluator.pyor build the Docker image to ensure the evaluation logic is correct and runs normally. - Submit Pull Request (PR):
- Push changes to your remote Fork.
- Initiate a Pull Request to the
mainbranch of this repository. - PR Description: Please briefly explain the background, source, and how to run the verification code for the Task.
- Code Review:
- Agent Review: After submitting the PR, an AI Agent will first conduct an automated preliminary review (including code standards, basic logic verification, etc.) and may propose modifications directly in the PR.
- Maintainer Review: After the Agent review passes, maintainers will conduct a final re-check. Once confirmed correct, your contribution will be merged.
💡 If this is your first contribution or you have questions about the directory structure, feel free to submit an Issue for discussion first.
The table below lists the current coverage of domain tasks in the Benchmark. We welcome not only code contributions but also ideas for challenging new engineering problems from the community.
Note: the current effective v1 benchmark pool contains 47 tasks. MuonTomography remains listed below for completeness, but is temporarily excluded from the effective v1 pool pending objective / evaluator redesign.
| Domain | Task Name | Status | Contributor | Reviewer | Remarks | Version |
|---|---|---|---|---|---|---|
| Astrodynamics | MannedLunarLanding |
Completed | @jdp22 | @jdp22 | Lunar soft landing trajectory optimization | v1 |
| ParticlePhysics | MuonTomography |
Completed | @SeanDF333 | @ahydchh | Muon detector placement optimization under flux, budget, and excavation constraints; temporarily excluded from the effective v1 pool pending redesign | |
ProtonTherapyPlanning |
Completed | @SeanDF333 | @ahydchh | IMPT dose weight optimization under tumor coverage, OAR safety, and beam cost constraints | ||
| Kernel Engineering | MLA |
Completed | @ahydchh | @ahydchh | GPUMode | v1 |
TriMul |
Completed | @ahydchh | @ahydchh | GPUMode | v1 | |
FlashAttention |
Completed | @Geniusyingmanji | @ahydchh | Optimize a causal scaled dot-product attention forward kernel for GPU execution | v1 | |
| Single Cell Analysis | denoising |
Completed | @ahydchh | @ahydchh | Open Problems in Single-Cell Analysis | |
perturbation_prediction |
Completed | @llltttwww | @llltttwww | NeurIPS 2023 scPerturb | ||
predict_modality |
Completed | @llltttwww | @llltttwww | NeurIPS 2021, RNA→ADT | v1 | |
| QuantumComputing | routing qftentangled |
Completed | @ahydchh | @ahydchh | Routing-Oriented Optimization | v1 |
clifford t synthesis |
Completed | @ahydchh | @ahydchh | Clifford+T Synthesis Optimization | v1 | |
cross target qaoa |
Completed | @ahydchh | @ahydchh | Cross-Target Robust Optimization | v1 | |
| Cryptographic | AES-128 CTR |
Completed | @ahydchh | @ahydchh | Advanced Encryption Standard, 128-bit key, Counter mode | v1 |
SHA-256 |
Completed | @ahydchh | @ahydchh | Secure Hash Algorithm 256-bit | v1 | |
SHA3-256 |
Completed | @ahydchh | @ahydchh | Secure Hash Algorithm 3 256-bit | v1 | |
| CommunicationEngineering | LDPCErrorFloor |
Completed | @WayneJin0918 | @ahydchh | LDPC code error floor estimation using importance sampling for trapping sets | |
PMDSimulation |
Completed | @WayneJin0918 | @ahydchh | Polarization Mode Dispersion simulation with importance sampling for rare outage events | ||
RayleighFadingBER |
Completed | @WayneJin0918 | @ahydchh | BER analysis under Rayleigh fading with importance sampling for deep fade events | ||
| EnergyStorage | BatteryFastChargingProfile |
Completed | @kunkun04 | @ahydchh | Fast-charge current-profile optimization for a lithium-ion cell under voltage, thermal, and degradation constraints | v1 |
BatteryFastChargingSPMe |
Completed | @kunkun04 | @ahydchh | Staged fast-charge optimization under a reduced SPMe-T-Aging style electrochemical, thermal, plating, and aging model | v1 | |
| SustainableDataCenterControl | hand_written_control |
Completed | @ahydchh | @ahydchh | SustainDC joint control benchmark for load shifting, cooling, and battery dispatch through the unified evaluation pipeline | v1 |
| ReactionOptimisation | snar_multiobjective |
Completed | @ahydchh | @ahydchh | Continuous-flow SnAr reaction optimization with a Pareto trade-off between productivity and waste | v1 |
mit_case1_mixed |
Completed | @ahydchh | @ahydchh | Mixed-variable reaction yield maximization with continuous process settings and a categorical catalyst | v1 | |
reizman_suzuki_pareto |
Completed | @ahydchh | @ahydchh | Reizman Suzuki emulator Pareto optimization over catalyst choice and operating conditions | v1 | |
dtlz2_pareto |
Completed | @ahydchh | @ahydchh | DTLZ2 Pareto-front approximation task integrated through the unified evaluation pipeline | ||
| MolecularMechanics | weighted_parameter_coverage |
Completed | @ahydchh | @ahydchh | Rare force-field parameter coverage under a molecule budget | |
diverse_conformer_portfolio |
Completed | @ahydchh | @ahydchh | Low-energy, high-diversity conformer portfolio selection | ||
torsion_profile_fitting |
Completed | @ahydchh | @ahydchh | Force-field torsion-scale fitting against target energy profiles | ||
| Optics | adaptive_constrained_dm_control |
Completed | @ahydchh | @ahydchh | Constrained deformable mirror control | |
adaptive_temporal_smooth_control |
Completed | @ahydchh | @ahydchh | Temporal smoothness versus correction quality | v1 | |
adaptive_energy_aware_control |
Completed | @ahydchh | @ahydchh | Energy-aware adaptive optics control | ||
adaptive_fault_tolerant_fusion |
Completed | @ahydchh | @ahydchh | Fault-tolerant multi-WFS fusion | v1 | |
phase_weighted_multispot_single_plane |
Completed | @ahydchh | @ahydchh | Single-plane weighted multispot phase DOE | ||
phase_fourier_pattern_holography |
Completed | @ahydchh | @ahydchh | Fourier pattern holography | v1 | |
phase_dammann_uniform_orders |
Completed | @ahydchh | @ahydchh | Dammann grating uniform diffraction orders | v1 | |
phase_large_scale_weighted_spot_array |
Completed | @ahydchh | @ahydchh | Large-scale weighted spot array synthesis | ||
fiber_wdm_channel_power_allocation |
Completed | @ahydchh | @ahydchh | WDM channel and launch power allocation | v1 | |
fiber_mcs_power_scheduling |
Completed | @ahydchh | @ahydchh | Joint MCS and power scheduling | v1 | |
fiber_dsp_mode_scheduling |
Completed | @ahydchh | @ahydchh | Receiver DSP mode scheduling | ||
fiber_guardband_spectrum_packing |
Completed | @ahydchh | @ahydchh | Spectrum packing with guard-band constraints | v1 | |
holographic_multifocus_power_ratio |
Completed | @ahydchh | @ahydchh | Multi-focus power ratio control | v1 | |
holographic_multiplane_focusing |
Completed | @ahydchh | @ahydchh | Multi-plane holographic focusing | v1 | |
holographic_multispectral_focusing |
Completed | @ahydchh | @ahydchh | Multispectral holographic focusing | ||
holographic_polarization_multiplexing |
Completed | @ahydchh | @ahydchh | Polarization-multiplexed holography | ||
| Computer Systems | Malloc Lab |
Completed | @ahydchh | @ahydchh | Dynamic memory allocation | v1 |
DuckDBWorkloadOptimization |
Completed | @DocZbs | @DocZbs | Index/materialized-view selection and query rewriting optimization on official DuckDB workloads | ||
| EngDesign | CY_03, WJ_01, XY_05, AM_02, AM_03, YJ_02, YJ_03 |
Completed | @ahydchh | @ahydchh | EngDesign | v1 |
| InventoryOptimization | tree_gsm_safety_stock |
Completed | @ahydchh | @ahydchh | Tree-structured multi-echelon safety-stock placement (GSM) | v1 |
general_meio |
Completed | @ahydchh | @ahydchh | General-topology MEIO with simulation-based objective | v1 | |
joint_replenishment |
Completed | @ahydchh | @ahydchh | Multi-SKU joint replenishment with shared setup cost | v1 | |
finite_horizon_dp |
Completed | @ahydchh | @ahydchh | Finite-horizon stochastic inventory control via time-varying policy | v1 | |
disruption_eoqd |
Completed | @ahydchh | @ahydchh | EOQ lot-sizing optimization under supply disruptions | v1 | |
| PyPortfolioOpt | robust_mvo_rebalance |
Completed | @ahydchh | @ahydchh | Robust mean-variance rebalancing with sector/factor/turnover constraints | v1 |
cvar_stress_control |
Completed | @ahydchh | @ahydchh | CVaR stress-controlled portfolio allocation under return and exposure constraints | ||
discrete_rebalance_mip |
Completed | @ahydchh | @ahydchh | Discrete lot-constrained rebalancing with mixed-integer optimization | ||
| JobShop | abz |
Completed | @ahydchh | @ahydchh | Classical JSSP ABZ family (Adams, Balas, Zawack 1988) | v1 |
ft |
Completed | @ahydchh | @ahydchh | Classical JSSP FT family (Fisher and Thompson 1963) | ||
la |
Completed | @ahydchh | @ahydchh | Classical JSSP LA family (Lawrence 1984) | ||
orb |
Completed | @ahydchh | @ahydchh | Classical JSSP ORB family (Applegate and Cook 1991) | ||
swv |
Completed | @ahydchh | @ahydchh | Classical JSSP SWV family (Storer, Wu, Vaccari 1992) | v1 | |
ta |
Completed | @ahydchh | @ahydchh | Classical JSSP TA family (Taillard 1993) | v1 | |
yn |
Completed | @ahydchh | @ahydchh | Classical JSSP YN family (Yamada and Nakano 1992) | ||
| StructuralOptimization | ISCSO2015 |
Completed | @yks23 | @yks23 | 45-bar 2D truss size + shape | v1 |
ISCSO2023 |
Completed | @yks23 | @yks23 | 284-member 3D truss sizing | v1 | |
TopologyOptimization |
Completed | @Geniusyingmanji | @ahydchh | MBB beam 2D topology optimization (SIMP), Continuous, volume-constrained, compliance minimization | v1 | |
PyMOTOSIMPCompliance |
Completed | @DocZbs | @DocZbs | pyMOTO-based 2D beam topology optimization (SIMP + OC/MMA) under a volume-fraction constraint | ||
| Robotics | DynamicObstacleAvoidanceNavigation |
Completed | @MichaelCaoo | @yks23 | Navigate a differential-drive robot from start to goal | v1 |
QuadrupedGaitOptimization |
Completed | @MichaelCaoo | @yks23 | Maximize the forward locomotion speed of a quadruped robot by optimizing 8 gait parameters | v1 | |
RobotArmCycleTimeOptimization |
Completed | @MichaelCaoo | @yks23 | Minimize the motion time of a 7-DOF KUKA LBR iiwa arm moving from a start to a goal configuration, collision-free | v1 | |
PIDTuning |
Completed | @Geniusyingmanji | @ahydchh | Tune a cascaded PID controller for a 2D quadrotor across multiple flight scenarios | v1 | |
UAVInspectionCoverageWithWind |
Completed | @MichaelCaoo | @ahydchh | UAV inspection under wind field disturbance | v1 | |
CoFlyersVasarhelyiTuning |
In Progress | @DocZbs | @DocZbs | Tune the original CoFlyers Vasarhelyi flocking parameters | ||
| Aerodynamics | CarAerodynamicsSensing |
Completed | @LeiDQ, @llltttwww | @llltttwww | Sensor placement on 3D car surface for pressure field reconstruction | v1 |
DawnAircraftDesignOptimization |
Completed | @DocZbs | @DocZbs | Jointly optimize wing, fuselage, and propulsion variables under cruise/endurance/payload constraints to minimize total aircraft mass | ||
| WirelessChannelSimulation | HighReliableSimulation |
Completed | @tonyhaohan | @yks23, @ahydchh | BER estimation with importance sampling for Hamming(127,120) | v1 |
| PowerSystems | EV2GymSmartCharging |
Completed | @DocZbs | @DocZbs | Upstream-aligned EV smart charging | |
| AdditiveManufacturing | DiffSimThermalControl |
Completed | @DocZbs | @DocZbs | Study process optimization in additive manufacturing using differentiable simulation |
💡 Have an idea for a new engineering problem? Even if you cannot provide complete verification code for now, we highly welcome you to share good Task concepts! Please create an Issue detailing the real-world background and engineering value of the problem. After discussion and confirmation, we will add it to the table above to rally community power to solve it together.
An initial integration between some evaluation algorithms and benchmarks has been implemented. The core implementation is located in ./frontier_eval. For usage instructions, see the Evaluation README. Note: some optional algorithms/benchmarks require extra repos under third_party/ (local clones); the Evaluation README documents how to set them up.
Welcome to our developer community! Whether you want to discuss new engineering problem concepts, find task collaborators, or encounter technical issues during your contribution, you can always communicate with us in the group.
-
🟢 Feishu (Lark): Click here to join our Feishu discussion group
-
🔜 Discord: Click here to join our Discord community