MLS-Bench is a benchmark for machine learning science. Where most agent benchmarks reward engineering one fixed instance — clean the data, tune the pipeline, climb a leaderboard — MLS-Bench asks the harder question: can an AI agent propose a new component, loss, optimizer, or training procedure whose gain transfers across settings, seeds, datasets, and scales?
The benchmark contains 140 tasks across 12 ML research domains. Each task fixes a research scaffold, gives the agent the relevant source code and strong baseline implementations, then asks for one algorithmic change inside a constrained edit surface.
- 2026.6 — More efficient on larger GPUs: a new
compute_scaleoption lets the LLM pretraining and reinforcement-learning tasks — and, optionally, the other tasks — run more efficiently on H200-class GPUs without changing results. See issue #4 and PR #9 for the design. - 2026.5 — Harbor support: official Harbor-compatible runtime and pre-rendered task images on Docker Hub under
bohanlyu2022/mlsbench-harbor-*. Seeharbor/README.md. - 2026.5 — Stronger Sparse L0 Adversarial Attack task: upgraded to the canonical Sparse-RS L0 threat model (k=24, untargeted) against three adversarially-robust RobustBench L2 CIFAR-10 targets (Rebuffi-R18 / Augustin / Engstrom). Strong attacks no longer trivially saturate, leaving real headroom to measure genuine attack improvements.
- 2026.5 — Scoring: the main results table in the arXiv paper previously aggregated tasks within each area by geometric mean; switched to arithmetic mean for easier comparison with the per-task numbers. Rankings are unchanged and no conclusions are affected.
pip install -e ".[agent]"Python 3.10+ is required. MLS-Bench separates the choice of runtime backend from the choice of job scheduler, and any combination of the two is supported:
- Runtime backends: Docker, Apptainer, or local Conda — selected in
your config file via
container_runtime. - Job schedulers: SLURM (when a
slurm:section is present in the config) or the built-in single-node GPU scheduler.
container_runtime: docker # docker, apptainer, or localRecommended setup: Docker or Apptainer for the runtime, with SLURM as the job scheduler. If SLURM is unavailable, the built-in scheduler can be combined with any of the three runtimes. If neither a container runtime nor SLURM is available, the local Conda backend together with the built-in scheduler provides a complete fallback (see the section below).
Running with local Conda environments and the built-in scheduler
When neither Docker nor Apptainer is available, MLS-Bench can build a
dedicated Conda environment per package and dispatch jobs through a
single-node GPU queue (src/mlsbench/scheduler.py). This backend is
intended for development and small-scale experimentation; for full-scale
benchmarking on a cluster we recommend SLURM with one of the container
runtimes instead. The Conda backend should not be combined with SLURM,
since both attempt to schedule GPU jobs.
-
Use a config with
container_runtime: localand noslurm:section. Throughout this section we refer to it asconfigs/local.yaml. -
Build the environment for each package:
mlsbench build <package> --config configs/local.yaml
-
Start the GPU scheduler:
nohup python -m mlsbench.scheduler start \ --gpus 0,1,2,3 \ --config configs/local.yaml \ > .scheduler/scheduler.log 2>&1 &
-
Launch agents or baselines. They enqueue jobs to the scheduler and return immediately:
PYTHONPATH=src nohup python3 -m mlsbench agent <task> --model <model> \ --config configs/local.yaml \ > .scheduler/logs/agent_<task>.log 2>&1 &
-
Inspect or manage the queue:
python -m mlsbench.scheduler status python -m mlsbench.scheduler list python -m mlsbench.scheduler cancel <job_id> python -m mlsbench.scheduler clear
To rebuild a package's environment from scratch, remove it with
conda env remove -n mlsbench-<package> and re-run mlsbench build.
Running an agent requires an API key for the model provider you choose. If you enable the optional web-search tool, a Tavily key is also required. Configure keys in either of two equivalent ways:
1. Inline in your config file under the providers: block — useful when
you want to keep separate configs per environment or per project:
providers:
openai:
api_key: "sk-..."
anthropic:
api_key: "sk-ant-..."
openrouter:
api_key: "sk-or-..."
base_url: "https://openrouter.ai/api/v1"
deepseek:
api_key: "sk-..."
base_url: "https://api.deepseek.com/v1"
tavily:
api_key: "tvly-..." # only needed if the web_search tool is enabled2. Environment variables — leave the api_key field empty (or omit the
provider entirely) and the CLI falls back to the standard env var for that
provider:
| Provider | Env var |
|---|---|
| OpenAI | OPENAI_API_KEY |
| Anthropic | ANTHROPIC_API_KEY |
| OpenRouter | OPENROUTER_API_KEY_NEW |
| DeepSeek | DEEPSEEK_API_KEY |
| Qwen / DashScope | QWEN_API_KEY / DASHSCOPE_API_KEY |
| Gemini / Google | GEMINI_API_KEY / GOOGLE_API_KEY |
| Kimi / Moonshot | KIMI_API_KEY / MOONSHOT_API_KEY |
| GLM | GLM_API_KEY |
| MiniMax | MINIMAX_API_KEY |
| Tavily (web search) | TAVILY_API_KEY |
You can also use ${ENV_VAR} interpolation inside the YAML
(api_key: "${OPENAI_API_KEY}") when you want a tracked config file that
still resolves the secret from the environment at runtime.
The model string passed to mlsbench agent --model <name> selects the
provider automatically:
- Bare names are dispatched by their well-known prefix:
claude-*→providers.anthropic,gpt-* / o1 / o3 / o4→providers.openai,deepseek-*→providers.deepseek,qwen-*→providers.qwen,gemini-*→providers.gemini,kimi-* / moonshot-*→providers.kimi,glm-*→providers.glm,minimax-*→providers.minimax. - Prefixed names (
<provider>/<model>, e.g.openai/gpt-5.4,vertex_ai/...,openrouter/anthropic/claude-opus-4.6) dispatch generically to the matchingproviders.<provider>entry. Point that entry'sbase_urlat whichever upstream you want — direct API, OpenRouter, a LiteLLM proxy, etc. — and the same key is reused.
Fetch external packages and build the runtime (data dependencies are prepared automatically as part of the build):
mlsbench fetch --name <package>
mlsbench build <package> --config configs/react.yamlRun an agent and compute its task score:
mlsbench agent <task> --model <model> --config configs/react.yaml
mlsbench score task <task>Baseline scores are already populated in each task's leaderboard.csv, so
running an agent alone is sufficient to obtain its normalized score under
the MLS-Bench evaluation framework. Before launching the agent, however, we
recommend running one baseline first to confirm that your environment is
set up correctly:
mlsbench baseline <task> --name <baseline> --config configs/react.yamlBaselines and agents share the same task scripts, parsers, seeds, resource limits, and leaderboard code; only the source of the edits differs.
To avoid building each package from source, prebuilt images are published for every supported package:
- Docker Hub:
bohanlyu2022/mlsbench-<pkg>:latest— https://hub.docker.com/u/bohanlyu2022 - Hugging Face (Apptainer SIFs):
sif/<Pkg>.sifinside the Bohan22/MLS-Bench-Tasks dataset
mlsbench agent, mlsbench baseline, and mlsbench build automatically
pull the prebuilt image when the local image is missing, and fall back to
building from source on failure. mlsbench run performs the same lookup
but does not build from source; run mlsbench build <pkg> first if a
local build is required.
Two mutually-exclusive flags force a specific source for mlsbench build:
mlsbench build <package> --pull # use only the prebuilt image
mlsbench build <package> --local-build # build locally from the Dockerfile / .defFor Apptainer, the SIF can be obtained either via apptainer pull docker://... (default) or from the Hugging Face mirror — a direct HTTPS
download of sif/<Pkg>.sif, which can be faster in networks where Docker
registries are slow. Select the source with --sif-source {docker,hf,auto}
on mlsbench build.
MLS-Bench's 140 tasks are also available as a Harbor
dataset so any Harbor-supported agent (claude-code, codex, openhands,
terminus-2, …) can be evaluated on the suite without going through this
repository's own runner:
PYTHONPATH=. harbor run -c run.yaml -a claude-code -m anthropic/claude-opus-4-7The pre-rendered dataset, GPU-capable environment plugin, and reference
Harbor config live under harbor/. See
harbor/README.md for usage details and the
self-contained per-task layout.
src/mlsbench/ CLI, agent loop, execution backends, scoring
tasks/<task>/ 140 task definitions, parsers, scores, baselines
vendor/packages.yaml External package registry
vendor/pkg_configs/<package>/ Package runtime configs and pre-edit patches
vendor/data_scripts/ Dataset and model-cache preparation scripts
configs/react.yaml Runtime and provider configuration
configs/openevolve.yaml OpenEvolve defaults
configs/discover.yaml Discover defaults
harbor/ Pre-rendered Harbor dataset (140 tasks) + run config
Fetched upstream repositories, built images, downloaded datasets, run workspaces, logs, and scheduler state are intentionally not versioned.
Show the 140-task appendix table
| Area | Directory shorthand | Task | Research question | External package(s) | Baselines | Evaluation settings |
|---|---|---|---|---|---|---|
| LM | agent-tool-reasoning | LLM Agent Tool-Use Reasoning Strategy | Studies how tool-use search, backtracking, and stopping policies affect answer validity and query efficiency. | zhichengg/StableToolBench | Greedy Chain (CoT) DFS with LLM Ranking DFSDT |
StableToolBench I1-instruction 50q / deepseek-chat StableToolBench I1-instruction 50q / qwen2.5-72b-instruct StableToolBench I1-instruction 50q / qwen2.5-7b-instruct |
| LM | llm-dllm-demask-strategy | Masked Diffusion LM: Demasking Strategy | Studies how demasking schedules, position selection, and token assignment affect diffusion language-model quality and decoding efficiency. | ML-GSAI/LLaDA | Top-K Margin Confidence Greedy KLASS |
LLaDA / MATH-500 LLaDA / HumanEval Dream / C4 prefix continuation |
| LM | llm-pretrain-attention | Autoregressive Attention Mechanism | Studies how self-attention computation and positional handling affect autoregressive pretraining loss and downstream accuracy. | karpathy/nanoGPT EleutherAI/lm-evaluation-harness |
QK-Norm RoPE RoPE + QK-Norm |
ClimbMix val loss + WikiText-2/LAMBADA PPL HellaSwag, ARC-Easy, PIQA, WinoGrande 0-shot accuracy |
| LM | llm-pretrain-bitlinear | Low-Bit Linear Pretraining Layer | Studies how low-bit linear layers and quantization functions affect pretraining loss under discrete weight constraints. | karpathy/nanoGPT EleutherAI/lm-evaluation-harness |
Binary Sign (BitNet) Ternary 1.58-bit (BitNet b1.58) INT2 Uniform |
ClimbMix val loss + WikiText-2/LAMBADA PPL HellaSwag, ARC-Easy, PIQA, WinoGrande 0-shot accuracy |
| LM | llm-pretrain-embedding | Autoregressive Embedding Strategy | Studies how token embeddings, position embeddings, value embeddings, and weight tying affect autoregressive pretraining loss and downstream accuracy. | karpathy/nanoGPT EleutherAI/lm-evaluation-harness |
Untied Embeddings Value Embeddings Bigram Hash Embeddings |
ClimbMix val loss + WikiText-2/LAMBADA PPL HellaSwag, ARC-Easy, PIQA, WinoGrande 0-shot accuracy |
| LM | llm-pretrain-linear-attention | Subquadratic Attention Mechanism | Studies whether linear or subquadratic attention can reduce autoregressive validation loss while preserving downstream performance. | karpathy/nanoGPT EleutherAI/lm-evaluation-harness |
RetNet DeltaNet GLA |
ClimbMix val loss + WikiText-2/LAMBADA PPL HellaSwag, ARC-Easy, PIQA, WinoGrande 0-shot accuracy |
| LM | llm-pretrain-loss | Autoregressive Pretraining Loss | Studies how alternative next-token training losses affect autoregressive validation cross-entropy. | karpathy/nanoGPT EleutherAI/lm-evaluation-harness |
Label Smoothing Softcap Cross-Entropy Z-Loss |
ClimbMix val loss + WikiText-2/LAMBADA PPL HellaSwag, ARC-Easy, PIQA, WinoGrande 0-shot accuracy |
| LM | llm-pretrain-lr-schedule | Pretraining Learning-Rate Schedule | Studies how warmup, decay shape, and schedule horizon affect autoregressive pretraining validation loss. | karpathy/nanoGPT EleutherAI/lm-evaluation-harness |
WSD (Warmup-Stable-Decay) Trapezoidal WSD with Inverse-Sqrt Decay |
ClimbMix val loss + WikiText-2/LAMBADA PPL HellaSwag, ARC-Easy, PIQA, WinoGrande 0-shot accuracy |
| LM | llm-pretrain-mlp | Transformer Feed-Forward Block | Studies how activation, gating, and expansion choices in the feed-forward sublayer affect language-model validation loss. | karpathy/nanoGPT EleutherAI/lm-evaluation-harness |
ReLU-Squared SwiGLU GeGLU |
ClimbMix val loss + WikiText-2/LAMBADA PPL HellaSwag, ARC-Easy, PIQA, WinoGrande 0-shot accuracy |
| LM | llm-pretrain-normalization | Normalization and Block Layout | Studies how normalization placement, affine behavior, and transformer block layout affect pretraining stability and validation loss. | karpathy/nanoGPT EleutherAI/lm-evaluation-harness |
RMSNorm RMSNorm + Sandwich-Norm RMSNorm (Parallel Block) |
ClimbMix val loss + WikiText-2/LAMBADA PPL HellaSwag, ARC-Easy, PIQA, WinoGrande 0-shot accuracy |
| LM | llm-pretrain-optimizer | Pretraining Optimizer Design | Studies how optimizer choice, parameter grouping, and schedule coupling affect autoregressive pretraining validation loss. | karpathy/nanoGPT EleutherAI/lm-evaluation-harness |
AdamW + Nesterov Lion Muon |
ClimbMix val loss + WikiText-2/LAMBADA PPL HellaSwag, ARC-Easy, PIQA, WinoGrande 0-shot accuracy |
| LM | llm-pretrain-residual | Transformer Residual Stream Strategy | Studies how residual connections and information flow across transformer layers affect validation loss, perplexity, and accuracy metrics. | karpathy/nanoGPT EleutherAI/lm-evaluation-harness |
Vanilla (Pre-LN) ProRes Learned Scaling Block Attention Residuals |
ClimbMix val loss + WikiText-2/LAMBADA PPL HellaSwag, ARC-Easy, PIQA, WinoGrande 0-shot accuracy |
| LM | llm-rl-advantage | Reasoning RL Advantage Estimation | Studies how advantage estimates for online language-model reinforcement learning affect mathematical reasoning accuracy. | volcengine/verl | GRPO Dr. GRPO Reinforce++ Baseline |
GSM8K MATH-500 AMC |
| LM | llm-rl-importance-sampling | Reasoning RL Importance-Sampling Granularity | Studies how importance-sampling ratio granularity and clipping affect online language-model reinforcement learning for reasoning. | volcengine/verl | Token-Level (Vanilla PPO) Sequence-Level (GSPO) First-K Tokens |
GSM8K MATH-500 AMC |
| LM | llm-rl-kl-estimator | Actor Divergence Estimator for Reasoning RL | Studies how per-token actor KL estimation controls reference-policy drift while preserving reasoning accuracy during online RL. | volcengine/verl | K1 (Unbiased Log-Ratio) K2 (Squared Log-Ratio) K3 (Low-Variance KL) Absolute Log-Ratio |
GSM8K MATH-500 AMC |
| LM | llm-rl-reward-normalization | Pre-Advantage Reward Normalization | Studies how reward normalization before advantage estimation affects reasoning accuracy in online language-model RL. | volcengine/verl | Outcome-Only (Raw) Group-Std Normalization Batch-Std Whitening Length-Aware Normalization |
GSM8K MATH-500 AMC |
| LM | llm-scaling-law-discovery | Symbolic Scaling-Law Discovery | Studies how symbolic functional forms and group-specific coefficients capture held-out scaling behavior. | trevorstephens/gplearn | Human Exact Form SLDAgent-Style Kernel Ridge Regression XGBoost |
SLDBench Vocabulary Scaling SLDBench LR x Batch-Size Scaling SLDBench Data-Constrained Scaling |
| LM | mas-topology | Language-Agent Collaboration Topology | Studies how deterministic collaboration topology affects multi-agent code-generation quality and execution success. | OpenBMB/ChatDev | Chain Star Layered |
HumanEval-33 (deepseek-chat, 4 agents) HumanEval-33 (qwen2.5-72b-instruct, 4 agents) SRDD-20 (deepseek-chat, 4 agents) |
| Rob | jepa-planning | Latent World-Model Planner | Studies how goal-conditioned planning should exploit a fixed latent world model to improve navigation success. | facebookresearch/eb_jepa | Random CEM MPPI iCEM |
Two Rooms (Horizon 30) Two Rooms (Horizon 60) Two Rooms (Horizon 90) |
| Rob | jepa-prediction-loss | Temporal Latent Prediction Loss | Studies how latent prediction objectives affect multi-step video representation quality. | facebookresearch/eb_jepa | MSE Smooth L1 Cosine |
Moving MNIST AP (small: henc=16, dstc=8, hpre=16) Moving MNIST AP (base: henc=32, dstc=16, hpre=32) Moving MNIST AP (large: henc=64, dstc=32, hpre=64) |
| Rob | jepa-regularizer | Anti-Collapse Representation Regularizer | Studies how self-supervised regularization prevents representation collapse and improves linear-probe accuracy. | facebookresearch/eb_jepa | Naive VICReg SigReg Barlow Twins |
ResNet-18 Probe ResNet-34 Probe ResNet-50 Probe |
| Rob | robo-diffusion-guidance | Diffusion Guidance for Robot Trajectory Planning | Studies guidance mechanisms for a fixed trajectory-level diffusion planner on D4RL MuJoCo, optimizing normalized score across hopper-medium-v2, walker2d-medium-v2, and halfcheetah-medium-v2. | CleanDiffuserTeam/CleanDiffuser | Diffuser (Classifier Guidance) Classifier-Free Guidance No Guidance Decision Diffuser |
D4RL Hopper-Medium-v2 D4RL Walker2d-Medium-v2 D4RL HalfCheetah-Medium-v2 |
| Rob | robo-diffusion-policy | Diffusion Policy Learning for Robot Control | Studies how diffusion policy training, value guidance, and action generation affect robot-control episode reward. | CleanDiffuserTeam/CleanDiffuser | DQL (Diffusion Q-Learning) IDQL Diffusion Policy |
D4RL Hopper-Medium-v2 D4RL Walker2d-Medium-v2 D4RL HalfCheetah-Medium-v2 |
| Rob | robo-diffusion-sampling-method | Efficient Diffusion Sampling for Robot Actions | Studies how solver choice and sampling_steps affect DQL-style diffusion-policy normalized score at low NFE on D4RL MuJoCo. | CleanDiffuserTeam/CleanDiffuser | DDPM (100-Step Ancestral Sampling) DDIM (20-Step Deterministic Sampling) DPM-Solver++ 2M (10-Step) |
D4RL Hopper-Medium-v2 D4RL Walker2d-Medium-v2 D4RL HalfCheetah-Medium-v2 |
| Rob | robo-humanoid-sim2real-algo | Humanoid Transfer Policy Learning | Studies how actor-critic architecture, policy optimization, and rollout processing affect humanoid command-following transfer. | roboterax/humanoid-gym | Default PPO PPO with Adaptive KL PPO with LayerNorm |
RobotEra XBot-L Training RobotEra XBot-L / Diverse Commands RobotEra XBot-L / Forward-Only RobotEra XBot-L / High Speed |
| Rob | robomimic-bc-loss | Behavioral Cloning Loss for Manipulation | Studies how imitation-learning loss design affects rollout success for low-dimensional robot manipulation tasks. | ARISE-Initiative/robomimic | NLL with Entropy Weighted NLL Default (NLL) |
Tool Hang (PH) Can (PH) Square (PH) |
| Rob | robomimic-iql-vf | Offline Value Loss for Manipulation | Studies how asymmetric value regression loss design affects offline robot manipulation policy success. | ARISE-Initiative/robomimic | Quantile Regression Huber Pinball Default (Expectile) |
Tool Hang (PH) Can (PH) Square (PH) |
| Rob | robomimic-obs-encoder | Observation Fusion Encoder for Imitation Learning | Designs a multimodal robot state encoder for behavioral cloning to improve rollout success rate on manipulation tasks. | ARISE-Initiative/robomimic | Attention Fusion Gated Fusion Default (Concatenation) |
Tool Hang (PH) Can (PH) Square (PH) |
| Rob | tdmpc2-planning | Trajectory Optimization for Model-Based Planning | An online planning algorithm selects actions through learned-world-model trajectory optimization to improve episode reward. | nicklashansen/tdmpc2 | CEM iCEM MPPI |
Walker Walk Cheetah Run Cartpole Swingup |
| Rob | tdmpc2-simnorm | Latent Representation Normalization for Model-Based RL | Designs latent-state normalization for the TD-MPC2 encoder and dynamics world-model networks, evaluated by DMControl episode reward. | nicklashansen/tdmpc2 | SimNorm L2 normalization RMSNorm Identity (no normalization) |
DMControl walker-walk DMControl cheetah-run DMControl cartpole-swingup |
| V&G | cv-3dgs-densification | 3D Gaussian Splatting Densification Strategy Design | Designs a 3D Gaussian Splatting densification strategy controlling clone, split, prune, reset, relocation, and sample-add behavior to improve held-out novel-view quality on Mip-NeRF 360 scenes. | nerfstudio-project/gsplat | Original 3DGS densification AbsGS + Taming-3DGS + New Split EDC-TamingGS-Abs |
Mip-NeRF 360 garden (8x, best PSNR) Mip-NeRF 360 bicycle (8x, best PSNR) Mip-NeRF 360 bonsai (8x, best PSNR) Mip-NeRF 360 stump (8x, best PSNR) |
| V&G | cv-3dgs-regularizer | 3D Gaussian Splatting Regularizer Design | Designs a scalar regularizer added to the 3DGS photometric loss during 30k-step Mip-NeRF 360 reconstruction, evaluated on held-out novel views and scored by best PSNR. | nerfstudio-project/gsplat | No regularization Scale + opacity L1 Effective-rank + scale/opacity L1 |
Mip-NeRF 360 garden (8x, best PSNR) Mip-NeRF 360 bicycle (8x, best PSNR) Mip-NeRF 360 bonsai (8x, best PSNR) Mip-NeRF 360 stump (8x, best PSNR) |
| V&G | cv-dbm-sampler | Custom Sampler for Diffusion Bridge Models | Designs a low-NFE sampler for Diffusion Bridge Models on image-to-image translation, ImageNet center-inpainting, and DIODE depth, evaluated by FID at NFE=5. | thu-ml/DiffusionBridge | DBIM DBIM-HO (high-order) DDBM (50 NFE reference) ECSI |
Edges2Handbags / e2h (FID, NFE=5) ImageNet center-inpaint (FID, NFE=5) DIODE depth (FID, NFE=5) |
| V&G | cv-dbm-scheduler | Time Scheduler for Diffusion Bridge Models (NFE=5) | Designs a monotone low-step time schedule for Diffusion Bridge Models, evaluated by FID on Edges2Handbags, ImageNet center-inpainting, and DIODE depth at NFE=5. | thu-ml/DiffusionBridge | Karras EDM (rho=7) Uniform (linear) Cosine (Nichol-Dhariwal) Log-linear (geometric) |
Edges2Handbags / e2h (FID, NFE=5) ImageNet center-inpaint (FID, NFE=5) DIODE depth (FID, NFE=5) |
| V&G | cv-diffusion-architecture | Diffusion Model Architecture Design | Design a denoising UNet backbone for unconditional CIFAR-10 DDPM training, optimizing best FID with fixed epsilon prediction and 50-step DDIM sampling. | huggingface/diffusers | Standard DDPM U-Net Full-Attention U-Net No-Attention U-Net |
CIFAR-10 DDPM Small CIFAR-10 DDPM Medium CIFAR-10 DDPM Large |
| V&G | cv-diffusion-cfg | Diffusion Model: Classifier-Free Guidance Optimization | Design a classifier-free guidance method for Stable Diffusion text-to-image generation across SD v1.5, Stable Diffusion 2 Base, and Stable Diffusion XL; evaluation generates COCO-caption images and official scoring uses per-model FID. | CFGpp-diffusion/CFGpp | Standard CFG CFG++ Zero-Init CFG++ |
Stable Diffusion v1.5 / COCO captions / NFE=10 Stable Diffusion 2 Base / COCO captions / NFE=10 Stable Diffusion XL Base 1.0 / COCO captions / NFE=10 |
| V&G | cv-diffusion-conditioning | Class-Conditional Diffusion: Conditioning Injection Methods | Design class-conditioning injection for a CIFAR-10 class-conditional UNet2DModel/DDPM, optimizing best FID with 50-step DDIM sampling. | huggingface/diffusers | Concat-FiLM Cross-Attention AdaLN-Zero |
CIFAR-10 Class-Conditional Small UNet2DModel CIFAR-10 Class-Conditional Medium UNet2DModel CIFAR-10 Class-Conditional Large UNet2DModel |
| V&G | cv-diffusion-efficiency | Diffusion Model: Sampler Efficiency Optimization | Design a Stable Diffusion sampler update rule for COCO-caption text-to-image generation at a fixed NFE=20 budget; official scoring uses per-model FID. | CFGpp-diffusion/CFGpp | DDIM DPM++ 3M DPM++ 2S |
Stable Diffusion v1.5 / COCO captions / NFE=20 Stable Diffusion 2 Base / COCO captions / NFE=20 Stable Diffusion XL Base 1.0 / COCO captions / NFE=20 |
| V&G | cv-diffusion-prediction | Diffusion Prediction Parameterization | Design a prediction target and consistent x0 inversion for unconditional CIFAR-10 UNet2DModel diffusion, optimizing best FID with 50-step DDIM sampling. | huggingface/diffusers | Epsilon Prediction V-Prediction X0 Prediction |
CIFAR-10 Unconditional Small UNet2DModel CIFAR-10 Unconditional Medium UNet2DModel CIFAR-10 Unconditional Large UNet2DModel |
| V&G | cv-meanflow-perceptual-loss | Flow Map with Perceptual Loss | Studies whether auxiliary perceptual losses on denoised images improve CIFAR-10 FID for MeanFlow flow-map training with DiT backbones. | snap-research/alphaflow | Pure MSE Velocity MSE + Charbonnier + LPIPS + Gradient + Multiscale MSE + LPIPS + Gradient + Multiscale + FFT |
CIFAR-10 Small DiT CIFAR-10 Medium DiT CIFAR-10 Large DiT |
| V&G | cv-vae-loss | VAE Loss Function Design for Image Reconstruction | Studies how VAE loss components affect CIFAR-10 AutoencoderKL reconstruction quality, scored primarily by rFID on the full test set. | huggingface/diffusers | L1 + KL L1 + LPIPS + KL L1 + LPIPS + KL + PatchGAN |
CIFAR-10 AutoencoderKL Small CIFAR-10 AutoencoderKL Medium CIFAR-10 AutoencoderKL Large |
| RL | marl-centralized-critic | Cooperative MARL Centralized Critic Architecture for MAPPO | Studies centralized critic architectures for MAPPO on SMACLite cooperative MARL maps, scored by greedy-policy test win rate and return. | uoe-agents/epymarl | IPPO Decentralized Critic MAPPO Centralized Critic MAT-Style Attention Critic |
SMACLite MMM (10-agent heterogeneous) SMACLite 2s3z (5-agent heterogeneous) SMACLite 3s5z (8-agent heterogeneous) |
| RL | meta-rl | Meta-RL: Context Encoder for PEARL Task Inference | Studies PEARL context encoders that map transition tuples to latent task representations for fast adaptation, evaluated by meta_test_return after 20 meta-training iterations. | katerakelly/oyster | PEARL MLP Context Encoder PEARL Recurrent Context Encoder PEARL Attention Context Encoder |
Half-Cheetah Velocity (30 train/10 test tasks) Sparse Point Robot (40 train/10 test tasks) Point Robot (40 train/10 test tasks) |
| RL | meta-rl-algorithm | Meta-RL Algorithm Design | Studies complete meta-RL algorithm design across task inference, policy conditioning, and meta-training, scored by meta_test_return on held-out tasks after the fixed short-budget protocol. | katerakelly/oyster | PEARL FOCAL VariBAD |
Half-Cheetah Velocity (30 train/10 test tasks) Sparse Point Robot (40 train/10 test tasks) Point Robot (40 train/10 test tasks) |
| RL | rl-intrinsic-exploration | Intrinsic Exploration for Sparse Rewards | Studies how intrinsic rewards and advantage mixing affect exploration and return in sparse-reward Atari environments. | vwxyzjn/cleanrl | PPO RND ICM |
Tutankham-v5 Frostbite-v5 PrivateEye-v5 |
| RL | rl-offline-adroit | Offline Dexterous Manipulation from Narrow Demonstrations | Studies how offline RL algorithms learn dexterous manipulation from narrow human demonstration datasets. | corl-team/CORL | IQL AWAC ReBRAC |
Pen-Human-v1 Hammer-Human-v1 Door-Cloned-v1 |
| RL | rl-offline-continuous | Q-Overestimation Suppression for Offline Continuous Control | Studies how offline continuous-control algorithms suppress out-of-distribution Q-value overestimation. | corl-team/CORL | ReBRAC TD3-BC IQL |
HalfCheetah-Medium-v2 Maze2D-Medium-v1 Walker2d-Medium-v2 |
| RL | rl-offline-off2on | Offline-to-Online Fine-Tuning Without Forgetting | Studies how offline-to-online reinforcement learning prevents forgetting and value collapse during continued interaction. | corl-team/CORL | IQL AWAC SPOT |
Pen-Cloned-v1 Hammer-Cloned-v1 Hammer-Expert-v1 |
| RL | rl-offpolicy-continuous | Off-Policy Actor-Critic for Continuous Control | Changes off-policy actor-critic update rules, losses, or exploration strategies to improve mean episodic return on continuous-control tasks. | vwxyzjn/cleanrl | DDPG TD3 SAC |
HalfCheetah-v4 Reacher-v4 Ant-v4 |
| RL | rl-onpolicy-continuous | On-Policy Actor-Critic for Continuous Control | Changes on-policy actor-critic objectives, update rules, or exploration mechanisms to improve mean episodic return on continuous-control tasks. | vwxyzjn/cleanrl | PPO AWR PPO (KL Penalty) |
HalfCheetah-v4 Swimmer-v4 InvertedDoublePendulum-v4 |
| RL | rl-reward-learning | Inverse RL Reward Learning from Demonstrations | Studies how reward models learned from expert demonstrations affect downstream policy return in continuous-control locomotion. | HumanCompatibleAI/imitation | GAIL AIRL BC |
HalfCheetah-v4 Hopper-v4 Walker2d-v4 |
| RL | rl-value-atari | Value-Based Visual Control | Studies how value-based RL losses, update rules, and exploration strategies affect visual-control episodic return. | vwxyzjn/cleanrl | QR-DQN C51 Double-DQN |
BreakoutNoFrameskip-v4 SeaquestNoFrameskip-v4 PongNoFrameskip-v4 |
| RL | rl-value-discrete | Value-Based Discrete Control | Changes value estimation, uncertainty handling, or replay-based update rules to improve episodic return on discrete-action control tasks. | vwxyzjn/cleanrl | QR-DQN Dueling-DQN C51 |
CartPole-v1 LunarLander-v2 Acrobot-v1 |
| RL | safe-rl | Constraint Handling for Safe RL | Changes Lagrangian or controller-style multiplier updates and cost-reward advantage mixing to improve reward while keeping episode cost below target. | PKU-Alignment/omnisafe | Naive PPO Lagrangian PPO PID Lagrangian |
SafetyPointGoal1-v0 SafetyCarGoal1-v0 SafetyPointButton1-v0 |
| Sys | dlm-dkv-policy | Diffusion LM KV Cache Policy | Studies how token-state refresh intervals, masks, transfer ratios, and fallbacks affect denoising quality and cache reuse. | maomaocun/dLLM-Cache | Vanilla (Uncached) dLLM-Cache d2Cache Elastic-Cache |
MATH-500 HumanEval ARC-Challenge |
| Sys | llm-kv-adaptive-quantization | LLM KV Cache: Adaptive Quantization Policy | Studies adaptive 4-bit KV-cache quantization for instruction-tuned long-context inference, trading benchmark final-score quality against effective KV bits and compression. | huggingface/transformers | KIVI Overlap (4-bit) KVTuner-4 Per-Token KVTuner-4 KIVI SQuat Subspace (4-bit) |
LongBench-E hotpotqa_e QA F1 LongBench-E passage_retrieval_en_e retrieval score LongBench-E repobench-p_e code-similarity score NeedleBench NIAH exact phrase retrieval GSM8K exact final-answer accuracy |
| Sys | llm-kv-selection-budgeting | LLM KV Cache Selection Budgeting | Studies how selection and eviction controllers allocate layer budgets and recent windows for quality, latency, and memory tradeoffs. | huggingface/transformers | Full Attention StreamingLLM Expected Attention LagKV |
LongBench-E hotpotqa_e QA F1 LongBench-E passage_retrieval_en_e retrieval score LongBench-E repobench-p_e code-similarity score LongBench v2 train split multiple-choice accuracy GSM8K exact final-answer accuracy |
| Sys | llm-kv-structural-reduction | LLM Pretraining: KV-Structural Reduction | Studies GPT-style KV-state structural reduction through MHA, MQA, GQA, and MLA-style latent KV compression under fixed nanoGPT pretraining. | karpathy/nanoGPT EleutherAI/lm-evaluation-harness |
MHA MQA GQA MLA |
ClimbMix val loss + KV bytes/token + WikiText-2/WikiText-103/LAMBADA heldout loss HellaSwag, ARC-Easy, PIQA, WinoGrande 0-shot accuracy |
| Sys | llm-pretrain-kernel | LLM Pretraining: Custom GPU Kernel Optimization | Studies custom/fused MLP kernels for nanoGPT pretraining while preserving ClimbMix validation, held-out perplexity, and downstream lm-eval quality. | karpathy/nanoGPT EleutherAI/lm-evaluation-harness |
ReLU-Squared (Torch) Triton GELU Triton ReLU-Squared (Fused) |
ClimbMix val loss + WikiText-2/LAMBADA PPL HellaSwag, ARC-Easy, PIQA, WinoGrande 0-shot accuracy |
| Sys | llm-ptq-algorithm | LLM Post-Training Quantization (PTQ) Algorithm | Design a post-training quantization algorithm for a pretrained LLM that minimizes WikiText-2 perplexity degradation under INT4/INT3 group quantization without retraining. | IST-DASLab/gptq | Round-to-Nearest (RTN) GPTQ AWQ |
PTQ INT4 PTQ INT3 PTQ INT4 (g64) |
| Sys | llm-qat-algorithm | LLM Quantization-Aware Training (QAT) Algorithm | Design a quantization-aware training algorithm for a pretrained LLM that minimizes WikiText-2 perplexity after INT4/INT3/INT2 quantization at inference time. | custom | No QAT STE LSQ Finetune + PTQ |
QAT INT4 QAT INT3 QAT INT2 |
| Sys | mlsys-fused-attention | Fused Attention Kernel Design for H100 GPUs | Design an OpenAI Triton fused self-attention forward kernel for H100 GPUs that maximizes throughput (TFLOPs/s) while preserving numerical correctness. | Dao-AILab/flash-attention | FlashAttention FlashAttention-2 FlashAttention-3 |
Head Dim 64 / Seq 4K Head Dim 128 / Seq 8K Head Dim 256 / Seq 16K |
| Sys | mlsys-moe-load-balance | MoE Expert Parallelism Load Balancing | Design an efficient MoE expert-replica placement algorithm that minimizes GPU/node load imbalance while preserving inter-node locality and low runtime. | deepseek-ai/eplb | Greedy Zigzag Flat Zigzag |
DeepSeek-V3 Qwen3-MoE DeepSeek-V2 Stress-Skew |
| Sys | mlsys-sparse-attention-inference | Long-Context Inference-Time Sparse Attention | Design an inference-time sparse attention module for a pretrained instruction-tuned causal LLM that preserves NIAH and LongBench quality under a 25% density budget without retraining. | custom | Dense StreamingLLM BigBird Block Top-K |
NIAH (8K) LongBench Qasper LongBench MultiFieldQA-EN |
| Sci | ai4bio-mutation-effect-prediction | Mutation Fitness Predictor | Studies how mutant and wild-type protein representations can predict functional effects of sequence mutations. | OATML-Markslab/ProteinGym | Ridge Regression MLP Reshape CNN |
BLAT_ECOLX ESTA_BACSU RASH_HUMAN |
| Sci | ai4bio-protein-inverse-folding | Backbone-to-Sequence Inverse Folding | Studies how geometric structure encoding and sequence decoding recover amino-acid sequences from protein backbones. | A4Bio/ProteinInvBench | ProteinMPNN PiFold GVP |
CATH 4.2 CATH 4.3 TS50 |
| Sci | ai4bio-protein-structure-repr | Geometric Protein Structure Encoder | Studies how local and global geometric protein representations transfer to structure-aware function prediction. | a-r-j/ProteinWorkshop | SchNet EGNN GearNet |
EC GO-BP Fold |
| Sci | ai4sci-climate-emulation | Atmospheric Column Emulator Architecture | Studies how neural emulator architecture maps vertical atmospheric states to sub-grid physics tendencies across training budgets. | leap-stc/ClimSim | CNN Encoder-Decoder U-Net HSR |
Short Budget Medium Budget Long Budget |
| Sci | ai4sci-inverse-diffusion-algo | Diffusion-Prior Inverse Solver | Studies how diffusion priors and measurement guidance can be combined for inverse-problem reconstruction. | devzhk/InverseBench | DPS REDDiff LGD |
Inverse Scattering Black Hole Imaging Inpainting |
| Sci | ai4sci-mol-property-prediction | Molecular Representation Predictor | Studies how molecular graph and geometric representations improve property prediction under scaffold-based generalization. | deepmodeling/Uni-Mol | D-MPNN Uni-Mol GIN |
BBBP BACE Tox21 |
| Sci | ai4sci-pla-binding-affinity | Protein-Ligand Interaction Model | Studies how intra- and inter-molecular geometric interactions should be represented to predict binding affinity. | guaguabujianle/EHIGN_PLA | EHIGN GIGN SchNet EGNN |
PDBbind 2013 PDBbind 2016 PDBbind 2019 |
| Sci | ai4sci-vs-contrastive-scoring | Contrastive Virtual-Screening Objective | Studies how projection geometry and contrastive losses affect zero-shot protein-ligand screening quality. | jianhuiwemi/HypSeek | Vanilla CLIP HCC HCC + Hyperbolic Cone |
HypSeek Training DUD-E LIT-PCBA DEKOIS 2.0 |
| Sci | ai4sci-weather-forecast-aggregation | Weather Forecast Variable Aggregation | Studies how weather forecasting models aggregate information across heterogeneous meteorological variables for optimal prediction. | microsoft/ClimaX | Cross-Attention Mean Pooling Learned Weighted Sum |
Z500 3-Day T850 5-Day 10m-Wind 7-Day |
| Sci | pde-design-solver | Industrial CFD Design: Custom Neural Operator Design | Designs and implements a custom neural operator for industrial aerodynamic design prediction on 3D unstructured point clouds. | thuml/Neural-Solver-Library | PointNet GraphSAGE Graph U-Net Transolver |
Car Design AirfRANS Aircraft Design |
| Opt | optimization-bilevel | Optimization Bilevel | Studies a fixed bilevel-optimization benchmark based on Shen and Chen's penalty-based bilevel gradient descent experiments, selecting supported methods and tuning paper-style strategy hyperparameters. | hanshen95/penalized-bilevel-gradient-descent | V-PBGD G-PBGD RHG T-RHG |
Toy Convergence HyperClean (Linear) HyperClean (MLP) |
| Opt | optimization-convex-concave | RAIN Convex-Concave | Studies gradient-norm convergence on the exact convex-concave benchmark instances used by the official RAIN bilinear and delta-function scripts. | TrueNobility303/RAIN | SEG R-SEG SEAG RAIN |
Default Noise Low Noise High Noise |
| Opt | optimization-diagonal-net | Optimizer Design for Diagonal-Net Sparse Recovery | Designs an optimizer that recovers a sparse linear predictor from fewer training samples under a diagonal-net parameterization with noisy labels. | TrueNobility303/RAIN | SGD AdaGrad Adam Adam (Alt.) |
d=200, k=5, s=0.1 d=500, k=10, s=0.1 d=500, k=10, s=0.2 d=10000, k=50 |
| Opt | optimization-dp-sgd | Differentially Private SGD: Privacy-Utility Optimization | Design an improved DP-SGD variant that achieves higher test accuracy under the same (epsilon, delta)-differential privacy budget. | custom | Standard DP-SGD Automatic Clipping (AUTO-S) Adaptive Quantile Clipping Step-Decay Noise Schedule |
MNIST Fashion-MNIST CIFAR-10 |
| Opt | optimization-evolution-strategy | Evolutionary Optimization Strategy Design | Design a novel combination of selection, crossover, mutation operators and/or evolutionary loop for continuous black-box optimization across multiple benchmark functions. | DEAP/deap | GA (SBX) CMA-ES Differential Evolution L-SHADE |
Rastrigin (30D) Rosenbrock (30D) Ackley (30D) Rastrigin (100D) |
| Opt | optimization-gradient-compression | Gradient Compression for Communication-Efficient Distributed Training | Design a gradient compression operator that reduces communication cost in distributed training while maintaining convergence quality. | custom | TopK Sparsification with Error Feedback QSGD (Quantized SGD) SignSGD |
ResNet-20 / CIFAR-10 VGG-11-BN / CIFAR-100 ResNet-56 / CIFAR-10 |
| Opt | optimization-hyperparameter-search | Hyperparameter Optimization: Custom Search Strategy Design | Design a custom HPO strategy that improves final validation score and convergence under limited multi-fidelity evaluation budgets. | custom | Random Search TPE Hyperband DEHB BOHB Optuna CMA-ES |
XGBoost SVM Neural Net |
| Opt | optimization-multi-objective | Multi-Objective Optimization: Custom Evolutionary Strategy Design | Design a custom multi-objective evolutionary strategy that improves convergence, diversity, and spread on standard benchmark problems. | DEAP/deap | NSGA-II MOEA/D SPEA2 NSGA-III RVEA AGE-MOEA |
ZDT1 ZDT3 DTLZ2 DTLZ1 |
| Opt | optimization-nas | Sample-Efficient Neural Architecture Search | Design and implement a sample-efficient NAS optimizer that discovers high-performing architectures in the NAS-Bench-201 search space under a strict query budget. | automl/naslib | Random Search REA BANANAS |
CIFAR-10 CIFAR-100 ImageNet16-120 |
| Opt | optimization-online-bandit | Online Bandits: Exploration-Exploitation Strategy Design | Design and implement a bandit policy that minimizes cumulative regret across diverse multi-armed bandit settings. | SMPyBandits/SMPyBandits | UCB1 Thompson Sampling KL-UCB |
Stochastic MAB Contextual Bandit Non-Stationary Bandit |
| Opt | optimization-pac-bayes-bound | PAC-Bayes Generalization Bound Optimization | Design a tighter PAC-Bayes generalization bound by optimizing the bound formulation, prior/posterior parameterization, and KL divergence estimation for stochastic neural networks. | mperezortiz/PBB | McAllester Catoni Quadratic |
MNIST (FCN) MNIST (CNN) FashionMNIST (CNN) |
| Opt | optimization-parity | Optimization Parity | Improve a fixed two-layer MLP's ability to learn sparse parity by designing only its initialization, training dataset, and AdamW hyperparameters. | pytorch/examples | Default Multi-Epoch No Weight Decay |
n=32, k=8 n=50, k=8 n=64, k=8 |
| Opt | optimization-variance-reduction | Variance Reduction for Stochastic Optimization | Design an improved variance reduction strategy for stochastic gradient descent on finite-sum optimization problems. | custom | SVRG STORM STORM+ |
Logistic Regression MLP Ill-Conditioned |
| CAL | meta-fewshot-classification | Few-Shot Image Classification Method | Studies how support encoding, query comparison, and loss design affect episodic few-shot image-classification accuracy. | sicara/easy-few-shot-learning | ProtoNet MatchingNet RelationNet |
Mini-ImageNet 5w-5s CIFAR-FS CUB |
| CAL | meta-inner-loop-optimizer | Meta-Learning Inner-Loop Optimizer | Studies how differentiable inner-loop adaptation rules affect few-shot classification accuracy in gradient-based meta-learning. | learnables/learn2learn | MAML Meta-SGD ANIL |
Mini-ImageNet 5w-1s Mini-ImageNet 5w-5s CIFAR-FS 5w-5s |
| CAL | ml-active-learning | Pool-Based Active Learning Query Strategy | Studies how unlabeled-sample query rules affect accuracy under a fixed labeling budget. | JordanAsh/badge | BADGE BAIT BALD Least Confidence Random |
Letter Spambase Splice |
| CAL | ml-anomaly-detection | Unsupervised Tabular Anomaly Detector | Studies how unlabeled anomaly scoring algorithms identify outliers across tabular data distributions. | custom | IF (Isolation Forest) LOF OCSVM ECOD COPOD |
Cardio Thyroid Satellite Shuttle |
| CAL | ml-calibration | Post-Hoc Probability Calibration Mapping | Studies how post-hoc probability transforms improve classifier confidence calibration. | custom | Platt Temperature Scaling Isotonic Regression |
RF / MNIST MLP / Fashion-MNIST GBM / Madelon SVM / Breast Cancer |
| CAL | ml-clustering-algorithm | Geometry-Robust Clustering Algorithm | Studies how clustering objectives and distance metrics handle convex blobs, non-convex moons, and high-dimensional digit data. | custom | K-Means DBSCAN HDBSCAN |
Blobs Moons Digits |
| CAL | ml-continual-regularization | Continual Learning Importance Regularizer | Changes parameter-importance estimation and regularization loss to reduce catastrophic forgetting and improve final average accuracy across contexts. | GMvandeVen/continual-learning | EWC SI Online EWC |
Split-MNIST Permuted-MNIST Split-CIFAR100 |
| CAL | ml-dimensionality-reduction | Nonlinear 2D Structure-Preserving Embedding | Studies how nonlinear dimensionality reduction preserves neighborhood structure in low-dimensional embeddings. | custom | PCA t-SNE UMAP TriMap PaCMAP |
MNIST Fashion-MNIST 20 Newsgroups |
| CAL | ml-ensemble-boosting | Adaptive Boosting Weight and Target Strategy | Studies how pseudo-targets, learner weights, and sample reweighting affect boosted ensemble performance. | custom | AdaBoost Gradient Boosting XGBoost-style |
Breast Cancer Diabetes California Housing |
| CAL | ml-federated-aggregation | Heterogeneous Federated Server Aggregation | Changes server-side client selection and model aggregation to improve federated test accuracy under heterogeneous client data. | adap/flower | FedAvg FedProx SCAFFOLD |
CIFAR-10 (Non-IID alpha=0.1) FEMNIST Shakespeare |
| CAL | ml-missing-data-imputation | Correlation-Aware Tabular Imputation | Studies how feature correlations and predictive structure guide missing-value imputation in tabular data. | custom | Mean Imputation KNN Imputation MICE MissForest GAIN |
Breast Cancer Wisconsin Wine California Housing |
| CAL | ml-selective-deferral | Selective Deferral Under Subgroup Shift | Studies how acceptance and deferral rules trade off selective risk, subgroup robustness, and coverage on AIF360 tabular datasets. | custom | Confidence Thresholding Conformal Abstention Learned Deferral Group-wise Thresholding |
Adult COMPAS Law School GPA |
| CAL | ml-subgroup-calibration-shift | Shift-Robust Subgroup Calibration | Studies how post-hoc calibration behaves under subgroup distribution shift and worst-group reliability constraints on AIF360 tabular datasets. | custom | Temperature Scaling Isotonic Regression Beta Calibration Group-wise Temperature Scaling |
Adult COMPAS Law School GPA |
| CAL | ml-symbolic-regression | Genetic Programming Search for Symbolic Regression | Studies how symbolic-regression search strategies recover generalizable analytical expressions. | trevorstephens/gplearn | Standard GP Parsimony GP Lexicase GP |
Nguyen-7 Nguyen-10 Koza-3 |
| DL | cv-classification-loss | Adaptive Classification Loss | Modify the training loss over logits and labels to improve classification accuracy across image-model families. | custom | Label Smoothing Focal Loss PolyLoss |
ResNet-56 / CIFAR-100 VGG-16-BN / CIFAR-100 MobileNet-V2 / Fashion-MNIST |
| DL | cv-data-augmentation | Image Augmentation Policy | Design the training transform pipeline combining geometric, photometric, and erasing operations to improve image-classification generalization. | custom | Cutout RandAugment TrivialAugmentWide |
ResNet-20 / CIFAR-10 ResNet-56 / CIFAR-100 MobileNet-V2 / Fashion-MNIST |
| DL | cv-multitask-loss | Hierarchical Classification Loss Weighting | Studies how fine-label and coarse-label objectives should be combined to improve hierarchical image classification. | custom | Uncertainty Weighting DWA PCGrad |
ResNet-20 / CIFAR-100-MT ResNet-56 / CIFAR-100-MT VGG-16-BN / CIFAR-100-MT |
| DL | cv-pooling-aggregation | Spatial Feature Aggregation | Studies how global spatial features should be aggregated to improve image-classification accuracy across convolutional architectures. | custom | Global Max GeM Avg + Max |
ResNet-56 / CIFAR-100 VGG-16-BN / CIFAR-100 MobileNet-V2 / Fashion-MNIST |
| DL | cv-sample-weighting | Long-Tail Class Reweighting | Studies how class-count statistics should be mapped to loss weights to improve test accuracy on balanced test sets for long-tailed image classification. | custom | Inverse Frequency Class-Balanced (Effective Number) Balanced Softmax |
ResNet-32 / CIFAR-10-LT ResNet-32 / CIFAR-100-LT VGG-16-BN / CIFAR-100-LT |
| DL | dl-activation-function | Convolutional Activation Nonlinearity | Studies how drop-in activation functions affect accuracy across convolutional image classifiers. | custom | GELU SiLU Mish |
ResNet-20 / CIFAR-10 VGG-16-BN / CIFAR-100 MobileNet-V2 / Fashion-MNIST |
| DL | dl-lr-schedule | Architecture-Aware Learning-Rate Scheduling | Designs an epoch-level learning-rate curve conditioned on architecture and dataset to improve convergence and final classification accuracy. | custom | Cosine WarmupCosine OneCycle |
ResNet-20 / CIFAR-10 ResNet-56 / CIFAR-100 MobileNet-V2 / Fashion-MNIST |
| DL | dl-normalization | Normalization Statistics and Affine Design | Studies how normalization statistics and affine behavior affect convolutional training stability and test accuracy. | custom | GroupNorm Batch-Instance Norm Switchable Norm |
ResNet-56 / CIFAR-100 ResNet-110 / CIFAR-100 MobileNet-V2 / Fashion-MNIST |
| DL | dl-regularization | Adaptive Regularization Loss | Adds a model-, output-, input-, or epoch-dependent regularization term to improve classification generalization beyond standard weight decay. | custom | DropBlock Confidence Penalty Orthogonal Regularization |
ResNet-56 / CIFAR-100 VGG-16-BN / CIFAR-100 MobileNet-V2 / Fashion-MNIST |
| DL | dl-residual-connection | Residual Block Skip Design | Studies how shortcut transformations and residual branch computation affect optimization and generalization across network depths. | custom | Pre-Activation Gated Residual Stochastic Depth |
ResNet-20 / CIFAR-10 ResNet-56 / CIFAR-100 ResNet-110 / CIFAR-100 |
| DL | dl-weight-initialization | DL Weight Initialization Strategy Design | Designs data-independent initialization for convolutional, normalization, and classifier layers to improve convergence and final accuracy. | custom | Kaiming Normal Fixup Orthogonal |
ResNet-56 / CIFAR-100 VGG-16-BN / CIFAR-100 MobileNet-V2 / Fashion-MNIST |
| TS | quant-concept-drift | Concept-Drift-Aware Quantitative Forecasting | The stock prediction model and data pipeline are redesigned to handle temporal distribution shift and improve signal quality and portfolio metrics. | microsoft/qlib | TRA AdaRNN LightGBM |
CSI 300 CSI 300 (Shifted) CSI 300 (Recent) |
| TS | quant-graph-stock | Graph-Based Quantitative Forecasting | Studies how inter-asset graph relationships affect return signal quality and portfolio performance. | microsoft/qlib | HIST GATs LightGBM |
CSI 300 CSI 100 CSI 300 (Recent) |
| TS | quant-stock-prediction | Quantitative Return Forecasting | Studies how predictive models and input processing affect next-period return signals and portfolio performance. | microsoft/qlib | LightGBM LSTM Transformer |
CSI 300 CSI 100 CSI 300 (Recent) |
| TS | stf-traffic-forecast | Spatial-Temporal Traffic Forecasting Model | Studies how spatial-temporal models capture sensor-network dependencies for traffic forecasting. | GestaltCogTeam/BasicTS | STID DLinear StemGNN iTransformer TimesNet SOFTS TimeMixer |
METR-LA PEMS-BAY PEMS04 |
| TS | ts-anomaly-detection | Reconstruction Model for Time-Series Anomaly Detection | An unsupervised reconstruction model detects anomalous multivariate time-series segments to improve F-score. | thuml/Time-Series-Library | DLinear TimesNet PatchTST |
PSM MSL SMAP |
| TS | ts-classification | Multivariate Time-Series Classification Model | Studies how representation learning improves classification of multivariate time-series signals. | thuml/Time-Series-Library | DLinear TimesNet PatchTST |
EthanolConcentration FaceDetection Handwriting |
| TS | ts-exogenous-forecast | Exogenous-Variable Target Forecasting Model | Studies how exogenous variables improve target-channel forecasting. | thuml/Time-Series-Library | DLinear PatchTST iTransformer TimeXer |
ETTh1 Weather ECL |
| TS | ts-imputation | Masked Multivariate Time-Series Imputation | Studies how imputation models reconstruct missing regions in multivariate time series. | thuml/Time-Series-Library | DLinear TimesNet PatchTST |
ETTh1 (25% missing) Weather (25% missing) ECL (25% missing) |
| TS | ts-long-term-forecast | Multivariate Long-Horizon Forecasting Model | Studies how long-horizon forecasting models predict future multivariate sequences. | thuml/Time-Series-Library | DLinear PatchTST iTransformer TimeMixer TimeXer |
ETTh1 Weather ECL |
| TS | ts-short-term-forecast | Univariate Short-Horizon Forecasting Model | Studies how short-horizon forecasting models predict seasonal univariate series. | thuml/Time-Series-Library | DLinear TimesNet PatchTST TimeMixer |
M4 Monthly M4 Quarterly M4 Yearly |
| SCR | causal-discovery-discrete | Discrete Causal Graph Discovery | Studies how causal discovery algorithms recover equivalence-class graph structure from discrete observational data. | py-why/causal-learn | PC GES GRaSP BOSS Hill Climbing |
Cancer Child ALARM HAILFINDER Win95pts |
| SCR | causal-observational-linear-gaussian | Linear Gaussian Causal Discovery | Studies how observational algorithms recover causal graph structure under linear Gaussian assumptions. | py-why/causal-learn | PC GRaSP BOSS |
ER (n=10) ER (n=20) SF (n=50) SF (n=50, Hard) ER (n=20, Noisy) |
| SCR | causal-observational-linear-non-gaussian | Non-Gaussian Causal Discovery | Studies how non-Gaussian structure can identify directed causal relationships from observational data. | py-why/causal-learn | ICA-LiNGAM DirectLiNGAM NOTEARS |
ER (n=30) ER (n=50) SF (n=100) |
| SCR | causal-observational-nonlinear | Nonlinear Causal Discovery | Studies how nonlinear additive-noise assumptions support directed causal graph recovery from observations. | py-why/causal-learn | CAM NOTEARS-MLP DirectLiNGAM GraN-DAG |
SF (n=20, GP) ER (n=20, Gauss) ER (n=12, Low-Sample) |
| SCR | causal-treatment-effect | Heterogeneous Treatment Effect Estimation | Studies how observational estimators recover individual and average treatment effects on synthetic CATE benchmark families. | custom | S-Learner T-Learner IPW Causal Forest DR-Learner R-Learner |
IHDP-inspired Synth Jobs/LaLonde-inspired Synth ACIC-inspired Synth |
| SCR | graph-generation | Unconditional Graph Generator Architecture | Studies how graph generator architecture affects distributional match to target graph statistics. | pyg-team/pytorch_geometric | GraphVAE GRAN DiGress |
Community-Small Ego-Small ENZYMES |
| SCR | graph-graph-classification | Structure-Aware Graph Readout Pooling | Studies how graph-level readout mechanisms affect graph classification accuracy and macro F1 under a fixed message-passing backbone. | pyg-team/pytorch_geometric | GIN + Sum SAGPool DiffPool |
MUTAG PROTEINS NCI1 |
| SCR | graph-link-prediction | Graph Link Encoder-Decoder | Studies how node encoders and edge decoders affect missing-link prediction quality. | custom | GCN + MLP Decoder VGAE SEAL |
Cora CiteSeer ogbl-collab |
| SCR | graph-node-classification | Graph Node Message Passing | Studies how message-passing layers affect node classification across citation network benchmarks. | pyg-team/pytorch_geometric | GCN GAT GraphSAGE |
Cora CiteSeer PubMed |
| SCR | graph-signal-propagation | Homophily-Heterophily Graph Filter | The graph signal propagation filter is changed to improve node classification accuracy across homophilic and heterophilic graphs. | ivam-he/ChebNetII | GPR-GNN BernNet ChebNetII |
Cora CiteSeer Texas Cornell |
| TL | security-adversarial-attack-black-box-score | Score-Based Black-Box Linf Attack | Designs a query-efficient black-box Linf evasion attack to improve attack success rate under a fixed per-sample query budget. | Harry24k/adversarial-attacks-pytorch | Square Attack SPSA Random Search |
ResNet-20 / CIFAR-10 VGG-11-BN / CIFAR-10 MobileNet-V2 / CIFAR-10 ResNet-20 / CIFAR-100 MobileNet-V2 / CIFAR-100 |
| TL | security-adversarial-attack-sparse-l0 | Sparse L0 Adversarial Attack | Studies how sparse perturbation strategies improve attack success while respecting a strict pixel budget. | Harry24k/adversarial-attacks-pytorch | OnePixel SparseFool JSMA Pixle Sparse-RS |
Rebuffi-R18 (l2-AT) / CIFAR-10 Augustin (l2-robust) / CIFAR-10 Engstrom (l2-robust) / CIFAR-10 |
| TL | security-adversarial-attack-white-box-linf | White-Box Linf Evasion Attack | Designs a gradient-based white-box Linf attack to improve attack success rate while respecting the perturbation budget. | Harry24k/adversarial-attacks-pytorch | FGSM PGD MI-FGSM AutoAttack |
ResNet-20 / CIFAR-10 VGG-11-BN / CIFAR-10 ResNet-20 / CIFAR-100 VGG-11-BN / CIFAR-100 MobileNet-V2 / CIFAR-100 |
| TL | security-adversarial-training | Linf Adversarial Training for Robust Accuracy | Studies how adversarial training procedures improve robust accuracy while maintaining clean accuracy. | Harry24k/adversarial-attacks-pytorch | Standard Training PGD-AT TRADES MART AWP + TRADES |
SmallCNN / MNIST PreAct ResNet-18 / CIFAR-10 VGG-11-BN / CIFAR-10 PreAct ResNet-18 / CIFAR-100 |
| TL | security-backdoor-defense | Poisoned-Sample Scoring for Backdoor Filtering | A suspicion scoring rule identifies and filters backdoored training examples to reduce attack success rate while preserving clean accuracy. | custom | Confidence Filter Spectral Signatures Activation Clustering Z-Score Outlier |
ResNet-20 / CIFAR-10 (BadNets) VGG-16-BN / CIFAR-100 (Blend) MobileNet-V2 / Fashion-MNIST (BadNets) |
| TL | security-machine-unlearning | Targeted Update Rules for Class Unlearning | An unlearning update rule removes forget-class information while improving retained accuracy and reducing forget-set membership leakage. | custom | Retain Fine-Tune Negative Gradient Bad Teacher SCRUB |
ResNet-20 / CIFAR-10 (Class 0) VGG-16-BN / CIFAR-100 (Class 0) MobileNet-V2 / Fashion-MNIST (Class 0) |
| TL | security-membership-inference-defense | Training Regularization for Membership Privacy | Studies how privacy-preserving training losses reduce membership leakage while maintaining accuracy. | custom | ERM Label Smoothing Confidence Penalty RelaxLoss |
ResNet-20 / CIFAR-10 VGG-16-BN / CIFAR-100 MobileNet-V2 / Fashion-MNIST |
| TL | security-poison-robust-learning | Robust Losses for Label-Flip Poisoning | A robust loss or sample-weighting rule improves clean accuracy under label-flip poisoning and reduces poisoned-label memorization. | custom | Cross-Entropy Generalized Cross-Entropy Symmetric Cross-Entropy Bootstrap |
ResNet-20 / CIFAR-10 (Label-Flip) VGG-16-BN / CIFAR-100 (Label-Flip) MobileNet-V2 / Fashion-MNIST (Label-Flip) |
@misc{lyu2026mlsbenchholisticrigorousassessment,
title={MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI},
author={Bohan Lyu and Yucheng Yang and Siqiao Huang and Jiaru Zhang and Qixin Xu and Xinghan Li and Xinyang Han and Yicheng Zhang and Huaqing Zhang and Runhan Huang and Kaicheng Yang and Zitao Chen and Wentao Guo and Junlin Yang and Xinyue Ai and Wenhao Chai and Yadi Cao and Ziran Yang and Kun Wang and Dapeng Jiang and Huan-ang Gao and Shange Tang and Chengshuai Shi and Simon S. Du and Max Simchowitz and Jiantao Jiao and Dawn Song and Chi Jin},
year={2026},
eprint={2605.08678},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2605.08678},
}
