Skip to content

Imbernoulli/MLS-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

87 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MLS-Bench

Website arXiv Hugging Face Dataset Docker Hub Discord

MLS-Bench is a benchmark for machine learning science. Where most agent benchmarks reward engineering one fixed instance — clean the data, tune the pipeline, climb a leaderboard — MLS-Bench asks the harder question: can an AI agent propose a new component, loss, optimizer, or training procedure whose gain transfers across settings, seeds, datasets, and scales?

MLS-Bench overview

The benchmark contains 140 tasks across 12 ML research domains. Each task fixes a research scaffold, gives the agent the relevant source code and strong baseline implementations, then asks for one algorithmic change inside a constrained edit surface.

News

  • 2026.6More efficient on larger GPUs: a new compute_scale option lets the LLM pretraining and reinforcement-learning tasks — and, optionally, the other tasks — run more efficiently on H200-class GPUs without changing results. See issue #4 and PR #9 for the design.
  • 2026.5Harbor support: official Harbor-compatible runtime and pre-rendered task images on Docker Hub under bohanlyu2022/mlsbench-harbor-*. See harbor/README.md.
  • 2026.5Stronger Sparse L0 Adversarial Attack task: upgraded to the canonical Sparse-RS L0 threat model (k=24, untargeted) against three adversarially-robust RobustBench L2 CIFAR-10 targets (Rebuffi-R18 / Augustin / Engstrom). Strong attacks no longer trivially saturate, leaving real headroom to measure genuine attack improvements.
  • 2026.5Scoring: the main results table in the arXiv paper previously aggregated tasks within each area by geometric mean; switched to arithmetic mean for easier comparison with the per-task numbers. Rankings are unchanged and no conclusions are affected.

Installation

pip install -e ".[agent]"

Python 3.10+ is required. MLS-Bench separates the choice of runtime backend from the choice of job scheduler, and any combination of the two is supported:

  • Runtime backends: Docker, Apptainer, or local Conda — selected in your config file via container_runtime.
  • Job schedulers: SLURM (when a slurm: section is present in the config) or the built-in single-node GPU scheduler.
container_runtime: docker      # docker, apptainer, or local

Recommended setup: Docker or Apptainer for the runtime, with SLURM as the job scheduler. If SLURM is unavailable, the built-in scheduler can be combined with any of the three runtimes. If neither a container runtime nor SLURM is available, the local Conda backend together with the built-in scheduler provides a complete fallback (see the section below).

Running with local Conda environments and the built-in scheduler

When neither Docker nor Apptainer is available, MLS-Bench can build a dedicated Conda environment per package and dispatch jobs through a single-node GPU queue (src/mlsbench/scheduler.py). This backend is intended for development and small-scale experimentation; for full-scale benchmarking on a cluster we recommend SLURM with one of the container runtimes instead. The Conda backend should not be combined with SLURM, since both attempt to schedule GPU jobs.

  1. Use a config with container_runtime: local and no slurm: section. Throughout this section we refer to it as configs/local.yaml.

  2. Build the environment for each package:

    mlsbench build <package> --config configs/local.yaml
  3. Start the GPU scheduler:

    nohup python -m mlsbench.scheduler start \
      --gpus 0,1,2,3 \
      --config configs/local.yaml \
      > .scheduler/scheduler.log 2>&1 &
  4. Launch agents or baselines. They enqueue jobs to the scheduler and return immediately:

    PYTHONPATH=src nohup python3 -m mlsbench agent <task> --model <model> \
      --config configs/local.yaml \
      > .scheduler/logs/agent_<task>.log 2>&1 &
  5. Inspect or manage the queue:

    python -m mlsbench.scheduler status
    python -m mlsbench.scheduler list
    python -m mlsbench.scheduler cancel <job_id>
    python -m mlsbench.scheduler clear

To rebuild a package's environment from scratch, remove it with conda env remove -n mlsbench-<package> and re-run mlsbench build.

API Keys

Running an agent requires an API key for the model provider you choose. If you enable the optional web-search tool, a Tavily key is also required. Configure keys in either of two equivalent ways:

1. Inline in your config file under the providers: block — useful when you want to keep separate configs per environment or per project:

providers:
  openai:
    api_key: "sk-..."
  anthropic:
    api_key: "sk-ant-..."
  openrouter:
    api_key: "sk-or-..."
    base_url: "https://openrouter.ai/api/v1"
  deepseek:
    api_key: "sk-..."
    base_url: "https://api.deepseek.com/v1"
  tavily:
    api_key: "tvly-..."     # only needed if the web_search tool is enabled

2. Environment variables — leave the api_key field empty (or omit the provider entirely) and the CLI falls back to the standard env var for that provider:

Provider Env var
OpenAI OPENAI_API_KEY
Anthropic ANTHROPIC_API_KEY
OpenRouter OPENROUTER_API_KEY_NEW
DeepSeek DEEPSEEK_API_KEY
Qwen / DashScope QWEN_API_KEY / DASHSCOPE_API_KEY
Gemini / Google GEMINI_API_KEY / GOOGLE_API_KEY
Kimi / Moonshot KIMI_API_KEY / MOONSHOT_API_KEY
GLM GLM_API_KEY
MiniMax MINIMAX_API_KEY
Tavily (web search) TAVILY_API_KEY

You can also use ${ENV_VAR} interpolation inside the YAML (api_key: "${OPENAI_API_KEY}") when you want a tracked config file that still resolves the secret from the environment at runtime.

The model string passed to mlsbench agent --model <name> selects the provider automatically:

  • Bare names are dispatched by their well-known prefix: claude-*providers.anthropic, gpt-* / o1 / o3 / o4providers.openai, deepseek-*providers.deepseek, qwen-*providers.qwen, gemini-*providers.gemini, kimi-* / moonshot-*providers.kimi, glm-*providers.glm, minimax-*providers.minimax.
  • Prefixed names (<provider>/<model>, e.g. openai/gpt-5.4, vertex_ai/..., openrouter/anthropic/claude-opus-4.6) dispatch generically to the matching providers.<provider> entry. Point that entry's base_url at whichever upstream you want — direct API, OpenRouter, a LiteLLM proxy, etc. — and the same key is reused.

Quick Start

Fetch external packages and build the runtime (data dependencies are prepared automatically as part of the build):

mlsbench fetch --name <package>
mlsbench build <package> --config configs/react.yaml

Run an agent and compute its task score:

mlsbench agent <task> --model <model> --config configs/react.yaml
mlsbench score task <task>

Baseline scores are already populated in each task's leaderboard.csv, so running an agent alone is sufficient to obtain its normalized score under the MLS-Bench evaluation framework. Before launching the agent, however, we recommend running one baseline first to confirm that your environment is set up correctly:

mlsbench baseline <task> --name <baseline> --config configs/react.yaml

Baselines and agents share the same task scripts, parsers, seeds, resource limits, and leaderboard code; only the source of the edits differs.

Prebuilt Container Images

To avoid building each package from source, prebuilt images are published for every supported package:

mlsbench agent, mlsbench baseline, and mlsbench build automatically pull the prebuilt image when the local image is missing, and fall back to building from source on failure. mlsbench run performs the same lookup but does not build from source; run mlsbench build <pkg> first if a local build is required.

Two mutually-exclusive flags force a specific source for mlsbench build:

mlsbench build <package> --pull          # use only the prebuilt image
mlsbench build <package> --local-build   # build locally from the Dockerfile / .def

For Apptainer, the SIF can be obtained either via apptainer pull docker://... (default) or from the Hugging Face mirror — a direct HTTPS download of sif/<Pkg>.sif, which can be faster in networks where Docker registries are slow. Select the source with --sif-source {docker,hf,auto} on mlsbench build.

Running under Harbor

MLS-Bench's 140 tasks are also available as a Harbor dataset so any Harbor-supported agent (claude-code, codex, openhands, terminus-2, …) can be evaluated on the suite without going through this repository's own runner:

PYTHONPATH=. harbor run -c run.yaml -a claude-code -m anthropic/claude-opus-4-7

The pre-rendered dataset, GPU-capable environment plugin, and reference Harbor config live under harbor/. See harbor/README.md for usage details and the self-contained per-task layout.

Repository Map

src/mlsbench/                  CLI, agent loop, execution backends, scoring
tasks/<task>/                  140 task definitions, parsers, scores, baselines
vendor/packages.yaml           External package registry
vendor/pkg_configs/<package>/  Package runtime configs and pre-edit patches
vendor/data_scripts/           Dataset and model-cache preparation scripts
configs/react.yaml             Runtime and provider configuration
configs/openevolve.yaml        OpenEvolve defaults
configs/discover.yaml          Discover defaults
harbor/                        Pre-rendered Harbor dataset (140 tasks) + run config

Fetched upstream repositories, built images, downloaded datasets, run workspaces, logs, and scheduler state are intentionally not versioned.

Full Task Catalog

Show the 140-task appendix table
Area Directory shorthand Task Research question External package(s) Baselines Evaluation settings
LM agent-tool-reasoning LLM Agent Tool-Use Reasoning Strategy Studies how tool-use search, backtracking, and stopping policies affect answer validity and query efficiency. zhichengg/StableToolBench Greedy Chain (CoT)
DFS with LLM Ranking
DFSDT
StableToolBench I1-instruction 50q / deepseek-chat
StableToolBench I1-instruction 50q / qwen2.5-72b-instruct
StableToolBench I1-instruction 50q / qwen2.5-7b-instruct
LM llm-dllm-demask-strategy Masked Diffusion LM: Demasking Strategy Studies how demasking schedules, position selection, and token assignment affect diffusion language-model quality and decoding efficiency. ML-GSAI/LLaDA Top-K Margin
Confidence Greedy
KLASS
LLaDA / MATH-500
LLaDA / HumanEval
Dream / C4 prefix continuation
LM llm-pretrain-attention Autoregressive Attention Mechanism Studies how self-attention computation and positional handling affect autoregressive pretraining loss and downstream accuracy. karpathy/nanoGPT
EleutherAI/lm-evaluation-harness
QK-Norm
RoPE
RoPE + QK-Norm
ClimbMix val loss + WikiText-2/LAMBADA PPL
HellaSwag, ARC-Easy, PIQA, WinoGrande 0-shot accuracy
LM llm-pretrain-bitlinear Low-Bit Linear Pretraining Layer Studies how low-bit linear layers and quantization functions affect pretraining loss under discrete weight constraints. karpathy/nanoGPT
EleutherAI/lm-evaluation-harness
Binary Sign (BitNet)
Ternary 1.58-bit (BitNet b1.58)
INT2 Uniform
ClimbMix val loss + WikiText-2/LAMBADA PPL
HellaSwag, ARC-Easy, PIQA, WinoGrande 0-shot accuracy
LM llm-pretrain-embedding Autoregressive Embedding Strategy Studies how token embeddings, position embeddings, value embeddings, and weight tying affect autoregressive pretraining loss and downstream accuracy. karpathy/nanoGPT
EleutherAI/lm-evaluation-harness
Untied Embeddings
Value Embeddings
Bigram Hash Embeddings
ClimbMix val loss + WikiText-2/LAMBADA PPL
HellaSwag, ARC-Easy, PIQA, WinoGrande 0-shot accuracy
LM llm-pretrain-linear-attention Subquadratic Attention Mechanism Studies whether linear or subquadratic attention can reduce autoregressive validation loss while preserving downstream performance. karpathy/nanoGPT
EleutherAI/lm-evaluation-harness
RetNet
DeltaNet
GLA
ClimbMix val loss + WikiText-2/LAMBADA PPL
HellaSwag, ARC-Easy, PIQA, WinoGrande 0-shot accuracy
LM llm-pretrain-loss Autoregressive Pretraining Loss Studies how alternative next-token training losses affect autoregressive validation cross-entropy. karpathy/nanoGPT
EleutherAI/lm-evaluation-harness
Label Smoothing
Softcap Cross-Entropy
Z-Loss
ClimbMix val loss + WikiText-2/LAMBADA PPL
HellaSwag, ARC-Easy, PIQA, WinoGrande 0-shot accuracy
LM llm-pretrain-lr-schedule Pretraining Learning-Rate Schedule Studies how warmup, decay shape, and schedule horizon affect autoregressive pretraining validation loss. karpathy/nanoGPT
EleutherAI/lm-evaluation-harness
WSD (Warmup-Stable-Decay)
Trapezoidal
WSD with Inverse-Sqrt Decay
ClimbMix val loss + WikiText-2/LAMBADA PPL
HellaSwag, ARC-Easy, PIQA, WinoGrande 0-shot accuracy
LM llm-pretrain-mlp Transformer Feed-Forward Block Studies how activation, gating, and expansion choices in the feed-forward sublayer affect language-model validation loss. karpathy/nanoGPT
EleutherAI/lm-evaluation-harness
ReLU-Squared
SwiGLU
GeGLU
ClimbMix val loss + WikiText-2/LAMBADA PPL
HellaSwag, ARC-Easy, PIQA, WinoGrande 0-shot accuracy
LM llm-pretrain-normalization Normalization and Block Layout Studies how normalization placement, affine behavior, and transformer block layout affect pretraining stability and validation loss. karpathy/nanoGPT
EleutherAI/lm-evaluation-harness
RMSNorm
RMSNorm + Sandwich-Norm
RMSNorm (Parallel Block)
ClimbMix val loss + WikiText-2/LAMBADA PPL
HellaSwag, ARC-Easy, PIQA, WinoGrande 0-shot accuracy
LM llm-pretrain-optimizer Pretraining Optimizer Design Studies how optimizer choice, parameter grouping, and schedule coupling affect autoregressive pretraining validation loss. karpathy/nanoGPT
EleutherAI/lm-evaluation-harness
AdamW + Nesterov
Lion
Muon
ClimbMix val loss + WikiText-2/LAMBADA PPL
HellaSwag, ARC-Easy, PIQA, WinoGrande 0-shot accuracy
LM llm-pretrain-residual Transformer Residual Stream Strategy Studies how residual connections and information flow across transformer layers affect validation loss, perplexity, and accuracy metrics. karpathy/nanoGPT
EleutherAI/lm-evaluation-harness
Vanilla (Pre-LN)
ProRes
Learned Scaling
Block Attention Residuals
ClimbMix val loss + WikiText-2/LAMBADA PPL
HellaSwag, ARC-Easy, PIQA, WinoGrande 0-shot accuracy
LM llm-rl-advantage Reasoning RL Advantage Estimation Studies how advantage estimates for online language-model reinforcement learning affect mathematical reasoning accuracy. volcengine/verl GRPO
Dr. GRPO
Reinforce++ Baseline
GSM8K
MATH-500
AMC
LM llm-rl-importance-sampling Reasoning RL Importance-Sampling Granularity Studies how importance-sampling ratio granularity and clipping affect online language-model reinforcement learning for reasoning. volcengine/verl Token-Level (Vanilla PPO)
Sequence-Level (GSPO)
First-K Tokens
GSM8K
MATH-500
AMC
LM llm-rl-kl-estimator Actor Divergence Estimator for Reasoning RL Studies how per-token actor KL estimation controls reference-policy drift while preserving reasoning accuracy during online RL. volcengine/verl K1 (Unbiased Log-Ratio)
K2 (Squared Log-Ratio)
K3 (Low-Variance KL)
Absolute Log-Ratio
GSM8K
MATH-500
AMC
LM llm-rl-reward-normalization Pre-Advantage Reward Normalization Studies how reward normalization before advantage estimation affects reasoning accuracy in online language-model RL. volcengine/verl Outcome-Only (Raw)
Group-Std Normalization
Batch-Std Whitening
Length-Aware Normalization
GSM8K
MATH-500
AMC
LM llm-scaling-law-discovery Symbolic Scaling-Law Discovery Studies how symbolic functional forms and group-specific coefficients capture held-out scaling behavior. trevorstephens/gplearn Human Exact Form
SLDAgent-Style
Kernel Ridge Regression
XGBoost
SLDBench Vocabulary Scaling
SLDBench LR x Batch-Size Scaling
SLDBench Data-Constrained Scaling
LM mas-topology Language-Agent Collaboration Topology Studies how deterministic collaboration topology affects multi-agent code-generation quality and execution success. OpenBMB/ChatDev Chain
Star
Layered
HumanEval-33 (deepseek-chat, 4 agents)
HumanEval-33 (qwen2.5-72b-instruct, 4 agents)
SRDD-20 (deepseek-chat, 4 agents)
Rob jepa-planning Latent World-Model Planner Studies how goal-conditioned planning should exploit a fixed latent world model to improve navigation success. facebookresearch/eb_jepa Random
CEM
MPPI
iCEM
Two Rooms (Horizon 30)
Two Rooms (Horizon 60)
Two Rooms (Horizon 90)
Rob jepa-prediction-loss Temporal Latent Prediction Loss Studies how latent prediction objectives affect multi-step video representation quality. facebookresearch/eb_jepa MSE
Smooth L1
Cosine
Moving MNIST AP (small: henc=16, dstc=8, hpre=16)
Moving MNIST AP (base: henc=32, dstc=16, hpre=32)
Moving MNIST AP (large: henc=64, dstc=32, hpre=64)
Rob jepa-regularizer Anti-Collapse Representation Regularizer Studies how self-supervised regularization prevents representation collapse and improves linear-probe accuracy. facebookresearch/eb_jepa Naive
VICReg
SigReg
Barlow Twins
ResNet-18 Probe
ResNet-34 Probe
ResNet-50 Probe
Rob robo-diffusion-guidance Diffusion Guidance for Robot Trajectory Planning Studies guidance mechanisms for a fixed trajectory-level diffusion planner on D4RL MuJoCo, optimizing normalized score across hopper-medium-v2, walker2d-medium-v2, and halfcheetah-medium-v2. CleanDiffuserTeam/CleanDiffuser Diffuser (Classifier Guidance)
Classifier-Free Guidance
No Guidance
Decision Diffuser
D4RL Hopper-Medium-v2
D4RL Walker2d-Medium-v2
D4RL HalfCheetah-Medium-v2
Rob robo-diffusion-policy Diffusion Policy Learning for Robot Control Studies how diffusion policy training, value guidance, and action generation affect robot-control episode reward. CleanDiffuserTeam/CleanDiffuser DQL (Diffusion Q-Learning)
IDQL
Diffusion Policy
D4RL Hopper-Medium-v2
D4RL Walker2d-Medium-v2
D4RL HalfCheetah-Medium-v2
Rob robo-diffusion-sampling-method Efficient Diffusion Sampling for Robot Actions Studies how solver choice and sampling_steps affect DQL-style diffusion-policy normalized score at low NFE on D4RL MuJoCo. CleanDiffuserTeam/CleanDiffuser DDPM (100-Step Ancestral Sampling)
DDIM (20-Step Deterministic Sampling)
DPM-Solver++ 2M (10-Step)
D4RL Hopper-Medium-v2
D4RL Walker2d-Medium-v2
D4RL HalfCheetah-Medium-v2
Rob robo-humanoid-sim2real-algo Humanoid Transfer Policy Learning Studies how actor-critic architecture, policy optimization, and rollout processing affect humanoid command-following transfer. roboterax/humanoid-gym Default PPO
PPO with Adaptive KL
PPO with LayerNorm
RobotEra XBot-L Training
RobotEra XBot-L / Diverse Commands
RobotEra XBot-L / Forward-Only
RobotEra XBot-L / High Speed
Rob robomimic-bc-loss Behavioral Cloning Loss for Manipulation Studies how imitation-learning loss design affects rollout success for low-dimensional robot manipulation tasks. ARISE-Initiative/robomimic NLL with Entropy
Weighted NLL
Default (NLL)
Tool Hang (PH)
Can (PH)
Square (PH)
Rob robomimic-iql-vf Offline Value Loss for Manipulation Studies how asymmetric value regression loss design affects offline robot manipulation policy success. ARISE-Initiative/robomimic Quantile Regression
Huber Pinball
Default (Expectile)
Tool Hang (PH)
Can (PH)
Square (PH)
Rob robomimic-obs-encoder Observation Fusion Encoder for Imitation Learning Designs a multimodal robot state encoder for behavioral cloning to improve rollout success rate on manipulation tasks. ARISE-Initiative/robomimic Attention Fusion
Gated Fusion
Default (Concatenation)
Tool Hang (PH)
Can (PH)
Square (PH)
Rob tdmpc2-planning Trajectory Optimization for Model-Based Planning An online planning algorithm selects actions through learned-world-model trajectory optimization to improve episode reward. nicklashansen/tdmpc2 CEM
iCEM
MPPI
Walker Walk
Cheetah Run
Cartpole Swingup
Rob tdmpc2-simnorm Latent Representation Normalization for Model-Based RL Designs latent-state normalization for the TD-MPC2 encoder and dynamics world-model networks, evaluated by DMControl episode reward. nicklashansen/tdmpc2 SimNorm
L2 normalization
RMSNorm
Identity (no normalization)
DMControl walker-walk
DMControl cheetah-run
DMControl cartpole-swingup
V&G cv-3dgs-densification 3D Gaussian Splatting Densification Strategy Design Designs a 3D Gaussian Splatting densification strategy controlling clone, split, prune, reset, relocation, and sample-add behavior to improve held-out novel-view quality on Mip-NeRF 360 scenes. nerfstudio-project/gsplat Original 3DGS densification
AbsGS + Taming-3DGS + New Split
EDC-TamingGS-Abs
Mip-NeRF 360 garden (8x, best PSNR)
Mip-NeRF 360 bicycle (8x, best PSNR)
Mip-NeRF 360 bonsai (8x, best PSNR)
Mip-NeRF 360 stump (8x, best PSNR)
V&G cv-3dgs-regularizer 3D Gaussian Splatting Regularizer Design Designs a scalar regularizer added to the 3DGS photometric loss during 30k-step Mip-NeRF 360 reconstruction, evaluated on held-out novel views and scored by best PSNR. nerfstudio-project/gsplat No regularization
Scale + opacity L1
Effective-rank + scale/opacity L1
Mip-NeRF 360 garden (8x, best PSNR)
Mip-NeRF 360 bicycle (8x, best PSNR)
Mip-NeRF 360 bonsai (8x, best PSNR)
Mip-NeRF 360 stump (8x, best PSNR)
V&G cv-dbm-sampler Custom Sampler for Diffusion Bridge Models Designs a low-NFE sampler for Diffusion Bridge Models on image-to-image translation, ImageNet center-inpainting, and DIODE depth, evaluated by FID at NFE=5. thu-ml/DiffusionBridge DBIM
DBIM-HO (high-order)
DDBM (50 NFE reference)
ECSI
Edges2Handbags / e2h (FID, NFE=5)
ImageNet center-inpaint (FID, NFE=5)
DIODE depth (FID, NFE=5)
V&G cv-dbm-scheduler Time Scheduler for Diffusion Bridge Models (NFE=5) Designs a monotone low-step time schedule for Diffusion Bridge Models, evaluated by FID on Edges2Handbags, ImageNet center-inpainting, and DIODE depth at NFE=5. thu-ml/DiffusionBridge Karras EDM (rho=7)
Uniform (linear)
Cosine (Nichol-Dhariwal)
Log-linear (geometric)
Edges2Handbags / e2h (FID, NFE=5)
ImageNet center-inpaint (FID, NFE=5)
DIODE depth (FID, NFE=5)
V&G cv-diffusion-architecture Diffusion Model Architecture Design Design a denoising UNet backbone for unconditional CIFAR-10 DDPM training, optimizing best FID with fixed epsilon prediction and 50-step DDIM sampling. huggingface/diffusers Standard DDPM U-Net
Full-Attention U-Net
No-Attention U-Net
CIFAR-10 DDPM Small
CIFAR-10 DDPM Medium
CIFAR-10 DDPM Large
V&G cv-diffusion-cfg Diffusion Model: Classifier-Free Guidance Optimization Design a classifier-free guidance method for Stable Diffusion text-to-image generation across SD v1.5, Stable Diffusion 2 Base, and Stable Diffusion XL; evaluation generates COCO-caption images and official scoring uses per-model FID. CFGpp-diffusion/CFGpp Standard CFG
CFG++
Zero-Init CFG++
Stable Diffusion v1.5 / COCO captions / NFE=10
Stable Diffusion 2 Base / COCO captions / NFE=10
Stable Diffusion XL Base 1.0 / COCO captions / NFE=10
V&G cv-diffusion-conditioning Class-Conditional Diffusion: Conditioning Injection Methods Design class-conditioning injection for a CIFAR-10 class-conditional UNet2DModel/DDPM, optimizing best FID with 50-step DDIM sampling. huggingface/diffusers Concat-FiLM
Cross-Attention
AdaLN-Zero
CIFAR-10 Class-Conditional Small UNet2DModel
CIFAR-10 Class-Conditional Medium UNet2DModel
CIFAR-10 Class-Conditional Large UNet2DModel
V&G cv-diffusion-efficiency Diffusion Model: Sampler Efficiency Optimization Design a Stable Diffusion sampler update rule for COCO-caption text-to-image generation at a fixed NFE=20 budget; official scoring uses per-model FID. CFGpp-diffusion/CFGpp DDIM
DPM++ 3M
DPM++ 2S
Stable Diffusion v1.5 / COCO captions / NFE=20
Stable Diffusion 2 Base / COCO captions / NFE=20
Stable Diffusion XL Base 1.0 / COCO captions / NFE=20
V&G cv-diffusion-prediction Diffusion Prediction Parameterization Design a prediction target and consistent x0 inversion for unconditional CIFAR-10 UNet2DModel diffusion, optimizing best FID with 50-step DDIM sampling. huggingface/diffusers Epsilon Prediction
V-Prediction
X0 Prediction
CIFAR-10 Unconditional Small UNet2DModel
CIFAR-10 Unconditional Medium UNet2DModel
CIFAR-10 Unconditional Large UNet2DModel
V&G cv-meanflow-perceptual-loss Flow Map with Perceptual Loss Studies whether auxiliary perceptual losses on denoised images improve CIFAR-10 FID for MeanFlow flow-map training with DiT backbones. snap-research/alphaflow Pure MSE Velocity
MSE + Charbonnier + LPIPS + Gradient + Multiscale
MSE + LPIPS + Gradient + Multiscale + FFT
CIFAR-10 Small DiT
CIFAR-10 Medium DiT
CIFAR-10 Large DiT
V&G cv-vae-loss VAE Loss Function Design for Image Reconstruction Studies how VAE loss components affect CIFAR-10 AutoencoderKL reconstruction quality, scored primarily by rFID on the full test set. huggingface/diffusers L1 + KL
L1 + LPIPS + KL
L1 + LPIPS + KL + PatchGAN
CIFAR-10 AutoencoderKL Small
CIFAR-10 AutoencoderKL Medium
CIFAR-10 AutoencoderKL Large
RL marl-centralized-critic Cooperative MARL Centralized Critic Architecture for MAPPO Studies centralized critic architectures for MAPPO on SMACLite cooperative MARL maps, scored by greedy-policy test win rate and return. uoe-agents/epymarl IPPO Decentralized Critic
MAPPO Centralized Critic
MAT-Style Attention Critic
SMACLite MMM (10-agent heterogeneous)
SMACLite 2s3z (5-agent heterogeneous)
SMACLite 3s5z (8-agent heterogeneous)
RL meta-rl Meta-RL: Context Encoder for PEARL Task Inference Studies PEARL context encoders that map transition tuples to latent task representations for fast adaptation, evaluated by meta_test_return after 20 meta-training iterations. katerakelly/oyster PEARL MLP Context Encoder
PEARL Recurrent Context Encoder
PEARL Attention Context Encoder
Half-Cheetah Velocity (30 train/10 test tasks)
Sparse Point Robot (40 train/10 test tasks)
Point Robot (40 train/10 test tasks)
RL meta-rl-algorithm Meta-RL Algorithm Design Studies complete meta-RL algorithm design across task inference, policy conditioning, and meta-training, scored by meta_test_return on held-out tasks after the fixed short-budget protocol. katerakelly/oyster PEARL
FOCAL
VariBAD
Half-Cheetah Velocity (30 train/10 test tasks)
Sparse Point Robot (40 train/10 test tasks)
Point Robot (40 train/10 test tasks)
RL rl-intrinsic-exploration Intrinsic Exploration for Sparse Rewards Studies how intrinsic rewards and advantage mixing affect exploration and return in sparse-reward Atari environments. vwxyzjn/cleanrl PPO
RND
ICM
Tutankham-v5
Frostbite-v5
PrivateEye-v5
RL rl-offline-adroit Offline Dexterous Manipulation from Narrow Demonstrations Studies how offline RL algorithms learn dexterous manipulation from narrow human demonstration datasets. corl-team/CORL IQL
AWAC
ReBRAC
Pen-Human-v1
Hammer-Human-v1
Door-Cloned-v1
RL rl-offline-continuous Q-Overestimation Suppression for Offline Continuous Control Studies how offline continuous-control algorithms suppress out-of-distribution Q-value overestimation. corl-team/CORL ReBRAC
TD3-BC
IQL
HalfCheetah-Medium-v2
Maze2D-Medium-v1
Walker2d-Medium-v2
RL rl-offline-off2on Offline-to-Online Fine-Tuning Without Forgetting Studies how offline-to-online reinforcement learning prevents forgetting and value collapse during continued interaction. corl-team/CORL IQL
AWAC
SPOT
Pen-Cloned-v1
Hammer-Cloned-v1
Hammer-Expert-v1
RL rl-offpolicy-continuous Off-Policy Actor-Critic for Continuous Control Changes off-policy actor-critic update rules, losses, or exploration strategies to improve mean episodic return on continuous-control tasks. vwxyzjn/cleanrl DDPG
TD3
SAC
HalfCheetah-v4
Reacher-v4
Ant-v4
RL rl-onpolicy-continuous On-Policy Actor-Critic for Continuous Control Changes on-policy actor-critic objectives, update rules, or exploration mechanisms to improve mean episodic return on continuous-control tasks. vwxyzjn/cleanrl PPO
AWR
PPO (KL Penalty)
HalfCheetah-v4
Swimmer-v4
InvertedDoublePendulum-v4
RL rl-reward-learning Inverse RL Reward Learning from Demonstrations Studies how reward models learned from expert demonstrations affect downstream policy return in continuous-control locomotion. HumanCompatibleAI/imitation GAIL
AIRL
BC
HalfCheetah-v4
Hopper-v4
Walker2d-v4
RL rl-value-atari Value-Based Visual Control Studies how value-based RL losses, update rules, and exploration strategies affect visual-control episodic return. vwxyzjn/cleanrl QR-DQN
C51
Double-DQN
BreakoutNoFrameskip-v4
SeaquestNoFrameskip-v4
PongNoFrameskip-v4
RL rl-value-discrete Value-Based Discrete Control Changes value estimation, uncertainty handling, or replay-based update rules to improve episodic return on discrete-action control tasks. vwxyzjn/cleanrl QR-DQN
Dueling-DQN
C51
CartPole-v1
LunarLander-v2
Acrobot-v1
RL safe-rl Constraint Handling for Safe RL Changes Lagrangian or controller-style multiplier updates and cost-reward advantage mixing to improve reward while keeping episode cost below target. PKU-Alignment/omnisafe Naive PPO
Lagrangian PPO
PID Lagrangian
SafetyPointGoal1-v0
SafetyCarGoal1-v0
SafetyPointButton1-v0
Sys dlm-dkv-policy Diffusion LM KV Cache Policy Studies how token-state refresh intervals, masks, transfer ratios, and fallbacks affect denoising quality and cache reuse. maomaocun/dLLM-Cache Vanilla (Uncached)
dLLM-Cache
d2Cache
Elastic-Cache
MATH-500
HumanEval
ARC-Challenge
Sys llm-kv-adaptive-quantization LLM KV Cache: Adaptive Quantization Policy Studies adaptive 4-bit KV-cache quantization for instruction-tuned long-context inference, trading benchmark final-score quality against effective KV bits and compression. huggingface/transformers KIVI Overlap (4-bit)
KVTuner-4 Per-Token
KVTuner-4 KIVI
SQuat Subspace (4-bit)
LongBench-E hotpotqa_e QA F1
LongBench-E passage_retrieval_en_e retrieval score
LongBench-E repobench-p_e code-similarity score
NeedleBench NIAH exact phrase retrieval
GSM8K exact final-answer accuracy
Sys llm-kv-selection-budgeting LLM KV Cache Selection Budgeting Studies how selection and eviction controllers allocate layer budgets and recent windows for quality, latency, and memory tradeoffs. huggingface/transformers Full Attention
StreamingLLM
Expected Attention
LagKV
LongBench-E hotpotqa_e QA F1
LongBench-E passage_retrieval_en_e retrieval score
LongBench-E repobench-p_e code-similarity score
LongBench v2 train split multiple-choice accuracy
GSM8K exact final-answer accuracy
Sys llm-kv-structural-reduction LLM Pretraining: KV-Structural Reduction Studies GPT-style KV-state structural reduction through MHA, MQA, GQA, and MLA-style latent KV compression under fixed nanoGPT pretraining. karpathy/nanoGPT
EleutherAI/lm-evaluation-harness
MHA
MQA
GQA
MLA
ClimbMix val loss + KV bytes/token + WikiText-2/WikiText-103/LAMBADA heldout loss
HellaSwag, ARC-Easy, PIQA, WinoGrande 0-shot accuracy
Sys llm-pretrain-kernel LLM Pretraining: Custom GPU Kernel Optimization Studies custom/fused MLP kernels for nanoGPT pretraining while preserving ClimbMix validation, held-out perplexity, and downstream lm-eval quality. karpathy/nanoGPT
EleutherAI/lm-evaluation-harness
ReLU-Squared (Torch)
Triton GELU
Triton ReLU-Squared (Fused)
ClimbMix val loss + WikiText-2/LAMBADA PPL
HellaSwag, ARC-Easy, PIQA, WinoGrande 0-shot accuracy
Sys llm-ptq-algorithm LLM Post-Training Quantization (PTQ) Algorithm Design a post-training quantization algorithm for a pretrained LLM that minimizes WikiText-2 perplexity degradation under INT4/INT3 group quantization without retraining. IST-DASLab/gptq Round-to-Nearest (RTN)
GPTQ
AWQ
PTQ INT4
PTQ INT3
PTQ INT4 (g64)
Sys llm-qat-algorithm LLM Quantization-Aware Training (QAT) Algorithm Design a quantization-aware training algorithm for a pretrained LLM that minimizes WikiText-2 perplexity after INT4/INT3/INT2 quantization at inference time. custom No QAT
STE
LSQ
Finetune + PTQ
QAT INT4
QAT INT3
QAT INT2
Sys mlsys-fused-attention Fused Attention Kernel Design for H100 GPUs Design an OpenAI Triton fused self-attention forward kernel for H100 GPUs that maximizes throughput (TFLOPs/s) while preserving numerical correctness. Dao-AILab/flash-attention FlashAttention
FlashAttention-2
FlashAttention-3
Head Dim 64 / Seq 4K
Head Dim 128 / Seq 8K
Head Dim 256 / Seq 16K
Sys mlsys-moe-load-balance MoE Expert Parallelism Load Balancing Design an efficient MoE expert-replica placement algorithm that minimizes GPU/node load imbalance while preserving inter-node locality and low runtime. deepseek-ai/eplb Greedy
Zigzag
Flat Zigzag
DeepSeek-V3
Qwen3-MoE
DeepSeek-V2
Stress-Skew
Sys mlsys-sparse-attention-inference Long-Context Inference-Time Sparse Attention Design an inference-time sparse attention module for a pretrained instruction-tuned causal LLM that preserves NIAH and LongBench quality under a 25% density budget without retraining. custom Dense
StreamingLLM
BigBird
Block Top-K
NIAH (8K)
LongBench Qasper
LongBench MultiFieldQA-EN
Sci ai4bio-mutation-effect-prediction Mutation Fitness Predictor Studies how mutant and wild-type protein representations can predict functional effects of sequence mutations. OATML-Markslab/ProteinGym Ridge Regression
MLP
Reshape CNN
BLAT_ECOLX
ESTA_BACSU
RASH_HUMAN
Sci ai4bio-protein-inverse-folding Backbone-to-Sequence Inverse Folding Studies how geometric structure encoding and sequence decoding recover amino-acid sequences from protein backbones. A4Bio/ProteinInvBench ProteinMPNN
PiFold
GVP
CATH 4.2
CATH 4.3
TS50
Sci ai4bio-protein-structure-repr Geometric Protein Structure Encoder Studies how local and global geometric protein representations transfer to structure-aware function prediction. a-r-j/ProteinWorkshop SchNet
EGNN
GearNet
EC
GO-BP
Fold
Sci ai4sci-climate-emulation Atmospheric Column Emulator Architecture Studies how neural emulator architecture maps vertical atmospheric states to sub-grid physics tendencies across training budgets. leap-stc/ClimSim CNN
Encoder-Decoder
U-Net
HSR
Short Budget
Medium Budget
Long Budget
Sci ai4sci-inverse-diffusion-algo Diffusion-Prior Inverse Solver Studies how diffusion priors and measurement guidance can be combined for inverse-problem reconstruction. devzhk/InverseBench DPS
REDDiff
LGD
Inverse Scattering
Black Hole Imaging
Inpainting
Sci ai4sci-mol-property-prediction Molecular Representation Predictor Studies how molecular graph and geometric representations improve property prediction under scaffold-based generalization. deepmodeling/Uni-Mol D-MPNN
Uni-Mol
GIN
BBBP
BACE
Tox21
Sci ai4sci-pla-binding-affinity Protein-Ligand Interaction Model Studies how intra- and inter-molecular geometric interactions should be represented to predict binding affinity. guaguabujianle/EHIGN_PLA EHIGN
GIGN
SchNet
EGNN
PDBbind 2013
PDBbind 2016
PDBbind 2019
Sci ai4sci-vs-contrastive-scoring Contrastive Virtual-Screening Objective Studies how projection geometry and contrastive losses affect zero-shot protein-ligand screening quality. jianhuiwemi/HypSeek Vanilla CLIP
HCC
HCC + Hyperbolic Cone
HypSeek Training
DUD-E
LIT-PCBA
DEKOIS 2.0
Sci ai4sci-weather-forecast-aggregation Weather Forecast Variable Aggregation Studies how weather forecasting models aggregate information across heterogeneous meteorological variables for optimal prediction. microsoft/ClimaX Cross-Attention
Mean Pooling
Learned Weighted Sum
Z500 3-Day
T850 5-Day
10m-Wind 7-Day
Sci pde-design-solver Industrial CFD Design: Custom Neural Operator Design Designs and implements a custom neural operator for industrial aerodynamic design prediction on 3D unstructured point clouds. thuml/Neural-Solver-Library PointNet
GraphSAGE
Graph U-Net
Transolver
Car Design
AirfRANS
Aircraft Design
Opt optimization-bilevel Optimization Bilevel Studies a fixed bilevel-optimization benchmark based on Shen and Chen's penalty-based bilevel gradient descent experiments, selecting supported methods and tuning paper-style strategy hyperparameters. hanshen95/penalized-bilevel-gradient-descent V-PBGD
G-PBGD
RHG
T-RHG
Toy Convergence
HyperClean (Linear)
HyperClean (MLP)
Opt optimization-convex-concave RAIN Convex-Concave Studies gradient-norm convergence on the exact convex-concave benchmark instances used by the official RAIN bilinear and delta-function scripts. TrueNobility303/RAIN SEG
R-SEG
SEAG
RAIN
Default Noise
Low Noise
High Noise
Opt optimization-diagonal-net Optimizer Design for Diagonal-Net Sparse Recovery Designs an optimizer that recovers a sparse linear predictor from fewer training samples under a diagonal-net parameterization with noisy labels. TrueNobility303/RAIN SGD
AdaGrad
Adam
Adam (Alt.)
d=200, k=5, s=0.1
d=500, k=10, s=0.1
d=500, k=10, s=0.2
d=10000, k=50
Opt optimization-dp-sgd Differentially Private SGD: Privacy-Utility Optimization Design an improved DP-SGD variant that achieves higher test accuracy under the same (epsilon, delta)-differential privacy budget. custom Standard DP-SGD
Automatic Clipping (AUTO-S)
Adaptive Quantile Clipping
Step-Decay Noise Schedule
MNIST
Fashion-MNIST
CIFAR-10
Opt optimization-evolution-strategy Evolutionary Optimization Strategy Design Design a novel combination of selection, crossover, mutation operators and/or evolutionary loop for continuous black-box optimization across multiple benchmark functions. DEAP/deap GA (SBX)
CMA-ES
Differential Evolution
L-SHADE
Rastrigin (30D)
Rosenbrock (30D)
Ackley (30D)
Rastrigin (100D)
Opt optimization-gradient-compression Gradient Compression for Communication-Efficient Distributed Training Design a gradient compression operator that reduces communication cost in distributed training while maintaining convergence quality. custom TopK Sparsification with Error Feedback
QSGD (Quantized SGD)
SignSGD
ResNet-20 / CIFAR-10
VGG-11-BN / CIFAR-100
ResNet-56 / CIFAR-10
Opt optimization-hyperparameter-search Hyperparameter Optimization: Custom Search Strategy Design Design a custom HPO strategy that improves final validation score and convergence under limited multi-fidelity evaluation budgets. custom Random Search
TPE
Hyperband
DEHB
BOHB
Optuna CMA-ES
XGBoost
SVM
Neural Net
Opt optimization-multi-objective Multi-Objective Optimization: Custom Evolutionary Strategy Design Design a custom multi-objective evolutionary strategy that improves convergence, diversity, and spread on standard benchmark problems. DEAP/deap NSGA-II
MOEA/D
SPEA2
NSGA-III
RVEA
AGE-MOEA
ZDT1
ZDT3
DTLZ2
DTLZ1
Opt optimization-nas Sample-Efficient Neural Architecture Search Design and implement a sample-efficient NAS optimizer that discovers high-performing architectures in the NAS-Bench-201 search space under a strict query budget. automl/naslib Random Search
REA
BANANAS
CIFAR-10
CIFAR-100
ImageNet16-120
Opt optimization-online-bandit Online Bandits: Exploration-Exploitation Strategy Design Design and implement a bandit policy that minimizes cumulative regret across diverse multi-armed bandit settings. SMPyBandits/SMPyBandits UCB1
Thompson Sampling
KL-UCB
Stochastic MAB
Contextual Bandit
Non-Stationary Bandit
Opt optimization-pac-bayes-bound PAC-Bayes Generalization Bound Optimization Design a tighter PAC-Bayes generalization bound by optimizing the bound formulation, prior/posterior parameterization, and KL divergence estimation for stochastic neural networks. mperezortiz/PBB McAllester
Catoni
Quadratic
MNIST (FCN)
MNIST (CNN)
FashionMNIST (CNN)
Opt optimization-parity Optimization Parity Improve a fixed two-layer MLP's ability to learn sparse parity by designing only its initialization, training dataset, and AdamW hyperparameters. pytorch/examples Default
Multi-Epoch
No Weight Decay
n=32, k=8
n=50, k=8
n=64, k=8
Opt optimization-variance-reduction Variance Reduction for Stochastic Optimization Design an improved variance reduction strategy for stochastic gradient descent on finite-sum optimization problems. custom SVRG
STORM
STORM+
Logistic Regression
MLP
Ill-Conditioned
CAL meta-fewshot-classification Few-Shot Image Classification Method Studies how support encoding, query comparison, and loss design affect episodic few-shot image-classification accuracy. sicara/easy-few-shot-learning ProtoNet
MatchingNet
RelationNet
Mini-ImageNet 5w-5s
CIFAR-FS
CUB
CAL meta-inner-loop-optimizer Meta-Learning Inner-Loop Optimizer Studies how differentiable inner-loop adaptation rules affect few-shot classification accuracy in gradient-based meta-learning. learnables/learn2learn MAML
Meta-SGD
ANIL
Mini-ImageNet 5w-1s
Mini-ImageNet 5w-5s
CIFAR-FS 5w-5s
CAL ml-active-learning Pool-Based Active Learning Query Strategy Studies how unlabeled-sample query rules affect accuracy under a fixed labeling budget. JordanAsh/badge BADGE
BAIT
BALD
Least Confidence
Random
Letter
Spambase
Splice
CAL ml-anomaly-detection Unsupervised Tabular Anomaly Detector Studies how unlabeled anomaly scoring algorithms identify outliers across tabular data distributions. custom IF (Isolation Forest)
LOF
OCSVM
ECOD
COPOD
Cardio
Thyroid
Satellite
Shuttle
CAL ml-calibration Post-Hoc Probability Calibration Mapping Studies how post-hoc probability transforms improve classifier confidence calibration. custom Platt
Temperature Scaling
Isotonic Regression
RF / MNIST
MLP / Fashion-MNIST
GBM / Madelon
SVM / Breast Cancer
CAL ml-clustering-algorithm Geometry-Robust Clustering Algorithm Studies how clustering objectives and distance metrics handle convex blobs, non-convex moons, and high-dimensional digit data. custom K-Means
DBSCAN
HDBSCAN
Blobs
Moons
Digits
CAL ml-continual-regularization Continual Learning Importance Regularizer Changes parameter-importance estimation and regularization loss to reduce catastrophic forgetting and improve final average accuracy across contexts. GMvandeVen/continual-learning EWC
SI
Online EWC
Split-MNIST
Permuted-MNIST
Split-CIFAR100
CAL ml-dimensionality-reduction Nonlinear 2D Structure-Preserving Embedding Studies how nonlinear dimensionality reduction preserves neighborhood structure in low-dimensional embeddings. custom PCA
t-SNE
UMAP
TriMap
PaCMAP
MNIST
Fashion-MNIST
20 Newsgroups
CAL ml-ensemble-boosting Adaptive Boosting Weight and Target Strategy Studies how pseudo-targets, learner weights, and sample reweighting affect boosted ensemble performance. custom AdaBoost
Gradient Boosting
XGBoost-style
Breast Cancer
Diabetes
California Housing
CAL ml-federated-aggregation Heterogeneous Federated Server Aggregation Changes server-side client selection and model aggregation to improve federated test accuracy under heterogeneous client data. adap/flower FedAvg
FedProx
SCAFFOLD
CIFAR-10 (Non-IID alpha=0.1)
FEMNIST
Shakespeare
CAL ml-missing-data-imputation Correlation-Aware Tabular Imputation Studies how feature correlations and predictive structure guide missing-value imputation in tabular data. custom Mean Imputation
KNN Imputation
MICE
MissForest
GAIN
Breast Cancer Wisconsin
Wine
California Housing
CAL ml-selective-deferral Selective Deferral Under Subgroup Shift Studies how acceptance and deferral rules trade off selective risk, subgroup robustness, and coverage on AIF360 tabular datasets. custom Confidence Thresholding
Conformal Abstention
Learned Deferral
Group-wise Thresholding
Adult
COMPAS
Law School GPA
CAL ml-subgroup-calibration-shift Shift-Robust Subgroup Calibration Studies how post-hoc calibration behaves under subgroup distribution shift and worst-group reliability constraints on AIF360 tabular datasets. custom Temperature Scaling
Isotonic Regression
Beta Calibration
Group-wise Temperature Scaling
Adult
COMPAS
Law School GPA
CAL ml-symbolic-regression Genetic Programming Search for Symbolic Regression Studies how symbolic-regression search strategies recover generalizable analytical expressions. trevorstephens/gplearn Standard GP
Parsimony GP
Lexicase GP
Nguyen-7
Nguyen-10
Koza-3
DL cv-classification-loss Adaptive Classification Loss Modify the training loss over logits and labels to improve classification accuracy across image-model families. custom Label Smoothing
Focal Loss
PolyLoss
ResNet-56 / CIFAR-100
VGG-16-BN / CIFAR-100
MobileNet-V2 / Fashion-MNIST
DL cv-data-augmentation Image Augmentation Policy Design the training transform pipeline combining geometric, photometric, and erasing operations to improve image-classification generalization. custom Cutout
RandAugment
TrivialAugmentWide
ResNet-20 / CIFAR-10
ResNet-56 / CIFAR-100
MobileNet-V2 / Fashion-MNIST
DL cv-multitask-loss Hierarchical Classification Loss Weighting Studies how fine-label and coarse-label objectives should be combined to improve hierarchical image classification. custom Uncertainty Weighting
DWA
PCGrad
ResNet-20 / CIFAR-100-MT
ResNet-56 / CIFAR-100-MT
VGG-16-BN / CIFAR-100-MT
DL cv-pooling-aggregation Spatial Feature Aggregation Studies how global spatial features should be aggregated to improve image-classification accuracy across convolutional architectures. custom Global Max
GeM
Avg + Max
ResNet-56 / CIFAR-100
VGG-16-BN / CIFAR-100
MobileNet-V2 / Fashion-MNIST
DL cv-sample-weighting Long-Tail Class Reweighting Studies how class-count statistics should be mapped to loss weights to improve test accuracy on balanced test sets for long-tailed image classification. custom Inverse Frequency
Class-Balanced (Effective Number)
Balanced Softmax
ResNet-32 / CIFAR-10-LT
ResNet-32 / CIFAR-100-LT
VGG-16-BN / CIFAR-100-LT
DL dl-activation-function Convolutional Activation Nonlinearity Studies how drop-in activation functions affect accuracy across convolutional image classifiers. custom GELU
SiLU
Mish
ResNet-20 / CIFAR-10
VGG-16-BN / CIFAR-100
MobileNet-V2 / Fashion-MNIST
DL dl-lr-schedule Architecture-Aware Learning-Rate Scheduling Designs an epoch-level learning-rate curve conditioned on architecture and dataset to improve convergence and final classification accuracy. custom Cosine
WarmupCosine
OneCycle
ResNet-20 / CIFAR-10
ResNet-56 / CIFAR-100
MobileNet-V2 / Fashion-MNIST
DL dl-normalization Normalization Statistics and Affine Design Studies how normalization statistics and affine behavior affect convolutional training stability and test accuracy. custom GroupNorm
Batch-Instance Norm
Switchable Norm
ResNet-56 / CIFAR-100
ResNet-110 / CIFAR-100
MobileNet-V2 / Fashion-MNIST
DL dl-regularization Adaptive Regularization Loss Adds a model-, output-, input-, or epoch-dependent regularization term to improve classification generalization beyond standard weight decay. custom DropBlock
Confidence Penalty
Orthogonal Regularization
ResNet-56 / CIFAR-100
VGG-16-BN / CIFAR-100
MobileNet-V2 / Fashion-MNIST
DL dl-residual-connection Residual Block Skip Design Studies how shortcut transformations and residual branch computation affect optimization and generalization across network depths. custom Pre-Activation
Gated Residual
Stochastic Depth
ResNet-20 / CIFAR-10
ResNet-56 / CIFAR-100
ResNet-110 / CIFAR-100
DL dl-weight-initialization DL Weight Initialization Strategy Design Designs data-independent initialization for convolutional, normalization, and classifier layers to improve convergence and final accuracy. custom Kaiming Normal
Fixup
Orthogonal
ResNet-56 / CIFAR-100
VGG-16-BN / CIFAR-100
MobileNet-V2 / Fashion-MNIST
TS quant-concept-drift Concept-Drift-Aware Quantitative Forecasting The stock prediction model and data pipeline are redesigned to handle temporal distribution shift and improve signal quality and portfolio metrics. microsoft/qlib TRA
AdaRNN
LightGBM
CSI 300
CSI 300 (Shifted)
CSI 300 (Recent)
TS quant-graph-stock Graph-Based Quantitative Forecasting Studies how inter-asset graph relationships affect return signal quality and portfolio performance. microsoft/qlib HIST
GATs
LightGBM
CSI 300
CSI 100
CSI 300 (Recent)
TS quant-stock-prediction Quantitative Return Forecasting Studies how predictive models and input processing affect next-period return signals and portfolio performance. microsoft/qlib LightGBM
LSTM
Transformer
CSI 300
CSI 100
CSI 300 (Recent)
TS stf-traffic-forecast Spatial-Temporal Traffic Forecasting Model Studies how spatial-temporal models capture sensor-network dependencies for traffic forecasting. GestaltCogTeam/BasicTS STID
DLinear
StemGNN
iTransformer
TimesNet
SOFTS
TimeMixer
METR-LA
PEMS-BAY
PEMS04
TS ts-anomaly-detection Reconstruction Model for Time-Series Anomaly Detection An unsupervised reconstruction model detects anomalous multivariate time-series segments to improve F-score. thuml/Time-Series-Library DLinear
TimesNet
PatchTST
PSM
MSL
SMAP
TS ts-classification Multivariate Time-Series Classification Model Studies how representation learning improves classification of multivariate time-series signals. thuml/Time-Series-Library DLinear
TimesNet
PatchTST
EthanolConcentration
FaceDetection
Handwriting
TS ts-exogenous-forecast Exogenous-Variable Target Forecasting Model Studies how exogenous variables improve target-channel forecasting. thuml/Time-Series-Library DLinear
PatchTST
iTransformer
TimeXer
ETTh1
Weather
ECL
TS ts-imputation Masked Multivariate Time-Series Imputation Studies how imputation models reconstruct missing regions in multivariate time series. thuml/Time-Series-Library DLinear
TimesNet
PatchTST
ETTh1 (25% missing)
Weather (25% missing)
ECL (25% missing)
TS ts-long-term-forecast Multivariate Long-Horizon Forecasting Model Studies how long-horizon forecasting models predict future multivariate sequences. thuml/Time-Series-Library DLinear
PatchTST
iTransformer
TimeMixer
TimeXer
ETTh1
Weather
ECL
TS ts-short-term-forecast Univariate Short-Horizon Forecasting Model Studies how short-horizon forecasting models predict seasonal univariate series. thuml/Time-Series-Library DLinear
TimesNet
PatchTST
TimeMixer
M4 Monthly
M4 Quarterly
M4 Yearly
SCR causal-discovery-discrete Discrete Causal Graph Discovery Studies how causal discovery algorithms recover equivalence-class graph structure from discrete observational data. py-why/causal-learn PC
GES
GRaSP
BOSS
Hill Climbing
Cancer
Child
ALARM
HAILFINDER
Win95pts
SCR causal-observational-linear-gaussian Linear Gaussian Causal Discovery Studies how observational algorithms recover causal graph structure under linear Gaussian assumptions. py-why/causal-learn PC
GRaSP
BOSS
ER (n=10)
ER (n=20)
SF (n=50)
SF (n=50, Hard)
ER (n=20, Noisy)
SCR causal-observational-linear-non-gaussian Non-Gaussian Causal Discovery Studies how non-Gaussian structure can identify directed causal relationships from observational data. py-why/causal-learn ICA-LiNGAM
DirectLiNGAM
NOTEARS
ER (n=30)
ER (n=50)
SF (n=100)
SCR causal-observational-nonlinear Nonlinear Causal Discovery Studies how nonlinear additive-noise assumptions support directed causal graph recovery from observations. py-why/causal-learn CAM
NOTEARS-MLP
DirectLiNGAM
GraN-DAG
SF (n=20, GP)
ER (n=20, Gauss)
ER (n=12, Low-Sample)
SCR causal-treatment-effect Heterogeneous Treatment Effect Estimation Studies how observational estimators recover individual and average treatment effects on synthetic CATE benchmark families. custom S-Learner
T-Learner
IPW
Causal Forest
DR-Learner
R-Learner
IHDP-inspired Synth
Jobs/LaLonde-inspired Synth
ACIC-inspired Synth
SCR graph-generation Unconditional Graph Generator Architecture Studies how graph generator architecture affects distributional match to target graph statistics. pyg-team/pytorch_geometric GraphVAE
GRAN
DiGress
Community-Small
Ego-Small
ENZYMES
SCR graph-graph-classification Structure-Aware Graph Readout Pooling Studies how graph-level readout mechanisms affect graph classification accuracy and macro F1 under a fixed message-passing backbone. pyg-team/pytorch_geometric GIN + Sum
SAGPool
DiffPool
MUTAG
PROTEINS
NCI1
SCR graph-link-prediction Graph Link Encoder-Decoder Studies how node encoders and edge decoders affect missing-link prediction quality. custom GCN + MLP Decoder
VGAE
SEAL
Cora
CiteSeer
ogbl-collab
SCR graph-node-classification Graph Node Message Passing Studies how message-passing layers affect node classification across citation network benchmarks. pyg-team/pytorch_geometric GCN
GAT
GraphSAGE
Cora
CiteSeer
PubMed
SCR graph-signal-propagation Homophily-Heterophily Graph Filter The graph signal propagation filter is changed to improve node classification accuracy across homophilic and heterophilic graphs. ivam-he/ChebNetII GPR-GNN
BernNet
ChebNetII
Cora
CiteSeer
Texas
Cornell
TL security-adversarial-attack-black-box-score Score-Based Black-Box Linf Attack Designs a query-efficient black-box Linf evasion attack to improve attack success rate under a fixed per-sample query budget. Harry24k/adversarial-attacks-pytorch Square Attack
SPSA
Random Search
ResNet-20 / CIFAR-10
VGG-11-BN / CIFAR-10
MobileNet-V2 / CIFAR-10
ResNet-20 / CIFAR-100
MobileNet-V2 / CIFAR-100
TL security-adversarial-attack-sparse-l0 Sparse L0 Adversarial Attack Studies how sparse perturbation strategies improve attack success while respecting a strict pixel budget. Harry24k/adversarial-attacks-pytorch OnePixel
SparseFool
JSMA
Pixle
Sparse-RS
Rebuffi-R18 (l2-AT) / CIFAR-10
Augustin (l2-robust) / CIFAR-10
Engstrom (l2-robust) / CIFAR-10
TL security-adversarial-attack-white-box-linf White-Box Linf Evasion Attack Designs a gradient-based white-box Linf attack to improve attack success rate while respecting the perturbation budget. Harry24k/adversarial-attacks-pytorch FGSM
PGD
MI-FGSM
AutoAttack
ResNet-20 / CIFAR-10
VGG-11-BN / CIFAR-10
ResNet-20 / CIFAR-100
VGG-11-BN / CIFAR-100
MobileNet-V2 / CIFAR-100
TL security-adversarial-training Linf Adversarial Training for Robust Accuracy Studies how adversarial training procedures improve robust accuracy while maintaining clean accuracy. Harry24k/adversarial-attacks-pytorch Standard Training
PGD-AT
TRADES
MART
AWP + TRADES
SmallCNN / MNIST
PreAct ResNet-18 / CIFAR-10
VGG-11-BN / CIFAR-10
PreAct ResNet-18 / CIFAR-100
TL security-backdoor-defense Poisoned-Sample Scoring for Backdoor Filtering A suspicion scoring rule identifies and filters backdoored training examples to reduce attack success rate while preserving clean accuracy. custom Confidence Filter
Spectral Signatures
Activation Clustering
Z-Score Outlier
ResNet-20 / CIFAR-10 (BadNets)
VGG-16-BN / CIFAR-100 (Blend)
MobileNet-V2 / Fashion-MNIST (BadNets)
TL security-machine-unlearning Targeted Update Rules for Class Unlearning An unlearning update rule removes forget-class information while improving retained accuracy and reducing forget-set membership leakage. custom Retain Fine-Tune
Negative Gradient
Bad Teacher
SCRUB
ResNet-20 / CIFAR-10 (Class 0)
VGG-16-BN / CIFAR-100 (Class 0)
MobileNet-V2 / Fashion-MNIST (Class 0)
TL security-membership-inference-defense Training Regularization for Membership Privacy Studies how privacy-preserving training losses reduce membership leakage while maintaining accuracy. custom ERM
Label Smoothing
Confidence Penalty
RelaxLoss
ResNet-20 / CIFAR-10
VGG-16-BN / CIFAR-100
MobileNet-V2 / Fashion-MNIST
TL security-poison-robust-learning Robust Losses for Label-Flip Poisoning A robust loss or sample-weighting rule improves clean accuracy under label-flip poisoning and reduces poisoned-label memorization. custom Cross-Entropy
Generalized Cross-Entropy
Symmetric Cross-Entropy
Bootstrap
ResNet-20 / CIFAR-10 (Label-Flip)
VGG-16-BN / CIFAR-100 (Label-Flip)
MobileNet-V2 / Fashion-MNIST (Label-Flip)

Citation

@misc{lyu2026mlsbenchholisticrigorousassessment,
      title={MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI},
      author={Bohan Lyu and Yucheng Yang and Siqiao Huang and Jiaru Zhang and Qixin Xu and Xinghan Li and Xinyang Han and Yicheng Zhang and Huaqing Zhang and Runhan Huang and Kaicheng Yang and Zitao Chen and Wentao Guo and Junlin Yang and Xinyue Ai and Wenhao Chai and Yadi Cao and Ziran Yang and Kun Wang and Dapeng Jiang and Huan-ang Gao and Shange Tang and Chengshuai Shi and Simon S. Du and Max Simchowitz and Jiantao Jiao and Dawn Song and Chi Jin},
      year={2026},
      eprint={2605.08678},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2605.08678},
}

About

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages