GitHub - Imbernoulli/MLS-Bench

MLS-Bench is a benchmark for machine learning science. Where most agent benchmarks reward engineering one fixed instance — clean the data, tune the pipeline, climb a leaderboard — MLS-Bench asks the harder question: can an AI agent propose a new component, loss, optimizer, or training procedure whose gain transfers across settings, seeds, datasets, and scales?

The benchmark contains 140 tasks across 12 ML research domains. Each task fixes a research scaffold, gives the agent the relevant source code and strong baseline implementations, then asks for one algorithmic change inside a constrained edit surface.

News

2026.6 — More efficient on larger GPUs: a new compute_scale option lets the LLM pretraining and reinforcement-learning tasks — and, optionally, the other tasks — run more efficiently on H200-class GPUs without changing results. See issue #4 and PR #9 for the design.
2026.5 — Harbor support: official Harbor-compatible runtime and pre-rendered task images on Docker Hub under bohanlyu2022/mlsbench-harbor-*. See harbor/README.md.
2026.5 — Stronger Sparse L0 Adversarial Attack task: upgraded to the canonical Sparse-RS L0 threat model (k=24, untargeted) against three adversarially-robust RobustBench L2 CIFAR-10 targets (Rebuffi-R18 / Augustin / Engstrom). Strong attacks no longer trivially saturate, leaving real headroom to measure genuine attack improvements.
2026.5 — Scoring: the main results table in the arXiv paper previously aggregated tasks within each area by geometric mean; switched to arithmetic mean for easier comparison with the per-task numbers. Rankings are unchanged and no conclusions are affected.

Installation

pip install -e ".[agent]"

Python 3.10+ is required. MLS-Bench separates the choice of runtime backend from the choice of job scheduler, and any combination of the two is supported:

Runtime backends: Docker, Apptainer, or local Conda — selected in your config file via container_runtime.
Job schedulers: SLURM (when a slurm: section is present in the config) or the built-in single-node GPU scheduler.

container_runtime: docker      # docker, apptainer, or local

Recommended setup: Docker or Apptainer for the runtime, with SLURM as the job scheduler. If SLURM is unavailable, the built-in scheduler can be combined with any of the three runtimes. If neither a container runtime nor SLURM is available, the local Conda backend together with the built-in scheduler provides a complete fallback (see the section below).

Running with local Conda environments and the built-in scheduler

When neither Docker nor Apptainer is available, MLS-Bench can build a dedicated Conda environment per package and dispatch jobs through a single-node GPU queue (src/mlsbench/scheduler.py). This backend is intended for development and small-scale experimentation; for full-scale benchmarking on a cluster we recommend SLURM with one of the container runtimes instead. The Conda backend should not be combined with SLURM, since both attempt to schedule GPU jobs.

Use a config with container_runtime: local and no slurm: section. Throughout this section we refer to it as configs/local.yaml.

Build the environment for each package:

mlsbench build <package> --config configs/local.yaml

Start the GPU scheduler:

nohup python -m mlsbench.scheduler start \
  --gpus 0,1,2,3 \
  --config configs/local.yaml \
  > .scheduler/scheduler.log 2>&1 &

Launch agents or baselines. They enqueue jobs to the scheduler and return immediately:

PYTHONPATH=src nohup python3 -m mlsbench agent <task> --model <model> \
  --config configs/local.yaml \
  > .scheduler/logs/agent_<task>.log 2>&1 &

Inspect or manage the queue:

python -m mlsbench.scheduler status
python -m mlsbench.scheduler list
python -m mlsbench.scheduler cancel <job_id>
python -m mlsbench.scheduler clear

To rebuild a package's environment from scratch, remove it with conda env remove -n mlsbench-<package> and re-run mlsbench build.

API Keys

Running an agent requires an API key for the model provider you choose. If you enable the optional web-search tool, a Tavily key is also required. Configure keys in either of two equivalent ways:

1. Inline in your config file under the providers: block — useful when you want to keep separate configs per environment or per project:

providers:
  openai:
    api_key: "sk-..."
  anthropic:
    api_key: "sk-ant-..."
  openrouter:
    api_key: "sk-or-..."
    base_url: "https://openrouter.ai/api/v1"
  deepseek:
    api_key: "sk-..."
    base_url: "https://api.deepseek.com/v1"
  tavily:
    api_key: "tvly-..."     # only needed if the web_search tool is enabled

2. Environment variables — leave the api_key field empty (or omit the provider entirely) and the CLI falls back to the standard env var for that provider:

Provider	Env var
OpenAI	`OPENAI_API_KEY`
Anthropic	`ANTHROPIC_API_KEY`
OpenRouter	`OPENROUTER_API_KEY_NEW`
DeepSeek	`DEEPSEEK_API_KEY`
Qwen / DashScope	`QWEN_API_KEY` / `DASHSCOPE_API_KEY`
Gemini / Google	`GEMINI_API_KEY` / `GOOGLE_API_KEY`
Kimi / Moonshot	`KIMI_API_KEY` / `MOONSHOT_API_KEY`
GLM	`GLM_API_KEY`
MiniMax	`MINIMAX_API_KEY`
Tavily (web search)	`TAVILY_API_KEY`

You can also use ${ENV_VAR} interpolation inside the YAML (api_key: "${OPENAI_API_KEY}") when you want a tracked config file that still resolves the secret from the environment at runtime.

The model string passed to mlsbench agent --model <name> selects the provider automatically:

Bare names are dispatched by their well-known prefix: claude-* → providers.anthropic, gpt-* / o1 / o3 / o4 → providers.openai, deepseek-* → providers.deepseek, qwen-* → providers.qwen, gemini-* → providers.gemini, kimi-* / moonshot-* → providers.kimi, glm-* → providers.glm, minimax-* → providers.minimax.
Prefixed names (<provider>/<model>, e.g. openai/gpt-5.4, vertex_ai/..., openrouter/anthropic/claude-opus-4.6) dispatch generically to the matching providers.<provider> entry. Point that entry's base_url at whichever upstream you want — direct API, OpenRouter, a LiteLLM proxy, etc. — and the same key is reused.

Quick Start

Fetch external packages and build the runtime (data dependencies are prepared automatically as part of the build):

mlsbench fetch --name <package>
mlsbench build <package> --config configs/react.yaml

Run an agent and compute its task score:

mlsbench agent <task> --model <model> --config configs/react.yaml
mlsbench score task <task>

Baseline scores are already populated in each task's leaderboard.csv, so running an agent alone is sufficient to obtain its normalized score under the MLS-Bench evaluation framework. Before launching the agent, however, we recommend running one baseline first to confirm that your environment is set up correctly:

mlsbench baseline <task> --name <baseline> --config configs/react.yaml

Baselines and agents share the same task scripts, parsers, seeds, resource limits, and leaderboard code; only the source of the edits differs.

Prebuilt Container Images

To avoid building each package from source, prebuilt images are published for every supported package:

Docker Hub: bohanlyu2022/mlsbench-<pkg>:latest — https://hub.docker.com/u/bohanlyu2022
Hugging Face (Apptainer SIFs): sif/<Pkg>.sif inside the Bohan22/MLS-Bench-Tasks dataset

mlsbench agent, mlsbench baseline, and mlsbench build automatically pull the prebuilt image when the local image is missing, and fall back to building from source on failure. mlsbench run performs the same lookup but does not build from source; run mlsbench build <pkg> first if a local build is required.

Two mutually-exclusive flags force a specific source for mlsbench build:

mlsbench build <package> --pull          # use only the prebuilt image
mlsbench build <package> --local-build   # build locally from the Dockerfile / .def

For Apptainer, the SIF can be obtained either via apptainer pull docker://... (default) or from the Hugging Face mirror — a direct HTTPS download of sif/<Pkg>.sif, which can be faster in networks where Docker registries are slow. Select the source with --sif-source {docker,hf,auto} on mlsbench build.

Running under Harbor

MLS-Bench's 140 tasks are also available as a Harbor dataset so any Harbor-supported agent (claude-code, codex, openhands, terminus-2, …) can be evaluated on the suite without going through this repository's own runner:

PYTHONPATH=. harbor run -c run.yaml -a claude-code -m anthropic/claude-opus-4-7

The pre-rendered dataset, GPU-capable environment plugin, and reference Harbor config live under harbor/. See harbor/README.md for usage details and the self-contained per-task layout.

Repository Map

src/mlsbench/                  CLI, agent loop, execution backends, scoring
tasks/<task>/                  140 task definitions, parsers, scores, baselines
vendor/packages.yaml           External package registry
vendor/pkg_configs/<package>/  Package runtime configs and pre-edit patches
vendor/data_scripts/           Dataset and model-cache preparation scripts
configs/react.yaml             Runtime and provider configuration
configs/openevolve.yaml        OpenEvolve defaults
configs/discover.yaml          Discover defaults
harbor/                        Pre-rendered Harbor dataset (140 tasks) + run config

Fetched upstream repositories, built images, downloaded datasets, run workspaces, logs, and scheduler state are intentionally not versioned.

Full Task Catalog

Show the 140-task appendix table

Area	Directory shorthand	Task	Research question	External package(s)	Baselines	Evaluation settings
LM	agent-tool-reasoning	LLM Agent Tool-Use Reasoning Strategy	Studies how tool-use search, backtracking, and stopping policies affect answer validity and query efficiency.	zhichengg/StableToolBench	Greedy Chain (CoT) DFS with LLM Ranking DFSDT	StableToolBench I1-instruction 50q / deepseek-chat StableToolBench I1-instruction 50q / qwen2.5-72b-instruct StableToolBench I1-instruction 50q / qwen2.5-7b-instruct
LM	llm-dllm-demask-strategy	Masked Diffusion LM: Demasking Strategy	Studies how demasking schedules, position selection, and token assignment affect diffusion language-model quality and decoding efficiency.	ML-GSAI/LLaDA	Top-K Margin Confidence Greedy KLASS	LLaDA / MATH-500 LLaDA / HumanEval Dream / C4 prefix continuation
LM	llm-pretrain-attention	Autoregressive Attention Mechanism	Studies how self-attention computation and positional handling affect autoregressive pretraining loss and downstream accuracy.	karpathy/nanoGPT EleutherAI/lm-evaluation-harness	QK-Norm RoPE RoPE + QK-Norm	ClimbMix val loss + WikiText-2/LAMBADA PPL HellaSwag, ARC-Easy, PIQA, WinoGrande 0-shot accuracy
LM	llm-pretrain-bitlinear	Low-Bit Linear Pretraining Layer	Studies how low-bit linear layers and quantization functions affect pretraining loss under discrete weight constraints.	karpathy/nanoGPT EleutherAI/lm-evaluation-harness	Binary Sign (BitNet) Ternary 1.58-bit (BitNet b1.58) INT2 Uniform	ClimbMix val loss + WikiText-2/LAMBADA PPL HellaSwag, ARC-Easy, PIQA, WinoGrande 0-shot accuracy
LM	llm-pretrain-embedding	Autoregressive Embedding Strategy	Studies how token embeddings, position embeddings, value embeddings, and weight tying affect autoregressive pretraining loss and downstream accuracy.	karpathy/nanoGPT EleutherAI/lm-evaluation-harness	Untied Embeddings Value Embeddings Bigram Hash Embeddings	ClimbMix val loss + WikiText-2/LAMBADA PPL HellaSwag, ARC-Easy, PIQA, WinoGrande 0-shot accuracy
LM	llm-pretrain-linear-attention	Subquadratic Attention Mechanism	Studies whether linear or subquadratic attention can reduce autoregressive validation loss while preserving downstream performance.	karpathy/nanoGPT EleutherAI/lm-evaluation-harness	RetNet DeltaNet GLA	ClimbMix val loss + WikiText-2/LAMBADA PPL HellaSwag, ARC-Easy, PIQA, WinoGrande 0-shot accuracy
LM	llm-pretrain-loss	Autoregressive Pretraining Loss	Studies how alternative next-token training losses affect autoregressive validation cross-entropy.	karpathy/nanoGPT EleutherAI/lm-evaluation-harness	Label Smoothing Softcap Cross-Entropy Z-Loss	ClimbMix val loss + WikiText-2/LAMBADA PPL HellaSwag, ARC-Easy, PIQA, WinoGrande 0-shot accuracy
LM	llm-pretrain-lr-schedule	Pretraining Learning-Rate Schedule	Studies how warmup, decay shape, and schedule horizon affect autoregressive pretraining validation loss.	karpathy/nanoGPT EleutherAI/lm-evaluation-harness	WSD (Warmup-Stable-Decay) Trapezoidal WSD with Inverse-Sqrt Decay	ClimbMix val loss + WikiText-2/LAMBADA PPL HellaSwag, ARC-Easy, PIQA, WinoGrande 0-shot accuracy
LM	llm-pretrain-mlp	Transformer Feed-Forward Block	Studies how activation, gating, and expansion choices in the feed-forward sublayer affect language-model validation loss.	karpathy/nanoGPT EleutherAI/lm-evaluation-harness	ReLU-Squared SwiGLU GeGLU	ClimbMix val loss + WikiText-2/LAMBADA PPL HellaSwag, ARC-Easy, PIQA, WinoGrande 0-shot accuracy
LM	llm-pretrain-normalization	Normalization and Block Layout	Studies how normalization placement, affine behavior, and transformer block layout affect pretraining stability and validation loss.	karpathy/nanoGPT EleutherAI/lm-evaluation-harness	RMSNorm RMSNorm + Sandwich-Norm RMSNorm (Parallel Block)	ClimbMix val loss + WikiText-2/LAMBADA PPL HellaSwag, ARC-Easy, PIQA, WinoGrande 0-shot accuracy
LM	llm-pretrain-optimizer	Pretraining Optimizer Design	Studies how optimizer choice, parameter grouping, and schedule coupling affect autoregressive pretraining validation loss.	karpathy/nanoGPT EleutherAI/lm-evaluation-harness	AdamW + Nesterov Lion Muon	ClimbMix val loss + WikiText-2/LAMBADA PPL HellaSwag, ARC-Easy, PIQA, WinoGrande 0-shot accuracy
LM	llm-pretrain-residual	Transformer Residual Stream Strategy	Studies how residual connections and information flow across transformer layers affect validation loss, perplexity, and accuracy metrics.	karpathy/nanoGPT EleutherAI/lm-evaluation-harness	Vanilla (Pre-LN) ProRes Learned Scaling Block Attention Residuals	ClimbMix val loss + WikiText-2/LAMBADA PPL HellaSwag, ARC-Easy, PIQA, WinoGrande 0-shot accuracy
LM	llm-rl-advantage	Reasoning RL Advantage Estimation	Studies how advantage estimates for online language-model reinforcement learning affect mathematical reasoning accuracy.	volcengine/verl	GRPO Dr. GRPO Reinforce++ Baseline	GSM8K MATH-500 AMC
LM	llm-rl-importance-sampling	Reasoning RL Importance-Sampling Granularity	Studies how importance-sampling ratio granularity and clipping affect online language-model reinforcement learning for reasoning.	volcengine/verl	Token-Level (Vanilla PPO) Sequence-Level (GSPO) First-K Tokens	GSM8K MATH-500 AMC
LM	llm-rl-kl-estimator	Actor Divergence Estimator for Reasoning RL	Studies how per-token actor KL estimation controls reference-policy drift while preserving reasoning accuracy during online RL.	volcengine/verl	K1 (Unbiased Log-Ratio) K2 (Squared Log-Ratio) K3 (Low-Variance KL) Absolute Log-Ratio	GSM8K MATH-500 AMC
LM	llm-rl-reward-normalization	Pre-Advantage Reward Normalization	Studies how reward normalization before advantage estimation affects reasoning accuracy in online language-model RL.	volcengine/verl	Outcome-Only (Raw) Group-Std Normalization Batch-Std Whitening Length-Aware Normalization	GSM8K MATH-500 AMC
LM	llm-scaling-law-discovery	Symbolic Scaling-Law Discovery	Studies how symbolic functional forms and group-specific coefficients capture held-out scaling behavior.	trevorstephens/gplearn	Human Exact Form SLDAgent-Style Kernel Ridge Regression XGBoost	SLDBench Vocabulary Scaling SLDBench LR x Batch-Size Scaling SLDBench Data-Constrained Scaling
LM	mas-topology	Language-Agent Collaboration Topology	Studies how deterministic collaboration topology affects multi-agent code-generation quality and execution success.	OpenBMB/ChatDev	Chain Star Layered	HumanEval-33 (deepseek-chat, 4 agents) HumanEval-33 (qwen2.5-72b-instruct, 4 agents) SRDD-20 (deepseek-chat, 4 agents)
Rob	jepa-planning	Latent World-Model Planner	Studies how goal-conditioned planning should exploit a fixed latent world model to improve navigation success.	facebookresearch/eb_jepa	Random CEM MPPI iCEM	Two Rooms (Horizon 30) Two Rooms (Horizon 60) Two Rooms (Horizon 90)
Rob	jepa-prediction-loss	Temporal Latent Prediction Loss	Studies how latent prediction objectives affect multi-step video representation quality.	facebookresearch/eb_jepa	MSE Smooth L1 Cosine	Moving MNIST AP (small: henc=16, dstc=8, hpre=16) Moving MNIST AP (base: henc=32, dstc=16, hpre=32) Moving MNIST AP (large: henc=64, dstc=32, hpre=64)
Rob	jepa-regularizer	Anti-Collapse Representation Regularizer	Studies how self-supervised regularization prevents representation collapse and improves linear-probe accuracy.	facebookresearch/eb_jepa	Naive VICReg SigReg Barlow Twins	ResNet-18 Probe ResNet-34 Probe ResNet-50 Probe
Rob	robo-diffusion-guidance	Diffusion Guidance for Robot Trajectory Planning	Studies guidance mechanisms for a fixed trajectory-level diffusion planner on D4RL MuJoCo, optimizing normalized score across hopper-medium-v2, walker2d-medium-v2, and halfcheetah-medium-v2.	CleanDiffuserTeam/CleanDiffuser	Diffuser (Classifier Guidance) Classifier-Free Guidance No Guidance Decision Diffuser	D4RL Hopper-Medium-v2 D4RL Walker2d-Medium-v2 D4RL HalfCheetah-Medium-v2
Rob	robo-diffusion-policy	Diffusion Policy Learning for Robot Control	Studies how diffusion policy training, value guidance, and action generation affect robot-control episode reward.	CleanDiffuserTeam/CleanDiffuser	DQL (Diffusion Q-Learning) IDQL Diffusion Policy	D4RL Hopper-Medium-v2 D4RL Walker2d-Medium-v2 D4RL HalfCheetah-Medium-v2
Rob	robo-diffusion-sampling-method	Efficient Diffusion Sampling for Robot Actions	Studies how solver choice and sampling_steps affect DQL-style diffusion-policy normalized score at low NFE on D4RL MuJoCo.	CleanDiffuserTeam/CleanDiffuser	DDPM (100-Step Ancestral Sampling) DDIM (20-Step Deterministic Sampling) DPM-Solver++ 2M (10-Step)	D4RL Hopper-Medium-v2 D4RL Walker2d-Medium-v2 D4RL HalfCheetah-Medium-v2
Rob	robo-humanoid-sim2real-algo	Humanoid Transfer Policy Learning	Studies how actor-critic architecture, policy optimization, and rollout processing affect humanoid command-following transfer.	roboterax/humanoid-gym	Default PPO PPO with Adaptive KL PPO with LayerNorm	RobotEra XBot-L Training RobotEra XBot-L / Diverse Commands RobotEra XBot-L / Forward-Only RobotEra XBot-L / High Speed
Rob	robomimic-bc-loss	Behavioral Cloning Loss for Manipulation	Studies how imitation-learning loss design affects rollout success for low-dimensional robot manipulation tasks.	ARISE-Initiative/robomimic	NLL with Entropy Weighted NLL Default (NLL)	Tool Hang (PH) Can (PH) Square (PH)
Rob	robomimic-iql-vf	Offline Value Loss for Manipulation	Studies how asymmetric value regression loss design affects offline robot manipulation policy success.	ARISE-Initiative/robomimic	Quantile Regression Huber Pinball Default (Expectile)	Tool Hang (PH) Can (PH) Square (PH)
Rob	robomimic-obs-encoder	Observation Fusion Encoder for Imitation Learning	Designs a multimodal robot state encoder for behavioral cloning to improve rollout success rate on manipulation tasks.	ARISE-Initiative/robomimic	Attention Fusion Gated Fusion Default (Concatenation)	Tool Hang (PH) Can (PH) Square (PH)
Rob	tdmpc2-planning	Trajectory Optimization for Model-Based Planning	An online planning algorithm selects actions through learned-world-model trajectory optimization to improve episode reward.	nicklashansen/tdmpc2	CEM iCEM MPPI	Walker Walk Cheetah Run Cartpole Swingup
Rob	tdmpc2-simnorm	Latent Representation Normalization for Model-Based RL	Designs latent-state normalization for the TD-MPC2 encoder and dynamics world-model networks, evaluated by DMControl episode reward.	nicklashansen/tdmpc2	SimNorm L2 normalization RMSNorm Identity (no normalization)	DMControl walker-walk DMControl cheetah-run DMControl cartpole-swingup
V&G	cv-3dgs-densification	3D Gaussian Splatting Densification Strategy Design	Designs a 3D Gaussian Splatting densification strategy controlling clone, split, prune, reset, relocation, and sample-add behavior to improve held-out novel-view quality on Mip-NeRF 360 scenes.	nerfstudio-project/gsplat	Original 3DGS densification AbsGS + Taming-3DGS + New Split EDC-TamingGS-Abs	Mip-NeRF 360 garden (8x, best PSNR) Mip-NeRF 360 bicycle (8x, best PSNR) Mip-NeRF 360 bonsai (8x, best PSNR) Mip-NeRF 360 stump (8x, best PSNR)
V&G	cv-3dgs-regularizer	3D Gaussian Splatting Regularizer Design	Designs a scalar regularizer added to the 3DGS photometric loss during 30k-step Mip-NeRF 360 reconstruction, evaluated on held-out novel views and scored by best PSNR.	nerfstudio-project/gsplat	No regularization Scale + opacity L1 Effective-rank + scale/opacity L1	Mip-NeRF 360 garden (8x, best PSNR) Mip-NeRF 360 bicycle (8x, best PSNR) Mip-NeRF 360 bonsai (8x, best PSNR) Mip-NeRF 360 stump (8x, best PSNR)
V&G	cv-dbm-sampler	Custom Sampler for Diffusion Bridge Models	Designs a low-NFE sampler for Diffusion Bridge Models on image-to-image translation, ImageNet center-inpainting, and DIODE depth, evaluated by FID at NFE=5.	thu-ml/DiffusionBridge	DBIM DBIM-HO (high-order) DDBM (50 NFE reference) ECSI	Edges2Handbags / e2h (FID, NFE=5) ImageNet center-inpaint (FID, NFE=5) DIODE depth (FID, NFE=5)
V&G	cv-dbm-scheduler	Time Scheduler for Diffusion Bridge Models (NFE=5)	Designs a monotone low-step time schedule for Diffusion Bridge Models, evaluated by FID on Edges2Handbags, ImageNet center-inpainting, and DIODE depth at NFE=5.	thu-ml/DiffusionBridge	Karras EDM (rho=7) Uniform (linear) Cosine (Nichol-Dhariwal) Log-linear (geometric)	Edges2Handbags / e2h (FID, NFE=5) ImageNet center-inpaint (FID, NFE=5) DIODE depth (FID, NFE=5)
V&G	cv-diffusion-architecture	Diffusion Model Architecture Design	Design a denoising UNet backbone for unconditional CIFAR-10 DDPM training, optimizing best FID with fixed epsilon prediction and 50-step DDIM sampling.	huggingface/diffusers	Standard DDPM U-Net Full-Attention U-Net No-Attention U-Net	CIFAR-10 DDPM Small CIFAR-10 DDPM Medium CIFAR-10 DDPM Large
V&G	cv-diffusion-cfg	Diffusion Model: Classifier-Free Guidance Optimization	Design a classifier-free guidance method for Stable Diffusion text-to-image generation across SD v1.5, Stable Diffusion 2 Base, and Stable Diffusion XL; evaluation generates COCO-caption images and official scoring uses per-model FID.	CFGpp-diffusion/CFGpp	Standard CFG CFG++ Zero-Init CFG++	Stable Diffusion v1.5 / COCO captions / NFE=10 Stable Diffusion 2 Base / COCO captions / NFE=10 Stable Diffusion XL Base 1.0 / COCO captions / NFE=10
V&G	cv-diffusion-conditioning	Class-Conditional Diffusion: Conditioning Injection Methods	Design class-conditioning injection for a CIFAR-10 class-conditional UNet2DModel/DDPM, optimizing best FID with 50-step DDIM sampling.	huggingface/diffusers	Concat-FiLM Cross-Attention AdaLN-Zero	CIFAR-10 Class-Conditional Small UNet2DModel CIFAR-10 Class-Conditional Medium UNet2DModel CIFAR-10 Class-Conditional Large UNet2DModel
V&G	cv-diffusion-efficiency	Diffusion Model: Sampler Efficiency Optimization	Design a Stable Diffusion sampler update rule for COCO-caption text-to-image generation at a fixed NFE=20 budget; official scoring uses per-model FID.	CFGpp-diffusion/CFGpp	DDIM DPM++ 3M DPM++ 2S	Stable Diffusion v1.5 / COCO captions / NFE=20 Stable Diffusion 2 Base / COCO captions / NFE=20 Stable Diffusion XL Base 1.0 / COCO captions / NFE=20
V&G	cv-diffusion-prediction	Diffusion Prediction Parameterization	Design a prediction target and consistent x0 inversion for unconditional CIFAR-10 UNet2DModel diffusion, optimizing best FID with 50-step DDIM sampling.	huggingface/diffusers	Epsilon Prediction V-Prediction X0 Prediction	CIFAR-10 Unconditional Small UNet2DModel CIFAR-10 Unconditional Medium UNet2DModel CIFAR-10 Unconditional Large UNet2DModel
V&G	cv-meanflow-perceptual-loss	Flow Map with Perceptual Loss	Studies whether auxiliary perceptual losses on denoised images improve CIFAR-10 FID for MeanFlow flow-map training with DiT backbones.	snap-research/alphaflow	Pure MSE Velocity MSE + Charbonnier + LPIPS + Gradient + Multiscale MSE + LPIPS + Gradient + Multiscale + FFT	CIFAR-10 Small DiT CIFAR-10 Medium DiT CIFAR-10 Large DiT
V&G	cv-vae-loss	VAE Loss Function Design for Image Reconstruction	Studies how VAE loss components affect CIFAR-10 AutoencoderKL reconstruction quality, scored primarily by rFID on the full test set.	huggingface/diffusers	L1 + KL L1 + LPIPS + KL L1 + LPIPS + KL + PatchGAN	CIFAR-10 AutoencoderKL Small CIFAR-10 AutoencoderKL Medium CIFAR-10 AutoencoderKL Large
RL	marl-centralized-critic	Cooperative MARL Centralized Critic Architecture for MAPPO	Studies centralized critic architectures for MAPPO on SMACLite cooperative MARL maps, scored by greedy-policy test win rate and return.	uoe-agents/epymarl	IPPO Decentralized Critic MAPPO Centralized Critic MAT-Style Attention Critic	SMACLite MMM (10-agent heterogeneous) SMACLite 2s3z (5-agent heterogeneous) SMACLite 3s5z (8-agent heterogeneous)
RL	meta-rl	Meta-RL: Context Encoder for PEARL Task Inference	Studies PEARL context encoders that map transition tuples to latent task representations for fast adaptation, evaluated by meta_test_return after 20 meta-training iterations.	katerakelly/oyster	PEARL MLP Context Encoder PEARL Recurrent Context Encoder PEARL Attention Context Encoder	Half-Cheetah Velocity (30 train/10 test tasks) Sparse Point Robot (40 train/10 test tasks) Point Robot (40 train/10 test tasks)
RL	meta-rl-algorithm	Meta-RL Algorithm Design	Studies complete meta-RL algorithm design across task inference, policy conditioning, and meta-training, scored by meta_test_return on held-out tasks after the fixed short-budget protocol.	katerakelly/oyster	PEARL FOCAL VariBAD	Half-Cheetah Velocity (30 train/10 test tasks) Sparse Point Robot (40 train/10 test tasks) Point Robot (40 train/10 test tasks)
RL	rl-intrinsic-exploration	Intrinsic Exploration for Sparse Rewards	Studies how intrinsic rewards and advantage mixing affect exploration and return in sparse-reward Atari environments.	vwxyzjn/cleanrl	PPO RND ICM	Tutankham-v5 Frostbite-v5 PrivateEye-v5
RL	rl-offline-adroit	Offline Dexterous Manipulation from Narrow Demonstrations	Studies how offline RL algorithms learn dexterous manipulation from narrow human demonstration datasets.	corl-team/CORL	IQL AWAC ReBRAC	Pen-Human-v1 Hammer-Human-v1 Door-Cloned-v1
RL	rl-offline-continuous	Q-Overestimation Suppression for Offline Continuous Control	Studies how offline continuous-control algorithms suppress out-of-distribution Q-value overestimation.	corl-team/CORL	ReBRAC TD3-BC IQL	HalfCheetah-Medium-v2 Maze2D-Medium-v1 Walker2d-Medium-v2
RL	rl-offline-off2on	Offline-to-Online Fine-Tuning Without Forgetting	Studies how offline-to-online reinforcement learning prevents forgetting and value collapse during continued interaction.	corl-team/CORL	IQL AWAC SPOT	Pen-Cloned-v1 Hammer-Cloned-v1 Hammer-Expert-v1
RL	rl-offpolicy-continuous	Off-Policy Actor-Critic for Continuous Control	Changes off-policy actor-critic update rules, losses, or exploration strategies to improve mean episodic return on continuous-control tasks.	vwxyzjn/cleanrl	DDPG TD3 SAC	HalfCheetah-v4 Reacher-v4 Ant-v4
RL	rl-onpolicy-continuous	On-Policy Actor-Critic for Continuous Control	Changes on-policy actor-critic objectives, update rules, or exploration mechanisms to improve mean episodic return on continuous-control tasks.	vwxyzjn/cleanrl	PPO AWR PPO (KL Penalty)	HalfCheetah-v4 Swimmer-v4 InvertedDoublePendulum-v4
RL	rl-reward-learning	Inverse RL Reward Learning from Demonstrations	Studies how reward models learned from expert demonstrations affect downstream policy return in continuous-control locomotion.	HumanCompatibleAI/imitation	GAIL AIRL BC	HalfCheetah-v4 Hopper-v4 Walker2d-v4
RL	rl-value-atari	Value-Based Visual Control	Studies how value-based RL losses, update rules, and exploration strategies affect visual-control episodic return.	vwxyzjn/cleanrl	QR-DQN C51 Double-DQN	BreakoutNoFrameskip-v4 SeaquestNoFrameskip-v4 PongNoFrameskip-v4
RL	rl-value-discrete	Value-Based Discrete Control	Changes value estimation, uncertainty handling, or replay-based update rules to improve episodic return on discrete-action control tasks.	vwxyzjn/cleanrl	QR-DQN Dueling-DQN C51	CartPole-v1 LunarLander-v2 Acrobot-v1
RL	safe-rl	Constraint Handling for Safe RL	Changes Lagrangian or controller-style multiplier updates and cost-reward advantage mixing to improve reward while keeping episode cost below target.	PKU-Alignment/omnisafe	Naive PPO Lagrangian PPO PID Lagrangian	SafetyPointGoal1-v0 SafetyCarGoal1-v0 SafetyPointButton1-v0
Sys	dlm-dkv-policy	Diffusion LM KV Cache Policy	Studies how token-state refresh intervals, masks, transfer ratios, and fallbacks affect denoising quality and cache reuse.	maomaocun/dLLM-Cache	Vanilla (Uncached) dLLM-Cache d2Cache Elastic-Cache	MATH-500 HumanEval ARC-Challenge
Sys	llm-kv-adaptive-quantization	LLM KV Cache: Adaptive Quantization Policy	Studies adaptive 4-bit KV-cache quantization for instruction-tuned long-context inference, trading benchmark final-score quality against effective KV bits and compression.	huggingface/transformers	KIVI Overlap (4-bit) KVTuner-4 Per-Token KVTuner-4 KIVI SQuat Subspace (4-bit)	LongBench-E hotpotqa_e QA F1 LongBench-E passage_retrieval_en_e retrieval score LongBench-E repobench-p_e code-similarity score NeedleBench NIAH exact phrase retrieval GSM8K exact final-answer accuracy
Sys	llm-kv-selection-budgeting	LLM KV Cache Selection Budgeting	Studies how selection and eviction controllers allocate layer budgets and recent windows for quality, latency, and memory tradeoffs.	huggingface/transformers	Full Attention StreamingLLM Expected Attention LagKV	LongBench-E hotpotqa_e QA F1 LongBench-E passage_retrieval_en_e retrieval score LongBench-E repobench-p_e code-similarity score LongBench v2 train split multiple-choice accuracy GSM8K exact final-answer accuracy
Sys	llm-kv-structural-reduction	LLM Pretraining: KV-Structural Reduction	Studies GPT-style KV-state structural reduction through MHA, MQA, GQA, and MLA-style latent KV compression under fixed nanoGPT pretraining.	karpathy/nanoGPT EleutherAI/lm-evaluation-harness	MHA MQA GQA MLA	ClimbMix val loss + KV bytes/token + WikiText-2/WikiText-103/LAMBADA heldout loss HellaSwag, ARC-Easy, PIQA, WinoGrande 0-shot accuracy
Sys	llm-pretrain-kernel	LLM Pretraining: Custom GPU Kernel Optimization	Studies custom/fused MLP kernels for nanoGPT pretraining while preserving ClimbMix validation, held-out perplexity, and downstream lm-eval quality.	karpathy/nanoGPT EleutherAI/lm-evaluation-harness	ReLU-Squared (Torch) Triton GELU Triton ReLU-Squared (Fused)	ClimbMix val loss + WikiText-2/LAMBADA PPL HellaSwag, ARC-Easy, PIQA, WinoGrande 0-shot accuracy
Sys	llm-ptq-algorithm	LLM Post-Training Quantization (PTQ) Algorithm	Design a post-training quantization algorithm for a pretrained LLM that minimizes WikiText-2 perplexity degradation under INT4/INT3 group quantization without retraining.	IST-DASLab/gptq	Round-to-Nearest (RTN) GPTQ AWQ	PTQ INT4 PTQ INT3 PTQ INT4 (g64)
Sys	llm-qat-algorithm	LLM Quantization-Aware Training (QAT) Algorithm	Design a quantization-aware training algorithm for a pretrained LLM that minimizes WikiText-2 perplexity after INT4/INT3/INT2 quantization at inference time.	custom	No QAT STE LSQ Finetune + PTQ	QAT INT4 QAT INT3 QAT INT2
Sys	mlsys-fused-attention	Fused Attention Kernel Design for H100 GPUs	Design an OpenAI Triton fused self-attention forward kernel for H100 GPUs that maximizes throughput (TFLOPs/s) while preserving numerical correctness.	Dao-AILab/flash-attention	FlashAttention FlashAttention-2 FlashAttention-3	Head Dim 64 / Seq 4K Head Dim 128 / Seq 8K Head Dim 256 / Seq 16K
Sys	mlsys-moe-load-balance	MoE Expert Parallelism Load Balancing	Design an efficient MoE expert-replica placement algorithm that minimizes GPU/node load imbalance while preserving inter-node locality and low runtime.	deepseek-ai/eplb	Greedy Zigzag Flat Zigzag	DeepSeek-V3 Qwen3-MoE DeepSeek-V2 Stress-Skew
Sys	mlsys-sparse-attention-inference	Long-Context Inference-Time Sparse Attention	Design an inference-time sparse attention module for a pretrained instruction-tuned causal LLM that preserves NIAH and LongBench quality under a 25% density budget without retraining.	custom	Dense StreamingLLM BigBird Block Top-K	NIAH (8K) LongBench Qasper LongBench MultiFieldQA-EN
Sci	ai4bio-mutation-effect-prediction	Mutation Fitness Predictor	Studies how mutant and wild-type protein representations can predict functional effects of sequence mutations.	OATML-Markslab/ProteinGym	Ridge Regression MLP Reshape CNN	BLAT_ECOLX ESTA_BACSU RASH_HUMAN
Sci	ai4bio-protein-inverse-folding	Backbone-to-Sequence Inverse Folding	Studies how geometric structure encoding and sequence decoding recover amino-acid sequences from protein backbones.	A4Bio/ProteinInvBench	ProteinMPNN PiFold GVP	CATH 4.2 CATH 4.3 TS50
Sci	ai4bio-protein-structure-repr	Geometric Protein Structure Encoder	Studies how local and global geometric protein representations transfer to structure-aware function prediction.	a-r-j/ProteinWorkshop	SchNet EGNN GearNet	EC GO-BP Fold
Sci	ai4sci-climate-emulation	Atmospheric Column Emulator Architecture	Studies how neural emulator architecture maps vertical atmospheric states to sub-grid physics tendencies across training budgets.	leap-stc/ClimSim	CNN Encoder-Decoder U-Net HSR	Short Budget Medium Budget Long Budget
Sci	ai4sci-inverse-diffusion-algo	Diffusion-Prior Inverse Solver	Studies how diffusion priors and measurement guidance can be combined for inverse-problem reconstruction.	devzhk/InverseBench	DPS REDDiff LGD	Inverse Scattering Black Hole Imaging Inpainting
Sci	ai4sci-mol-property-prediction	Molecular Representation Predictor	Studies how molecular graph and geometric representations improve property prediction under scaffold-based generalization.	deepmodeling/Uni-Mol	D-MPNN Uni-Mol GIN	BBBP BACE Tox21
Sci	ai4sci-pla-binding-affinity	Protein-Ligand Interaction Model	Studies how intra- and inter-molecular geometric interactions should be represented to predict binding affinity.	guaguabujianle/EHIGN_PLA	EHIGN GIGN SchNet EGNN	PDBbind 2013 PDBbind 2016 PDBbind 2019
Sci	ai4sci-vs-contrastive-scoring	Contrastive Virtual-Screening Objective	Studies how projection geometry and contrastive losses affect zero-shot protein-ligand screening quality.	jianhuiwemi/HypSeek	Vanilla CLIP HCC HCC + Hyperbolic Cone	HypSeek Training DUD-E LIT-PCBA DEKOIS 2.0
Sci	ai4sci-weather-forecast-aggregation	Weather Forecast Variable Aggregation	Studies how weather forecasting models aggregate information across heterogeneous meteorological variables for optimal prediction.	microsoft/ClimaX	Cross-Attention Mean Pooling Learned Weighted Sum	Z500 3-Day T850 5-Day 10m-Wind 7-Day
Sci	pde-design-solver	Industrial CFD Design: Custom Neural Operator Design	Designs and implements a custom neural operator for industrial aerodynamic design prediction on 3D unstructured point clouds.	thuml/Neural-Solver-Library	PointNet GraphSAGE Graph U-Net Transolver	Car Design AirfRANS Aircraft Design
Opt	optimization-bilevel	Optimization Bilevel	Studies a fixed bilevel-optimization benchmark based on Shen and Chen's penalty-based bilevel gradient descent experiments, selecting supported methods and tuning paper-style strategy hyperparameters.	hanshen95/penalized-bilevel-gradient-descent	V-PBGD G-PBGD RHG T-RHG	Toy Convergence HyperClean (Linear) HyperClean (MLP)
Opt	optimization-convex-concave	RAIN Convex-Concave	Studies gradient-norm convergence on the exact convex-concave benchmark instances used by the official RAIN bilinear and delta-function scripts.	TrueNobility303/RAIN	SEG R-SEG SEAG RAIN	Default Noise Low Noise High Noise
Opt	optimization-diagonal-net	Optimizer Design for Diagonal-Net Sparse Recovery	Designs an optimizer that recovers a sparse linear predictor from fewer training samples under a diagonal-net parameterization with noisy labels.	TrueNobility303/RAIN	SGD AdaGrad Adam Adam (Alt.)	d=200, k=5, s=0.1 d=500, k=10, s=0.1 d=500, k=10, s=0.2 d=10000, k=50
Opt	optimization-dp-sgd	Differentially Private SGD: Privacy-Utility Optimization	Design an improved DP-SGD variant that achieves higher test accuracy under the same (epsilon, delta)-differential privacy budget.	custom	Standard DP-SGD Automatic Clipping (AUTO-S) Adaptive Quantile Clipping Step-Decay Noise Schedule	MNIST Fashion-MNIST CIFAR-10
Opt	optimization-evolution-strategy	Evolutionary Optimization Strategy Design	Design a novel combination of selection, crossover, mutation operators and/or evolutionary loop for continuous black-box optimization across multiple benchmark functions.	DEAP/deap	GA (SBX) CMA-ES Differential Evolution L-SHADE	Rastrigin (30D) Rosenbrock (30D) Ackley (30D) Rastrigin (100D)
Opt	optimization-gradient-compression	Gradient Compression for Communication-Efficient Distributed Training	Design a gradient compression operator that reduces communication cost in distributed training while maintaining convergence quality.	custom	TopK Sparsification with Error Feedback QSGD (Quantized SGD) SignSGD	ResNet-20 / CIFAR-10 VGG-11-BN / CIFAR-100 ResNet-56 / CIFAR-10
Opt	optimization-hyperparameter-search	Hyperparameter Optimization: Custom Search Strategy Design	Design a custom HPO strategy that improves final validation score and convergence under limited multi-fidelity evaluation budgets.	custom	Random Search TPE Hyperband DEHB BOHB Optuna CMA-ES	XGBoost SVM Neural Net
Opt	optimization-multi-objective	Multi-Objective Optimization: Custom Evolutionary Strategy Design	Design a custom multi-objective evolutionary strategy that improves convergence, diversity, and spread on standard benchmark problems.	DEAP/deap	NSGA-II MOEA/D SPEA2 NSGA-III RVEA AGE-MOEA	ZDT1 ZDT3 DTLZ2 DTLZ1
Opt	optimization-nas	Sample-Efficient Neural Architecture Search	Design and implement a sample-efficient NAS optimizer that discovers high-performing architectures in the NAS-Bench-201 search space under a strict query budget.	automl/naslib	Random Search REA BANANAS	CIFAR-10 CIFAR-100 ImageNet16-120
Opt	optimization-online-bandit	Online Bandits: Exploration-Exploitation Strategy Design	Design and implement a bandit policy that minimizes cumulative regret across diverse multi-armed bandit settings.	SMPyBandits/SMPyBandits	UCB1 Thompson Sampling KL-UCB	Stochastic MAB Contextual Bandit Non-Stationary Bandit
Opt	optimization-pac-bayes-bound	PAC-Bayes Generalization Bound Optimization	Design a tighter PAC-Bayes generalization bound by optimizing the bound formulation, prior/posterior parameterization, and KL divergence estimation for stochastic neural networks.	mperezortiz/PBB	McAllester Catoni Quadratic	MNIST (FCN) MNIST (CNN) FashionMNIST (CNN)
Opt	optimization-parity	Optimization Parity	Improve a fixed two-layer MLP's ability to learn sparse parity by designing only its initialization, training dataset, and AdamW hyperparameters.	pytorch/examples	Default Multi-Epoch No Weight Decay	n=32, k=8 n=50, k=8 n=64, k=8
Opt	optimization-variance-reduction	Variance Reduction for Stochastic Optimization	Design an improved variance reduction strategy for stochastic gradient descent on finite-sum optimization problems.	custom	SVRG STORM STORM+	Logistic Regression MLP Ill-Conditioned
CAL	meta-fewshot-classification	Few-Shot Image Classification Method	Studies how support encoding, query comparison, and loss design affect episodic few-shot image-classification accuracy.	sicara/easy-few-shot-learning	ProtoNet MatchingNet RelationNet	Mini-ImageNet 5w-5s CIFAR-FS CUB
CAL	meta-inner-loop-optimizer	Meta-Learning Inner-Loop Optimizer	Studies how differentiable inner-loop adaptation rules affect few-shot classification accuracy in gradient-based meta-learning.	learnables/learn2learn	MAML Meta-SGD ANIL	Mini-ImageNet 5w-1s Mini-ImageNet 5w-5s CIFAR-FS 5w-5s
CAL	ml-active-learning	Pool-Based Active Learning Query Strategy	Studies how unlabeled-sample query rules affect accuracy under a fixed labeling budget.	JordanAsh/badge	BADGE BAIT BALD Least Confidence Random	Letter Spambase Splice
CAL	ml-anomaly-detection	Unsupervised Tabular Anomaly Detector	Studies how unlabeled anomaly scoring algorithms identify outliers across tabular data distributions.	custom	IF (Isolation Forest) LOF OCSVM ECOD COPOD	Cardio Thyroid Satellite Shuttle
CAL	ml-calibration	Post-Hoc Probability Calibration Mapping	Studies how post-hoc probability transforms improve classifier confidence calibration.	custom	Platt Temperature Scaling Isotonic Regression	RF / MNIST MLP / Fashion-MNIST GBM / Madelon SVM / Breast Cancer
CAL	ml-clustering-algorithm	Geometry-Robust Clustering Algorithm	Studies how clustering objectives and distance metrics handle convex blobs, non-convex moons, and high-dimensional digit data.	custom	K-Means DBSCAN HDBSCAN	Blobs Moons Digits
CAL	ml-continual-regularization	Continual Learning Importance Regularizer	Changes parameter-importance estimation and regularization loss to reduce catastrophic forgetting and improve final average accuracy across contexts.	GMvandeVen/continual-learning	EWC SI Online EWC	Split-MNIST Permuted-MNIST Split-CIFAR100
CAL	ml-dimensionality-reduction	Nonlinear 2D Structure-Preserving Embedding	Studies how nonlinear dimensionality reduction preserves neighborhood structure in low-dimensional embeddings.	custom	PCA t-SNE UMAP TriMap PaCMAP	MNIST Fashion-MNIST 20 Newsgroups
CAL	ml-ensemble-boosting	Adaptive Boosting Weight and Target Strategy	Studies how pseudo-targets, learner weights, and sample reweighting affect boosted ensemble performance.	custom	AdaBoost Gradient Boosting XGBoost-style	Breast Cancer Diabetes California Housing
CAL	ml-federated-aggregation	Heterogeneous Federated Server Aggregation	Changes server-side client selection and model aggregation to improve federated test accuracy under heterogeneous client data.	adap/flower	FedAvg FedProx SCAFFOLD	CIFAR-10 (Non-IID alpha=0.1) FEMNIST Shakespeare
CAL	ml-missing-data-imputation	Correlation-Aware Tabular Imputation	Studies how feature correlations and predictive structure guide missing-value imputation in tabular data.	custom	Mean Imputation KNN Imputation MICE MissForest GAIN	Breast Cancer Wisconsin Wine California Housing
CAL	ml-selective-deferral	Selective Deferral Under Subgroup Shift	Studies how acceptance and deferral rules trade off selective risk, subgroup robustness, and coverage on AIF360 tabular datasets.	custom	Confidence Thresholding Conformal Abstention Learned Deferral Group-wise Thresholding	Adult COMPAS Law School GPA
CAL	ml-subgroup-calibration-shift	Shift-Robust Subgroup Calibration	Studies how post-hoc calibration behaves under subgroup distribution shift and worst-group reliability constraints on AIF360 tabular datasets.	custom	Temperature Scaling Isotonic Regression Beta Calibration Group-wise Temperature Scaling	Adult COMPAS Law School GPA
CAL	ml-symbolic-regression	Genetic Programming Search for Symbolic Regression	Studies how symbolic-regression search strategies recover generalizable analytical expressions.	trevorstephens/gplearn	Standard GP Parsimony GP Lexicase GP	Nguyen-7 Nguyen-10 Koza-3
DL	cv-classification-loss	Adaptive Classification Loss	Modify the training loss over logits and labels to improve classification accuracy across image-model families.	custom	Label Smoothing Focal Loss PolyLoss	ResNet-56 / CIFAR-100 VGG-16-BN / CIFAR-100 MobileNet-V2 / Fashion-MNIST
DL	cv-data-augmentation	Image Augmentation Policy	Design the training transform pipeline combining geometric, photometric, and erasing operations to improve image-classification generalization.	custom	Cutout RandAugment TrivialAugmentWide	ResNet-20 / CIFAR-10 ResNet-56 / CIFAR-100 MobileNet-V2 / Fashion-MNIST
DL	cv-multitask-loss	Hierarchical Classification Loss Weighting	Studies how fine-label and coarse-label objectives should be combined to improve hierarchical image classification.	custom	Uncertainty Weighting DWA PCGrad	ResNet-20 / CIFAR-100-MT ResNet-56 / CIFAR-100-MT VGG-16-BN / CIFAR-100-MT
DL	cv-pooling-aggregation	Spatial Feature Aggregation	Studies how global spatial features should be aggregated to improve image-classification accuracy across convolutional architectures.	custom	Global Max GeM Avg + Max	ResNet-56 / CIFAR-100 VGG-16-BN / CIFAR-100 MobileNet-V2 / Fashion-MNIST
DL	cv-sample-weighting	Long-Tail Class Reweighting	Studies how class-count statistics should be mapped to loss weights to improve test accuracy on balanced test sets for long-tailed image classification.	custom	Inverse Frequency Class-Balanced (Effective Number) Balanced Softmax	ResNet-32 / CIFAR-10-LT ResNet-32 / CIFAR-100-LT VGG-16-BN / CIFAR-100-LT
DL	dl-activation-function	Convolutional Activation Nonlinearity	Studies how drop-in activation functions affect accuracy across convolutional image classifiers.	custom	GELU SiLU Mish	ResNet-20 / CIFAR-10 VGG-16-BN / CIFAR-100 MobileNet-V2 / Fashion-MNIST
DL	dl-lr-schedule	Architecture-Aware Learning-Rate Scheduling	Designs an epoch-level learning-rate curve conditioned on architecture and dataset to improve convergence and final classification accuracy.	custom	Cosine WarmupCosine OneCycle	ResNet-20 / CIFAR-10 ResNet-56 / CIFAR-100 MobileNet-V2 / Fashion-MNIST
DL	dl-normalization	Normalization Statistics and Affine Design	Studies how normalization statistics and affine behavior affect convolutional training stability and test accuracy.	custom	GroupNorm Batch-Instance Norm Switchable Norm	ResNet-56 / CIFAR-100 ResNet-110 / CIFAR-100 MobileNet-V2 / Fashion-MNIST
DL	dl-regularization	Adaptive Regularization Loss	Adds a model-, output-, input-, or epoch-dependent regularization term to improve classification generalization beyond standard weight decay.	custom	DropBlock Confidence Penalty Orthogonal Regularization	ResNet-56 / CIFAR-100 VGG-16-BN / CIFAR-100 MobileNet-V2 / Fashion-MNIST
DL	dl-residual-connection	Residual Block Skip Design	Studies how shortcut transformations and residual branch computation affect optimization and generalization across network depths.	custom	Pre-Activation Gated Residual Stochastic Depth	ResNet-20 / CIFAR-10 ResNet-56 / CIFAR-100 ResNet-110 / CIFAR-100
DL	dl-weight-initialization	DL Weight Initialization Strategy Design	Designs data-independent initialization for convolutional, normalization, and classifier layers to improve convergence and final accuracy.	custom	Kaiming Normal Fixup Orthogonal	ResNet-56 / CIFAR-100 VGG-16-BN / CIFAR-100 MobileNet-V2 / Fashion-MNIST
TS	quant-concept-drift	Concept-Drift-Aware Quantitative Forecasting	The stock prediction model and data pipeline are redesigned to handle temporal distribution shift and improve signal quality and portfolio metrics.	microsoft/qlib	TRA AdaRNN LightGBM	CSI 300 CSI 300 (Shifted) CSI 300 (Recent)
TS	quant-graph-stock	Graph-Based Quantitative Forecasting	Studies how inter-asset graph relationships affect return signal quality and portfolio performance.	microsoft/qlib	HIST GATs LightGBM	CSI 300 CSI 100 CSI 300 (Recent)
TS	quant-stock-prediction	Quantitative Return Forecasting	Studies how predictive models and input processing affect next-period return signals and portfolio performance.	microsoft/qlib	LightGBM LSTM Transformer	CSI 300 CSI 100 CSI 300 (Recent)
TS	stf-traffic-forecast	Spatial-Temporal Traffic Forecasting Model	Studies how spatial-temporal models capture sensor-network dependencies for traffic forecasting.	GestaltCogTeam/BasicTS	STID DLinear StemGNN iTransformer TimesNet SOFTS TimeMixer	METR-LA PEMS-BAY PEMS04
TS	ts-anomaly-detection	Reconstruction Model for Time-Series Anomaly Detection	An unsupervised reconstruction model detects anomalous multivariate time-series segments to improve F-score.	thuml/Time-Series-Library	DLinear TimesNet PatchTST	PSM MSL SMAP
TS	ts-classification	Multivariate Time-Series Classification Model	Studies how representation learning improves classification of multivariate time-series signals.	thuml/Time-Series-Library	DLinear TimesNet PatchTST	EthanolConcentration FaceDetection Handwriting
TS	ts-exogenous-forecast	Exogenous-Variable Target Forecasting Model	Studies how exogenous variables improve target-channel forecasting.	thuml/Time-Series-Library	DLinear PatchTST iTransformer TimeXer	ETTh1 Weather ECL
TS	ts-imputation	Masked Multivariate Time-Series Imputation	Studies how imputation models reconstruct missing regions in multivariate time series.	thuml/Time-Series-Library	DLinear TimesNet PatchTST	ETTh1 (25% missing) Weather (25% missing) ECL (25% missing)
TS	ts-long-term-forecast	Multivariate Long-Horizon Forecasting Model	Studies how long-horizon forecasting models predict future multivariate sequences.	thuml/Time-Series-Library	DLinear PatchTST iTransformer TimeMixer TimeXer	ETTh1 Weather ECL
TS	ts-short-term-forecast	Univariate Short-Horizon Forecasting Model	Studies how short-horizon forecasting models predict seasonal univariate series.	thuml/Time-Series-Library	DLinear TimesNet PatchTST TimeMixer	M4 Monthly M4 Quarterly M4 Yearly
SCR	causal-discovery-discrete	Discrete Causal Graph Discovery	Studies how causal discovery algorithms recover equivalence-class graph structure from discrete observational data.	py-why/causal-learn	PC GES GRaSP BOSS Hill Climbing	Cancer Child ALARM HAILFINDER Win95pts
SCR	causal-observational-linear-gaussian	Linear Gaussian Causal Discovery	Studies how observational algorithms recover causal graph structure under linear Gaussian assumptions.	py-why/causal-learn	PC GRaSP BOSS	ER (n=10) ER (n=20) SF (n=50) SF (n=50, Hard) ER (n=20, Noisy)
SCR	causal-observational-linear-non-gaussian	Non-Gaussian Causal Discovery	Studies how non-Gaussian structure can identify directed causal relationships from observational data.	py-why/causal-learn	ICA-LiNGAM DirectLiNGAM NOTEARS	ER (n=30) ER (n=50) SF (n=100)
SCR	causal-observational-nonlinear	Nonlinear Causal Discovery	Studies how nonlinear additive-noise assumptions support directed causal graph recovery from observations.	py-why/causal-learn	CAM NOTEARS-MLP DirectLiNGAM GraN-DAG	SF (n=20, GP) ER (n=20, Gauss) ER (n=12, Low-Sample)
SCR	causal-treatment-effect	Heterogeneous Treatment Effect Estimation	Studies how observational estimators recover individual and average treatment effects on synthetic CATE benchmark families.	custom	S-Learner T-Learner IPW Causal Forest DR-Learner R-Learner	IHDP-inspired Synth Jobs/LaLonde-inspired Synth ACIC-inspired Synth
SCR	graph-generation	Unconditional Graph Generator Architecture	Studies how graph generator architecture affects distributional match to target graph statistics.	pyg-team/pytorch_geometric	GraphVAE GRAN DiGress	Community-Small Ego-Small ENZYMES
SCR	graph-graph-classification	Structure-Aware Graph Readout Pooling	Studies how graph-level readout mechanisms affect graph classification accuracy and macro F1 under a fixed message-passing backbone.	pyg-team/pytorch_geometric	GIN + Sum SAGPool DiffPool	MUTAG PROTEINS NCI1
SCR	graph-link-prediction	Graph Link Encoder-Decoder	Studies how node encoders and edge decoders affect missing-link prediction quality.	custom	GCN + MLP Decoder VGAE SEAL	Cora CiteSeer ogbl-collab
SCR	graph-node-classification	Graph Node Message Passing	Studies how message-passing layers affect node classification across citation network benchmarks.	pyg-team/pytorch_geometric	GCN GAT GraphSAGE	Cora CiteSeer PubMed
SCR	graph-signal-propagation	Homophily-Heterophily Graph Filter	The graph signal propagation filter is changed to improve node classification accuracy across homophilic and heterophilic graphs.	ivam-he/ChebNetII	GPR-GNN BernNet ChebNetII	Cora CiteSeer Texas Cornell
TL	security-adversarial-attack-black-box-score	Score-Based Black-Box Linf Attack	Designs a query-efficient black-box Linf evasion attack to improve attack success rate under a fixed per-sample query budget.	Harry24k/adversarial-attacks-pytorch	Square Attack SPSA Random Search	ResNet-20 / CIFAR-10 VGG-11-BN / CIFAR-10 MobileNet-V2 / CIFAR-10 ResNet-20 / CIFAR-100 MobileNet-V2 / CIFAR-100
TL	security-adversarial-attack-sparse-l0	Sparse L0 Adversarial Attack	Studies how sparse perturbation strategies improve attack success while respecting a strict pixel budget.	Harry24k/adversarial-attacks-pytorch	OnePixel SparseFool JSMA Pixle Sparse-RS	Rebuffi-R18 (l2-AT) / CIFAR-10 Augustin (l2-robust) / CIFAR-10 Engstrom (l2-robust) / CIFAR-10
TL	security-adversarial-attack-white-box-linf	White-Box Linf Evasion Attack	Designs a gradient-based white-box Linf attack to improve attack success rate while respecting the perturbation budget.	Harry24k/adversarial-attacks-pytorch	FGSM PGD MI-FGSM AutoAttack	ResNet-20 / CIFAR-10 VGG-11-BN / CIFAR-10 ResNet-20 / CIFAR-100 VGG-11-BN / CIFAR-100 MobileNet-V2 / CIFAR-100
TL	security-adversarial-training	Linf Adversarial Training for Robust Accuracy	Studies how adversarial training procedures improve robust accuracy while maintaining clean accuracy.	Harry24k/adversarial-attacks-pytorch	Standard Training PGD-AT TRADES MART AWP + TRADES	SmallCNN / MNIST PreAct ResNet-18 / CIFAR-10 VGG-11-BN / CIFAR-10 PreAct ResNet-18 / CIFAR-100
TL	security-backdoor-defense	Poisoned-Sample Scoring for Backdoor Filtering	A suspicion scoring rule identifies and filters backdoored training examples to reduce attack success rate while preserving clean accuracy.	custom	Confidence Filter Spectral Signatures Activation Clustering Z-Score Outlier	ResNet-20 / CIFAR-10 (BadNets) VGG-16-BN / CIFAR-100 (Blend) MobileNet-V2 / Fashion-MNIST (BadNets)
TL	security-machine-unlearning	Targeted Update Rules for Class Unlearning	An unlearning update rule removes forget-class information while improving retained accuracy and reducing forget-set membership leakage.	custom	Retain Fine-Tune Negative Gradient Bad Teacher SCRUB	ResNet-20 / CIFAR-10 (Class 0) VGG-16-BN / CIFAR-100 (Class 0) MobileNet-V2 / Fashion-MNIST (Class 0)
TL	security-membership-inference-defense	Training Regularization for Membership Privacy	Studies how privacy-preserving training losses reduce membership leakage while maintaining accuracy.	custom	ERM Label Smoothing Confidence Penalty RelaxLoss	ResNet-20 / CIFAR-10 VGG-16-BN / CIFAR-100 MobileNet-V2 / Fashion-MNIST
TL	security-poison-robust-learning	Robust Losses for Label-Flip Poisoning	A robust loss or sample-weighting rule improves clean accuracy under label-flip poisoning and reduces poisoned-label memorization.	custom	Cross-Entropy Generalized Cross-Entropy Symmetric Cross-Entropy Bootstrap	ResNet-20 / CIFAR-10 (Label-Flip) VGG-16-BN / CIFAR-100 (Label-Flip) MobileNet-V2 / Fashion-MNIST (Label-Flip)

Citation

@misc{lyu2026mlsbenchholisticrigorousassessment,
      title={MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI},
      author={Bohan Lyu and Yucheng Yang and Siqiao Huang and Jiaru Zhang and Qixin Xu and Xinghan Li and Xinyang Han and Yicheng Zhang and Huaqing Zhang and Runhan Huang and Kaicheng Yang and Zitao Chen and Wentao Guo and Junlin Yang and Xinyue Ai and Wenhao Chai and Yadi Cao and Ziran Yang and Kun Wang and Dapeng Jiang and Huan-ang Gao and Shange Tang and Chengshuai Shi and Simon S. Du and Max Simchowitz and Jiantao Jiao and Dawn Song and Chi Jin},
      year={2026},
      eprint={2605.08678},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2605.08678},
}

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
assets		assets
configs		configs
harbor		harbor
src		src
tasks		tasks
vendor		vendor
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

News

Installation

API Keys

Quick Start

Prebuilt Container Images

Running under Harbor

Repository Map

Full Task Catalog

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

News

Installation

API Keys

Quick Start

Prebuilt Container Images

Running under Harbor

Repository Map

Full Task Catalog

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages