FactoryBench: Evaluating Industrial Machine Understanding

Team: Yanis Merzouki, Coral Izquierdo, Matei Ignuta-Ciuncanu, Marcos Gomez-Bracamonte, Riccardo Maggioni, Alessandro Lombardi, Camilla Mazzoleni, Federico Martelli, Balazs Gunther, Jonas Petersen, Philipp Petersen

We introduce FactoryBench, a benchmark for evaluating time-series models and LLMs on machine understanding over industrial robotic telemetry. Q&A pairs are organized along four causal levels (state, intervention, counterfactual, decision) instantiating Pearl's ladder of causation, and span five answer formats: four structured formats are scored deterministically and free-form answers are scored by an LLM-as-judge voting protocol. We propose a scalable Q&A generation framework built around structured question templates, present FactoryWave (a dense, multitask, multivariate sensor dataset collected from a UR3 cobot and a KUKA KR10 industrial arm), and construct FactoryBench as a large-scale benchmark of over 70k Q&A items grounded in roughly 15k normalized episodes from FactoryWave, AURSAD, and voraus-AD. Zero-shot evaluation of six frontier LLMs shows that no model exceeds 50% on structured levels or 18% on decision-making, revealing a wide gap between current models and operational machine understanding.

4-Level Q&A Framework

Level	Task	Example
L1 State	Interpret what the machine is doing now	"What is the current of joint 3 right now?"
L2 Intervention	Reason about what an action now would change	"If force in joint 3 increases to X now, what happens?"
L3 Counterfactual	Reason about what a different past would have produced	"If force had increased to X at t=20ms, what would have happened?"
L4 Decision	Generate a remediation plan from the trace	"Robot stopped with error C203A. What to do?"

Each level builds on the previous; failure at level N implies failure at level N+1.

FactoryWave Dataset


Q&A items	70,000+
Episodes	~15,000 normalized
Question templates	21 structured templates
Answer formats	Multi-select, scalar, tensor, ranking, free-form
Robots	UR3 cobot + KUKA KR10
Faults	27 types across 3 tasks

Quick Start

pip install datasets

from datasets import load_dataset

ds = load_dataset("FactoryBench/FactoryBench")

Citation

@article{merzouki2026factorybench,
  title   = {FactoryBench: Evaluating Industrial Machine Understanding},
  author  = {Merzouki, Yanis and Izquierdo, Coral and Ignuta-Ciuncanu, Matei
             and Gomez-Bracamonte, Marcos and Maggioni, Riccardo and Lombardi,
             Alessandro and Mazzoleni, Camilla and Martelli, Federico and
             Gunther, Balazs and Petersen, Jonas and Petersen, Philipp},
  journal = {arXiv preprint arXiv:2605.07675},
  year    = {2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
assets		assets
factorybench		factorybench
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.txt		LICENSE.txt
README.md		README.md
RELEASING.md		RELEASING.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FactoryBench: Evaluating Industrial Machine Understanding

4-Level Q&A Framework

FactoryWave Dataset

Quick Start

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FactoryBench: Evaluating Industrial Machine Understanding

4-Level Q&A Framework

FactoryWave Dataset

Quick Start

Citation

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages