FactoryBench: Evaluating Industrial Machine Understanding

We introduce FactoryBench, a benchmark for evaluating time-series models and LLMs on machine understanding over academic and industrial time-series data. Q&A pairs are organised along four causal levels (state, intervention, counterfactual, decision) instantiating Pearl's ladder of causation on robotic telemetry, and span five answer formats: four structured formats are scored deterministically and free-form answers are scored by an LLM-as-judge voting protocol.

We propose a scalable Q&A generation framework built around structured question templates, present FactoryWave (a dense, multitask, multivariate sensor dataset collected from a UR3 cobot and a KUKA KR10 industrial arm at 125 and 83 Hz), and construct FactoryBench as a large-scale benchmark grounded in FactoryWave alongside the AURSAD and voraus-AD open-source datasets. Together, these provide a rigorous testbed for evaluating reasoning, causal understanding, and decision support over industrial signals.

📦 HF dataset: https://huggingface.co/datasets/FactoryBench/FactoryBench

Levels

Q&A tasks are organised along a four-tier hierarchy. Our framework extends Pearl’s ladder of causation (association, intervention, and counterfactual reasoning) with a dedicated decision-making layer, mirroring the real-world diagnostic-then-act loop essential for factory operations.

Level	Type	Example question	What it tests	Answer format
1	State	"We want to isolate the lifting phase in the robot's time series. Assuming a fixed window length of 15 timesteps, at which timestep should the window begin?"	Interpret the current state of the machine under normal operation	Tensor
2	Intervention	"A collision with a foam cube occurs at T=850 ms. Rank signal segments (A–D) in the order you would expect them to appear after the event."	Reason about how the machine reacts to a present-time event	Ranking
3	Counterfactual	"Had a payload misconfiguration occurred at T=200 ms, what would the target torque on joint 2 at T+50 ms have been in this counterfactual case?"	Can the model reason accurately about theoretical scenarios?	Scalar
4	Decision	"Given the sensor stream below, does the machine show signs of anomalous behavior? If yes, identify the root cause and the steps to fix it."	Can the model make informed decisions about the machine?	Free-form

Datasets

Dataset	Robot	Episodes	Hz	Tasks	Anomalies
FactoryWave (ours)	UR3 + KUKA KR10	8,983	125 / 83	PnP, screwing, peg-in-hole	27
AURSAD	UR3e	4,094	100	Screwing	4
voraus-AD	Yu-Cobot	2,122	100 / 500	Pick-and-place	12

Signals follow a Setpoint–Context–Effort/Feedback (SCE) causal schema. The L4 knowledge graph (manufacturer error codes → recovery procedures) lives at knowledge_graph/ on HF.

Quick Start

1. Clone the Repo

git clone {URL_REPO}
cd factorybench

2. Prepare the Environment

python -m venv .venv
source .venv/bin/activate  # or .venv\Scripts\activate on Windows
pip install -e .
pip install boto3   # only for AWS-routed models

3. Set up API Keys

Create .env. This is the minimum for the smallest run (gpt-5.1-1 + judge):

HF_TOKEN="<hf>"
AZURE_OPENAI_ENDPOINT="<azure>"
AZURE_OPENAI_API_KEY="<azure>"
CHAT_ENDPOINT="<azure>"
OPIK_API_KEY="<opik>"          # tracing (recommended)
OPIK_PROJECT_NAME="FactoryBench"
OPIK_WORKSPACE="<workspace>"

💡 Opik (comet.com/opik) is the optional tracing layer. When configured, every evaluated question is logged as a trace tagged with model, level, usage, cost, and accuracy score. Leave the OPIK_* vars unset to skip it (inference still runs, just without the dashboard).

Extra providers (AWS Bedrock / SageMaker)

# AWS Bedrock: claude-sonnet-4.6, mistral-large-3, deepseek-v3.2
AWS_PROFILE="<profile>"
AWS_REGION="eu-central-1"
FB_S3_BUCKET="<bucket>"
BEDROCK_BATCH_ROLE_ARN="arn:aws:iam::<acct>:role/factorybench-bedrock-batch"
CLAUDE_SONNET_46_MODEL_ID="eu.anthropic.claude-sonnet-4-6"
CLAUDE_SONNET_46_REGION="eu-central-1"
MISTRAL_LARGE_3_MODEL_ID="mistral.mistral-large-3-675b-instruct"
MISTRAL_LARGE_3_REGION="us-west-2"
DEEPSEEK_V32_MODEL_ID="deepseek.v3-2"
DEEPSEEK_V32_REGION="eu-west-2"

# AWS SageMaker (only if re-enabling a self-hosted async endpoint)
SAGEMAKER_ROLE_ARN="arn:aws:iam::<acct>:role/factorybench-sagemaker"
QWEN_SAGEMAKER_ENDPOINT="<endpoint>"
QWEN_SAGEMAKER_REGION="eu-central-1"

Full IAM / region / first-run notes: src/evaluation/aws-setup.md. Model → provider mapping: src/config.py.

Run

After pip install -e . you get a factorybench command (alias: fb). All evaluation runs use the published Q&A pairs on Hugging Face (FactoryBench/FactoryBench, folder factorybench_qa), fetched automatically by the fetch stage. The default split is test; switch with --split train|validation if needed.

# End-to-end: fetch published Q&A from HF, build prompts, evaluate
factorybench \
    --stages fetch,prompts,eval \
    --hf-dataset-folder factorybench_qa \
    --levels 1,2,3,4 \
    --models gpt-5.1-1,claude-sonnet-4.6,mistral-large-3,deepseek-v3.2 \
    --cost-limit 5

# Smoke test: single level, single model, $1 cap
factorybench --stages fetch,prompts,eval --hf-dataset-folder factorybench_qa \
    --levels 1 --models gpt-5.1-1 --cost-limit 1

Useful flags: --split, --no-batch, --no-judge, --strict-batch, --judge-model, --concurrency, --model-concurrency, --overwrite. See factorybench --help for the full list.

For contributors: regenerating Q&A from scratch

End users should not need this: the HF Q&A pairs are the canonical benchmark. Regeneration requires full FactoryNet dataset access and is only used when extending or modifying the question templates.

# Regenerate Q&A from FactoryNet, then prompts + eval
factorybench --levels 1,2,3,4 --models gpt-5.1-1 --cost-limit 5

# Train/val/test split (episode-level, shared across levels)
python -m scripts.split_qa_train_val_test --dataset-folder factorynet_qa_150k `
    --levels 1 2 3 4 --train-frac 0.8 --val-frac 0.1

License

FactoryBench and FactoryWave are released under the MIT License and are freely available for academic and commercial use. Both the episode data and the benchmark artefacts (question templates, paraphrase banks, LLM-as-judge prompts, generator source code, and the full Q&A dataset) are distributed via the public Hugging Face repository huggingface.co/datasets/FactoryBench/FactoryBench. The data are fully public; there is no private holdout. The train/validation/test split (80/10/10 at the episode level, 30 shared across all four levels) is described in Section 5.2 and the mapping is included in the release so that evaluees can reproduce the exact partition used in this paper. We provide a versioned release track (e.g., v1.0, v1.1) so that reported numbers always refer to a fixed benchmark state; corrections to labels or templates are published as minor-version updates with a public changelog, and the exact version used in any evaluation is stamped into every result file. Bug reports and label-correction requests are tracked on the public GitHub repository.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
figures		figures
scripts		scripts
sims		sims
src		src
.gitignore		.gitignore
README.md		README.md
environment.yaml		environment.yaml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FactoryBench: Evaluating Industrial Machine Understanding

Levels

Datasets

Quick Start

1. Clone the Repo

2. Prepare the Environment

3. Set up API Keys

Run

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FactoryBench: Evaluating Industrial Machine Understanding

Levels

Datasets

Quick Start

1. Clone the Repo

2. Prepare the Environment

3. Set up API Keys

Run

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages