Skip to content

Forgis-Labs/FactoryBench_Generation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FactoryBench: Evaluating Industrial Machine Understanding

We introduce FactoryBench, a benchmark for evaluating time-series models and LLMs on machine understanding over academic and industrial time-series data. Q&A pairs are organised along four causal levels (state, intervention, counterfactual, decision) instantiating Pearl's ladder of causation on robotic telemetry, and span five answer formats: four structured formats are scored deterministically and free-form answers are scored by an LLM-as-judge voting protocol.

We propose a scalable Q&A generation framework built around structured question templates, present FactoryWave (a dense, multitask, multivariate sensor dataset collected from a UR3 cobot and a KUKA KR10 industrial arm at 125 and 83 Hz), and construct FactoryBench as a large-scale benchmark grounded in FactoryWave alongside the AURSAD and voraus-AD open-source datasets. Together, these provide a rigorous testbed for evaluating reasoning, causal understanding, and decision support over industrial signals.

FactoryBench end-to-end pipeline


Levels

Q&A tasks are organised along a four-tier hierarchy. Our framework extends Pearl’s ladder of causation (association, intervention, and counterfactual reasoning) with a dedicated decision-making layer, mirroring the real-world diagnostic-then-act loop essential for factory operations.

Level Type Example question What it tests Answer format
1 State "We want to isolate the lifting phase in the robot's time series. Assuming a fixed window length of 15 timesteps, at which timestep should the window begin?" Interpret the current state of the machine under normal operation Tensor
2 Intervention "A collision with a foam cube occurs at T=850 ms. Rank signal segments (A–D) in the order you would expect them to appear after the event." Reason about how the machine reacts to a present-time event Ranking
3 Counterfactual "Had a payload misconfiguration occurred at T=200 ms, what would the target torque on joint 2 at T+50 ms have been in this counterfactual case?" Can the model reason accurately about theoretical scenarios? Scalar
4 Decision "Given the sensor stream below, does the machine show signs of anomalous behavior? If yes, identify the root cause and the steps to fix it." Can the model make informed decisions about the machine? Free-form

Datasets

Dataset Robot Episodes Hz Tasks Anomalies
FactoryWave (ours) UR3 + KUKA KR10 8,983 125 / 83 PnP, screwing, peg-in-hole 27
AURSAD UR3e 4,094 100 Screwing 4
voraus-AD Yu-Cobot 2,122 100 / 500 Pick-and-place 12

Signals follow a Setpoint–Context–Effort/Feedback (SCE) causal schema. The L4 knowledge graph (manufacturer error codes → recovery procedures) lives at knowledge_graph/ on HF.


Quick Start

1. Clone the Repo

git clone {URL_REPO}
cd factorybench

2. Prepare the Environment

python -m venv .venv
source .venv/bin/activate  # or .venv\Scripts\activate on Windows
pip install -e .
pip install boto3   # only for AWS-routed models

3. Set up API Keys

Create .env. This is the minimum for the smallest run (gpt-5.1-1 + judge):

HF_TOKEN="<hf>"
AZURE_OPENAI_ENDPOINT="<azure>"
AZURE_OPENAI_API_KEY="<azure>"
CHAT_ENDPOINT="<azure>"
OPIK_API_KEY="<opik>"          # tracing (recommended)
OPIK_PROJECT_NAME="FactoryBench"
OPIK_WORKSPACE="<workspace>"

💡 Opik (comet.com/opik) is the optional tracing layer. When configured, every evaluated question is logged as a trace tagged with model, level, usage, cost, and accuracy score. Leave the OPIK_* vars unset to skip it (inference still runs, just without the dashboard).

Extra providers (AWS Bedrock / SageMaker)
# AWS Bedrock: claude-sonnet-4.6, mistral-large-3, deepseek-v3.2
AWS_PROFILE="<profile>"
AWS_REGION="eu-central-1"
FB_S3_BUCKET="<bucket>"
BEDROCK_BATCH_ROLE_ARN="arn:aws:iam::<acct>:role/factorybench-bedrock-batch"
CLAUDE_SONNET_46_MODEL_ID="eu.anthropic.claude-sonnet-4-6"
CLAUDE_SONNET_46_REGION="eu-central-1"
MISTRAL_LARGE_3_MODEL_ID="mistral.mistral-large-3-675b-instruct"
MISTRAL_LARGE_3_REGION="us-west-2"
DEEPSEEK_V32_MODEL_ID="deepseek.v3-2"
DEEPSEEK_V32_REGION="eu-west-2"

# AWS SageMaker (only if re-enabling a self-hosted async endpoint)
SAGEMAKER_ROLE_ARN="arn:aws:iam::<acct>:role/factorybench-sagemaker"
QWEN_SAGEMAKER_ENDPOINT="<endpoint>"
QWEN_SAGEMAKER_REGION="eu-central-1"

Full IAM / region / first-run notes: src/evaluation/aws-setup.md. Model → provider mapping: src/config.py.


Run

After pip install -e . you get a factorybench command (alias: fb). All evaluation runs use the published Q&A pairs on Hugging Face (FactoryBench/FactoryBench, folder factorybench_qa), fetched automatically by the fetch stage. The default split is test; switch with --split train|validation if needed.

# End-to-end: fetch published Q&A from HF, build prompts, evaluate
factorybench \
    --stages fetch,prompts,eval \
    --hf-dataset-folder factorybench_qa \
    --levels 1,2,3,4 \
    --models gpt-5.1-1,claude-sonnet-4.6,mistral-large-3,deepseek-v3.2 \
    --cost-limit 5

# Smoke test: single level, single model, $1 cap
factorybench --stages fetch,prompts,eval --hf-dataset-folder factorybench_qa \
    --levels 1 --models gpt-5.1-1 --cost-limit 1

Useful flags: --split, --no-batch, --no-judge, --strict-batch, --judge-model, --concurrency, --model-concurrency, --overwrite. See factorybench --help for the full list.

For contributors: regenerating Q&A from scratch

End users should not need this: the HF Q&A pairs are the canonical benchmark. Regeneration requires full FactoryNet dataset access and is only used when extending or modifying the question templates.

# Regenerate Q&A from FactoryNet, then prompts + eval
factorybench --levels 1,2,3,4 --models gpt-5.1-1 --cost-limit 5

# Train/val/test split (episode-level, shared across levels)
python -m scripts.split_qa_train_val_test --dataset-folder factorynet_qa_150k `
    --levels 1 2 3 4 --train-frac 0.8 --val-frac 0.1

License

FactoryBench and FactoryWave are released under the MIT License and are freely available for academic and commercial use. Both the episode data and the benchmark artefacts (question templates, paraphrase banks, LLM-as-judge prompts, generator source code, and the full Q&A dataset) are distributed via the public Hugging Face repository huggingface.co/datasets/FactoryBench/FactoryBench. The data are fully public; there is no private holdout. The train/validation/test split (80/10/10 at the episode level, 30 shared across all four levels) is described in Section 5.2 and the mapping is included in the release so that evaluees can reproduce the exact partition used in this paper. We provide a versioned release track (e.g., v1.0, v1.1) so that reported numbers always refer to a fixed benchmark state; corrections to labels or templates are published as minor-version updates with a public changelog, and the exact version used in any evaluation is stamped into every result file. Bug reports and label-correction requests are tracked on the public GitHub repository.

About

Benchmark to evaluate models w.r.t. their ability to understand, optimise and fix industrial machinery.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors