Skip to content

docs(contrib): add external industrial dataset mappings#347

Merged
DhavalRepo18 merged 3 commits into
IBM:mainfrom
ulises-jeremias:docs/external-datasets-guide
Jun 18, 2026
Merged

docs(contrib): add external industrial dataset mappings#347
DhavalRepo18 merged 3 commits into
IBM:mainfrom
ulises-jeremias:docs/external-datasets-guide

Conversation

@ulises-jeremias

@ulises-jeremias ulises-jeremias commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Description

This PR adds a contributor-oriented external industrial dataset guide for issue #342. It helps external contributors discover public sources and map them to existing AssetOpsBench domains without changing benchmark logic.

Following maintainer feedback, the guide now includes concrete starter mappings for:

  • Vibration diagnostics using public bearing datasets such as CWRU and Paderborn.
  • SWaT / water-treatment telemetry mapped to iot, tsfm, and possible multi-step anomaly scenarios.

Type of Change

  • Documentation / Tutorial update

Related Issues

Changes

  • Added docs/external-industrial-datasets.md with:
    • starter dataset references,
    • Vibration and SWaT starter mappings,
    • candidate future scenario shapes,
    • mapping checklist to AssetOpsBench domains,
    • privacy/safety and provenance guardrails,
    • suggested ingestion workflow for future scenario PRs.
  • Added a small pointer in README.md under "Call for Scenario Contribution".

Why this scope

  • No raw external datasets are committed.
  • No benchmark logic, scoring, runners, MCP servers, or baseline definitions are modified.
  • Keeps this PR small and reviewable while documenting a concrete path for future executable scenario PRs.

Validation

  • Verified internal documentation references exist.
  • Verified the Paderborn dataset link resolves.
  • git diff --check passed.
  • Docs-only change; Python tests were not run.

Checklist

  • I have signed off my commits (DCO).

Signed-off-by: ulises-jeremias <ulisescf.24@gmail.com>
Signed-off-by: ulises-jeremias <ulisescf.24@gmail.com>
@DhavalRepo18

Copy link
Copy Markdown
Collaborator

@ulises-jeremias Give it a week, and you can revise your PR to add any asset class in a new way.

@ulises-jeremias

ulises-jeremias commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

@DhavalRepo18 Thanks, this is very helpful.

I can revise this PR with one concrete asset-class extension next and keep it small/reviewable. I can start with either:

  1. Water treatment (SWaT) anomaly-focused scenarios (iot + tsfm), or
  2. Bearing diagnostics scenarios (vibration) from a public bearing dataset.

Which option would you prefer for the first revision?

@DhavalRepo18

Copy link
Copy Markdown
Collaborator

@DhavalRepo18 Thanks, this is very helpful.

I can revise this PR with one concrete asset-class extension next and keep it small/reviewable. I can start with either:

  1. Water treatment (SWaT) anomaly-focused scenarios (iot + tsfm), or
  2. Bearing diagnostics scenarios (vibration) from a public bearing dataset.

Which option would you prefer for the first revision?

Start with Vibration and SwaT, we can seek help.

@ulises-jeremias

ulises-jeremias commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

@DhavalRepo18 thanks, that helps. Before I revise the PR, I want to confirm the expected scope.

I see two possible paths:

  1. Docs-only revision for this PR

    • Keep this PR small and reviewable.
    • Extend docs/external-industrial-datasets.md with concrete starter mappings for both Vibration and SWaT.
    • For Vibration, map CWRU/Paderborn-style bearing datasets to the existing vibration domain and current vibration server capabilities.
    • For SWaT, map water-treatment telemetry to iot + tsfm candidate scenarios.
    • Add candidate scenario shapes and a checklist for a future executable scenario PR.
    • No raw datasets, no runtime changes, no benchmark/scoring changes.
  2. Implementation PR / feature work

    • Add actual scenario artifacts and/or ingestion guidance for one or both asset classes.
    • For Vibration, this could build on the existing src/servers/vibration tools and src/scenarios/local/vibration_utterance.json patterns.
    • For SWaT, this would require deciding a sensor/asset schema, timestamp normalization, dataset transform path, and how it should connect to iot and/or tsfm.
    • This would likely be larger and may be better as a follow-up PR after the docs mapping is agreed.

My recommendation is option 1 for this PR, then a separate follow-up implementation PR once the mapping is approved. Do you think that is enough for this PR, or would you prefer that I start working on implementation now?

@DhavalRepo18

Copy link
Copy Markdown
Collaborator

@ulises-jeremias Let us start with 1 and we will have better clarity on the additional changes we are doing at present.

Signed-off-by: ulises-jeremias <ulisescf.24@gmail.com>
@ulises-jeremias ulises-jeremias changed the title docs(contrib): add external industrial dataset guide (proposal) docs(contrib): add external industrial dataset mappings Jun 18, 2026
@ulises-jeremias

Copy link
Copy Markdown
Contributor Author

@DhavalRepo18 I updated this PR following option 1.

What changed:

  • kept the PR docs-only and small
  • added concrete starter mappings for Vibration diagnostics and SWaT / water-treatment telemetry
  • mapped Vibration to the existing vibration server capabilities and local vibration utterance patterns
  • mapped SWaT to candidate iot, tsfm, and multi-step anomaly scenario shapes
  • added candidate future prompts for both asset classes
  • added a checklist for turning these mappings into future executable scenario PRs
  • updated the README pointer to mention starter asset-class mappings

What I intentionally did not change:

  • no raw datasets added
  • no runtime/MCP server changes
  • no benchmark/scoring changes
  • no executable scenario artifacts yet

Validation:

  • verified internal documentation references exist
  • verified the Paderborn dataset link resolves
  • git diff --check passed
  • docs-only change, so Python tests were not run

@DhavalRepo18 DhavalRepo18 self-requested a review June 18, 2026 11:31
@DhavalRepo18

Copy link
Copy Markdown
Collaborator

One of my suggestions is: if we write a Kaggle Connection (Kaggle MCP Tool) and enable seamless integration to bring your own assets and map them to our AssetOpsBench, then the Bench can become a system for testing things out. Kaggle has its dataset API. Think about it and give us an easy feature - Bring your own Kaggle Asset.

@DhavalRepo18 DhavalRepo18 merged commit 4189d6e into IBM:main Jun 18, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants