Skip to content

feat: add 40 graph-native scenarios + 8-dim evaluation#203

Open
sandeepkunkunuru wants to merge 1 commit intoIBM:mainfrom
samyama-ai:graph-scenarios
Open

feat: add 40 graph-native scenarios + 8-dim evaluation#203
sandeepkunkunuru wants to merge 1 commit intoIBM:mainfrom
samyama-ai:graph-scenarios

Conversation

@sandeepkunkunuru
Copy link

Summary

  • 40 new scenarios (IDs 601-640) across 7 categories testing graph-native capabilities that complement the existing 139 scenarios
  • Extended evaluation framework: 2 new dimensions (Graph Utilization, Semantic Precision) added to the existing 6-dimensional scoring

New Scenario Categories

Category Count Graph Capabilities Tested
Multi-hop dependency 8 BFS/DFS over DEPENDS_ON, SHARES_SYSTEM_WITH edges
Cross-asset correlation 6 Anomaly correlation across connected equipment
Failure pattern similarity 6 Vector similarity search on FailureMode embeddings
Criticality analysis 5 PageRank, WCC, articulation point detection
Maintenance optimization 5 Constrained scheduling, Pareto optimization
Root cause analysis 5 Reverse edge traversal (TRIGGERED, DETECTED_ANOMALY)
Temporal pattern 5 Temporal aggregation over work order sequences

Design Principles

  • Tool-agnostic: Any graph-capable agent can attempt these — no vendor-specific tools assumed
  • Data-grounded: All scenarios reference equipment, sensors, and failure modes already in AssetOpsBench
  • Follows existing format: Same JSON schema (id, type, text, category, characteristic_form, deterministic, note)
  • Non-deterministic: Acceptance criteria in characteristic_form, evaluated by existing grading agent

Files

  • src/tmp/assetopsbench/scenarios/single_agent/graph_utterance.json — 40 scenarios
  • docs/extended_evaluation_8dim.md — evaluation framework documentation

Motivation

Graph-based approaches to industrial maintenance enable query types that flat document stores cannot support: multi-hop dependency analysis, vector similarity on failure mode embeddings, PageRank criticality ranking, and Pareto-optimal scheduling. These scenarios provide a standardized way to evaluate such capabilities.

…ework

Add 40 new scenarios (IDs 601-640) across 7 categories that test graph
traversal, vector similarity, graph algorithms, and optimization:

- Multi-hop dependency (8): BFS/DFS over DEPENDS_ON edges
- Cross-asset correlation (6): Anomaly correlation across connected equipment
- Failure pattern similarity (6): Vector search on FailureMode embeddings
- Criticality analysis (5): PageRank, WCC, articulation points
- Maintenance optimization (5): Constrained scheduling, Pareto optimization
- Root cause analysis (5): Reverse TRIGGERED/DETECTED_ANOMALY traversal
- Temporal pattern (5): MTBF, seasonal patterns, degradation trends

Scenarios are tool-agnostic — any graph-capable agent can attempt them.
All reference equipment, sensors, and failure modes in AssetOpsBench data.

Also add 2 new evaluation dimensions (Graph Utilization, Semantic Precision)
extending the original 6-dimensional framework to 8 dimensions.

Signed-off-by: Sandeep Kunkunuru <sandeep.kunkunuru@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants