Skip to content

Conversation

@speriaswamy-amd
Copy link
Contributor

@speriaswamy-amd speriaswamy-amd commented Nov 17, 2025

Motivation

This PR introduces multi-run aggregation and statistical validation to CVS’s RCCL test framework. Previously, CVS only supported single-run validation using basic thresholds. The enhancements include:

  1. Statistical benchmarking: Compute mean and standard deviation for busBw, algBw, and time across multiple runs.
  2. Data integrity: Validate RCCL JSON outputs with Pydantic schemas to catch malformed data early.
  3. Future-ready design: Add a models/ module for type-safe validation, paving the way for database integration.
  4. Consistent multi-node testing: Ensure aggregated runs share identical cluster configurations.

This enables reproducible, statistically sound performance analysis for single-node and multi-node RCCL tests.


Technical Highlights

1. New models/ Module

  • RcclTests: Validates single-node results (checks NaN/Inf, enforces types).
  • RcclTestsMultinodeRaw: Adds multi-node metadata validation (e.g., ranks = nodes × ranksPerNode).
  • RcclTestsAggregated: Stores aggregated mean/std metrics and optional multi-node metadata.

2. Aggregation Logic

  • Function: aggregate_rccl_test_results() in lib/rccl_lib.py
  • Uses Pandas groupby().agg() for robust stats.
  • Validates cluster config consistency before aggregation.
  • Groups by (name, size, type, inPlace).

3. Integration

  • Single-node: Aggregates multiple dtype runs, saves to *_aggregated.json.
  • Multi-node: Validates topology, aggregates results, preserves metadata.

4. Dependencies

  • Pydantic for schema validation.
  • Pandas for aggregation.

5. Design Principles

  • JSON-native, backward compatible, fail-fast validation, polymorphic aggregation, future extensibility.

Test Plan

  • Single-node: Validate JSON, aggregate multiple runs, handle edge cases (NaN, single run).
  • Multi-node: Validate topology, detect config mismatches, preserve metadata.
  • Backward compatibility: Existing tests remain unchanged.

Results

✅ Aggregation works for single-node and multi-node
✅ Schema validation catches malformed data
✅ Config consistency enforced
✅ Outputs saved in correct format
✅ Detailed logging for debugging


Files Changed:

  • models/rccl.py (new)
  • models/__init__.py (new)
  • lib/rccl_lib.py (modified)

Lines Added: ~250 | Lines Modified: ~70

@speriaswamy-amd
Copy link
Contributor Author

@venksrin09,
I’ve tested this implementation on a single gfx942 cluster with CX7 (without MPI) and on two gfx942 nodes with CX7 (with MPI). The implementation is working as expected, and the benchmark results are in line with the expected values. I’m happy to share the log files and benchmark results with you offline.

@speriaswamy-amd speriaswamy-amd self-assigned this Nov 17, 2025
@speriaswamy-amd speriaswamy-amd changed the title Surya/rccl regression detection Rccl-Test Aggregation Nov 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants