Rccl-Test Aggregation #19

speriaswamy-amd · 2025-11-17T20:32:39Z

Motivation

This PR introduces multi-run aggregation and statistical validation to CVS’s RCCL test framework. Previously, CVS only supported single-run validation using basic thresholds. The enhancements include:

Statistical benchmarking: Compute mean and standard deviation for busBw, algBw, and time across multiple runs.
Data integrity: Validate RCCL JSON outputs with Pydantic schemas to catch malformed data early.
Future-ready design: Add a models/ module for type-safe validation, paving the way for database integration.
Consistent multi-node testing: Ensure aggregated runs share identical cluster configurations.

This enables reproducible, statistically sound performance analysis for single-node and multi-node RCCL tests.

Technical Highlights

1. New `models/` Module

RcclTests: Validates single-node results (checks NaN/Inf, enforces types).
RcclTestsMultinodeRaw: Adds multi-node metadata validation (e.g., ranks = nodes × ranksPerNode).
RcclTestsAggregated: Stores aggregated mean/std metrics and optional multi-node metadata.

2. Aggregation Logic

Function: aggregate_rccl_test_results() in lib/rccl_lib.py
Uses Pandas groupby().agg() for robust stats.
Validates cluster config consistency before aggregation.
Groups by (name, size, type, inPlace).

3. Integration

Single-node: Aggregates multiple dtype runs, saves to *_aggregated.json.
Multi-node: Validates topology, aggregates results, preserves metadata.

4. Dependencies

Pydantic for schema validation.
Pandas for aggregation.

5. Design Principles

JSON-native, backward compatible, fail-fast validation, polymorphic aggregation, future extensibility.

Test Plan

Single-node: Validate JSON, aggregate multiple runs, handle edge cases (NaN, single run).
Multi-node: Validate topology, detect config mismatches, preserve metadata.
Backward compatibility: Existing tests remain unchanged.

Results

✅ Aggregation works for single-node and multi-node
✅ Schema validation catches malformed data
✅ Config consistency enforced
✅ Outputs saved in correct format
✅ Detailed logging for debugging

Files Changed:

models/rccl.py (new)
models/__init__.py (new)
lib/rccl_lib.py (modified)

Lines Added: ~250 | Lines Modified: ~70

speriaswamy-amd · 2025-11-17T20:37:54Z

@venksrin09,
I’ve tested this implementation on a single gfx942 cluster with CX7 (without MPI) and on two gfx942 nodes with CX7 (with MPI). The implementation is working as expected, and the benchmark results are in line with the expected values. I’m happy to share the log files and benchmark results with you offline.

speriaswamy-amd added 13 commits November 16, 2025 14:34

NCCL_SOCKET_IFNAME is required for multinode rccl-test runs

164bd72

Pydantic models for validating rccl-tests results

b132cf0

make models a module

3579920

Pass dtype and number of cycles are argument from config

da9a681

updated requirements for pydantic & pandas

0a30ec3

Refractor to reduce model duplication use inheritance

80149a2

Single node rccl-tests aggregation for each data type

e5073d7

Pass no of cycles and dtypes through config dict

e241bd4

iterate through dtypes and run multinode rccl-tests with validation

9bc6483

For multinode Rccl-tests model take nodes,ranks etc as optional metadata

25006bd

Aggregate multinode rccl-test results

8d2fb68

For mutlinode aggregation retain metadata

25b31af

Multinode rccl-tests aggregation

0c3eef0

speriaswamy-amd requested a review from venksrin09 November 17, 2025 20:38

speriaswamy-amd self-assigned this Nov 17, 2025

speriaswamy-amd changed the title ~~Surya/rccl regression detection~~ Rccl-Test Aggregation Nov 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rccl-Test Aggregation #19

Rccl-Test Aggregation #19

Uh oh!

speriaswamy-amd commented Nov 17, 2025 •

edited

Loading

Uh oh!

speriaswamy-amd commented Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Rccl-Test Aggregation #19

Are you sure you want to change the base?

Rccl-Test Aggregation #19

Uh oh!

Conversation

speriaswamy-amd commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Highlights

1. New models/ Module

2. Aggregation Logic

3. Integration

4. Dependencies

5. Design Principles

Test Plan

Results

Uh oh!

speriaswamy-amd commented Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

speriaswamy-amd commented Nov 17, 2025 •

edited

Loading

1. New `models/` Module