Rccl-Test Aggregation #19
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
This PR introduces multi-run aggregation and statistical validation to CVS’s RCCL test framework. Previously, CVS only supported single-run validation using basic thresholds. The enhancements include:
busBw,algBw, andtimeacross multiple runs.models/module for type-safe validation, paving the way for database integration.This enables reproducible, statistically sound performance analysis for single-node and multi-node RCCL tests.
Technical Highlights
1. New
models/ModuleRcclTests: Validates single-node results (checks NaN/Inf, enforces types).RcclTestsMultinodeRaw: Adds multi-node metadata validation (e.g.,ranks = nodes × ranksPerNode).RcclTestsAggregated: Stores aggregated mean/std metrics and optional multi-node metadata.2. Aggregation Logic
aggregate_rccl_test_results()inlib/rccl_lib.pygroupby().agg()for robust stats.(name, size, type, inPlace).3. Integration
*_aggregated.json.4. Dependencies
5. Design Principles
Test Plan
Results
✅ Aggregation works for single-node and multi-node
✅ Schema validation catches malformed data
✅ Config consistency enforced
✅ Outputs saved in correct format
✅ Detailed logging for debugging
Files Changed:
models/rccl.py(new)models/__init__.py(new)lib/rccl_lib.py(modified)Lines Added: ~250 | Lines Modified: ~70