Skip to content

Comments

Add validation for separate GEMM and all-scatter operations in example 20#234

Merged
mawad-amd merged 5 commits intomuhosama/all-scatter-gemm-separatev2from
copilot/add-validation-for-example-20
Oct 13, 2025
Merged

Add validation for separate GEMM and all-scatter operations in example 20#234
mawad-amd merged 5 commits intomuhosama/all-scatter-gemm-separatev2from
copilot/add-validation-for-example-20

Conversation

Copy link
Contributor

Copilot AI commented Oct 13, 2025

Problem

Example 20 (20_gemm_all_scatter_independent) performs GEMM and all-scatter as separate, independent operations, but the validation only checked the GEMM result and didn't properly validate the all-scatter communication. Additionally, there was no support for using different tensor dimensions for the communication operation versus the GEMM operation.

Solution

This PR adds comprehensive validation for both operations and support for separate tensor dimensions.

New validation function

Added validate_all_scatter() in examples/common/validation.py to validate all-scatter communication patterns:

  • Verifies the global tensor has the correct shape (M, N × world_size)
  • Checks that each rank's contribution is at the correct position (columns rank × N to (rank + 1) × N)
  • Provides detailed error messages when validation fails

Updated example 20

New command-line arguments:

  • --m_comm: Number of rows for communication tensor (defaults to m)
  • --n_comm: Total number of columns for communication tensor (defaults to n)

Separate validation:

  • GEMM validation: Checks A @ B == C
  • All-scatter validation: Verifies each rank's data is correctly scattered to all ranks
  • JSON output now includes success_gemm and success_comm fields for detailed reporting

Example usage:

# Use default dimensions
python examples/20_gemm_all_scatter_independent/benchmark.py -v

# Use custom communication dimensions
python examples/20_gemm_all_scatter_independent/benchmark.py -v --m_comm 4096 --n_comm 9216

Implementation details

The validation correctly handles the all-scatter pattern where:

  1. Each rank initializes its portion of a global tensor
  2. The all-scatter kernel scatters each rank's portion to all other ranks
  3. After completion, all ranks have the full tensor with contributions from all ranks

The n_comm argument represents the total number of columns (consistent with n semantics for GEMM), and internally n_comm_local = n_comm // world_size is computed for per-rank columns. The all-scatter kernel takes n_comm_local and produces an m_comm × n_comm tensor that is replicated across all ranks.

The validate_all_scatter() function is reusable and can be adopted by other examples that perform all-scatter operations.

Testing

  • All changes pass linting (ruff check and ruff format)
  • Logic validated with simulation tests
  • Import and syntax checks successful

Fixes #233

Original prompt

This section details on the original issue you should resolve

<issue_title>Add validation for example 20</issue_title>
<issue_description>Right now, the validation assumes gemm -> all-scatter dependency. For example 20 we separate those out as separate operations, and that needs to be correctly validate.

Also add support for arguments where separate m/n/k can be supplied for comm tensor vs. gemm tensor.</issue_description>

Comments on the Issue (you are @copilot in this section)

Fixes #233


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot AI and others added 3 commits October 13, 2025 19:40
…e 20

Co-authored-by: neoblizz <9790745+neoblizz@users.noreply.github.com>
Co-authored-by: neoblizz <9790745+neoblizz@users.noreply.github.com>
Co-authored-by: neoblizz <9790745+neoblizz@users.noreply.github.com>
Copilot AI changed the title [WIP] Add validation for example 20 operations Add validation for separate GEMM and all-scatter operations in example 20 Oct 13, 2025
Copilot AI requested a review from neoblizz October 13, 2025 19:48
…per-rank columns

Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Copilot AI requested a review from mawad-amd October 13, 2025 20:03
@mawad-amd mawad-amd marked this pull request as ready for review October 13, 2025 20:41
@mawad-amd mawad-amd requested a review from BKP as a code owner October 13, 2025 20:41
@mawad-amd mawad-amd merged commit c18aded into muhosama/all-scatter-gemm-separatev2 Oct 13, 2025
3 checks passed
@mawad-amd mawad-amd deleted the copilot/add-validation-for-example-20 branch October 13, 2025 20:41
@neoblizz neoblizz restored the copilot/add-validation-for-example-20 branch October 14, 2025 00:01
@neoblizz neoblizz deleted the copilot/add-validation-for-example-20 branch October 14, 2025 21:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants