Skip to content

Ensure Consistency among Data Accesses in the same Thread Block #551

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: feature/dsl
Choose a base branch
from

Conversation

caiomcbr
Copy link
Contributor

The PR ensures that if two or more operations impact the same memory location, it will insert a mechanism to synchronize the thread block between theses operations, preventing any errors. For instance:

copy chunk 0 -> chunk 1
put chunk 1 -> chunk 2

Here we can see the copy operation and put operation have chunk 1 in common for writing and reading, respectively. In this case, I need to make sure the copy operation ends before starting the put operation. Therefore, we should have a synchronization mechanism between them to avoid any conflicts.

In this PR, we will insert synchronization mechanisms between the operations under the following conditions:

Operation using Write Data Access followed by Operation using Write Data Access to the same memory location
Operation using Read Data Access followed by Operation using Write Data Access to the same memory location
Operation using Write Data Access followed by Operation using Read Data Access to the same memory location

@caiomcbr caiomcbr requested a review from Binyang2014 June 17, 2025 23:23
@Binyang2014 Binyang2014 requested a review from Copilot June 20, 2025 17:11
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements a conflict-detection pass to insert synchronization operations between thread‐block operations that read/write the same buffer intervals. It adds data‐access tracking (DataAccess), a buffering‐access analyzer (BuffersAccess), and hooks into the existing pipeline to resolve dependencies after instruction fusion.

  • Introduced DataAccessType and DataAccess in dsl_types and built buffer_access.py to detect interval conflicts
  • Extended every operation with unique IDs and local_data_access methods to annotate reads/writes
  • Added resolve_data_dependency through ThreadBlock, Gpu, and MSCCLPPProgram.post_process_operations

Reviewed Changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
python/mscclpp/language/internal/dsl_types.py Added DataAccessType, DataAccess, reordered SyncType
python/mscclpp/language/internal/buffer_access.py New BuffersAccess class to scan and sync operations
python/mscclpp/language/internal/operations.py Gave operations UUIDs and implemented local_data_access
python/mscclpp/language/internal/threadblock.py Imported buffer_access and added resolve_data_dependency
python/mscclpp/language/internal/gpu.py Propagated resolve_data_dependency to all threadblocks
python/mscclpp/language/program.py Called resolve_data_dependency after adding_data_sync
python/mscclpp/language/internal/optmizer.py Included relaxed_signal/relaxed_wait in sync filter
python/mscclpp/language/tests/*.py Updated tests for new sync/DSL types and revised logic
python/mscclpp/language/collectives.py & channel.py Migrated imports to dsl_types
Comments suppressed due to low confidence (1)

python/mscclpp/language/tests/allreduce.py:13

  • [nitpick] The variable num_tb is ambiguous—does it represent the chunk factor, number of thread blocks, or something else? Consider renaming to a more descriptive identifier like chunk_factor or num_thread_blocks.
    num_tb = 8

Comment on lines +373 to +377
for chunk in self.src_buff:
data_access.append(
DataAccess(self.id, chunk.index, chunk.index + chunk.size - 1, chunk.type, DataAccessType.read)
)

Copy link
Preview

Copilot AI Jun 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In PutOperation.local_data_access, only read accesses (src_buff) are recorded. You should also append write accesses for each dst_buff chunk to correctly detect write conflicts.

Suggested change
for chunk in self.src_buff:
data_access.append(
DataAccess(self.id, chunk.index, chunk.index + chunk.size - 1, chunk.type, DataAccessType.read)
)
# Record read accesses for src_buff
for chunk in self.src_buff:
data_access.append(
DataAccess(self.id, chunk.index, chunk.index + chunk.size - 1, chunk.type, DataAccessType.read)
)
# Record write accesses for dst_buff
for chunk in self.dst_buff:
data_access.append(
DataAccess(self.id, chunk.index, chunk.index + chunk.size - 1, chunk.type, DataAccessType.write)
)
return data_access

Copilot uses AI. Check for mistakes.

Comment on lines +164 to +165
return list(map_num.keys())[map_num[self] | map_num[other]]

Copy link
Preview

Copilot AI Jun 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The __or__ implementation may index past the end of map_num.keys() and is not a reliable reverse mapping. Consider using an explicit mapping from the bitwise OR result back to the correct DataAccessType enum member.

Suggested change
return list(map_num.keys())[map_num[self] | map_num[other]]
reverse_map = {1: DataAccessType.read, 2: DataAccessType.write, 3: DataAccessType.both}
result = map_num[self] | map_num[other]
return reverse_map.get(result, NotImplemented)

Copilot uses AI. Check for mistakes.


allreduce_example(args.name, args.num_gpus, args.num_threads_per_block, args.min_message_size, args.max_message_size)

""" allreduce_example("allreduce", 4, 1024, 0, 2**64 - 1) """
Copy link
Preview

Copilot AI Jun 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] This triple-quoted invocation at the end isn’t executed and can be confusing; consider removing it or converting it to a proper comment or unit test.

Suggested change
""" allreduce_example("allreduce", 4, 1024, 0, 2**64 - 1) """
# Example usage: allreduce_example("allreduce", 4, 1024, 0, 2**64 - 1)

Copilot uses AI. Check for mistakes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we can not use types as the module name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think so, actually I try different approaches, but I couldn't keep the same name.

@@ -0,0 +1,82 @@
from sortedcontainers import SortedDict
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to put this in our reuqirements.txt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another concern is the latest sortedcontainers release is 2021, seems no updates since then

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. If you want I can try to find another library or code the struct by myself as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants