-
Notifications
You must be signed in to change notification settings - Fork 55
Ensure Consistency among Data Accesses in the same Thread Block #551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: feature/dsl
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements a conflict-detection pass to insert synchronization operations between thread‐block operations that read/write the same buffer intervals. It adds data‐access tracking (DataAccess
), a buffering‐access analyzer (BuffersAccess
), and hooks into the existing pipeline to resolve dependencies after instruction fusion.
- Introduced
DataAccessType
andDataAccess
indsl_types
and builtbuffer_access.py
to detect interval conflicts - Extended every operation with unique IDs and
local_data_access
methods to annotate reads/writes - Added
resolve_data_dependency
throughThreadBlock
,Gpu
, andMSCCLPPProgram.post_process_operations
Reviewed Changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.
Show a summary per file
File | Description |
---|---|
python/mscclpp/language/internal/dsl_types.py | Added DataAccessType , DataAccess , reordered SyncType |
python/mscclpp/language/internal/buffer_access.py | New BuffersAccess class to scan and sync operations |
python/mscclpp/language/internal/operations.py | Gave operations UUIDs and implemented local_data_access |
python/mscclpp/language/internal/threadblock.py | Imported buffer_access and added resolve_data_dependency |
python/mscclpp/language/internal/gpu.py | Propagated resolve_data_dependency to all threadblocks |
python/mscclpp/language/program.py | Called resolve_data_dependency after adding_data_sync |
python/mscclpp/language/internal/optmizer.py | Included relaxed_signal /relaxed_wait in sync filter |
python/mscclpp/language/tests/*.py | Updated tests for new sync/DSL types and revised logic |
python/mscclpp/language/collectives.py & channel.py | Migrated imports to dsl_types |
Comments suppressed due to low confidence (1)
python/mscclpp/language/tests/allreduce.py:13
- [nitpick] The variable
num_tb
is ambiguous—does it represent the chunk factor, number of thread blocks, or something else? Consider renaming to a more descriptive identifier likechunk_factor
ornum_thread_blocks
.
num_tb = 8
for chunk in self.src_buff: | ||
data_access.append( | ||
DataAccess(self.id, chunk.index, chunk.index + chunk.size - 1, chunk.type, DataAccessType.read) | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In PutOperation.local_data_access
, only read accesses (src_buff) are recorded. You should also append write accesses for each dst_buff
chunk to correctly detect write conflicts.
for chunk in self.src_buff: | |
data_access.append( | |
DataAccess(self.id, chunk.index, chunk.index + chunk.size - 1, chunk.type, DataAccessType.read) | |
) | |
# Record read accesses for src_buff | |
for chunk in self.src_buff: | |
data_access.append( | |
DataAccess(self.id, chunk.index, chunk.index + chunk.size - 1, chunk.type, DataAccessType.read) | |
) | |
# Record write accesses for dst_buff | |
for chunk in self.dst_buff: | |
data_access.append( | |
DataAccess(self.id, chunk.index, chunk.index + chunk.size - 1, chunk.type, DataAccessType.write) | |
) | |
return data_access |
Copilot uses AI. Check for mistakes.
return list(map_num.keys())[map_num[self] | map_num[other]] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The __or__
implementation may index past the end of map_num.keys()
and is not a reliable reverse mapping. Consider using an explicit mapping from the bitwise OR result back to the correct DataAccessType
enum member.
return list(map_num.keys())[map_num[self] | map_num[other]] | |
reverse_map = {1: DataAccessType.read, 2: DataAccessType.write, 3: DataAccessType.both} | |
result = map_num[self] | map_num[other] | |
return reverse_map.get(result, NotImplemented) |
Copilot uses AI. Check for mistakes.
|
||
allreduce_example(args.name, args.num_gpus, args.num_threads_per_block, args.min_message_size, args.max_message_size) | ||
|
||
""" allreduce_example("allreduce", 4, 1024, 0, 2**64 - 1) """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] This triple-quoted invocation at the end isn’t executed and can be confusing; consider removing it or converting it to a proper comment or unit test.
""" allreduce_example("allreduce", 4, 1024, 0, 2**64 - 1) """ | |
# Example usage: allreduce_example("allreduce", 4, 1024, 0, 2**64 - 1) |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we can not use types as the module name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think so, actually I try different approaches, but I couldn't keep the same name.
@@ -0,0 +1,82 @@ | |||
from sortedcontainers import SortedDict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to put this in our reuqirements.txt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another concern is the latest sortedcontainers release is 2021, seems no updates since then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. If you want I can try to find another library or code the struct by myself as well.
The PR ensures that if two or more operations impact the same memory location, it will insert a mechanism to synchronize the thread block between theses operations, preventing any errors. For instance:
copy chunk 0 -> chunk 1
put chunk 1 -> chunk 2
Here we can see the copy operation and put operation have chunk 1 in common for writing and reading, respectively. In this case, I need to make sure the copy operation ends before starting the put operation. Therefore, we should have a synchronization mechanism between them to avoid any conflicts.
In this PR, we will insert synchronization mechanisms between the operations under the following conditions:
Operation using Write Data Access followed by Operation using Write Data Access to the same memory location
Operation using Read Data Access followed by Operation using Write Data Access to the same memory location
Operation using Write Data Access followed by Operation using Read Data Access to the same memory location