Which component has the problem?
CUTLASS C++
Bug Report
Describe the bug
First, the first element of the array containing problem shapes are accessed in the constructor of group tile scheduler.
|
ProblemShape problem_shape = params_.problem_shapes_.get_problem_shape(0); |
|
return problem_shapes[group_idx]; |
TileScheduler is constructed before the wait_on_dependent_grids(), which could risk a chance reading the pointer arrays before dependent data gets flushed into global memory by a preceding kernel.
|
auto scheduler = [&] () { |
|
// Group scheduler requires a different constructor that takes a response ptr |
|
if constexpr (cute::is_same_v<SchedulerTag, GroupScheduler>) { |
|
return TileScheduler{params.scheduler, shared_storage.scheduler_response}; |
|
} |
|
else { |
|
return TileScheduler{params.scheduler}; |
|
} |
|
} (); |
|
|
|
TileScheduler scheduler( |
|
(!IsTensorMapUpdateAsync || is_participant.sched || is_participant.tensor_map_updater) |
|
? &shared_storage.clc_response[0][0] |
|
: &shared_storage.clc_response[1][0], |
|
params.scheduler, |
|
block_id_in_cluster |
|
); |
This would cause a race condition when PDL is enabled.
Steps/Code to reproduce bug
Hard to reproduce as it's a race condition.
I wish there was some compiler support to spot PDL related bugs, e.g. ,print a warning log when codes in a kernel access to global address before wait_on_dependent_grids(). It was frustrating to figure out this kind of issues.
Which component has the problem?
CUTLASS C++
Bug Report
Describe the bug
First, the first element of the array containing problem shapes are accessed in the constructor of group tile scheduler.
cutlass/include/cutlass/gemm/kernel/sm90_tile_scheduler_group.hpp
Line 267 in 8debf77
cutlass/include/cutlass/gemm/group_array_problem_shape.hpp
Line 68 in 8debf77
TileScheduler is constructed before the
wait_on_dependent_grids(), which could risk a chance reading the pointer arrays before dependent data gets flushed into global memory by a preceding kernel.cutlass/include/cutlass/gemm/kernel/sm90_gemm_array_tma_warpspecialized_pingpong.hpp
Lines 466 to 475 in 8debf77
cutlass/include/cutlass/gemm/kernel/sm100_gemm_array_tma_warpspecialized.hpp
Lines 836 to 842 in 8debf77
This would cause a race condition when PDL is enabled.
Steps/Code to reproduce bug
Hard to reproduce as it's a race condition.
I wish there was some compiler support to spot PDL related bugs, e.g. ,print a warning log when codes in a kernel access to global address before wait_on_dependent_grids(). It was frustrating to figure out this kind of issues.