Skip to content

[BUG] Sm90 & Sm100 Array gemm kernels read ahead of wait_on_dependent_grids() #2962

@Algy

Description

@Algy

Which component has the problem?

CUTLASS C++

Bug Report

Describe the bug

First, the first element of the array containing problem shapes are accessed in the constructor of group tile scheduler.

ProblemShape problem_shape = params_.problem_shapes_.get_problem_shape(0);

return problem_shapes[group_idx];

TileScheduler is constructed before the wait_on_dependent_grids(), which could risk a chance reading the pointer arrays before dependent data gets flushed into global memory by a preceding kernel.

auto scheduler = [&] () {
// Group scheduler requires a different constructor that takes a response ptr
if constexpr (cute::is_same_v<SchedulerTag, GroupScheduler>) {
return TileScheduler{params.scheduler, shared_storage.scheduler_response};
}
else {
return TileScheduler{params.scheduler};
}
} ();

TileScheduler scheduler(
(!IsTensorMapUpdateAsync || is_participant.sched || is_participant.tensor_map_updater)
? &shared_storage.clc_response[0][0]
: &shared_storage.clc_response[1][0],
params.scheduler,
block_id_in_cluster
);

This would cause a race condition when PDL is enabled.

Steps/Code to reproduce bug

Hard to reproduce as it's a race condition.

I wish there was some compiler support to spot PDL related bugs, e.g. ,print a warning log when codes in a kernel access to global address before wait_on_dependent_grids(). It was frustrating to figure out this kind of issues.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions