Skip to content

[FEA]: Require ct.barrier for multi stage kernels #37

@ZhangZhiPku

Description

@ZhangZhiPku

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request?

High

Please provide a clear description of problem this feature solves

In CUDA programming, we use atomic methods or cooperative groups to synchronize execution across blocks.
cutile could provide a similar mechanism to help developers write complex multi-stage kernels in a simpler way.

Feature Description

Example:

import torch
import cuda.tile as ct

@ct.kernel
def device_norm(
    x: ct.Array, y: ct.Array, workspace: ct.Array, 
    tile_size: ct.Constant, p: ct.Constant):
    # create a barrier on global memory, except p blocks to reach it.
    barrier = ct.barrier(p=p)
    block_id = ct.bid(0)
    
    tile = ct.load(x, index=(block_id, 0), shape=(1, tile_size))
    mean = ct.sum(tile) / tile_size
    
    ct.atomic_add(workspace, (0, ), mean)
    # wait until p blocks to reach here
    barrier.wait()

    global_mean = ct.load(workspace, (0, ), (1, ))
    global_mean = global_mean / p
    tile = tile - global_mean
    
    ct.store(y, (block_id, ), (tile_size, ))

Describe your ideal solution

Provide ct.barrier, or a similar feature, to make it easier for developers to write applications that require block-level synchronization.

There are multiple ways to implement ct.barrier:

  1. Allocate a region in global memory for synchronization, and let each block atomically increment a counter when it reaches the barrier.
  2. Use cooperative groups.

Describe any alternatives you have considered

No response

Additional context

No response

Contributing Guidelines

  • I agree to follow cuTile Python's contributing guidelines
  • I have searched the open feature requests and have found no duplicates for this feature request

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions