Skip to content

[PoC] SoL Tensor Copy #7277

@fbusato

Description

@fbusato

Is this a duplicate?

Area

General CCCL

Is your feature request related to a problem? Please describe.

Introduction

Tensors, or multidimensional arrays, are fundamental entities in several scientific domains, including physics, simulation, data science and signal/image processing. They are also the basic data structure in major deep learning (DL) frameworks such as PyTorch.

One of the primary aspects of tensor manipulation is copying. This operation can occur within the same memory space, between CPUs and GPUs, or even remotely.

Motivations

  • Performance: The performance of such operations can be the dominant factor for the overall workload efficiency, even when the compute itself is fast. This is even more important for systems with limited memory bandwidth, e.g., CPU <-> GPU transfer.

  • Implementation proliferation. As Tensor Copies become more widespread, software packages provide different implementations, which can lead to duplication of functionality, inefficiency and inconsistent behaviour.

Use Cases

Concrete use cases in the CUDA ecosystem include CPU/GPU and CPU/CPU copies, as well as software and frameworks such as cuda.python, cuda.core, CUB and MathDX.

As a consequence, the tensor copy affects libraries and frameworks that rely on CUDA, such as Eigen, Kokkos, [cuda]::std::mdspan, NumPy, CuPy, JAX and Python ATen.

Main Goal

Provide a single source of thrust for tensor copying in CCCL that can be reused across the CUDA software stack. This would enable gather use cases, optimize, and tune the implementation.

Outside, contiguous layout, the first focus will be directed to strided layout.

Challenges

  • Interface unification: Several frameworks provide their own tensor interface, with different properties and guarantees. The challenge consists in searching abstractions that can fit in many use cases as possible without compromising the usability or performance.

  • Performance. Tensor copy is prone to several optimizations, including vectorization, TMA, normalization, slicing, loop permutation, etc. Finding and combining optimization opportunities is a hard problem.

Directions

We recognize C++ as a common underlying layer that is widely accepted in many ecosystems. For this reason, we prefer mdspan-based interfaces. However, we are considering non-standard features to overcome the current status's potential limitations, e.g., unsupported zero or negative strides.

CUB already provides a DeviceCopy routine compatible with mdspan, but it only provides SOL for contiguous layouts.

Ideally, we would use techniques that are already state of the art. Natural choices in this direction include:

  • CuTE. The framework provides all the features necessary for a SOL implementation. The main issues are that it is an external package outside CCCL and that it provides many additional features and use cases beyond tensor copy, which has implications for compile time and binary size. This could lead to a ~CuTE 2.0.

  • CUDA Tile binding. Need to investigate CUDA Tile interoperability with the SIMT model.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions