Is this a duplicate?
Area
General CCCL
Is your feature request related to a problem? Please describe.
Introduction
Tensors, or multidimensional arrays, are fundamental entities in several scientific domains, including physics, simulation, data science and signal/image processing. They are also the basic data structure in major deep learning (DL) frameworks such as PyTorch.
One of the primary aspects of tensor manipulation is copying. This operation can occur within the same memory space, between CPUs and GPUs, or even remotely.
Motivations
-
Performance: The performance of such operations can be the dominant factor for the overall workload efficiency, even when the compute itself is fast. This is even more important for systems with limited memory bandwidth, e.g., CPU <-> GPU transfer.
-
Implementation proliferation. As Tensor Copies become more widespread, software packages provide different implementations, which can lead to duplication of functionality, inefficiency and inconsistent behaviour.
Use Cases
Concrete use cases in the CUDA ecosystem include CPU/GPU and CPU/CPU copies, as well as software and frameworks such as cuda.python, cuda.core, CUB and MathDX.
As a consequence, the tensor copy affects libraries and frameworks that rely on CUDA, such as Eigen, Kokkos, [cuda]::std::mdspan, NumPy, CuPy, JAX and Python ATen.
Main Goal
Provide a single source of thrust for tensor copying in CCCL that can be reused across the CUDA software stack. This would enable gather use cases, optimize, and tune the implementation.
Outside, contiguous layout, the first focus will be directed to strided layout.
Challenges
-
Interface unification: Several frameworks provide their own tensor interface, with different properties and guarantees. The challenge consists in searching abstractions that can fit in many use cases as possible without compromising the usability or performance.
-
Performance. Tensor copy is prone to several optimizations, including vectorization, TMA, normalization, slicing, loop permutation, etc. Finding and combining optimization opportunities is a hard problem.
Directions
We recognize C++ as a common underlying layer that is widely accepted in many ecosystems. For this reason, we prefer mdspan-based interfaces. However, we are considering non-standard features to overcome the current status's potential limitations, e.g., unsupported zero or negative strides.
CUB already provides a DeviceCopy routine compatible with mdspan, but it only provides SOL for contiguous layouts.
Ideally, we would use techniques that are already state of the art. Natural choices in this direction include:
-
CuTE. The framework provides all the features necessary for a SOL implementation. The main issues are that it is an external package outside CCCL and that it provides many additional features and use cases beyond tensor copy, which has implications for compile time and binary size. This could lead to a ~CuTE 2.0.
-
CUDA Tile binding. Need to investigate CUDA Tile interoperability with the SIMT model.
Is this a duplicate?
Area
General CCCL
Is your feature request related to a problem? Please describe.
Introduction
Tensors, or multidimensional arrays, are fundamental entities in several scientific domains, including physics, simulation, data science and signal/image processing. They are also the basic data structure in major deep learning (DL) frameworks such as PyTorch.
One of the primary aspects of tensor manipulation is copying. This operation can occur within the same memory space, between CPUs and GPUs, or even remotely.
Motivations
Performance: The performance of such operations can be the dominant factor for the overall workload efficiency, even when the compute itself is fast. This is even more important for systems with limited memory bandwidth, e.g., CPU <-> GPU transfer.
Implementation proliferation. As Tensor Copies become more widespread, software packages provide different implementations, which can lead to duplication of functionality, inefficiency and inconsistent behaviour.
Use Cases
Concrete use cases in the CUDA ecosystem include CPU/GPU and CPU/CPU copies, as well as software and frameworks such as
cuda.python,cuda.core,CUBandMathDX.As a consequence, the tensor copy affects libraries and frameworks that rely on CUDA, such as Eigen, Kokkos,
[cuda]::std::mdspan, NumPy, CuPy, JAX and Python ATen.Main Goal
Provide a single source of thrust for tensor copying in CCCL that can be reused across the CUDA software stack. This would enable gather use cases, optimize, and tune the implementation.
Outside, contiguous layout, the first focus will be directed to strided layout.
Challenges
Interface unification: Several frameworks provide their own tensor interface, with different properties and guarantees. The challenge consists in searching abstractions that can fit in many use cases as possible without compromising the usability or performance.
Performance. Tensor copy is prone to several optimizations, including vectorization, TMA, normalization, slicing, loop permutation, etc. Finding and combining optimization opportunities is a hard problem.
Directions
We recognize C++ as a common underlying layer that is widely accepted in many ecosystems. For this reason, we prefer
mdspan-based interfaces. However, we are considering non-standard features to overcome the current status's potential limitations, e.g., unsupported zero or negative strides.CUB already provides a
DeviceCopyroutine compatible withmdspan, but it only provides SOL for contiguous layouts.Ideally, we would use techniques that are already state of the art. Natural choices in this direction include:
CuTE. The framework provides all the features necessary for a SOL implementation. The main issues are that it is an external package outside CCCL and that it provides many additional features and use cases beyond tensor copy, which has implications for compile time and binary size. This could lead to a ~CuTE 2.0.
CUDA Tile binding. Need to investigate CUDA Tile interoperability with the SIMT model.