[PoC] SoL Tensor Copy

### Is this a duplicate?

- [x] I confirmed there appear to be no [duplicate issues](https://github.com/NVIDIA/cccl/issues) for this request and that I agree to the [Code of Conduct](CODE_OF_CONDUCT.md)

### Area

General CCCL

### Is your feature request related to a problem? Please describe.

## Introduction

**Tensors**, or **multidimensional arrays**, are fundamental entities in several scientific domains, including physics, simulation, data science and signal/image processing. They are also the basic data structure in major deep learning (DL) frameworks such as PyTorch.

One of the primary aspects of tensor manipulation is **copying**. This operation can occur within the same memory space, between CPUs and GPUs, or even remotely.

## Motivations

- **Performance**: The performance of such operations can be the dominant factor for the overall workload efficiency, even when the compute itself is fast.  This is even more important for systems with limited memory bandwidth, e.g., CPU <-> GPU transfer. 

- **Implementation proliferation**. As Tensor Copies become more widespread, software packages provide different implementations, which can lead to duplication of functionality, inefficiency and inconsistent behaviour.

## Use Cases

Concrete use cases in the CUDA ecosystem include CPU/GPU and CPU/CPU copies, as well as software and frameworks such as `cuda.python`, `cuda.core`, `CUB` and [`MathDX`](https://docs.nvidia.com/cuda/cublasdx/api/other_tensors.html#copying-tensors).

As a consequence, the tensor copy affects libraries and frameworks that rely on CUDA, such as Eigen, Kokkos, `[cuda]::std::mdspan`, NumPy, CuPy, JAX and Python ATen.

## Main Goal

Provide a **single source of thrust** for tensor copying in CCCL that can be reused across the CUDA software stack. This would enable gather use cases, optimize, and tune the implementation.

Outside, contiguous layout, the first focus will be directed to strided layout.

## Challenges

- **Interface unification**: Several frameworks provide their own tensor interface, with different properties and guarantees. The challenge consists in searching abstractions that can fit in many use cases as possible without compromising the usability or performance.

- **Performance**. Tensor copy is prone to several optimizations, including vectorization, TMA, normalization, slicing, loop permutation, etc. Finding and combining optimization opportunities is a hard problem.

## Directions

We recognize C++ as a common underlying layer that is widely accepted in many ecosystems. For this reason, we prefer `mdspan`-based interfaces. However, we are considering non-standard features to overcome the current status's potential limitations, e.g., unsupported zero or negative strides.

CUB already provides a `DeviceCopy` routine compatible with `mdspan`, but it only provides SOL for contiguous layouts.

Ideally, we would use techniques that are already state of the art. Natural choices in this direction include:

- **CuTE**. The framework provides all the features necessary for a SOL implementation. The main issues are that it is an external package outside CCCL and that it provides many additional features and use cases beyond tensor copy, which has implications for compile time and binary size. This could lead to a ~CuTE 2.0.

- **CUDA Tile binding**. Need to investigate CUDA Tile interoperability with the SIMT model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PoC] SoL Tensor Copy #7277

Is this a duplicate?

Area

Is your feature request related to a problem? Please describe.

Introduction

Motivations

Use Cases

Main Goal

Challenges

Directions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[PoC] SoL Tensor Copy #7277

Description

Is this a duplicate?

Area

Is your feature request related to a problem? Please describe.

Introduction

Motivations

Use Cases

Main Goal

Challenges

Directions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions