Skip to content

Added multi_tensor_copier package#13

Merged
xupinjie merged 1 commit intoNVIDIA:mainfrom
RmSchaffert:multi_tensor_copier
Mar 24, 2026
Merged

Added multi_tensor_copier package#13
xupinjie merged 1 commit intoNVIDIA:mainfrom
RmSchaffert:multi_tensor_copier

Conversation

@RmSchaffert
Copy link
Copy Markdown
Collaborator

@RmSchaffert RmSchaffert commented Mar 23, 2026

Description

Added the Multi-Tensor Copier functionality as well as the corresponding documentation & example & simple evaluation

Type of Change

Please select (at least one):

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation / examples / tutorials / demos
  • Supporting functionality change (fix or feature in documentation generation, helper scripts, ...)
  • Refactoring / internal change
  • Other (please describe):

Testing

Checklist for testing:

  • Tests added or updated if/as needed
  • Repository test runner executed: scripts/run_tests.sh

Documentation, Examples, Tutorials, Demos

Checklist for documentation:

  • User-facing documentation updated if/as needed (including API docs)
  • Examples / tutorials / demos updated or added (if relevant)
  • Limitations and constraints documented (if relevant)
  • Performance documented (if relevant)
  • Documentation building successful & checks outlined in the Documentation Checks section of the Contribution Guide are performed

Code Quality

Checklist for dependencies:

  • Dependencies updated in the relevant pyproject.toml if/as needed
  • Code formatted according to the Code Formatting Guide

Related Issues / Context

If applicable, link related issues, discussions etc.


DCO / Sign-Off

Please refer to the section on Signing Your Work & Developer Certificate of Origin (DCO)
in the Contribution Guide before submitting your contribution.

References

For additional details, please refer to the Contribution Guide.
The following guides are available (referenced in the Contribution Guide for further details):

Please also refer to the summary checklist in the Contribution Guide,
which is a guideline for what to consider when submitting your contribution and covers the same topics as the checklists above.

@RmSchaffert RmSchaffert requested a review from xupinjie March 23, 2026 07:08
Comment on lines +521 to +522
// Heuristic thresholds: only pack "small" tensors.
constexpr int64_t kPackMaxBytesPerTensor = 256 * 1024; // 256KB
Copy link
Copy Markdown
Collaborator

@xupinjie xupinjie Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to limit the total size of both packet and tensor?
Such as 32MB limit for a packet.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

For the per-tensor size, the threshold is to focus on small tensors (which benefit the most from the packing on the one hand, and do not add too much copying overhead on the CPU on the other hand).

A very large total size may lead to problems such as allocation failure or larger allocation overhead. I adjusted the implementation and now, multiple chunks are allocated if needed (32MB by default, configurable). If there is more than 32 MB needed, multiple chunks will be used.

Results
-------

.. list-table:: Runtime and Speedup (mean +/- std over 10 runs)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a performance compare with pytorch nested tensor?

Copy link
Copy Markdown
Collaborator Author

@RmSchaffert RmSchaffert Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use-cases are different: With the copier, we can have more general input structures (e.g. dicts/lists/tuples containing tensors (typical for meta-data); individual inputs can have different dtypes and be on different devices).

However, the underlying implementation has similarities: For a nested tensor, a single memory buffer is also used. I made a small evaluation and compared the copy runtime to both an already created nested tensor, and to copying by creating a nested tensor from a list & copy the tensor (without splitting it back into a list). The results are (copy of 500 tensors, each has 32 - 1024 entries, use of pinned memory when creating the nested tensor):
multi_tensor_copier: 0.388 ms
nested tensor (from list): 1.071 ms
nested tensor (pre-built): 0.158 ms

So, if lists are used, the multi_tensor_copier copy is faster, but using a nested tensor directly is even faster.
I would say that this is expected as a nested tensor is already in a format similar to what we use internally (and have to convert to and from) for the copier.

Copy link
Copy Markdown
Collaborator

@xupinjie xupinjie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, i add some comment in the code.

@RmSchaffert RmSchaffert force-pushed the multi_tensor_copier branch from 4f5ed1f to 3aab06f Compare March 24, 2026 03:05
Signed-off-by: Roman Schaffert <rschaffert@nvidia.com>
@RmSchaffert RmSchaffert force-pushed the multi_tensor_copier branch from 3aab06f to 16b8213 Compare March 24, 2026 05:25
@RmSchaffert
Copy link
Copy Markdown
Collaborator Author

Thank you for the insightful comments @xupinjie ! I prepared a new version. Apart from the changes related to your comments, I also reworked how streams are handled. Previously, some copy directions were not synchronized properly and the way multiple stream were used was not meaningful (as there are only copy operations involved).

@xupinjie
Copy link
Copy Markdown
Collaborator

That is great! Merged.

@xupinjie xupinjie merged commit 218a821 into NVIDIA:main Mar 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants