Skip to content

Conversation

@cpcloud
Copy link
Contributor

@cpcloud cpcloud commented Jan 5, 2026

TL;DR

Removes artificial power-of-two restriction on itemsize, enabling support for arbitrary-sized dtypes (e.g., np.dtype([("a", "i4"), ("b", "i1")]) → 5 bytes).

Changes

  • Removed validation: Deleted itemsize & (itemsize - 1) checks from _layout.pxd and _layout.pyx
  • Updated error messages: Changed from "must be a power of two" to "must be greater than zero"
  • Documentation: Removed power-of-two constraint from repacked() method docstring
  • Test coverage: Added test_from_buffer_with_non_power_of_two_itemsize() validating 5-byte structured dtype

Files modified: 3 files (+31/-25 lines)

  • cuda/core/_layout.pxd - Removed constraint from _init() and _init_dense()
  • cuda/core/_layout.pyx - Removed from pack_extents(), unpack_extents(), max_compatible_itemsize()
  • tests/test_utils.py - Added non-power-of-two test case

Impact

Enables:

  • Structured dtypes with non-power-of-two sizes (common in real-world data)
  • np.dtype([("field1", "i4"), ("field2", "i1")]) (5 bytes)
  • np.dtype([("x", "i2"), ("y", "i1")]) (3 bytes)
  • np.dtype([("a", "i4"), ("b", "i4"), ("c", "i1")]) (9 bytes)

Backward compatible: All existing power-of-two itemsizes continue to work identically.

Testing

  • ✅ New test validates 5-byte structured dtype ([("a", "int32"), ("b", "int8")])
  • ✅ All existing tests pass (continue using power-of-two itemsizes by default)
  • ✅ Comprehensive verification confirms no code dependencies on power-of-two invariant
🔍 Deep Dive: Comprehensive Verification Report

Verification Methodology

Performed exhaustive analysis of entire cuda_core subpackage to verify no code relies on power-of-two invariant (explicitly or implicitly).


✅ Verification Results: NO DEPENDENCIES FOUND

1. Explicit Power-of-Two Checks

Status: All removed ✅

Location Before After
_layout.pxd:114-115 if itemsize & (itemsize - 1) if itemsize <= 0
_layout.pxd:127-128 if itemsize & (itemsize - 1) if itemsize <= 0
_layout.pyx:1218-1219 if itemsize & (itemsize - 1) if itemsize <= 0
_layout.pyx:1274-1276 if itemsize & (itemsize - 1) if itemsize <= 0
_layout.pyx:1305-1307 if itemsize & (itemsize - 1) if itemsize <= 0

Search results:

  • ❌ No itemsize & (itemsize - 1) patterns remain
  • ❌ No "power of two" or "power of 2" in comments/docs
  • ❌ No TODO/FIXME related to itemsize constraints

2. Bit Operations Analysis

3 bit-shift locations found, NONE depend on power-of-two itemsize:

_memoryview.pyx:656-657

cdef int itemsize = nbits >> 3              # Divide bits by 8
if (itemsize << 3) != nbits:                # Verify bits is multiple of 8
    raise ValueError("dtype.bits must be a multiple of 8")

Purpose: DLPack bits→bytes conversion
Analysis: Validates bits are byte-aligned (multiple of 8), independent of itemsize power-of-two requirement

_layout.pyx:953,955

axes_mask &= (AXES_MASK_ALL << start_axis)
axes_mask &= (AXES_MASK_ALL >> (STRIDED_LAYOUT_MAX_NDIM - end_axis - 1))

Purpose: Axis mask bit manipulation
Analysis: Unrelated to itemsize

_layout.pxd:49-60

PROP_IS_UNIQUE = 1 << 0
PROP_IS_CONTIGUOUS_C = 1 << 1
# ... property flags

Purpose: Bit flags for layout properties
Analysis: Standard enumeration pattern, unrelated to itemsize


  1. Arithmetic Operations on itemsize

All operations work correctly with non-power-of-two values:

Division Operations ✅

pack_extents (_layout.pyx:1234)

vec_size = new_itemsize // itemsize
if packed_extent * vec_size != shape[axis]:
    raise ValueError(f"extent must be divisible by {vec_size}")

unpack_extents (_layout.pyx:1288)

vec_size = itemsize // new_itemsize

Validation: Divisibility checks ensure correctness
Verdict: General-purpose integer division, safe for any positive integers

Modulo Operations ✅

# Alignment check (_layout.pyx:1226)
if data_ptr % new_itemsize != 0:
    raise ValueError("data pointer must be aligned...")

Stride divisibility (_layout.pxd:407)

if stride * itemsize != base.strides[i]:
    raise ValueError("strides must be divisible by itemsize")

Verdict: Standard modulo arithmetic, works for any divisor

GCD Algorithm ✅

# max_compatible_itemsize (_layout.pyx:1208-1211)
def gcd(a, b):
    while b != 0:
        a, b = b, a % b
    return a

Purpose: Find maximum compatible itemsize
Analysis: Euclidean algorithm, general-purpose for any integers


  1. Memory Alignment Verification

Alignment check (_layout.pyx:1226):

if data_ptr % new_itemsize != 0:
    raise ValueError(f"data pointer must be aligned to {new_itemsize}")

✅ Uses standard modulo - works for any itemsize value
✅ No assumption that itemsize is power-of-two


  1. Documentation Consistency

repacked() Method Docstring ✅

Before (upstream/main):
The conversion is subject to the following constraints:
* The old and new itemsizes must be powers of two.
* The extent at axis must be a positive integer.
* ...

After (HEAD):
The conversion is subject to the following constraints:
* The extent at axis must be a positive integer.
* The stride at axis must be 1.
* ...
✅ Power-of-two constraint removed from documentation

Error Messages ✅

All 14 ValueError messages related to itemsize now only check:

if itemsize <= 0:
    raise ValueError("itemsize must be greater than zero")

✅ No power-of-two requirements in error messages


  1. Test Coverage

Existing Tests ✅

test_strided_layout.py: Uses _ITEMSIZES = [1, 2, 4, 8, 16]

  • These are power-of-two for convenience, not a requirement
  • All existing tests continue to pass

New Test ✅

test_utils.py:435-445:

def test_from_buffer_with_non_power_of_two_itemsize():
    dtype = np.dtype([("a", "int32"), ("b", "int8")])  # 5 bytes
    shape = (1,)
    layout = _StridedLayout(shape=shape, strides=None, itemsize=dtype.itemsize)
    required_size = layout.required_size_in_bytes()
    assert required_size == math.prod(shape) * dtype.itemsize
    buffer = dev.memory_resource.allocate(required_size)
    view = StridedMemoryView.from_buffer(buffer, shape=shape, strides=layout.strides,
                                         dtype=dtype, is_readonly=True)
    assert view.dtype == dtype

✅ Validates 5-byte structured dtype works correctly


  1. Critical Functions Verified
Function Location Verification
_init() _layout.pxd:113 ✅ Only checks itemsize > 0
_init_dense() _layout.pxd:126 ✅ Only checks itemsize > 0
pack_extents() _layout.pyx:1214 ✅ Integer division + divisibility checks
unpack_extents() _layout.pyx:1268 ✅ Integer multiplication (safe)
max_compatible_itemsize() _layout.pyx:1301 ✅ GCD algorithm (general-purpose)
_divide_strides() _layout.pxd:401 ✅ Integer division + validation
layout_from_dlpack() _memoryview.pyx:653 ✅ Bit shifts for bits↔bytes only

  1. Files Analyzed

Core implementation (3 files):

  • ✅ _layout.pyx (1,334 lines)
  • ✅ _layout.pxd (409 lines)
  • ✅ _memoryview.pyx (688 lines)

Tests (4 files):

  • ✅ tests/test_strided_layout.py
  • ✅ tests/test_utils.py
  • ✅ tests/helpers/layout.py
  • ✅ All test helpers verified

Examples (3 files):

  • ✅ examples/memory_ops.py
  • ✅ examples/saxpy.py
  • ✅ examples/thread_block_cluster.py

C/C++ headers (3 files):

  • ✅ _include/layout.hpp
  • ✅ _include/utility.hpp
  • ✅ _include/dlpack.h

Final Verdict

✅ SAFE TO MERGE

  1. ✅ No explicit dependencies on power-of-two invariant
  2. ✅ No implicit dependencies found in arithmetic operations
  3. ✅ All bit operations are unrelated to itemsize constraints
  4. ✅ Documentation updated to reflect new behavior
  5. ✅ Test coverage added for non-power-of-two itemsizes
  6. ✅ Backward compatible - all existing code continues to work

Remaining Constraints (Documented)

After this change, itemsize must only satisfy:

  • ✅ itemsize > 0 (positive integer)
  • ✅ For packing: new_itemsize >= itemsize and sizes are divisible
  • ✅ For unpacking: new_itemsize <= itemsize and sizes are divisible
  • ✅ Data pointer alignment matches target itemsize

Enabled Use Cases

Users can now work with:

  • ✅ 3-byte dtypes: np.dtype([("x", "i2"), ("y", "i1")])
  • ✅ 5-byte dtypes: np.dtype([("a", "i4"), ("b", "i1")])
  • ✅ 6-byte dtypes: np.dtype([("a", "i2"), ("b", "i2"), ("c", "i2")])
  • ✅ 9-byte dtypes: np.dtype([("a", "i4"), ("b", "i4"), ("c", "i1")])
  • ✅ Any arbitrary struct layout from real-world data formats

@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Jan 5, 2026

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@cpcloud
Copy link
Contributor Author

cpcloud commented Jan 5, 2026

/ok to test

@github-actions

This comment has been minimized.

Copy link
Contributor

@mdboom mdboom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@kkraus14 kkraus14 merged commit 2b7e4b9 into NVIDIA:main Jan 6, 2026
80 checks passed
@github-actions
Copy link

github-actions bot commented Jan 6, 2026

Doc Preview CI
Preview removed because the pull request was closed or merged.

@cpcloud cpcloud deleted the relax-power-of-two-itemsize-check branch January 6, 2026 12:46
github-actions bot pushed a commit that referenced this pull request Jan 13, 2026
* test: add test for removing power of two check from SMV init

* feat: remove power of two check for itemsize in SMV construction

(cherry picked from commit 2b7e4b9)
@github-actions
Copy link

Successfully created backport PR for release/cuda-core-v0.5.1:

cpcloud added a commit that referenced this pull request Jan 13, 2026
Co-authored-by: Phillip Cloud <417981+cpcloud@users.noreply.github.com>
@leofang leofang added this to the cuda.core beta 12 milestone Jan 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants