Skip to content

Enhance tensor management and CUDA utilities with benchmarks#51

Merged
codeaddict-119 merged 11 commits into
expfrom
master
May 23, 2026
Merged

Enhance tensor management and CUDA utilities with benchmarks#51
codeaddict-119 merged 11 commits into
expfrom
master

Conversation

@codeaddict-119
Copy link
Copy Markdown
Collaborator

No description provided.

Eamon2009 added 11 commits May 21, 2026 18:41
- Implement `TensorShape` struct aligned to 16-byte boundaries.
- Add host/device `numel()` method to compute total element count.
- Add host-side `compute_strides()` and `is_contiguous()` logic for row-major layouts.
- Provide initialization shortcuts for 1D, 2D, 3D, and 4D tensor shapes.
- Implement warp-level parallel reductions using `__shfl_xor_sync` (Sum, Max, Min).
- Implement a `warpBroadcast` helper utility targeting lane 0.
- Introduce skeleton for a shared-memory backed `blockReduceSum`.
- Guard all device-specific primitives behind `__CUDACC__` flags.
…45)

## Summary
- Project Versioning: Sets the starting project version to 0.1.0.

- Code Shortcuts (Macros): Creates clean shorthand terms for CUDA
keywords (like wrapping __device__ into QX_DEVICE) to make writing GPU
kernels cleaner.

- Math & Memory Utilities: Adds fast math helpers for aligning memory,
rounding numbers, and calculating power-of-two boundaries quickly.

- Memory Optimization: Forces a 128-byte memory alignment to ensure the
GPU can read data as fast as possible (coalesced memory access).

- Automatic Error Checking: Introduces safety wrappers (CUDA_CHECK,
CUBLAS_CHECK, NCCL_CHECK) that instantly watch for crashes or failures
in Nvidia's core hardware and math libraries, making debugging much
easier.
- Implement memory wrappers for host (`malloc`), pinned host (`cudaMallocHost`), and aligned device allocations (`cudaMalloc`).
- Enforce strict memory layout by rounding up device bytes to `QX_MEM_ALIGN`.
- Add `tensor_alloc_device` and `tensor_alloc_host` factory allocators with automatic initialization.
- Implement unified `tensor_free` handling safe deallocations across all memory spaces.
- Add async Host-to-Device (`tensor_h2d`) copy routine.
- Implement memory wrappers for host (`malloc`), pinned host (`cudaMallocHost`), and aligned device allocations (`cudaMalloc`).
- Enforce strict memory layout by rounding up device bytes to `QX_MEM_ALIGN`.
- Add `tensor_alloc_device` and `tensor_alloc_host` factory allocators with automatic initialization.
- Implement unified `tensor_free` handling safe deallocations across all memory spaces.
- Add async Host-to-Device (`tensor_h2d`) copy routine.
@codeaddict-119 codeaddict-119 requested a review from Eamon2009 May 23, 2026 13:22
@codeaddict-119 codeaddict-119 merged commit 8451d4a into exp May 23, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants