Enhance tensor management and CUDA utilities with benchmarks by codeaddict-119 · Pull Request #51 · Eamon2009/Quadtrix.cpp

codeaddict-119 · 2026-05-23T13:22:20Z

No description provided.

- Implement `TensorShape` struct aligned to 16-byte boundaries. - Add host/device `numel()` method to compute total element count. - Add host-side `compute_strides()` and `is_contiguous()` logic for row-major layouts. - Provide initialization shortcuts for 1D, 2D, 3D, and 4D tensor shapes.

- Implement warp-level parallel reductions using `__shfl_xor_sync` (Sum, Max, Min). - Implement a `warpBroadcast` helper utility targeting lane 0. - Introduce skeleton for a shared-memory backed `blockReduceSum`. - Guard all device-specific primitives behind `__CUDACC__` flags.

…45) ## Summary - Project Versioning: Sets the starting project version to 0.1.0. - Code Shortcuts (Macros): Creates clean shorthand terms for CUDA keywords (like wrapping __device__ into QX_DEVICE) to make writing GPU kernels cleaner. - Math & Memory Utilities: Adds fast math helpers for aligning memory, rounding numbers, and calculating power-of-two boundaries quickly. - Memory Optimization: Forces a 128-byte memory alignment to ensure the GPU can read data as fast as possible (coalesced memory access). - Automatic Error Checking: Introduces safety wrappers (CUDA_CHECK, CUBLAS_CHECK, NCCL_CHECK) that instantly watch for crashes or failures in Nvidia's core hardware and math libraries, making debugging much easier.

- Implement memory wrappers for host (`malloc`), pinned host (`cudaMallocHost`), and aligned device allocations (`cudaMalloc`). - Enforce strict memory layout by rounding up device bytes to `QX_MEM_ALIGN`. - Add `tensor_alloc_device` and `tensor_alloc_host` factory allocators with automatic initialization. - Implement unified `tensor_free` handling safe deallocations across all memory spaces. - Add async Host-to-Device (`tensor_h2d`) copy routine.

…l-clock time

…edundant code and improved readability.

Eamon2009 added 11 commits May 21, 2026 18:41

Delete run_20260430_192930.png

1e54e9f

Merge branch 'master' of https://github.com/Eamon2009/Quadtrix.cpp

2cc2d74

Delete run_20260508_110726.png

c7761f6

docs : training report v1.0 loss curves val loss vs wall-clock time

6d85944

docs : training report CUDA /bf16 version loss curves val loss vs wal…

9ba7a91

…l-clock time

refactor(main.cpp): clean up code and add progress tracking removed r…

cd675d2

…edundant code and improved readability.

codeaddict-119 requested a review from Eamon2009 May 23, 2026 13:22

codeaddict-119 assigned codeaddict-119 and Eamon2009 May 23, 2026

codeaddict-119 added the cuda label May 23, 2026

codeaddict-119 merged commit 8451d4a into exp May 23, 2026
9 checks passed

Eamon2009 added a commit that referenced this pull request May 25, 2026

feat :tensor management with benchmarks (#51) (#52)

c7a1e01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance tensor management and CUDA utilities with benchmarks#51

Enhance tensor management and CUDA utilities with benchmarks#51
codeaddict-119 merged 11 commits into
expfrom
master

codeaddict-119 commented May 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

codeaddict-119 commented May 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants