A lightweight CUDA wrapper for memory management, kernel timing, and error handling—focus on kernels, not boilerplate.
tinyCUDA strips away the tedium of CUDA development: no more manual cudaMalloc/cudaMemcpy juggling, launch errors, or rough timings. It's designed for quick prototyping—perfect for new CUDA users or anyone tired of repetitive setup. Not a full framework (no autograd or N-D tensors); just sane defaults for 1D buffers and kernel launches.
Built with: C++17, CUDA 11+. Compatible with all CUDA-capable GPUs.
Also visit the blog post for a step-by-step guide on getting started.
Vanilla CUDA (with manual error checks and rough single-run timing):
#include <cuda_runtime.h>
#include <chrono>
#include <cstdio>
#include <cstdlib>
// Manual error check macro (boilerplate everywhere)
#define CUDA_CHECK(call) { \
cudaError_t err = call; \
if (err != cudaSuccess) { \
printf("CUDA error at %s:%d: %s\n", __FILE__, __LINE__, cudaGetErrorString(err)); \
exit(1); \
} \
}
// Assume h_a, h_out are host arrays; kernel updates output
float *h_a = /* host data */, *d_a, *d_out; int N = /* size */;
// Alloc device mem
CUDA_CHECK(cudaMalloc(&d_a, N * sizeof(float)));
CUDA_CHECK(cudaMalloc(&d_out, N * sizeof(float)));
// H→D copy
CUDA_CHECK(cudaMemcpy(d_a, h_a, N * sizeof(float), cudaMemcpyHostToDevice));
// Rough timing: single run (no warmup, includes JIT/host overhead)
auto start = std::chrono::high_resolution_clock::now();
kernel<<<blocks, threads>>>(d_a, d_out, N);
CUDA_CHECK(cudaGetLastError()); // Check launch errors
CUDA_CHECK(cudaDeviceSynchronize()); // Wait + measure total wall time
auto end = std::chrono::high_resolution_clock::now();
float ms = std::chrono::duration<float, std::milli>(end - start).count();
// D→H copy
CUDA_CHECK(cudaMemcpy(h_out, d_out, N * sizeof(float), cudaMemcpyDeviceToHost));
// Free (forget one? Memory leak!)
CUDA_CHECK(cudaFree(d_a));
CUDA_CHECK(cudaFree(d_out));With tinyCUDA (focus on kernel):
#include "tinycuda/tinycuda.hpp" // Includes error checks, chrono, etc.
float *h_a = /* host data */; int N = /* size */;
tinycuda::Buffer<float> buf(h_a, N);
buf.to_gpu(); // Alloc + H→D (auto-checked)
//Single run:
kernel<<<blocks, threads>>>(buf.gpu_data(), /* out via another buf or ptr */, N);
// (or) Accurate timing: warmup + averaged batches (no JIT bias)
float ms = tinycuda::KernelProfiler(1, 10)([&] {
kernel<<<blocks, threads>>>(buf.gpu_data(), /* out via another buf or ptr */, N);
});
buf.to_cpu(); // D→H (device mem auto-freed on destruct)- Buffer: Non-owning GPU mirror of host data. Allocates device mem on first
to_gpu(), copies H↔D, handles moves. Zero-size support. - KernelProfiler: Warmup + batched timing for accurate kernel execution averages (ms). Untimed warmup avoids JIT overhead.
- CUDA_CHECK: Macro for immediate error checking/abort with file:line context.
- tinycuda.hpp: One-include aggregator for the full API.
- Header-only: No installation/linking required. Include
tinycuda.hppand you are good to go!
- CUDA Toolkit 11+ (nvcc required).
- C++17 compiler (nvcc handles it).
git clone https://github.com/OnixHoque/tinycuda.git
cd tinycuda
./scripts/build_and_run.sh # Defaults to vector_add example
./scripts/build_and_run.sh ./examples/matmul.cu # Matmul example- Vector Add: Simple element-wise addition with verification.
- Matmul: Basic matrix multiply (512x512) with timing & GFLOPS.
Output example (vector_add):
[INFO] Building examples/vector_add.cu → build/vector_add ...
[INFO] Running build/vector_add ...
Average kernel time: 0.0032 ms
[PASS] Vector addition verified: all elements == 3.0f
[SUCCESS] Example completed.
./scripts/run_tests.shRuns test_memory and test_profiler—verifies Buffer moves, zero-size, and profiler accuracy.
std::vector<float> host(N, 0.0f);
tinycuda::Buffer<float> buf(host.data(), N);
buf.to_gpu(); // Alloc + H→D copy (safe to repeat)
kernel<<<grid, block>>>(buf.gpu_data(), N);
buf.to_cpu(); // D→H copy (device mem retained)- Move-enabled; no copy ctor.
on_gpu()flag;size()query.- Docs: memory.hpp.
tinycuda::KernelProfiler prof(5, 50); // Warmup, repeats
float ms = prof([&] { kernel<<<grid, block>>>(...); });
printf("Avg: %.4f ms\n", ms);- Batched launches (no per-run sync); averages total time.
- Docs: profiler.hpp.
/* Auto-checks CUDA expr for errors and aborts on failure. Defined as CUDA_CHECK(expr) in error.hpp. */
CUDA_CHECK(cudaMalloc(&ptr, N * sizeof(float)));- Prints file:line +
cudaGetErrorString. - Docs: error.hpp.
Include everything: #include "tinycuda/tinycuda.hpp".
- vector_add.cu: Element-wise add with Buffer + Profiler + verify.
- matmul.cu: Non-tiled matrix multiply (512x512) with GFLOPS calc.
Build any: ./scripts/build_and_run.sh examples/matmul.cu matmul.
tinycuda is a header-only library — no compilation or linking required. Just include the headers and compile your .cu files with nvcc.
- Download (or clone) and copy the
include/tinycuda/folder into your project directory (or athird_party/subdir). - Include in your
.cufiles:#include "tinycuda/tinycuda.hpp" // bundles everything // Or granular: #include "tinycuda/memory.hpp" etc.
Compile with nvcc (C++17 required):
nvcc -std=c++17 -I/path/to/tinycuda/include your_kernel.cu -o your_output
./your_outputExample with a simple kernel:
nvcc -std=c++17 -I./include my_add.cu -o my_add
./my_addFor global access:
- Copy
include/tinycuda/to/usr/local/include/tinycuda/(or similar). - Compile anywhere:
nvcc -std=c++17 -I/usr/local/include your_kernel.cu -o your_output
No dependencies beyond CUDA Toolkit. That's it — write your kernel and run! 🚀
.
├── include/tinycuda/ # Headers
│ ├── error.hpp
│ ├── memory.hpp
│ ├── profiler.hpp
│ └── tinycuda.hpp # Aggregator
├── examples/ # Demos
├── tests/ # Unit tests
├── scripts/ # Build/run helpers
└── README.md
MIT License—see LICENSE. Free to fork, extend, or teach with.
- Tests: Add to
tests/+ run./scripts/run_tests.sh. - Issues/PRs: Welcome! Focus on simplicity.
Questions? Open an issue or ping on LinkedIn.