Releases: Industrial-Algebra/Borsalino
Releases · Industrial-Algebra/Borsalino
Borsalino v0.2.1
Borsalino v0.2.1
Hotfix release — removes accidentally published early-research content.
No other changes from v0.2.0.
Borsalino v0.2.0
New Features
- Async dispatch —
dispatch_async()returnsPulsehandle for non-blocking GPU execution. VkFence (Vulkan), MTLCommandBuffer (Metal). Drop performs implicit join. - Persistent buffers —
create_device_buffer()keeps data on GPU across dispatches. VRAM on discrete GPUs, zero-copy on unified memory. - GPU timestamps —
gpu.timestamp()for profiling. Vulkan: vkCmdWriteTimestamp query pool. - 2D/3D tiled dispatch — WGSL shared memory + barriers for tiled matmul.
- Candle integration — custom element-wise GPU kernel pattern for complementing ML frameworks.
Benchmarks
| Platform | Tiled Matmul 8192 | Batched SAXPY 1M | Dispatch |
|---|---|---|---|
| GB10 (RTX Spark) | 1,403 GFLOPS | 372 GFLOPS | 0.4 µs |
| RTX 5080 | 523 GFLOPS | 477 GFLOPS | 0.5 µs |
| M3 Pro | 186 GFLOPS | 42 GFLOPS | 142 µs |
Breaking Changes
None. All additions are backward-compatible trait methods with default implementations.
Full details: CHANGELOG.md
Borsalino v0.1.0
Borsalino v0.1.0
Thin GPU compute abstraction for the Industrial Algebra ecosystem.
One trait, two backends (Metal + Vulkan), zero ceremony.
Highlights
- Vulkan backend — full GpuBackend trait via ash 0.38 FFI. WGSL->SPIR-V via naga
- WGSL shader language — write once, run on Metal + Vulkan
- Device-local memory — auto-detects VRAM vs unified, RTX 5080: 15x improvement
- Batched dispatch — dispatch_many() amortises overhead, 0.5 us/dispatch, 577 GFLOPS
- Metal Apple Silicon M3 — seven root causes resolved
- Dual license — AGPL-3.0 + commercial
Benchmarks
| Platform | SAXPY 16M | Batched SAXPY 1M | Per-dispatch |
|---|---|---|---|
| RTX 5080 | 92 GFLOPS | 577 GFLOPS | 0.5 us |
| GB10 | 49 GFLOPS | 208 GFLOPS | 1.0 us |
| M3 Pro | 30 GFLOPS | — | 136 us |
| AMD iGPU | 17 GFLOPS | — | 45 us |
Full details: CHANGELOG.md