Release Borsalino v0.2.1 · Industrial-Algebra/Borsalino

Borsalino v0.2.1

Hotfix release — removes accidentally published early-research content.
No other changes from v0.2.0.

Async dispatch — dispatch_async() returns Pulse handle for non-blocking GPU execution. VkFence (Vulkan), MTLCommandBuffer (Metal). Drop performs implicit join.
Persistent buffers — create_device_buffer() keeps data on GPU across dispatches. VRAM on discrete GPUs, zero-copy on unified memory.
GPU timestamps — gpu.timestamp() for profiling. Vulkan: vkCmdWriteTimestamp query pool.
2D/3D tiled dispatch — WGSL shared memory + barriers for tiled matmul.
Candle integration — custom element-wise GPU kernel pattern for complementing ML frameworks.

Platform	Tiled Matmul 8192	Batched SAXPY 1M	Dispatch
GB10 (RTX Spark)	1,403 GFLOPS	372 GFLOPS	0.4 µs
RTX 5080	523 GFLOPS	477 GFLOPS	0.5 µs
M3 Pro	186 GFLOPS	42 GFLOPS	142 µs

None. All additions are backward-compatible trait methods with default implementations.

Full details: CHANGELOG.md