v1.13.0 #1420
shi-eric
announced in
Announcements
v1.13.0
#1420
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Warp v1.13.0
Warp v1.13 introduces experimental graph capture serialization with CPU replay, letting captured simulations roundtrip through a portable
.wrpfile and load from standalone C++ on either GPU or CPU. It also adds an experimental cuBQL BVH backend forwp.Meshthat accelerates ray-heavy mesh queries, thewp.bfloat16scalar type, a pluggable CUDA allocator interface with built-in RAPIDS Memory Manager (RMM) integration, scoped memory tracking with C++-layer call-site attribution, and a batch of new tile primitives (tile_dot,tile_axpy,tile_stack, scatter helpers).New features
Graph capture serialization and CPU replay
Important
This is an experimental feature. The API may change without a formal deprecation cycle.
Warp v1.13 introduces a portable serialized-graph format. Operations recorded during
wp.capture_begin(apic=True)/wp.capture_end()can be saved to a.wrpfile withwp.capture_save()and replayed from either Python or standalone C++ viawp.capture_load(), enabling cross-process and cross-language graph reuse (#1349). CPU graph capture is also new in this release: the samewp.Graphobject now replays on CPU throughwp.capture_launch(), and the underlying APIC operation log is what gets serialized. A newwp.handle(auint64alias) carrieswp.Meshhandles across save and load so kernels can keep referencing meshes after deserialization.Loading and replaying from standalone C++ (CPU device shown). The full example also walks the
_modules/directory, loads each.oviawp_load_obj, resolves kernel symbols, and registers them withwp_apic_register_loaded_cpu_kernelbefore the first replay. The snippet below elides that boilerplate:See
warp/examples/cpp/02_apic_visualization(CUDA replay) andwarp/examples/cpp/03_apic_visualization_cpu(CPU replay) for end-to-end demos with OpenGL visualization.What gets written:
Key capabilities:
wp.capture_save(graph, path, inputs=..., outputs=...)registers named bindings so the consumer side can swap in fresh inputs and read outputs by name without touching the graph topology.wp.capture_load()+wp.capture_launch()support replay on both CPU and CUDA. Loaded graphs exposeset_param,get_param, andget_param_ptrfor each registered binding, plusparamsandis_loadedproperties onwp.Graph.wp.handlescalar type andwp.Meshremap let kernels accept mesh handles whose underlying objects are reconstructed on load. APIC walks@wp.structfields recursively to find handle pointers and remap them.Stability and known gaps:
API Capture is experimental, and we plan to keep adding capabilities and closing gaps over future releases (tracker: #1388). For now, regenerate
.wrpartifacts when upgrading Warp. The current operation set, handle types, and platform constraints are documented in the Graphs section of the user guide.cuBQL BVH backend for
wp.MeshImportant
This is an experimental feature. The API may change without a formal deprecation cycle.
wp.Meshnow acceptsbvh_constructor="cubql"to build its acceleration structure with cuBQL, an Apache 2.0-licensed header-only CUDA library for fast BVH construction and traversal (#1286). For ray-heavy workloads on dense static meshes, where the existing SAH builder's exhaustive construction dominates setup time and where ray traversal sits on the simulation hot path, cuBQL typically delivers faster ray queries alongside consistently lower build times than the SAH, median, and LBVH builders. As one specific data point, a Warp-based renderer benchmark on an RTX 4090 (Franka Emika Panda visual mesh, 8192 parallel worlds) saw simulation time drop from 1.41 s to 0.98 s after switching the constructor with no other changes. Speedups depend heavily on mesh size, query mix, and how much of the frame the mesh queries occupy, so benchmark on your own scene before relying on a particular win.The
"cubql"backend currently only routeswp.mesh_query_ray()through cuBQL's traversal kernels. Extending it to point queries, AABB queries, grouped queries, and winding-number support is future work. Today, passinggroups=...orsupport_winding_number=Trueto a cuBQLwp.Meshraises aRuntimeErrorat construction. Callingwp.mesh_query_point_*orwp.mesh_query_aabb_*against a cuBQL mesh silently returns no results. Stick with the default SAH/median/LBVH builders for kernels that mix query types or aren't ray-bound.Pluggable CUDA allocator and RMM integration
CUDA device-memory allocations can now be routed through any object implementing the
wp.Allocatorprotocol viawp.set_cuda_allocator(),wp.set_device_allocator(), or scoped withwp.ScopedAllocator(#781). The built-inwp.utils.AllocatorRmmdelegates to RAPIDS Memory Manager so Warp can share a memory pool with PyTorch, CuPy, or any other RMM-aware framework, eliminating duplicate caching on GPUs that train and simulate in the same process.wp.utils.AllocatorRmmrequires thermmpackage on Linux (pip install rmm-cu12).Memory tracking
wp.ScopedMemoryTrackerand thewp.config.track_memoryflag enable allocation tracking with call-site attribution and per-category reports across GPU, host, and pinned-host memory (#1269). Tracking is implemented in the C++ native layer by intercepting allwp_alloc_*/wp_free_*calls, so internal allocations from BVH, hash-grid, mesh, volume, and sparse subsystems show up alongside Python-originated arrays, labeled with their subsystem (e.g.(native:bvh),(native:hashgrid),(native:mesh)).Output:
Nest trackers to form hierarchical scopes (e.g.
"simulation/collision"), and passreport_func=...to redirect the report to a logger or test-assert callback instead of stdout. For unscoped tracking across an entire process, setwp.config.track_memory = Truebeforewp.init()and callwp.print_memory_report()at any time.wp.bfloat16scalar data typewp.bfloat16joinswp.float16,wp.float32, andwp.float64as a first-class Warp scalar type (#1332). It supports array allocation, kernel execution, autodiff, DLPack, PyTorch (wp.from_torch(t)fortorch.bfloat16tensors), JAX, and optional NumPy interop via theml_dtypespackage.wp.tile_matmul()and atomic operations also accept bfloat16 tiles, so transformer-style mixed-precision kernels can stay in Warp end to end.Note
Long chains of bfloat16 math accumulate rounding error at bfloat16 precision because each intermediate result is quantized back to 16 bits before feeding into the next op. This is true whether the underlying op runs on native bf16 hardware (Warp dispatches
+,-,*to PTX bf16 instructions onsm_80and newer GPUs) or goes through a float32 round-trip (/, comparisons, and math built-ins likewp.sqrt/wp.exp, matching how PyTorch, JAX, and CuPy handle those ops). For precision-sensitive code, cast towp.float32at the start of the chain and back towp.bfloat16only at the boundary so intermediates stay in float32.Zero-copy texture interop
wp.Texture1D,wp.Texture2D, andwp.Texture3Daccept an externally allocated CUDA array via the newcuda_array=parameter, sharing memory without an extra copy (#1238). For OpenGL interop, the newwp.GLTextureResourceclass registers an OpenGL texture for use as a Warp texture, so rendering pipelines can sample from textures Warp writes (and vice versa) without going through host memory. Texture objects also gaincopy_from()andcopy_to()methods for transfers between textures, host arrays, and device arrays. The previouscopy_from_array()/copy_to_array()methods are now deprecated.Tile programming enhancements
New tile primitives
This release adds several tile primitives that shorten common kernel patterns:
wp.tile_dot(a, b)(Addwp.tile_dot()for fused element-wise multiply and reduce #1364) computes the dot product of two same-shape tiles, returning a single-element tile of the underlying scalar type. For tiles of vectors or matrices, each element pair is fully contracted (e.g.wp.dot(a[i], b[i])for tiles ofvec3f).wp.tile_axpy(alpha, src, dest)(Addwp.tile_axpy()for fused scalar-multiply-and-accumulate on tiles #1363) performs the fused in-place updatedest += alpha * srcwithout allocating an intermediate scaled tile.wp.tile_scatter_add(a, i, value, has_value, atomic=True)(Addwp.tile_scatter_add()builtin #1342) is a cooperative scatter-add into a shared-memory tile. Setatomic=Falsefor faster writes when indices are guaranteed unique across threads.wp.tile_scatter_masked()([REQ] Add support for fine grained writes into shared memory tiles #1298) writes per-thread values into a shared-memory tile with cooperative synchronization, so each lane can publish a value at its own index without manual barriers.wp.tile_query_valid()(Easier way to know when tile bvh query has finished #1335) exposes a cleaner loop condition for tile BVH and mesh AABB queries that avoids thewp.tile_max()reduction overhead the previous pattern required.Output:
Cooperative thread-block stack
wp.tile_stack()allocates a cooperative thread-block stack in shared memory, withwp.tile_stack_push(),wp.tile_stack_pop(),wp.tile_stack_clear(), andwp.tile_stack_count()operations (#1287). The stack lets all threads in a block contribute or consume entries without manual atomic-counter management.wp.tile_stack_push()accepts ahas_valueflag so threads can opt out of pushing while staying in the cooperative call, andwp.tile_stack_pop()returns a(value, slot)pair withslot == -1for threads that did not get an element (for example, when the stack runs dry).Cholesky improvements
wp.tile_cholesky(),wp.tile_cholesky_inplace(),wp.tile_cholesky_solve(), andwp.tile_cholesky_solve_inplace()accept a newfill_modeparameter ("lower"or"upper") for upper Cholesky factorization and solve (#1318). The default lower path reads its input by columns, which hits a known slowdown at power-of-2 tile sizes. The newfill_mode="upper"path reads by rows and avoids that cliff, running roughly 1.2x to 1.7x faster on power-of-2 tiles in microbenchmarks on an RTX PRO 6000 Blackwell. Out-of-placetile_choleskyfactorization also gains a backward pass, so it can be used as part of a differentiable solve (#1316).Multi-dimensional tile FFTs
wp.tile_fft()andwp.tile_ifft()now accept tiles of rank N >= 2 instead of being limited to 2-D, computing the FFT along the last dimension and treating any leading dimensions as independent batches (#1317). A tile of shape(B1, B2, N)or(B1, B2, B3, N)no longer needs to be reshaped to a 2-D(B, N)tile before transforming. The other constraints carried over from the 2-D path are unchanged: the input must be a register-storage tile ofwp.vec2forwp.vec2d(interpreted as complex pairs), and the FFT length (the last dimension) must be at least2 * block_dim.Faster tile load and store
wp.tile_load()andwp.tile_store()get two new fast paths without any kernel changes (#1236):wp.mat33,wp.mat44, orwp.mat66get a new shared-memory copy path that replaces the previous per-element loop, restoring the bandwidth that the old path lost on multi-byte elements. The bigger the element, the bigger the speedup.A new
alignedparameter on both functions skips runtime alignment checks when the caller can guarantee 16-byte alignment, contiguity, and in-bounds access:Tile parameters by reference in
@wp.func_native@wp.func_nativesnippets now receive tile parameters by reference instead of by value, matching the behavior of@wp.func(#1362). Native snippets can therefore modify shared tiles in place rather than receiving a local copy that gets discarded on return.warp.femdouble-precision supportwarp.femgains end-to-endwp.float64support (#418). Precision is selected on the geometry (e.g.scalar_type=wp.float64on grid constructors) and propagates automatically to function spaces, quadrature, fields, and integration kernels. Existingwp.float32setups are unchanged.The default
output_dtypeofwarp.fem.integrate()now follows the geometry's scalar type (wp.float32orwp.float64) instead of theaccumulate_dtype. See Breaking changes for the migration.Compilation and tooling
Faster CPU compile times
CPU kernel compile times in multi-module workloads drop substantially via precompiled-header support, controlled by the existing
warp.config.use_precompiled_headerssetting (#595). Repeated module compiles in a session no longer re-parse the bundled headers on every call.CPU kernels target host CPU features by default
CPU kernels now compile with
-march=nativeby default, so generated code automatically picks up the wider SIMD instruction sets (AVX2, AVX-512, NEON variants) available on the build host instead of targeting a generic x86-64 / aarch64 baseline (#1308). Existing kernels do not need any change to benefit.The follow-on implication is that AOT-compiled CPU modules and shared CPU kernel caches are now host-specific.
wp.compile_aot_module()emits a warning when it produces a CPU module under the new default, because loading that module on a host with a narrower CPU feature set would fail with an illegal-instruction crash. Setwp.config.cpu_compiler_flags = ""before compiling to opt back into a portable baseline build. Cached CPU.ofilenames also include a short host-ISA hash (e.g.module.cpu1a2b3c4d.o), so heterogeneous CI runners and shared kernel caches no longer cross-load incompatible binaries.Inline module options on
@wp.kernel@wp.kernel(module="unique", module_options={...})accepts a dict of module-level compilation options inline (#1250), avoiding the previous pattern of togglingwarp.config.*globally before defining the kernel. One useful application is disabling cuBLASDx forwp.tile_matmulduring development to skip its slow LTO compile step:On a clean kernel cache the kernel above compiles in roughly a second, instead of the tens of seconds the default cuBLASDx-backed path can take.
Other improvements
-O2, CUDA defaults remain-O3) ([REQ] Raise Clang optimization level to -O3 for CPU kernels #1310).None) and its explicit default, avoiding spurious recompilation ([BUG] Centralize module option resolution to fix hash/build inconsistencies #1307).wp.get_suggested_block_size(kernel)queries the CUDA driver's occupancy API for a launch configuration that maximizes per-SM occupancy ([REQ] Enhance ease of use for grid-stride loop #1270). Returns(block_size, min_grid_size). Passblock_sizetowp.launch(..., block_dim=...).Language enhancements
and/or(Chainedandin kernels does not short-circuit — nullable array access evaluated unconditionally #1329): chained boolean operators in kernels now use Python semantics, so guards likeif arr and arr[i] == 0no longer crash from eagerly evaluating the right-hand side.wp.vec3d(),wp.mat22h(),wp.quatd(), etc. now accept Python scalar literals directly without an explicit cast, preserving precision.wp.indexedarrayfields in@wp.struct(Supportindexedarrayfields instruct#1327): structs may now holdwp.indexedarrayfields, with assignment, device transfer, and NumPy structured-value support working the same way aswp.arrayfields.wp.Volumeanisotropic voxels ([REQ] Let wp.Volume.load_from_numpy accept tuples for voxel_size #1193):wp.Volume.load_from_numpy()andwp.Volume.allocate()accept a 3-element sequence forvoxel_size, enabling anisotropic voxel spacing.Breaking changes
tape.backward()zeroes intermediate gradient buffers (#1062)tape.backward()now zeroes the gradient buffer of any array written to in the forward pass, matching PyTorch's behavior for intermediate gradients. This is a correctness fix: previously, callingtape.backward()multiple times accumulated stale.gradfrom the prior call into the new pass.Backward passes of kernels with many per-element array writes (e.g. matrix component assignments) may be slower because of the additional zeroing. See the differentiability guide for the full rationale.
warp.fem.integrate()output_dtypedefault change (#418)The default
output_dtypeofwarp.fem.integrate()now follows the geometry's scalar type instead ofaccumulate_dtype(which itself defaults towp.float64). Forwp.float32geometries this changes the default output fromwp.float64towp.float32.Announcements
Release cadence
Warp now aims to publish a feature release every month. This replaces the previous schedule that alternated monthly between feature and bugfix releases. Bugfix releases are no longer regularly scheduled. They are issued ad hoc, only when an important issue cannot wait for the next feature release, and only against the most recent feature release line. See the Compatibility & Support page for the full versioning, deprecation, and support policy.
Removals in this release
warp.torch,warp.context, etc.), usedwarp.mat/warp.vec, referencedwarp.context.Devicelike, or relied on theModule.foo->Module._fooproxy now raises rather than warning (Remove Direct Access to Private API #1352). Use the curated public API in thewarpnamespace. The API documentation is the source of truth for what is public.wp.isfinite(),wp.isnan(), andwp.isinf()no longer accept integer types. These functions now accept floating-point arguments only, finalizing the deprecation announced in v1.11 ([BUG] isfinite(), isnan(), and isinf() should only accept float types #847). Drop integer call sites or wrap the operand in a float cast.Upcoming removals
copy_from_array()/copy_to_array()deprecated. Use the newcopy_from()/copy_to()methods instead ([REQ] Support zero-copy interop with external textures #1238). Calling the deprecated names emits aDeprecationWarning. They will be removed in a future feature release per the standard deprecation timeline.Acknowledgments
We also thank the following contributors from outside the core Warp development team:
wp.indexedarrayfield support in@wp.struct, with assignment, device transfer, and NumPy structured-value handling (Supportindexedarrayfields instruct#1327).wp.Volume.load_from_numpy()andwp.Volume.allocate(), including the shared_normalize_voxel_size()validation helper and tests ([REQ] Let wp.Volume.load_from_numpy accept tuples for voxel_size #1193).For a complete list of changes, see the full changelog.
This discussion was created from the release v1.13.0.
Beta Was this translation helpful? Give feedback.
All reactions