v1.14.0 #1500
shi-eric
announced in
Announcements
v1.14.0
#1500
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Warp v1.14.0
Warp v1.14 expands serialized CPU capture support: captured graphs can now include backward launches, tiled kernels, richer launch arguments, and
.wrpfiles with arrays nested inside structs orwp.indexedarrayarguments. This release also adds multi-environment FEM support for batched simulations, reusable and batched linear solvers, pluggable logging, portable tile FFT and solver fallbacks, stable JAX integration APIs, and relaxed CPU/GPU array access for Heterogeneous Memory Management (HMM) and Address Translation Services (ATS) systems.New features
API Capture expands to cover more workflows
Building on the initial API Capture serialization support in Warp v1.13, Warp v1.14 primarily broadens the set of CPU graph patterns that can be saved and replayed. CPU captures can now include forward execution and reverse-mode passes from
wp.Tape().backward(),wp.launch_tiled()kernels, and scalar parameters of any size (#1431). The shared.wrpserialization format also now supports@wp.structarguments that contain arrays andwp.indexedarrayarguments that carry data, gradient, and index buffers.Important
Upgrade impact for APIC users:
.wrpfiles saved by Warp v1.13. Warp v1.14 writes APIC format version 10 and rejects the previous format.APICState*andAPICGraph*. Ownership and destroy calls are unchanged. See the APIC migration diff.Saved APIC graphs can still be consumed from standalone C++ through the C API declared in
warp/native/apic.h. Native replay behavior is unchanged apart from the explicit pointer spelling for APIC handles.Key capabilities:
wp.Tape().backward()are recorded into the CPU APIC stream and replayed from live captures or loaded graphs.wp.indexedarrayarguments, and handles inside serialized launch value blobs.Known limitations:
wp.utils.array_scan()is still not recorded into CPU APIC and raisesNotImplementedErrorin CPU capture.array.fill_()operations on CPU are not recorded.wp.Meshhandles, butwp.Volumeandwp.Bvhhandles are not yet supported..wrpgraphs requires the warp-clang backend and the companion_modules/directory with the recorded CPU kernel objects.Multi-environment
warp.femwarp.femcan now represent many independent simulation environments inside one geometry and one solve setup (#1407). ColocatedGrid2DandGrid3Dgeometries expose anenv_count, sparseNanogridandAdaptiveNanogridgeometries can pack per-environment voxels into one NanoVDB volume, and unstructured meshes can carry per-cell environment metadata throughcell_envandenv_count.This feature changes two positional call signatures. See the FEM migration diff if your code passes
requires_grad,device, ortemporary_storepositionally.Environment-aware lookup keeps colocated environments from interacting accidentally. When a geometry has more than one environment, pass an environment index to
fem.lookup()and passenv_indicestofem.PicQuadraturewhen particles are binned from world-space positions. The newwarp/examples/fem/example_apic_fluid_multi_env.pyexample uses these APIs to run colocated APIC fluid environments with environment-aware particle quadrature and batched pressure solves.Key capabilities:
Nanogrid.from_environment_voxels()andAdaptiveNanogrid.from_environment_voxels()build packed sparse grids with per-cellcell_envmetadata and hidden offsets for packed grids.cell_envandenv_countso grouped BVH lookup only traverses the requested environment.make_space_partition(..., environment_first=True)exposesenv_offsetsthat line up with the new linear-solverbatch_offsetssupport for scalar spaces.environment_first=Truedoes not support halo nodes for partitions that do not cover a whole geometry. Mesh environment indices are lookup and partition metadata, so callers must still provide disconnected mesh topology for independent mesh environments.Reusable and batched linear solvers
The iterative solvers in
warp.optim.linearcan now preallocate their temporary buffers and reuse them across compatible solves (#1391). Passingrun=Falsetocg(),cr(),bicgstab(), orgmres()returns a solver object that can be called repeatedly with replacement operands that have the same shape, dtype, device, and batch layout.batch_offsetsis independent of solver-state reuse. It partitions aLinearOperatorinto scalar degree-of-freedom intervals that are solved as independent subproblems in one solver launch sequence, with convergence checked per batch and reported through the worst residual. For one-shot solves, call the solver directly as before.Pluggable logging
Warp's Python-side diagnostics now flow through a configurable logger (#1315, #1434). Install a custom logger with
wp.set_logger(), scope it withwp.ScopedLogger, and control verbosity throughwp.config.log_levelorwp.ScopedLogLevel.wp.config.log_levelaccepts the standard Warp log-level constants:wp.LOG_DEBUG: most verbose, including code-generation details and module loads.wp.LOG_INFO: the default, including the init banner and compile timings.wp.LOG_WARNING: warnings and errors only.wp.LOG_ERROR: errors only.The default Warp logger now routes warnings emitted by Warp's Python code through Python's
warnings.warn(), so application warning filters can suppress Warp deprecation warnings again.wp.config.verboseandwp.config.quietstill work during the deprecation window, but they now emit one-timeDeprecationWarnings and map towp.config.log_level.Relaxed CPU/GPU array access
Warp no longer enforces the old rule that every Warp array passed to a kernel must be allocated on the launch device.
wp.config.launch_array_access_modenow defaults towp.config.LaunchArrayAccessMode.RELAXED, so CUDA kernels can receive CPU arrays on hardware where CUDA reports pageable CPU memory as GPU-accessible (#1461). The main targets are Linux Heterogeneous Memory Management (HMM) systems and NVIDIA Address Translation Services (ATS) platforms such as GH200, GB200, DGX Spark / GB10, and Jetson Thor, where ordinary CPU allocations can be GPU-accessible without an explicit.to(device)copy. Usewp.can_access()for concrete arrays andDevice.can_access()for coarse device checks when code needs to choose a direct launch or an explicit copy at runtime.Choose the validation mode based on how much pre-launch checking you want:
wp.config.LaunchArrayAccessMode.RELAXEDis the default. It checks type, dtype, and dimensions, but does not reject cross-device array arguments before launch. If a GPU kernel dereferences CPU memory that the device cannot access, the failure surfaces as a CUDA runtime error instead of Warp's previous Python same-device error.wp.config.LaunchArrayAccessMode.CHECKEDraises a Python error before launch when Warp can prove that an array is not accessible from the launch device. It warns and proceeds for custom or externally wrapped allocations whose provenance Warp cannot verify.wp.config.LaunchArrayAccessMode.STRICTrestores the old same-device rule and requires every Warp array argument to be allocated on the launch device.Most users can keep the default. Use
wp.config.LaunchArrayAccessMode.CHECKEDwhen diagnosing mixed-device launches, and usewp.config.LaunchArrayAccessMode.STRICTin tests or libraries that depend on pre-launch same-device validation.CUDA graph capture modes
CUDA graph capture now exposes CUDA's stream-capture mode through the
capture_modekeyword onwp.ScopedCaptureandwp.capture_begin()(#1410). Choose the mode based on how strictly CUDA should reject capture-unsafe runtime calls while capture is active:wp.CaptureMode.THREAD_LOCALis the default and preserves Warp's historical behavior. Capture-unsafe runtime calls from the capturing thread invalidate the capture, while other threads are unaffected.wp.CaptureMode.GLOBALis the strictest mode. Capture-unsafe runtime calls from any thread invalidate the capture.wp.CaptureMode.RELAXEDtolerates capture-unsafe runtime calls. Use it when composing with libraries that may lazily initialize CUDA contexts or allocators during capture.The function form uses the same
capture_modekeyword, for examplewp.capture_begin(device="cuda:0", capture_mode=wp.CaptureMode.RELAXED).Tile programming enhancements
Portable tile FFT and solver fallbacks
wp.tile_fft()andwp.tile_ifft()now run on CPU and on GPU builds that do not include libmathdx (#1396). CPU supports any power-of-two FFT length and non-power-of-two lengths up to 4096 elements. The GPU fallback, selected automatically when libmathdx is unavailable or explicitly withenable_mathdx_fft=False, supports power-of-two FFT lengths divisible byblock_dim.GPU scalar fallbacks are also available for
wp.tile_cholesky(),wp.tile_cholesky_solve(),wp.tile_lower_solve(), andwp.tile_upper_solve(), including in-place variants and thewp.tile_cholesky()adjoint (#1402). Select them withwp.config.enable_mathdx_solver=Falseormodule_options={"enable_mathdx_solver": False}. The fallback avoids a libmathdx dependency and reduces compile cost, at the expense of runtime performance. One fallback limitation is that differentiatedwp.tile_cholesky()allocates extra per-block scratch storage, so large GPU tiles can exceed the device's shared-memory budget. If you hit that limit, reduce the Cholesky tile size or dtype, or use a libmathdx-enabled build withwp.config.enable_mathdx_solver=True.wp.tile_empty()wp.tile_empty()allocates an uninitialized register or shared-memory tile for kernels that overwrite every element before the first read (#1312). Use it instead ofwp.tile_zeros()for full overwrites, because skipping zero-fill work can improve performance. Keepwp.tile_zeros()for accumulators or partial writes where the initial zeros are part of correctness.Tile reliability fixes
Several tile fixes improve correctness in edge cases:
wp.tile_load(),wp.tile_store(), and indexed tile operations now use 64-bit byte offsets, fixing overflows on arrays larger than 2 GiB (tile_load/tile_storeoverflow int byte-offset on arrays > 2 GB #1422). This correctness fix widens address arithmetic in the tile load/store paths, so tile-heavy kernels may see slightly higher address-calculation overhead.wp.tile_matmul()kernels and their adjoints (Register-to-shared tile reassignment fails in CUDA tile kernels #1439, Shared-to-register tile reassignment fails in CUDA tile kernels #1440).wp.tile_matmul()now rejectswp.bfloat16output tiles. It also rejectswp.bfloat16input tiles when backward compilation is enabled, because the backend cannot use bfloat16 accumulators for those paths (tile_matmul() fails to compile with bfloat16 accumulator on CUDA 13 (cuBLASDx static_assert in libmathdx 0.3.2) #1427).JAX integration graduates to stable API
The JAX integration has been promoted from
warp.jax_experimentalinto Warp's stable public API (#1370). New code should importwarp.jax_kernel,warp.jax_callable,warp.clear_jax_callable_graph_cache,warp.JaxCallableGraphMode, andwarp.JaxModulePreloadModedirectly from the top-levelwarpnamespace.As part of the promotion,
warp.jax_experimentalis now a deprecated compatibility namespace and will be removed in Warp 1.16. Thewarp.jax_experimental.get_jax_callable_default_graph_cache_max()andwarp.jax_experimental.set_jax_callable_default_graph_cache_max()helpers are also deprecated. Passgraph_cache_maxtowarp.jax_callable()or update the returned callable'sgraph_cache_maxattribute instead. Top-levelwarp.jax_callable()defaults tograph_cache_max=32. Passgraph_cache_max=Nonefor an unlimited graph cache.Differentiable
warp.jax_kernel()wrappers now acceptlaunch_dimstogether withenable_backward=True(#1380). The dimensions are fixed at wrapper construction and reused for both the forward and adjoint launches, which is useful when the input array includes batch or channel dimensions that are not part of the kernel'swp.tid()iteration space.When
enable_backward=True,launch_dimscannot be overridden per call, andoutput_dimsremains unsupported.Compilation and source-build tooling
Source builds gain two new build options.
build_lib.py --use-dynamic-cudalinks Warp's native library against shared CUDA libraries instead of embedding them statically, for deployments that already provide the matching CUDA shared libraries at runtime (#1334).build_lib.py --sanitize=addressbuilds the native libraries with AddressSanitizer instrumentation, a compiler/runtime memory-error detector for native out-of-bounds accesses, use-after-free, double-free, and similar bugs (#1387). Use it for debugging source builds when you want a failing test or repro to report the invalid memory access closer to where it happens.On Linux, install Python development headers if a source build fails with a missing
Python.h, for examplepython3-devorlibpython3-devon Debian and Ubuntu (#1339). Warp now compiles a small Python C API extension for fasterwp.float16conversion to and from Pythonfloat. The extension uses CPython's vectorcall protocol, the public fast-call convention introduced by PEP 590. Linux and macOS debug builds now use-Og -g, which keeps debug information while preserving useful compiler diagnostics such as uninitialized-value analysis (#1414).Math, autodiff, and correctness fixes
Floating-point
wp.min(),wp.max(),wp.clamp(),wp.atomic_min(), andwp.atomic_max()now use NaN-as-missing semantics matching Cfmin()andfmax()(#1376). When exactly one operand is NaN, the non-NaN operand wins. Vector reductions andwp.argmin()/wp.argmax()skip NaN slots. Adjoint and atomic variants route gradients to the operand chosen by the forward pass.Other math and autodiff fixes are grouped by affected path:
wp.copysign()is now available in kernels (Addwp.copysign(x, y)builtin #1444).wp.curlnoise()in 2D, 3D, and 4D, so differentiable curl-noise force fields no longer produce zero gradients ([REQ] Curlnoise adjoint functions #1012).wp.array, such asarr[i].y = rhs,m[i][r, c] = rhs, transform.p/.qwrites, and scalar or composite struct-field writes, now propagate gradients ([REQ] Gradients for composite type in-place array assignments #583, [BUG] Gradients with vectors are incorrect #248, [QUESTION] Gradients don't seem to propagate through array of structs #1174).wp.closest_point_edge_edge()is more reliable for near-parallel float32 segments (Add analyticwp.closest_point_edge_edgeadjoint #1437). Well-conditioned inputs keep the same output, while near-parallel cases now use a more stable closest-point computation and a bounded analytic adjoint.wp.array.fill_()now passes fill values up to 3968 bytes inline through kernel arguments on both non-contiguous and contiguous array fill paths (array_fill_* APIs are not composable when captured #1412). This fixes forked-stream capture failures for scalar, vector/matrix, and many struct fills. Larger fill values still use the previous temporary-storage fallback and are not guaranteed to compose in the same forked-stream capture scenario.module="unique"kernel declarations that depend on other Warp functions no longer retain stale Python module references (Repeatedmodule="unique"kernel declarations retain discarded temporary modules #1462), and autodiff metadata is now cleaned up for non-differentiable builtins ([REQ] Remove unnecessary adj_* function stubs in native code #988, Fix autodiff metadata for built-ins with no-op placeholder adjoints #1466).The component-write fix means the example below now gives
x.grad == [2, 2, 2]because the adjoint crosses the storedout[i].ycomponent instead of stopping at the component assignment.Breaking changes
APIC file format and native graph handles (#1431)
Warp v1.14 writes APIC format version 10.
.wrpfiles captured by Warp v1.13 used the older format and must be recaptured. Native C/C++ code that includesapic.hmust update APIC handles from the old typedef form, which hid the pointer, to explicit pointers. Ownership and destroy calls are unchanged. The migration is a source-level spelling change:FEM positional signatures (#1407)
warp.fem.PicQuadrature()now acceptsenv_indicesbeforerequires_grad, andwarp.fem.make_space_partition()now acceptsenvironment_firstbefore the keyword-onlydeviceandtemporary_store. Pass affected arguments by keyword to preserve behavior:HashGrid query type annotations (#1452)
Generated docs and public stubs now expose
wp.HashGridQueryas the single query type. Runtime aliaseswp.HashGridQueryHandwp.HashGridQueryDstill warn and forward during the deprecation window, but function annotations should migrate to avoid IDE or type-checker failures. Use unparameterizedwp.HashGridQueryfor the defaultwp.float32query type, and parameterize it when the query uses another coordinate precision:This is a source-typing change only. Runtime query objects and aliases remain compatible during the deprecation window.
Linux source builds require
Python.h(#1339)Linux source builds now compile a small Python C API extension for faster
wp.float16conversions, so the build needsPython.h. If a source build fails with a missingPython.h, install your distribution's Python development package before rebuilding Warp. The extension uses CPython's vectorcall protocol, the public fast-call convention introduced by PEP 590.Announcements
Upcoming removals
warp.jax_experimentalis deprecated. Importjax_kernel,jax_callable,clear_jax_callable_graph_cache,JaxCallableGraphMode, andJaxModulePreloadModefrom top-levelwarpinstead. The graph-cache default helpers are also deprecated. See JAX integration graduates to stable API for the import diff and graph-cache migration. The deprecated namespace will be removed in Warp 1.16 (Promotewarp.jax_experimentalAPI towarp.jax#1370).warp.config.verboseandwarp.config.quietare deprecated. Usewarp.config.log_level = wp.LOG_DEBUGfor verbose diagnostics andwarp.config.log_level = wp.LOG_WARNINGto suppress the init banner. See Pluggable logging for accepted log-level values. The legacy flags will be removed in a future feature release per the standard deprecation timeline ([BUG]:warp.utils.warn()overrides user warning filters, making deprecation warnings unsuppressible #1315).wp.HashGridQueryHandwp.HashGridQueryDare deprecated. Usewp.HashGridQuery[wp.float16]orwp.HashGridQuery[wp.float64]in function annotations that need explicit query precision. Use plainwp.HashGridQueryfor the defaultwp.float32query type. See the HashGrid query annotation migration. Runtime aliases remain available during the deprecation window, but public stubs now expose onlywp.HashGridQuery(Consolidate documented HashGrid query types #1452).quadratureanddomainarguments ofwarp.fem.interpolate(), plus thespaceargument ofwarp.fem.make_space_restriction()andwarp.fem.make_space_partition(), remain available for this release and are now scheduled for removal in Warp 1.15.Acknowledgments
We also thank the following contributors from outside the core Warp development team:
launch_dimswith differentiablewarp.jax_kernel()wrappers ([jax_experimental] Incorrect gradients injax_kernel(..., enable_backward=True):launch_dimsauto-inference uses full array shape and the explicit workaround is blocked #1380).For a complete list of changes, see the full changelog.
This discussion was created from the release v1.14.0.
Beta Was this translation helpful? Give feedback.
All reactions