Bug Description
wp.tile_dot() fails to compile for scalar wp.float64 tiles.
Repro
import numpy as np
import warp as wp
_TILE = 64
@wp.kernel(enable_backward=False)
def dot_tiled_f64(
a: wp.array(dtype=wp.float64),
b: wp.array(dtype=wp.float64),
partials: wp.array(dtype=wp.float64),
):
i = wp.tid()
offset = i * _TILE
at = wp.tile_load(a, (_TILE,), offset=(offset,))
bt = wp.tile_load(b, (_TILE,), offset=(offset,))
d = wp.tile_dot(at, bt)
partials[i] = d[0]
wp.init()
device = "cuda:0"
n_tiles = 2
n = n_tiles * _TILE
a = wp.array(np.arange(n, dtype=np.float64), dtype=wp.float64, device=device)
b = wp.array(np.arange(n, dtype=np.float64), dtype=wp.float64, device=device)
partials = wp.zeros(n_tiles, dtype=wp.float64, device=device)
wp.launch_tiled(
dot_tiled_f64,
dim=[n_tiles],
inputs=[a, b, partials],
block_dim=_TILE,
device=device,
)
Actual Result
CUDA compilation fails:
error: no operator "=" matches these operands
operand types are:
wp::tile_shared_t<wp::float64, ...> = wp::tile_register_t<float, ...>
var_9 = wp::tile_dot(var_5, var_8);
CPU compilation fails similarly with:
error: no viable overloaded '='
Expected Result
wp.tile_dot() should support scalar wp.float64 tiles and return a single-element wp.float64 tile.
Notes
The wp.float32 version of the same kernel compiles and runs correctly on both CPU and CUDA.
The likely issue is a mismatch between Python type inference and the native implementation:
- Python infers the result type as
type_scalar_type(a.dtype), so float64 inputs expect a float64 result.
- Native
tile_dot() computes its result type with decltype(tensordot(T{}, T{})).
- Scalar
tensordot() appears to only have a float overload, so scalar wp::float64 operands resolve to a float result tile.
System Information
No response
Bug Description
wp.tile_dot()fails to compile for scalarwp.float64tiles.Repro
Actual Result
CUDA compilation fails:
CPU compilation fails similarly with:
Expected Result
wp.tile_dot()should support scalarwp.float64tiles and return a single-elementwp.float64tile.Notes
The
wp.float32version of the same kernel compiles and runs correctly on both CPU and CUDA.The likely issue is a mismatch between Python type inference and the native implementation:
type_scalar_type(a.dtype), sofloat64inputs expect afloat64result.tile_dot()computes its result type withdecltype(tensordot(T{}, T{})).tensordot()appears to only have afloatoverload, so scalarwp::float64operands resolve to afloatresult tile.System Information
No response