The PyTorch implementation of point serialization in PointTransformerV3, especially hilbert_encode, adds noticeable overhead. Replacing it with a custom CUDA kernel speeds things up quite a bit.
Inference time of Utonia (pretrained PTv3) for 130,000 points was reduced from ~139.5 ms to ~108.6 ms (~22% speedup, NVIDIA A40, averaged over 10 runs after warmup).
PyTorch Implementation
CUDA Implementation
Set your target CUDA architecture in libs/serialize_cuda/setup.py.
cd libs/serialization
python setup.py installReplace serialization/default.py with the provided version, calling the CUDA kernels when available.
warmup = 2
runs = 10
times = []
for i in range(runs + warmup):
point = dataset[i]
point = transform(point)
with torch.no_grad():
for key in point.keys():
if isinstance(point[key], torch.Tensor):
point[key] = point[key].cuda(non_blocking=True)
torch.cuda.synchronize()
start = time.perf_counter()
point = model(point)
torch.cuda.synchronize()
end = time.perf_counter()
if i >= warmup:
times.append(end - start)
print(f"Iter time: {end - start:.6f} seconds")
total = sum(times) / runs
print(f"Total time: {total:.6f} seconds")
