Skip to content

ChristianSchott/point_serialization_cuda

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CUDA Implementation of PointTransformerV3 serialization

The PyTorch implementation of point serialization in PointTransformerV3, especially hilbert_encode, adds noticeable overhead. Replacing it with a custom CUDA kernel speeds things up quite a bit.

Inference time of Utonia (pretrained PTv3) for 130,000 points was reduced from ~139.5 ms to ~108.6 ms (~22% speedup, NVIDIA A40, averaged over 10 runs after warmup).

Profile Trace Comparison

PyTorch Implementation

Profile Trace Utonia - PyTorch serialization

CUDA Implementation

Profile Trace Utonia - CUDA serialization

Installation

Set your target CUDA architecture in libs/serialize_cuda/setup.py.

cd libs/serialization
python setup.py install

Replace serialization/default.py with the provided version, calling the CUDA kernels when available.

Benchmark

warmup = 2
runs = 10
times = []

for i in range(runs + warmup):
    point = dataset[i]
    point = transform(point)

    with torch.no_grad():
        for key in point.keys():
            if isinstance(point[key], torch.Tensor):
                point[key] = point[key].cuda(non_blocking=True)
        
        torch.cuda.synchronize()
        start = time.perf_counter()

        point = model(point)

        torch.cuda.synchronize()
        end = time.perf_counter()
        if i >= warmup:
            times.append(end - start)
        print(f"Iter time: {end - start:.6f} seconds")

total = sum(times) / runs
print(f"Total time: {total:.6f} seconds")

About

Drop-in CUDA replacement for PointTransformerV3's PyTorch-based point serialization with ~20% faster inference.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

No contributors