Skip to content

TLAS and MultitypeSet#14

Open
SimonDanisch wants to merge 40 commits intomasterfrom
sd/multitype-vec
Open

TLAS and MultitypeSet#14
SimonDanisch wants to merge 40 commits intomasterfrom
sd/multitype-vec

Conversation

@SimonDanisch
Copy link
Copy Markdown
Member

This adds:

  • GPU two-level acceleration (TLAS/BLAS): Instanced BVH with per-instance transforms, TLAS/StaticTLAS split (mutable for construction, immutable isbits for kernel traversal), Adapt.jl for CPU→GPU transfer
  • MultiTypeSet: GPU-safe heterogeneous collection with compile-time type-stable dispatch via with_index, enabling multiple material/texture types without dynamic dispatch on the GPU
  • GPU utilities: @get/@set SoA macros, for_unrolled/map_unrolled/reduce_unrolled for compile-time loop unrolling, FastClosure for GPU-safe closures

SimonDanisch and others added 30 commits December 22, 2025 19:23
…12)

SetKey.type_idx was changed from UInt8 to UInt32 for LLVM/SPIR-V
compatibility, but the @generated with_index function still compared
against UInt8 literals. Since Julia's === checks both value and type,
UInt32(1) === UInt8(1) is always false, causing all branches to fall
through to the default (first material). This made every object render
with the same material.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
On Metal, device pointers (Core.LLVMPtr) stored inside GPU buffers
cannot be reliably dereferenced by kernels. The inline data (root_aabb)
reads correctly, but following embedded pointers to per-BLAS node/primitive
arrays returns zeros.

Replace the pointer-based BLAS architecture in StaticTLAS with:
- BLASDescriptor: lightweight struct with nodes_offset, primitives_offset, root_aabb
- Flat concatenated arrays (all_blas_nodes, all_blas_prims) built from per-BLAS GPU arrays
- Offset-based indexing in closest_hit/any_hit traversal

Management kernels (update_tlas_leaf_aabbs_kernel!, etc.) still use
blas_array but only read root_aabb (inline, unaffected).

Verified: CPU and Metal produce identical results (mean pixel ~0.327
on 3-sphere test scene).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@SimonDanisch SimonDanisch mentioned this pull request Mar 2, 2026
SimonDanisch and others added 5 commits March 2, 2026 16:31
Pkg.test() defaults to --check-bounds=yes which injects error paths
that can't compile to SPIR-V. GPU tests now auto-skip with
@test_broken when bounds checking is forced.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- viewfactors_content.md: Build visualization mesh from TLAS primitives
  instead of raw merged mesh. The TLAS removes degenerate triangles, so
  face counts differed (202 vs 250), causing FaceView verification error.
- gpu_raytracing_tutorial.md: bvh.primitives → bvh.all_blas_prims
  (StaticTLAS field was renamed in TLAS/BLAS refactor)
- bvh_hit_tests_content.md: Fix swapped benchmark labels (closest_hit
  was showing any_hit timing and vice versa), remove empty section header,
  fix test numbering (3 not 4)
- instanced-bvh-architecture.md: Replace broken example using
  TriangleMesh/inv_translate with working high-level TLAS API
- raytracing_tutorial_content.md: Fix "Analougus" → "Analogous"
- README.md: Add MultiTypeSet and GPU TLAS to features list
Results (400×720, 4spp, 6014 triangles):
  Wavefront GPU:  2.7 ms  (winner, 223x vs CPU baseline)
  Tiled (32×16):  7.5 ms
  Tiled (32×32):  8.3 ms
  Unrolled:      14.2 ms
  Baseline GPU:  16.2 ms
  Tiled (8×8):   16.7 ms
  Wavefront CPU: 97.0 ms
  Baseline CPU: 602.7 ms
The example scene uses Y-up geometry (floor at y=-1.5), but the
wavefront renderer defaulted to camera_up=Vec3f(0,0,1) (Z-up),
producing an upside-down/rotated view. Also fix camera_lookat
to look along +Z matching the simple camera used by other kernels.
Shows how to enable hw_accel=true with Lava backend, explains the
architecture (extract-trace-shade pipeline), includes benchmark
comparison between SW BVH and HW RT on materials scene (20 spheres,
AMD RX 7900 XTX). Honest results: parity on simple scenes, HW RT
advantage on complex geometry (3.5M+ triangles).
AMD RX 7900 XTX via AMDGPU.jl, dragon mesh (249K tris) + procedural.
Key findings: Raycore 3.5-20x faster for ray tracing (single-pass
closest-hit with early termination vs two-pass BV candidate list).
ImplicitBVH 2-5x faster for BVH build (simpler construction).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants