CUDA Path Tracer

Jason Xie, Fall 2023

Overview

A path tracer written in CUDA.

In this project, I take advantage of the massively parallel processing power of GPU to render photorealisitc images. The end result is a interactive path tracer featuring specular and diffuse shading, along with several optimizations to improve performance.

Features

✨ Diffuse / Specular shading

Simple BSDF shading for diffuse and specular surfaces.

100% Diffuse

100% Specular

🙅‍♂️ Path termination via stream compaction

When rays hit a light source or exceed the maximum number of bounces, they should no longer be traced. Stream compaction is implemented in this project using thrust::partition, where valid rays are moved to the front of the array and invalid rays are moved to the back. The number of valid rays is then returned.

Let's compare the number of active paths for an open Cornell box vs a closed Cornell box.

Open Cornell Box	Closed Cornell Box

As expected, the number of active paths for the open Cornell box is significantly higher than the closed Cornell box. This is because rays can escape the open Cornell box and terminate early, while rays in the closed Cornell box are forced to bounce around until they exceed the maximum number of bounces.

What about actual performance?

Path termination on `cornell.txt`	ms / frame
No	13
Yes	27

The current form of path termination is slower than the naive approach, which suggests that the overhead of stream compaction is greater than the performance gains from terminating paths early for this simple scene.

⚡️ Continuous memory rays via sorting

To prevent inefficient memory access patterns, we perform memory coalescing by sorting rays by their material type. This ensures that rays with the same material type are stored contiguously in memory.

Performance Comparison:

Sorting on `cornell.txt`	ms / frame
No	27
Yes	54

Sorting rays by material type is slower than the naive approach, which suggests that the overhead of sorting rays is greater than the performance gains from memory coalescing.

🏦 First bounce caching

In addition to sorting rays by material type, we also cache the first bounce of each ray. This allows us to avoid having to recalculate the first bounce for each ray in the scene. It is possible to perform this optimization because the first intersection is always deterministic.

Performance Comparison:

Max Bounces	With Caching (ms/ frame)	Without Caching (ms / frame)
2	13.5	13.5
3	13.7	13.7
4	14.0	14.0
5	25.0	22.0
6	27.0	26.6
7	27.5	26.8
8	26.6	27.0

It appears that storing the first bounce provides minor improvements to the frame time, and performance gains remains constant over the number of bounces. This can be attributed to the fact that the savings from caching is independent of the number of subsequent bounces.

🐇 GLTF mesh loading

In order to push the limits of the path tracer, I implemented a GLTF mesh loader to load complex geometries. First, the GLTF file is parsed using tinygltf. Then, mesh vertices and indices are loaded from raw bytes into CUDA device memory.

🐙 Hierarchical spatial data structures

In the naive path tracing approach, a ray must check with every object and triangle in the scene to find the closest intersection. This is inefficient, especially when the scene contains thousands of objects. To improve performance, we can partition the scene using octrees🐙. octrees🐙 significantly reduce the number of intersection tests required by eliminating large swatch of the scene that do not intersect with the ray.

To implement octrees🐙, an axis-aligned bounding box is first created for the entire mesh object. Then, the bounding box is recursively subdivided into 8 octants until each octant contains at most N triangles or the maximum depth is reached. Triangles spanning across octants are accounted for by duplicating them across octants. To check for an intersection, the ray first checks for a collision with the bounding box. If a collision is found, the ray recursively checks for a collision with child nodes until a leaf node is reached. Then, the ray checks for a collision with each triangle in the leaf node.

Performance Comparison

🐇.gltf: Triangles: 30.3k, Vertices: 15.3k

Octree Depth	ms / frame	FPS
0	383	2.6
1	378	2.6
3	429	2.3
16	429	2.3
64	429	2.3

For this scene, performance improves slightly at low depths, but quickly saturates. More complex scenes and better memory access patterns may help push performance further.

Acknowledgements

Bloopers

Missing triangles due to not accounting for triangles that span across octree🐙 nodes.

Attempt to optimize using CUDA streams, but kernels were accessing the same memory chunk.

Name		Name	Last commit message	Last commit date
Latest commit History 145 Commits
.vscode		.vscode
cmake		cmake
external		external
img		img
scenes		scenes
src		src
.cproject		.cproject
.gitignore		.gitignore
.project		.project
CMakeLists.txt		CMakeLists.txt
GNUmakefile		GNUmakefile
INSTRUCTION.md		INSTRUCTION.md
Project3-CUDA-Path-Tracer.launch		Project3-CUDA-Path-Tracer.launch
README.md		README.md
build.sh		build.sh
imgui.ini		imgui.ini

JChunX/CUDA-Path-Tracer

Folders and files

Latest commit

History

Repository files navigation

CUDA Path Tracer

Overview

Features

✨ Diffuse / Specular shading

🙅‍♂️ Path termination via stream compaction

⚡️ Continuous memory rays via sorting

🏦 First bounce caching

🐇 GLTF mesh loading

🐙 Hierarchical spatial data structures

Acknowledgements

Bloopers

About

Resources

Stars

Watchers

Forks

Languages