This project implements a custom SIMT GPU Streaming Multiprocessor (SM) integrated with a dedicated hardware Ray-Tracing Compute Unit (RTCU). It is designed to demonstrate hardware/software co-design for graphics acceleration, bypassing Apple Silicon's historical lack of dedicated ray-tracing hardware blocks in earlier generations.
graph LR
SM[GPU Streaming Multiprocessor] -->|1024-bit Packed Ray| RTCU[Ray-Tracing Compute Unit]
RTCU -->|Intersection Test| BVH[Unified Memory / BVH Nodes]
RTCU -->|Hit/Miss Results| SM
Implements a 32-lane vector streaming multiprocessor.
- SIMT Execution Mask: Manages thread active masks across the 32-lane warp.
- Instruction Decoder: Decodes custom graphics instructions. Employs a custom opcode (
0x7B) to dispatch ray-tracing operations directly to the co-processor. - Packed Array Interface: Flattens 32-thread vector paths (e.g., 32 threads * 32-bit float coordinates) into 1024-bit packed vectors for synthesizable, high-bandwidth communication with the RTCU.
A dedicated hardware accelerator designed to perform parallel Bounding Volume Hierarchy (BVH) node traversal and ray-triangle intersection tests.
- Memory Interface: Fetches scene geometry and BVH tree structures directly from Unified Memory using a 256-bit wide bus.
- Pipelined Traversal: Employs an internal state machine (
IDLE,FETCH_BVH,INT_BOX,FETCH_TRI,INT_TRI) to walk the spatial index and check for ray intersections.
A cycle-accurate architectural simulator written in Python.
- Functional Math Model: Implements vector math, camera ray generation, and bounding box/triangle intersection tests.
- Output: Path-traces a scene with spherical geometry and shadows, generating a native
render.bmpfile to verify the visual correctness of the rendering pipeline.
Verifies the SM-to-RTCU interface:
- Simulates an instruction fetch containing the custom RT opcode (
0x7B). - Checks that the SM decodes the opcode and asserts the
rtcu_dispatch_validsignal. - Verifies that the 1024-bit packed ray coordinates (
originanddirectionvectors) are properly driven onto the bus.
-
RTL Simulation: Run the design directly on EDA Playground using this pre-configured link: 👉 Live EDA Playground Simulator
Alternatively, copy
tb_gpu_core.svand the source design files into the playground manually, select Aldec Riviera Pro, and click Run. -
Visual Simulator: Execute
python sim/gpu_sim.pyin your local terminal to run the architectural path-tracer and generate the visualrender.bmpoutput.
The included docs/Schematic_gpu_top.pdf represents the top-level hardware routing between the Streaming Multiprocessor (SM) and the Ray-Tracing Compute Unit (RTCU) produced by the Yosys synthesis suite. For engineers reviewing this schematic, note the following symbolic representations:
- Octagonal Nodes: Represent the physical input and output ports of the
gpu_topmodule. - Comparator Box (
$eq): You can trace thefetch_instrbus directly into a comparator checking against7'b1111011(the binary representation of the0x7Bcustom Ray-Tracing opcode). - AND Gate Box (
$logic_and): The output of the comparator is logically ANDed withfetch_validto generate thertcu_dispatch_validhandshake signal. - 1024-bit Bus Routing: The massive, ultra-wide data buses mapping the 32-lane packed ray origins and directions (e.g.,
rtcu_ray_dir_x) route cleanly and directly from the SM to the RTCU without combinatorial delay. - Hardware Handshake: The
1'1(True) constant node is driven continuously into thertcu_dispatch_readyport, confirming the single-cycle handshake capability of the coprocessor.