WarpTrace is a high-fidelity GPU branch divergence simulator that implements hardware-accurate SIMT (Single Instruction Multiple Thread) execution with stack-based reconvergence. This project demonstrates:
- ✅ 100% thread divergence resolution using immediate post-dominator reconvergence algorithm
- 🔧 Hardware-accurate SIMT stack implementation in C++
- 📊 Static analysis of NVIDIA SASS assembly code
- 🔄 Automatic Control Flow Graph (CFG) generation and export
- 📈 Detailed execution statistics and performance metrics
The core simulator implements a production-quality SIMT stack that mirrors real GPU hardware:
struct StackEntry {
int pc; // Reconvergence program counter
uint32_t active_mask; // 32-bit thread activity mask
};Divergence Handling:
- Automatic detection of branch divergence points
- Stack-based tracking of divergent execution paths
- Guaranteed reconvergence at immediate post-dominator blocks
- Support for nested divergence and complex control flow
Static analysis toolkit for NVIDIA SASS assembly:
- Instruction Parser: Parses SASS opcodes, operands, and branch targets
- CFG Builder: Constructs control flow graphs with basic block analysis
- Divergence Detection: Identifies branch points and reconvergence locations
- JSON Export: Generates CFG data for the C++ simulator
┌─────────────────────────────────────────────────────────────┐
│ WarpTrace Pipeline │
├─────────────────────────────────────────────────────────────┤
│ │
│ SASS Assembly ──→ Python Parser ──→ CFG Builder │
│ │ │
│ ↓ │
│ JSON Export │
│ │ │
│ ↓ │
│ C++ Simulator ←─── Load CFG ←─── cfg.json │
│ │ │
│ ├─→ SIMT Stack Execution │
│ ├─→ Divergence Handling │
│ ├─→ Thread Reconvergence │
│ └─→ Statistics & Metrics │
│ │
└─────────────────────────────────────────────────────────────┘
- C++: C++17 compatible compiler (GCC 7+, Clang 5+, MSVC 2017+)
- CMake: Version 3.12 or higher
- Python: Version 3.8 or higher
# Create build directory
mkdir build && cd build
# Configure with CMake
cmake ..
# Build
cmake --build .
# The executable will be: build/warptrace (or warptrace.exe on Windows)Create or use an existing SASS assembly file, then parse it with the Python analyzer:
cd src/python
python main.py analyze ../../examples/simple_branch.sass -o cfg.jsonThis generates:
cfg.json: Control Flow Graph for the C++ simulator- Console output with CFG statistics
Execute the C++ simulator with the generated CFG:
./build/warptrace cfg.jsonOr run the built-in demonstration:
./build/warptrace --demoWarpTrace: GPU Branch Divergence Simulator
========================================
[SIMT] Initialized with PC=0, mask=0xffffffff
[DIVERGENCE] at PC=1
Current mask: 0xffffffff
Taken mask: 0x0000ffff -> PC=2 (16 threads)
Not-taken: 0xffff0000 -> PC=3 (16 threads)
Reconverge at PC=4
Stack depth: 2
[RECONVERGE] PC=2 -> 4, mask=0xffff -> 0xffffffff
Active threads: 32
=== Execution Statistics ===
Total instructions executed: 156
Divergent branches: 1
Reconvergences: 1
Reconvergence rate: 100%
============================
✓ Successfully handled divergence with 100% reconvergence
The examples/ directory contains sample SASS programs demonstrating various divergence patterns:
Basic if-then-else divergence with 50/50 thread split.
ISETP.LT.AND P0, PT, R0, 16, PT // if (threadIdx < 16)
@P0 BRA then_block // 16 threads take this path
// else_block: 16 threads take this pathMultiple levels of divergence demonstrating recursive reconvergence.
Threads exit a loop at different iterations, reconverging at loop exit.
The simulator implements the immediate post-dominator reconvergence algorithm:
- Divergence Detection: When a branch instruction is encountered, the warp splits based on predicate evaluation
- Stack Management: Push reconvergence point and not-taken path onto stack
- Sequential Execution: Execute taken path first with reduced active mask
- Reconvergence: Pop stack to restore full warp when paths complete
This matches the behavior of real NVIDIA GPU hardware (Fermi architecture and later).
The Python analyzer performs:
- Lexical Analysis: Tokenize SASS instructions
- Semantic Analysis: Identify control flow instructions (BRA, JMP, RET, etc.)
- Graph Building: Create basic blocks and edges
- Post-Dominator Analysis: Find reconvergence points for branches
WarpTrace/
├── src/
│ ├── cpp/ # C++ simulation engine
│ │ ├── simt_stack.h # SIMT stack interface
│ │ ├── simt_stack.cpp # Stack implementation
│ │ └── simulator.cpp # Main simulator executable
│ └── python/ # Python SASS analyzer
│ ├── __init__.py
│ ├── sass_parser.py # SASS instruction parser
│ ├── cfg_builder.py # Control flow graph builder
│ ├── cfg_exporter.py # JSON/DOT export
│ └── main.py # Command-line interface
├── examples/ # Sample SASS programs
│ ├── simple_branch.sass
│ ├── nested_branch.sass
│ └── loop_divergence.sass
├── CMakeLists.txt # Build configuration
└── README.md
The simulator tracks and reports:
- Total Instructions: Dynamic instruction count across all threads
- Divergent Branches: Number of branch divergence events
- Reconvergences: Number of successful thread reconvergences
- Reconvergence Rate: Percentage of divergences that reconverge (target: 100%)
- Stack Depth: Maximum SIMT stack depth during execution
- Active Thread Count: Threads executing at each program point
- NVIDIA SASS: Streaming ASSembly - low-level GPU instruction format
- CUDA: Parallel computing platform and programming model
- SIMT: Single Instruction Multiple Thread execution model
- PTX: Parallel Thread Execution - NVIDIA's virtual ISA
- Real PTX/SASS compilation integration
- Interactive execution debugger
- Warp occupancy analysis
- Memory divergence tracking
- Visualization of execution timeline
- Support for other GPU architectures (AMD, Intel)
In GPU architectures, threads are grouped into warps (typically 32 threads) that execute in lockstep (SIMT). When threads in a warp take different paths at a branch:
- Divergence: Warp splits into multiple execution paths
- Serialization: Paths execute sequentially (performance penalty)
- Reconvergence: Threads rejoin at a common point
WarpTrace simulates this behavior to:
- Understand GPU performance characteristics
- Validate compiler optimizations
- Analyze control flow complexity
- Educate about GPU architecture
Branch divergence is a critical performance bottleneck in GPU computing:
- Performance Impact: Up to 32× slowdown for fully divergent warps
- Compiler Optimization: Modern compilers try to minimize divergence
- Algorithm Design: GPU algorithms must consider warp-level behavior
MIT License - See LICENSE file for details
Created as a portfolio demonstration of:
- Systems programming in C++
- GPU architecture knowledge
- Compiler/simulator design
- Professional software engineering practices
Simulating GPU hardware instruction scheduling with stack-based reconvergence