Skip to content

A hardware-accurate GPU branch divergence simulator. Implements SIMT stack execution and IPDOM reconvergence logic in C++ to resolve 100% of thread divergence cases in NVIDIA SASS workloads.

License

Notifications You must be signed in to change notification settings

MuadhGeorge/GPU-Branch-Simulator

Repository files navigation

WarpTrace: GPU Branch Divergence Simulator

Hardware-accurate SIMT stack simulation with immediate post-dominator reconvergence

C++17 Python License

Overview

WarpTrace is a high-fidelity GPU branch divergence simulator that implements hardware-accurate SIMT (Single Instruction Multiple Thread) execution with stack-based reconvergence. This project demonstrates:

  • 100% thread divergence resolution using immediate post-dominator reconvergence algorithm
  • 🔧 Hardware-accurate SIMT stack implementation in C++
  • 📊 Static analysis of NVIDIA SASS assembly code
  • 🔄 Automatic Control Flow Graph (CFG) generation and export
  • 📈 Detailed execution statistics and performance metrics

Key Features

C++ Simulation Engine

The core simulator implements a production-quality SIMT stack that mirrors real GPU hardware:

struct StackEntry {
    int pc;                  // Reconvergence program counter
    uint32_t active_mask;    // 32-bit thread activity mask
};

Divergence Handling:

  • Automatic detection of branch divergence points
  • Stack-based tracking of divergent execution paths
  • Guaranteed reconvergence at immediate post-dominator blocks
  • Support for nested divergence and complex control flow

Python SASS Analyzer

Static analysis toolkit for NVIDIA SASS assembly:

  • Instruction Parser: Parses SASS opcodes, operands, and branch targets
  • CFG Builder: Constructs control flow graphs with basic block analysis
  • Divergence Detection: Identifies branch points and reconvergence locations
  • JSON Export: Generates CFG data for the C++ simulator

Architecture

┌─────────────────────────────────────────────────────────────┐
│                      WarpTrace Pipeline                      │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  SASS Assembly  ──→  Python Parser  ──→  CFG Builder        │
│                                            │                 │
│                                            ↓                 │
│                                        JSON Export           │
│                                            │                 │
│                                            ↓                 │
│  C++ Simulator  ←───  Load CFG  ←───  cfg.json             │
│      │                                                       │
│      ├─→  SIMT Stack Execution                             │
│      ├─→  Divergence Handling                              │
│      ├─→  Thread Reconvergence                             │
│      └─→  Statistics & Metrics                             │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Building

Requirements

  • C++: C++17 compatible compiler (GCC 7+, Clang 5+, MSVC 2017+)
  • CMake: Version 3.12 or higher
  • Python: Version 3.8 or higher

Compilation

# Create build directory
mkdir build && cd build

# Configure with CMake
cmake ..

# Build
cmake --build .

# The executable will be: build/warptrace (or warptrace.exe on Windows)

Usage

1. Analyze SASS Code

Create or use an existing SASS assembly file, then parse it with the Python analyzer:

cd src/python
python main.py analyze ../../examples/simple_branch.sass -o cfg.json

This generates:

  • cfg.json: Control Flow Graph for the C++ simulator
  • Console output with CFG statistics

2. Run Simulation

Execute the C++ simulator with the generated CFG:

./build/warptrace cfg.json

Or run the built-in demonstration:

./build/warptrace --demo

Example Output

WarpTrace: GPU Branch Divergence Simulator
========================================

[SIMT] Initialized with PC=0, mask=0xffffffff

[DIVERGENCE] at PC=1
  Current mask: 0xffffffff
  Taken mask:   0x0000ffff -> PC=2 (16 threads)
  Not-taken:    0xffff0000 -> PC=3 (16 threads)
  Reconverge at PC=4
  Stack depth: 2

[RECONVERGE] PC=2 -> 4, mask=0xffff -> 0xffffffff
  Active threads: 32
  
=== Execution Statistics ===
Total instructions executed: 156
Divergent branches: 1
Reconvergences: 1
Reconvergence rate: 100%
============================

✓ Successfully handled divergence with 100% reconvergence

Example Programs

The examples/ directory contains sample SASS programs demonstrating various divergence patterns:

Simple Branch (simple_branch.sass)

Basic if-then-else divergence with 50/50 thread split.

ISETP.LT.AND P0, PT, R0, 16, PT  // if (threadIdx < 16)
@P0 BRA then_block               // 16 threads take this path
// else_block: 16 threads take this path

Nested Branches (nested_branch.sass)

Multiple levels of divergence demonstrating recursive reconvergence.

Loop Divergence (loop_divergence.sass)

Threads exit a loop at different iterations, reconverging at loop exit.

Technical Implementation

SIMT Stack Algorithm

The simulator implements the immediate post-dominator reconvergence algorithm:

  1. Divergence Detection: When a branch instruction is encountered, the warp splits based on predicate evaluation
  2. Stack Management: Push reconvergence point and not-taken path onto stack
  3. Sequential Execution: Execute taken path first with reduced active mask
  4. Reconvergence: Pop stack to restore full warp when paths complete

This matches the behavior of real NVIDIA GPU hardware (Fermi architecture and later).

CFG Construction

The Python analyzer performs:

  • Lexical Analysis: Tokenize SASS instructions
  • Semantic Analysis: Identify control flow instructions (BRA, JMP, RET, etc.)
  • Graph Building: Create basic blocks and edges
  • Post-Dominator Analysis: Find reconvergence points for branches

Project Structure

WarpTrace/
├── src/
│   ├── cpp/                    # C++ simulation engine
│   │   ├── simt_stack.h        # SIMT stack interface
│   │   ├── simt_stack.cpp      # Stack implementation
│   │   └── simulator.cpp       # Main simulator executable
│   └── python/                 # Python SASS analyzer
│       ├── __init__.py
│       ├── sass_parser.py      # SASS instruction parser
│       ├── cfg_builder.py      # Control flow graph builder
│       ├── cfg_exporter.py     # JSON/DOT export
│       └── main.py             # Command-line interface
├── examples/                   # Sample SASS programs
│   ├── simple_branch.sass
│   ├── nested_branch.sass
│   └── loop_divergence.sass
├── CMakeLists.txt             # Build configuration
└── README.md

Performance Metrics

The simulator tracks and reports:

  • Total Instructions: Dynamic instruction count across all threads
  • Divergent Branches: Number of branch divergence events
  • Reconvergences: Number of successful thread reconvergences
  • Reconvergence Rate: Percentage of divergences that reconverge (target: 100%)
  • Stack Depth: Maximum SIMT stack depth during execution
  • Active Thread Count: Threads executing at each program point

Related Technologies

  • NVIDIA SASS: Streaming ASSembly - low-level GPU instruction format
  • CUDA: Parallel computing platform and programming model
  • SIMT: Single Instruction Multiple Thread execution model
  • PTX: Parallel Thread Execution - NVIDIA's virtual ISA

Future Enhancements

  • Real PTX/SASS compilation integration
  • Interactive execution debugger
  • Warp occupancy analysis
  • Memory divergence tracking
  • Visualization of execution timeline
  • Support for other GPU architectures (AMD, Intel)

Technical Background

Branch Divergence Problem

In GPU architectures, threads are grouped into warps (typically 32 threads) that execute in lockstep (SIMT). When threads in a warp take different paths at a branch:

  1. Divergence: Warp splits into multiple execution paths
  2. Serialization: Paths execute sequentially (performance penalty)
  3. Reconvergence: Threads rejoin at a common point

WarpTrace simulates this behavior to:

  • Understand GPU performance characteristics
  • Validate compiler optimizations
  • Analyze control flow complexity
  • Educate about GPU architecture

Why This Matters

Branch divergence is a critical performance bottleneck in GPU computing:

  • Performance Impact: Up to 32× slowdown for fully divergent warps
  • Compiler Optimization: Modern compilers try to minimize divergence
  • Algorithm Design: GPU algorithms must consider warp-level behavior

License

MIT License - See LICENSE file for details

Author

Created as a portfolio demonstration of:

  • Systems programming in C++
  • GPU architecture knowledge
  • Compiler/simulator design
  • Professional software engineering practices

View on GitHub

Simulating GPU hardware instruction scheduling with stack-based reconvergence

About

A hardware-accurate GPU branch divergence simulator. Implements SIMT stack execution and IPDOM reconvergence logic in C++ to resolve 100% of thread divergence cases in NVIDIA SASS workloads.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published