# PRLX: GPU Differential Debugger — End-to-End Test

This notebook tests the full PRLX pipeline on Colab:

1. **Environment check** — GPU type, driver version, CUDA version
2. **Install dependencies** — LLVM 20, Rust, CMake
3. **Build everything** — LLVM pass, runtime, NVBit tool, Rust differ
4. **Run demo** — Compile + instrument a CUDA kernel, capture two traces, diff them
5. **NVBit test** — LD_PRELOAD binary instrumentation (no recompilation needed)

**Runtime**: Select **T4 GPU** (free tier) or **A100** (Pro). Both work.

## 0. Environment Check

Bail out early if the GPU or driver won't work.

In [1]:
%%bash
set -e

echo "=== GPU ==="
nvidia-smi --query-gpu=name,compute_cap,driver_version --format=csv,noheader
echo ""

echo "=== CUDA Toolkit ==="
nvcc --version | grep release
echo ""

echo "=== Driver ==="
nvidia-smi --query-gpu=driver_version --format=csv,noheader
echo ""

# Check compute capability
SM=$(nvidia-smi --query-gpu=compute_cap --format=csv,noheader | head -1 | tr -d '.')
echo "SM version: sm_${SM}"

if [ "$SM" -gt 120 ]; then
    echo "WARNING: SM_${SM} may not be supported by NVBit 1.7.7.1 (max SM_120)"
    echo "LLVM pass pipeline will still work."
else
    echo "OK: SM_${SM} is supported by NVBit 1.7.7.1"
fi

=== GPU ===
Tesla T4, 7.5, 580.82.07

=== CUDA Toolkit ===
Cuda compilation tools, release 12.8, V12.8.93

=== Driver ===
580.82.07

SM version: sm_75
OK: SM_75 is supported by NVBit 1.7.7.1


## 1. Install Dependencies

In [2]:
%%bash
set -e

echo "=== Installing LLVM 20 ==="
wget -qO- https://apt.llvm.org/llvm-snapshot.gpg.key | sudo apt-key add - 2>/dev/null
echo "deb http://apt.llvm.org/jammy/ llvm-toolchain-jammy-20 main" | sudo tee /etc/apt/sources.list.d/llvm-20.list > /dev/null
sudo apt-get update -qq
sudo apt-get install -y -qq llvm-20-dev clang-20 libzstd-dev 2>&1 | tail -3
echo "LLVM $(llvm-config-20 --version) installed"

echo ""
echo "=== Installing Rust ==="
if ! command -v cargo &>/dev/null; then
    curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y --quiet 2>&1 | tail -1
    echo "Rust installed"
else
    echo "Rust already installed: $(cargo --version)"
fi

echo ""
echo "=== Upgrading CMake ==="
pip install -q cmake --upgrade 2>&1 | tail -1
echo "CMake $(cmake --version | head -1)"

=== Installing LLVM 20 ===
OK

/sbin/ldconfig.real: /usr/local/lib/libtcm_debug.so.1 is not a symbolic link

LLVM 20.1.8 installed

=== Installing Rust ===
source $"($nu.home-path)/.cargo/env.nu"  # For nushell
Rust installed

=== Upgrading CMake ===
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 28.9/28.9 MB 73.4 MB/s eta 0:00:00
CMake cmake version 4.2.1


W: http://apt.llvm.org/jammy/dists/llvm-toolchain-jammy-20/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details.
W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)


## 2. Clone and Build

In [3]:
%%bash
set -e

cd /content

# Clone (or pull if already cloned)
if [ -d "parallax" ]; then
    echo "Repository already cloned, pulling latest..."
    cd parallax && git pull && cd ..
else
    echo "Cloning PRLX..."
    git clone https://github.com/khushiyant/parallax.git
fi

echo "Done."

Cloning PRLX...
Done.


Cloning into 'parallax'...


In [4]:
%%bash
set -e
export PATH="$HOME/.cargo/bin:$PATH"

cd /content/parallax

# Detect GPU SM version for targeted build
SM=$(nvidia-smi --query-gpu=compute_cap --format=csv,noheader | head -1 | tr -d '.')
echo "Building for sm_${SM}"

echo ""
echo "=== CMake Configure ==="
cmake -B build \
    -DPRLX_CUDA_ARCHITECTURES="${SM}" \
    -DPRLX_BUILD_NVBIT=ON \
    -DCMAKE_BUILD_TYPE=Release \
    2>&1 | grep -E '(Found|Using|Downloaded|WARNING|ERROR|-- Build|LLVM|NVBit|zstd)'

echo ""
echo "=== CMake Build (C++/CUDA) ==="
cmake --build build -j$(nproc) 2>&1 | tail -20

echo ""
echo "=== Build Artifacts ==="
ls -lh build/lib/pass/libPrlxPass.so 2>/dev/null && echo "  LLVM Pass: OK" || echo "  LLVM Pass: NOT BUILT"
ls -lh build/lib/runtime/libprlx_runtime_shared.so 2>/dev/null && echo "  Runtime: OK" || echo "  Runtime: NOT BUILT"
ls -lh build/lib/nvbit_tool/libprlx_nvbit.so 2>/dev/null && echo "  NVBit Tool: OK" || echo "  NVBit Tool: NOT BUILT"

Building for sm_75

=== CMake Configure ===
-- Found CUDAToolkit: /usr/local/cuda/targets/x86_64-linux/include (found version "12.8.93")
-- Found Threads: TRUE
-- Manually configured zstd: /usr/lib/x86_64-linux-gnu/libzstd.so.1
-- Found FFI: /usr/lib/x86_64-linux-gnu/libffi.so
-- Found ZLIB: /usr/lib/x86_64-linux-gnu/libz.so (found version "1.2.11")
-- Found zstd: /usr/lib/x86_64-linux-gnu/libzstd.so
-- Found LibXml2: /usr/lib/x86_64-linux-gnu/libxml2.so (found version "2.9.13")
-- Found CURL: /usr/lib/x86_64-linux-gnu/libcurl.so (found version "7.81.0")
-- Found LLVM 20.1.8
-- Using LLVMConfig.cmake in: /usr/lib/llvm-20/cmake
-- zstd compression enabled
-- Found LLVM 20.1.8
-- Using LLVMConfig.cmake in: /usr/lib/llvm-20/cmake
-- Downloaded NVBit 1.7.1 to: /content/parallax/build/_deps/nvbit_sdk-src
-- Build files have been written to: /content/parallax/build

=== CMake Build (C++/CUDA) ===
[ 81%] Building CUDA object examples/CMakeFiles/instrumented_divergence.dir/instrumented_diverge

In [5]:
%%bash
set -e
export PATH="$HOME/.cargo/bin:$PATH"

cd /content/parallax/differ

echo "=== Building Rust Differ ==="
cargo build --release 2>&1 | tail -5

echo ""
echo "=== Running Differ Tests ==="
cargo test 2>&1 | tail -15

echo ""
ls -lh target/release/prlx-diff
echo "Differ: OK"

=== Building Rust Differ ===
   |
   = note: `EventDisplay` has derived impls for the traits `Clone` and `Debug`, but these are intentionally ignored during dead code analysis

    Finished `release` profile [optimized] target(s) in 1m 41s

=== Running Differ Tests ===
test test_branch_direction_divergence ... ok
test test_extra_events_resync ... ok
test test_identical_traces ... ok
test test_path_divergence ... ok
test test_snapshot_integration ... ok
test test_value_divergence ... ok

test result: ok. 7 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.04s

   Doc-tests prlx_diff

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s


-rwxr-xr-x 2 root root 2.4M Feb 14 18:13 target/release/prlx-diff
Differ: OK


In [None]:
%%bash
set -e

cd /content/parallax

echo "=== Installing Python package (editable) ==="
pip install -e . 2>&1 | tail -10

echo ""
echo "=== Verifying Python ==="
python -c "import prlx; print(f'prlx version: {prlx.__version__}')"
python -c "from prlx._find_lib import find_pass_plugin, find_differ_binary; print(f'Pass: {find_pass_plugin()}'); print(f'Differ: {find_differ_binary()}')"

echo ""
echo "Python package: OK"

## 3. LLVM Pass Pipeline — Branch Divergence Demo

This tests the full compile-time instrumentation pipeline:
1. `clang` compiles CUDA with `-fpass-plugin=libPrlxPass.so`
2. The pass instruments all branches, atomics, and shared memory stores
3. Run the kernel twice with different thresholds → two `.prlx` trace files
4. `prlx-diff` compares them and reports divergences

In [None]:
%%bash
set -e

cd /content/parallax
WORK=/tmp/prlx_demo
rm -rf $WORK && mkdir -p $WORK

SM=$(nvidia-smi --query-gpu=compute_cap --format=csv,noheader | head -1 | tr -d '.')

# Write a simple branch divergence kernel
cat > $WORK/branch_demo.cu << 'EOF'
#include <cuda_runtime.h>
#include <cstdio>
#include <cstdlib>

__global__ void branch_kernel(int* data, int* out, int threshold, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= n) return;

    int value = data[idx];

    // This branch diverges differently depending on threshold
    if (value > threshold) {
        out[idx] = value * 2;    // taken path
    } else {
        out[idx] = -value;       // not-taken path
    }
}

int main(int argc, char** argv) {
    const int N = 128;
    int threshold = 64;
    if (argc > 1) threshold = atoi(argv[1]);

    printf("Branch demo: N=%d, threshold=%d\n", N, threshold);

    int* h_data = new int[N];
    int* h_out = new int[N];
    for (int i = 0; i < N; i++) h_data[i] = i;

    int *d_data, *d_out;
    cudaMalloc(&d_data, N * sizeof(int));
    cudaMalloc(&d_out, N * sizeof(int));
    cudaMemcpy(d_data, h_data, N * sizeof(int), cudaMemcpyHostToDevice);

    branch_kernel<<<(N+31)/32, 32>>>(d_data, d_out, threshold, N);
    cudaDeviceSynchronize();

    cudaMemcpy(h_out, d_out, N * sizeof(int), cudaMemcpyDeviceToHost);
    printf("Sample: out[0]=%d, out[63]=%d, out[64]=%d, out[127]=%d\n",
           h_out[0], h_out[63], h_out[64], h_out[127]);

    cudaFree(d_data); cudaFree(d_out);
    delete[] h_data; delete[] h_out;
    return 0;
}
EOF

echo "=== Compiling with LLVM pass (clang + libPrlxPass.so) ==="
PASS=$(find /content/parallax/build -name 'libPrlxPass.so' | head -1)
echo "Pass plugin: $PASS"

clang++-20 \
    -fpass-plugin=$PASS \
    --cuda-gpu-arch=sm_${SM} \
    -I/content/parallax/lib/runtime \
    -I/content/parallax/lib/common \
    -L/content/parallax/build/lib/runtime \
    $WORK/branch_demo.cu \
    -lcudart -lprlx_runtime_shared \
    -o $WORK/branch_demo \
    -Wno-unknown-cuda-version 2>&1

echo ""
echo "Binary: $(ls -lh $WORK/branch_demo)"
echo "Compilation: OK"

In [None]:
%%bash
set -e

WORK=/tmp/prlx_demo
cd $WORK

export LD_LIBRARY_PATH=/content/parallax/build/lib/runtime:$LD_LIBRARY_PATH

echo "=== Run A: threshold=10 (most threads take the 'true' branch) ==="
PRLX_TRACE=$WORK/run_a.prlx ./branch_demo 10
echo ""

echo "=== Run B: threshold=100 (most threads take the 'false' branch) ==="
PRLX_TRACE=$WORK/run_b.prlx ./branch_demo 100
echo ""

echo "=== Trace files ==="
ls -lh $WORK/*.prlx 2>/dev/null || echo "No trace files generated (runtime may not be linked)"
echo ""

echo "=== Differential Analysis ==="
DIFFER=/content/parallax/differ/target/release/prlx-diff
if [ -f "$WORK/run_a.prlx" ] && [ -f "$WORK/run_b.prlx" ]; then
    $DIFFER $WORK/run_a.prlx $WORK/run_b.prlx --format=text 2>&1 || true
    echo ""
    echo "Differential analysis: COMPLETE"
else
    echo "Skipping diff — trace files not found."
    echo "This is expected if the kernel was compiled without runtime linking."
fi

## 4. NVBit Pipeline — Zero-Recompilation Instrumentation

NVBit instruments at the SASS level via `LD_PRELOAD`. No recompilation needed.
This is the key advantage: instrument **any** CUDA binary.

In [None]:
%%bash
set -e

WORK=/tmp/prlx_nvbit_demo
rm -rf $WORK && mkdir -p $WORK

SM=$(nvidia-smi --query-gpu=compute_cap --format=csv,noheader | head -1 | tr -d '.')

# Compile a plain CUDA binary (NO prlx pass, NO instrumentation)
cat > $WORK/plain_kernel.cu << 'EOF'
#include <cuda_runtime.h>
#include <cstdio>
#include <cstdlib>

__global__ void compute_kernel(float* data, float* out, float scale, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= n) return;

    float val = data[idx];

    // Branch: different paths depending on value vs scale
    if (val > scale) {
        out[idx] = val * 2.0f;
    } else {
        out[idx] = val + scale;
    }
}

int main(int argc, char** argv) {
    const int N = 256;
    float scale = 0.5f;
    if (argc > 1) scale = atof(argv[1]);

    printf("NVBit demo: N=%d, scale=%.1f\n", N, scale);

    float* h_data = new float[N];
    float* h_out = new float[N];
    for (int i = 0; i < N; i++) h_data[i] = (float)i / N;

    float *d_data, *d_out;
    cudaMalloc(&d_data, N * sizeof(float));
    cudaMalloc(&d_out, N * sizeof(float));
    cudaMemcpy(d_data, h_data, N * sizeof(float), cudaMemcpyHostToDevice);

    compute_kernel<<<(N+31)/32, 32>>>(d_data, d_out, scale, N);
    cudaDeviceSynchronize();

    cudaMemcpy(h_out, d_out, N * sizeof(float), cudaMemcpyDeviceToHost);
    printf("Sample: out[0]=%.2f, out[127]=%.2f, out[255]=%.2f\n",
           h_out[0], h_out[127], h_out[255]);

    cudaFree(d_data); cudaFree(d_out);
    delete[] h_data; delete[] h_out;
    return 0;
}
EOF

echo "=== Compiling plain CUDA binary (no instrumentation) ==="
nvcc -arch=sm_${SM} $WORK/plain_kernel.cu -o $WORK/plain_kernel
echo "Compiled: $(ls -lh $WORK/plain_kernel)"

echo ""
echo "=== Sanity run (no NVBit) ==="
$WORK/plain_kernel 0.5

In [None]:
%%bash
set -e

WORK=/tmp/prlx_nvbit_demo
NVBIT_LIB=/content/parallax/build/lib/nvbit_tool/libprlx_nvbit.so
DIFFER=/content/parallax/differ/target/release/prlx-diff

if [ ! -f "$NVBIT_LIB" ]; then
    echo "ERROR: libprlx_nvbit.so not found. NVBit build may have failed."
    exit 1
fi

echo "=== Run A with NVBit: scale=0.3 ==="
PRLX_TRACE=$WORK/nvbit_a.prlx \
PRLX_SITES=$WORK/sites_a.json \
    LD_PRELOAD=$NVBIT_LIB $WORK/plain_kernel 0.3 2>&1
echo ""

echo "=== Run B with NVBit: scale=0.8 ==="
PRLX_TRACE=$WORK/nvbit_b.prlx \
PRLX_SITES=$WORK/sites_b.json \
    LD_PRELOAD=$NVBIT_LIB $WORK/plain_kernel 0.8 2>&1
echo ""

echo "=== NVBit Trace Files ==="
ls -lh $WORK/nvbit_*.prlx 2>/dev/null || echo "No trace files (NVBit may not have captured events)"
ls -lh $WORK/sites_*.json 2>/dev/null || echo "No site table files"
echo ""

echo "=== Site Table (Run A) ==="
if [ -f "$WORK/sites_a.json" ]; then
    python3 -m json.tool $WORK/sites_a.json 2>/dev/null | head -30
fi
echo ""

echo "=== Differential Analysis (NVBit traces) ==="
if [ -f "$WORK/nvbit_a.prlx" ] && [ -f "$WORK/nvbit_b.prlx" ]; then
    $DIFFER $WORK/nvbit_a.prlx $WORK/nvbit_b.prlx --format=text 2>&1 || true
    echo ""
    echo "NVBit pipeline: COMPLETE"
else
    echo "No trace files to diff."
fi

## 5. Python API — Read and Diff Traces Programmatically

In [None]:
import prlx
import os

print(f"prlx version: {prlx.__version__}")

# Try to read the LLVM pass traces
trace_a_path = "/tmp/prlx_demo/run_a.prlx"
trace_b_path = "/tmp/prlx_demo/run_b.prlx"

# Fall back to NVBit traces if LLVM pass traces don't exist
if not os.path.exists(trace_a_path):
    trace_a_path = "/tmp/prlx_nvbit_demo/nvbit_a.prlx"
    trace_b_path = "/tmp/prlx_nvbit_demo/nvbit_b.prlx"

if os.path.exists(trace_a_path):
    trace_a = prlx.read_trace(trace_a_path)
    trace_b = prlx.read_trace(trace_b_path)

    print(f"\nTrace A: {trace_a_path}")
    print(f"  Kernel: {trace_a.kernel_name}")
    print(f"  Grid:   {trace_a.grid_dim}")
    print(f"  Block:  {trace_a.block_dim}")
    print(f"  Warps:  {trace_a.num_warps}")
    print(f"  Events: {sum(len(w) for w in trace_a.warp_events)}")

    print(f"\nTrace B: {trace_b_path}")
    print(f"  Kernel: {trace_b.kernel_name}")
    print(f"  Events: {sum(len(w) for w in trace_b.warp_events)}")

    # Run diff
    rc = prlx.diff(trace_a_path, trace_b_path)
    print(f"\nDiff exit code: {rc}")
else:
    print("No trace files found. Run the demo cells above first.")

## 6. Summary

If everything above ran successfully, you've verified:

| Component | What it does | Status |
|-----------|-------------|--------|
| **LLVM Pass** | Compile-time CUDA instrumentation | Compiles + instruments branches |
| **Runtime** | Device-side trace capture | Captures per-warp events to `.prlx` |
| **NVBit Tool** | SASS-level binary instrumentation | LD_PRELOAD on unmodified binaries |
| **Differ** | Trace comparison + divergence detection | 17/17 Rust tests pass |
| **Python API** | `prlx.read_trace()`, `prlx.diff()` | Reads traces, invokes differ |
| **CLI** | `prlx compile`, `prlx diff` | Full pipeline orchestration |