# EdgeLLM CUDA Inference Test

This notebook tests the CUDA T-MAC kernels for GPU-accelerated BitNet inference.

**Requirements:**
- NVIDIA GPU (Jetson, RTX, etc.)
- CUDA Toolkit 11.0+
- nvcc compiler

## 1. Check GPU Environment

In [None]:
# Check NVIDIA GPU
!nvidia-smi

In [None]:
# Check CUDA version
!nvcc --version

In [None]:
# Get GPU details
!nvidia-smi --query-gpu=name,memory.total,compute_cap --format=csv

## 2. Clone Repository and Build CUDA Kernels

In [None]:
# Clone the repository (if not already present)
import os
if not os.path.exists('ollama-api-gateway'):
    !git clone https://github.com/umerkhan95/ollama-api-gateway.git
else:
    print('Repository already exists, pulling latest changes...')
    !cd ollama-api-gateway && git pull

In [None]:
# Navigate to kernels directory
%cd ollama-api-gateway/mojo-gateway/src/kernels

In [None]:
# Build CUDA kernels
!make cuda

In [None]:
# Verify build output
!ls -la ../../lib/

## 3. Run CUDA Kernel Tests

In [None]:
# Run CUDA unit tests
!make cuda-test

## 4. Python CUDA Kernel Test

Test the CUDA kernels directly from Python using ctypes.

In [None]:
import ctypes
import numpy as np
import os

# Find the CUDA library
lib_path = '../../lib/libtmac_kernel_cuda.so'
if not os.path.exists(lib_path):
    raise FileNotFoundError(f'CUDA library not found at {lib_path}. Run make cuda first.')

# Load the library
cuda_lib = ctypes.CDLL(lib_path)
print(f'Loaded CUDA library: {lib_path}')

In [None]:
# Define function signatures
cuda_lib.cuda_available.restype = ctypes.c_int
cuda_lib.cuda_device_name.restype = ctypes.c_char_p
cuda_lib.cuda_init.argtypes = [ctypes.c_int, ctypes.c_int, ctypes.c_int]
cuda_lib.cuda_init.restype = ctypes.c_int
cuda_lib.cuda_cleanup.restype = None

# Check CUDA availability
if cuda_lib.cuda_available():
    device_name = cuda_lib.cuda_device_name().decode('utf-8')
    print(f'CUDA Available: Yes')
    print(f'Device: {device_name}')
else:
    print('CUDA Not Available')

In [None]:
# Initialize CUDA
max_weights = 10_000_000  # 10MB
max_activations = 1_000_000
max_output = 1_000_000

ret = cuda_lib.cuda_init(max_weights, max_activations, max_output)
if ret == 0:
    print('CUDA initialized successfully')
else:
    print('CUDA initialization failed')

In [None]:
# Test RMSNorm kernel
cuda_lib.rmsnorm_cuda.argtypes = [
    ctypes.POINTER(ctypes.c_float),  # output
    ctypes.POINTER(ctypes.c_float),  # input
    ctypes.POINTER(ctypes.c_float),  # weight
    ctypes.c_int,                     # batch_size
    ctypes.c_int,                     # size
    ctypes.c_float                    # eps
]
cuda_lib.rmsnorm_cuda.restype = ctypes.c_int

# Create test data
batch_size = 4
size = 256

input_data = np.random.randn(batch_size, size).astype(np.float32)
weight_data = np.ones(size, dtype=np.float32)
output_data = np.zeros((batch_size, size), dtype=np.float32)

# Run RMSNorm on GPU
ret = cuda_lib.rmsnorm_cuda(
    output_data.ctypes.data_as(ctypes.POINTER(ctypes.c_float)),
    input_data.ctypes.data_as(ctypes.POINTER(ctypes.c_float)),
    weight_data.ctypes.data_as(ctypes.POINTER(ctypes.c_float)),
    batch_size,
    size,
    ctypes.c_float(1e-6)
)

if ret == 0:
    print('RMSNorm CUDA: SUCCESS')
    print(f'Input mean: {input_data.mean():.4f}')
    print(f'Output mean: {output_data.mean():.4f}')
    print(f'Output std: {output_data.std():.4f}')
else:
    print('RMSNorm CUDA: FAILED')

In [None]:
# Test Softmax kernel
cuda_lib.softmax_cuda.argtypes = [
    ctypes.POINTER(ctypes.c_float),  # output
    ctypes.POINTER(ctypes.c_float),  # input
    ctypes.c_int,                     # batch_size
    ctypes.c_int                      # size
]
cuda_lib.softmax_cuda.restype = ctypes.c_int

# Create test data
logits = np.random.randn(batch_size, size).astype(np.float32) * 2
probs = np.zeros((batch_size, size), dtype=np.float32)

# Run Softmax on GPU
ret = cuda_lib.softmax_cuda(
    probs.ctypes.data_as(ctypes.POINTER(ctypes.c_float)),
    logits.ctypes.data_as(ctypes.POINTER(ctypes.c_float)),
    batch_size,
    size
)

if ret == 0:
    print('Softmax CUDA: SUCCESS')
    # Verify softmax sums to 1
    for b in range(batch_size):
        row_sum = probs[b].sum()
        print(f'  Batch {b} sum: {row_sum:.6f} (should be ~1.0)')
else:
    print('Softmax CUDA: FAILED')

## 5. Performance Benchmark

In [None]:
import time

# Benchmark RMSNorm
batch_size = 32
size = 4096  # Typical hidden size
iterations = 1000

input_data = np.random.randn(batch_size, size).astype(np.float32)
weight_data = np.ones(size, dtype=np.float32)
output_data = np.zeros((batch_size, size), dtype=np.float32)

# Warmup
for _ in range(10):
    cuda_lib.rmsnorm_cuda(
        output_data.ctypes.data_as(ctypes.POINTER(ctypes.c_float)),
        input_data.ctypes.data_as(ctypes.POINTER(ctypes.c_float)),
        weight_data.ctypes.data_as(ctypes.POINTER(ctypes.c_float)),
        batch_size, size, ctypes.c_float(1e-6)
    )

# Benchmark
cuda_lib.cuda_sync()  # Ensure warmup is done
start = time.perf_counter()
for _ in range(iterations):
    cuda_lib.rmsnorm_cuda(
        output_data.ctypes.data_as(ctypes.POINTER(ctypes.c_float)),
        input_data.ctypes.data_as(ctypes.POINTER(ctypes.c_float)),
        weight_data.ctypes.data_as(ctypes.POINTER(ctypes.c_float)),
        batch_size, size, ctypes.c_float(1e-6)
    )
cuda_lib.cuda_sync()
end = time.perf_counter()

total_time = end - start
per_call = total_time / iterations * 1000  # ms
throughput = iterations / total_time

print(f'RMSNorm Benchmark ({batch_size}x{size}):')
print(f'  Total time: {total_time:.3f}s for {iterations} iterations')
print(f'  Per call: {per_call:.3f}ms')
print(f'  Throughput: {throughput:.1f} calls/sec')

In [None]:
# Benchmark Softmax
logits = np.random.randn(batch_size, size).astype(np.float32)
probs = np.zeros((batch_size, size), dtype=np.float32)

# Warmup
for _ in range(10):
    cuda_lib.softmax_cuda(
        probs.ctypes.data_as(ctypes.POINTER(ctypes.c_float)),
        logits.ctypes.data_as(ctypes.POINTER(ctypes.c_float)),
        batch_size, size
    )

# Benchmark
cuda_lib.cuda_sync()
start = time.perf_counter()
for _ in range(iterations):
    cuda_lib.softmax_cuda(
        probs.ctypes.data_as(ctypes.POINTER(ctypes.c_float)),
        logits.ctypes.data_as(ctypes.POINTER(ctypes.c_float)),
        batch_size, size
    )
cuda_lib.cuda_sync()
end = time.perf_counter()

total_time = end - start
per_call = total_time / iterations * 1000
throughput = iterations / total_time

print(f'Softmax Benchmark ({batch_size}x{size}):')
print(f'  Total time: {total_time:.3f}s for {iterations} iterations')
print(f'  Per call: {per_call:.3f}ms')
print(f'  Throughput: {throughput:.1f} calls/sec')

In [None]:
# Cleanup
cuda_lib.cuda_cleanup()
print('CUDA resources cleaned up')

## 6. Full T-MAC MatMul Benchmark

Test the T-MAC matmul kernel that eliminates LUT rebuilding overhead.

In [None]:
# Re-initialize CUDA
ret = cuda_lib.cuda_init(60_000_000, 4096, 50000)  # 60MB for weights
print(f'CUDA re-initialized: {ret == 0}')

# Define T-MAC matmul signature
cuda_lib.tmac_matmul_cuda.argtypes = [
    ctypes.POINTER(ctypes.c_int8),    # weights (packed ternary)
    ctypes.POINTER(ctypes.c_float),   # activations
    ctypes.POINTER(ctypes.c_float),   # output
    ctypes.POINTER(ctypes.c_float),   # scales
    ctypes.c_int,                      # M (output rows)
    ctypes.c_int,                      # N (batch size)
    ctypes.c_int                       # K (input dimension)
]
cuda_lib.tmac_matmul_cuda.restype = ctypes.c_int

# SmolLM-135M dimensions
dim = 576        # hidden dimension
hidden_dim = 1536  # FFN intermediate dimension
vocab_size = 49152  # vocabulary size

print(f'\nSmolLM-135M dimensions:')
print(f'  dim (hidden): {dim}')
print(f'  hidden_dim (FFN): {hidden_dim}')
print(f'  vocab_size: {vocab_size}')

In [None]:
# Create test data for T-MAC matmul
M = dim  # Output rows (e.g., QKV projection)
N = 1    # Batch size (single token)
K = dim  # Input dimension

# Packed ternary weights: each byte holds 4 ternary values
bytes_per_row = (K + 3) // 4
weights = np.random.randint(-128, 127, (M, bytes_per_row), dtype=np.int8)
activations = np.random.randn(K).astype(np.float32)
output = np.zeros(M, dtype=np.float32)
scales = np.random.uniform(0.1, 1.0, M).astype(np.float32)

print(f'T-MAC MatMul dimensions:')
print(f'  M={M}, N={N}, K={K}')
print(f'  Weights: {weights.shape} ({weights.nbytes / 1024:.1f} KB)')
print(f'  Activations: {activations.shape}')
print(f'  Output: {output.shape}')

# Run T-MAC matmul
ret = cuda_lib.tmac_matmul_cuda(
    weights.ctypes.data_as(ctypes.POINTER(ctypes.c_int8)),
    activations.ctypes.data_as(ctypes.POINTER(ctypes.c_float)),
    output.ctypes.data_as(ctypes.POINTER(ctypes.c_float)),
    scales.ctypes.data_as(ctypes.POINTER(ctypes.c_float)),
    M, N, K
)

print(f'\nT-MAC MatMul result: {"SUCCESS" if ret == 0 else "FAILED"}')
print(f'Output mean: {output.mean():.4f}, std: {output.std():.4f}')

In [None]:
# Benchmark per-layer operations (simulating full forward pass)
iterations = 100

# Layer operations for SmolLM-135M:
# - QKV projection: (dim, dim) x 3
# - Output projection: (dim, dim)
# - FFN gate/up: (hidden_dim, dim) x 2
# - FFN down: (dim, hidden_dim)
# Total per layer: 7 matmuls

layers = [
    ('QKV Projection', dim, dim),
    ('Output Projection', dim, dim),
    ('FFN Gate', hidden_dim, dim),
    ('FFN Up', hidden_dim, dim),
    ('FFN Down', dim, hidden_dim),
]

print('Per-Layer Benchmark (SmolLM-135M, single token):')
print('=' * 60)

total_time_per_layer = 0

for name, M, K in layers:
    bytes_per_row = (K + 3) // 4
    weights = np.random.randint(-128, 127, (M, bytes_per_row), dtype=np.int8)
    activations = np.random.randn(K).astype(np.float32)
    output = np.zeros(M, dtype=np.float32)
    scales = np.random.uniform(0.1, 1.0, M).astype(np.float32)
    
    # Warmup
    for _ in range(10):
        cuda_lib.tmac_matmul_cuda(
            weights.ctypes.data_as(ctypes.POINTER(ctypes.c_int8)),
            activations.ctypes.data_as(ctypes.POINTER(ctypes.c_float)),
            output.ctypes.data_as(ctypes.POINTER(ctypes.c_float)),
            scales.ctypes.data_as(ctypes.POINTER(ctypes.c_float)),
            M, 1, K
        )
    
    cuda_lib.cuda_sync()
    start = time.perf_counter()
    for _ in range(iterations):
        cuda_lib.tmac_matmul_cuda(
            weights.ctypes.data_as(ctypes.POINTER(ctypes.c_int8)),
            activations.ctypes.data_as(ctypes.POINTER(ctypes.c_float)),
            output.ctypes.data_as(ctypes.POINTER(ctypes.c_float)),
            scales.ctypes.data_as(ctypes.POINTER(ctypes.c_float)),
            M, 1, K
        )
    cuda_lib.cuda_sync()
    end = time.perf_counter()
    
    per_call_ms = (end - start) / iterations * 1000
    total_time_per_layer += per_call_ms
    
    print(f'{name:20s} ({M:4d} x {K:4d}): {per_call_ms:.3f} ms')

print('=' * 60)
print(f'Total per layer: {total_time_per_layer:.3f} ms')

In [None]:
# Estimate full model throughput
n_layers = 30  # SmolLM-135M has 30 transformer layers

# Per-token overhead breakdown
matmul_time_per_token = total_time_per_layer * n_layers  # 7 matmuls * 30 layers
attention_time_estimate = 0.5  # ms (depends on sequence length)
other_overhead = 0.5  # ms (RoPE, residuals, sampling)

total_per_token = matmul_time_per_token + attention_time_estimate + other_overhead
estimated_throughput = 1000 / total_per_token  # tokens per second

print('\n' + '=' * 60)
print('ESTIMATED FULL MODEL THROUGHPUT')
print('=' * 60)
print(f'Model: SmolLM-135M ({n_layers} layers)')
print(f'')
print(f'Per-token breakdown:')
print(f'  MatMul ({n_layers} layers x {len(layers)} ops): {matmul_time_per_token:.2f} ms')
print(f'  Attention estimate:              {attention_time_estimate:.2f} ms')
print(f'  Other overhead:                  {other_overhead:.2f} ms')
print(f'  -----------------------------------')
print(f'  Total per token:                 {total_per_token:.2f} ms')
print(f'')
print(f'ESTIMATED THROUGHPUT: {estimated_throughput:.1f} tok/s')
print('')
print('Comparison:')
print(f'  Ollama (baseline):    423 tok/s')
print(f'  EdgeLLM CUDA:         {estimated_throughput:.1f} tok/s')
print(f'  Speedup:              {estimated_throughput/423:.2f}x')

In [None]:
# Cleanup
cuda_lib.cuda_cleanup()
print('CUDA resources cleaned up')

## 7. Summary

This notebook tested the CUDA T-MAC kernel integration:

### What We Tested
1. **CUDA Environment** - GPU detection and initialization
2. **RMSNorm Kernel** - GPU-accelerated layer normalization
3. **Softmax Kernel** - GPU-accelerated attention softmax
4. **T-MAC MatMul** - Ternary weight matrix multiplication (NO LUT rebuilding!)

### Key Improvement: Eliminated LUT Rebuilding Overhead

**Before (CPU path):**
- Built 150 LUTs per token (5 per layer Ã— 30 layers)
- Each LUT: ~36K-98K float operations
- Total: ~7 million float ops just for LUT building per token!

**After (CUDA path):**
- CUDA kernel builds LUT internally once per matmul
- GPU parallelism handles LUT + matmul in single operation
- No separate LUT build step

### Next Steps
1. Test with real SmolLM-135M model weights
2. Run full Mojo inference with CUDA backend
3. Compare long-running (100+ tokens) throughput vs Ollama