# Fused Softmax CUDA Kernel for GPT-2 Attention

This notebook implements a fused softmax CUDA kernel that combines scale + causal mask + softmax into a single pass, reducing memory traffic from ~4 passes to 1.

## Setup: Load Files from Google Drive


In [1]:
# Mount Google Drive and set up project files
from google.colab import drive
import os
import shutil

print("Mounting Google Drive...")
drive.mount('/content/drive')
print("✓ Google Drive mounted!")


DRIVE_PROJECT_PATH = "/content/drive/MyDrive/GPT2"

if os.path.exists(DRIVE_PROJECT_PATH):
    print(f"✓ Found project at: {DRIVE_PROJECT_PATH}")

    content_source = os.path.join(DRIVE_PROJECT_PATH, "content")
    content_dest = "content"

    if os.path.exists(content_source):
        # Remove existing content folder if it exists
        if os.path.exists(content_dest):
            shutil.rmtree(content_dest)

        print(f"Copying content folder from Drive to workspace...")
        shutil.copytree(content_source, content_dest)
        print(f"✓ Content folder copied to: {content_dest}")

        cuda_dir = os.path.join(content_dest, "cuda")
        python_dir = os.path.join(content_dest, "python")

        if os.path.exists(cuda_dir) and os.path.exists(python_dir):
            print(f"✓ Project structure verified!")
            print(f"  - CUDA files: {len(os.listdir(cuda_dir))} files")
            print(f"  - Python files: {len(os.listdir(python_dir))} files")
        else:
            print("⚠ Warning: content/cuda or content/python not found")
    else:
        print(f"⚠ Warning: content folder not found at {content_source}")
        print("Please make sure you uploaded the 'content' folder to Google Drive (GPT2/content)")
else:
    print(f"⚠ Warning: Project path not found: {DRIVE_PROJECT_PATH}")


Mounting Google Drive...
Mounted at /content/drive
✓ Google Drive mounted!
✓ Found project at: /content/drive/MyDrive/GPT2
Copying content folder from Drive to workspace...
✓ Content folder copied to: content
✓ Project structure verified!
  - CUDA files: 3 files
  - Python files: 6 files


## Setup


# Fused Softmax CUDA Kernel for GPT-2 Attention

This notebook implements a fused softmax CUDA kernel that combines scale + causal mask + softmax into a single pass, reducing memory traffic from ~4 passes to 1.

## Setup


In [2]:
# Check GPU availability
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")


PyTorch version: 2.9.0+cu126
CUDA available: True
CUDA version: 12.6
GPU: NVIDIA L4
GPU Memory: 23.80 GB


In [3]:
# Install dependencies
!pip install transformers accelerate




## Quick Test: GPT-2 Inference

Let's first test basic GPT-2 inference to make sure everything works before running benchmarks.


In [4]:
# Simple GPT-2 Inference Test
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch

# Load model and tokenizer
print("Loading GPT-2 model (distilgpt2)...")
model_name = "distilgpt2"  # Small and fast for testing
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

# Move to GPU if available
if torch.cuda.is_available():
    model = model.cuda()
    model = model.eval()
    print(f"✓ Model loaded on GPU: {torch.cuda.get_device_name(0)}")
else:
    print("⚠ Warning: CUDA not available, using CPU (will be slow)")

# Set pad token to avoid warnings
# For distilgpt2, use eos_token_id as pad_token_id in generation
tokenizer.pad_token = tokenizer.eos_token

print("✓ Model and tokenizer ready!")


Loading GPT-2 model (distilgpt2)...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

✓ Model loaded on GPU: NVIDIA L4
✓ Model and tokenizer ready!


In [5]:
# Test GPT-2 Text Generation
# INPUT: Change this prompt to whatever you want
prompt = "The future of artificial intelligence"

print(f"Input prompt: '{prompt}'")
print("\nGenerating text...")

# Tokenize with attention mask (this avoids the warning)
encoded = tokenizer(prompt, return_tensors="pt", padding=False, truncation=False)
input_ids = encoded["input_ids"]
attention_mask = encoded["attention_mask"]

# Move to GPU if available
if torch.cuda.is_available():
    input_ids = input_ids.cuda()
    attention_mask = attention_mask.cuda()

print(f"Input tokens: {input_ids.shape[1]}")

# Generate text with proper attention mask
with torch.no_grad():
    outputs = model.generate(
        input_ids,
        attention_mask=attention_mask,
        max_new_tokens=50,  # Number of new tokens to generate (not total length)
        num_return_sequences=1,
        do_sample=False,
        pad_token_id=tokenizer.eos_token_id,  # Use eos as pad (distilgpt2 doesn't have pad)
        eos_token_id=tokenizer.eos_token_id,
        no_repeat_ngram_size=2,
    )

# Decode and print
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"\n{'='*60}")
print("OUTPUT:")
print(f"{'='*60}")
print(generated_text)
print(f"{'='*60}")

# Show token counts
input_len = input_ids.shape[1]
generated_len = outputs[0].shape[0] - input_len
print(f"\nInput tokens: {input_len}")
print(f"Generated tokens: {generated_len}")
print(f"Total tokens: {outputs[0].shape[0]}")


Input prompt: 'The future of artificial intelligence'

Generating text...
Input tokens: 5

OUTPUT:
The future of artificial intelligence is not yet clear.

The next step is to develop a new approach to artificial Intelligence. The first step will be to create a system that can be used to predict the future. This is a very exciting step. It is also a step

Input tokens: 5
Generated tokens: 50
Total tokens: 55


### Expected Output

When you run the cells above, you should see:

1. **Model Loading:**
   - "Loading GPT-2 model (distilgpt2)..."
   - "✓ Model loaded on GPU: [GPU Name]"
   - "✓ Model and tokenizer ready!"

2. **Text Generation:**
   - Your input prompt
   - Generated text continuation
   - Token counts

**Example:**
- Input: "The future of artificial intelligence"
- Output: "The future of artificial intelligence is bright. The technology is advancing rapidly and will continue to improve..."

### Try Different Prompts

Change the `prompt` variable in the cell above to test different inputs:
- `"In a world where technology"`
- `"The quick brown fox"`
- `"Once upon a time"`
- Or any text you want!


## Phase 1: Test Baseline Softmax

Now let's test the unfused PyTorch baseline to make sure it works correctly. This is what we'll compare our optimized kernel against.


---


In [6]:
# Test baseline softmax (unfused PyTorch implementation)
import sys
sys.path.append('content/python')

from baseline_softmax import baseline_softmax_forward
import torch

print("Testing baseline softmax (unfused PyTorch)...")
print("This implements: scale → causal mask → softmax (3 separate operations)")

# Test correctness
batch_size, seq_len = 1, 64
input_tensor = torch.randn(batch_size, seq_len, seq_len, device='cuda')
scale = 1.0 / (seq_len ** 0.5)

print(f"\nInput shape: {input_tensor.shape}")
print(f"Scale factor: {scale:.6f} (1/sqrt({seq_len}))")

output = baseline_softmax_forward(input_tensor, scale, causal_mask=True)

print(f"\n✓ Baseline softmax completed!")
print(f"Output shape: {output.shape}")
print(f"Output sum per row (should be ~1.0): {output.sum(dim=-1).mean():.6f}")

# Verify softmax properties
row_sums = output.sum(dim=-1)
print(f"Row sum range: [{row_sums.min():.6f}, {row_sums.max():.6f}]")
if torch.allclose(row_sums, torch.ones_like(row_sums), atol=1e-5):
    print("✓ Softmax property verified: all rows sum to ~1.0")
else:
    print("⚠ Warning: Some rows don't sum to 1.0")


Testing baseline softmax (unfused PyTorch)...
This implements: scale → causal mask → softmax (3 separate operations)

Input shape: torch.Size([1, 64, 64])
Scale factor: 0.125000 (1/sqrt(64))

✓ Baseline softmax completed!
Output shape: torch.Size([1, 64, 64])
Output sum per row (should be ~1.0): 1.000000
Row sum range: [1.000000, 1.000000]
✓ Softmax property verified: all rows sum to ~1.0


## Phase 2: CUDA Extension Setup


In [26]:
# Quick test: Verify CUDA compilation environment works
# Run this BEFORE the main compilation cell to check if CUDA compilation is possible

import torch
import subprocess
import os

print("Checking CUDA compilation environment...")
print("=" * 60)

# Check PyTorch CUDA
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"cuDNN version: {torch.backends.cudnn.version() if torch.backends.cudnn.is_available() else 'N/A'}")

# Check nvcc
try:
    result = subprocess.run(['nvcc', '--version'],
                          capture_output=True, text=True, timeout=5)
    if result.returncode == 0:
        print(f"\n✓ nvcc found:")
        print(result.stdout.split('\n')[0])
    else:
        print("\n⚠ nvcc returned non-zero exit code")
except FileNotFoundError:
    print("\n✗ nvcc not found in PATH")
except Exception as e:
    print(f"\n⚠ Error checking nvcc: {e}")

# Check if we can compile a simple test
print("\n" + "=" * 60)
print("Testing simple CUDA compilation...")
print("=" * 60)

try:
    from torch.utils.cpp_extension import load_inline

    # Try compiling a minimal CUDA kernel
    cuda_source = '''
    #include <cuda_runtime.h>
    __global__ void test_kernel(float* data, int n) {
        int idx = blockIdx.x * blockDim.x + threadIdx.x;
        if (idx < n) {
            data[idx] = data[idx] * 2.0f;
        }
    }
    '''

    cpp_source = '''
    #include <torch/extension.h>
    #include <cuda_runtime.h>

    void launch_test_kernel(float* data, int n);

    torch::Tensor test_func(torch::Tensor input) {
        auto output = input.clone();
        launch_test_kernel(output.data_ptr<float>(), output.numel());
        return output;
    }

    PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
        m.def("test", &test_func, "Test function");
    }
    '''

    # This will fail if CUDA compilation doesn't work at all
    test_ext = load_inline(
        name='test_cuda_compile',
        cpp_sources=[cpp_source],
        cuda_sources=[cuda_source],
        verbose=False
    )
    print("✓ Simple CUDA compilation test PASSED")
    print("  Your CUDA environment is working correctly!")

except Exception as e:
    print(f"✗ Simple CUDA compilation test FAILED")
    print(f"  Error: {e}")
    print(f"\n  This suggests there's a problem with your CUDA setup.")
    print(f"  The main compilation will likely fail for the same reason.")
    print(f"  Check CUDA installation and PyTorch CUDA compatibility.")

print("=" * 60)


Checking CUDA compilation environment...
PyTorch version: 2.9.0+cu126
CUDA available: True
CUDA version: 12.6
cuDNN version: 91002

✓ nvcc found:
nvcc: NVIDIA (R) Cuda compiler driver

Testing simple CUDA compilation...
✗ Simple CUDA compilation test FAILED
  Error: /root/.cache/torch_extensions/py312_cu126/test_cuda_compile/test_cuda_compile.so: undefined symbol: _Z18launch_test_kernelPfi

  This suggests there's a problem with your CUDA setup.
  The main compilation will likely fail for the same reason.
  Check CUDA installation and PyTorch CUDA compatibility.


In [34]:

import shutil, os
torch_ext_dir = "/root/.cache/torch_extensions"
if os.path.exists(torch_ext_dir):
    shutil.rmtree(torch_ext_dir)
    print("Cleared torch_extensions cache")

!pip install --quiet ninja

import importlib.util

# Quick sanity check that ninja is available
spec = importlib.util.find_spec("ninja")
if spec is not None:
    print("✓ ninja is installed and available")
else:
    print("⚠ ninja still not found; try re-running this cell or restarting runtime")


Cleared torch_extensions cache
✓ ninja is installed and available


In [37]:
# Compile CUDA extension - SIMPLIFIED with error extraction
import os
import shutil
import torch
import subprocess
import sys
import glob
from torch.utils.cpp_extension import load

build_dirs = glob.glob("/root/.cache/torch_extensions/py*/fused_softmax*")
build_dir = build_dirs[0]
print("Build dir:", build_dir)

result = subprocess.run(
    ["ninja", "-v", "-C", build_dir],
    capture_output=True,
    text=True,
)
print("Return code:", result.returncode)
print("\nSTDOUT:\n", result.stdout)
print("\nSTDERR:\n", result.stderr)

torch_ext_dir = "/root/.cache/torch_extensions"
if os.path.exists(torch_ext_dir):
    shutil.rmtree(torch_ext_dir)
    print("✓ Cleared previous build cache")

cuda_dir = 'content/cuda'
cpp_file = os.path.join(cuda_dir, 'softmax_binding.cpp')
cu_file = os.path.join(cuda_dir, 'softmax_kernel.cu')

print("Starting compilation...")
print("=" * 60)

try:
    fused_softmax = load(
        name='fused_softmax',
        sources=[cpp_file, cu_file],
        extra_cuda_cflags=['-O3', '--use_fast_math', f'-I{cuda_dir}'],
        verbose=True,
    )
    print("=" * 60)
    print("✓ CUDA extension compiled successfully!")

except Exception as e:
    print("=" * 60)
    print(f"\n❌ Compilation failed!")
    print(f"   Error type: {type(e).__name__}")
    print(f"   Error message: {e}")

    # Extract actual build errors from ninja
    print("\n" + "=" * 60)
    print("Extracting actual compilation errors from build directory...")
    print("=" * 60)

    build_dirs = glob.glob("/root/.cache/torch_extensions/py*/fused_softmax*")
    if build_dirs:
        build_dir = build_dirs[0]
        print(f"Build directory: {build_dir}")

        # Run ninja from the build directory itself (not a subdirectory)
        print(f"\n--- Running 'ninja -v -C {build_dir}' to see errors ---")
        try:
            result = subprocess.run(
                ['ninja', '-v', '-C', build_dir],
                capture_output=True,
                text=True,
                timeout=30
            )
            print(f"Return code: {result.returncode}")
            if result.stdout:
                print("\n" + "="*70)
                print("STDOUT (THIS CONTAINS THE ACTUAL ERRORS):")
                print("="*70)
                print(result.stdout)
            if result.stderr:
                print("\n" + "="*70)
                print("STDERR:")
                print("="*70)
                print(result.stderr)
        except Exception as ninja_err:
            print(f"Could not run ninja: {ninja_err}")
            import traceback
            traceback.print_exc()

        # Also check the .ninja_log file
        ninja_log = os.path.join(build_dir, ".ninja_log")
        if os.path.exists(ninja_log):
            print(f"\n--- Checking .ninja_log file ---")
            try:
                with open(ninja_log, 'r') as f:
                    lines = f.readlines()
                    # Show last 50 lines
                    print("Last 50 lines of ninja log:")
                    for line in lines[-50:]:
                        print(line.rstrip())
            except Exception as log_err:
                print(f"Could not read log: {log_err}")

    print("\n" + "=" * 60)
    print("IMPORTANT: Look at the STDOUT above for the actual compilation errors.")
    print("Common issues:")
    print("  - 'undefined reference' = linking problem")
    print("  - 'error:' = compilation error")
    print("  - Missing symbols = function not found")
    print("=" * 60)
    raise


Build dir: /root/.cache/torch_extensions/py312_cu126/fused_softmax
Return code: 0

STDOUT:
 ninja: Entering directory `/root/.cache/torch_extensions/py312_cu126/fused_softmax'
[1/2] c++ -MMD -MF softmax_binding.o.d -DTORCH_EXTENSION_NAME=fused_softmax_v6 -DTORCH_API_INCLUDE_EXTENSION_H -isystem /usr/local/lib/python3.12/dist-packages/torch/include -isystem /usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda/include -isystem /usr/include/python3.12 -fPIC -std=c++17 -c /content/content/cuda/softmax_binding.cpp -o softmax_binding.o 
[2/2] c++ softmax_binding.o softmax_kernel.cuda.o -shared -L/usr/local/lib/python3.12/dist-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o fused_softmax_v6.so


STDERR:
 
✓ Cleared previous build cache
Starting compilation...
✓ CUDA extension compiled successfully!


## Phase 3: CUDA Kernel Development & Testing


In [40]:
# Patch binding to allow keyword arg: block_dim=...
import os, re, shutil, torch
from torch.utils.cpp_extension import load

# --------- Locate your cuda folder ----------
candidate_dirs = [
    "content/cuda",
    "/content/content/cuda",
    "/content/cuda",
    "cuda",
    "/mnt/data",
]

cuda_dir = None
for d in candidate_dirs:
    if os.path.exists(os.path.join(d, "softmax_binding.cpp")):
        cuda_dir = d
        break

if cuda_dir is None:
    raise FileNotFoundError(
        "Could not find softmax_binding.cpp in any of these dirs:\n" + "\n".join(candidate_dirs)
    )

cpp_file = os.path.join(cuda_dir, "softmax_binding.cpp")
cu_file  = os.path.join(cuda_dir, "softmax_kernel.cu")

print("Using cuda_dir:", cuda_dir)
print("cpp_file:", cpp_file)
print("cu_file :", cu_file)

# --------- Patch softmax_binding.cpp ----------
src = open(cpp_file, "r", encoding="utf-8").read()

# Ensure pybind11 header + namespace alias exist
if "pybind11/pybind11.h" not in src:
    # insert after first include line
    src = re.sub(r'(#include\s+<[^>]+>\s*\n)', r'\1#include <pybind11/pybind11.h>\n', src, count=1)

if "namespace py = pybind11;" not in src:
    # insert after includes block (simple heuristic)
    lines = src.splitlines(True)
    insert_at = 0
    for i, line in enumerate(lines):
        if line.startswith("#include"):
            insert_at = i + 1
    lines.insert(insert_at, "\nnamespace py = pybind11;\n\n")
    src = "".join(lines)

# Replace the module binding to add named args + default for block_dim
pattern = r'PYBIND11_MODULE\s*\(\s*TORCH_EXTENSION_NAME\s*,\s*m\s*\)\s*\{[^}]*m\.def\s*\(\s*"forward"\s*,\s*&fused_softmax_forward\s*,\s*"[^"]*"\s*\)\s*;[^}]*\}'
replacement = r'''PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
    m.def(
        "forward",
        &fused_softmax_forward,
        "Fused softmax forward (CUDA)",
        py::arg("input"),
        py::arg("scale"),
        py::arg("causal_mask"),
        py::arg("block_dim") = 256
    );
}'''

new_src, n = re.subn(pattern, replacement, src, flags=re.DOTALL)
if n == 0:
    # fallback: replace just the single m.def line if the full-block regex didn't match
    new_src, n2 = re.subn(
        r'm\.def\s*\(\s*"forward"\s*,\s*&fused_softmax_forward\s*,\s*"[^"]*"\s*\)\s*;',
        'm.def("forward", &fused_softmax_forward, "Fused softmax forward (CUDA)",\n'
        '      py::arg("input"), py::arg("scale"), py::arg("causal_mask"), py::arg("block_dim") = 256);\n',
        src
    )
    if n2 == 0:
        raise RuntimeError("Could not find/patch the PYBIND11_MODULE binding in softmax_binding.cpp")
    src = new_src
else:
    src = new_src

open(cpp_file, "w", encoding="utf-8").write(src)
print("✓ Patched softmax_binding.cpp to support block_dim keyword arg")

# --------- Clear extension cache (recommended) ----------
torch_ext_dir = os.path.expanduser("~/.cache/torch_extensions")
if os.path.exists(torch_ext_dir):
    shutil.rmtree(torch_ext_dir)
    print("✓ Cleared torch extension cache")

# --------- Recompile ----------
fused_softmax = load(
    name="fused_softmax_kw",  # new name to avoid any stale artifacts
    sources=[cpp_file, cu_file],
    extra_cuda_cflags=["-O3", "--use_fast_math", f"-I{cuda_dir}"],
    verbose=True,
)

print("✓ Compiled! You can now call:")
print('   fused_softmax.forward(x, scale, True, block_dim=256)')


Using cuda_dir: content/cuda
cpp_file: content/cuda/softmax_binding.cpp
cu_file : content/cuda/softmax_kernel.cu
✓ Patched softmax_binding.cpp to support block_dim keyword arg
✓ Cleared torch extension cache
✓ Compiled! You can now call:
   fused_softmax.forward(x, scale, True, block_dim=256)


In [41]:
# Test correctness: compare fused vs baseline
from baseline_softmax import baseline_softmax_forward

batch_size, seq_len = 2, 128
input_tensor = torch.randn(batch_size, seq_len, seq_len, device='cuda')
scale = 1.0 / (seq_len ** 0.5)

# Baseline
baseline_output = baseline_softmax_forward(input_tensor, scale, causal_mask=True)

# Fused
fused_output = fused_softmax.forward(input_tensor, scale, True, block_dim=256)

# Compare
diff = torch.abs(baseline_output - fused_output)
max_diff = diff.max().item()
mean_diff = diff.mean().item()

print(f"Max difference: {max_diff:.6f}")
print(f"Mean difference: {mean_diff:.6f}")
print(f"Relative error: {max_diff / baseline_output.abs().max().item():.6f}")

if max_diff < 1e-4:
    print("✓ Correctness test passed!")
else:
    print("✗ Correctness test failed - differences too large")


Max difference: 0.000000
Mean difference: 0.000000
Relative error: 0.000000
✓ Correctness test passed!


## Phase 4: Microbenchmark


In [51]:
import importlib, pathlib

import generate_results_report  # this must import successfully
path = pathlib.Path(generate_results_report.__file__)
txt = path.read_text(encoding="utf-8")

start = txt.find("def get_gpu_properties():")
if start == -1:
    raise RuntimeError("Couldn't find def get_gpu_properties() in generate_results_report.py")

end = txt.find("\n\ndef ", start)
if end == -1:
    end = len(txt)

new_func = '''def get_gpu_properties():
    """Get GPU properties for occupancy calculations (robust across torch builds)."""
    if not torch.cuda.is_available():
        return None

    props = torch.cuda.get_device_properties(0)

    def g(name, default=None):
        return getattr(props, name, default)

    # Safe fallbacks (these are only for reporting; kernel correctness/speed is unaffected)
    return {
        "name": g("name", "Unknown GPU"),
        "multiprocessor_count": g("multi_processor_count", g("multiprocessor_count", 0)),
        "max_threads_per_multiprocessor": g("max_threads_per_multiprocessor",
                                           g("max_threads_per_multi_processor", 2048)),
        "warp_size": g("warp_size", 32),
        "max_threads_per_block": g("max_threads_per_block", 1024),
        "shared_memory_per_block": g("shared_memory_per_block", 0),
        "shared_memory_per_multiprocessor": g("shared_memory_per_multiprocessor", 0),
    }
'''

txt2 = txt[:start] + new_func + txt[end:]
path.write_text(txt2, encoding="utf-8")
print("✓ Patched:", path)


✓ Patched: /content/content/python/generate_results_report.py


In [52]:
import importlib, generate_results_report
importlib.reload(generate_results_report)
print("✓ Reloaded generate_results_report")


✓ Reloaded generate_results_report


In [53]:
# Run microbenchmark
from bench_micro import run_microbenchmark

results = run_microbenchmark()


Microbenchmark: Fused Softmax Kernel
Configuration:
  SEQ_LENGTHS: [512, 1024]
  BATCH_SIZES: [1]
  MASKS: ['none', 'causal']
  WARMUP: 5, RUNS: 100
  BLOCK_DIM: 256

Testing: batch=1, seq_len=512, mask=none
  Running baseline...
  Running fused kernel...
  Speedup: 1.27x

Testing: batch=1, seq_len=512, mask=causal
  Running baseline...
  Running fused kernel...
  Speedup: 4.78x

Testing: batch=1, seq_len=1024, mask=none
  Running baseline...
  Running fused kernel...
  Speedup: 1.24x

Testing: batch=1, seq_len=1024, mask=causal
  Running baseline...
  Running fused kernel...
  Speedup: 4.65x

Results Summary
Batch  SeqLen   Mask     Type       p50(ms)    GB/s       Bytes           Speedup   
--------------------------------------------------------------------------------
1      512      False    Baseline   0.037      167.30     7342080         -         
                         Fused      0.030      62.80      2097152         1.27      x

1      512      True     Baseline   0.141    

## Phase 5: GPT-2 Integration & Benchmark


In [54]:
# Run GPT-2 benchmark
from bench_gpt2 import run_gpt2_benchmark

gpt2_results = run_gpt2_benchmark()


GPT-2 Text Generation Benchmark
Loading distilgpt2...

Testing prompt: 'The future of artificial intelligence'
  Running baseline...
  Running fused kernel...
  Speedup: 1.01x
  Baseline: 179.92 tokens/sec
  Fused: 178.70 tokens/sec

Testing prompt: 'In a world where technology'
  Running baseline...
  Running fused kernel...
  Speedup: 1.00x
  Baseline: 182.31 tokens/sec
  Fused: 181.56 tokens/sec

Testing prompt: 'The quick brown fox'
  Running baseline...
  Running fused kernel...
  Speedup: 0.98x
  Baseline: 179.10 tokens/sec
  Fused: 182.10 tokens/sec

Results Summary
Type       Prompt                         Tokens/sec      Latency/Token (ms)  
--------------------------------------------------------------------------------
Baseline   The future of artificial int   179.92          5.558               
Fused      The future of artificial int   178.70          5.596               

Baseline   In a world where technology    182.31          5.485               
Fused      In a world 

## Phase 6: Generate Comprehensive Performance Report

This section generates a comprehensive report with three key sections:
1. **Memory Hierarchy**: Explicit reduction of global memory traffic
2. **Parallel Processors**: Warp-level reductions, occupancy and latency hiding
3. **Performance Evaluations**: Complete benchmark results and analysis


In [55]:
# Generate comprehensive performance report
import sys
sys.path.append('content/python')

from generate_results_report import generate_comprehensive_report, save_report_to_file

# Note: 'results' should be from microbenchmark cell above
#       'gpt2_results' should be from GPT-2 benchmark cell above

# Generate and display the comprehensive report
print("\n" + "=" * 100)
print("GENERATING COMPREHENSIVE PERFORMANCE REPORT")
print("=" * 100)
print()

# Generate report (will use results from previous cells if available)
try:
    # Try to use results from previous cells
    # If variables don't exist, the report will still generate with available data
    micro_results = results if 'results' in globals() else []
    gpt2_results_data = gpt2_results if 'gpt2_results' in globals() else None

    report = generate_comprehensive_report(
        micro_results=micro_results,
        gpt2_results=gpt2_results_data,
        block_dim=256
    )

    # Print the report
    print(report)

    # Save to file
    save_report_to_file(report, "comprehensive_performance_report.txt")

    print("\n✓ Comprehensive report generated and saved!")
    print("  The report includes:")
    print("    • Memory Hierarchy: Global memory traffic reduction analysis")
    print("    • Parallel Processors: Warp-level reductions, occupancy, latency hiding")
    print("    • Performance Evaluations: Complete benchmark results")

except NameError as e:
    print(f"⚠ Warning: Some benchmark results not available: {e}")
    print("Please run the microbenchmark and GPT-2 benchmark cells first.")
except Exception as e:
    print(f"Error generating report: {e}")
    import traceback
    traceback.print_exc()



GENERATING COMPREHENSIVE PERFORMANCE REPORT


COMPREHENSIVE PERFORMANCE ANALYSIS REPORT
Fused Softmax CUDA Kernel for GPT-2 Attention

Memory Hierarchy: Explicit Reduction of Global Memory Traffic

This table shows how the fused kernel reduces global memory traffic by combining
multiple operations (scale, mask, softmax) into a single pass.

Batch    SeqLen     Operation                 Memory Passes      Bytes Moved          Reduction      
----------------------------------------------------------------------------------------------------
1        512        Unfused (4 passes)        4 (read/write)     7,342,080            -              
                      Pass 1: Scale           1                  2,097,152                           
                      Pass 2: Mask            1                  2,097,152                           
                      Pass 3: Max reduce      1                  1,050,624                           
                      Pass 4: Softmax        

## Results Summary

The benchmarks above show:
- **Microbenchmark**: Raw kernel performance (latency, bandwidth, memory reduction)
- **GPT-2 Demo**: End-to-end text generation performance (tokens/sec)

Key metrics:
- p50 latency improvement
- Memory bandwidth (GB/s)
- Bytes moved reduction (4 passes → 1 pass)
- Tokens/sec improvement
