# CS336 Assignment 2 - nsys Profiling on Google Colab

This notebook sets up your environment to run NVIDIA Nsight Systems profiling on your Transformer model.

## 1. Verify GPU Availability

In [None]:
# Check that we have a GPU
!nvidia-smi

## 2. Clone Repositories

Clone both assignment1-basics (dependency) and assignment2-systems from your GitHub.

In [None]:
# Clone your repositories
!git clone https://github.com/0Chris5R/StanfordCS336-Own-Transformer.git assignment1-basics
!git clone https://github.com/0Chris5R/StanfordCS336-assignment2-systems.git assignment2-systems

## 3. Install Dependencies

In [None]:
# Install required packages
!pip install einops einx jaxtyping numpy regex tqdm wandb pandas humanfriendly

In [None]:
# Install both packages in editable mode
!pip install -e assignment1-basics
!pip install -e assignment2-systems

In [None]:
# Verify installation
import cs336_basics
import cs336_systems
print("Imports successful!")

## 4. Install NVIDIA Nsight Systems

In [None]:
# Add NVIDIA devtools repository and install nsight-systems
!apt-get update -qq
!apt-get install -y -qq gnupg

# Add NVIDIA repo key and source list
!wget -qO- https://developer.download.nvidia.com/devtools/repos/ubuntu2204/amd64/nvidia.pub | apt-key add -
!echo "deb https://developer.download.nvidia.com/devtools/repos/ubuntu2204/amd64/ /" > /etc/apt/sources.list.d/nsight.list

# Update and install nsight-systems
!apt-get update -qq
!apt-get install -y -qq nsight-systems-cli

In [None]:
# Verify nsys installation
!nsys --version

## 5. Test Basic Benchmarking (without profiling)

In [None]:
# Quick test run to make sure everything works
%cd assignment2-systems
!python -m cs336_systems.benchmarking --context-length 128 --d-model 512 --num-layers 6 --num-heads 8 --d-ff 2048 --warmup-steps 2 --n-steps 3

## 6. Profile with nsys

Now run the actual profiling. The output file can be downloaded and viewed in NVIDIA Nsight Systems desktop app.

In [None]:
# Model configurations from Table 1 (adjust as needed)
# Small: d_model=512, num_layers=6, num_heads=8, d_ff=2048
# Medium: d_model=1024, num_layers=24, num_heads=16, d_ff=4096
# Large: d_model=1280, num_layers=36, num_heads=20, d_ff=5120
# XL: d_model=1600, num_layers=48, num_heads=25, d_ff=6400

MODEL_SIZE = "small"  # Change this to run different configurations

configs = {
    "small": {"d_model": 512, "num_layers": 6, "num_heads": 8, "d_ff": 2048},
    "medium": {"d_model": 1024, "num_layers": 24, "num_heads": 16, "d_ff": 4096},
    "large": {"d_model": 1280, "num_layers": 36, "num_heads": 20, "d_ff": 5120},
    "xl": {"d_model": 1600, "num_layers": 48, "num_heads": 25, "d_ff": 6400},
}

cfg = configs[MODEL_SIZE]
print(f"Using {MODEL_SIZE} model config: {cfg}")

In [None]:
# Profile with context length 128
CONTEXT_LENGTH = 128

!nsys profile \
    -o profile_{MODEL_SIZE}_ctx{CONTEXT_LENGTH} \
    --stats=true \
    --force-overwrite=true \
    python -m cs336_systems.benchmarking \
        --context-length {CONTEXT_LENGTH} \
        --d-model {cfg["d_model"]} \
        --num-layers {cfg["num_layers"]} \
        --num-heads {cfg["num_heads"]} \
        --d-ff {cfg["d_ff"]} \
        --warmup-steps 3 \
        --n-steps 5

In [None]:
# Profile with context length 256
CONTEXT_LENGTH = 256

!nsys profile \
    -o profile_{MODEL_SIZE}_ctx{CONTEXT_LENGTH} \
    --stats=true \
    --force-overwrite=true \
    python -m cs336_systems.benchmarking \
        --context-length {CONTEXT_LENGTH} \
        --d-model {cfg["d_model"]} \
        --num-layers {cfg["num_layers"]} \
        --num-heads {cfg["num_heads"]} \
        --d-ff {cfg["d_ff"]} \
        --warmup-steps 3 \
        --n-steps 5

In [None]:
# Profile with context length 512
CONTEXT_LENGTH = 512

!nsys profile \
    -o profile_{MODEL_SIZE}_ctx{CONTEXT_LENGTH} \
    --stats=true \
    --force-overwrite=true \
    python -m cs336_systems.benchmarking \
        --context-length {CONTEXT_LENGTH} \
        --d-model {cfg["d_model"]} \
        --num-layers {cfg["num_layers"]} \
        --num-heads {cfg["num_heads"]} \
        --d-ff {cfg["d_ff"]} \
        --warmup-steps 3 \
        --n-steps 5

In [None]:
# Profile with context length 1024
CONTEXT_LENGTH = 1024

!nsys profile \
    -o profile_{MODEL_SIZE}_ctx{CONTEXT_LENGTH} \
    --stats=true \
    --force-overwrite=true \
    python -m cs336_systems.benchmarking \
        --context-length {CONTEXT_LENGTH} \
        --d-model {cfg["d_model"]} \
        --num-layers {cfg["num_layers"]} \
        --num-heads {cfg["num_heads"]} \
        --d-ff {cfg["d_ff"]} \
        --warmup-steps 3 \
        --n-steps 5

## 7. Profile with NVTX Annotations and Python Backtraces

For more detailed analysis with NVTX ranges and Python backtraces:

In [None]:
# Profile with Python backtraces (more overhead but more detail)
CONTEXT_LENGTH = 128

!nsys profile \
    -o profile_{MODEL_SIZE}_ctx{CONTEXT_LENGTH}_detailed \
    --stats=true \
    --force-overwrite=true \
    --python-backtrace=cuda \
    python -m cs336_systems.benchmarking \
        --context-length {CONTEXT_LENGTH} \
        --d-model {cfg["d_model"]} \
        --num-layers {cfg["num_layers"]} \
        --num-heads {cfg["num_heads"]} \
        --d-ff {cfg["d_ff"]} \
        --warmup-steps 3 \
        --n-steps 5

## 8. View Stats Summary (in Colab)

The `--stats=true` flag generates a summary. You can also export stats to SQLite:

In [None]:
# Export stats to SQLite for analysis
!nsys stats profile_small_ctx128.nsys-rep --report cuda_gpu_kern_sum --format csv --output kernel_summary

In [None]:
# View the kernel summary
import pandas as pd
try:
    df = pd.read_csv("kernel_summary_cuda_gpu_kern_sum.csv")
    print(df.head(20))
except FileNotFoundError:
    print("Run a profile first, then this cell")

## 9. Download Profile Files

Download the `.nsys-rep` files to view in NVIDIA Nsight Systems desktop application on your local machine.

In [None]:
# List all profile files
!ls -la *.nsys-rep 2>/dev/null || echo "No profile files found yet"

In [None]:
# Download profile files (run this cell and click the download link)
from google.colab import files

# Download a specific profile - change filename as needed
files.download('profile_small_ctx128.nsys-rep')

In [None]:
# Or zip all profiles and download
!zip -r all_profiles.zip *.nsys-rep
files.download('all_profiles.zip')

## 10. Benchmarking Without Warmup (for question 1.1.3c)

In [None]:
# No warmup steps
!python -m cs336_systems.benchmarking \
    --context-length 128 \
    --d-model 512 \
    --num-layers 6 \
    --num-heads 8 \
    --d-ff 2048 \
    --warmup-steps 0 \
    --n-steps 10

In [None]:
# 1 warmup step
!python -m cs336_systems.benchmarking \
    --context-length 128 \
    --d-model 512 \
    --num-layers 6 \
    --num-heads 8 \
    --d-ff 2048 \
    --warmup-steps 1 \
    --n-steps 10

In [None]:
# 2 warmup steps
!python -m cs336_systems.benchmarking \
    --context-length 128 \
    --d-model 512 \
    --num-layers 6 \
    --num-heads 8 \
    --d-ff 2048 \
    --warmup-steps 2 \
    --n-steps 10

In [None]:
# Standard warmup (5 steps)
!python -m cs336_systems.benchmarking \
    --context-length 128 \
    --d-model 512 \
    --num-layers 6 \
    --num-heads 8 \
    --d-ff 2048 \
    --warmup-steps 5 \
    --n-steps 10