A comprehensive framework for evaluating Large Language Model (LLM) runtime performance and investigating optimization strategies for specific hardware platforms.
This project provides tools and methodologies for:
- Benchmarking LLM inference performance across different hardware configurations
- Identifying performance bottlenecks in LLM execution through multi-stage profiling
- Developing and validating hardware-specific optimization techniques
- Providing actionable insights for deploying LLMs efficiently
Key Features:
- Multi-stage profiling: PyTorch Profiler (Stage 1) for operator analysis, Nsight Systems/Compute (Stage 2) for kernel-level profiling
- Hydra-based configuration: Flexible, composable configs for models, datasets, hardware, and profiling workflows
- Hardware-specific environments: Separate Pixi environments for different GPU architectures (sm_90, sm_120)
- Comprehensive reporting: Automated generation of profiling reports, roofline analysis, and performance visualizations
- Reproducibility: All profiling runs capture configuration, environment, and input data for full reproducibility
System Requirements:
- NVIDIA GPU (CUDA 12.0+)
- Linux (tested on Ubuntu 20.04, 22.04, 24.04)
- Python 3.11 or 3.12
- Pixi package manager
Install Pixi (if not already installed):
curl -fsSL https://pixi.sh/install.sh | bash- Clone the repository:
git clone https://github.com/CodeGandee/llm-perf-opt.git
cd llm-perf-opt- Install dependencies with Pixi:
pixi install # Installs default environment (CUDA 12.4, sm_90)- Bootstrap assets (models and datasets):
./bootstrap.sh --yesThis will:
- Create symlinks to models from
$HF_SNAPSHOTS_ROOT(default:~/.cache/huggingface/hub) - Create symlinks to datasets from
$DATASETS_ROOT(default:~/datasets) - Extract dataset archives if needed
You can also bootstrap individually:
- Models only:
models/bootstrap.sh --yes - Datasets only:
datasets/omnidocbench/bootstrap.sh --yes
Stage 1 Profiling (PyTorch Profiler - operator-level analysis):
pixi run stage1-runThis runs a quick profile with defaults:
- 3 samples from the dev-20 subset
- 64 max new tokens
- Outputs:
tmp/profile-output/<timestamp>/
View the results:
report.md: Comprehensive profiling report with prefill/decode timingsoperators.md: Operator-level summariesmetrics.json: Machine-readable metricsstakeholder_summary.md: Executive summary
Pixi is the primary package manager. Pixi environments are defined in pyproject.toml.
-
default (default) - PyTorch 2.5.1 + CUDA 12.4
- Supports up to sm_90 (Ada/Hopper architectures)
- Includes Flash Attention 2.7.4.post1
- Use for RTX 3090, RTX 4090, A100, H100
-
rtx5090 - PyTorch nightly + CUDA 12.8
- Supports sm_120 (Blackwell architecture)
- Builds Flash Attention and Triton from source
- Required for RTX 5090 and newer Blackwell GPUs
For RTX 5090 (Blackwell sm_120) development:
# Install the rtx5090 environment
pixi install -e rtx5090
# Run the full setup (installs PyTorch nightly, builds Triton and Flash Attention)
pixi run -e rtx5090 setup-rtx5090
# Verify the setup
pixi run -e rtx5090 verify-rtx5090Note: RTX 5090 setup requires:
- CUDA Toolkit 12.8+ installed system-wide (see System Prerequisites)
- Significant build time for Flash Attention (~10-30 minutes depending on CPU)
# Use default environment
pixi run stage1-run
# Use rtx5090 environment
pixi run -e rtx5090 stage1-runCollects prefill/decode timings, operator summaries, Model FLOPs Utilization (MFU), and writes artifacts under tmp/profile-output/<run_id>/torch_profiler/ and tmp/profile-output/<run_id>/static_analysis/.
Quick run with defaults:
pixi run stage1-run # 3 samples, 3 repeats per sampleSkip static analysis for faster runs:
pixi run stage1-run-no-staticCustom run with Hydra overrides:
pixi run python -m llm_perf_opt.runners.llm_profile_runner \
'hydra.run.dir=tmp/profile-output/${now:%Y%m%d-%H%M%S}' \
device=cuda:0 infer.max_new_tokens=64 \
dataset.sampling.num_samples_per_epoch=10 \
'pipeline.torch_profiler.activities=[cpu,cuda]' \
pipeline.nsys.enable=false pipeline.ncu.enable=falseArtifacts generated:
report.md- Comprehensive profiling reportoperators.md- Top operators by timemetrics.json- Machine-readable metricsstakeholder_summary.md- Executive summaryenv.json,inputs.yaml,assumptions.md- Reproducibility datastatic_compute.json,static_compute.md- Static model analysis (if enabled)
Deep kernel profiling with Nsight Systems (timeline) and Nsight Compute (per-kernel metrics).
Run Stage 2 profiling:
pixi run stage2-profileImportant: Stage 2 automatically disables torch_profiler and static_analysis for the workload to avoid overhead. Artifacts are written under tmp/profile-output/<run_id>/nsys/ and tmp/profile-output/<run_id>/ncu/.
Nsight Systems output:
*.nsys-rep- Binary report (open with Nsight Systems GUI)summary_cuda_gpu_kern_sum.csv- Kernel summary CSVcmd.txt- Command used for profiling
Nsight Compute output:
- Per-kernel CSV files with metrics (occupancy, throughput, roofline, etc.)
command.yaml- NCU profiling configuration
For detailed per-kernel profiling:
-
Run Stage 1 or Stage 2 with
pipeline.nsys.enable=trueto generate Nsight Systems report -
Extract top kernels from the Nsys CSV summary:
python scripts/ncu/release/extract-top-kernels.py \
tmp/profile-output/<run_id>/nsys/summary_cuda_gpu_kern_sum.csv \
-o top-kernels.yaml --topk 30- Profile specific kernels using the generated YAML:
# See scripts/ncu/release/README.md for detailed instructions
python scripts/ncu/release/profile-from-yaml.py \
--config top-kernels.yaml \
--output-dir tmp/ncu-kernels/- Analyze kernel metrics:
python scripts/ncu/analysis/analyze_ncu_dir.py \
tmp/ncu-kernels/ \
--output-dir tmp/ncu-analysis/This generates:
- Roofline plots (normalized and physical)
- Metric histograms (occupancy, throughput, bandwidth, etc.)
- Classification summaries (memory-bound, compute-bound, balanced)
- Per-kernel CSV exports
Run inference without profiling, outputting predictions and visualizations:
pixi run direct-infer-dev20Custom direct inference:
pixi run python -m llm_perf_opt.runners.direct_inference_runner \
dataset.subset_filelist=datasets/omnidocbench/subsets/dev-20.txt \
device=cuda:0 \
infer.max_new_tokens=8192 \
pipeline.direct_inference.enable=true \
pipeline.direct_inference.output.prediction.enable=true \
pipeline.direct_inference.output.visualization.enable=trueOutputs:
predictions/- JSON predictions for each samplevisualizations/- Images with OCR bounding boxes overlaid
# Unit tests
pytest tests/unit/
# Integration tests
pytest tests/integration/
# Manual tests
python tests/manual/<test>.py# Lint with Ruff
ruff check src/
# Format with Ruff
ruff format src/
# Type check with mypy
mypy src/# Serve docs locally at 127.0.0.1:8000
pixi run docs-serve
# Build docs
pixi run docs-buildConfiguration is managed via Hydra with a hierarchical structure under conf/:
conf/
├── config.yaml # Top-level defaults that compose config groups
├── model/ # Model identity, paths, dtypes (e.g., deepseek_ocr/)
├── dataset/ # Dataset roots, variants, sampling presets
├── runtime/ # Runtime parameters (PyTorch, vLLM, TensorRT-LLM)
├── hardware/ # Device selection and hardware-specific options
├── profiling/ # Profiler presets (nsys, ncu, torch)
└── output/ # Output artifact configurations per pipeline stage
Hydra overrides are used to swap configs and control pipeline stages:
# Example: Change model, enable full profiling with Nsight Systems
pixi run python -m llm_perf_opt.runners.llm_profile_runner \
model=qwen2_5_7b \
profiling=full \
pipeline.nsys.enable=truePipeline structure:
The config uses a pipeline.* structure to control different profiling stages:
pipeline.torch_profiler.*- PyTorch profiler settings (Stage 1)pipeline.static_analysis.*- Static model analysispipeline.nsys.*- Nsight Systems profiler (Stage 2)pipeline.ncu.*- Nsight Compute profiler (Stage 2)pipeline.direct_inference.*- Direct inference without profiling
The codebase uses a unified package structure under src/llm_perf_opt/:
-
runners/- Execution orchestrators that compose profiling harnesses with workloadsllm_profile_runner.py- Stage 1 entry point (Hydra main)deep_profile_runner.py- Stage 2 entry point for Nsight profilingdirect_inference_runner.py- Inference without profilingdsocr_session.py- DeepSeek-OCR model session wrapperinference_engine.py- Generic inference loop with NVTX annotations
-
profiling/- Profiling harnesses, parsers, and analysisharness.py- NVTX range context managersmfu.py- Model FLOPs Utilization (MFU) computationhw.py- Hardware detection (GPU names, peak TFLOPS, NVML)aggregate.py- Timing aggregation (mean/std across repeats)export.py- Report generation (markdown, JSON artifacts)vendor/- Nsight Systems/Compute wrappers and subprocess launchersparsers/- CSV/JSON parsers for profiler outputs
-
data/- Dataset utilities and preprocessorsmodels.py- Dataset schema definitions
-
contracts/- Shared data models and type conversionsmodels.py- Pydantic models for internal APIsconvert.py- Converters between formats
-
visualize/- Visualization utilitiesannotations.py- Render vendor-style annotations on images
-
dnn_models/- Model architectures for testingshallow_resnet.py- Simple ResNet for profiling validation
The project uses a unified pipeline architecture where different profiling stages can be enabled/disabled via Hydra config:
-
Stage 1 (
llm_profile_runner.py): PyTorch Profiler for operator-level analysis and MFU estimation- Lightweight, quick turnaround
- Produces operator summaries and timing statistics
- Optional static model analysis for refined FLOPs counting
-
Stage 2 (
deep_profile_runner.py): Nsight Systems/Compute for kernel-level analysis and GPU timeline- Automatically disables
torch_profilerandstatic_analysisfor the workload subprocess to avoid measurement overhead - Provides detailed kernel metrics, roofline analysis, and timeline visualization
- Automatically disables
The runners share a common workload implementation but wrap it with different profiling harnesses. All runners use the same Hydra config (conf/config.yaml) with pipeline-specific overrides.
When gating_nvtx=true, Nsight profilers use NVTX ranges to selectively capture specific workload phases (e.g., "prefill", "decode"). This reduces overhead and file sizes. The workload code emits NVTX ranges using nvtx_range() context managers from src/llm_perf_opt/profiling/harness.py.
llm-perf-opt/
├── conf/ # Hydra config groups (model, dataset, runtime, hardware, profiling)
│ ├── model/ # Model configs (deepseek_ocr, qwen2_5_7b, etc.)
│ ├── dataset/ # Dataset configs (omnidocbench, etc.)
│ ├── runtime/ # Runtime configs (PyTorch, vLLM, TensorRT-LLM)
│ ├── hardware/ # Hardware configs (CUDA device selection)
│ ├── profiling/ # Profiler presets (torch, nsys, ncu)
│ └── output/ # Output artifact configurations
├── models/ # Model weights/tokenizers (symlinks from $HF_SNAPSHOTS_ROOT)
├── datasets/ # Datasets (symlinks from $DATASETS_ROOT)
│ └── omnidocbench/ # OmniDocBench dataset with bootstrap scripts
├── src/
│ └── llm_perf_opt/ # Unified project package
│ ├── profiling/ # Profiling harnesses, parsers, MFU computation
│ ├── runners/ # Execution orchestrators (Stage 1, Stage 2, direct inference)
│ ├── data/ # Dataset utilities
│ ├── contracts/ # Shared data models (Pydantic)
│ ├── visualize/ # Visualization utilities
│ └── dnn_models/ # Model architectures for testing
├── scripts/ # Utility scripts
│ ├── ncu/ # NCU profiling and analysis scripts
│ │ ├── release/ # Top-kernel extraction and profiling
│ │ └── analysis/ # Roofline plots, histograms, classification
│ └── install-*.sh # Installation scripts (CUDA toolkit, vLLM, etc.)
├── reports/ # Profiling reports and artifacts
│ └── 20251107-dsocr/ # DeepSeek-OCR profiling report (example)
│ ├── final-report-v2.md # Technical report
│ ├── final-report-v2-chinese.md # Chinese translation
│ ├── ncu/ # NCU raw data (per-kernel CSVs)
│ ├── ncu-v2/ # NCU analysis (roofline, histograms)
│ └── nsys/ # Nsys reports (per-stage, all-stage)
├── tests/ # Tests (manual/unit/integration)
├── docs/ # Documentation (MkDocs)
├── context/ # Knowledge base and development hints
├── third_party/ # Read-only upstream references (submodules)
├── .specify/ # Speckit constitution and templates
└── pyproject.toml # Pixi environments and project metadata
NVIDIA GPU profiling requires system tools:
Check with nvidia-smi for CUDA compatibility:
nvidia-smiNote the "CUDA Version" shown - this is the maximum CUDA version your driver supports.
Install via NVIDIA APT repository (Ubuntu):
# Replace ubuntu2204 with your release (e.g., ubuntu2004, ubuntu2404)
sudo wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
# Install Nsight Systems CLI + target components
sudo apt-get install -y nsight-systems nsight-systems-target
# Verify
nsys --versionAlternative: Download from https://developer.nvidia.com/nsight-systems
Required to build CUDA-based dependencies (e.g., flash-attn). Use the official NVIDIA repository for system-wide installation.
Automated installation:
./scripts/install-cuda-toolkit-12-8.shManual installation:
# 1. Download and install CUDA keyring (Ubuntu 24.04 example)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
# 2. Update package list
sudo apt-get update
# 3. Install CUDA Toolkit 12.8 (toolkit only, no driver update)
sudo apt-get install -y cuda-toolkit-12-8
# 4. Add to PATH and environment
echo '' >> ~/.bashrc
echo '# CUDA 12.8 Toolkit' >> ~/.bashrc
echo 'export CUDA_HOME=/usr/local/cuda-12.8' >> ~/.bashrc
echo 'export PATH=$CUDA_HOME/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc
# 5. Verify installation
nvcc --versionImportant Notes:
- Replace
ubuntu2404with your Ubuntu version:ubuntu2004,ubuntu2204, orubuntu2404 - Pick a CUDA version that matches your driver (
nvidia-smishows max supported "CUDA Version") - System-wide installation ensures compilers can find CUDA headers at
/usr/local/cuda-12.8/include - For other distributions (RHEL, Fedora, SLES), see official installation guide
Per-kernel profiler for detailed metrics.
Preferred (newer versions): Pixi Global
pixi global install nsight-compute --channel nvidia --channel conda-forge
~/.pixi/bin/ncu --version # or add ~/.pixi/bin to PATHAlternative (root): APT via NVIDIA repository
sudo apt-get install -y nsight-computeFor readable CUDA timeline ranges:
uv pip install -U nvtx || pip install -U nvtxnsys profile --trace=cuda,nvtx,osrt -o tmp/nsys/deepseek \
pixi run python tests/manual/deepseek_ocr_hf_manual.pyDataset sampling is controlled via dataset.sampling.* overrides:
pixi run python -m llm_perf_opt.runners.llm_profile_runner \
dataset.sampling.num_epochs=1 \
dataset.sampling.num_samples_per_epoch=10 \
dataset.sampling.randomize=trueParameters:
dataset.sampling.num_epochs- Number of passes through the datasetdataset.sampling.num_samples_per_epoch- Samples per epoch (null= all)dataset.sampling.randomize- Whether to shuffle samplesdataset.subset_filelist- Path to file list for subset selection (e.g.,datasets/omnidocbench/subsets/dev-20.txt)
When modifying configs:
- Check
conf/config.yamlfor defaults list composition - New config groups go under appropriate subdirectory (e.g.,
conf/model/new_model/) - Use
@notation for mounting configs to nested keys (e.g.,profiling/torch@pipeline.torch_profiler) - Test overrides with
--cfg jobto print resolved config:pixi run python -m llm_perf_opt.runners.llm_profile_runner --cfg job
Models are symlinked from external storage ($HF_SNAPSHOTS_ROOT). To add a new model:
- Update
models/bootstrap.yamlwith the model path - Run
models/bootstrap.sh --yesto create symlink - Add model config group under
conf/model/<model_name>/ - Update defaults list in
conf/config.yamlif making it default
Recent profiling reports and artifacts are stored in reports/:
Located in reports/20251107-dsocr/:
final-report-v2.md- Comprehensive technical report with roofline analysis, kernel classification, and optimization recommendationsfinal-report-v2-chinese.md- Chinese translationncu-v2/analysis/- Roofline plots, histograms, and per-kernel metricsnsys/- Nsight Systems reports for each pipeline stagekernel-info.yaml- Kernel metadata for top 20 kernels
This report demonstrates the full profiling workflow from Stage 1 through NCU kernel analysis, providing insights into memory-bound vs compute-bound kernels, occupancy, throughput, and optimization opportunities.
Models are symlinked from external storage to avoid duplicating large files:
# Set custom model root (default: ~/.cache/huggingface/hub)
export HF_SNAPSHOTS_ROOT=/path/to/models
# Bootstrap models
models/bootstrap.sh --yesTo add a new profiling stage:
- Create config preset under
conf/profiling/orconf/output/ - Mount preset into
pipeline.<stage_name>inconf/config.yaml - Add enable/disable toggle to
pipeline.<stage_name>.enable - Implement harness wrapper in
src/llm_perf_opt/profiling/ - Create or extend runner in
src/llm_perf_opt/runners/ - Add Pixi task to
pyproject.tomlfor easy invocation
Contributions are welcome! This project is actively developed and we're building out the foundation.
Areas for contribution:
- Additional model support (encoder-decoder, multimodal)
- Hardware platform support (AMD GPUs, Intel GPUs, ARM processors)
- Optimization techniques (quantization, kernel fusion, memory optimization)
- Benchmark datasets and scenarios
- Documentation and tutorials
To be determined
- Define benchmark methodology and metrics
- Implement core benchmarking framework
- Multi-stage profiling architecture (PyTorch Profiler, Nsight Systems/Compute)
- Roofline analysis and kernel classification
- Comprehensive reporting and visualization
- Add support for major hardware platforms (AMD, Intel, ARM)
- Develop optimization toolkit (quantization, kernel fusion)
- Expand model coverage (encoder-decoder, multimodal)
- Build interactive performance dashboards
- Publish benchmark results and optimization case studies
Note: This project focuses on NVIDIA GPU profiling and optimization. Support for additional hardware platforms is planned for future releases.
For detailed development guidelines and advanced usage, see CLAUDE.md.