Add Paged Attention Op for CUDA SM80 support #24595

aciddelgado · 2025-04-29T23:11:45Z

Description

Adds Paged Attention Op which enables of Paged KV Cache. Inputs to this op are unpadded (packed / varlen) so Cumulative Sequence Lengths are a required input.

Motivation and Context

Adding this op to ONNXRuntime is necessary to allow the GenAI team to enable a continuous batching server API.

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/contrib_ops/cpu/bert/attention_parameters.h

onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_api.h

onnxruntime/contrib_ops/cuda/bert/paged_attention.cc

onnxruntime/test/python/transformers/test_paged_attention_cuda.py

onnxruntime/contrib_ops/cuda/bert/paged_attention.cc

onnxruntime/contrib_ops/cuda/bert/paged_attention_helper.h

onnxruntime/test/python/transformers/test_paged_attention_cuda.py

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/test/python/transformers/test_paged_attention_cuda.py

onnxruntime/contrib_ops/cpu/bert/attention_parameters.h

onnxruntime/contrib_ops/cuda/bert/paged_attention.cc

onnxruntime/contrib_ops/cuda/bert/paged_attention_helper.h

onnxruntime/contrib_ops/cuda/bert/paged_attention_impl.cu

onnxruntime/core/graph/contrib_ops/bert_defs.cc

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/contrib_ops/cuda/cuda_contrib_kernels.cc

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/contrib_ops/cuda/cuda_contrib_kernels.cc

onnxruntime/contrib_ops/cpu/bert/attention_parameters.h

onnxruntime/contrib_ops/cuda/bert/attention_data.h

onnxruntime/contrib_ops/cuda/bert/paged_attention_helper.h

onnxruntime/core/graph/contrib_ops/bert_defs.cc

onnxruntime/contrib_ops/cuda/bert/paged_attention_impl.cu

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/contrib_ops/cuda/cuda_contrib_kernels.cc

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/contrib_ops/cuda/cuda_contrib_kernels.cc

onnxruntime/contrib_ops/cpu/bert/attention_parameters.h

onnxruntime/core/graph/contrib_ops/bert_defs.cc

tianleiwu

What design change needed if we want to support FP8 or FP4 paged attention in the future?

onnxruntime/contrib_ops/cuda/bert/paged_attention_helper.h

aciddelgado · 2025-06-11T20:56:55Z

What design change needed if we want to support FP8 or FP4 paged attention in the future?

New kernel necessary

aciddelgado added 4 commits April 11, 2025 15:53

paged attention op

6e75e22

test file amid-debug

aa6fd44

paged attention works

5583aaa

everything works and is implemented

c628e66

aciddelgado requested review from baijumeswani, tianleiwu and kunal-vaishnavi April 29, 2025 23:11

github-actions bot reviewed Apr 29, 2025

View reviewed changes

github-advanced-security bot found potential problems Apr 29, 2025

View reviewed changes

small stuff

554d0c7

github-actions bot reviewed Apr 29, 2025

View reviewed changes

onnxruntime/test/python/transformers/test_paged_attention_cuda.py Outdated Show resolved Hide resolved

onnxruntime/test/python/transformers/test_paged_attention_cuda.py Outdated Show resolved Hide resolved

kunal-vaishnavi reviewed Apr 30, 2025

View reviewed changes

onnxruntime/contrib_ops/cpu/bert/attention_parameters.h Outdated Show resolved Hide resolved

kunal-vaishnavi reviewed Apr 30, 2025

View reviewed changes

onnxruntime/contrib_ops/cpu/bert/attention_parameters.h Outdated Show resolved Hide resolved

kunal-vaishnavi reviewed Apr 30, 2025

View reviewed changes

onnxruntime/contrib_ops/cuda/bert/paged_attention.cc Show resolved Hide resolved

kunal-vaishnavi reviewed Apr 30, 2025

View reviewed changes

onnxruntime/contrib_ops/cuda/bert/paged_attention_helper.h Outdated Show resolved Hide resolved

kunal-vaishnavi reviewed Apr 30, 2025

View reviewed changes

onnxruntime/contrib_ops/cuda/bert/paged_attention_helper.h Outdated Show resolved Hide resolved

kunal-vaishnavi reviewed Apr 30, 2025

View reviewed changes

onnxruntime/contrib_ops/cuda/bert/paged_attention_impl.cu Outdated Show resolved Hide resolved

kunal-vaishnavi reviewed Apr 30, 2025

View reviewed changes

onnxruntime/core/graph/contrib_ops/bert_defs.cc Show resolved Hide resolved

lint

497a22e

github-actions bot reviewed Apr 30, 2025

View reviewed changes

onnxruntime/contrib_ops/cuda/cuda_contrib_kernels.cc Outdated Show resolved Hide resolved

onnxruntime/contrib_ops/cuda/cuda_contrib_kernels.cc Outdated Show resolved Hide resolved

onnxruntime/contrib_ops/cuda/cuda_contrib_kernels.cc Outdated Show resolved Hide resolved

address comments

c3276d1

github-actions bot reviewed Apr 30, 2025

View reviewed changes

onnxruntime/contrib_ops/cuda/cuda_contrib_kernels.cc Outdated Show resolved Hide resolved

onnxruntime/contrib_ops/cuda/cuda_contrib_kernels.cc Outdated Show resolved Hide resolved

onnxruntime/contrib_ops/cuda/cuda_contrib_kernels.cc Outdated Show resolved Hide resolved