Adding Support for Attention Sinks to vLLM Code Path. by NicoGrande · Pull Request #2923 · AI-Hypercomputer/maxtext

NicoGrande · 2026-01-08T21:52:05Z

Description

This PR introduces support for attention sinks in the MaxText on vLLM codepath. This introduces support for the GPT-OSS family of models.

Tests

Tested locally on v6e-4 with the following command:

  python3 -m MaxText.vllm_decode \
    --model_name gpt-oss-20b \
    --hf_model_name openai/gpt-oss-20b \
    --hf_config_path src/MaxText/integration/vllm/maxtext_vllm_adapter \
    --load_parameters_path $CHECKPOINT_PATH \
    --ici_tensor_parallelism 4 \
    --gpu_memory_utilization 0.5 \
    --prompt "Suggest some famous landmarks in London."

Output:

Prompt: 'Suggest some famous landmarks in London.', Generated text: "\n\nLondon is home to a wealth of iconic landmarks that reflect its rich history and vibrant culture. Here are some of the most famous:\n\n1. **The Tower of London** - A historic castle on the north bank of the River Thames, known for its role as a royal palace, prison, and treasury.\n2. **Buckingham Palace** - The London residence and administrative headquarters of the monarch of the United Kingdom.\n3. **The British Museum** - One of the world's best museums, famous for its vast collection of art and antiquities from around the world.\n4. **The Houses of Parliament and Big Ben** - The iconic clock tower and the seat of the UK Parliament.\n5. **The London Eye** - A giant Ferris wheel on the South Bank of the River Thames, offering panoramic views of the city.\n6. **St. Paul’s Cathedral** - Known for its magnificent dome and historic significance.\n7. **The Shard** - The tallest building in the UK, offering spectacular views from its viewing platform.\n8. **The Tate Modern** - A leading contemporary art museum.\n9. **The National Gallery** - Home to a collection of paintings, sculpture, and prints.\n10. **The National Gallery** - The National Gallery** - Home to a collection of paintings, sculpture, and prints.\n\nHere are some of the\n\nHere are some of the\n\nHere are some famous landmarks in London\n\nHere\n\nHere\n\nHere\n\nHere\n\nLondon\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nLondon\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\n1\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nLondon\n\nHere\n\nLondon\n\nHere\n\nHere\n\nLondon\n\nHere\n\nHere\n\nLondon\n\nLondon\n\nHere\n\nLondon\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nand\n\nHere\n\n\n\nHere\n\nand\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\nHere\n\n"

Note: GPT-OSS uses the harmony tokenizer with a special end token which is not used in vLLM by default. This is why we see repeated characters at the end of the response.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

codecov · 2026-01-08T22:04:07Z

Codecov Report

❌ Patch coverage is 0% with 3 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/MaxText/layers/attentions.py	0.00%	3 Missing ⚠️

📢 Thoughts on this report? Let us know!

gagika

thanks

NicoGrande marked this pull request as ready for review January 8, 2026 21:54

NicoGrande requested review from RissyRan, bvandermoon, gagika, gobbleturk, jiangjy1982, parambole, richjames0, shralex, shuningjin and suexu1025 as code owners January 8, 2026 21:54

gagika reviewed Jan 9, 2026

View reviewed changes

Comment thread src/MaxText/layers/attentions.py Outdated

NicoGrande force-pushed the nicogrande/enable-gpt-oss-attention-vllm branch from 6873337 to 27ce213 Compare January 9, 2026 19:38

adding support for attention sinks vllm.

e6976ba

NicoGrande force-pushed the nicogrande/enable-gpt-oss-attention-vllm branch from 27ce213 to e6976ba Compare January 9, 2026 22:26

gobbleturk approved these changes Jan 12, 2026

View reviewed changes

NuojCheng approved these changes Jan 12, 2026

View reviewed changes

gagika approved these changes Jan 12, 2026

View reviewed changes

NicoGrande added the pull ready label Jan 12, 2026

copybara-service Bot merged commit 05a4a53 into main Jan 12, 2026
38 of 47 checks passed

copybara-service Bot deleted the nicogrande/enable-gpt-oss-attention-vllm branch January 12, 2026 22:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Support for Attention Sinks to vLLM Code Path.#2923

Adding Support for Attention Sinks to vLLM Code Path.#2923
copybara-service[bot] merged 1 commit intomainfrom
nicogrande/enable-gpt-oss-attention-vllm

NicoGrande commented Jan 8, 2026

Uh oh!

codecov Bot commented Jan 8, 2026 •

edited

Loading

Uh oh!

Uh oh!

gagika left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

NicoGrande commented Jan 8, 2026

Description

Tests

Checklist

Uh oh!

codecov Bot commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

gagika left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov Bot commented Jan 8, 2026 •

edited

Loading