Enable Gemma 4 E2B / E4B inference via vLLM RPA by gagika · Pull Request #4053 · AI-Hypercomputer/maxtext

gagika · 2026-06-03T19:09:21Z

Description

Enable Gemma 4 E2B / E4B inference via vLLM adapter by adding KV-shared layers wiring and a system_prompt flag to vllm_decode (required by the E-family -it checkpoints), and documents a verified inference recipe in Run_Gemma4.md.

Key changes

gemma4_small.py — decoder layer returns the kernel-updated kv_cache (was dropped).
decoders.py — KV-shared layers redirect to the donor's kv_caches slot via a layer→slot map; cache is written back per layer.
attentions.py — sliding-window only on LOCAL_SLIDING layers; KV-shared layers no longer overwrite the donor's cache (update_kv_cache=False).
vllm_decode.py / types.py — new system_prompt config knob, prepended as a system message when use_chat_template=True. Adapter registration + env flags moved from import time into main() (import side effects leaked into unrelated tests).
Run_Gemma4.md — E2B / E4B recipe: system prompt, model-card sampling, full eos_token_id [1, 106, 50] stop-token set.

Tests

On v5p-8, e2b + e4b: cross-checked top-1 logits at greedy vs the native checkpoint — bit-identical. With the documented recipe, models generate coherent output and stop cleanly. CLI and Python API verified.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

github-actions · 2026-06-03T19:09:52Z

🤖 Hi @gagika, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions

## 📋 Review Summary

This Pull Request successfully enables inference for Gemma 4 E2B/E4B models by implementing cross-layer KV sharing within the vLLM RPA (Ragged Paged Attention) path. The implementation is technically sound, handling the complex layer-to-slot mapping required for shared KV caches while maintaining compatibility with existing inference workflows.

🔍 General Feedback

Robust KV-Sharing Implementation: The mapping logic in decoders.py correctly handles the redirection of shared layers to donor slots, ensuring efficient memory usage during TPU inference.
Improved Attention Logic: The fix in attentions.py to restrict sliding window attention to LOCAL_SLIDING layers is a necessary correction for hybrid attention models like Gemma 4.
Clear Documentation: The added recipes in Run_Gemma4.md provide essential guidance on system prompts and sampling parameters required for coherent output from these smaller checkpoints.

codecov · 2026-06-03T19:15:04Z

Codecov Report

❌ Patch coverage is 47.61905% with 11 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/maxtext/layers/decoders.py	0.00%	8 Missing ⚠️
src/maxtext/layers/attentions.py	0.00%	2 Missing ⚠️
src/maxtext/models/gemma4_small.py	90.90%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

github-actions · 2026-06-04T05:37:48Z

🤖 Hi @gagika, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions · 2026-06-04T05:39:51Z

🤖 I'm sorry @gagika, but I was unable to process your request. Please see the logs for more details.

github-actions · 2026-06-04T17:00:57Z

🤖 Hi @gagika, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions · 2026-06-04T17:04:31Z

🤖 I'm sorry @gagika, but I was unable to process your request. Please see the logs for more details.

khatwanimohit

LGTM

gagika added the gemini-review label Jun 3, 2026

github-actions Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread src/maxtext/layers/decoders.py Outdated

Comment thread src/maxtext/inference/vllm_decode.py

gagika force-pushed the agagik-gemma4e-vllm branch 3 times, most recently from 08a296b to b0c87a2 Compare June 4, 2026 05:37

gagika added gemini-review and removed gemini-review labels Jun 4, 2026

gagika force-pushed the agagik-gemma4e-vllm branch 2 times, most recently from 630d51f to 6b2739a Compare June 4, 2026 17:00

gagika added gemini-review and removed gemini-review labels Jun 4, 2026

gagika force-pushed the agagik-gemma4e-vllm branch 2 times, most recently from 88a2e20 to 9364c50 Compare June 5, 2026 04:47

gagika assigned khatwanimohit Jun 5, 2026

gagika marked this pull request as ready for review June 5, 2026 05:05

gagika requested review from Lumosis, gpolovets1, jrplatin, mailvijayasingh, mitalisi, parambole, patemotter, richjames0 and vipannalla as code owners June 5, 2026 05:05

gagika requested review from A9isha, NicoGrande, NuojCheng, RissyRan, SurbhiJainUSC, abhinavclemson, aireenmei, bvandermoon, darisoy, dipannita08, gobbleturk, hengtaoguo, igorts-git, jesselu-google, jiangjy1982, khatwanimohit, shralex, shuningjin and suexu1025 as code owners June 5, 2026 05:05

gemma4 e2b/e4b: enable vLLM RPA inference

97c6f9a

gagika force-pushed the agagik-gemma4e-vllm branch from 9364c50 to 97c6f9a Compare June 5, 2026 18:03

khatwanimohit approved these changes Jun 5, 2026

View reviewed changes

NuojCheng approved these changes Jun 5, 2026

View reviewed changes

gagika added the pull ready label Jun 6, 2026

copybara-service Bot merged commit 2e6cd11 into main Jun 6, 2026
53 checks passed

copybara-service Bot deleted the agagik-gemma4e-vllm branch June 6, 2026 01:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable Gemma 4 E2B / E4B inference via vLLM RPA#4053

Enable Gemma 4 E2B / E4B inference via vLLM RPA#4053
copybara-service[bot] merged 1 commit into
mainfrom
agagik-gemma4e-vllm

gagika commented Jun 3, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented Jun 3, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 4, 2026

Uh oh!

github-actions Bot commented Jun 4, 2026

Uh oh!

github-actions Bot commented Jun 4, 2026

Uh oh!

github-actions Bot commented Jun 4, 2026

Uh oh!

khatwanimohit left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

gagika commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Key changes

Tests

Checklist

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

🔍 General Feedback

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions Bot commented Jun 4, 2026

Uh oh!

github-actions Bot commented Jun 4, 2026

Uh oh!

github-actions Bot commented Jun 4, 2026

Uh oh!

github-actions Bot commented Jun 4, 2026

Uh oh!

khatwanimohit left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gagika commented Jun 3, 2026 •

edited

Loading

codecov Bot commented Jun 3, 2026 •

edited

Loading