Enable Gemma 4 E2B / E4B inference via vLLM RPA#4053
Conversation
|
🤖 Hi @gagika, I've received your request, and I'm working on it now! You can track my progress in the logs for more details. |
There was a problem hiding this comment.
This Pull Request successfully enables inference for Gemma 4 E2B/E4B models by implementing cross-layer KV sharing within the vLLM RPA (Ragged Paged Attention) path. The implementation is technically sound, handling the complex layer-to-slot mapping required for shared KV caches while maintaining compatibility with existing inference workflows.
🔍 General Feedback
- Robust KV-Sharing Implementation: The mapping logic in
decoders.pycorrectly handles the redirection of shared layers to donor slots, ensuring efficient memory usage during TPU inference. - Improved Attention Logic: The fix in
attentions.pyto restrict sliding window attention toLOCAL_SLIDINGlayers is a necessary correction for hybrid attention models like Gemma 4. - Clear Documentation: The added recipes in
Run_Gemma4.mdprovide essential guidance on system prompts and sampling parameters required for coherent output from these smaller checkpoints.
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
08a296b to
b0c87a2
Compare
|
🤖 Hi @gagika, I've received your request, and I'm working on it now! You can track my progress in the logs for more details. |
|
🤖 I'm sorry @gagika, but I was unable to process your request. Please see the logs for more details. |
630d51f to
6b2739a
Compare
|
🤖 Hi @gagika, I've received your request, and I'm working on it now! You can track my progress in the logs for more details. |
|
🤖 I'm sorry @gagika, but I was unable to process your request. Please see the logs for more details. |
88a2e20 to
9364c50
Compare
9364c50 to
97c6f9a
Compare
Description
Enable Gemma 4 E2B / E4B inference via vLLM adapter by adding KV-shared layers wiring and a
system_promptflag tovllm_decode(required by the E-family-itcheckpoints), and documents a verified inference recipe inRun_Gemma4.md.Key changes
gemma4_small.py— decoder layer returns the kernel-updatedkv_cache(was dropped).decoders.py— KV-shared layers redirect to the donor'skv_cachesslot via a layer→slot map; cache is written back per layer.attentions.py— sliding-window only onLOCAL_SLIDINGlayers; KV-shared layers no longer overwrite the donor's cache (update_kv_cache=False).vllm_decode.py/types.py— newsystem_promptconfig knob, prepended as a system message whenuse_chat_template=True. Adapter registration + env flags moved from import time into main() (import side effects leaked into unrelated tests).Run_Gemma4.md— E2B / E4B recipe: system prompt, model-card sampling, fulleos_token_id[1, 106, 50]stop-token set.Tests
On v5p-8, e2b + e4b: cross-checked top-1 logits at greedy vs the native checkpoint — bit-identical. With the documented recipe, models generate coherent output and stop cleanly. CLI and Python API verified.
Checklist
Before submitting this PR, please make sure (put X in square brackets):
gemini-reviewlabel.