Add Llama4VisionModel for multimodal decoding #1809

hengtaoguo · 2025-06-05T22:17:50Z

Description

This PR implements a complete class Llama4VisionModel, by integrating all Llama4 basic vision components. It allows Llama4 multimodal decode by describing an input image. Joint by @hengtaoguo and @aireenmei .

Core change Llama4VisionModel, which converts image tiles (batch_size, num_tiles, C, H, W) to feature activations (batch_size, num_tiles, num_patches, vision_output_dim_for_vit). Then Llama4MultiModalProjector projects it to (batch_size, num_tiles, num_patches, base_emb_dim). Example: (8, 5, 3, 336, 336) -> (8, 5, 144, 4096) -> (8, 5, 144, 5120). After that, the image tokens will be merged into text tokens by re-using merge_mm_embeddings.
Refactor by adding a get_dummy_image_shape_for_init() to create desired dummy images for different models, for jit init purpose.

Tests

Tested full multimodal decode on v5p-16 cluster with this command, and workload with screenshot.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed.

aireenmei

Thanks Hengtao, great work!

MaxText/layers/models.py

gagika

thanks

hengtaoguo force-pushed the hengtaoguo-vision branch 2 times, most recently from ed26908 to 02adb51 Compare June 20, 2025 05:34

hengtaoguo marked this pull request as ready for review June 20, 2025 05:35

hengtaoguo requested review from gobbleturk, khatwanimohit, bvandermoon, vipannalla, RissyRan, richjames0, gagika, shralex, yangyuwei, SurbhiJainUSC, A9isha and aireenmei as code owners June 20, 2025 05:35

hengtaoguo self-assigned this Jun 20, 2025

hengtaoguo changed the title ~~[WIP] Full Llama4 VisionModel~~ Implement Llama4VisionModel for multimodal decoding Jun 20, 2025

hengtaoguo changed the title ~~Implement Llama4VisionModel for multimodal decoding~~ Add Llama4VisionModel for multimodal decoding Jun 20, 2025

aireenmei approved these changes Jun 20, 2025

View reviewed changes

MaxText/layers/models.py Outdated Show resolved Hide resolved

MaxText/layers/models.py Outdated Show resolved Hide resolved

aireenmei mentioned this pull request Jun 20, 2025

llama4 ckpt conversion #1816

Open

4 tasks

hengtaoguo assigned gagika Jun 20, 2025

gagika approved these changes Jun 23, 2025

View reviewed changes

hengtaoguo force-pushed the hengtaoguo-vision branch 2 times, most recently from 5c6e38d to e0946fa Compare June 23, 2025 21:47

github-actions bot added the pull ready label Jun 23, 2025

hengtaoguo force-pushed the hengtaoguo-vision branch 3 times, most recently from b10022f to c70bdc3 Compare June 23, 2025 23:03

VisionModel

9ad61e6

hengtaoguo force-pushed the hengtaoguo-vision branch from f1c7ae3 to 9ad61e6 Compare June 23, 2025 23:06

copybara-service bot merged commit d5ecf6d into main Jun 23, 2025
18 checks passed

copybara-service bot deleted the hengtaoguo-vision branch June 23, 2025 23:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Llama4VisionModel for multimodal decoding #1809

Add Llama4VisionModel for multimodal decoding #1809

Uh oh!

hengtaoguo commented Jun 5, 2025 •

edited

Loading

Uh oh!

aireenmei left a comment

Uh oh!

Uh oh!

Uh oh!

gagika left a comment

Uh oh!

Uh oh!

Uh oh!

Add Llama4VisionModel for multimodal decoding #1809

Add Llama4VisionModel for multimodal decoding #1809

Uh oh!

Conversation

hengtaoguo commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Checklist

Uh oh!

aireenmei left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gagika left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

hengtaoguo commented Jun 5, 2025 •

edited

Loading