Mixed modality by CUHKSZzxy · Pull Request #4531 · InternLM/lmdeploy

CUHKSZzxy · 2026-04-16T04:54:49Z

Refactor the VLM preprocessing pipeline to support mixed modality (image + video in one request), and extend forward to handle per-modality multimodal masks.

For backward compatibility, the new pipeline is opt-in: models that override preprocess(self, messages) continue to use the old path unchanged. New-style models inherit the base implementation and are detected automatically via inspect.signature at init.

New-style models: Qwen3-VL, Qwen3.5-VL, InternS1-Pro, GLM4.1V

lmdeploy/vl/model/base.py

New preprocess(messages, input_prompt, mm_processor_kwargs) — collects all modalities from messages, calls HF processor once with images_kwargs/videos_kwargs to route per-modality size overrides independently (no cross-modality bleed), and returns one item per multimodal token
get_override_size(processor, mm_processor_kwargs, modality) — resolves min_pixels/max_pixels from mm_processor_kwargs['image'] or mm_processor_kwargs['video'] independently
get_expanded_input_ids / get_expanded_mm_items — expand placeholder tokens into per-token multimodal items for the PT engine
MultimodalSpecialTokens dataclass — centralises image/video/audio/time-series token IDs per model

lmdeploy/serve/processors/multimodal.py

Signature-based dispatch: routes to new or legacy preprocess based on _uses_new_preprocess flag cached at init
apply_chat_template moved to engine layer so the new-style path tokenises after preprocessing

lmdeploy/pytorch/models/utils/model.py

get_multimodal_mask — builds a unified position mask across image, video, and time-series tokens for use in forward

lmdeploy/pytorch/models/ (Qwen3-VL, Qwen3.5, InternS1-Pro, GLM4.1V)

preprocess_input updated to unpack new-style per-modality items and build correct MultiModalData
forward uses get_multimodal_mask to scatter visual embeddings at the right positions

lmdeploy/vl/constants.py

Modality enum supports == comparison with plain strings for legacy compatibility

Docs

Add multimodal_inputs.md (EN + ZH) covering all modalities, local/base64 inputs, mm_processor_kwargs and media_io_kwargs

Tests

test_qwen3vl_processor.py: per-modality min_pixels/max_pixels override tests and mixed image+video independence test

Add multimodal_inputs.md covering all modalities (text, image, video, audio, time series, mixed) with OpenAI-style examples, local file / base64 usage via lmdeploy.vl.utils helpers, and mm_processor_kwargs / media_io_kwargs guidance. Link from vl_pipeline.md and index.rst. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

This PR extends LMDeploy’s VLM preprocessing and PyTorch serving flow to support mixed-modality inputs (notably image+video) with per-modality processor kwargs, targeting specific new-style models (Qwen3-VL / Qwen3.5 / InternS1Pro / GLM4.1v) while keeping legacy behavior for others. It also adds comprehensive documentation for OpenAI-style multimodal message formats.

Changes:

Introduce a new “engine-aligned” preprocess path that uses apply_chat_template → preprocess(input_prompt, mm_processor_kwargs) and returns {input_ids, multimodal}.
Add mixed-modality handling in PyTorch models via a unified multimodal_mask and per-item offsets.
Add new multimodal input documentation (EN/ZH) and update related tests.

Reviewed changes

Copilot reviewed 25 out of 25 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
tests/test_lmdeploy/test_vl/test_qwen3vl_processor.py	Updates preprocessing test flow to match engine pipeline; adds mixed image+video tests and per-modality kwargs usage.
tests/test_lmdeploy/test_vl/test_hf_chat_template.py	Adds/adjusts chat template tests, switching Qwen3-VL to `apply_chat_template` API.
lmdeploy/vl/model/qwen3_5.py	Simplifies preprocessor build by deferring to superclass logic.
lmdeploy/vl/model/qwen3.py	Removes legacy per-modality preprocess/packing logic; keeps `apply_chat_template` and special-token setup for new flow.
lmdeploy/vl/model/interns1_pro.py	Adds time-series special tokens and a time-series processor; updates to `apply_chat_template`.
lmdeploy/vl/model/glm4_1v.py	Introduces `mm_tokens` and switches to `apply_chat_template`.
lmdeploy/vl/model/base.py	Adds new shared preprocess implementation returning `{input_ids, multimodal}` with per-item expansion/offset computation.
lmdeploy/vl/engine.py	Adds `apply_chat_template` wrapper and signature-based preprocess kwargs passing (input_prompt/mm kwargs).
lmdeploy/vl/constants.py	Changes `Modality` enum behavior to support string comparisons/hashing without inheriting from `str`.
lmdeploy/serve/processors/multimodal.py	Detects new preprocess API by signature; for PyTorch backend uses `apply_chat_template` + new preprocess path.
lmdeploy/serve/core/async_engine.py	Makes prompt logging tolerant of missing `prompt` in prompt_input.
lmdeploy/pytorch/models/utils/model.py	Adds `get_multimodal_mask` helper to combine image/video/time-series token masks.
lmdeploy/pytorch/models/qwen3_vl_moe.py	Renames `image_mask` to `multimodal_mask` and uses it for embedding scatter.
lmdeploy/pytorch/models/qwen3_vl.py	Updates generation prep and multimodal packing to use offsets + `multimodal_mask`; adjusts grid_thw stacking and mRoPE helper.
lmdeploy/pytorch/models/qwen3_5.py	Same `multimodal_mask` refactor as Qwen3-VL.
lmdeploy/pytorch/models/interns1_pro.py	Updates image/video/time-series handling to use offsets and `multimodal_mask`; switches grid_thw to stack.
lmdeploy/pytorch/models/glm4_1v.py	Switches grid_thw concatenation to stacking; adds a dedicated `Glm4vInputProcessor` using offsets.
lmdeploy/pytorch/messages.py	Changes multimodal range filtering logic in `HistoryMultiModals.get_datas`.
lmdeploy/pytorch/configurations/glm4_1v.py	Adds GLM4v config builder to align bos_token_id handling.
docs/zh_cn/multi_modal/vl_pipeline.md	Links to the new multimodal input reference doc.
docs/zh_cn/multi_modal/multimodal_inputs.md	Adds detailed ZH multimodal message format reference and examples.
docs/zh_cn/multi_modal/index.rst	Adds the new guide to the ZH multi-modal docs toctree.
docs/en/multi_modal/vl_pipeline.md	Links to the new multimodal input reference doc.
docs/en/multi_modal/multimodal_inputs.md	Adds detailed EN multimodal message format reference and examples.
docs/en/multi_modal/index.rst	Adds the new guide to the EN multi-modal docs toctree.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-17T07:17:51Z

+        start_positions = (mask & ~torch.roll(mask, 1)).nonzero(as_tuple=True)[0]
+        end_positions = (mask & ~torch.roll(mask, -1)).nonzero(as_tuple=True)[0]
+        end_positions += 1 # convert to exclusive end index, compatible with legacy pytorch implementation


get_mm_items_offset uses torch.roll to detect segment starts/ends. This breaks for edge cases (e.g., sequence length 1 where the only token is mm_token_id, or when the first/last positions are multimodal tokens) because roll wraps around. Consider using a non-wrapping boundary approach (e.g., diff with prepended/appended False) to compute start/end indices robustly.

Suggested change

start_positions = (mask & ~torch.roll(mask, 1)).nonzero(as_tuple=True)[0]

end_positions = (mask & ~torch.roll(mask, -1)).nonzero(as_tuple=True)[0]

end_positions += 1 # convert to exclusive end index, compatible with legacy pytorch implementation

prev_mask = torch.cat((mask.new_zeros(1), mask[:-1]))

next_mask = torch.cat((mask[1:], mask.new_zeros(1)))

start_positions = (mask & ~prev_mask).nonzero(as_tuple=True)[0]

end_positions = (mask & ~next_mask).nonzero(as_tuple=True)[0]

end_positions += 1 # convert to exclusive end index, compatible with legacy pytorch implementation

Copilot · 2026-04-17T07:17:51Z

            for modal_data in modal_datas:
-                if (modal_data.start not in test_range and modal_data.end - 1 not in test_range):
+                if (modal_data.start not in test_range or modal_data.end - 1 not in test_range):
                    continue


The range check in get_datas is incorrect for interval overlap. As written, it only includes multimodal data when both start and end-1 are inside [start, end), which drops partial overlaps and also fails the case where a multimodal span fully covers the query range. Use a proper overlap test (e.g., modal_data.start < end and modal_data.end > start).

- glm4_1v: guard chat_template_kwargs against None before ** expansion - base: use local time_series_processor to avoid mutating self.processor - base: fix preprocess return type annotation list[dict] -> dict[str, Any] - base: lower valid size-override log from WARNING to INFO Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

CUHKSZzxy added 7 commits April 14, 2026 15:05

WIP: support mixed modality

a12e492

fix mm processor kwargs, cleanup

79b9953

qwen3.5 mixed modality

c12b02c

interns1 pro mixed modality, fix kwargs

cd85dcf

fix generate, cleanup

dc8e388

minor

8d525fc

Merge branch 'main' into mixed-modality

1e6a17d

CUHKSZzxy added BC-breaking WIP labels Apr 16, 2026

simplify

20cd4ec

CUHKSZzxy added BC-breaking and removed BC-breaking labels Apr 16, 2026

CUHKSZzxy added 2 commits April 16, 2026 18:30

fix glm4.1v

71112f4

compatible with legacy preprocess, give up re-writing all ...

1994bfa

CUHKSZzxy removed the BC-breaking label Apr 16, 2026

CUHKSZzxy and others added 10 commits April 16, 2026 21:22

fix bugs

f47d011

minor

1d40a76

minor

b69fd8b

minor

70d4178

fix ut

006a8ca

fix qwen3vl moe

7922cc3

allow modality-specific kwargs, add ut

2f992c9

docs: update video/audio URLs to official Qwen assets

8611115

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs: fix model name Qwen3.5-VL -> Qwen3.5

de923d7

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

CUHKSZzxy marked this pull request as ready for review April 17, 2026 07:09

Copilot AI review requested due to automatic review settings April 17, 2026 07:09

Copilot started reviewing on behalf of CUHKSZzxy April 17, 2026 07:10 View session

CUHKSZzxy removed the WIP label Apr 17, 2026

Copilot AI reviewed Apr 17, 2026

View reviewed changes

CUHKSZzxy mentioned this pull request Apr 17, 2026

[WIP] Support qwen3-omni #4411

Draft

4 tasks

CUHKSZzxy added 2 commits April 17, 2026 15:37

Merge branch 'main' into mixed-modality

8844e26

refactor: rename interns1_pro_ts.py to interns1_pro_time_series.py

3df1592

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mixed modality#4531

Mixed modality#4531
CUHKSZzxy wants to merge 23 commits intoInternLM:mainfrom
CUHKSZzxy:mixed-modality

CUHKSZzxy commented Apr 16, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

CUHKSZzxy commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CUHKSZzxy commented Apr 16, 2026 •

edited

Loading