Conversation
Add multimodal_inputs.md covering all modalities (text, image, video, audio, time series, mixed) with OpenAI-style examples, local file / base64 usage via lmdeploy.vl.utils helpers, and mm_processor_kwargs / media_io_kwargs guidance. Link from vl_pipeline.md and index.rst. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR extends LMDeploy’s VLM preprocessing and PyTorch serving flow to support mixed-modality inputs (notably image+video) with per-modality processor kwargs, targeting specific new-style models (Qwen3-VL / Qwen3.5 / InternS1Pro / GLM4.1v) while keeping legacy behavior for others. It also adds comprehensive documentation for OpenAI-style multimodal message formats.
Changes:
- Introduce a new “engine-aligned” preprocess path that uses
apply_chat_template → preprocess(input_prompt, mm_processor_kwargs)and returns{input_ids, multimodal}. - Add mixed-modality handling in PyTorch models via a unified
multimodal_maskand per-item offsets. - Add new multimodal input documentation (EN/ZH) and update related tests.
Reviewed changes
Copilot reviewed 25 out of 25 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_lmdeploy/test_vl/test_qwen3vl_processor.py | Updates preprocessing test flow to match engine pipeline; adds mixed image+video tests and per-modality kwargs usage. |
| tests/test_lmdeploy/test_vl/test_hf_chat_template.py | Adds/adjusts chat template tests, switching Qwen3-VL to apply_chat_template API. |
| lmdeploy/vl/model/qwen3_5.py | Simplifies preprocessor build by deferring to superclass logic. |
| lmdeploy/vl/model/qwen3.py | Removes legacy per-modality preprocess/packing logic; keeps apply_chat_template and special-token setup for new flow. |
| lmdeploy/vl/model/interns1_pro.py | Adds time-series special tokens and a time-series processor; updates to apply_chat_template. |
| lmdeploy/vl/model/glm4_1v.py | Introduces mm_tokens and switches to apply_chat_template. |
| lmdeploy/vl/model/base.py | Adds new shared preprocess implementation returning {input_ids, multimodal} with per-item expansion/offset computation. |
| lmdeploy/vl/engine.py | Adds apply_chat_template wrapper and signature-based preprocess kwargs passing (input_prompt/mm kwargs). |
| lmdeploy/vl/constants.py | Changes Modality enum behavior to support string comparisons/hashing without inheriting from str. |
| lmdeploy/serve/processors/multimodal.py | Detects new preprocess API by signature; for PyTorch backend uses apply_chat_template + new preprocess path. |
| lmdeploy/serve/core/async_engine.py | Makes prompt logging tolerant of missing prompt in prompt_input. |
| lmdeploy/pytorch/models/utils/model.py | Adds get_multimodal_mask helper to combine image/video/time-series token masks. |
| lmdeploy/pytorch/models/qwen3_vl_moe.py | Renames image_mask to multimodal_mask and uses it for embedding scatter. |
| lmdeploy/pytorch/models/qwen3_vl.py | Updates generation prep and multimodal packing to use offsets + multimodal_mask; adjusts grid_thw stacking and mRoPE helper. |
| lmdeploy/pytorch/models/qwen3_5.py | Same multimodal_mask refactor as Qwen3-VL. |
| lmdeploy/pytorch/models/interns1_pro.py | Updates image/video/time-series handling to use offsets and multimodal_mask; switches grid_thw to stack. |
| lmdeploy/pytorch/models/glm4_1v.py | Switches grid_thw concatenation to stacking; adds a dedicated Glm4vInputProcessor using offsets. |
| lmdeploy/pytorch/messages.py | Changes multimodal range filtering logic in HistoryMultiModals.get_datas. |
| lmdeploy/pytorch/configurations/glm4_1v.py | Adds GLM4v config builder to align bos_token_id handling. |
| docs/zh_cn/multi_modal/vl_pipeline.md | Links to the new multimodal input reference doc. |
| docs/zh_cn/multi_modal/multimodal_inputs.md | Adds detailed ZH multimodal message format reference and examples. |
| docs/zh_cn/multi_modal/index.rst | Adds the new guide to the ZH multi-modal docs toctree. |
| docs/en/multi_modal/vl_pipeline.md | Links to the new multimodal input reference doc. |
| docs/en/multi_modal/multimodal_inputs.md | Adds detailed EN multimodal message format reference and examples. |
| docs/en/multi_modal/index.rst | Adds the new guide to the EN multi-modal docs toctree. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| start_positions = (mask & ~torch.roll(mask, 1)).nonzero(as_tuple=True)[0] | ||
| end_positions = (mask & ~torch.roll(mask, -1)).nonzero(as_tuple=True)[0] | ||
| end_positions += 1 # convert to exclusive end index, compatible with legacy pytorch implementation |
There was a problem hiding this comment.
get_mm_items_offset uses torch.roll to detect segment starts/ends. This breaks for edge cases (e.g., sequence length 1 where the only token is mm_token_id, or when the first/last positions are multimodal tokens) because roll wraps around. Consider using a non-wrapping boundary approach (e.g., diff with prepended/appended False) to compute start/end indices robustly.
| start_positions = (mask & ~torch.roll(mask, 1)).nonzero(as_tuple=True)[0] | |
| end_positions = (mask & ~torch.roll(mask, -1)).nonzero(as_tuple=True)[0] | |
| end_positions += 1 # convert to exclusive end index, compatible with legacy pytorch implementation | |
| prev_mask = torch.cat((mask.new_zeros(1), mask[:-1])) | |
| next_mask = torch.cat((mask[1:], mask.new_zeros(1))) | |
| start_positions = (mask & ~prev_mask).nonzero(as_tuple=True)[0] | |
| end_positions = (mask & ~next_mask).nonzero(as_tuple=True)[0] | |
| end_positions += 1 # convert to exclusive end index, compatible with legacy pytorch implementation |
| for modal_data in modal_datas: | ||
| if (modal_data.start not in test_range and modal_data.end - 1 not in test_range): | ||
| if (modal_data.start not in test_range or modal_data.end - 1 not in test_range): | ||
| continue |
There was a problem hiding this comment.
The range check in get_datas is incorrect for interval overlap. As written, it only includes multimodal data when both start and end-1 are inside [start, end), which drops partial overlaps and also fails the case where a multimodal span fully covers the query range. Use a proper overlap test (e.g., modal_data.start < end and modal_data.end > start).
- glm4_1v: guard chat_template_kwargs against None before ** expansion - base: use local time_series_processor to avoid mutating self.processor - base: fix preprocess return type annotation list[dict] -> dict[str, Any] - base: lower valid size-override log from WARNING to INFO Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Refactor the VLM preprocessing pipeline to support mixed modality (image + video in one request), and extend
forwardto handle per-modality multimodal masks.For backward compatibility, the new pipeline is opt-in: models that override
preprocess(self, messages)continue to use the old path unchanged. New-style models inherit the base implementation and are detected automatically viainspect.signatureat init.New-style models: Qwen3-VL, Qwen3.5-VL, InternS1-Pro, GLM4.1V
lmdeploy/vl/model/base.pypreprocess(messages, input_prompt, mm_processor_kwargs)— collects all modalities from messages, calls HF processor once withimages_kwargs/videos_kwargsto route per-modality size overrides independently (no cross-modality bleed), and returns one item per multimodal tokenget_override_size(processor, mm_processor_kwargs, modality)— resolvesmin_pixels/max_pixelsfrommm_processor_kwargs['image']ormm_processor_kwargs['video']independentlyget_expanded_input_ids/get_expanded_mm_items— expand placeholder tokens into per-token multimodal items for the PT engineMultimodalSpecialTokensdataclass — centralises image/video/audio/time-series token IDs per modellmdeploy/serve/processors/multimodal.pypreprocessbased on_uses_new_preprocessflag cached at initapply_chat_templatemoved to engine layer so the new-style path tokenises after preprocessinglmdeploy/pytorch/models/utils/model.pyget_multimodal_mask— builds a unified position mask across image, video, and time-series tokens for use inforwardlmdeploy/pytorch/models/(Qwen3-VL, Qwen3.5, InternS1-Pro, GLM4.1V)preprocess_inputupdated to unpack new-style per-modality items and build correctMultiModalDataforwardusesget_multimodal_maskto scatter visual embeddings at the right positionslmdeploy/vl/constants.pyModalityenum supports==comparison with plain strings for legacy compatibilityDocs
multimodal_inputs.md(EN + ZH) covering all modalities, local/base64 inputs,mm_processor_kwargsandmedia_io_kwargsTests
test_qwen3vl_processor.py: per-modalitymin_pixels/max_pixelsoverride tests and mixed image+video independence test