Support Interleaved Multimodal Sequences and Long-Context Training Infrastructure

Hi, thanks for the great work on Cheers! I would like to suggest adding support for fully interleaved multimodal sequences covering both input and output, where text and images can appear alternately (e.g., image → text → image → text) in both prompt and generated responses. This is important for realistic multimodal scenarios such as multi-image reasoning, document-style understanding, and unified multimodal modeling, but current pipelines are mostly limited to single-image input and text-only generation. Supporting this would require unified formatting/tokenization, modality-aware masking for training (loss and attention), and consistent handling in both training and inference. This feature would significantly improve Cheers’ ability to model richer multimodal interactions and long-context multimodal sequences.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Interleaved Multimodal Sequences and Long-Context Training Infrastructure #4

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Support Interleaved Multimodal Sequences and Long-Context Training Infrastructure #4

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions