Skip to content

Support Interleaved Multimodal Sequences and Long-Context Training Infrastructure #4

@Sishxo

Description

@Sishxo

Hi, thanks for the great work on Cheers! I would like to suggest adding support for fully interleaved multimodal sequences covering both input and output, where text and images can appear alternately (e.g., image → text → image → text) in both prompt and generated responses. This is important for realistic multimodal scenarios such as multi-image reasoning, document-style understanding, and unified multimodal modeling, but current pipelines are mostly limited to single-image input and text-only generation. Supporting this would require unified formatting/tokenization, modality-aware masking for training (loss and attention), and consistent handling in both training and inference. This feature would significantly improve Cheers’ ability to model richer multimodal interactions and long-context multimodal sequences.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions