Hi, thanks for the great work on Cheers! I would like to suggest adding support for fully interleaved multimodal sequences covering both input and output, where text and images can appear alternately (e.g., image → text → image → text) in both prompt and generated responses. This is important for realistic multimodal scenarios such as multi-image reasoning, document-style understanding, and unified multimodal modeling, but current pipelines are mostly limited to single-image input and text-only generation. Supporting this would require unified formatting/tokenization, modality-aware masking for training (loss and attention), and consistent handling in both training and inference. This feature would significantly improve Cheers’ ability to model richer multimodal interactions and long-context multimodal sequences.
Hi, thanks for the great work on Cheers! I would like to suggest adding support for fully interleaved multimodal sequences covering both input and output, where text and images can appear alternately (e.g., image → text → image → text) in both prompt and generated responses. This is important for realistic multimodal scenarios such as multi-image reasoning, document-style understanding, and unified multimodal modeling, but current pipelines are mostly limited to single-image input and text-only generation. Supporting this would require unified formatting/tokenization, modality-aware masking for training (loss and attention), and consistent handling in both training and inference. This feature would significantly improve Cheers’ ability to model richer multimodal interactions and long-context multimodal sequences.