Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities, AJ Piergiovanni+, N/A, arXiv'23 #1165

AkihikoWatanabe · 2023-11-27T00:59:21Z

URL

https://arxiv.org/abs/2311.05698

Affiliations

AJ Piergiovanni, N/A
Isaac Noble, N/A
Dahun Kim, N/A
Michael S. Ryoo, N/A
Victor Gomes, N/A
Anelia Angelova, N/A

Abstract

One of the main challenges of multimodal learning is the need to combineheterogeneous modalities (e.g., video, audio, text). For example, video andaudio are obtained at much higher rates than text and are roughly aligned intime. They are often not synchronized with text, which comes as a globalcontext, e.g., a title, or a description. Furthermore, video and audio inputsare of much larger volumes, and grow as the video length increases, whichnaturally requires more compute dedicated to these modalities and makesmodeling of long-range dependencies harder. We here decouple the multimodal modeling, dividing it into separate, focusedautoregressive models, processing the inputs according to the characteristicsof the modalities. We propose a multimodal model, called Mirasol3B, consistingof an autoregressive component for the time-synchronized modalities (audio andvideo), and an autoregressive component for the context modalities which arenot necessarily aligned in time but are still sequential. To address thelong-sequences of the video-audio inputs, we propose to further partition thevideo and audio sequences in consecutive snippets and autoregressively processtheir representations. To that end, we propose a Combiner mechanism, whichmodels the audio-video information jointly within a timeframe. The Combinerlearns to extract audio and video features from raw spatio-temporal signals,and then learns to fuse these features producing compact but expressiverepresentations per snippet. Our approach achieves the state-of-the-art on well established multimodalbenchmarks, outperforming much larger models. It effectively addresses the highcomputational demand of media inputs by both learning compact representations,controlling the sequence length of the audio-video feature representations, andmodeling their dependencies in time.

Translation (by gpt-3.5-turbo)

マルチモーダル学習の主な課題の1つは、異種のモダリティ（例：ビデオ、音声、テキスト）を組み合わせる必要があることです。例えば、ビデオと音声はテキストよりもはるかに高い速度で取得され、時間的におおよそ整合しています。しかし、ビデオと音声はしばしばテキストと同期されず、グローバルなコンテキスト（例：タイトル、説明）として提供されます。さらに、ビデオと音声の入力はより大きなボリュームであり、ビデオの長さが増えるにつれて増加します。これにより、これらのモダリティに専用の計算リソースが必要とされ、長距離の依存関係のモデリングが困難になります。

本研究では、マルチモーダルモデリングを分離し、モダリティの特性に応じて個別のフォーカスされた自己回帰モデルで処理することで、この課題に取り組みます。私たちは、Mirasol3Bというマルチモーダルモデルを提案します。これは、時間に同期したモダリティ（音声とビデオ）用の自己回帰コンポーネントと、時間に必ずしも整合していないが順序付けられたコンテキストモダリティ用の自己回帰コンポーネントから構成されています。ビデオと音声の長いシーケンスに対処するために、ビデオと音声のシーケンスを連続したスニペットにさらに分割し、それらの表現を自己回帰的に処理することを提案します。このために、時間枠内でオーディオとビデオの情報を共同でモデリングするCombinerメカニズムを提案します。Combinerは、生の時空間信号からオーディオとビデオの特徴を抽出し、これらの特徴を結合してスニペットごとにコンパクトで表現力のある表現を生成する方法を学習します。

私たちの手法は、よく確立されたマルチモーダルベンチマークで最先端の性能を達成し、はるかに大きなモデルを上回っています。これにより、メディア入力の高い計算要求に効果的に対処し、コンパクトな表現を学習し、オーディオビデオ特徴表現のシーケンス長を制御し、時間的な依存関係をモデリングすることができます。

Summary (by gpt-3.5-turbo)

異なるモダリティ（ビデオ、音声、テキスト）を組み合わせるマルチモーダル学習の課題に取り組むため、本研究ではモダリティごとに個別の自己回帰モデルを使用するアプローチを提案する。提案手法では、時間に同期したモダリティ（音声とビデオ）と順序付けられたコンテキストモダリティを別々に処理するMirasol3Bモデルを使用する。また、ビデオと音声の長いシーケンスに対処するために、シーケンスをスニペットに分割し、Combinerメカニズムを使用して特徴を結合する。この手法は、マルチモーダルベンチマークで最先端の性能を発揮し、高い計算要求に対処し、時間的な依存関係をモデリングすることができる。

AkihikoWatanabe added the Pocket label Nov 27, 2023

AkihikoWatanabe changed the title あ Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities, AJ Piergiovanni+, N/A, arXiv'23 Nov 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities, AJ Piergiovanni+, N/A, arXiv'23 #1165

Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities, AJ Piergiovanni+, N/A, arXiv'23 #1165

AkihikoWatanabe commented Nov 27, 2023 •

edited

Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities, AJ Piergiovanni+, N/A, arXiv'23 #1165

Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities, AJ Piergiovanni+, N/A, arXiv'23 #1165

Comments

AkihikoWatanabe commented Nov 27, 2023 • edited

URL

Affiliations

Abstract

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)

AkihikoWatanabe commented Nov 27, 2023 •

edited