You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One of the main challenges of multimodal learning is the need to combineheterogeneous modalities (e.g., video, audio, text). For example, video andaudio are obtained at much higher rates than text and are roughly aligned intime. They are often not synchronized with text, which comes as a globalcontext, e.g., a title, or a description. Furthermore, video and audio inputsare of much larger volumes, and grow as the video length increases, whichnaturally requires more compute dedicated to these modalities and makesmodeling of long-range dependencies harder. We here decouple the multimodal modeling, dividing it into separate, focusedautoregressive models, processing the inputs according to the characteristicsof the modalities. We propose a multimodal model, called Mirasol3B, consistingof an autoregressive component for the time-synchronized modalities (audio andvideo), and an autoregressive component for the context modalities which arenot necessarily aligned in time but are still sequential. To address thelong-sequences of the video-audio inputs, we propose to further partition thevideo and audio sequences in consecutive snippets and autoregressively processtheir representations. To that end, we propose a Combiner mechanism, whichmodels the audio-video information jointly within a timeframe. The Combinerlearns to extract audio and video features from raw spatio-temporal signals,and then learns to fuse these features producing compact but expressiverepresentations per snippet. Our approach achieves the state-of-the-art on well established multimodalbenchmarks, outperforming much larger models. It effectively addresses the highcomputational demand of media inputs by both learning compact representations,controlling the sequence length of the audio-video feature representations, andmodeling their dependencies in time.
AkihikoWatanabe
changed the title
あ
Mirasol3B: A Multimodal Autoregressive model for time-aligned and
contextual modalities, AJ Piergiovanni+, N/A, arXiv'23
Nov 27, 2023
URL
Affiliations
Abstract
Translation (by gpt-3.5-turbo)
本研究では、マルチモーダルモデリングを分離し、モダリティの特性に応じて個別のフォーカスされた自己回帰モデルで処理することで、この課題に取り組みます。私たちは、Mirasol3Bというマルチモーダルモデルを提案します。これは、時間に同期したモダリティ(音声とビデオ)用の自己回帰コンポーネントと、時間に必ずしも整合していないが順序付けられたコンテキストモダリティ用の自己回帰コンポーネントから構成されています。ビデオと音声の長いシーケンスに対処するために、ビデオと音声のシーケンスを連続したスニペットにさらに分割し、それらの表現を自己回帰的に処理することを提案します。このために、時間枠内でオーディオとビデオの情報を共同でモデリングするCombinerメカニズムを提案します。Combinerは、生の時空間信号からオーディオとビデオの特徴を抽出し、これらの特徴を結合してスニペットごとにコンパクトで表現力のある表現を生成する方法を学習します。
私たちの手法は、よく確立されたマルチモーダルベンチマークで最先端の性能を達成し、はるかに大きなモデルを上回っています。これにより、メディア入力の高い計算要求に効果的に対処し、コンパクトな表現を学習し、オーディオビデオ特徴表現のシーケンス長を制御し、時間的な依存関係をモデリングすることができます。
Summary (by gpt-3.5-turbo)
The text was updated successfully, but these errors were encountered: