Meta-Transformer: A Unified Framework for Multimodal Learning, Yiyuan Zhang+, N/A, arXiv'23 #875

AkihikoWatanabe · 2023-07-22T07:39:49Z

URL

https://arxiv.org/abs/2307.10802

Affiliations

Yiyuan Zhang, N/A
Kaixiong Gong, N/A
Kaipeng Zhang, N/A
Hongsheng Li, N/A
Yu Qiao, N/A
Wanli Ouyang, N/A
Xiangyu Yue, N/A

Abstract

Multimodal learning aims to build models that can process and relateinformation from multiple modalities. Despite years of development in thisfield, it still remains challenging to design a unified network for processingvarious modalities ($\textit{e.g.}$ natural language, 2D images, 3D pointclouds, audio, video, time series, tabular data) due to the inherent gaps amongthem. In this work, we propose a framework, named Meta-Transformer, thatleverages a $\textbf{frozen}$ encoder to perform multimodal perception withoutany paired multimodal training data. In Meta-Transformer, the raw input datafrom various modalities are mapped into a shared token space, allowing asubsequent encoder with frozen parameters to extract high-level semanticfeatures of the input data. Composed of three main components: a unified datatokenizer, a modality-shared encoder, and task-specific heads for downstreamtasks, Meta-Transformer is the first framework to perform unified learningacross 12 modalities with unpaired data. Experiments on different benchmarksreveal that Meta-Transformer can handle a wide range of tasks includingfundamental perception (text, image, point cloud, audio, video), practicalapplication (X-Ray, infrared, hyperspectral, and IMU), and data mining (graph,tabular, and time-series). Meta-Transformer indicates a promising future fordeveloping unified multimodal intelligence with transformers. Code will beavailable at https://github.com/invictus717/MetaTransformer

Translation (by gpt-3.5-turbo)

マルチモーダル学習は、複数のモダリティからの情報を処理し関連付けるモデルを構築することを目指しています。この分野の長年の開発にもかかわらず、自然言語、2D画像、3Dポイントクラウド、音声、ビデオ、時系列、表形式のデータなど、さまざまなモダリティを処理するための統一されたネットワークを設計することは依然として困難です。本研究では、対応するマルチモーダルトレーニングデータなしでマルチモーダルパーセプションを実行するためのフレームワークであるMeta-Transformerを提案します。Meta-Transformerでは、さまざまなモダリティの生の入力データを共有トークン空間にマッピングし、その後のエンコーダーが凍結されたパラメータで入力データの高レベルな意味的特徴を抽出できるようにします。統一されたデータトークナイザー、モダリティ共有エンコーダー、およびダウンストリームタスクのためのタスク固有のヘッドから構成されるMeta-Transformerは、対応のないデータを使用して12のモダリティ間で統一された学習を行う最初のフレームワークです。さまざまなベンチマークでの実験結果から、Meta-Transformerはテキスト、画像、ポイントクラウド、音声、ビデオなどの基本的なパーセプション、X線、赤外線、高分光、IMUなどの実用的なアプリケーション、およびグラフ、表形式、時系列などのデータマイニングなど、幅広いタスクを処理できることがわかります。Meta-Transformerは、トランスフォーマーを用いた統一されたマルチモーダルインテリジェンスの開発に向けた有望な未来を示しています。コードはhttps://github.com/invictus717/MetaTransformerで入手可能です。

Summary (by gpt-3.5-turbo)

本研究では、マルチモーダル学習のためのMeta-Transformerというフレームワークを提案しています。このフレームワークは、異なるモダリティの情報を処理し関連付けるための統一されたネットワークを構築することを目指しています。Meta-Transformerは、対応のないデータを使用して12のモダリティ間で統一された学習を行うことができ、テキスト、画像、ポイントクラウド、音声、ビデオなどの基本的なパーセプションから、X線、赤外線、高分光、IMUなどの実用的なアプリケーション、グラフ、表形式、時系列などのデータマイニングまで、幅広いタスクを処理することができます。Meta-Transformerは、トランスフォーマーを用いた統一されたマルチモーダルインテリジェンスの開発に向けた有望な未来を示しています。

AkihikoWatanabe · 2023-07-22T07:49:24Z

12種類のモダリティに対して学習できるTransformerを提案
Dataをsequenceにtokenizeし、unifiedにfeatureをencodingし、それぞれのdownstreamタスクで学習

AkihikoWatanabe added ComputerVision NLP LanguageModel Spoken Language Processing action_wanted MulltiModal AudioProcessing Pocket labels Jul 22, 2023

AkihikoWatanabe changed the title あ Meta-Transformer: A Unified Framework for Multimodal Learning, Yiyuan Zhang+, N/A, arXiv'23 Jul 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Meta-Transformer: A Unified Framework for Multimodal Learning, Yiyuan Zhang+, N/A, arXiv'23 #875

Meta-Transformer: A Unified Framework for Multimodal Learning, Yiyuan Zhang+, N/A, arXiv'23 #875

AkihikoWatanabe commented Jul 22, 2023 •

edited

AkihikoWatanabe commented Jul 22, 2023 •

edited

Meta-Transformer: A Unified Framework for Multimodal Learning, Yiyuan Zhang+, N/A, arXiv'23 #875

Meta-Transformer: A Unified Framework for Multimodal Learning, Yiyuan Zhang+, N/A, arXiv'23 #875

Comments

AkihikoWatanabe commented Jul 22, 2023 • edited

URL

Affiliations

Abstract

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)

AkihikoWatanabe commented Jul 22, 2023 • edited

AkihikoWatanabe commented Jul 22, 2023 •

edited

AkihikoWatanabe commented Jul 22, 2023 •

edited