You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Multimodal learning aims to build models that can process and relateinformation from multiple modalities. Despite years of development in thisfield, it still remains challenging to design a unified network for processingvarious modalities ($\textit{e.g.}$ natural language, 2D images, 3D pointclouds, audio, video, time series, tabular data) due to the inherent gaps amongthem. In this work, we propose a framework, named Meta-Transformer, thatleverages a $\textbf{frozen}$ encoder to perform multimodal perception withoutany paired multimodal training data. In Meta-Transformer, the raw input datafrom various modalities are mapped into a shared token space, allowing asubsequent encoder with frozen parameters to extract high-level semanticfeatures of the input data. Composed of three main components: a unified datatokenizer, a modality-shared encoder, and task-specific heads for downstreamtasks, Meta-Transformer is the first framework to perform unified learningacross 12 modalities with unpaired data. Experiments on different benchmarksreveal that Meta-Transformer can handle a wide range of tasks includingfundamental perception (text, image, point cloud, audio, video), practicalapplication (X-Ray, infrared, hyperspectral, and IMU), and data mining (graph,tabular, and time-series). Meta-Transformer indicates a promising future fordeveloping unified multimodal intelligence with transformers. Code will beavailable at https://github.com/invictus717/MetaTransformer
URL
Affiliations
Abstract
Translation (by gpt-3.5-turbo)
Summary (by gpt-3.5-turbo)
The text was updated successfully, but these errors were encountered: