On decoder-only architecture for speech-to-text and large language model integration, Jian Wu+, N/A, arXiv'23 #789

AkihikoWatanabe · 2023-07-11T10:58:40Z

URL

https://arxiv.org/abs/2307.03917

Affiliations

Jian Wu, N/A
Yashesh Gaur, N/A
Zhuo Chen, N/A
Long Zhou, N/A
Yimeng Zhu, N/A
Tianrui Wang, N/A
Jinyu Li, N/A
Shujie Liu, N/A
Bo Ren, N/A
Linquan Liu, N/A
Yu Wu, N/A

Abstract

Large language models (LLMs) have achieved remarkable success in the field ofnatural language processing, enabling better human-computer interaction usingnatural language. However, the seamless integration of speech signals into LLMshas not been explored well. The "decoder-only" architecture has also not beenwell studied for speech processing tasks. In this research, we introduceSpeech-LLaMA, a novel approach that effectively incorporates acousticinformation into text-based large language models. Our method leveragesConnectionist Temporal Classification and a simple audio encoder to map thecompressed acoustic features to the continuous semantic space of the LLM. Inaddition, we further probe the decoder-only architecture for speech-to-texttasks by training a smaller scale randomly initialized speech-LLaMA model fromspeech-text paired data alone. We conduct experiments on multilingualspeech-to-text translation tasks and demonstrate a significant improvement overstrong baselines, highlighting the potential advantages of decoder-only modelsfor speech-to-text conversion.

Translation (by gpt-3.5-turbo)

大規模言語モデル（LLMs）は、自然言語処理の分野で驚異的な成功を収めており、自然言語を使用したより良い人間とコンピュータのインタラクションを実現しています。しかし、音声信号をLLMsにシームレスに統合することはまだ十分に研究されていません。また、音声処理タスクにおける「デコーダのみ」アーキテクチャも十分に研究されていません。本研究では、音響情報をテキストベースの大規模言語モデルに効果的に組み込む新しいアプローチであるSpeech-LLaMAを紹介します。私たちの手法は、Connectionist Temporal Classificationとシンプルなオーディオエンコーダを活用して、圧縮された音響特徴をLLMの連続的な意味空間にマッピングします。さらに、スピーチ-LLaMAモデルを音声テキストのペアデータのみからランダムに初期化された小規模なモデルでトレーニングすることで、デコーダのみアーキテクチャをスピーチからテキストへのタスクにさらに探求します。多言語音声からテキストへの翻訳タスクで実験を行い、強力なベースラインに比べて大幅な改善を示し、デコーダのみモデルが音声からテキストへの変換における潜在的な利点を示しています。

Summary (by gpt-3.5-turbo)

本研究では、音声情報を大規模言語モデルに組み込む新しいアプローチであるSpeech-LLaMAを提案しています。この手法は、音響特徴を意味空間にマッピングするためにCTCとオーディオエンコーダを使用します。また、デコーダのみモデルを音声からテキストへのタスクに適用するために、小規模なモデルでトレーニングを行います。実験結果は、多言語音声からテキストへの翻訳タスクにおいて、強力なベースラインに比べて大幅な改善を示し、デコーダのみモデルの潜在的な利点を示しています。

AkihikoWatanabe added the Pocket label Jul 11, 2023

AkihikoWatanabe changed the title あ On decoder-only architecture for speech-to-text and large language model integration, Jian Wu+, N/A, arXiv'23 Jul 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On decoder-only architecture for speech-to-text and large language model integration, Jian Wu+, N/A, arXiv'23 #789

On decoder-only architecture for speech-to-text and large language model integration, Jian Wu+, N/A, arXiv'23 #789

AkihikoWatanabe commented Jul 11, 2023 •

edited

On decoder-only architecture for speech-to-text and large language model integration, Jian Wu+, N/A, arXiv'23 #789

On decoder-only architecture for speech-to-text and large language model integration, Jian Wu+, N/A, arXiv'23 #789

Comments

AkihikoWatanabe commented Jul 11, 2023 • edited

URL

Affiliations

Abstract

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)

AkihikoWatanabe commented Jul 11, 2023 •

edited