You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Large language models (LLMs) have achieved remarkable success in the field ofnatural language processing, enabling better human-computer interaction usingnatural language. However, the seamless integration of speech signals into LLMshas not been explored well. The "decoder-only" architecture has also not beenwell studied for speech processing tasks. In this research, we introduceSpeech-LLaMA, a novel approach that effectively incorporates acousticinformation into text-based large language models. Our method leveragesConnectionist Temporal Classification and a simple audio encoder to map thecompressed acoustic features to the continuous semantic space of the LLM. Inaddition, we further probe the decoder-only architecture for speech-to-texttasks by training a smaller scale randomly initialized speech-LLaMA model fromspeech-text paired data alone. We conduct experiments on multilingualspeech-to-text translation tasks and demonstrate a significant improvement overstrong baselines, highlighting the potential advantages of decoder-only modelsfor speech-to-text conversion.
AkihikoWatanabe
changed the title
あ
On decoder-only architecture for speech-to-text and large language model
integration, Jian Wu+, N/A, arXiv'23
Jul 11, 2023
URL
Affiliations
Abstract
Translation (by gpt-3.5-turbo)
Summary (by gpt-3.5-turbo)
The text was updated successfully, but these errors were encountered: