Skip to content
View LLM-Codec's full-sized avatar
Block or Report

Block or report LLM-Codec

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
LLM-Codec/readme.md

Large Language Models-driven Audio Codec are Few-shot Audio Task Learners

This Repository provides the open-sourced code, checkpoints, and demos for LLM-Codec.
More details will be updated soon. This is an anonymous page, which aims to give more details about our work for the reviewer. An more detailed page will be released after blind-review.

Introduction

The Large Language models (LLMs) have demonstrated supreme capabilities in text understanding and generation, but cannot be directly applied to cross-modal tasks without fine-tuning. This paper proposes a cross-modal in-context learning approach, empowering the frozen LLMs to achieve multiple audio tasks in a few-shot style without any parameter update. Specifically, we propose a novel and LLMs-driven audio codec model, LLM-Codec, to transfer the audio modality into the textual space, \textit{i.e.} representing audio tokens with words or sub-words in the vocabulary of LLMs, while keeping high audio reconstruction quality. The key idea is to reduce the modality heterogeneity between text and audio by compressing the audio modality into a well-trained LLMs token space. Then the audio representation can be viewed as a new \textit{foreign language} and LLMs can learn the new \textit{foreign language} with several demonstrations. In experiments, we investigate the performance of the proposed approach across multiple audio understanding and generation tasks, \textit{e.g.} speech emotion classification, audio classification, text-to-speech generation, speech enhancement, etc. The experimental results demonstrate that the LLMs equipped with the proposed LLM-Codec, prompted by only a few examples, is capable of achieving the expected functions in simple scenarios. It validates the feasibility and effectiveness of the proposed cross-modal in-context learning approach.

How to use LLM-Codec?

step 1:

download the checkpoint (Cannot be uploaded to github, due to the size is more than 100M.)

Step 2: Download LLAMA 2 7B based on https://github.com/meta-llama/llama-recipes/tree/main
Step 3: refer to infer.py

python infer.py

How to use LLM-Code and LLAMA 2

In the following, we give a simple demonstration to use it.

CUDA_VISIBLE_DEVICES=0 torchrun --nproc_per_node 1 --master_port=10645 infer_code/eval_accent_understanding_v2.py \
            --batch_size 1 \
            --max_seq_len 2048 \
            --num_workers 0 \
            --output_type "next_token_prediction" \
            --audio_path "the path of audio folder" \
            --file_path tsv/acc_9way_1_shot.scp \
            --vq_config_path config.yaml \
            --output_dir log_eval_few_shot/7B_output \
            --llama_model_path llama_inference/llama-2-7b \
            --induction 1 \
            --codec_ckpt "llm-codec.pth" \

Demos

Please refer to demos folder to listen the generated audio.

Acknowledgements

Popular repositories Loading

  1. LLM-Codec LLM-Codec Public

    Python 1