Skip to content

NVIDIA/audio-flamingo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Audio Flamingo: Series of Advanced Audio Understanding Language Models

Overview

In this repo, we present the Audio Flamingo series of advanced audio understanding Language models:

Audio Flamingo


Audio Flamingo is our first audio language model based on the Flamingo architecture. It is based on an 1.3B language model and has in-context few-shot learning and multi-turn dialogue abilities (see Audio Dialogues for details of dialogue data). We curated about 5.9M audio-text pairs to train our model. It achieves the SOTA results on several zero-shot, few-shot, and in-distribution benchmarks of captioning, classification, and question answering.




Audio Flamingo 2


Audio Flamingo 2 significantly improves Audio Flamingo in several aspects. First, we re-trained a better CLAP for with stronger text understanding abilities. Second, we scaled up the training set to about 10M audio-text pairs with a focus on several understanding skills (AudioSkills) and understanding of longer audio (LongAudio). Third, we carefully ablate the training recipes and curriculums and found a 3-stage training strategy yields the best results. Audio Flamingo 2 is based on a 3B langauge model. It achieves the SOTA results on several individual and mixed audio understanding benchmarks of captioning, classification, and question answering. It can also understand longer audio up to 5 minutes.



Audio Flamingo 3


Audio Flamingo 3 is our latest model based on a 7B language model and the LLaVA architecture. We trained our unified AF-Whisper audio encoder based on Whisper to handle understanding beyond speech recognition. We included speech-related tasks in Audio Flamingo 3 and scaled up the training dataset to about 50M audio-text pairs. Therefore, Audio Flamingo 3 is able to handle all three modalities in audio: sound, music, and speech. It outperforms prior SOTA models including GAMA, Audio Flamingo, Audio Flamingo 2, Qwen-Audio, Qwen2-Audio, Qwen2.5-Omni, LTU, LTU-AS, SALMONN, AudioGPT, Gemini Flash v2 and Gemini Pro v1.5 on a number of understanding and reasoning benchmarks.

Audio Flamingo 3 can take up to 10 minutes of audio inputs, and has a streaming TTS module (AF3-Chat) to output voice.



Code Structure

Each branch includes the individual code to train and inference Audio Flamingo.

License

  • The code in this repo is under MIT license.
  • The checkpoints are for non-commercial use only (see NVIDIA OneWay Noncommercial License). They are also subject to other restrictions (see README and incl_licenses within each branch).
  • Notice: Audio Flamingo is built with OPT-IML and is subject to the OPT-IML license.
  • Notice: Audio Flamingo 2 and Audio Flamingo 3 are built with Qwen-2.5. Qwen is licensed under the Qwen RESEARCH LICENSE AGREEMENT, Copyright (c) Alibaba Cloud. All Rights Reserved.

Citation

  • Audio Flamingo
@inproceedings{kong2024audio,
  title={Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities},
  author={Kong, Zhifeng and Goel, Arushi and Badlani, Rohan and Ping, Wei and Valle, Rafael and Catanzaro, Bryan},
  booktitle={International Conference on Machine Learning},
  pages={25125--25148},
  year={2024},
  organization={PMLR}
}
  • Audio Flamingo 2
@inproceedings{
  ghosh2025audio,
  title={Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities},
  author={Ghosh, Sreyan and Kong, Zhifeng and Kumar, Sonal and Sakshi, S and Kim, Jaehyeon and Ping, Wei and Valle, Rafael and Manocha, Dinesh and Catanzaro, Bryan},
  booktitle={Forty-second International Conference on Machine Learning},
  year={2025},
  url={https://openreview.net/forum?id=xWu5qpDK6U}
}
  • Audio Flamingo 3
@article{goel2025audio,
  title={Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models},
  author={Goel, Arushi and Ghosh, Sreyan and Kim, Jaehyeon and Kumar, Sonal and Kong, Zhifeng and Lee, Sang-gil and Yang, Chao-Han Huck and Duraiswami, Ramani and Manocha, Dinesh and Valle, Rafael and Catanzaro, Bryan},
  journal={arXiv preprint arXiv:2507.08128},
  year={2025}
}

About

PyTorch implementation of Audio Flamingo: Series of Advanced Audio Understanding Language Models

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published