GitHub - NVIDIA/audio-flamingo: PyTorch implementation of Audio Flamingo: Series of Advanced Audio Understanding Language Models

Audio Flamingo: Series of Advanced Audio Understanding Language Models

Overview

In this repo, we present the Audio Flamingo series of advanced audio understanding Language models:

Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities (ICML 2024)
Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities (ICML 2025)
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models (arxiv)

Audio Flamingo

Audio Flamingo is our first audio language model based on the Flamingo architecture. It is based on an 1.3B language model and has in-context few-shot learning and multi-turn dialogue abilities (see Audio Dialogues for details of dialogue data). We curated about 5.9M audio-text pairs to train our model. It achieves the SOTA results on several zero-shot, few-shot, and in-distribution benchmarks of captioning, classification, and question answering.

Audio Flamingo 2

Audio Flamingo 2 significantly improves Audio Flamingo in several aspects. First, we re-trained a better CLAP for with stronger text understanding abilities. Second, we scaled up the training set to about 10M audio-text pairs with a focus on several understanding skills (AudioSkills) and understanding of longer audio (LongAudio). Third, we carefully ablate the training recipes and curriculums and found a 3-stage training strategy yields the best results. Audio Flamingo 2 is based on a 3B langauge model. It achieves the SOTA results on several individual and mixed audio understanding benchmarks of captioning, classification, and question answering. It can also understand longer audio up to 5 minutes.

Audio Flamingo 3

Audio Flamingo 3 is our latest model based on a 7B language model and the LLaVA architecture. We trained our unified AF-Whisper audio encoder based on Whisper to handle understanding beyond speech recognition. We included speech-related tasks in Audio Flamingo 3 and scaled up the training dataset to about 50M audio-text pairs. Therefore, Audio Flamingo 3 is able to handle all three modalities in audio: sound, music, and speech. It outperforms prior SOTA models including GAMA, Audio Flamingo, Audio Flamingo 2, Qwen-Audio, Qwen2-Audio, Qwen2.5-Omni, LTU, LTU-AS, SALMONN, AudioGPT, Gemini Flash v2 and Gemini Pro v1.5 on a number of understanding and reasoning benchmarks.

Audio Flamingo 3 can take up to 10 minutes of audio inputs, and has a streaming TTS module (AF3-Chat) to output voice.

Code Structure

Each branch includes the individual code to train and inference Audio Flamingo.

License

The code in this repo is under MIT license.
The checkpoints are for non-commercial use only (see NVIDIA OneWay Noncommercial License). They are also subject to other restrictions (see README and incl_licenses within each branch).
Notice: Audio Flamingo is built with OPT-IML and is subject to the OPT-IML license.

Citation

Audio Flamingo

@inproceedings{kong2024audio,
  title={Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities},
  author={Kong, Zhifeng and Goel, Arushi and Badlani, Rohan and Ping, Wei and Valle, Rafael and Catanzaro, Bryan},
  booktitle={International Conference on Machine Learning},
  pages={25125--25148},
  year={2024},
  organization={PMLR}
}

Audio Flamingo 2

@inproceedings{
  ghosh2025audio,
  title={Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities},
  author={Ghosh, Sreyan and Kong, Zhifeng and Kumar, Sonal and Sakshi, S and Kim, Jaehyeon and Ping, Wei and Valle, Rafael and Manocha, Dinesh and Catanzaro, Bryan},
  booktitle={Forty-second International Conference on Machine Learning},
  year={2025},
  url={https://openreview.net/forum?id=xWu5qpDK6U}
}

Audio Flamingo 3

@article{goel2025audio,
  title={Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models},
  author={Goel, Arushi and Ghosh, Sreyan and Kim, Jaehyeon and Kumar, Sonal and Kong, Zhifeng and Lee, Sang-gil and Yang, Chao-Han Huck and Duraiswami, Ramani and Manocha, Dinesh and Valle, Rafael and Catanzaro, Bryan},
  journal={arXiv preprint arXiv:2507.08128},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
incl_licenses		incl_licenses
static		static
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Audio Flamingo: Series of Advanced Audio Understanding Language Models

Overview

Audio Flamingo

Audio Flamingo 2

Audio Flamingo 3

Code Structure

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

NVIDIA/audio-flamingo

Folders and files

Latest commit

History

Repository files navigation

Audio Flamingo: Series of Advanced Audio Understanding Language Models

Overview

Audio Flamingo

Audio Flamingo 2

Audio Flamingo 3

Code Structure

License

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Packages