# How do *Language* Models Understand Images, Audio, and Video?

<img src="./media/mmllm.png" width=600>

Language models began initially by, as the name suggests, modeling language using deep learning techniques. The most popular generative models, Large Language Models (LLMs), build on this capability of understanding language to create new content based on some form of input. Through (primarily) scaling the size of language models up, the ability to complete text based on an input has proven itself quite useful, leading to emergent reasoning, back-and-forth chatting, and advanced contextual understanding of text to near human abilities.

<img src="./media/scaling_laws.png" width=600>

[*Scaling Laws for Neural Language Models*](https://arxiv.org/pdf/2001.08361)

While text understanding is incredibly useful, we live in a multi-dimensional world where understanding through text is just a portion of how we as humans process information. Within the digital realm, we still have visual and auditory understanding, taking the form of images (both photographs or even just your screen), videos, and sound (speech / non-speech). As large language models work towards emulating cognitive human behavior in a digital world researchers have been active in providing these additional modalities.

<img src="./media/llama3_diagram.png" width=600>

[*Llama 3 Technical Report*](https://arxiv.org/pdf/2407.21783)

This notebook will cover a high level overview of some of the techniques being used to teach language models image, video, and audio understanding with corresponding open source model examples.

---
# Introducing Vision with Image Understanding

<img src="./media/gpt4_vision.png" width=500>

[*GPT-4 Technical Report*](https://arxiv.org/pdf/2303.08774)

Images were the first new modality integrated into LLMs largely due to the maturity of computer vision techniques and the abundance of visual data. By the time large language models emerged, the vision community had already achieved impressive results in image classification, object detection, and even generative imaging (e.g. photorealistic image generation from text). These advancements made visual understanding a natural starting point for multimodal AI, since models could leverage pre-existing image recognition capabilities. Vision is also a primary sense for humans, so enabling AI to see complements its language understanding in a human-like way. Given this potential and the solid foundation of computer vision, it made sense that **image understanding was the first modality** adopted in LLMs for building multimodal systems.

### Vision Transformers (ViT)

<img src="./media/clip_training.png" width=500>

[*Learning Transferable Visual Models From Natural Language Supervision*](https://arxiv.org/pdf/2103.00020)

The first major attempt to use similar architectures to LLMs was in Vision Transformers (ViT). ViTs apply the transformer to image data by breaking an image into fixed-size patch “tokens.” In experimentation, researchers split images into 16×16 pixel patches, each of which is flattened and linearly projected into an embedding vector. These patch embeddings are then fed to a Transformer encoder just like they had been with words in a sentence, along with positional encodings to retain information about patch locations. Through self-attention, the model learns relationships between patches across the entire image, enabling it to capture global context and detailed spatial relationships that prior convolutional network approaches might miss.

The vision transformer can successfully attend to any region of the image from any other, learning long-range dependencies (for instance, relating an object in one corner to a detail in another) proving that a “pure” transformer could excel at vision tasks. The success of ViT showed that transformers can serve as powerful image encoders, making them ideal for pairing with language models in a multimodal system.

### CLIP: Aligning Visual and Language Representations

<img src="./media/clip_training.png" width=500>

[*Learning Transferable Visual Models From Natural Language Supervision*](https://arxiv.org/pdf/2103.00020)

While ViT deals with image encoding, **CLIP (Contrastive Language-Image Pretraining)** aligns images with text. In short, CLIP is a model that jointly trains an image encoder and a text encoder to produce a shared embedding space for both modalities. It was trained at scale using over **400 million image-text pairs** collected from the internet with an objective to predict which caption from a batch matches which image. Through deep learning, the CLIP model learns to pull together the embeddings of a corresponding image and caption, and push apart embeddings of mismatched pairs. This **contrastive learning** setup teaches CLIP to generate image and text *representations* that are directly comparable.

CLIP’s outcome is a system that understands visual concepts in natural language terms, most importantly resulting in zero-shot classification capabilities. After training, you can give CLIP a new image and a set of text labels (for example, names of ImageNet categories like “cat” or “dog”), and it will identify the label whose text embedding is closest to the image embedding without any task-specific fine-tuning. This demonstrated that **natural language supervision can train extremely flexible visual models**, since CLIP can recognize a wide array of image contents through just pure scaled training on real-world captions. In the context of multimodal AI, CLIP became a foundational as its embeddings can be used to bridge vision and language with **zero-shot image understanding**.

### Flamingo: From Understanding to Generation

<img src="./media/flamingo.png" width=500>

[*Flamingo: a Visual Language Model for Few-Shot Learning*](https://arxiv.org/pdf/2204.14198)

While representation learning and encoding was a good first step, the next step was to not just understand images but also to **generate language** about images fluidly. **Flamingo** (by DeepMind) is a visual language model that can accept images and generate text in a combined flexible way. Flamingo achieves this by **integrating a pre-trained vision encoder and a pre-trained language model** and adding special **cross-attention layers** between them. These learnable cross-attention modules  allow the frozen language model to "understand" and attend to the image embeddings at various points in its layers, connecting visual information into the text generation architecture. Thus, Flamingo can generat text **conditioning the LM on the content of images** without needing to retrain the entire language model from scratch.

<img src="./media/flamingo_arch.png" width=500>

The big concept here was that both the core vision and language models in Flamingo remain frozen; only the new cross-modal attention layers (and a small Perceiver Resampler module that compresses visual features are trained to achieve this. This makes training more efficient and avoids overfitting or forgetting the language model’s knowledge and prior text generation capability. This also allows for few-shot learning, specifically able to **generate relevant captions or answers about an image given just a few examples**. Without explicit fine-tuning, researchers proved that Flamingo could look at an image and answer complex questions about it or produce a detailed caption, simply by being prompted with a couple of examples of image-question-answer pairs. This showed that we can **teach an LLM to “see” and talk about what it sees** in a flexible, general way. Flamingo demonstrated an improvement from passive image *understanding* to active *generation* of textual descriptions and answers based on images.

### BLIP-2: Efficiency Through Modularity

<img src="./media/blip_overview.png" width=400>

[*BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models*](https://arxiv.org/pdf/2301.12597)

Researchers also sought to make training multimodal models more **efficient and modular**. **BLIP-2** is an approach that achieves efficiency by **bootstraping from off-the-shelf components**- combining a frozen image encoder (like a pre-trained ViT or CLIP model) and a frozen large language model with a lightweight **Querying Transformer (Q-Former)** in between. The Q-Former is a small transformer that plays the role of an intermediary or translator between the vision and language parts.

Training BLIP-2 is done in two stages:
1. **Vision-language representation learning:** In the first stage, the Q-Former is trained to interpret and extract useful information from the frozen image encoder. In theory, the Q-Former “queries” the image features – it has learnable query tokens that attend to the image encoder’s output and is trained (using paired images and text) to produce embeddings that align with textual meaning. During this stage, the image encoder stays frozen, and the Q-Former learns to produce a compact representation of the image that captures concepts relevant to captions or labels.
2. **Vision-to-language generation learning:** In the second stage, BLIP-2 connects the Q-Former to the frozen language model. Now the goal is to enable the language model to generate text based on the Q-Former’s image-informed embeddings. The Q-Former’s output is fed into the language model (for example, as a prefix or set of tokens the LM can attend to), and training teaches the combined system to generate appropriate text (captions, answers, etc.) for a given image. During this stage, the large language model remains frozen (not updated), and we fine-tune only the Q-Former.

<img src="./media/blip_p2.png" width=800>

This parameter efficient approach efficiently reduces the number of parameters that need to be learned from scratch. The only newly trained part is the Q-Former, which results in a much more **scalable and efficient** strategy for multimodal learning. The BLIP-2 approach outperformed DeepMind’s 80B parameter Flamingo on a zero-shot image QA benchmark, all while using **54× fewer trainable parameters**. Since it’s modular, one can swap in different image encoders or language models as needed, and just train a new Q-Former to connect them.

### Modern Day Vision LLMs

<img src="./media/vlm-structure.png" width=500>

[*Vision Language Models Explained*](https://huggingface.co/blog/vlms)

The evolution from CLIP to Flamingo to BLIP-2 and beyond has paved the way for a new generation of **vision-language LLMs** that are more capable and efficient than ever. Modern vision-enabled LLMs, especially in the open-source community, have largely standardized around a **modular architecture**: a pre-trained image encoder, a multimodal projector module, and a pre-trained text decoder (language model). The image encoder (often a ViT or CLIP model) produces a representation of the image, the projector (which can be as simple as a few linear layers or a small transformer) maps this representation into the language model’s embedding space, and the language model generates the final response. This design means that most of the heavy-lifting is done by components that were pre-trained on vast amounts of unimodal data (images or text), and only a comparatively small part (the projector and some connecting layers) needs training for the multimodal task.

Recent vision-LLMs focus on improving **alignment** with human intent and maintaining **computational efficiency**. A good example is [**LLaVA (Large Language and Vision Assistant)**](https://huggingface.co/llava-hf/llava-v1.6-34b-hf), which refines the BLIP-2 approach. LLaVA uses a CLIP ViT-L/14 as its image encoder and Vicuna (an LLM based on LLaMA) as the text decoder, bridged by a learned projection layer. To train it, the authors generated a synthetic instruction-following dataset of image captions via GPT-4 along with question-answer pairs about each image- simulating an interactive user asking about the image. They then trained LLaVA in two phases: first freezing the image encoder and language model and training the projector on this data (to align the modalities), and then fine-tuning the language model with the projector for better performance.

<img src="./media/llava_ex.png" width=500>

Advancements like those in LLaVA have improved how well the visual and textual parts cooperate (better alignment) and have made training these models more accessible (better efficiency). As researchers refine these techniques, we’re moving closer to AI systems that can see as well as they can read and write, enabling more natural and powerful human-AI interactions and use cases. The integration of vision into LLMs was only the first step, but it has proven pivotal.

---
### Image LLM Example

Let's use [Moondream](https://huggingface.co/vikhyatk/moondream2), a small but powerful vision language model based on the Phi language model trained by [Vik Korrapati](https://x.com/vikhyatk) and team.

We'll pass in the image:

<img src="./example_media/moondream_ex.jpeg" width=250>

and ask for a description!

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image

model = AutoModelForCausalLM.from_pretrained(
    "vikhyatk/moondream2",
    revision="2025-01-09",
    trust_remote_code=True,
    device_map={"": "cuda"}
)

In [None]:
image = Image.open("./example_media/moondream_ex.jpeg")

print(model.query(image, "Can you describe what's going on in this image?")["answer"])

 The image features a ginger cat with a white chest and paws, standing upright and holding a piece of paper in its paws. The cat's eyes are wide open, and it appears to be looking directly at the camera. The paper the cat is holding has the name "Adam Lucek" written on it in black letters. The background of the image is a solid, light blue-green color.


---
# From Images to Videos With Temporal Understanding

Moving from image understanding to video understanding introduces **time** as a new dimension. An image is a static snapshot, whereas a video is a sequence of frames capturing dynamic content over time. This added temporal dimension increases complexity: motion, timing, and temporal relationships can change the interpretation of a scene. The good news however, is that techniques from image vision can be extended to videos, but now they must account for sequential structure. Modern video understanding models build on image models by integrating this temporal modeling, ensuring that motion and frame-to-frame dependencies are learned alongside spatial content.

### Video Transformer Architectures

<img src="./media/timesformer.png" width=600>

[*Is Space-Time Attention All You Need for Video Understanding?*](https://arxiv.org/pdf/2102.05095)

Just as **Vision Transformers** (ViTs) enabled image recognition by modeling an image as a sequence of patches with self-attention. **Video transformers** extend this idea to sequences of image frames. The difference is that a video contains many more patches (spatial patches across multiple frames), so naively applying self-attention over all space-time patches is computationally expensive. To handle this, video transformers introduce structures to capture temporal information efficiently by treating a video as a sequence of frame embeddings (or even patch embeddings per frame) and incorporate temporal position encodings in addition to spatial ones, so the model *knows* the order of frames.

Early adaptations like TimeSformer adapt the standard ViT for videos by enabling attention both **within each frame** and **across frames**. It uses a novel *divided space-time attention* mechanism: within each transformer block, it applies self-attention over the temporal dimension separately from the spatial dimension which means the model first attends to tokens along the time axis (capturing how a particular patch location evolves over frames), and then attends to tokens within each frame (capturing spatial relationships). This was proven to effectively model full spatiotemporal relationships. In other words, separating “when” and “where” attention preserves the ability to learn complex motion patterns without the full cost of joint 3D attention. TimeSformer’s attention-only design achieved state-of-the-art results on action recognition benchmarks, proving again that pure transformers can effectively learn video representations and that image-transformer techniques can be successfully extended to handle the temporal dynamics of video.

### VideoMAE: Masked Autoencoders for Video Pre-Training

<img src="./media/videomae.png" width=600>

[*VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training*](https://arxiv.org/pdf/2203.12602)

Self-supervised learning has become key in vision, with two major families of approaches: **contrastive learning** (e.g. matching different views of the same scene) and **masked prediction** (e.g. masking parts of input and learning to reconstruct them). **VideoMAE** applies the masked autoencoder strategy to video to advance self-supervised video learning. Inspired by the Masked Autoencoder (MAE) for images, VideoMAE processes a video as a sequence of “tokens” (patches across frames) and masks a large portion of them during training. The model (a ViT backbone) must reconstruct the missing patches, forcing it to learn meaningful spatiotemporal features from the remaining context. A key innovation is the use of an **extremely high masking ratio** – typically 90% of the video patches are masked out. By making the task very challenging, VideoMAE encourages the encoder to extract rich video representations to fill in the blanks. The high masking ratio works especially well for video because temporal continuity provides additional context that static images lack.

### InternVideo2: Unified Video Foundation Models

<img src="./media/internvideo.png" width=600>

[*InternVideo2: Scaling Foundation Models for Multimodal Video Understanding*](https://arxiv.org/pdf/2403.15377)

**InternVideo2** is a recent *video foundation model* that begins to unify multiple training objectives to learn versatile video representations and connect to text generation. Its approach is organized into three progressive stages, each adding a new capability on top of the previous:
- **Masked Video Modeling (Stage 1)** – First, the video encoder is trained with a masked video token reconstruction task (analogous to VideoMAE). By learning to predict or reconstruct masked portions of video clips, the model learns spatiotemporal structures. This provides the base of low- and mid-level video understanding (motion, objects, scene dynamics) through self-supervision.
- **Crossmodal Contrastive Learning (Stage 2)** – Next, InternVideo2 aligns video representations with other modalities, notably text (and also audio when available). In this stage, the video encoder (initialized from Stage 1’s weights) is trained jointly with a text encoder to produce matching embeddings for corresponding video-text pairs. This **video-text contrastive learning** is similar to how CLIP aligned images with captions, now applied to videos and captions for high-level semantic awareness of video content. After this stage, InternVideo2 is able to now align visual understanding to human language.
- **LLM Integration via QFormer (Stage 3)** – In the final stage, InternVideo2 connects with a Large Language Model to enable **reasoning and generation** based on video input via a **Q-Former** module (inspired by BLIP-2 style architectures) that connects the video encoder and the language model. The video encoder (from Stage 2) produces a set of token embeddings for a given video and the Q-Former then attends to these video tokens and outputs a condensed set of “query” features that encapsulate the video’s information. These query features are fed into an LLM as prompts or special tokens to generate captions, answers, or dialog responses about the video, guided by the video-derived queries without needing to retrain the large language model from scratch.

This progressive training yields a 6-billion-parameter video encoder that achieved impressive results across a wide range of video tasks (classification, retrieval, and video dialogue).

### Apollo: Unified Video Understanding in Large Multimodal Models
<img src="./media/apollo_exp.png" width=600>

[*Apollo: An Exploration of Video Understanding in Large Multimodal Models*](https://arxiv.org/pdf/2412.10360)

More recently from Meta, **Apollo** demonstrates several modern techniques to advance video understanding in large multimodal models (LMMs), integrating spatial and temporal innovations. Apollo combines two visual encoders, a **SigLIP encoder** specialized for spatial detail and the **InternVideo2 encoder** optimized for temporal dynamics. By combining these complementary encoders, Apollo achieves spatiotemporal understanding superior to either encoder individually.

To handle extensive video content efficiently, Apollo uses a new learned **Perceiver Resampler**. This module condenses detailed frame-level features into a compact set of tokens by selectively attending to key visual elements. This downsampling approach enables the model to manage long videos without losing critical information, significantly outperforming prior pooling methods. Theyadopts an **fps-based frame sampling** strategy, selecting continuous frame snippets rather than uniformly spaced frames which maintains natural temporal order and captures authentic motion patterns, resulting in better action recognition and temporal understanding.

And finally, Apollo emphasizes balanced multimodal training by integrating pure text and image-text data alongside video content. This combined balance was shown to preserve the model’s language abilities and prevent overfitting to visual tasks, so the existing linguistic skills remain intact.

<img src="./media/apollo_training.png" width=600>

---
### Video LLM Example

For video processing we'll try out [Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) and ask it to process a short video clip

<video src="./example_media/bakery.mp4" width="500" controls></video>

In [1]:
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-3B-Instruct",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
)

# default processer
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct")


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


In [2]:
# Messages containing a local video path and a text query
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "./example_media/bakery.mp4",
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe what happens in this video."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    fps=1.0,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference
generated_ids = model.generate(**inputs, max_new_tokens=1024)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

print(output_text)

qwen-vl-utils using decord to read video.


['The video starts with a view of a street scene featuring a bakery named "Bakken met Passie." The bakery has a blue storefront with large windows displaying various baked goods and pastries. People are walking by on the sidewalk, and a cyclist rides past. The camera then pans to the right, showing more of the street and the adjacent buildings. The video ends with a clear view of the bakery\'s exterior.']


---
# Past Vision to Hearing - Audio Understanding in LLMs

<img src="./media/spectrogram.png" width=600>

While text and vision have been a focus, a key frontier is **auditory modality**. Audio understanding (from either speech or general acoustics) enables AI models the ability to grasp information conveyed in spoken language, environmental sounds, and music. Incorporating hearing lets an AI interpret context like tone of voice, background noises, or non-verbal cues that pure text or vision might miss. Enabling LLMs with audio comprehension, while overlooked, is important for real-world applications complementing visual understanding and moving toward truly multimodal intelligence.

### Speech and General Acoustics

<img src="./media/wave2vec.png" width=600>

[*Wave2Vec: Unsupervised Pre-Training for Speech Recognition*](https://arxiv.org/pdf/1904.05862)

Early **speech models** set the foundation for audio-enabled LLMs by converting raw waveforms into meaningful representations. For example, **Wav2Vec 2.0** introduced a *self-supervised* framework that learns latent speech features directly from raw audio, masking portions of the waveform’s latent encoding and predicting them via a contrastive objective (think back to CLIP!). This approach yields embeddings that preserve audio-specific nuances (e.g. **tone** or speaker identity) without requiring transcripts.

<img src="./media/whisper.png" width=400>

[*Robust Speech Recognition via Large-Scale Weak Supervision*](https://cdn.openai.com/papers/whisper.pdf)

Jumping forward on the supervised side, OpenAI’s **Whisper** shows how scaling up speech recognition: it’s an encoder-decoder Transformer ASR model trained on *680,000 hours* of multilingual data, achieving remarkable robustness to accents, background noise, and technical language. Whisper processes audio by converting 30-second clips into a log-Mel spectrogram, feeding it into a Transformer encoder, and then decoding text along with special tokens for tasks like language identification, timestamps, and translation. Together, these speech models (and others in general acoustics) demonstrated how raw waveforms can be transformed into linguistic or semantic content – the first step for LLMs to ingest and reason about audio.

### Audio Spectrogram Transformer (AST)

<img src="./media/AST.png" width=500>

[*AST: Audio Spectrogram Transformer*](https://arxiv.org/pdf/2104.01778)

Similar to what we've seen with audio and video, the **Audio Spectrogram Transformer (AST)** applies the success of transformers to the audio domain as the first purely attention-based audio classification model. It works by treating an audio **spectrogram** (a time–frequency heatmap of sound) like an image, splitting it into patches (akin to 16×16 pixel patches in Vision Transformers) and feeding these into a Transformer encoder. Surprisingly, AST can leverage pretrained Vision Transformer weights- the authors adapted a ViT to accept single-channel spectrogram input by averaging the ViT’s RGB patch embedding filters into one, transferring visual knowledge to audio. This approach allows AST to capture long-range context in audio, once again proving that Transformers can learn powerful acoustic representations from spectrograms. ASTs success opened the door to using **ViT-like architectures** for general audio understanding tasks.

### CLAP (Contrastive Language-Audio Pretraining)

<img src="./media/CLAP.png" width=600>

[*Clap: Learning Audio Concepts From Natural Language Supervision*](https://arxiv.org/pdf/2206.04769)

Analogous to how CLIP aligned images with text, **CLAP** aligns audio and text representations through contrastive learning. CLAP consists of two encoders (one for audio, one for text) trained jointly on audio–caption pairs so that corresponding audio and text map to similar embeddings in a shared space. Given an audio clip and a textual description, a contrastive objective is used to **pull together** matching audio-text pairs and push apart mismatched ones. Despite training on only ~128k audio/text pairs (far fewer than image-text datasets), this approach proved effective and enabled the same enhanced *zero-shot* classification performance on diverse audio tasks without any explicit class labels. By learning audio concepts from natural language supervision, CLAP demonstrates the same usefulness of language to supervise audio models.

### Audio Flamingo

<img src="./media/audioflamingo.png" width=600>

[*Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities*](https://arxiv.org/pdf/2402.01831v1)

**Audio Flamingo** extends the idea of DeepMind’s Flamingo to the audio. It is an *audio-language model* that aims to combine pretrained audio encoders with LLMs to enable open-ended reasoning about audio inputs. Instead of simply transcribing audio to text, Audio Flamingo feeds raw audio features directly into a language model via the same cross-attention mechanism, similar to how the original Flamingo handled image features. An audio feature extractor (e.g. a waveform model or a CLAP-like spectrogram encoder) produces a sequence of audio tokens, which are then fused into the LLM’s layers using learned cross-attention *gating*. This design lets the model attend to sounds (including **non-speech audio** and paralinguistic cues) during text generation. As a result, Audio Flamingo can adapt to new audio tasks with *few-shot prompts* and supports multi-turn conversations about what it “hears”. Using much of the same techniques from the vision training in the first Flamingo model family, Audio Flamingo proves that an LLM can be taught to interpret and discuss audio content by plugging in an audio modality interface.

### GAMA (Generalist Audio Model for All)

<img src="./media/GAMA.png" width=600>

[*GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities*](https://arxiv.org/pdf/2406.11768)

On step above is **GAMA**, a general-purpose large audio-language model with advanced audio understanding and reasoning abilities. Architecturally, GAMA integrates a pretrained LLM with a dedicated audio front-end, introducing an *Audio Q-Former* module that serves as a connection between the audio encoder and the language model. The audio encoder (which could be a CNN or Transformer) produces intermediate features and the Audio Q-Former then aggregates these multi-layer audio features into a concise representation that the LLM can attend to. By fine-tuning this combined model on a large-scale audio-language dataset, GAMA learns to ground textual reasoning in audio inputs (much like how vision-language models ground text in images). After further instruction tuning, GAMA was able to answer open-ended audio-related questions requiring multi-step reasoning, outperforming other audio-language models by a substantial margin. GAMA showcases an **audio-first foundation model** pulling inspiration from BLIP-2 Q-Former approaches and training curricula to equip an LLM with a broad and deep understanding of sound.

---
# Audio LLM Example

This time we'll use a specific audio model [Qwen2-Audio-7B-Instruct](https://huggingface.co/Qwen/Qwen2-Audio-7B-Instruct) with the sound effect of a car starting:

<audio src="./example_media/car_start.wav" controls></audio>



In [1]:
from io import BytesIO
from urllib.request import urlopen
import librosa
from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor

processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct")
model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct", device_map="auto")

Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.


Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]



In [2]:
audio_file_path = "./example_media/car_start.wav"

conversation = [
    {'role': 'system', 'content': 'You are a helpful assistant.'},
    {"role": "user", "content": [
        {"type": "audio", "audio_url": audio_file_path},
        {"type": "text", "text": "Describe this sound in detail."},
    ]},
]

text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)

# Load the audio file
audio, _ = librosa.load(audio_file_path, sr=processor.feature_extractor.sampling_rate)

# Process inputs with the audio
inputs = processor(text=text, audios=[audio], return_tensors="pt", padding=True)
inputs.input_ids = inputs.input_ids.to("cuda")

generate_ids = model.generate(**inputs, max_length=512)
generate_ids = generate_ids[:, inputs.input_ids.size(1):]

response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.
The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.


In [3]:
print(response)

The audio contains the sound of an engine starting up and revving, followed by the sound of it idling loudly.


---
# Multi Modality and Non Text Outputs - The Future

<img src="./media/llama3_diagram.png" width=600>

[*The Llama 3 Herd of Models*](https://arxiv.org/pdf/2407.21783)

Language models initially focused solely on text understanding but have gradually expanded their capabilities to embrace and combine multiple modalities, mirroring aspects of human cognition more closely. This evolution has unfolded consistently following a three-step pattern:

1. **Specialized Encoders**: Each modality—such as images, videos, or speech—has dedicated encoders like Vision Transformers for images, TimeSformer for videos, and AST for speech.
2. **Representation Alignment**: Modality-specific representations are aligned with textual embeddings using contrastive learning techniques, including CLIP for images and CLAP for audio, establishing a shared embedding space.
3. **Integration with LLMs**: Aligned representations integrate into large language models through specialized adapter layers, enabling multimodal generation and interaction.

These, three primary architectural strategies have emerged to bridge modality-specific encoders with language models. Cross-attention layers, exemplified by systems like Flamingo and Llama 3's vision module, empower language models to dynamically attend to modality-specific information during generation. Alternatively, adapters and modules such as BLIP-2’s Q-Former serve as bottlenecks, filtering and forwarding only the most relevant multimodal information into the language model. Lastly, in some instances, particularly with speech modalities like in Llama 3’s speech system, encoder outputs can be directly integrated into the language model's token space, enabling multimodal comprehension and generation without intermediate adapters.

<img src="./media/gemini_card.png" width=600>

The latest models can now support all three major modalities simultaneously - a model like Gemini-1.5-pro can process up to 7,200 images, 2 hours of video, and 19 hours of audio while maintaining strong reasoning capabilities across data types.

The use cases this unlocks for advanced reasoning past text are plentiful, including image-guided question answering, automated video summarization, precise semantic search in multimedia databases, real-time audio transcription and sentiment analysis, context-aware robotics interactions, advanced OCR and document understanding, multimodal content generation, and more with seamless interaction across text, voice, and visual interfaces.

I personally use these capabilities with a couple programs:

<img src="./media/ppt2desc.png" width=400>

[ppt2desc](https://github.com/ALucek/ppt2desc) uses vision-language models to convert PowerPoint presentations into detailed textual descriptions, capturing both textual content and the visual relationships within slides.

<img src="./media/niavs.png" width=400>

[NeedleInAVidStack](https://github.com/ALucek/NeedleInAVidStack) automates video content extraction by processing audio tracks using Google's Gemini AI models, enabling precise semantic searches across extensive video collections.

<img src="./media/any2any.png" width=600>

[*NExT-GPT: Any-to-Any Multimodal LLM*](https://arxiv.org/pdf/2309.05519)

The future of AI research extends multimodal input to embrace omnimodal capabilities- those that can not only process multiple input modalities but also generate across various output channels (images to audio, text to video, speech to 3D models, and beyond). As these technologies mature, we'll move toward AI systems that begin to also mirror human cognitive flexibility by translating between different forms of perception and expression, removing barriers between modalities and creating more intuitive and natural human-machine interfaces.