## Open notebook in:
| Colab                               Gradient                                                                                                                                         |
|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Nicolepcx/Transformers-in-Action/blob/main/CH06/06_audio_classification.ipynb)                                              


# About the Notebook

This notebook demonstrates how to perform **multimodal inference** with the **Qwen2.5-Omni-7B** model from Alibaba. Qwen2.5-Omni is designed to process **audio, video, image, and text inputs** in a single unified framework, making it suitable for tasks such as music analysis, video captioning, and speech-based interaction.

## What this notebook covers

1. **Model Setup**

   * Loads the `Qwen2_5OmniForConditionalGeneration` model and `Qwen2_5OmniProcessor`.
   * Initializes utilities for handling multimodal inputs (audio, video, images).

2. **Video Inference**

   * Downloads a sample MP4 video with audio.
   * Uses `ffmpeg` to extract and resample the audio into a 16 kHz mono WAV file.
   * Runs inference on the video to generate both **text output and synthesized audio responses**.

3. **Audio Preprocessing**

   * Reads the extracted waveform with `soundfile`.
   * Ensures sampling rate compatibility (16 kHz).
   * Plays the processed audio directly in the notebook.

4. **Audio-only Inference**

   * Provides a dedicated function `inference_audio` that analyzes pure audio waveforms.
   * Accepts both system and user prompts, allowing fine-grained control over how the model interprets sound.
   * Example: identifying musical instruments, tempo, and genre from an audio clip.

5. **Custom Prompting**

   * Demonstrates structured prompts for guiding the model’s behavior.
   * Example system prompt: *“You analyze only the audio. Ignore visuals. Be concise.”*

## Key Features Shown

* Integration of **text + audio + video** inputs in one pipeline.
* Practical use of **ffmpeg** for preprocessing multimedia data.
* Generating structured text outputs from audio signals.
* Flexibility to run multimodal or unimodal (audio-only) analysis.


In [1]:
!pip install -q qwen-omni-utils==0.0.8 openai==1.106.1 gdown==5.2.0

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m39.9/39.9 MB[0m [31m50.2 MB/s[0m eta [36m0:00:00[0m
[?25h

# Dependencies

# Imports

In [2]:
import base64
from IPython.display import HTML
from IPython.display import Video
from qwen_omni_utils import process_mm_info
from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
from qwen_omni_utils import process_mm_info
import librosa
import audioread
from IPython.display import Video
from IPython.display import Audio
import os, subprocess, sys, numpy as np
import soundfile as sf
from IPython.display import Audio, display
from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
from qwen_omni_utils import process_mm_info
import torch
import urllib.request
import time

# Load model and processor

In [4]:
model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-Omni-7B", torch_dtype="auto", device_map="auto"
)
processor = Qwen2_5OmniProcessor.from_pretrained("Qwen/Qwen2.5-Omni-7B")


Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}


Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

# Inference Function

In [5]:
def inference(video_path):
    messages = [
        {"role": "system", "content": [{"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}]},
        {"role": "user", "content": [
                {"type": "video", "video": video_path},
            ]
        },
    ]
    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    # image_inputs, video_inputs = process_vision_info([messages])
    audios, images, videos = process_mm_info(messages, use_audio_in_video=True)
    inputs = processor(text=text, audio=audios, images=images, videos=videos, return_tensors="pt", padding=True, use_audio_in_video=True)
    inputs = inputs.to(model.device).to(model.dtype)

    output = model.generate(**inputs, use_audio_in_video=True, return_audio=True)

    text = processor.batch_decode(output[0], skip_special_tokens=True, clean_up_tokenization_spaces=False)
    audio = output[1]
    return text, audio

# Load the audio source

In [None]:

video_url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/music.mp4"
mp4_path = "/content/audio_source.mp4"
wav_path = "/content/audio_16k.wav"

# download with retry
def download_with_retry(url, out, tries=3, delay=2.0):
    for i in range(tries):
        try:
            urllib.request.urlretrieve(url, out)
            if os.path.getsize(out) > 0:
                return
        except Exception as e:
            if i == tries - 1:
                raise
            time.sleep(delay)

download_with_retry(video_url, mp4_path)

# Extract audio to 16 kHz mono WAV with ffmpeg

In [None]:
# -vn removes video, -ac 1 mono, -ar 16000 resample, pcm_s16le wav
cmd = [
    "ffmpeg", "-y", "-i", mp4_path, "-vn", "-ac", "1", "-ar", "16000",
    "-f", "wav", wav_path
]
subprocess.run(cmd, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)


# Load waveform

In [None]:
audio_16k, sr = sf.read(wav_path, dtype="float32")
if sr != 16000:
    raise RuntimeError(f"Expected 16000 Hz, got {sr}")

display(Audio(audio_16k, rate=sr))


# Audio-only inference

In [None]:
def inference_audio(audio_waveform, sampling_rate, prompt, sys_prompt="You are a helpful assistant."):
    messages = [
        {"role": "system", "content": [{"type": "text", "text": sys_prompt}]},
        {"role": "user", "content": [
            {"type": "audio", "audio": audio_waveform, "sampling_rate": sampling_rate},
            {"type": "text", "text": prompt},
        ]},
    ]
    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    audios, images, videos = process_mm_info(messages, use_audio_in_video=False)

    inputs = processor(
        text=text, audio=audios, images=images, videos=videos,
        return_tensors="pt", padding=True, use_audio_in_video=False
    ).to(model.device)

    # keep float tensors on model dtype
    for k, v in inputs.items():
        if hasattr(v, "dtype") and v.dtype.is_floating_point:
            inputs[k] = v.to(model.dtype)

    with torch.inference_mode():
        output = model.generate(
            **inputs, use_audio_in_video=False, return_audio=False, max_new_tokens=256
        )

    out_text = processor.batch_decode(output, skip_special_tokens=True, clean_up_tokenization_spaces=False)
    return out_text



# A focused audio prompt

In [7]:

sys_prompt = "You analyze only the audio. Ignore visuals. Be concise."
prompt = "Identify the main instruments, tempo feel, time signature if clear, and likely genre in bullet points."

response = inference_audio(audio_16k, sr, prompt, sys_prompt=sys_prompt)
print(response[0])


Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}


Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]



system
You analyze only the audio. Ignore visuals. Be concise.
user
Identify the main instruments, tempo feel, time signature if clear, and likely genre in bullet points.
assistant
- Main instruments: marimba, xylophone, acoustic guitar
- Tempo feel: moderate
- Time signature: 4/4
- Likely genre: folk, acoustic
