## Open notebook in:
| Colab                                 
:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Nicolepcx/transformers-the-definitive-guide/blob/master/CH06/ch06_Qwen2_Audio_different_audio_tasks.ipynb)                                             

# About this Notebook

This notebook showcases how to interact with **Qwen2-Audio-7B-Instruct**, a large language model capable of understanding and generating responses based on both **audio** and **text** inputs. The model allows for multimodal chat-style interactions with audio clips, making it well-suited for tasks such as **speaker analysis**, **sound event detection**, and **speech transcription**.

### Steps Included:

1. **Model and Processor Setup**:
   The notebook loads the `Qwen2-Audio-7B-Instruct` model and its corresponding processor from the Hugging Face Hub. The model is loaded with automatic device mapping, using the GPU if available.

2. **Prompt Formatting**:
   Chat-style prompts are constructed using the processor's `apply_chat_template` function. These prompts can contain both audio clips (remote URLs or local files) and textual queries.

3. **Audio Preprocessing**:
   Audio clips are downloaded or loaded from local storage using `librosa`, then resampled to match the model’s expected sampling rate.

4. **Batch Input Construction**:
   Both the textual prompts and the corresponding audio waveforms are tokenized and packed into a batch format using the processor. The inputs are padded and transferred to the appropriate device.

5. **Inference and Response Generation**:
   The model generates a textual response to each multimodal prompt using the `generate` method. The response is then decoded into human-readable text.

6. **Use Cases Demonstrated**:

   * **Gender and age prediction** from voice
   * **Sound recognition** (e.g., detecting glass breaking or gunshots)
   * **Speech transcription** from audio files

This notebook illustrates how modern multimodal language models can move beyond text and process **real-world audio inputs** with flexible, instruction-following behavior, bridging the gap between natural language understanding and audio signal processing.


# Installs

In [7]:
!pip install transformers==4.53.1 gdown -qqq

# Imports

In [3]:
from io import BytesIO
from urllib.request import urlopen
import os
import librosa
import torch
from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor

# Load model and processor

In [4]:
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct")
model = Qwen2AudioForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-Audio-7B-Instruct", device_map="auto"
)

preprocessor_config.json:   0%|          | 0.00/342 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/853 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

model-00003-of-00005.safetensors:   0%|          | 0.00/3.98G [00:00<?, ?B/s]

model-00005-of-00005.safetensors:   0%|          | 0.00/1.28G [00:00<?, ?B/s]

model-00002-of-00005.safetensors:   0%|          | 0.00/3.98G [00:00<?, ?B/s]

model-00004-of-00005.safetensors:   0%|          | 0.00/3.64G [00:00<?, ?B/s]

model-00001-of-00005.safetensors:   0%|          | 0.00/3.91G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/230 [00:00<?, ?B/s]

# Load and prepare audio data

In [8]:

# Download the file from Google Drive using its file ID
file_id = "1IymJaN-zOk7YgZCWPsTmzK2lp45mgfjL"
destination = "gunshots.wav"
!gdown --id {file_id} -O {destination}

# Check if the file was downloaded
import os
if os.path.exists(destination):
    print(f"File downloaded successfully: {destination}")
else:
    print("Download failed.")


Downloading...
From: https://drive.google.com/uc?id=1IymJaN-zOk7YgZCWPsTmzK2lp45mgfjL
To: /content/gunshots.wav
100% 320k/320k [00:00<00:00, 123MB/s]
File downloaded successfully: gunshots.wav


In [9]:
conversation1 = [
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/glass-breaking-151256.mp3"},
        {"type": "text", "text": "What's that sound?"},
    ]},
]

conversation2 = [
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/1272-128104-0000.flac"},
        {"type": "text", "text": "What does the person say?"},
    ]},
]

conversation3 = [
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "/content/gunshots.wav"},
        {"type": "text", "text": "What is that sound, and what does the person say?"},
    ]},
]
conversation4 = [
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/guess_age_gender.wav"},
        {"type": "text", "text": "Can you guess the speaker's gender and age?"}
    ]},
]

conversations = [conversation1, conversation2, conversation3,conversation4]


# Process and predict

In [10]:

# Prepare inputs
text = [processor.apply_chat_template(conv, add_generation_prompt=True, tokenize=False) for conv in conversations]

sr = processor.feature_extractor.sampling_rate
audios = []

for conversation in conversations:
    for message in conversation:
        if isinstance(message["content"], list):
            for ele in message["content"]:
                if ele["type"] == "audio":
                    audio_path = ele['audio_url']
                    if audio_path.startswith("http"):
                        # Remote URL
                        audio, _ = librosa.load(BytesIO(urlopen(audio_path).read()), sr=sr)
                    else:
                        # Load local file
                        if not os.path.exists(audio_path):
                            raise FileNotFoundError(f"Local audio file not found: {audio_path}")
                        audio, _ = librosa.load(audio_path, sr=sr)
                    audios.append(audio)

# Batch processing
inputs = processor(text=text, audio=audios, return_tensors="pt", padding=True)

# Move all tensor inputs to the correct device
device = model.device
inputs = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}

# Generate response
generate_ids = model.generate(**inputs, max_new_tokens=512)
generate_ids = generate_ids[:, inputs["input_ids"].size(1):]

# Decode
responses = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)

# Print outputs
for i, r in enumerate(responses):
    print(f"Response {i+1}: {r}")


It is strongly recommended to pass the `sampling_rate` argument to `WhisperFeatureExtractor()`. Failing to do so can result in silent errors that might be hard to debug.


Response 1: It is the sound of glass breaking.
Response 2: The original content of this audio is: 'Mister Quiller is the apostle of the middle classes and we are glad to welcome his gospel.'
Response 3: In the audio, there is the sound of gunfire and artillery fire happening in the distance, and a male voice speaking English saying 'Can you guess where I am right now?' with a neutral mood.
Response 4: The speaker is female and in her twenties.
