# Lab 4.1.5: Audio Transcription with Whisper

**Module:** 4.1 - Multimodal AI  
**Time:** 2 hours  
**Difficulty:** ⭐⭐⭐

---

## Learning Objectives

By the end of this notebook, you will:
- [ ] Understand how Whisper transcription works
- [ ] Transcribe audio files with high accuracy
- [ ] Handle multiple languages and accents
- [ ] Build an Audio Q&A pipeline with LLMs
- [ ] Process audio in real-time scenarios

---

## Prerequisites

- Completed: Tasks 14.1-14.4
- Knowledge of: Basic audio concepts, LLMs
- Running in: NGC PyTorch container

---

## Real-World Context

Audio transcription powers many applications we use daily:

**Industry Applications:**
- **Meeting Notes**: Automatic transcription of video calls
- **Podcasts**: Generate searchable transcripts and summaries
- **Customer Service**: Transcribe and analyze support calls
- **Medical**: Dictation to medical records
- **Legal**: Court and deposition transcription
- **Accessibility**: Real-time captions for hearing impaired

**Why DGX Spark?**
- Whisper large-v3: ~4GB - runs with room to spare
- Process hours of audio without cloud costs
- Keep sensitive recordings on-premise
- Combine with LLM for audio Q&A

---

## ELI5: How Does Whisper Work?

> **Imagine you're a super-powered listener at a party:**
>
> 1. **Hear the Sound** (Audio Input): Someone speaks, and you hear the sound waves
>
> 2. **Break It Down** (Mel Spectrogram): You mentally "see" the sound as a picture - high notes look different from low notes, loud from quiet
>
> 3. **Recognize Patterns** (Encoder): Your trained brain recognizes these sound patterns - "That's an 'S' sound, that's an 'AH' sound..."
>
> 4. **Form Words** (Decoder): You put the sounds together into words, words into sentences
>
> 5. **Write It Down** (Text Output): You transcribe what you heard!
>
> **In AI terms:**
> - **Audio → Mel Spectrogram**: Convert waveform to time-frequency representation
> - **Encoder**: Transformer that processes audio features
> - **Decoder**: Autoregressive transformer that generates text tokens
> - **Multilingual**: Trained on 680,000 hours of audio in 98 languages!

---

## Part 1: Environment Setup

In [None]:
# Install required packages (run once)
# !pip install openai-whisper soundfile librosa scipy -q

In [None]:
import torch
import gc
import numpy as np
import librosa
import matplotlib.pyplot as plt
import time
from typing import Dict, List, Optional, Tuple
import warnings
warnings.filterwarnings('ignore')

# Check GPU
print("=" * 50)
print("GPU Configuration")
print("=" * 50)

if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    total_memory = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"GPU: {gpu_name}")
    print(f"Total Memory: {total_memory:.1f} GB")
else:
    print("No GPU available - Whisper will run on CPU (slower)")

print(f"\nPyTorch: {torch.__version__}")

In [None]:
def clear_gpu_memory():
    """Clear GPU memory."""
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        torch.cuda.synchronize()
    print("GPU memory cleared!")

def get_memory_usage():
    """Get GPU memory usage."""
    if torch.cuda.is_available():
        allocated = torch.cuda.memory_allocated(0) / 1e9
        return f"Allocated: {allocated:.2f}GB"
    return "No GPU"

print("Utility functions loaded!")

---

## Part 2: Creating Sample Audio

Let's create some sample audio to work with. We'll use text-to-speech to generate test audio.

In [None]:
import numpy as np

def generate_sine_wave(frequency: float, duration: float, sample_rate: int = 16000) -> np.ndarray:
    """
    Generate a sine wave tone.
    
    Args:
        frequency: Frequency in Hz
        duration: Duration in seconds
        sample_rate: Sample rate in Hz
        
    Returns:
        Audio array
    """
    t = np.linspace(0, duration, int(sample_rate * duration), endpoint=False)
    return np.sin(2 * np.pi * frequency * t).astype(np.float32)

def generate_dtmf_tone(digit: str, duration: float = 0.5, sample_rate: int = 16000) -> np.ndarray:
    """
    Generate a DTMF (phone dial) tone.
    
    Args:
        digit: Phone keypad digit (0-9, *, #)
        duration: Duration in seconds
        sample_rate: Sample rate
        
    Returns:
        Audio array
    """
    # DTMF frequency pairs
    dtmf_freqs = {
        '1': (697, 1209), '2': (697, 1336), '3': (697, 1477),
        '4': (770, 1209), '5': (770, 1336), '6': (770, 1477),
        '7': (852, 1209), '8': (852, 1336), '9': (852, 1477),
        '*': (941, 1209), '0': (941, 1336), '#': (941, 1477)
    }
    
    if digit not in dtmf_freqs:
        return np.zeros(int(sample_rate * duration), dtype=np.float32)
    
    low_freq, high_freq = dtmf_freqs[digit]
    t = np.linspace(0, duration, int(sample_rate * duration), endpoint=False)
    
    signal = np.sin(2 * np.pi * low_freq * t) + np.sin(2 * np.pi * high_freq * t)
    return (signal * 0.5).astype(np.float32)

print("Audio generation functions ready!")

In [None]:
# Generate some test audio
sample_rate = 16000  # Whisper expects 16kHz

# Create a simple tone sequence
audio_data = np.concatenate([
    generate_sine_wave(440, 0.5, sample_rate),  # A4 note
    np.zeros(int(sample_rate * 0.1), dtype=np.float32),  # Silence
    generate_sine_wave(523, 0.5, sample_rate),  # C5 note
    np.zeros(int(sample_rate * 0.1), dtype=np.float32),
    generate_sine_wave(659, 0.5, sample_rate),  # E5 note
])

print(f"Generated {len(audio_data) / sample_rate:.2f} seconds of audio")
print(f"Sample rate: {sample_rate} Hz")
print(f"Audio shape: {audio_data.shape}")

In [None]:
# Visualize the audio
fig, axes = plt.subplots(2, 1, figsize=(12, 6))

# Waveform
time_axis = np.arange(len(audio_data)) / sample_rate
axes[0].plot(time_axis, audio_data, linewidth=0.5)
axes[0].set_xlabel('Time (s)')
axes[0].set_ylabel('Amplitude')
axes[0].set_title('Audio Waveform')
axes[0].grid(True, alpha=0.3)

# Spectrogram (what Whisper "sees")
# nperseg=512 gives ~32ms windows at 16kHz, good balance of time/frequency resolution
from scipy import signal

frequencies, times, Sxx = signal.spectrogram(audio_data, sample_rate, nperseg=512)
axes[1].pcolormesh(times, frequencies, 10 * np.log10(Sxx + 1e-10), shading='gouraud', cmap='viridis')
axes[1].set_xlabel('Time (s)')
axes[1].set_ylabel('Frequency (Hz)')
axes[1].set_title('Spectrogram (What Whisper Sees)')
axes[1].set_ylim(0, 2000)  # Human speech is mostly below 2kHz

plt.tight_layout()
plt.show()

---

## Part 3: Loading Whisper

We'll use OpenAI's Whisper model for transcription. Let's load the large-v3 model for best accuracy.

In [None]:
import whisper

print("Loading Whisper large-v3...")
print(f"Memory before: {get_memory_usage()}")
start_time = time.time()

# Load the model (downloads on first run)
# Options: tiny, base, small, medium, large, large-v2, large-v3
whisper_model = whisper.load_model("large-v3")

load_time = time.time() - start_time
print(f"\nWhisper loaded in {load_time:.1f} seconds!")
print(f"Memory after: {get_memory_usage()}")

# Model info
print(f"\nModel: large-v3")
print(f"Parameters: ~1.5B")
print(f"Languages: 98+")

In [None]:
# Let's also try with the Hugging Face version for comparison
# This gives us more flexibility for integration

from transformers import WhisperProcessor, WhisperForConditionalGeneration

print("Loading Whisper from Hugging Face...")
start_time = time.time()

hf_whisper_processor = WhisperProcessor.from_pretrained("openai/whisper-large-v3")
hf_whisper_model = WhisperForConditionalGeneration.from_pretrained(
    "openai/whisper-large-v3",
    torch_dtype=torch.bfloat16  # Optimized for Blackwell
).to("cuda")

load_time = time.time() - start_time
print(f"\nHF Whisper loaded in {load_time:.1f} seconds!")
print(f"Memory after: {get_memory_usage()}")

---

## Part 4: Basic Transcription

Let's transcribe some audio! Since we don't have real speech recordings, we'll work with synthetic examples and discuss the API.

In [None]:
def transcribe_audio(
    audio: np.ndarray,
    sample_rate: int = 16000,
    language: Optional[str] = None,
    task: str = "transcribe"
) -> Dict:
    """
    Transcribe audio using Whisper.
    
    Args:
        audio: Audio data as numpy array
        sample_rate: Sample rate of audio
        language: Language code (e.g., 'en', 'es', 'zh') or None for auto-detect
        task: 'transcribe' or 'translate' (translate to English)
        
    Returns:
        Dictionary with transcription results
    """
    # Resample if needed
    if sample_rate != 16000:
        import librosa
        audio = librosa.resample(audio, orig_sr=sample_rate, target_sr=16000)
    
    # Ensure audio is float32 and normalized
    audio = audio.astype(np.float32)
    if audio.max() > 1.0:
        audio = audio / np.abs(audio).max()
    
    start_time = time.time()
    
    # Transcribe using OpenAI Whisper
    result = whisper_model.transcribe(
        audio,
        language=language,
        task=task,
        fp16=torch.cuda.is_available()
    )
    
    transcribe_time = time.time() - start_time
    
    return {
        'text': result['text'],
        'language': result['language'],
        'segments': result['segments'],
        'duration': len(audio) / 16000,
        'transcribe_time': transcribe_time,
        'rtf': transcribe_time / (len(audio) / 16000)  # Real-time factor
    }

print("transcribe_audio() function ready!")

In [None]:
# Since we don't have real speech, let's demonstrate the API
# Whisper will detect that our tone sequence isn't speech

print("Attempting to transcribe tone sequence...")
result = transcribe_audio(audio_data)

print(f"\nTranscription: '{result['text']}'")
print(f"Detected language: {result['language']}")
print(f"Audio duration: {result['duration']:.2f}s")
print(f"Transcription time: {result['transcribe_time']:.2f}s")
print(f"Real-time factor: {result['rtf']:.2f}x")

### Working with Real Audio Files

In practice, you would load audio from files. Here's how:

In [None]:
import soundfile as sf
import librosa

def load_audio(file_path: str, target_sr: int = 16000) -> Tuple[np.ndarray, int]:
    """
    Load audio from a file.
    
    Args:
        file_path: Path to audio file (wav, mp3, flac, etc.)
        target_sr: Target sample rate
        
    Returns:
        Tuple of (audio_data, sample_rate)
    """
    # Load with librosa (handles many formats)
    audio, sr = librosa.load(file_path, sr=target_sr, mono=True)
    return audio, sr

def save_audio(file_path: str, audio: np.ndarray, sample_rate: int = 16000):
    """
    Save audio to a file.
    
    Args:
        file_path: Output file path
        audio: Audio data
        sample_rate: Sample rate
    """
    sf.write(file_path, audio, sample_rate)

# Example usage (commented since we don't have a file)
# audio, sr = load_audio("path/to/audio.wav")
# result = transcribe_audio(audio, sr)
# print(result['text'])

print("Audio I/O functions ready!")

---

## Part 5: Whisper with Hugging Face Transformers

The Hugging Face version provides more flexibility for integration with other models.

In [None]:
def transcribe_hf(
    audio: np.ndarray,
    sample_rate: int = 16000,
    language: Optional[str] = None,
    return_timestamps: bool = False
) -> Dict:
    """
    Transcribe audio using Hugging Face Whisper.
    
    Args:
        audio: Audio data as numpy array
        sample_rate: Sample rate of audio
        language: Language code (e.g., 'en', 'es')
        return_timestamps: Whether to return word timestamps
        
    Returns:
        Dictionary with transcription results
    """
    # Resample if needed
    if sample_rate != 16000:
        import librosa
        audio = librosa.resample(audio, orig_sr=sample_rate, target_sr=16000)
    
    # Process audio
    input_features = hf_whisper_processor(
        audio,
        sampling_rate=16000,
        return_tensors="pt"
    ).input_features.to(hf_whisper_model.device, dtype=torch.bfloat16)
    
    # Set language if specified
    forced_decoder_ids = None
    if language:
        forced_decoder_ids = hf_whisper_processor.get_decoder_prompt_ids(
            language=language,
            task="transcribe"
        )
    
    start_time = time.time()
    
    # Generate transcription
    with torch.inference_mode():
        predicted_ids = hf_whisper_model.generate(
            input_features,
            forced_decoder_ids=forced_decoder_ids,
            return_timestamps=return_timestamps,
            max_new_tokens=448
        )
    
    transcribe_time = time.time() - start_time
    
    # Decode
    transcription = hf_whisper_processor.batch_decode(
        predicted_ids,
        skip_special_tokens=True
    )[0]
    
    return {
        'text': transcription,
        'duration': len(audio) / 16000,
        'transcribe_time': transcribe_time,
        'rtf': transcribe_time / (len(audio) / 16000)
    }

print("transcribe_hf() function ready!")

In [None]:
# Test HF version with our tone sequence
print("Testing Hugging Face Whisper...")
result_hf = transcribe_hf(audio_data)

print(f"\nTranscription: '{result_hf['text']}'")
print(f"Audio duration: {result_hf['duration']:.2f}s")
print(f"Transcription time: {result_hf['transcribe_time']:.2f}s")

---

## Part 6: Audio Q&A Pipeline

Now let's build a complete pipeline that:
1. Transcribes audio
2. Uses an LLM to answer questions about the content

In [None]:
class AudioQA:
    """
    Audio Question-Answering Pipeline.
    
    Transcribes audio and answers questions using an LLM.
    """
    
    def __init__(self, whisper_model, whisper_processor):
        """
        Initialize the Audio QA system.
        
        Args:
            whisper_model: Loaded Whisper model
            whisper_processor: Whisper processor
        """
        self.whisper_model = whisper_model
        self.whisper_processor = whisper_processor
        self.llm = None
        self.llm_tokenizer = None
        self.transcripts = {}  # Store transcripts by ID
        
    def transcribe(self, audio: np.ndarray, audio_id: str, sample_rate: int = 16000) -> str:
        """
        Transcribe audio and store the result.
        
        Args:
            audio: Audio data
            audio_id: Unique identifier for this audio
            sample_rate: Sample rate
            
        Returns:
            Transcription text
        """
        # Resample if needed
        if sample_rate != 16000:
            import librosa
            audio = librosa.resample(audio, orig_sr=sample_rate, target_sr=16000)
        
        # Process audio
        input_features = self.whisper_processor(
            audio,
            sampling_rate=16000,
            return_tensors="pt"
        ).input_features.to(self.whisper_model.device, dtype=torch.bfloat16)
        
        # Generate transcription
        with torch.inference_mode():
            predicted_ids = self.whisper_model.generate(
                input_features,
                max_new_tokens=448
            )
        
        # Decode
        transcription = self.whisper_processor.batch_decode(
            predicted_ids,
            skip_special_tokens=True
        )[0]
        
        # Store
        self.transcripts[audio_id] = {
            'text': transcription,
            'duration': len(audio) / 16000,
            'timestamp': time.time()
        }
        
        return transcription
    
    def load_llm(self):
        """Load the LLM for answering questions."""
        if self.llm is not None:
            return
            
        from transformers import AutoTokenizer, AutoModelForCausalLM
        
        print("Loading LLM for Q&A...")
        model_id = "Qwen/Qwen2.5-7B-Instruct"
        
        self.llm_tokenizer = AutoTokenizer.from_pretrained(model_id)
        self.llm = AutoModelForCausalLM.from_pretrained(
            model_id,
            torch_dtype=torch.bfloat16,
            device_map="auto"
        )
        print("LLM loaded!")
    
    def ask(self, question: str, audio_id: Optional[str] = None) -> Dict:
        """
        Ask a question about transcribed audio.
        
        Args:
            question: Question to ask
            audio_id: Specific audio to query (None = all)
            
        Returns:
            Dictionary with answer and sources
        """
        if not self.transcripts:
            return {'answer': "No audio has been transcribed yet.", 'sources': []}
        
        if self.llm is None:
            self.load_llm()
        
        # Get relevant transcripts
        if audio_id:
            if audio_id not in self.transcripts:
                return {'answer': f"Audio '{audio_id}' not found.", 'sources': []}
            context = f"Transcript from {audio_id}:\n{self.transcripts[audio_id]['text']}"
            sources = [audio_id]
        else:
            context_parts = []
            for aid, data in self.transcripts.items():
                context_parts.append(f"Transcript from {aid}:\n{data['text']}")
            context = "\n\n".join(context_parts)
            sources = list(self.transcripts.keys())
        
        # Create prompt
        messages = [
            {"role": "system", "content": "You are a helpful assistant that answers questions based on audio transcripts. Be concise and accurate."},
            {"role": "user", "content": f"Based on the following audio transcript(s), answer the question.\n\n{context}\n\nQuestion: {question}"}
        ]
        
        text = self.llm_tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
        )
        
        inputs = self.llm_tokenizer(text, return_tensors="pt").to(self.llm.device)
        
        with torch.inference_mode():
            outputs = self.llm.generate(
                **inputs,
                max_new_tokens=300,
                do_sample=True,
                temperature=0.7
            )
        
        response = self.llm_tokenizer.decode(
            outputs[0][inputs.input_ids.shape[1]:],
            skip_special_tokens=True
        )
        
        return {
            'question': question,
            'answer': response,
            'sources': sources
        }
    
    def summarize(self, audio_id: str) -> str:
        """
        Summarize a transcript.
        
        Args:
            audio_id: Audio to summarize
            
        Returns:
            Summary text
        """
        result = self.ask(f"Summarize the main points from the transcript.", audio_id)
        return result['answer']
    
    def list_transcripts(self) -> List[Dict]:
        """List all transcribed audio."""
        return [
            {'id': aid, 'duration': data['duration'], 'preview': data['text'][:100]}
            for aid, data in self.transcripts.items()
        ]

print("AudioQA class defined!")

In [None]:
# Create a sample transcript manually (since we don't have speech audio)
# In practice, this would come from actual transcription

sample_transcript = """
Welcome to the quarterly earnings call for TechCorp. I'm the CEO, and today I'll discuss our Q4 results.

Revenue reached 3.4 million dollars, up 23% from the previous quarter. Our AI services division 
saw particularly strong growth, contributing 60% of total revenue. We're excited about the 
upcoming product launch in March.

Operating expenses were well controlled at 1.8 million, giving us a healthy profit margin of 47%.
We plan to invest heavily in R&D next year, focusing on multimodal AI capabilities.

Questions from analysts:
- Analyst: What's driving the AI services growth?
- CEO: Enterprise adoption of our document processing solution has accelerated.
- Analyst: Any concerns about competition?
- CEO: We maintain our competitive edge through our unified platform approach.
"""

# Initialize the Audio QA system
audio_qa = AudioQA(hf_whisper_model, hf_whisper_processor)

# Manually add the sample transcript (simulating transcription)
audio_qa.transcripts["earnings_call_q4"] = {
    'text': sample_transcript,
    'duration': 120.0,  # Simulated 2 minutes
    'timestamp': time.time()
}

print("Sample transcript added!")
print(f"Transcripts: {audio_qa.list_transcripts()}")

In [None]:
# Ask questions about the transcript
print("\n" + "=" * 50)
print("AUDIO Q&A DEMO")
print("=" * 50)

questions = [
    "What was the revenue in Q4?",
    "What percentage of revenue came from AI services?",
    "What is the company planning for next year?"
]

for q in questions:
    print(f"\nQ: {q}")
    result = audio_qa.ask(q, "earnings_call_q4")
    print(f"A: {result['answer']}")

In [None]:
# Get a summary
print("\n" + "=" * 50)
print("TRANSCRIPT SUMMARY")
print("=" * 50)

summary = audio_qa.summarize("earnings_call_q4")
print(f"\n{summary}")

---

## Part 7: Whisper Model Comparison

Let's understand the different Whisper model sizes:

In [None]:
# Whisper model comparison
model_info = [
    ("tiny", "39M", "~1GB", "Fastest, basic accuracy"),
    ("base", "74M", "~1GB", "Good for quick transcription"),
    ("small", "244M", "~2GB", "Balanced speed/accuracy"),
    ("medium", "769M", "~5GB", "High accuracy"),
    ("large", "1.55B", "~10GB", "Best English accuracy"),
    ("large-v2", "1.55B", "~10GB", "Improved multilingual"),
    ("large-v3", "1.55B", "~10GB", "State-of-the-art")
]

print("Whisper Model Comparison")
print("=" * 70)
print(f"{'Model':<12} {'Params':<10} {'VRAM':<10} {'Notes'}")
print("-" * 70)
for model, params, vram, notes in model_info:
    print(f"{model:<12} {params:<10} {vram:<10} {notes}")
print("\n* DGX Spark can easily run large-v3 with room to spare!")

---

## Try It Yourself: Build a Meeting Transcription System

Create a system that:
1. Accepts audio files (simulated or real)
2. Transcribes them
3. Extracts action items and key decisions
4. Generates meeting minutes

<details>
<summary>Hint: Action Item Extraction</summary>

```python
def extract_action_items(transcript: str) -> List[str]:
    prompt = f"""
    Extract all action items from this meeting transcript.
    Format as a numbered list.
    
    Transcript:
    {transcript}
    """
    # Use the LLM to extract...
```
</details>

In [None]:
# YOUR CODE HERE
# Build your meeting transcription system!



---

## Common Mistakes

### Mistake 1: Wrong Sample Rate

```python
# Wrong - passing audio at wrong sample rate
audio, sr = load_audio("file.wav")  # sr = 44100
result = whisper.transcribe(audio)  # Expects 16kHz!

# Right - resample to 16kHz
audio, sr = load_audio("file.wav")
audio_16k = librosa.resample(audio, orig_sr=sr, target_sr=16000)
result = whisper.transcribe(audio_16k)  # Correct!
```

### Mistake 2: Audio Too Long

```python
# Wrong - feeding entire 2-hour audio
result = whisper.transcribe(long_audio)  # May OOM or lose context

# Right - chunk the audio
chunk_length = 30 * 16000  # 30 seconds
for i in range(0, len(long_audio), chunk_length):
    chunk = long_audio[i:i+chunk_length]
    result = whisper.transcribe(chunk)
    # Combine results...
```

### Mistake 3: Ignoring Language Setting

```python
# Wrong - letting Whisper guess language for known content
result = whisper.transcribe(spanish_audio)  # Might misdetect

# Right - specify language when known
result = whisper.transcribe(spanish_audio, language="es")  # Accurate!
```

### Mistake 4: Not Handling Stereo Audio

```python
# Wrong - stereo audio (2 channels)
audio = load_stereo_audio()  # Shape: (2, samples)
result = whisper.transcribe(audio)  # May fail

# Right - convert to mono
audio = audio.mean(axis=0)  # Average channels
result = whisper.transcribe(audio)  # Works!
```

---

## Checkpoint

You've learned:
- How Whisper transcription works
- How to transcribe audio with different model sizes
- How to use both OpenAI Whisper and Hugging Face versions
- How to build an Audio Q&A pipeline
- How to combine transcription with LLM analysis

### Key Takeaways

1. **Whisper is multilingual**: Works on 98+ languages out of the box
2. **Sample rate matters**: Always use 16kHz for Whisper
3. **Model size trade-offs**: Larger = more accurate but slower
4. **Combine with LLMs**: Transcription + LLM = powerful Q&A system

---

## Challenge (Optional)

### Build a Real-Time Transcription System

Create a system that:
1. Processes audio in chunks (simulating streaming)
2. Transcribes each chunk
3. Maintains context across chunks
4. Provides real-time text output

In [None]:
# YOUR CHALLENGE CODE HERE

class StreamingTranscriber:
    """
    Real-time streaming transcription.
    
    Processes audio in chunks and provides incremental transcription.
    """
    
    def __init__(self, chunk_size_seconds: float = 5.0):
        """
        Initialize the streaming transcriber.
        
        Args:
            chunk_size_seconds: Size of each audio chunk
        """
        # TODO: Implement
        pass
    
    def process_chunk(self, audio_chunk: np.ndarray) -> str:
        """
        Process a single audio chunk.
        
        Args:
            audio_chunk: Audio data for this chunk
            
        Returns:
            Transcription for this chunk
        """
        # TODO: Implement
        pass

---

## Further Reading

- [Whisper Paper](https://arxiv.org/abs/2212.04356)
- [OpenAI Whisper GitHub](https://github.com/openai/whisper)
- [Hugging Face Whisper](https://huggingface.co/openai/whisper-large-v3)
- [Faster Whisper (CTranslate2)](https://github.com/guillaumekln/faster-whisper)

---

## Cleanup

In [None]:
# Clean up
if 'whisper_model' in dir():
    del whisper_model
if 'hf_whisper_model' in dir():
    del hf_whisper_model
    del hf_whisper_processor
if 'audio_qa' in dir():
    if audio_qa.llm is not None:
        del audio_qa.llm
        del audio_qa.llm_tokenizer
    del audio_qa

clear_gpu_memory()
print(f"Final memory state: {get_memory_usage()}")
print("\nModule 14 complete! Congratulations!")
print("\nYou've learned:")
print("  - Vision-Language Models (LLaVA, Qwen-VL)")
print("  - Image Generation (SDXL, ControlNet, Flux)")
print("  - Multimodal RAG (CLIP + ChromaDB)")
print("  - Document AI (VLM-based extraction)")
print("  - Audio Transcription (Whisper + LLM Q&A)")