# Lab 4.1.5: Audio Transcription with Whisper

**Module:** 4.1 - Multimodal AI  
**Time:** 2 hours  
**Difficulty:** ‚≠ê‚≠ê‚≠ê

---

## üéØ Learning Objectives

By the end of this notebook, you will:
- [ ] Understand how Whisper converts speech to text
- [ ] Transcribe audio files with accurate timestamps
- [ ] Detect languages automatically
- [ ] Translate audio from any language to English
- [ ] Build an audio Q&A pipeline with LLMs

---

## üìö Prerequisites

- Completed: Module 3 (LLM Systems)
- Knowledge of: Basic audio concepts, Python
- Running in: NGC PyTorch container

---

## üåç Real-World Context

Audio transcription is everywhere:

- **Meetings**: Automatic meeting notes and summaries
- **Podcasts**: Generate searchable transcripts
- **Accessibility**: Subtitles for videos
- **Customer Service**: Transcribe and analyze support calls
- **Healthcare**: Dictation for medical records

---

## üßí ELI5: How Does Whisper Work?

> **Imagine you're learning to understand a foreign language by watching thousands of movies with subtitles.** You start to recognize that certain sounds match certain words.
>
> Whisper learned the same way! It was trained on 680,000 hours of audio with transcripts from the internet. Now it can:
> 1. **Listen** to any audio in almost any language
> 2. **Recognize** the words being spoken
> 3. **Write out** exactly what was said, with punctuation!
>
> **In AI terms:** Whisper is a transformer model that converts mel spectrograms (visual representations of audio) into text tokens, one at a time, similar to how GPT generates text.

---

## Part 1: Environment Setup

Let's set up Whisper for audio transcription.

In [None]:
# Check GPU
import torch

print("=" * 50)
print("DGX Spark Environment Check")
print("=" * 50)

if torch.cuda.is_available():
    device = torch.cuda.get_device_properties(0)
    print(f"GPU: {device.name}")
    print(f"Memory: {device.total_memory / 1024**3:.1f} GB")
    print("\nüí° Whisper large-v3 uses ~10GB - easily fits!")
else:
    print("WARNING: No GPU detected! Whisper will run slowly on CPU.")

In [None]:
# Install dependencies (run once)
# !pip install openai-whisper soundfile librosa scipy numpy matplotlib

In [None]:
# Import libraries
import gc
import time
import json
from pathlib import Path
from typing import Optional, Union, List, Dict, Any
from dataclasses import dataclass, field

import torch
import numpy as np
import matplotlib.pyplot as plt

print("‚úÖ Libraries imported!")

In [None]:
# Check for audio dependencies
try:
    import whisper
    print("‚úÖ OpenAI Whisper installed")
except ImportError:
    print("‚ùå OpenAI Whisper not installed. Run: pip install openai-whisper")

try:
    import librosa
    print("‚úÖ librosa installed")
except ImportError:
    print("‚ùå librosa not installed. Run: pip install librosa")

try:
    import soundfile as sf
    print("‚úÖ soundfile installed")
except ImportError:
    print("‚ùå soundfile not installed. Run: pip install soundfile")

---

## Part 2: Understanding Whisper Models

Whisper comes in different sizes - larger models are more accurate but slower.

In [None]:
# Whisper model comparison
print("üìä Whisper Model Comparison")
print("=" * 70)
print(f"{'Model':<15} {'Parameters':<12} {'VRAM':<10} {'Speed':<12} {'Best For'}")
print("-" * 70)

models = [
    ("tiny", "39M", "~1GB", "~32x", "Testing, quick previews"),
    ("base", "74M", "~1GB", "~16x", "Simple recordings"),
    ("small", "244M", "~2GB", "~6x", "Good accuracy/speed balance"),
    ("medium", "769M", "~5GB", "~2x", "Better accuracy"),
    ("large-v3", "1.55B", "~10GB", "~1x", "Best accuracy (recommended)"),
]

for name, params, vram, speed, use_case in models:
    fits = "‚úÖ" if "GB" in vram and float(vram.replace("~", "").replace("GB", "")) < 20 else "‚úÖ"
    print(f"{fits} {name:<13} {params:<12} {vram:<10} {speed:<12} {use_case}")

print("\nüí° With 128GB on DGX Spark, use large-v3 for best quality!")

In [None]:
import whisper

# Load Whisper model
# Use large-v3 for best quality, or smaller models for speed
MODEL_SIZE = "base"  # Start with base for quick testing, change to "large-v3" for production

print(f"Loading Whisper {MODEL_SIZE}...")
start_time = time.time()

model = whisper.load_model(MODEL_SIZE)

print(f"\n‚úÖ Loaded in {time.time() - start_time:.1f}s")
print(f"GPU Memory: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")

---

## Part 3: Creating Sample Audio

Let's create a sample audio file for testing. We'll generate a simple sine wave with "speech-like" patterns.

In [None]:
import soundfile as sf
from scipy.io import wavfile

def create_sample_audio(output_path: str = "sample_audio.wav", duration: float = 5.0):
    """
    Create a simple sample audio file.
    
    Note: For real testing, use actual speech recordings!
    """
    # Sample rate for speech
    sample_rate = 16000
    
    # Generate duration's worth of samples
    t = np.linspace(0, duration, int(sample_rate * duration), dtype=np.float32)
    
    # Create a more complex waveform (not real speech, just for testing)
    # Real speech would come from recordings
    audio = 0.3 * np.sin(2 * np.pi * 440 * t)  # A4 note
    audio += 0.2 * np.sin(2 * np.pi * 880 * t)  # A5 note
    audio += 0.1 * np.sin(2 * np.pi * 220 * t)  # A3 note
    
    # Add some variation
    envelope = np.exp(-t / 2) * (1 + 0.5 * np.sin(2 * np.pi * 2 * t))
    audio = audio * envelope
    
    # Normalize
    audio = audio / np.max(np.abs(audio)) * 0.8
    
    # Save as WAV
    sf.write(output_path, audio, sample_rate)
    
    return output_path, sample_rate, len(audio) / sample_rate

# Create sample (note: this won't produce actual speech!)
sample_path, sr, duration = create_sample_audio()
print(f"‚úÖ Created sample audio: {sample_path}")
print(f"   Sample rate: {sr} Hz")
print(f"   Duration: {duration:.1f}s")

print("\n‚ö†Ô∏è  Note: This is a synthetic tone, not real speech.")
print("   For proper testing, use an actual audio recording!")

In [None]:
# Visualize the audio
import librosa
import librosa.display

def visualize_audio(audio_path: str):
    """
    Visualize an audio file's waveform and spectrogram.
    """
    # Load audio
    audio, sr = librosa.load(audio_path, sr=16000)
    
    fig, axes = plt.subplots(2, 1, figsize=(12, 6))
    
    # Waveform
    axes[0].set_title("Waveform")
    librosa.display.waveshow(audio, sr=sr, ax=axes[0])
    axes[0].set_xlabel("Time (s)")
    
    # Mel spectrogram (what Whisper "sees")
    mel_spec = librosa.feature.melspectrogram(y=audio, sr=sr, n_mels=80)
    mel_spec_db = librosa.power_to_db(mel_spec, ref=np.max)
    
    axes[1].set_title("Mel Spectrogram (what Whisper sees)")
    img = librosa.display.specshow(mel_spec_db, x_axis='time', y_axis='mel', sr=sr, ax=axes[1])
    fig.colorbar(img, ax=axes[1], format='%+2.0f dB')
    
    plt.tight_layout()
    plt.show()
    
    return audio, sr

audio, sr = visualize_audio(sample_path)

### üîç Understanding the Mel Spectrogram

The mel spectrogram is a visual representation of audio:
- **X-axis**: Time
- **Y-axis**: Frequency (mel scale, which matches human perception)
- **Color**: Intensity (louder = brighter)

Whisper converts this "image" of sound into text, similar to how a VLM converts images to descriptions!

---

## Part 4: Basic Transcription

Let's transcribe our audio file. For real-world testing, use an actual speech recording.

In [None]:
@dataclass
class TranscriptionSegment:
    """A segment of transcribed audio."""
    id: int
    start: float  # Start time in seconds
    end: float    # End time in seconds
    text: str
    
    @property
    def duration(self) -> float:
        return self.end - self.start
    
    def format_timestamp(self, t: float) -> str:
        """Format time as HH:MM:SS.mmm"""
        hours = int(t // 3600)
        minutes = int((t % 3600) // 60)
        seconds = t % 60
        return f"{hours:02d}:{minutes:02d}:{seconds:06.3f}"
    
    def to_srt(self) -> str:
        """Convert to SRT subtitle format."""
        start_ts = self.format_timestamp(self.start).replace(".", ",")
        end_ts = self.format_timestamp(self.end).replace(".", ",")
        return f"{self.id}\n{start_ts} --> {end_ts}\n{self.text.strip()}\n"


@dataclass
class TranscriptionResult:
    """Complete transcription result."""
    text: str
    segments: List[TranscriptionSegment]
    language: str
    duration: float
    processing_time: float
    
    def to_srt(self) -> str:
        """Export as SRT subtitle format."""
        return "\n".join(seg.to_srt() for seg in self.segments)
    
    def to_vtt(self) -> str:
        """Export as WebVTT format."""
        lines = ["WEBVTT\n"]
        for seg in self.segments:
            start_ts = seg.format_timestamp(seg.start)
            end_ts = seg.format_timestamp(seg.end)
            lines.append(f"{start_ts} --> {end_ts}")
            lines.append(seg.text.strip())
            lines.append("")
        return "\n".join(lines)

print("‚úÖ Data classes defined!")

In [None]:
def transcribe(
    audio_path: str,
    language: Optional[str] = None,
    task: str = "transcribe",  # or "translate" (to English)
    verbose: bool = True,
) -> TranscriptionResult:
    """
    Transcribe an audio file using Whisper.
    
    Args:
        audio_path: Path to audio file (mp3, wav, m4a, etc.)
        language: Source language (auto-detected if None)
        task: "transcribe" or "translate" (to English)
        verbose: Print progress
        
    Returns:
        TranscriptionResult with text and segments
    """
    if verbose:
        print(f"üé§ Transcribing: {audio_path}")
    
    start_time = time.time()
    
    # Load audio
    audio = whisper.load_audio(audio_path)
    audio_duration = len(audio) / whisper.audio.SAMPLE_RATE
    
    if verbose:
        print(f"   Duration: {audio_duration:.1f}s")
    
    # Transcribe
    result = model.transcribe(
        audio,
        language=language,
        task=task,
        verbose=verbose,
    )
    
    processing_time = time.time() - start_time
    
    # Convert segments
    segments = [
        TranscriptionSegment(
            id=i + 1,
            start=seg["start"],
            end=seg["end"],
            text=seg["text"],
        )
        for i, seg in enumerate(result["segments"])
    ]
    
    transcription = TranscriptionResult(
        text=result["text"],
        segments=segments,
        language=result.get("language", "unknown"),
        duration=audio_duration,
        processing_time=processing_time,
    )
    
    if verbose:
        ratio = audio_duration / processing_time
        print(f"\n‚úÖ Completed in {processing_time:.1f}s ({ratio:.1f}x realtime)")
        print(f"   Detected language: {transcription.language}")
    
    return transcription

print("‚úÖ Transcription function ready!")

In [None]:
# Test transcription with our sample
# Note: Our synthetic audio won't produce meaningful text!

print("\nüìù Transcription Test")
print("=" * 60)

result = transcribe(sample_path)

print(f"\nüìÑ Transcription:")
print(f"   '{result.text}'")

print(f"\nüìä Statistics:")
print(f"   Segments: {len(result.segments)}")
print(f"   Audio duration: {result.duration:.1f}s")
print(f"   Processing time: {result.processing_time:.1f}s")

if result.segments:
    print(f"\nüîñ Segments:")
    for seg in result.segments[:5]:  # Show first 5
        print(f"   [{seg.format_timestamp(seg.start)} --> {seg.format_timestamp(seg.end)}] {seg.text}")

---

## Part 5: Language Detection and Translation

Whisper can automatically detect languages and translate to English!

In [None]:
def detect_language(audio_path: str, top_k: int = 5) -> Dict[str, float]:
    """
    Detect the language of an audio file.
    
    Args:
        audio_path: Path to audio file
        top_k: Number of top languages to return
        
    Returns:
        Dictionary mapping language codes to probabilities
    """
    # Load and pad/trim audio
    audio = whisper.load_audio(audio_path)
    audio = whisper.pad_or_trim(audio)
    
    # Create mel spectrogram
    mel = whisper.log_mel_spectrogram(audio).to(model.device)
    
    # Detect language
    _, probs = model.detect_language(mel)
    
    # Sort and get top-k
    sorted_probs = sorted(probs.items(), key=lambda x: x[1], reverse=True)
    return dict(sorted_probs[:top_k])

print("Testing language detection...")
languages = detect_language(sample_path)

print("\nüåç Detected Languages:")
for lang, prob in languages.items():
    bar = "‚ñà" * int(prob * 40)
    print(f"   {lang}: {prob:.1%} {bar}")

In [None]:
# Whisper language support
print("\nüåê Whisper Language Support")
print("=" * 60)

# Get available languages
from whisper.tokenizer import LANGUAGES

print(f"Supports {len(LANGUAGES)} languages including:")

# Show some common languages
common = ["en", "es", "fr", "de", "it", "pt", "ru", "ja", "ko", "zh", "ar", "hi"]
for code in common:
    if code in LANGUAGES:
        print(f"   {code}: {LANGUAGES[code]}")

---

## Part 6: Exporting Transcriptions

Let's export transcriptions in different formats.

In [None]:
def export_transcription(
    result: TranscriptionResult,
    output_path: str,
    format: str = "txt",
) -> str:
    """
    Export transcription to file.
    
    Args:
        result: TranscriptionResult
        output_path: Output file path
        format: "txt", "srt", "vtt", or "json"
        
    Returns:
        Path to saved file
    """
    output_path = Path(output_path)
    
    if format == "txt":
        content = result.text
    elif format == "srt":
        content = result.to_srt()
    elif format == "vtt":
        content = result.to_vtt()
    elif format == "json":
        content = json.dumps({
            "text": result.text,
            "language": result.language,
            "duration": result.duration,
            "segments": [
                {
                    "id": s.id,
                    "start": s.start,
                    "end": s.end,
                    "text": s.text,
                }
                for s in result.segments
            ],
        }, indent=2)
    else:
        raise ValueError(f"Unknown format: {format}")
    
    output_path.write_text(content)
    print(f"‚úÖ Saved {format.upper()}: {output_path}")
    
    return str(output_path)

# Export in different formats
if result.text.strip():
    export_transcription(result, "transcript.txt", "txt")
    export_transcription(result, "transcript.srt", "srt")
    export_transcription(result, "transcript.json", "json")
else:
    print("(Skipping export - no transcription text)")

---

## Part 7: Audio Q&A Pipeline

Let's combine Whisper with an LLM for audio question-answering!

In [None]:
def create_audio_qa_prompt(
    transcription: TranscriptionResult,
    question: str,
    include_timestamps: bool = True,
    max_context_length: int = 3000,
) -> str:
    """
    Create a prompt for audio Q&A based on transcription.
    
    Args:
        transcription: Transcription result
        question: User's question
        include_timestamps: Include timestamps in context
        max_context_length: Maximum context length
        
    Returns:
        Formatted prompt for an LLM
    """
    if include_timestamps and transcription.segments:
        # Build timestamped transcript
        lines = []
        for seg in transcription.segments:
            timestamp = f"[{seg.format_timestamp(seg.start)}]"
            lines.append(f"{timestamp} {seg.text.strip()}")
        transcript = "\n".join(lines)
    else:
        transcript = transcription.text
    
    # Truncate if needed
    if len(transcript) > max_context_length:
        transcript = transcript[:max_context_length] + "...\n[Truncated]"
    
    prompt = f"""Audio Transcription (Duration: {transcription.duration:.1f}s, Language: {transcription.language}):

{transcript}

Based on the audio transcription above, please answer the following question:
{question}

Answer:"""
    
    return prompt

# Example prompt creation
if result.text.strip():
    example_prompt = create_audio_qa_prompt(result, "What is the main topic discussed?")
    print("\nüìù Example Q&A Prompt:")
    print("=" * 60)
    print(example_prompt)
else:
    print("(No transcription available for Q&A example)")

In [None]:
class AudioQAPipeline:
    """
    Complete audio Q&A pipeline using Whisper and an LLM.
    """
    
    def __init__(self, whisper_model=None, llm_model=None, llm_processor=None):
        """
        Initialize the pipeline.
        
        Args:
            whisper_model: Loaded Whisper model (uses global if None)
            llm_model: LLM for Q&A (optional)
            llm_processor: LLM processor
        """
        self.whisper_model = whisper_model or model
        self.llm_model = llm_model
        self.llm_processor = llm_processor
        
        # Cache for transcriptions
        self._transcription_cache = {}
    
    def transcribe(self, audio_path: str, **kwargs) -> TranscriptionResult:
        """
        Transcribe audio with caching.
        """
        if audio_path in self._transcription_cache:
            print("üìé Using cached transcription")
            return self._transcription_cache[audio_path]
        
        result = transcribe(audio_path, **kwargs)
        self._transcription_cache[audio_path] = result
        
        return result
    
    def ask(
        self,
        audio_path: str,
        question: str,
        include_timestamps: bool = False,
    ) -> Dict[str, Any]:
        """
        Ask a question about an audio file.
        
        Args:
            audio_path: Path to audio file
            question: Question about the audio
            include_timestamps: Include timestamps in context
            
        Returns:
            Dictionary with answer and metadata
        """
        # Transcribe
        transcription = self.transcribe(audio_path)
        
        # Create prompt
        prompt = create_audio_qa_prompt(
            transcription,
            question,
            include_timestamps=include_timestamps,
        )
        
        # If no LLM, return the prompt for manual use
        if self.llm_model is None:
            return {
                "transcription": transcription.text,
                "question": question,
                "prompt": prompt,
                "answer": "(LLM not configured - use prompt with your preferred model)",
                "language": transcription.language,
                "duration": transcription.duration,
            }
        
        # Generate answer with LLM
        # (Implementation depends on LLM type)
        answer = self._generate_answer(prompt)
        
        return {
            "transcription": transcription.text,
            "question": question,
            "answer": answer,
            "language": transcription.language,
            "duration": transcription.duration,
        }
    
    def _generate_answer(self, prompt: str) -> str:
        """Generate answer using LLM."""
        # Placeholder - implement based on your LLM
        return "(LLM response would go here)"
    
    def summarize(self, audio_path: str) -> Dict[str, Any]:
        """
        Generate a summary of the audio content.
        """
        return self.ask(
            audio_path,
            "Provide a brief summary of the main points discussed in this audio."
        )

print("‚úÖ AudioQAPipeline class ready!")

In [None]:
# Test the pipeline
pipeline = AudioQAPipeline()

print("\nüéØ Audio Q&A Pipeline Test")
print("=" * 60)

result = pipeline.ask(
    sample_path,
    "What is being discussed in this audio?"
)

print(f"\n‚ùì Question: {result['question']}")
print(f"üìù Transcription: {result['transcription'][:200]}..." if len(result['transcription']) > 200 else f"üìù Transcription: {result['transcription']}")
print(f"üåç Language: {result['language']}")
print(f"‚è±Ô∏è Duration: {result['duration']:.1f}s")

---

## ‚ö†Ô∏è Common Mistakes

### Mistake 1: Wrong Audio Format
```python
# ‚ùå Wrong: Whisper expects specific sample rate
audio, sr = librosa.load("audio.wav", sr=44100)  # CD quality
result = model.transcribe(audio)  # May have issues!

# ‚úÖ Right: Use Whisper's audio loader (16kHz)
audio = whisper.load_audio("audio.wav")  # Auto-resamples to 16kHz
result = model.transcribe(audio)
```
**Why:** Whisper was trained on 16kHz audio. Always use `whisper.load_audio()`.

---

### Mistake 2: Using Tiny Model for Production
```python
# ‚ùå Wrong: Tiny model has high error rate
model = whisper.load_model("tiny")
result = model.transcribe(important_meeting)  # Many errors!

# ‚úÖ Right: Use large-v3 for important transcriptions
model = whisper.load_model("large-v3")
result = model.transcribe(important_meeting)  # Much more accurate
```
**Why:** Larger models are significantly more accurate. With 128GB on DGX Spark, always use large-v3!

---

### Mistake 3: Not Setting Language for Known Audio
```python
# ‚ùå Wrong: Let Whisper guess (may be wrong)
result = model.transcribe(french_audio)  # Might detect wrong language

# ‚úÖ Right: Specify language when known
result = model.transcribe(french_audio, language="fr")
```
**Why:** Specifying language improves accuracy and speed.

---

## üéâ Checkpoint

You've learned:
- ‚úÖ How Whisper converts audio to text using mel spectrograms
- ‚úÖ Choosing the right Whisper model size for your needs
- ‚úÖ Transcribing audio with accurate timestamps
- ‚úÖ Detecting languages automatically
- ‚úÖ Exporting transcriptions in SRT/VTT/JSON formats
- ‚úÖ Building an audio Q&A pipeline

---

## üöÄ Challenge (Optional)

Build a **Meeting Notes Generator** that:
1. Transcribes a meeting recording
2. Identifies speakers (speaker diarization)
3. Extracts action items and decisions
4. Generates a summary with key points
5. Exports as formatted meeting notes

In [None]:
# Challenge: Your code here!

def generate_meeting_notes(audio_path: str) -> Dict[str, Any]:
    """
    Generate structured meeting notes from audio.
    
    Args:
        audio_path: Path to meeting recording
        
    Returns:
        Dictionary with:
        - transcript: Full transcript
        - summary: Brief summary
        - action_items: List of action items
        - decisions: List of decisions made
        - participants: Identified speakers
    """
    # Your implementation here!
    pass

---

## üìñ Further Reading

- [Whisper Paper](https://arxiv.org/abs/2212.04356)
- [OpenAI Whisper GitHub](https://github.com/openai/whisper)
- [Whisper.cpp](https://github.com/ggerganov/whisper.cpp) - Efficient C++ implementation
- [Faster Whisper](https://github.com/SYSTRAN/faster-whisper) - CTranslate2 acceleration

---

## üßπ Cleanup

In [None]:
# Clean up
import os

# Remove sample files
for f in ["sample_audio.wav", "transcript.txt", "transcript.srt", "transcript.json"]:
    if os.path.exists(f):
        os.remove(f)

# Free memory
if 'model' in dir():
    del model

torch.cuda.empty_cache()
gc.collect()

print("‚úÖ Cleanup complete!")
print(f"GPU Memory: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")

---

## Module Complete! üéâ

Congratulations! You've completed Module 4.1: Multimodal AI!

You've learned:
1. **Vision-Language Models** - Analyzing and understanding images with LLaVA and CLIP
2. **Image Generation** - Creating images with SDXL and ControlNet
3. **Multimodal RAG** - Searching across images and text
4. **Document AI** - Processing PDFs with OCR and VLMs
5. **Audio Transcription** - Converting speech to text with Whisper

---

## Next Steps

Continue to **Module 4.2: AI Safety & Alignment** to learn about building safe and aligned AI systems!

‚û°Ô∏è [Module 4.2: AI Safety & Alignment](../../module-4.2-ai-safety/)