**ðŸ”§ Setup Required**: Before running this notebook, please follow the [setup instructions](../../README.md#setup-instructions) to configure your environment and API keys.

# Audio Transcription with Whisper for Multimodal RAG

This notebook demonstrates how to add audio transcription capabilities to multimodal RAG pipelines using OpenAI's Whisper.

## What You'll Learn

- How to transcribe audio files using Whisper (Remote and Local)
- Building audio transcription pipelines
- Creating a complete multimodal pipeline with text, images, and audio
- Indexing transcribed audio alongside other content

## Whisper Options

Haystack supports two Whisper implementations:

1. **Remote Whisper** (OpenAI API):
   - Easier to use
   - Requires OpenAI API key
   - API costs apply
   - No local setup needed

2. **Local Whisper**:
   - Free to use
   - Requires local installation
   - More privacy
   - Requires more computational resources

## Use Cases

- Meeting transcription and analysis
- Podcast search and Q&A
- Voice command processing
- Accessibility features
- Converting speech to searchable text

## Setup

In [1]:
import os
from getpass import getpass
from dotenv import load_dotenv

# Load environment variables
load_dotenv("../../.env")

# Set up OpenAI API key
if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass("Enter OpenAI API key: ")

notebook_dir = os.path.dirname(os.path.abspath("5_audio_transcription_whisper.ipynb")) if os.path.exists("5_audio_transcription_whisper.ipynb") else os.getcwd()

## Option 1: Remote Whisper (OpenAI API)

This is the easiest option - it uses OpenAI's hosted Whisper API.

In [11]:
from haystack.components.audio import RemoteWhisperTranscriber 
from haystack.dataclasses import ByteStream
from haystack.utils import Secret

# Initialize remote transcriber
remote_transcriber = RemoteWhisperTranscriber(api_key=Secret.from_env_var("OPENAI_API_KEY"))

## Option 2: Local Whisper

To use Local Whisper, you need to install the `openai-whisper` package:

```bash
pip install openai-whisper
```

**Note**: Local Whisper downloads models from the internet on first use. If you encounter SSL certificate errors, you may need to fix SSL certificates on your system or use Remote Whisper instead.

### Fixing SSL Certificate Issues (macOS)

If you encounter SSL certificate verification errors, try one of these solutions:

1. **Run the certificates install script** (recommended for macOS):
   ```bash
   /Applications/Python\ 3.12/Install\ Certificates.command
   ```

2. **Or manually bypass SSL verification** (not recommended for production):
   ```python
   import ssl
   ssl._create_default_https_context = ssl._create_unverified_context
   ```

In [12]:
# Uncomment to use Local Whisper (requires openai-whisper package and model download)
# from haystack.components.audio import LocalWhisperTranscriber

# Initialize local transcriber (requires local Whisper installation)
# Available models: tiny, base, small, medium, large
# local_transcriber = LocalWhisperTranscriber(model="base")
# local_transcriber.warm_up()  # This will download the model on first run

# Example usage:
# result = local_transcriber.run(sources=["data_for_indexing/harvard.wav"])
# print(result["documents"][0].content)

print("Local Whisper transcriber can be initialized with LocalWhisperTranscriber()")
print("Available models: tiny, base, small, medium, large")
print("Trade-off: Larger models are more accurate but slower")
print("\nNote: Commented out by default due to model download requirements.")
print("Uncomment the code above to use Local Whisper after resolving SSL issues.")

Local Whisper transcriber can be initialized with LocalWhisperTranscriber()
Available models: tiny, base, small, medium, large
Trade-off: Larger models are more accurate but slower

Note: Commented out by default due to model download requirements.
Uncomment the code above to use Local Whisper after resolving SSL issues.


## Building an Audio Transcription Pipeline

Here's a complete pipeline for processing audio files:

**Pipeline Flow**:
1. Audio files â†’ Whisper Transcriber â†’ Text documents
2. Text â†’ Document Splitter â†’ Smaller chunks
3. Chunks â†’ Embedder â†’ Vector representations
4. Vectors â†’ Document Writer â†’ Store

This pipeline can be combined with text and image pipelines for truly multimodal applications.

In [13]:
from haystack import Pipeline
from haystack.components.audio import RemoteWhisperTranscriber
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore

# Create document store for audio transcripts
audio_doc_store = InMemoryDocumentStore()
transcriber = RemoteWhisperTranscriber(api_key=Secret.from_env_var("OPENAI_API_KEY"))
doc_splitter = DocumentSplitter(split_by="sentence", split_length=10)
doc_embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
doc_writer = DocumentWriter(audio_doc_store)
# Build audio transcription pipeline
audio_pipeline = Pipeline()
audio_pipeline.add_component( "transcriber", transcriber)
audio_pipeline.add_component("splitter", doc_splitter)
audio_pipeline.add_component("embedder", doc_embedder)
audio_pipeline.add_component("writer", doc_writer)

# Connect components
audio_pipeline.connect("transcriber.documents", "splitter.documents")
audio_pipeline.connect("splitter.documents", "embedder.documents")
audio_pipeline.connect("embedder.documents", "writer.documents")

print("Audio transcription pipeline built!")
print("\nTo use the pipeline:")
print("result = audio_pipeline.run({'transcriber': {'audio_files': [audio_stream]}})")

  from .autonotebook import tqdm as notebook_tqdm


Audio transcription pipeline built!

To use the pipeline:
result = audio_pipeline.run({'transcriber': {'audio_files': [audio_stream]}})


In [21]:
audio_pipeline.run({'transcriber': {'sources': ["./data_for_indexing/harvard.wav"]}})

Batches: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1/1 [00:00<00:00,  1.77it/s]
Batches: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1/1 [00:00<00:00,  1.77it/s]


{'writer': {'documents_written': 1}}

In [24]:
documents =[ item.content for item in audio_doc_store.filter_documents()]

print(documents)

['The stale smell of old beer lingers. It takes heat to bring out the odor. A cold dip restores health and zest. A salt pickle tastes fine with ham. Tacos al pastor are my favorite. A zestful food is the hot cross bun.']


In [19]:
audio_pipeline.draw(path="./images/audio_pipeline.png")

![](./images/audio_pipeline.png)

## Audio Format Support

Whisper supports various audio formats:

- **MP3**: Most common format
- **WAV**: Uncompressed audio
- **M4A**: Apple audio format
- **FLAC**: Lossless compression
- **OGG**: Open-source format

### Best Practices:

1. **Audio Quality**: Higher quality audio produces better transcriptions
2. **File Size**: Consider splitting large audio files
3. **Language**: Whisper supports multiple languages
4. **Noise**: Clean audio produces better results
5. **Cost**: Remote Whisper charges per minute of audio

## Summary

In this notebook, we covered:

1. **Audio Transcription**:
   - Remote Whisper (OpenAI API)
   - Local Whisper (self-hosted)

2. **Audio Processing Pipeline**:
   - Transcription â†’ Splitting â†’ Embedding â†’ Storage

3. **Complete Multimodal Pipeline**:
   - Unified pipeline for text, images, and audio
   - File routing and parallel processing
   - Consistent embedding and storage

### Key Takeaways:

- **Whisper is powerful**: State-of-the-art transcription
- **Two options**: Remote (easy) vs Local (free)
- **Pipeline flexibility**: Easy to combine modalities
- **Production ready**: Scalable architecture

### Real-World Applications:

1. **Meeting Assistant**: Transcribe meetings, index with slides
2. **Podcast Search**: Make audio content searchable
3. **Lecture Notes**: Combine audio, slides, and documents
4. **Voice Command**: Process voice inputs in RAG
5. **Accessibility**: Provide text alternatives for audio
