# Multimodal Image-to-Speech Pipeline — Approach 2

**Using Google GenAI (Gemini) for Image Captioning + HuggingFace Transformers Pipeline for TTS**

## System Architecture

```
Input Image
    ↓
Image Captioning / Object Detection (Google GenAI — gemini-3-flash-preview)
    ↓
Generated Text
    ↓
Text Processing (Optional Enhancement)
    ↓
Text-to-Speech Model (suno/bark-small)
    ↓
Audio Output
```

## Models Used

| Task | Model | Provider |
|------|-------|----------|
| Image Captioning | `gemini-3-flash-preview` | Google GenAI |
| Text-to-Speech | `suno/bark-small` | HuggingFace Transformers |

In [None]:
# !pip install google-genai transformers torch Pillow scipy

## Step 1: Image Captioning using Google GenAI SDK

**Objective:** Generate a detailed, accessibility-focused caption describing the image.

Instead of detecting isolated objects (like DETR in Approach 1), the Gemini vision model:
- Understands the entire scene
- Identifies relationships between objects
- Describes actions and context
- Produces natural, human-like language

**Model:** `gemini-3-flash-preview`

In [None]:
import gc
import re

import torch
import numpy as np
from PIL import Image
from transformers import pipeline
from scipy.io.wavfile import write as write_wav
from IPython.display import Audio, display

from google import genai
from google.genai import types

In [None]:
# Load Gemini API key from local file
f = open("keys/.gemini.txt")
key = f.read().strip()
f.close()

client = genai.Client(api_key=key)
print("Gemini client initialized successfully.")

In [None]:
# Load and display the input image
image_path = "images/image_1.jpg"
image = Image.open(image_path)
display(image)

In [None]:
# Read image bytes and send to Gemini for captioning
with open(image_path, "rb") as f:
    image_bytes = f.read()

SYSTEM_PROMPT = (
    "You are a helpful AI Assistant. Given an image perform object detection "
    "and provide a text output which contains the information about the labels "
    "detected and their counts."
)

contents = [
    types.Part.from_bytes(data=image_bytes, mime_type="image/jpeg"),
]

response = client.models.generate_content(
    model="gemini-3-flash-preview",
    contents=contents,
    config=types.GenerateContentConfig(
        system_instruction=SYSTEM_PROMPT,
    ),
)

raw_caption = response.text
print("Gemini Response:")
print(raw_caption)

## Step 2: Text Processing (Optional Enhancement)

**Objective:** Prepare the generated caption for speech synthesis.

Possible enhancements:
- Remove unnecessary symbols (markdown formatting, etc.)
- Control length (brief/detailed mode)
- Adjust tone (formal/informal)
- Add introductory phrase (e.g., *"Here is what I see in the image..."*)

In [None]:
# Clean up the raw caption for TTS

# Remove markdown-style formatting symbols
text = re.sub(r"[*_#`>]", "", raw_caption)

# Collapse multiple whitespace / newlines into a single space
text = re.sub(r"\s+", " ", text).strip()

# Add an introductory phrase
if not text.lower().startswith("here is"):
    text = "Here is what I see in the image: " + text

# Truncate to ~500 chars to keep TTS output manageable
if len(text) > 500:
    text = text[:497] + "..."

print("Processed text for TTS:")
print(text)

## Step 3: Text-to-Speech (TTS)

**Objective:** Convert the generated descriptive text into natural speech audio.

**Model:** `suno/bark-small`

In [None]:
# Run TTS pipeline
synthesizer = pipeline(
    task="text-to-speech",
    model="suno/bark-small",
)

result = synthesizer(text)

# Extract audio data and sampling rate
audio = np.array(result["audio"][0])
sampling_rate = result["sampling_rate"]

# Normalize audio to 16-bit PCM range
audio = audio / np.max(np.abs(audio))
audio_16bit = (audio * 32767).astype(np.int16)

# Save as WAV
import os
os.makedirs("output", exist_ok=True)
write_wav("output/output_v2.wav", sampling_rate, audio_16bit)
print(f"Audio saved to: output/output_v2.wav")
print(f"Sampling rate: {sampling_rate} Hz")

# Clean up
del synthesizer
gc.collect()

## Play Audio Output

In [None]:
Audio(audio_16bit, rate=sampling_rate)