# Multimodal Image-to-Speech Pipeline

**Workflow:** Input Image → Object Detection → Label Extraction & Counting → Text Generation → Text-to-Speech → Audio Output

**Models Used:**
- **Object Detection:** `facebook/detr-resnet-50` (DETR - Transformer-based detector)
- **Text-to-Speech:** `suno/bark-small` (Neural TTS)

**Text Generation:** Pure Python logic (no LLM needed — simple label counting and sentence construction)

In [None]:
# Install dependencies (uncomment if needed)
# !pip install transformers torch Pillow scipy timm

## Step 1: Object Detection

Detect objects in the image using `facebook/detr-resnet-50` (DETR - DEtection TRansformer).

In [None]:
import gc
from collections import Counter

import torch
import numpy as np
from PIL import Image
from transformers import pipeline
from scipy.io.wavfile import write as write_wav
from IPython.display import Audio as IPythonAudio

In [None]:
# Load the image
image_path = "images/image_1.jpg"
image = Image.open(image_path)
image

In [None]:
# Run object detection
detector = pipeline(
    task="object-detection",
    model="facebook/detr-resnet-50",
    torch_dtype=torch.bfloat16,
)

detections = detector(image)

# Display results
for det in detections:
    print(f"  {det['label']}: {det['score']:.2f}")

print(f"\nTotal: {len(detections)} object(s) detected")

# Free memory
del detector
gc.collect()

## Step 2: Label Extraction & Text Generation

Count occurrences of detected objects and build a natural language sentence. Pure Python logic — no LLM needed.

In [None]:
# Count occurrences of each label
label_counts = Counter(det["label"] for det in detections)
print("Label counts:", dict(label_counts))

# Build a descriptive sentence
parts = []
for label, count in label_counts.items():
    if count == 1:
        parts.append(f"1 {label}")
    else:
        parts.append(f"{count} {label}s")

if len(parts) == 0:
    text = "No objects were detected in the image."
elif len(parts) == 1:
    text = f"The image contains {parts[0]}."
elif len(parts) == 2:
    text = f"The image contains {parts[0]} and {parts[1]}."
else:
    text = f"The image contains {', '.join(parts[:-1])}, and {parts[-1]}."

print(f"\nGenerated text: {text}")

## Step 3: Text-to-Speech (TTS)

Convert the generated text into speech audio using `suno/bark-small` and save as a WAV file.

In [None]:
# Run text-to-speech
synthesizer = pipeline(
    task="text-to-speech",
    model="suno/bark-small",
)

result = synthesizer(text)

# Extract audio data
audio = np.array(result["audio"][0])
sampling_rate = result["sampling_rate"]

# Normalize and save as WAV
audio = audio / np.max(np.abs(audio))
audio_16bit = (audio * 32767).astype(np.int16)

output_path = "output/output.wav"
write_wav(output_path, sampling_rate, audio_16bit)
print(f"Audio saved to: {output_path}")
print(f"Sampling rate: {sampling_rate} Hz")

# Free memory
del synthesizer
gc.collect()

## Play Audio Output

Listen to the generated speech directly in the notebook.

In [None]:
# Play the audio inline
IPythonAudio(audio_16bit, rate=sampling_rate)