# Approach 1: LLM + Tools (Pipeline Multimodal AI)

This notebook demonstrates a **pipeline** approach to building a multimodal AI system. Instead of a single massive model that does everything, we chain together specialized models:

1.  **Speech-to-Text (Whisper):** Converts input audio into text (transcription).
2.  **Reasoning Engine (LLM - Phi-3):** Takes the transcription, reasons about it, and generates a text response AND a prompt for a new image.
3.  **Image Generator (Stable Diffusion):** Takes the prompt from the LLM and generates a new image.

**Architecture:**
`Audio` → `[Speech-to-Text]` → `Text Transcription`
`Text Transcription` → `[LLM]` → `Text Response` + `Image Prompt`
`Image Prompt` → `[Image Generator]` → `Output Image`

In [None]:
# Install necessary libraries
# - transformers: For Phi-3
# - diffusers: For Stable Diffusion
# - accelerate: For model offloading and optimization
# - bitsandbytes: For 4-bit quantization of the LLM
# - openai-whisper: For Speech-to-Text
!pip install -q transformers diffusers accelerate bitsandbytes torch torchvision openai-whisper

In [None]:
import torch
import matplotlib.pyplot as plt
import whisper

# Setup device
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# Helper to display images
def show_image(img, title=None):
    plt.figure(figsize=(8, 8))
    plt.imshow(img)
    if title:
        plt.title(title)
    plt.axis('off')
    plt.show()

## 1. Load Speech-to-Text Model (Whisper)
We use `openai/whisper` (base model). It's fast and accurate for general English transcription.

In [None]:
print("Loading Whisper model...")
# Load the 'base' model. Options: tiny, base, small, medium, large
whisper_model = whisper.load_model("base", device=device)
print("Whisper loaded.")

def audio_to_text(audio_path):
    result = whisper_model.transcribe(audio_path)
    return result["text"]

## 2. Load Reasoning LLM (Phi-3-mini)
We use `microsoft/Phi-3-mini-4k-instruct`. To ensure it fits in 8GB VRAM alongside other models, we load it in **4-bit**.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

print("Loading Phi-3 LLM...")
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)

llm_tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
llm_model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    quantization_config=quantization_config,
    device_map="auto", 
    trust_remote_code=False # Use internal Transformers implementation to avoid DynamicCache errors
)
print("Phi-3 loaded.")

def run_llm(prompt):
    messages = [{"role": "user", "content": prompt}]
    input_ids = llm_tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(device)
    
    # Explicitly set attention mask and pad_token_id to avoid warnings/errors
    attention_mask = (input_ids != llm_tokenizer.pad_token_id).long() if llm_tokenizer.pad_token_id is not None else torch.ones_like(input_ids)
    
    outputs = llm_model.generate(
        input_ids, 
        attention_mask=attention_mask,
        max_new_tokens=256, 
        do_sample=True, 
        temperature=0.7,
        pad_token_id=llm_tokenizer.eos_token_id
    )
    text = llm_tokenizer.batch_decode(outputs[:, input_ids.shape[1]:], skip_special_tokens=True)[0]
    return text

## 3. Load Image Generation Model (Stable Diffusion 1.5)
We use `runwayml/stable-diffusion-v1-5`. We load it in `float16` to save memory.

In [None]:
from diffusers import StableDiffusionPipeline

print("Loading Stable Diffusion...")
# We use enable_model_cpu_offload() to save VRAM when the model is not in use
pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5", 
    torch_dtype=torch.float16
)
pipe.to(device)
# pipe.enable_model_cpu_offload() # Optional: Enable if you run out of VRAM (requires 'accelerate')
print("Stable Diffusion loaded.")

def text_to_image(prompt):
    image = pipe(prompt).images[0]
    return image

## 4. End-to-End Pipeline Demo
Now we combine everything.
1. **Input:** An audio file.
2. **Step 1:** Whisper hears the audio -> transcription.
3. **Step 2:** LLM sees transcription -> Generates a text reply AND a new image prompt.
4. **Step 3:** Stable Diffusion generates the new image.

In [None]:
# --- 1. INPUT ---
# Load a sample audio file
# Ensure you have an audio file named 'example_audio.mp3' or similar in your directory
audio_file = "example_audio.mp3" 

# If you don't have a file, uncomment the following lines to download a sample (if available)
# !wget -O example_audio.mp3 https://www2.cs.uic.edu/~i101/SoundFiles/BabyElephantWalk60.wav

print(f"Processing Audio File: {audio_file}")

# --- 2. HEAR (Audio to Text) ---
try:
    transcription = audio_to_text(audio_file)
    print(f"\n[Whisper] Transcription: {transcription}")
except Exception as e:
    print(f"Error processing audio: {e}")
    transcription = "A futuristic city with flying cars and neon lights." # Fallback for demo purposes

# --- 3. THINK (LLM Reasoning) ---
# We construct a structured prompt for the LLM
llm_prompt = f"""
You are a creative AI assistant.
I will provide a transcription of an audio recording.
Your job is to:
1. Improve and refine the description from the audio.
2. Create a concise, high-quality image generation prompt based on the refined description.

Audio Transcription: {transcription}

Format your response EXACTLY like this:
RESPONSE: [Your refined description here]
IMAGE_PROMPT: [Your image generation prompt here]
"""

print("\n[LLM] Thinking...")
llm_output = run_llm(llm_prompt)
print(f"[LLM] Raw Output:\n{llm_output}\n")

# Parse the output (Simple string parsing)
try:
    response_text = llm_output.split("RESPONSE:")[1].split("IMAGE_PROMPT:")[0].strip()
    image_prompt = llm_output.split("IMAGE_PROMPT:")[1].strip()
except IndexError:
    print("Error parsing LLM output. Using raw output as prompt.")
    response_text = llm_output
    image_prompt = transcription # Fallback

print(f"Parsed Response: {response_text}")
print(f"Parsed Image Prompt: {image_prompt}")

# --- 4. CREATE (Text to Image) ---
print(f"\n[Stable Diffusion] Generating image for: '{image_prompt}'...")
generated_image = text_to_image(image_prompt)

show_image(generated_image, "Generated Output")