<div align="center">
<img src="https://poorit.in/image.png" alt="Poorit" width="40" style="vertical-align: middle;"> <b>AI SYSTEMS ENGINEERING 1</b>

## Unit 2: Beyond Text - Images, Audio, and Multi-Model Access

**CV Raman Global University, Bhubaneswar**  
*AI Center of Excellence*

---

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Poorit-Technologies/cvraman-coe/blob/main/courses-contents/ai-systems-engineering-1/unit-2/03-ai-systems-engineering-1-unit2-beyond-text.ipynb)

</div>

---

### What You'll Learn

In this notebook, you will:

1. **Use LiteLLM** to access multiple LLM providers with one interface
2. **Generate images** from text prompts using gpt-image-1-mini
3. **Create audio with text-to-speech** using OpenAI's TTS
4. **Practice** with hands-on exercises

**Duration:** ~45 minutes

---

## 1. Environment Setup

In [None]:
# Install required packages
!pip install -q openai litellm pillow

In [None]:
import os
import base64
from io import BytesIO
from getpass import getpass
from openai import OpenAI
from litellm import completion
from PIL import Image
from IPython.display import Audio, display

In [None]:
# Configure API key
api_key = getpass("Enter your OpenAI API Key: ")
os.environ['OPENAI_API_KEY'] = api_key
client = OpenAI(api_key=api_key)

---

## 2. One Interface for Many Models -- LiteLLM

So far we've used the OpenAI Python library directly. But what if you want to switch to Google Gemini, Anthropic Claude, or a local model?

**LiteLLM** gives you a single `completion()` function that works with 100+ providers. You just change the model name -- the rest of your code stays the same.

| Provider | Model Name in LiteLLM | API Key Env Variable |
|----------|----------------------|---------------------|
| OpenAI | `openai/gpt-4o-mini` | `OPENAI_API_KEY` |
| Google | `gemini/gemini-2.0-flash` | `GOOGLE_API_KEY` |
| Anthropic | `anthropic/claude-sonnet-4-5-20250929` | `ANTHROPIC_API_KEY` |
| Local (Ollama) | `ollama/llama3.2` | -- |

Let's try it with OpenAI:

In [None]:
# LiteLLM uses the same messages format you already know
response = completion(
    model="openai/gpt-4o-mini",
    messages=[{"role": "user", "content": "Tell me a fun fact about India in one sentence."}]
)

print(response.choices[0].message.content)
print(f"\nTokens used: {response.usage.total_tokens}")

> **Key idea:** To switch models, you only change the `model` string.  
> For example, replacing `"openai/gpt-4o-mini"` with `"gemini/gemini-2.0-flash"` (after setting `GOOGLE_API_KEY`) would route the same request to Google's API -- no other code changes needed.

---

## 3. Image Generation with GPT Image

`gpt-image-1-mini` generates images from text descriptions. We send a prompt, and get back an image.

**Cost:** ~$0.005 per image (1024x1024, low quality)

In [None]:
def generate_image(prompt):
    """Generate an image using gpt-image-1-mini."""
    response = client.images.generate(
        model="gpt-image-1-mini",
        prompt=prompt,
        size="1024x1024",
        quality="low",
        n=1
    )

    image_data = base64.b64decode(response.data[0].b64_json)
    return Image.open(BytesIO(image_data))

In [None]:
# Generate an image
image = generate_image("The Taj Mahal at sunset with birds flying, vibrant colors")
display(image)

---

## 4. Text-to-Speech

OpenAI's `gpt-4o-mini-tts` converts text into natural-sounding speech. You choose a voice and send the text.

**Available voices:** alloy, ash, ballad, coral, echo, fable, onyx, nova, sage, shimmer, verse  
**Cost:** ~$0.016 per 1,000 characters

In [None]:
def text_to_speech(text, voice="alloy"):
    """Convert text to speech audio."""
    response = client.audio.speech.create(
        model="gpt-4o-mini-tts-2025-03-20",
        voice=voice,
        input=text
    )
    return response.content

In [None]:
# Generate and play speech
audio_content = text_to_speech(
    "Welcome to the AI Systems Engineering course at CV Raman University!"
)

# Save and play
with open("welcome.mp3", "wb") as f:
    f.write(audio_content)

Audio("welcome.mp3")

---

## 5. Cost Awareness

AI APIs charge per use. Here's a quick reference so you can estimate costs before running code:

| Feature | Approximate Cost | Example |
|---------|------------------|--------|
| GPT-4o-mini (text) | ~$0.001 per request | Chat response |
| Gemini Flash | Free tier available | Chat response |
| gpt-image-1-mini (low) | ~$0.005 per image | One generated image |
| gpt-4o-mini-tts | ~$0.016 per 1K chars | ~1 paragraph of audio |

> **Tip:** During development and learning, use the cheapest models (GPT-4o-mini, Gemini Flash, gpt-image-1-mini with low quality). Save expensive calls (GPT-4o, high-quality images) for when you really need them.

---

## 6. Exercises

### Exercise 1: Use Gemini via LiteLLM

Use LiteLLM to call Google's Gemini model. You'll need a Google API key (free tier available).

**Steps:**
1. Set your `GOOGLE_API_KEY`
2. Use `completion()` with model `"gemini/gemini-2.0-flash"`
3. Ask it to explain any topic in 2 sentences

In [None]:
# Exercise 1: Call Gemini using LiteLLM
# Hint: It works exactly like the OpenAI call above -- just change the model name

# os.environ['GOOGLE_API_KEY'] = getpass("Enter your Google API Key: ")

# response = completion(
#     model="gemini/gemini-2.0-flash",
#     messages=[{"role": "user", "content": "Explain machine learning in 2 sentences."}]
# )
# print(response.choices[0].message.content)

### Exercise 2: Generate an Image

Use the `generate_image()` function from Section 3 to create an image of your choice.

**Steps:**
1. Write a descriptive prompt (be specific -- colors, style, scene)
2. Call `generate_image()` with your prompt
3. Display the result

In [None]:
# Exercise 2: Generate an image with your own prompt
# Try describing a scene, place, or object in detail

# image = generate_image("your prompt here")
# display(image)

---

## Key Takeaways

1. **LiteLLM provides a unified interface** -- one `completion()` function for 100+ models; switch providers by changing the model string

2. **gpt-image-1-mini generates images from text** -- use `client.images.generate()` with a descriptive prompt

3. **gpt-4o-mini-tts creates natural audio** -- use `client.audio.speech.create()` to convert any text to speech

4. **Always be cost-aware** -- know the price of each API call before running it

### What's Next?

You've completed Unit 2! In Unit 3, we'll explore:
- Open-source models on Hugging Face
- Model fine-tuning basics
- Evaluation and benchmarking

---

## Additional Resources

- [LiteLLM Documentation](https://docs.litellm.ai/)
- [Image Generation Guide](https://platform.openai.com/docs/guides/image-generation)
- [Text-to-Speech Guide](https://platform.openai.com/docs/guides/text-to-speech)

---

**Course Information:**
- **Institution:** CV Raman Global University, Bhubaneswar
- **Program:** AI Center of Excellence
- **Course:** AI Systems Engineering 1
- **Developed by:** [Poorit Technologies](https://poorit.in) - *Transform Graduates into Industry-Ready Professionals*

---