# Let's go multi-modal!!

We can use DALL-E-3, the image generation model behind GPT-4o, to make us some images

Let's put this in a function called artist.

### Price alert: each time I generate an image it costs about 4c - don't go crazy with images!

In [None]:
import os
import json
from dotenv import load_dotenv
from openai import OpenAI
import gradio as gr

In [None]:
# Initialization

load_dotenv()
os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY', 'your-key-if-not-using-env')
MODEL = "gpt-4o-mini"
openai = OpenAI()

In [None]:
system_message = "You are a helpful assistant for an Airline called FlightAI. "
system_message += "Give short, courteous answers, no more than 1 sentence. "
system_message += "Always be accurate. If you don't know the answer, say so."

In [None]:
# Let's start by making a useful function

ticket_prices = {"london": "$799", "paris": "$899", "tokyo": "$1400", "berlin": "$499"}

def get_ticket_price(destination_city):
    print(f"Tool get_ticket_price called for {destination_city}")
    city = destination_city.lower()
    return ticket_prices.get(city, "Unknown")

In [None]:
# There's a particular dictionary structure that's required to describe our function:

price_function = {
    # Identify the function by name, allowing the AI model to recognize and invoke it directly.
    "name" : "get_ticket_price",
    # "description": Provides an explanation of when and why the function should be used, guiding the model to call it specifically when ticket price information is requested.
    "description" : "Get the price of a return ticket to the destination city. Call this whenever you need to know the ticket price, for example when a customer asks 'How much is a ticket to this city'",
    # "parameters": Defines the structure of inputs required by the function
    "parameters" : {
        "type" : "object", # Specify that the parameters will be in object form
        #"properties":  Lists required fields within the function’s parameters.
        "properties" : {
            # A required field describing the destination for which the user seeks a ticket price.
            "destination_city" : {
            "type" : "string",
            "description" : "The city that the customer wants to travel to",
        },
    },
    # Ensure that destination_city is provided whenever this function is called, as it’s essential for retrieving the price.
    "required" : ["destination_city"],
    # Restrict inputs to just the specified parameters, improving reliability.
    "additionalProperties" : False
    }
}

In [None]:
# This list of tools provides a standardized way for an AI model to access various functions it might use during interactions.
# "type": "function": Specifies that each item in the list is a function type, indicating the role of the item.
# "function": price_function: Associates the actual function dictionary (price_function) with the tool.
tools = [{"type" : "function", "function" : price_function}]

In [None]:
def chat(message, history):
    messages = [{"role": "system", "content": system_message}] + history + [{"role": "user", "content": message}]
    response = openai.chat.completions.create(model=MODEL, messages=messages, tools=tools)

    if response.choices[0].finish_reason=="tool_calls":
        message = response.choices[0].message
        response, city = handle_tool_call(message)
        messages.append(message)
        messages.append(response)
        response = openai.chat.completions.create(model=MODEL, messages=messages)
    
    return response.choices[0].message.content

In [None]:
# We have to write that function handle_tool_call:

def handle_tool_call(message):
    tool_call = message.tool_calls[0]
    arguments = json.loads(tool_call.function.arguments)
    city = arguments.get('destination_city')
    price = get_ticket_price(city)
    response = {
        "role": "tool",
        "content": json.dumps({"destination_city": city,"price": price}),
        "tool_call_id": message.tool_calls[0].id
    }
    return response, city

In [None]:
# Some imports for handling images

import base64
from io import BytesIO
from PIL import Image

In [None]:
def artist(city):
    image_response = openai.images.generate(
            model="dall-e-3",
            prompt=f"An image representing a vacation in {city}, showing tourist spots and everything unique about {city}, in a vibrant pop-art style",
            size="1024x1024",
            n=1,
            response_format="b64_json",
        )
    image_base64 = image_response.data[0].b64_json
    image_data = base64.b64decode(image_base64)
    return Image.open(BytesIO(image_data))

## Audio

And let's make a function talker that uses OpenAI's speech model to generate Audio

### Troubleshooting Audio issues

If you have any problems running this code below (like a FileNotFound error, or a warning of a missing package), you may need to install FFmpeg, a very popular audio utility.

**For PC Users**

1. Download FFmpeg from the official website: https://ffmpeg.org/download.html

2. Extract the downloaded files to a location on your computer (e.g., `C:\ffmpeg`)

3. Add the FFmpeg bin folder to your system PATH:
- Right-click on 'This PC' or 'My Computer' and select 'Properties'
- Click on 'Advanced system settings'
- Click on 'Environment Variables'
- Under 'System variables', find and edit 'Path'
- Add a new entry with the path to your FFmpeg bin folder (e.g., `C:\ffmpeg\bin`)
- Restart your command prompt, and within Jupyter Lab do Kernel -> Restart kernel, to pick up the changes

4. Open a new command prompt and run this to make sure it's installed OK
`ffmpeg -version`

**For Mac Users**

1. Install homebrew if you don't have it already by running this in a Terminal window and following any instructions:  
`/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"`

2. Then install FFmpeg with `brew install ffmpeg`

3. Verify your installation with `ffmpeg -version` and if everything is good, within Jupyter Lab do Kernel -> Restart kernel to pick up the changes

Message me or email me at ed@edwarddonner.com with any problems!

# For Mac users

This version should work fine for you. It might work for Windows users too, but you might get a Permissions error writing to a temp file. If so, see the next section!

As always, if you have problems, please contact me! (You could also comment out the audio talker() in the later code if you're less interested in audio generation)

When you call **talker("Well, hi there")**, here’s what happens step-by-ste
### **1. Calling the Function:**
* "Well, hi there" is passed to the **talker** function as the **message** parameter.
### **2. Generating Speech:**
* The function sends "Well, hi there" to OpenAI’s Text-to-Speech (TTS) API using the specified **tts-1** model with the **"onyx"** voice.
* OpenAI’s API processes this text to create an audio file, and the **response** object contains the audio data in binary form
### **3.Converting Audio Data for Playback:**
* **BytesIO(response.content)** wraps the binary audio data in an in-memory file-like object.
* **AudioSegment.from_file(audio_stream, format="mp3")** then converts this object into an **AudioSegment**, allowing easy handling and playback.  
### **4. Playing the Audio:**
* The **play** function outputs the audio through the default audio device on your system, so you should hear the phrase **"Well, hi there"** spoken in the selected **"onyx"** voice.

In [None]:
# AudioSegment: This class from the pydub library provides tools for loading, manipulating, and saving audio files. 
# Here, it's used to handle audio data returned from OpenAI’s API.
from pydub import AudioSegment

# play: This function from pydub.playback plays audio directly through the system’s default audio output.
# It's used here to immediately play the generated audio without needing to save it to disk.
from pydub.playback import play

def talker(message):
    # openai.audio.speech.create(): This function call sends a request to OpenAI’s API to convert text into speech audio. 
    response = openai.audio.speech.create(
        model = "tts-1", # tts-1 is a model optimized for producing natural-sounding speech.
        # voice="onyx": Selects a specific voice style for the generated speech.
        voice = "onyx",  # alloy onyx
        input = message # Send the message text as input to be converted into speech.
    )
    # response.content: This contains the audio data returned by the TTS API in binary format, ready for playback or further manipulation.
    # This enables loading the data without needing to save it as an intermediate file.
    audio_stream = BytesIO(response.content)

    # Load the audio data from the BytesIO stream into an AudioSegment object, specifying the format as "mp3".
    # This format specification aligns with the format returned by the OpenAI TTS API.
    audio = AudioSegment.from_file(audio_stream, format = "mp3")
    
    play(audio) # Play the AudioSegment object directly through the system’s speakers.

In [None]:
talker("Well, hi there.")

# For Windows users

## if you get a permissions error writing to a temp file, then this code should work instead.

A collaboration between student Mark M. and Claude got this resolved!

In [None]:
# AudioSegment: This class from the pydub library provides tools for loading, manipulating, and saving audio files. 
# Here, it's used to handle audio data returned from OpenAI’s API.
# from pydub import AudioSegment

# play: This function from pydub.playback plays audio directly through the system’s default audio output.
# It's used here to immediately play the generated audio without needing to save it to disk.
# from pydub.playback import play

#  This module provides tools for creating temporary files and directories. 
# Here, it's used to create a temporary location for saving audio data as a .wav file.
import tempfile

# This module allows you to spawn new processes, connect to their input/output/error pipes, and obtain their return codes. 
# Here, it’s used to call an external audio player (ffplay) to play the audio.
import subprocess

def play_audio(audio_segment):
    # Retrieve the system’s temporary directory, ensuring a safe location for the audio file.
    temp_dir = tempfile.gettempdir() 
    # Create the full path for the temporary audio file within the temp directory.
    temp_path = os.path.join(temp_dir, "temp_audio.wav")
    try:
        # Export the AudioSegment as a .wav file to temp_path. This format works well with ffplay.
        audio_segment.export(temp_path, format="wav")
        # Invokes ffplay, an audio player included with ffmpeg, to play the .wav file without displaying a window (-nodisp)
        # and automatically closing after playback (-autoexit). Redirecting stdout and stderr to DEVNULL silences any output.
        subprocess.call([
            "ffplay",
            "-nodisp",
            "-autoexit",
            "-hide_banner",
            temp_path
        ], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
    # finally block: After playback, the finally block deletes the temporary file (temp_path). 
    # If an error occurs during deletion, it’s safely ignored
    finally:
        try:
            os.remove(temp_path)
        except Exception:
            pass
 

def talker(message):
    # openai.audio.speech.create(): This function call sends a request to OpenAI’s API to convert text into speech audio. 
    response = openai.audio.speech.create(
        model = "tts-1", # tts-1 is a model optimized for producing natural-sounding speech.
        # voice="onyx": Selects a specific voice style for the generated speech.
        voice = "onyx",  # alloy onyx
        input = message # Send the message text as input to be converted into speech.
    )
    # response.content: This contains the audio data returned by the TTS API in binary format, ready for playback or further manipulation.
    # This enables loading the data without needing to save it as an intermediate file.
    audio_stream = BytesIO(response.content)

    # Load the audio data from the BytesIO stream into an AudioSegment object, specifying the format as "mp3".
    # This format specification aligns with the format returned by the OpenAI TTS API.
    audio = AudioSegment.from_file(audio_stream, format = "mp3")
    
    play_audio(audio) # Play the AudioSegment object directly through the system’s speakers.

In [None]:
talker("Well hi there")

# Our Agent Framework

The term 'Agentic AI' and Agentization is an umbrella term that refers to a number of techniques, such as:

1. Breaking a complex problem into smaller steps, with multiple LLMs carrying out specialized tasks
2. The ability for LLMs to use Tools to give them additional capabilities
3. The 'Agent Environment' which allows Agents to collaborate
4. An LLM can act as the Planner, dividing bigger tasks into smaller ones for the specialists
5. The concept of an Agent having autonomy / agency, beyond just responding to a prompt - such as Memory

We're showing 1 and 2 here, and to a lesser extent 3 and 5. In week 8 we will do the lot!

### **Purpose of Each Step**
* **Context Management:** Adds both system and previous user-assistant messages to maintain conversation flow.
* **Response Generation:** Uses OpenAI’s chat API to produce responses, with optional tool support.
* **Dynamic Tool Calls:** Triggers additional actions (like generating an image) when appropriate, providing richer responses.
* **Speech Integration:** Converts text to speech, making the interaction more engaging and accessible.

In [None]:
def chat(history):
    # Combines a system message (defined by system_message) with the history of previous messages. 
    # The system message provides instructions or context for the AI, such as conversational tone or focus areas.
    # history here includes all prior interactions (both user and assistant messages), creating a conversational context for generating coherent responses.
    messages = [{"role": "system", "content": system_message}] + history

    # Call OpenAI’s API to generate a response
    # tools=tools: Indicates any tools (like image generation or TTS) that the assistant can invoke.
    response = openai.chat.completions.create(model=MODEL, messages=messages, tools=tools)
    
    # Initialize image as None. It will be set to an image if the response triggers a tool call for image generation.
    image = None

    # Checks if the assistant’s response requires a tool, such as DALL-E for generating an image.
    if response.choices[0].finish_reason=="tool_calls": 
        message = response.choices[0].message # Capture the tool call message.
        # The handle_tool_call function processes the tool call and retrieves additional information (e.g., city).
        response, city = handle_tool_call(message)
        # Adds the tool call message and its response to messages to maintain conversation continuity.
        messages.append(message)
        messages.append(response)
        # Calls the artist function with city, generating an image for this context. image is updated with this result.
        image = artist(city)
        # Calls the API again, this time including the tool call and image in the conversation history, to generate a follow-up response.
        response = openai.chat.completions.create(model=MODEL, messages=messages)
   
    # Retrieve the assistant’s final reply.    
    reply = response.choices[0].message.content 
    # Update history with the assistant's response.
    history += [{"role":"assistant", "content":reply}]
    # Call talker to convert reply to speech, enhancing the interactive experience with spoken feedback.
    talker(reply)
    # Return the updated conversation history and any generated image. If no image was created, image remains None.
    return history, image

###  **Purpose of Each Component**

This setup provides a tailored, interactive AI assistant experience with multi-modal capabilities. 

* **Custom Chat Interface:** Creates a flexible, multi-modal chat UI that includes both text and image outputs, bypassing Gradio’s preset interface.
* **Function Chaining:** Uses Gradio’s **.submit()** and **.then()** methods to process user input, update history, call the assistant for a response, and display output in a seamless, interactive flow.
* **Clear Button:** Allows users to reset the conversation quickly, improving usability.


In [None]:
# More involved Gradio code as we're not using the preset Chat interface!
# Passing in inbrowser=True in the last line will cause a Gradio window to pop up immediately.

# This groups UI elements, creating a cohesive layout for the custom chat interface.
with gr.Blocks() as ui:
    # Arrange elements horizontally within each row.
    with gr.Row():
        # Create a chat display with a fixed height of 500 pixels and sets type="messages", ensuring messages appear in a conversational format.
        chatbot = gr.Chatbot(height=500, type="messages")
        # Provide an image display beside the chat, which can be used to show images generated by the assistant.
        image_output = gr.Image(height=500)
    with gr.Row():
        # Add a textbox where users can type messages, labeled for clarity.
        entry = gr.Textbox(label="Chat with our AI Assistant:")
    with gr.Row():
        # Add a button labeled "Clear" to reset the chat history
        clear = gr.Button("Clear")
        
    # The function appends the user’s message to history, then returns an empty string to clear the textbox, and updates the chatbot display.
    # message: Represents the current input from the user.
    # history: Holds the conversation history.
    def do_entry(message, history):
        history += [{"role":"user", "content":message}]
        return "", history

    # Configures what happens when the user submits a message
    # do_entry: First, the do_entry function updates history with the new user message.
    # After do_entry completes, the chat function is called, using chatbot as input. 
    # chat processes the conversation history and generates an assistant response, 
    # which appears in the chatbot and updates the image_output if an image is generated.
    entry.submit(do_entry, inputs=[entry, chatbot], outputs=[entry, chatbot]).then(
        chat, inputs=chatbot, outputs=[chatbot, image_output]
    )
    # This line links the "Clear" button to a function that resets the chatbot display, effectively clearing the chat history. 
    # The queue=False setting ensures the reset action occurs immediately.
    clear.click(lambda: None, inputs=None, outputs=chatbot, queue=False)

# Launche the Gradio UI and opens it in a new browser tab, enabling immediate access to the chat interface.
ui.launch(inbrowser=True)

# Business Applications

Add in more tools - perhaps to simulate actually booking a flight. A student has done this and provided their example in the community contributions folder.

Next: take this and apply it to your business. Make a multi-modal AI assistant with tools that could carry out an activity for your work. A customer support assistant? New employee onboarding assistant? So many possibilities!

If you feel bold, see if you can add audio input to our assistant so you can talk to it. 