# Multimodal Prompt Engineering with OpenAI GPT-4o

GPT-4o ("o" for "omni") is designed to handle a combination of text, audio, and video inputs, and can generate outputs in text, audio, and image formats.


### Background

Before GPT-4o, users could interact with ChatGPT using Voice Mode, which operated with three separate models. GPT-4o will integrate these capabilities into a single model that's trained across text, vision, and audio. This unified approach ensures that all inputs—whether text, visual, or auditory—are processed cohesively by the same neural network.


### Current API Capabilities

Currently, the API supports `{text, image}` inputs only, with `{text}` outputs, the same modalities as `gpt-4-turbo`. Additional modalities, including audio, will be introduced soon. This guide will help you get started with using GPT-4o for text, image, and video understanding.


## Getting Started

### Install OpenAI SDK for Python



In [0]:
!pip install openai==1.55.3

## Enter API Tokens

In [0]:
from getpass import getpass

OPENAI_KEY = getpass('Enter Open AI API Key:')

In [0]:
import os

os.environ['OPENAI_API_KEY'] = OPENAI_KEY

In [0]:
from openai import OpenAI
import os

## Set the API key and model name
MODEL="gpt-4o"
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

In [0]:
completion = client.chat.completions.create(
  model=MODEL,
  messages=[
    {"role": "system", "content": "You are a helpful assistant. Help me with my math homework!"}, # <-- This is the system message that provides context to the model
    {"role": "user", "content": "Hello! Could you solve 2+2?"}  # <-- This is the user message for which the model will generate a response
  ]
)

print("Assistant: " + completion.choices[0].message.content)

In [0]:
from IPython.display import Markdown

Markdown(completion.choices[0].message.content)

## Image Processing
GPT-4o can directly process images and take intelligent actions based on the image. We can provide images in two formats:
1. Base64 Encoded
2. URL

Let's first view the image we'll use, then try sending this image as both Base64 and as a URL link to the API

In [0]:
!curl -o triangle.png https://upload.wikimedia.org/wikipedia/commons/e/e2/The_Algebra_of_Mohammed_Ben_Musa_-_page_82b.png

In [0]:
from IPython.display import Image, display, Audio, Markdown
import base64

IMAGE_PATH = "./triangle.png"

# Preview image for context
display(Image(IMAGE_PATH))

#### Base64 Image Processing

In [0]:
# Open the image file and encode it as a base64 string
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

base64_image = encode_image(IMAGE_PATH)

response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You are a helpful assistant that responds in Markdown. Help me with my math homework!"},
        {"role": "user", "content": [
            {"type": "text", "text": "What's the area of the triangle?"},
            {"type": "image_url", "image_url": {
                "url": f"data:image/png;base64,{base64_image}"}
            }
        ]}
    ],
    temperature=0.0,
)

display(Markdown(response.choices[0].message.content))

To find the area of the triangle, we can use Heron's formula.

First, we need to find the semi-perimeter of the triangle.

The sides of the triangle are 6, 5, and 9.

Calculate the semi-perimeter $( s ): [ s = \frac{a + b + c}{2} = \frac{6 + 5 + 9}{2} = 10 ]$

Use Heron's formula to find the area $( A )$

$: [ A = \sqrt{s(s-a)(s-b)(s-c)} ]$

$[ A = \sqrt{10(10-6)(10-5)(10-9)} ]$

$[ A = \sqrt{10 \cdot 4 \cdot 5 \cdot 1} ]$

$[ A = \sqrt{200} ]$

$[ A = 10\sqrt{2} ]$

So, the area of the triangle is $( 10\sqrt{2} )$ square units.

#### URL Image Processing

In [0]:
response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You are a helpful assistant that responds in Markdown. Help me with my math homework!"},
        {"role": "user", "content": [
            {"type": "text", "text": "What's the area of the triangle?"},
            {"type": "image_url", "image_url": {
                "url": "https://upload.wikimedia.org/wikipedia/commons/e/e2/The_Algebra_of_Mohammed_Ben_Musa_-_page_82b.png"}
            }
        ]}
    ],
    temperature=0.0,
)

display(Markdown(response.choices[0].message.content))

In [0]:
! curl -o clinical_note.png https://i.imgur.com/AJwKUEb.png

In [0]:
IMAGE_PATH = "./clinical_note.png"

# Preview image for context
display(Image(IMAGE_PATH))

In [0]:
base64_image = encode_image(IMAGE_PATH)

response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": """Act as an expert in analyzing and understanding handwritten clinical notes.
                                         Detect the handwriting in the clinical note and perform tasks as per the user
                                      """},
        {"role": "user", "content": [
            {"type": "text",
             "text": """Extract all symptoms from the given clinical note image.
                        Differentiate between symptoms that are present vs. absent.
                        Give me the probability (high/ medium/ low) of how sure you are about the result.
                        Add a note on the probabilities and why you think so.

                        Output as a markdown table with the following columns,
                        all symptoms should be expanded and no acronyms unless you don't know:

                        Symptoms | Present/Denies | Probability.

                        Also expand all acronyms.
                        Output that also as a separate appendix table in Markdown.
                        Do not make up terms, if something is not detectable leave it out.
                     """},
            {"type": "image_url", "image_url": {
                "url": f"data:image/png;base64,{base64_image}"}
            }
        ]}
    ],
    temperature=0.0,
)

display(Markdown(response.choices[0].message.content))

In [0]:
! curl -o hwrite.png https://i.imgur.com/XWeRd8a.png

In [0]:
IMAGE_PATH = "./hwrite.png"

# Preview image for context
display(Image(IMAGE_PATH))

In [0]:
MODEL

In [0]:
base64_image = encode_image(IMAGE_PATH)

response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "Act as handwriting expert, detect the handwriting in the documents and perform tasks as per the user"},
        {"role": "user", "content": [
            {"type": "text",
             "text": """Convert the handwritten document into text exactly as in the image,
                        do not make up words, if something is not detectable just put [NOT_EXTRACTED]
                     """},
            {"type": "image_url", "image_url": {
                "url": f"data:image/png;base64,{base64_image}"}
            }
        ]}
    ],
    temperature=0.0,
)

Markdown(response.choices[0].message.content)

In [0]:
!curl -o sales.png https://i.imgur.com/jH3MNNP.png

In [0]:
IMAGE_PATH = "./sales.png"

# Preview image for context
display(Image(IMAGE_PATH))

In [0]:
base64_image = encode_image(IMAGE_PATH)

response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "Act as a data analyst, your job is to analyze visuals and give insights"},
        {"role": "user", "content": [
            {"type": "text",
             "text": """Given this sales report visualization, summarize it briefly,
                        give detailed statistics about the top 3 best performing salesmen
                     """},
            {"type": "image_url", "image_url": {
                "url": f"data:image/png;base64,{base64_image}"}
            }
        ]}
    ],
    temperature=0.0,
)

Markdown(response.choices[0].message.content)

In [0]:
# download images using curl
!curl https://i.imgur.com/6b9jwkk.png -o image1.png
!curl https://i.imgur.com/9CWuU2q.png -o image2.png

In [0]:
display(Image('image1.png'))

In [0]:
display(Image('image2.png'))

In [0]:
base64_image1 = encode_image('./image1.png')
base64_image2 = encode_image('./image2.png')

response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "Act as an analyst, your job is to analyze document scans and give insights"},
        {"role": "user", "content": [
            {"type": "text",
             "text": """Given the following images which can contain graphs, tables and text,
                        analyze all of them to answer the following questions:

                        - Tell me about the top 5 years with largest Wildfires
                        - Tell me about trend of wildfires in terms of acreage burned by region and ownership
                     """},
            {"type": "image_url", "image_url": {
                "url": f"data:image/png;base64,{base64_image1}"}
            },
            {"type": "image_url", "image_url": {
                "url": f"data:image/png;base64,{base64_image2}"}
            }
        ]}
    ],
    temperature=0.0,
)

Markdown(response.choices[0].message.content)

## Video Processing
While it's not possible to directly send a video to the API, GPT-4o can understand videos if you sample frames and then provide them as images. It performs better at this task than GPT-4 Turbo.

Since GPT-4o in the API does not yet support video directly and audio is in beta (as of November-December 2024), we'll use a combination of GPT-4o and Whisper to process both the audio and we will manually convert the video into a list of image frames, and showcase two usecases:
1. Summarization
2. Question and Answering



### Setup for Video Processing
We'll use two python packages for video processing - opencv-python and moviepy.

These require [ffmpeg](https://ffmpeg.org/about.html), so make sure to install this beforehand. Depending on your OS, you may need to run `brew install ffmpeg` or `sudo apt install ffmpeg`

In [0]:
!pip install opencv-python --quiet
!pip install moviepy --quiet

### Process the video into two components: frames and audio

In [0]:
!gdown -O 'keynote_recap.mp4' '1s6WOK3w1hJxcxE7T_WWZFioFGE81uKLb'

In [0]:
import cv2
from moviepy.editor import VideoFileClip
import time
import base64

# We'll be using the OpenAI DevDay Keynote Recap video. You can review the video here: https://www.youtube.com/watch?v=h02ti0Bl6zk
VIDEO_PATH = "./keynote_recap.mp4"

In [0]:
import os
import cv2
import base64
from moviepy.editor import VideoFileClip

def process_video(video_path, seconds_per_frame=2):
    # Initialize a list to store base64 encoded frames
    base64Frames = []
    # Extract the base name of the video file without extension
    base_video_path, _ = os.path.splitext(video_path)
    # Open the video file
    video = cv2.VideoCapture(video_path)

    # Get the total number of frames and the frames per second (fps) of the video
    total_frames = int(video.get(cv2.CAP_PROP_FRAME_COUNT))
    fps = video.get(cv2.CAP_PROP_FPS)
    # Calculate the number of frames to skip between samples
    frames_to_skip = int(fps * seconds_per_frame)

    # Start from the first frame
    curr_frame = 0
    # Loop through the video to extract frames at the specified interval
    while curr_frame < total_frames - 1:
        # Set the current frame position in the video
        video.set(cv2.CAP_PROP_POS_FRAMES, curr_frame)
        # Read the frame
        success, frame = video.read()
        if not success:
            break
        # Encode the frame as a JPEG image and convert it to base64
        _, buffer = cv2.imencode(".jpg", frame)
        base64Frames.append(base64.b64encode(buffer).decode("utf-8"))
        # Move to the next frame based on the sampling interval
        curr_frame += frames_to_skip
    # Release the video object
    video.release()

    # Extract audio from the video
    audio_path = f"{base_video_path}.mp3"
    clip = VideoFileClip(video_path)
    clip.audio.write_audiofile(audio_path, bitrate="32k")  # Save audio with reduced bitrate
    clip.audio.close()
    clip.close()

    print(f"Extracted {len(base64Frames)} frames")
    print(f"Extracted audio to {audio_path}")
    # Return the frames as base64 strings and the path to the audio file
    return base64Frames, audio_path

# Example usage
# Extract 1 frame for every 3 seconds from the video
base64Frames, audio_path = process_video(VIDEO_PATH, seconds_per_frame=3)

In [0]:
## Display the frames and audio for context
display_handle = display(None, display_id=True)
for img in base64Frames:
    display_handle.update(Image(data=base64.b64decode(img.encode("utf-8")), width=600))
    time.sleep(0.5)

Audio(audio_path)

In [0]:
len(base64Frames)

In [0]:
base64Frames[55]

In [0]:
Image(data=base64.b64decode(base64Frames[55].encode("utf-8")), width=600)

### Example 1: Summarization
Now that we have both the video frames and the audio, let's run a few different tests to generate a video summary to compare the results of using the models with different modalities. We should expect to see that the summary generated with context from both visual and audio inputs will be the most accurate, as the model is able to use the entire context from the video.

1. Visual Summary
2. Audio Summary
3. Visual + Audio Summary

#### Visual Summary
The visual summary is generated by sending the model only the frames from the video. With just the frames, the model is likely to capture the visual aspects, but will miss any details discussed by the speaker.

In [0]:
response = client.chat.completions.create(
    model=MODEL,
    messages=[
    {"role": "system",
     "content": """You are generating a video summary.
                   Create a detailed summary of the provided video with key bullet points.
                   Respond in Markdown.
                """},
    {"role": "user", "content": [
        "These are the frames from the video.",
        *map(lambda x: {"type": "image_url",
                        "image_url": {"url": f'data:image/jpg;base64,{x}', "detail": "low"}}, base64Frames)
        ],
    }
    ],
    temperature=0,
)
Markdown(response.choices[0].message.content)

The results are as expected - the model is able to capture the high level aspects of the video visuals, but misses the details provided in the speech.

#### Audio Summary
The audio summary is generated by sending the model the audio transcript. With just the audio, the model is likely to bias towards the audio content, and will miss the context provided by the presentations and visuals.

`{audio}` input for GPT-4o is in beta access via its realtime API but hopefully we see it in a stable release in 2025! For now, we use our existing `whisper-1` model to process the audio

In [0]:
# Transcribe the audio
transcription = client.audio.transcriptions.create(
    model="whisper-1",
    file=open(audio_path, "rb"),
)
## OPTIONAL: Uncomment the line below to print the transcription
print("Transcript: ", transcription.text[:1000] + "\n\n")

In [0]:
response = client.chat.completions.create(
    model=MODEL,
    messages=[
    {"role": "system",
     "content":"""You are generating a transcript summary.
                  Create a detailed summary of the provided transcription with key bullet points.
                  Respond in Markdown.
               """},
    {"role": "user", "content": [
        {"type": "text", "text": f"The audio transcription is: {transcription.text}"}
        ],
    }
    ],
    temperature=0,
)
Markdown(response.choices[0].message.content)

The audio summary is biased towards the content discussed during the speech, but comes out with much less structure than the video summary.

#### Audio + Visual Summary
The Audio + Visual summary is generated by sending the model both the visual and the audio from the video at once. When sending both of these, the model is expected to better summarize since it can perceive the entire video at once.

In [0]:
## Generate a summary with visual and audio
response = client.chat.completions.create(
    model=MODEL,
    messages=[
    {"role": "system",
     "content":"""You are generating a video summary.
                  Create a detailed summary of the provided video and its transcript with key bullet points.
                  Respond in Markdown
               """},
    {"role": "user", "content": [
        "These are the frames from the video.",
        *map(lambda x: {"type": "image_url",
                        "image_url": {"url": f'data:image/jpg;base64,{x}', "detail": "low"}}, base64Frames),
        {"type": "text", "text": f"The audio transcription is: {transcription.text}"}
        ],
    }
],
    temperature=0,
)
Markdown(response.choices[0].message.content)

After combining both the video and audio, we're able to get a much more detailed and comprehensive summary for the event which uses information from both the visual and audio elements from the video.

Comparing the three answers, the most accurate answer is generated by using both the audio and visual from the video. Sam Altman did not discuss the raising windows or radio on during the Keynote, but referenced an improved capability for the model to execute multiple functions in a single request while the examples were shown behind him.

## Conclusion
Integrating many input modalities such as audio, visual, and textual, significantly enhances the performance of the model on a diverse range of tasks. This multimodal approach allows for more comprehensive understanding and interaction, mirroring more closely how humans perceive and process information.

Currently, GPT-4o in the API supports text and image inputs, with audio capabilities in beta (late 2024).