[*Using Multimodal inputs with GPT4o for Image Recognition on SAP AI Core*](https://developers.sap.com/tutorials/ai-core-gpt4o-consumption.html)

## Using Multimodal inputs with GPT4o for Image Recognition on SAP AI Core
Multimodality refers to the ability of a model to process and interpret different types of inputs, such as text, images, audio, or video. In the context of GPT-4o on SAP AI Core, multimodal input allows the model to understand and generate responses that incorporate both text and visual data. This enhances the modelâ€™s ability to perform complex tasks, such as scene detection, object recognition, and image analysis, by combining the strengths of both language processing and image recognition.In this tutorial, we will demonstrate these capabilities with the help of GPT-4o, with a sample input and output, which can be replicated in future for various use cases.

In [32]:
from config import init_env
from config import variables
import importlib
variables = importlib.reload(variables)

# TODO: You need to specify which model you want to use. In this case we are directing our prompt
# to the openAI API directly so you need to pick one of the GPT models. Make sure the model is actually deployed
# in genAI Hub. You might also want to chose a model that can also process images here already. 
# E.g. 'gpt-4.1-mini'
MODEL_NAME = 'gpt-4o'

# Do not modify the `assert` line below
assert MODEL_NAME!='', """You should change the variable `MODEL_NAME` with the name of your deployed model (like 'gpt-4o-mini') first!"""

init_env.set_environment_variables()
# Do not modify the `assert` line below 
assert variables.RESOURCE_GROUP!='', """You should change the value assigned to the `RESOURCE_GROUP` in the `variables.py` file to your own resource group first!"""

print(f"Resource group is set to: {variables.RESOURCE_GROUP}")

Resource group is set to: default


In [33]:
from gen_ai_hub.proxy.native.openai import chat


### Scene Detection
In this step, we demonstrate how to use GPT-4o to describe a scene depicted in an image. By providing both text and an image URL as input, the model is able to generate a descriptive response that captures the key elements of the scene. This capability is particularly useful for applications like automated content tagging, visual storytelling, or enhancing user experience in multimedia platforms and more.

Follow the further steps to replicate scene detection using GPT-4o.

To utilize the GPT-4o model, which supports both text and image inputs, use the code below. This example demonstrates how to create a prompt with an image URL and a text query, enabling the model to process and provide a response based on both visual and textual information.

Note: You can replace the image URL with any image of your choice and modify the text prompt to ask the model any question about that image based on your specific needs.

In [34]:
import requests
import base64
def encode_image_from_url(image_url):
    """Download and encode image to base64 format from URL."""
    response = requests.get(image_url)
    if response.status_code == 200:
        return base64.b64encode(response.content).decode("utf-8")
    else:
        raise Exception(f"Failed to download image. Status code: {response.status_code}")

In [35]:
def create_image_prompt(image_url, text_prompt):
    """Create a prompt message for the model with the image data."""
    # Encode image URL to base64 format
    image_base64 = encode_image_from_url(image_url)
    
    # Create messages including both text and image input
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": text_prompt
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": image_url  # Use the direct image URL
                    }
                }
            ]
        }
    ]
    return messages

In [36]:
def get_response_from_model(model_name, messages):
    """Send messages to the model and return the response."""
    kwargs = dict(model_name=model_name, messages=messages)
    response = chat.completions.create(**kwargs)
    return response.to_dict()["choices"][0]["message"]["content"]

By following this example, you can easily integrate image-based inputs with the GPT-4o model and leverage its ability to understand and generate responses based on both visual and text content. For additional guidance, refer to the screenshot below.

<img src="https://raw.githubusercontent.com/SAP-samples/ai-core-samples/main/09_BusinessAIWeek/images/sceneDetection.jpg" width="30%"> 

In [37]:
# Example usage
image_url = "https://raw.githubusercontent.com/SAP-samples/ai-core-samples/main/09_BusinessAIWeek/images/sceneDetection.jpg"
text_prompt = "Describe the image in one line."  # Prompt asking for the description
model_name = "gpt-4o"  # Replace with the model that supports image input

# Create prompt with image and text
messages = create_image_prompt(image_url, text_prompt)

# Get response from model
response = get_response_from_model(model_name, messages)
print(response)

A blue bottle, black headphones, and a power outlet with an on switch are placed on a white table against a maroon and beige partition.


### Object Detection
This step focuses on identifying and labeling objects within an image. The multimodal input allows GPT-4o to analyze the visual data and generate a list of objects detected in the scene. Object detection is crucial for tasks such as inventory management, autonomous driving, and augmented reality applications and such.

Follow the further steps to replicate object detection using GPT-4o.

To utilize the GPT-4o model, which supports both text and image inputs, use the code below. This example demonstrates how to create a prompt with an image URL and a text query, enabling the model to process and provide a response based on both visual and textual information.

Note: You can replace the image URL with any image of your choice and modify the text prompt to ask the model any question about that image based on your specific needs.

In [38]:
import requests
import base64
def encode_image_from_url(image_url):
    """Download and encode image to base64 format from URL."""
    response = requests.get(image_url)
    if response.status_code == 200:
        return base64.b64encode(response.content).decode("utf-8")
    else:
        raise Exception(f"Failed to download image. Status code: {response.status_code}")

In [39]:
def create_image_prompt(image_url, text_prompt):
    """Create a prompt message for the model with the image data."""
    # Encode image URL to base64 format
    image_base64 = encode_image_from_url(image_url)
        # Create messages including both text and image input
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": text_prompt
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": image_url  # Use the direct image URL
                    }
                }
            ]
        }
    ]
    return messages

In [40]:
def get_response_from_model(model_name, messages):
    """Send messages to the model and return the response."""
    kwargs = dict(model_name=model_name, messages=messages)
    response = chat.completions.create(**kwargs)
    return response.to_dict()["choices"][0]["message"]["content"]

By following this example, you can easily integrate image-based inputs with the GPT-4o model and leverage its ability to understand and generate responses based on both visual and text content. For additional guidance, refer to the screenshot below.

<img src="https://raw.githubusercontent.com/SAP-samples/ai-core-samples/main/09_BusinessAIWeek/images/objectDetection.jpg" width="30%"> 

In [46]:
# Example usage
image_url = "https://raw.githubusercontent.com/SAP-samples/ai-core-samples/main/09_BusinessAIWeek/images/objectDetection.jpg"
text_prompt = "give me the bottle color and its count."  # Prompt asking for the description
model_name = "gpt-4o"  # Replace with the model that supports image input

# Create prompt with image and text
messages = create_image_prompt(image_url, text_prompt)

# Get response from model
response = get_response_from_model(model_name, messages)
print(response)

The bottle color is blue, and there is one bottle in the image.


### Graph Analysis
Here, the tutorial demonstrates how GPT-4o can be used to interpret and analyze data presented in graphical form. By combining text and image input, the model can extract meaningful insights from charts, graphs, and other visual data representations. This step is valuable for data analysis, reporting, and decision-making processes.

Follow the further steps to replicate graph analysis using GPT-4o.

To utilize the GPT-4o model, which supports both text and image inputs, use the code below. This example demonstrates how to create a prompt with an image URL and a text query, enabling the model to process and provide a response based on both visual and textual information.

Note: You can replace the image URL with any image of your choice and modify the text prompt to ask the model any question about that image based on your specific needs.

In [60]:
import requests
import base64

def encode_image_from_url(image_url):
    """Download and encode image to base64 format from URL."""
    response = requests.get(image_url)
    if response.status_code == 200:
        return base64.b64encode(response.content).decode("utf-8")
    else:
        raise Exception(f"Failed to download image. Status code: {response.status_code}")

In [61]:
def create_image_prompt(image_url, text_prompt):
    """Create a prompt message for the model with the image data."""
    # Encode image URL to base64 format
    image_base64 = encode_image_from_url(image_url)
    
    # Create messages including both text and image input
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": text_prompt
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": image_url  # Use the direct image URL
                    }
                }
            ]
        }
    ]
    return messages

In [62]:
def get_response_from_model(model_name, messages):
    """Send messages to the model and return the response."""
    kwargs = dict(model_name=model_name, messages=messages)
    response = chat.completions.create(**kwargs)
    return response.to_dict()["choices"][0]["message"]["content"]

By following this example, you can easily integrate image-based inputs with the GPT-4o model and leverage its ability to understand and generate responses based on both visual and text content. For additional guidance, refer to the screenshot below.

<img src="https://raw.githubusercontent.com/SAP-samples/ai-core-samples/main/09_BusinessAIWeek/images/graph.jpg" width="30%"> 

In [63]:
# Example usage
image_url = "https://raw.githubusercontent.com/SAP-samples/ai-core-samples/main/09_BusinessAIWeek/images/graph.jpg"
text_prompt = "what is this graph about"  # Prompt asking for the description
model_name = "gpt-4o"  # Replace with the model that supports image input

# Create prompt with image and text
messages = create_image_prompt(image_url, text_prompt)
# Get response from model
response = get_response_from_model(model_name, messages)
print(response)

The graph depicts the performance of the Dow Jones Industrial Average (DJIA) from around 2011 to 2023. The DJIA is a stock market index that represents 30 large, publicly-owned companies based in the United States. The graph shows the ups and downs of the index over this period, indicating trends, growth, and fluctuations in the market. The sharp drop around 2020 corresponds to the market impact of the COVID-19 pandemic, followed by a recovery and continued growth.


### Math
In this step, we explore how GPT-4o handles mathematical problems that involve both textual descriptions and visual data. The model can solve equations, interpret mathematical expressions in images, and provide detailed explanations of its reasoning. This capability is useful in educational tools, scientific research, and engineering applications.

Follow the further steps to replicate mathematical operations using GPT-4o.

To utilize the GPT-4o model, which supports both text and image inputs, use the code below. This example demonstrates how to create a prompt with an image URL and a text query, enabling the model to process and provide a response based on both visual and textual information.

Note: You can replace the image URL with any image of your choice and modify the text prompt to ask the model any question about that image based on your specific needs.

In [52]:
import requests
import base64

def encode_image_from_url(image_url):
    """Download and encode image to base64 format from URL."""
    response = requests.get(image_url)
    if response.status_code == 200:
        return base64.b64encode(response.content).decode("utf-8")
    else:
        raise Exception(f"Failed to download image. Status code: {response.status_code}")

In [53]:
def create_image_prompt(image_url, text_prompt):
    """Create a prompt message for the model with the image data."""
    # Encode image URL to base64 format
    image_base64 = encode_image_from_url(image_url)
    
    # Create messages including both text and image input
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": text_prompt
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": image_url  # Use the direct image URL
                    }
                }
            ]
        }
    ]
    return messages

In [54]:
def get_response_from_model(model_name, messages):
    """Send messages to the model and return the response."""
    kwargs = dict(model_name=model_name, messages=messages)
    response = chat.completions.create(**kwargs)
    return response.to_dict()["choices"][0]["message"]["content"]

By following this example, you can easily integrate image-based inputs with the GPT-4o model and leverage its ability to understand and generate responses based on both visual and text content. For additional guidance, refer to the screenshot below.

<img src="https://raw.githubusercontent.com/SAP-samples/ai-core-samples/main/09_BusinessAIWeek/images/math.jpg" width="20%"> 

In [55]:
# Example usage
image_url = "https://raw.githubusercontent.com/SAP-samples/ai-core-samples/main/09_BusinessAIWeek/images/math.jpg"
text_prompt = "find x"  # Prompt asking for the description
model_name = "gpt-4o"  # Replace with the model that supports image input

# Create prompt with image and text
messages = create_image_prompt(image_url, text_prompt)

# Get response from model
response = get_response_from_model(model_name, messages)
print(response)

To solve the equation \((2x - 10)/2 = 3(x - 1)\), follow these steps:

1. Multiply both sides by 2 to eliminate the fraction:
   \[
   2x - 10 = 6(x - 1)
   \]

2. Distribute the 6 on the right-hand side:
   \[
   2x - 10 = 6x - 6
   \]

3. Rearrange the equation to bring all terms involving \(x\) on one side and constant terms on the other side. Subtract \(2x\) from both sides:
   \[
   -10 = 4x - 6
   \]

4. Add 6 to both sides to isolate the term with \(x\):
   \[
   -4 = 4x
   \]

5. Divide both sides by 4 to solve for \(x\):
   \[
   x = -1
   \]

Thus, the solution is \(x = -1\).


### Image to Text
The final step focuses on converting visual information into text. By providing an image as input, GPT-4o generates a textual description or transcription of the content. This step is particularly beneficial for accessibility tools, content creation, and archiving visual data.

Follow the further steps to replicate Optical Character Recognition (OCR) using GPT-4o.

To utilize the GPT-4o model, which supports both text and image inputs, use the code below. This example demonstrates how to create a prompt with an image URL and a text query, enabling the model to process and provide a response based on both visual and textual information.

Note: You can replace the image URL with any image of your choice and modify the text prompt to ask the model any question about that image based on your specific needs.

In [None]:
import requests
import base64
def encode_image_from_url(image_url):
    """Download and encode image to base64 format from URL."""
    response = requests.get(image_url)
    if response.status_code == 200:
        return base64.b64encode(response.content).decode("utf-8")
    else:
        raise Exception(f"Failed to download image. Status code: {response.status_code}")

In [57]:
def create_image_prompt(image_url, text_prompt):
    """Create a prompt message for the model with the image data."""
    # Encode image URL to base64 format
    image_base64 = encode_image_from_url(image_url)
    
    # Create messages including both text and image input
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": text_prompt
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": image_url  # Use the direct image URL
                    }
                }
            ]
        }
    ]
    return messages

In [58]:
def get_response_from_model(model_name, messages):
    """Send messages to the model and return the response."""
    kwargs = dict(model_name=model_name, messages=messages)
    response = chat.completions.create(**kwargs)
    return response.to_dict()["choices"][0]["message"]["content"]

By following this example, you can easily integrate image-based inputs with the GPT-4o model and leverage its ability to understand and generate responses based on both visual and text content. For additional guidance, refer to the screenshot below.

<img src="https://raw.githubusercontent.com/SAP-samples/ai-core-samples/main/09_BusinessAIWeek/images/handwrittenText.png" width="45%"> 

In [59]:
# Example usage
image_url = "https://raw.githubusercontent.com/SAP-samples/ai-core-samples/main/09_BusinessAIWeek/images/handwrittenText.png"
text_prompt = "extract text"  # Prompt asking for the description
model_name = "gpt-4o"  # Replace with the model that supports image input

# Create prompt with image and text
messages = create_image_prompt(image_url, text_prompt)

# Get response from model
response = get_response_from_model(model_name, messages)
print(response)

Dear User,

Handwrytten uses robotic handwriting
machines that use an actual pen to
write your message. The results are
virtually indistinguishable from actual
handwriting.
Try it today!

The Robot
