# Multimodal Inputs with GPT-4o

This notebook shows how to attach images, audio, and documents when calling GPT-4o using the OpenAI Python SDK.

In [3]:
!pip install openai --upgrade

Collecting openai
  Downloading openai-1.95.1-py3-none-any.whl.metadata (29 kB)
Downloading openai-1.95.1-py3-none-any.whl (755 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m755.6/755.6 kB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: openai
  Attempting uninstall: openai
    Found existing installation: openai 1.94.0
    Uninstalling openai-1.94.0:
      Successfully uninstalled openai-1.94.0
Successfully installed openai-1.95.1


In [7]:
import openai
import base64
from openai import OpenAI
from google.colab import userdata

openai_api_key = userdata.get('OPENAI_API_KEY')
client = OpenAI(api_key=openai_api_key)

## 1. Sending an Image
Upload an image file and send it to GPT-4o.

In [8]:
# Function to encode the image
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")


# Path to your image
image_path = "/content/cat.jpg"

# Getting the Base64 string
base64_image = encode_image(image_path)


response = client.responses.create(
    model="gpt-4.1",
    input=[
        {
            "role": "user",
            "content": [
                { "type": "input_text", "text": "what's in this image?" },
                {
                    "type": "input_image",
                    "image_url": f"data:image/jpeg;base64,{base64_image}",
                },
            ],
        }
    ],
)

print(response.output_text)

This image shows a cute, fluffy kitten with tabby markings. The kitten has blue eyes and is lying on a soft, white surface, looking directly at the camera. The background is blurred, drawing focus to the kitten.


In [10]:
response = client.responses.create(
    model="gpt-4.1-mini",
    input=[{
        "role": "user",
        "content": [
            {"type": "input_text", "text": "what's in this image?"},
            {
                "type": "input_image",
                "image_url": "https://www.hartz.com/wp-content/uploads/2022/04/small-dog-owners-1.jpg",
            },
        ],
    }],
)

print(response.output_text)

This image shows a close-up of a small dog, likely a Yorkshire Terrier. The dog has light brown and tan fur with a slightly darker brown beard around its mouth. The background is blurred greenery, suggesting the photo was taken outdoors. The dog is wearing a collar.


## 2. Sending an Audio Clip
Upload an audio file and send it to GPT-4o.

In [12]:
audio_file = open("/content/beach-german.mp3", "rb")

translation = client.audio.translations.create(
    model="whisper-1",
    file=audio_file,
)

print(translation.text)

The beach, the swimsuit, the swimming trunks, the sandals, the air mattress, the towel, the ice cream, the ball, the sun, the sea, the waves,


## 3. Sending a Document (PDF)
Upload a PDF file and send it to GPT-4o.

In [14]:
response = client.responses.create(
    model="gpt-4.1",
    input=[
        {
            "role": "user",
            "content": [
                {
                    "type": "input_file",
                    "file_url": "https://arxiv.org/pdf/1706.03762",
                },
                {
                    "type": "input_text",
                    "text": "Analyze the letter and provide a summary of the key points.",
                },
            ],
        },
    ]
)

print(response.output_text)

Certainly! The text you provided is an excerpt from the influential paper **"Attention Is All You Need"** by Vaswani et al., which introduced the **Transformer** architecture.

Below is a summary of the **key points**:

---

### 1. **Permission and Attribution**
- Google grants permission to reproduce tables and figures from the paper in journalistic or scholarly works, given proper attribution.

---

### 2. **Abstract and Introduction**
- Traditional sequence transduction models use complex recurrent neural networks (RNNs) or convolutional neural networks (CNNs) with attention mechanisms.
- The authors propose the **Transformer**, a new architecture that relies solely on *attention mechanisms*, eliminating recurrence and convolutions entirely.
- On machine translation tasks, Transformers outperform previous models in both quality and training efficiency.
- The Transformer achieved state-of-the-art results in multiple benchmarks: WMT 2014 English-German (28.4 BLEU) and English-French (