# OpenAI Quickstart Guide

From the official documentation available at: https://platform.openai.com/docs/overview

You can use different models depending on your needs, check them at: https://platform.openai.com/docs/models

Note: check the pricing before using a model! --> https://platform.openai.com/docs/pricing

In [None]:
# Import required libraries
from dotenv import load_dotenv
import os
import openai
from pprint import pprint

# Load environment variables from .env file
load_dotenv()
api_key = os.getenv("OPENAI_API_KEY")

# Check if the API key is set
if not api_key:
    raise ValueError("OPENAI_API_KEY environment variable is not set.")

# Set OpenAI API key
openai.api_key = api_key

# Initialize OpenAI client
from openai import OpenAI
client = OpenAI()

In [None]:
# Make a request to the model
response = client.responses.create(
    model="gpt-4.1",
    input="Write a one-sentence bedtime story about a unicorn."
)

# Print the model's response
print("Response = ")
pprint(dict(response))
print(f"\nresponse.output_text = \n{response.output_text}")

You can specify instructions to provide high-level instructions adopted as an overall context for your prompts.

In [None]:
# Example of using the 'instructions' and 'reasoning' parameters in the model request
response_with_instructions = client.responses.create(
    model="gpt-4.1",
    input="Write a poem about a butterfly.",
    instructions="Answer in rhyme and with a cheerful style.")

# With some models you can also specify the reasoning effort parameter, e.g.    
# reasoning={"effort": "low"})
  
print("Response with instructions = ")
pprint(dict(response_with_instructions))
print(f"\nresponse_with_instructions.output_text = \n{response_with_instructions.output_text}")

## Structured Output

Structured Output allows you to receive responses from the model in a predefined format, such as JSON or other structured data types. This is useful when you need the model's output to be machine-readable for further processing, integration, or automation. By specifying the desired structure, you can ensure consistency and make it easier to extract specific information from the model's response.

In [None]:
from openai import OpenAI
from pydantic import BaseModel

client = OpenAI()

class Step(BaseModel):
    explanation: str
    output: str

class MathReasoning(BaseModel):
    steps: list[Step]
    final_answer: str

response = client.responses.parse(
    model="gpt-5-mini",
    input=[
        {
            "role": "system",
            "content": "You are a helpful math tutor. Guide the user through the solution step by step.",
        },
        {"role": "user", "content": "how can I solve 8x + 7 = -23"},
    ],
    text_format=MathReasoning,
)

math_reasoning = dict(response.output_parsed)
print("math_reasoning = \n")
pprint(math_reasoning)

Or again you can use structured output to request the response to be in a specific format, e.g:

In [None]:
from openai import OpenAI
from pydantic import BaseModel

client = OpenAI()

class CalendarEvent(BaseModel):
    name: str
    date: str
    participants: list[str]

response = client.responses.parse(
    model="gpt-4.1-mini",
    input=[
        {"role": "system", "content": "Extract the event information."},
        {
            "role": "user",
            "content": "Alice and Bob are going to a science fair on Friday.",
        },
    ],
    text_format=CalendarEvent,
)

print("response = \n")
pprint(response.output_parsed)

When using Structured Outputs consider also to check for refusals and specify what to do in case of a refusal (i.e. when the model refuses to answer).

Check the official documentation here --> https://platform.openai.com/docs/guides/structured-outputs#refusals

### Difference between `system`, `user`, and other roles in prompt content

**System role:** The `system` message sets the behavior, context, or instructions for the model. It defines how the model should respond and can guide its tone, style, or constraints. For example, you can instruct the model to act as a math tutor or to answer in a specific format.

**User role:** The `user` message represents the actual input or question from the end user. This is the prompt or query you want the model to answer.

**Other roles (e.g., `assistant`):** Some APIs support additional roles like `assistant`, which can be used to provide previous model responses in a conversation, or custom roles for advanced workflows. These help maintain context in multi-turn conversations.

In summary, `system` sets instructions/context, `user` provides the query, and other roles help structure multi-turn or complex interactions.

# Image generation

To generate images with OpenAI APIs, use the `image_generation` tool in your request. Specify your prompt in the `input` field and set the model (e.g., `"gpt-5"`). The API will return a base64-encoded image, which you can decode and save as a file.

In [None]:
from openai import OpenAI
import base64

client = OpenAI() 

response = client.responses.create(
    model="gpt-4o",
    input="Generate an image of gray tabby cat hugging an otter with an orange scarf",
    tools=[{"type": "image_generation"}],
)

# Save the image to a file
image_data = [
    output.result
    for output in response.output
    if output.type == "image_generation_call"
]
    
if image_data:
    image_base64 = image_data[0]
    with open("otter.png", "wb") as f:
        f.write(base64.b64decode(image_base64))

## Multi-turn image generation

With the Responses API, you can build multi-turn conversations involving image generation either by providing image generation calls outputs within context (you can also just use the image ID), or by using the 
previous_response_id
parameter. This makes it easy to iterate on images across multiple turns—refining prompts, applying new instructions, and evolving the visual output as the conversation progresses.

In [None]:
from openai import OpenAI
import base64

client = OpenAI()

response = client.responses.create(
    model="gpt-5",
    input="Generate an image of gray tabby cat hugging an otter with an orange scarf",
    tools=[{"type": "image_generation"}],
)

image_data = [
    output.result
    for output in response.output
    if output.type == "image_generation_call"
]

if image_data:
    image_base64 = image_data[0]

    with open("cat_and_otter.png", "wb") as f:
        f.write(base64.b64decode(image_base64))


# Follow up

response_fwup = client.responses.create(
    model="gpt-5",
    previous_response_id=response.id,
    input="Now make it look realistic",
    tools=[{"type": "image_generation"}],
)

image_data_fwup = [
    output.result
    for output in response_fwup.output
    if output.type == "image_generation_call"
]

if image_data_fwup:
    image_base64 = image_data_fwup[0]
    with open("cat_and_otter_realistic.png", "wb") as f:
        f.write(base64.b64decode(image_base64))

You can also use streaming image generation to stream partial images as they are generated, if you are interested check here --> https://platform.openai.com/docs/guides/image-generation#streaming

Moreover, note that when using some models (such as gpt-4.1) the model refines your prompt to enhance the result, if you are interested check here --> https://platform.openai.com/docs/guides/image-generation#revised-prompt

## Create a new image using image references

You can create an image by using one or more other images as reference.

With the Responses API, you can provide input images in 2 different ways:

- By providing an image as a Base64-encoded data URL
- By providing a file ID (created with the Files API)

### Creating a Base64-encoded data URL from an image

In [None]:
from openai import OpenAI
client = OpenAI()

def create_file(file_path):
  with open(file_path, "rb") as file_content:
    result = client.files.create(
        file=file_content,
        purpose="vision",
    )
    return result.id

### Create a file ID from an image

In [None]:
def encode_image(file_path):
    with open(file_path, "rb") as f:
        base64_image = base64.b64encode(f.read()).decode("utf-8")
    return base64_image

Now, let us generate an image from these 

In [None]:
from openai import OpenAI
import base64

client = OpenAI()

prompt = """Generate a photorealistic image of a gift basket on a white background 
labeled 'Relax & Unwind' with a ribbon and handwriting-like font, 
containing all the items in the reference pictures."""

base64_image1 = encode_image("soap.png")
base64_image2 = encode_image("bath-bomb.png")
file_id1 = create_file("body-lotion.png")
file_id2 = create_file("incense-kit.png")

response = client.responses.create(
    model="gpt-4.1",
    input=[
        {
            "role": "user",
            "content": [
                {"type": "input_text", "text": prompt},
                {
                    "type": "input_image",
                    "image_url": f"data:image/jpeg;base64,{base64_image1}",
                },
                {
                    "type": "input_image",
                    "image_url": f"data:image/jpeg;base64,{base64_image2}",
                },
                {
                    "type": "input_image",
                    "file_id": file_id1,
                },
                {
                    "type": "input_image",
                    "file_id": file_id2,
                }
            ],
        }
    ],
    tools=[{"type": "image_generation"}],
)

image_generation_calls = [
    output
    for output in response.output
    if output.type == "image_generation_call"
]

image_data = [output.result for output in image_generation_calls]

if image_data:
    image_base64 = image_data[0]
    with open("gift-basket.png", "wb") as f:
        f.write(base64.b64decode(image_base64))
else:
    print(response.output.content)

## Edit an image using a mask

When editing an image you can also provide a mask to indicate where the image should be edited.

In [None]:
from openai import OpenAI
client = OpenAI()

fileId = create_file("sunlit_lounge.png")
maskId = create_file("mask.png")

response = client.responses.create(
    model="gpt-4o",
    input=[
        {
            "role": "user",
            "content": [
                {
                    "type": "input_text",
                    "text": "generate an image of the same sunlit indoor lounge area with a pool but the pool should contain a flamingo",
                },
                {
                    "type": "input_image",
                    "file_id": fileId,
                }
            ],
        },
    ],
    tools=[
        {
            "type": "image_generation",
            "quality": "high",
            "input_image_mask": {
                "file_id": maskId,
            },
        },
    ],
)

image_data = [
    output.result
    for output in response.output
    if output.type == "image_generation_call"
]

if image_data:
    image_base64 = image_data[0]
    with open("lounge.png", "wb") as f:
        f.write(base64.b64decode(image_base64))

## Increasing Input fidelity in image generation

When dealing with images that require accurate preservation of elements (such as faces or logos) you can increase input fidelity by setting the <code>input_fidelity</code> parameter to <code>high</code>.

In [None]:
from openai import OpenAI
import base64

client = OpenAI()

womanId = create_file("woman_futuristic.jpg")
logoId = create_file("brain_logo.png")

response = client.responses.create(
    model="gpt-4.1",
    input=[
        {
            "role": "user",
            "content": [
                {"type": "input_text", "text": "Add the logo to the woman's top, as if stamped into the fabric."},
                {
                    "type": "input_image",
                    "file_id": womanId,
                },
                                {
                    "type": "input_image",
                    "file_id": logoId,
                },
            ],
        }
    ],
    tools=[{"type": "image_generation", "input_fidelity": "high"}],
)

# Extract the edited image
image_data = [
    output.result
    for output in response.output
    if output.type == "image_generation_call"
]

if image_data:
    image_base64 = image_data[0]
    with open("woman_with_logo.png", "wb") as f:
        f.write(base64.b64decode(image_base64))

## Additional custom options and features

You can check for additional features and options such as the size, quality and the transparency here --> https://platform.openai.com/docs/guides/image-generation#size-and-quality-options  

# Analyze images

You can use the vision capabilities of the model to analyze the content of an image, such as text or many other visual elements like shapes, colors, objects and textures.

Input images must meet the following requirements to be used in the API.

| Requirement         | Details                                                                                   |
|---------------------|-------------------------------------------------------------------------------------------|
| Supported file types| PNG (.png), JPEG (.jpeg, .jpg), WEBP (.webp), Non-animated GIF (.gif)                     |
| Size limits         | Up to 50 MB total payload size per request<br>Up to 500 individual image inputs per request|
| Other requirements  | No watermarks or logos<br>No NSFW content<br>Clear enough for a human to understand        |

For more info check here --> https://platform.openai.com/docs/guides/images-vision#analyze-images

In [None]:
from openai import OpenAI

client = OpenAI()

response = client.responses.create(
    model="gpt-4.1-mini",
    input=[{
        "role": "user",
        "content": [
            {"type": "input_text", "text": "what's in this image?"},
            {
                "type": "input_image",
                "image_url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
            },
        ],
    }],
)

print(response.output_text)

# Audio and speech

You can manage audio while working with a model by:
- having the model answer with a speech to a text prompt (text-to-speech)
- having the model answer with a text to an audio prompt (speech-to-text)
- having the model answer with a speech to an audio prompt (speech-to-speech)

## Text-to-speech

Let us see how you can have the model answer with a speech to a text prompt (text-to-speech).

For more info check here --> https://platform.openai.com/docs/guides/text-to-speech

In [None]:
import base64
from openai import OpenAI

client = OpenAI()

completion = client.chat.completions.create(
    model="gpt-4o-audio-preview",
    modalities=["text", "audio"],
    audio={"voice": "alloy", "format": "wav"},
    messages=[
        {
            "role": "user",
            "content": "Is a golden retriever a good family dog?"
        }
    ]
)

wav_bytes = base64.b64decode(completion.choices[0].message.audio.data)
with open("dog.wav", "wb") as f:
    f.write(wav_bytes)

Another use case for the text-to-speech is to generate spoken audio from input text, so let us do it! 

In [None]:
from pathlib import Path
from openai import OpenAI

client = OpenAI()
speech_file_path = "speech.mp3"

with client.audio.speech.with_streaming_response.create(
    model="gpt-4o-mini-tts",
    voice="coral",
    input="Today is a wonderful day to build something people love!",
    instructions="Speak in a cheerful and positive tone.",
) as response:
    response.stream_to_file(speech_file_path)

## Speech-to-text

Let us see how you can have the model answer with a text to an audio prompt (speech-to-text).

For more info check here --> https://platform.openai.com/docs/guides/speech-to-text

In [None]:
import base64
import requests
from openai import OpenAI

client = OpenAI()

# Fetch the audio file and convert it to a base64 encoded string
url = "https://cdn.openai.com/API/docs/audio/alloy.wav"
response = requests.get(url)
response.raise_for_status()
wav_data = response.content
encoded_string = base64.b64encode(wav_data).decode('utf-8')

completion = client.chat.completions.create(
    model="gpt-4o-audio-preview",
    modalities=["text", "audio"],
    audio={"voice": "alloy", "format": "wav"},
    messages=[
        {
            "role": "user",
            "content": [
                { 
                    "type": "text",
                    "text": "What is in this recording?"
                },
                {
                    "type": "input_audio",
                    "input_audio": {
                        "data": encoded_string,
                        "format": "wav"
                    }
                }
            ]
        },
    ]
)

transcript = dict(completion.choices[0].message.audio)
pprint(transcript['transcript'])

Another use case for the speech-to-text is to transcribe an audio, so let us do it!

In [None]:
from openai import OpenAI

client = OpenAI()
audio_file= open("/path/to/file/audio.mp3", "rb")

transcription = client.audio.transcriptions.create(
    model="gpt-4o-transcribe", 
    file=audio_file
)

print(transcription.text)

You can even produce a text which is the translation of the audio in another language!

In [None]:
from openai import OpenAI

client = OpenAI()
audio_file = open("/path/to/file/german.mp3", "rb")

translation = client.audio.translations.create(
    model="whisper-1", 
    file=audio_file,
)

print(translation.text)

## Speech-to-speech

Speech-to-speech can be achieved either by using a native speech-to-speech model or by chaining a speech-to-text and text-to-speech together.

Due to its complexity, we omit this use case here. If you are interested you can check here --> https://platform.openai.com/docs/guides/voice-agents?voice-agent-architecture=speech-to-speech 