## Text Generation
OpenAI provides simple APIs to use a large language model to generate text from a prompt, as you might using ChatGPT. These models have been trained on vast quantities of data to understand multimedia inputs and natural language instructions. From these prompts, models can generate almost any kind of text response, like code, mathematical equations, structured JSON data, or human-like prose.

In [None]:
from openai import OpenAI
client = OpenAI()

completion = client.chat.completions.create(
    model = "gpt-4o-mini",
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {
            "role": "user",
            "content": "Write a haiku about recursion in programming."
        }
    ]
)
print(completion.choices[0].message)

A function calls self,  
layers deep like falling leaves,  
endless yet finite.


In [None]:
from openai import OpenAI
client = OpenAI()
completion = client.chat.completions.create(
    model = "gpt-4o",
    messages=[
        {"role": "developer", "content": "You are a helpfull assistant"},
        {"role": "user", "content": "Write a haiku about recursion in programming."}
    ]
)

print(completion.choices[0].message)

In [None]:
#Generate JSON data based on a JSON Schema
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
    model = "gpt-4o-2024-08-06",
    messages = [
        {
            "role": "developer",
            "content": "You extract email addresses in JSON data."
        },
        {
            "role": "user",
            "content": "Feeling stuck? Send a message to help@mycompany.com."
        }
    ],
    response_format = {
        "type": "json_schema",
        "json_schema":{
            "name": "email_schema",
            "schema":{
                "type": "object",
                "properties": {
                    "email": {
                        "description": "The emiail address that appears in the input",
                        "type": "string"
                    },
                    "additionalProperties": False
                }
            }
        }
    }
)
print(response.choice[0].message.content)
    

In [None]:
#This is a JSON response from OpenAI's API
{
  "id": "chatcmpl-Af6LFgbOPpqu2fhGsVktc9xFaYUVh", #A unique identifier for this chat completion request.
  "object": "chat.completion", #Specifies that this is a chat completion.
  "created": 1734359189, #A timestamp representing when the response was generated.

  "model": "gpt-4o-2024-08-06",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Code within a loop,  \nFunction calls itself again,  \nInfinite echoes.",
        "refusal": null #The AI did not refuse to generate content.
      },
      "logprobs": null, #means that log probabilities (logprobs) were not included in the response. Logprobs show how confident the AI is in each word it generates.
      "finish_reason": "stop" #The response ended naturally (not cut off).
    }
  ],
  "usage": {}
}

A loop within loops,  
calls itself infinitely,  
stack overflow comes.  



## Vision
Several OpenAI models have vision capabilities, meaning the models can take images as input and answer questions about them. 

In [None]:
from openai import OpenAI
client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
                    },
                },
            ],
        }
    ],
    max_tokens=300,
)

print(response.choices[0])

### Uploading Base64 encoded images
If you have an image or set of images locally, pass them to the model in Base64 encoded format:

In [None]:
import base64
from openai import OpenAI

client = OpenAI()

# Function to encode the image
def encode_image(image_path): 
    with open(image_path, "rb") as image_file: # "rb" means read the file in binary mode (raw data, not text).read
        return base64.b64encode(image_file.read()).decode("utf-8") #Base64 → Converts binary data (like images) into a text format.


# Path to your image
image_path = "path_to_your_image.jpg"

# Getting the Base64 string
base64_image = encode_image(image_path)

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What is in this image?",
                },
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"},
                },
            ],
        }
    ],
)

print(response.choices[0])

### Multiple image inputs
The Chat Completions API is capable of taking in and processing multiple image inputs, in Base64 encoded format or as an image URL. 
The model processes each image and uses information from all images to answer the question.

In [None]:
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
    model = "gpt-4o-mini",
    messages = [
        {
            "role" : "user",
            "content":[
                {
                    "type" : "text",
                    "text" : "What are in this images? Is there any difference between them?",
                },
                {
                    "type" : "image_url",
                    "image_url":{
                        "url": "https://upload.wikimedia.org/wikipedia",
                    },
                },
            ],
        },
    ],
    max_tokens=300,
)
print(response.choices[0])

### Low or high fidelity image understanding
The detail parameter—which has three options, low, high, and auto—gives you control over how the model processes the image and generates
its textual understanding. By default, the model will use the auto setting, which looks at the image input size and decides if it should use 
the low or high setting.

low enables the "low res" mode. The model receives a low-resolution 512px x 512px version of the image. It represents the image with a budget
of 85 tokens. This allows the API to return faster responses and consume fewer input tokens for use cases that do not require high detail.
high enables "high res" mode, which first lets the model see the low-resolution image (using 85 tokens) and then creates detailed crops using
170 tokens for each 512px x 512px tile.

In [None]:
from openai import OpenAI
client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
                        "detail": "high",
                    },
                },
            ],
        }
    ],
    max_tokens=300,
)

print(response.choices[0].message.content)

### Image generation
DALL·E 2 and DALL·E 3 have different options for generating images.
The following code example uses DALL·E 3 to generate a square, standard quality image of a cat.
#### Size and quality options
Square, standard quality images are the fastest to generate. The default size of generated images is 1024x1024 pixels,
but each model has different options:

In [None]:
from openai import OpenAI
client = OpenAI()

response = client.images.generate(
    model="dall-e-3",
    prompt="a white siamese cat",
    size="1024x1024",
    quality="standard",
    n=1,
)

print(response.data[0].url)

### Edits (DALL·E 2 only)
The image edits endpoint lets you edit or extend an image by uploading an image and mask indicating which areas should be replaced.
This process is also known as inpainting.

The transparent areas of the mask indicate where the image should be edited, and the prompt should describe the full new image, not just the erased area. This endpoint enables experiences like DALL·E image editing in ChatGPT Plus.

In [None]:
#Edit an image
from openai import OpenAI
client = OpenAI()

response = client.images.edit(
    model="dall-e-2",
    image=open("sunlit_lounge.png", "rb"),
    mask=open("mask.png", "rb"),
    prompt="A sunlit indoor lounge area with a pool containing a flamingo",
    n=1,  #means generate 1 image variation based on the provided input (image, mask, and prompt).
    size="1024x1024",
)

print(response.data[0].url)

![Alt Text](D:/Projects/DeepLearning/FineTuning/images/sunlit.png)


### Variations (DALL·E 2 only)
The image variations endpoint allows you to generate a variation of a given image.

In [None]:
#Generate an image variation
from openai import OpenAI
client = OpenAI()

response = client.images.create_variation(
    model="dall-e-2",
    image=open("corgi_and_cat_paw.png", "rb"),
    n=1, # means generate 1 image variation based on the provided input (image, mask, and prompt).
    size="1024x1024"
)

print(response.data[0].url)

![Alt Text](D:/Projects/DeepLearning/FineTuning/images/dogCat.png)

### Error handling
API requests can potentially return errors due to invalid inputs, rate limits, or other issues. These errors can be handled with a try...except statement, and the error details can be found in e.error:

In [None]:
import openai
from openai import OpenAI
client = OpenAI()

try:
  response = client.images.create_variation(
    image=open("image_edit_mask.png", "rb"),
    n=1,
    model="dall-e-2",
    size="1024x1024"
  )
  print(response.data[0].url)
except openai.OpenAIError as e:
  print(e.http_status) # 400 → Bad Request (wrong parameters); 401 → Unauthorized (invalid API key); 429 → Too Many Requests (rate limit exceeded)
  print(e.error) #"message": "Invalid image format",

## Audio generation

You can use audio capabilities to:

Generate a spoken audio summary of a body of text (text in, audio out)

Perform sentiment analysis on a recording (audio in, text out)

Async speech-to-speech interactions with a model (audio in, audio out)

To generate audio or use audio as an input, use the chat completions endpoint. You can either use the REST API from the HTTP client of your choice or one of OpenAI's official SDKs.


In [None]:
#Audio Output from model
#Create a human-like audio response to a prompt
import base64
from openai import OpenAI

client = OpenAI()

completion = client.chat.completions.create(
    model="gpt-4o-audio-preview",
    modalities=["text", "audio"], # Request both text and audio response
    audio={"voice": "alloy", "format": "wav"},# Use "alloy" voice and WAV format; Alloy is voice choice similar to human; Echo, Fable, Onyx, Nova, Shimmer
    messages=[
        {
            "role": "user",
            "content": "Is a golden retriever a good family dog?"
        }
    ]
)

print(completion.choices[0])

wav_bytes = base64.b64decode(completion.choices[0].message.audio.data)
with open("dog.wav", "wb") as f:
    f.write(wav_bytes)

In [None]:
#audio input to model
#Use audio inputs for prompting a model
import base64
import requests
from openai import OpenAI

client = OpenAI()

# Fetch the audio file and convert it to a base64 encoded string
url = "https://cdn.openai.com/API/docs/audio/alloy.wav"
response = requests.get(url)
response.raise_for_status() #If the request fails (e.g., 404 Not Found or 500 Server Error), it raises an error instead of continuing execution.
wav_data = response.content 
encoded_string = base64.b64encode(wav_data).decode('utf-8') #UTF-8 → A text encoding format that supports all characters.

completion = client.chat.completions.create(
    model="gpt-4o-audio-preview",
    modalities=["text", "audio"],
    audio={"voice": "alloy", "format": "wav"},
    messages=[
        {
            "role": "user",
            "content": [
                { 
                    "type": "text",
                    "text": "What is in this recording?"
                },
                {
                    "type": "input_audio",
                    "input_audio": {
                        "data": encoded_string,
                        "format": "wav"
                    }
                }
            ]
        },
    ]
)

print(completion.choices[0].message)
                

### Multi-turn conversations
Using audio outputs from the model as inputs to multi-turn conversations requires a generated ID. Find this ID in the response data for an audio generation. Here's an example of a message you might receive from /chat/completions in a JSON data structure:

In [None]:
{
  "index": 0,
  "message": {
    "role": "assistant",
    "content": null, #No text response because the output is audio
    "refusal": null, #The AI didn’t refuse the request.
    "audio": {
      "id": "audio_abc123",
      "expires_at": 1729018505,
      "data": "<bytes omitted>", #This is where the actual audio file is stored.
      "transcript": "Yes, golden retrievers are known to be ..." #This is the text version of what the AI said in the audio.
    }
  },
  "finish_reason": "stop" #The response was completed successfully.
}

The value of message.audio.id above provides an identifier you can use in an assistant message for a new /chat/completions request, as in the example below.

In [None]:
curl "https://api.openai.com/v1/chat/completions" \ 
    -H "Content-Type: application/json" \  #-H (Header) → Adds HTTP headers This tells the API that the request body is in JSON format.
    -H "Authorization: Bearer $OPENAI_API_KEY" \
#-d (Data) → Sends the request body; This sends JSON data to the API, specifying what we want (like the model, messages, and response type).
    -d '{ 
        "model": "gpt-4o-audio-preview",
        "modalities": ["text", "audio"],
        "audio": { "voice": "alloy", "format": "wav" },
        "messages": [
            {
                "role": "user",
                "content": "Is a golden retriever a good family dog?"
            },
            {
                "role": "assistant",
                "audio": {
                    "id": "audio_abc123"
                }
            },
            {
                "role": "user",
                "content": "Why do you say they are loyal?"
            }
        ]
    }'

### How do I think about audio input to the model in terms of tokens?
We're working on better tooling to expose this, but roughly one hour of audio input equals 128k tokens, the max context window currently supported by this model.

## Text to speech
The Audio API provides a speech endpoint based on our TTS (text-to-speech) model. It comes with six built-in voices and can be used to:
- Narrate a written blog post
- Produce spoken audio in multiple languages
- Give realtime audio output using streaming
- 
The speech endpoint takes three key inputs: 1)**model,** 2) **the text to be turned into audio,** and 3) **the voice you want to use in the output.** Here's a simple request example:

In [1]:
from pathlib import Path
from openai import OpenAI

client = OpenAI()
speech_file_path = Path(__file__).parent / "speech.mp3"
response = client.audio.speech.create(
    model="tts-1",
    voice="alloy",
    input="Today is a wonderful day to build something people love!",
)
response.stream_to_file(speech_file_path)

By default, the endpoint outputs an MP3 of the spoken audio, but you can configure it to output any supported format.
### Audio quality
For realtime applications, the standard tts-1 model provides the lowest latency, but at a lower quality than the tts-1-hd model.

### Voice options
Experiment with different voices (alloy, ash, coral, echo, fable, onyx, nova, sage, shimmer) to find a match for your desired tone and audience. Current voices are optimized for English.

### Streaming realtime audio
The Speech API provides support for realtime audio streaming using chunk transfer encoding. This means the audio can be played before the full file is generated and made accessible.

In [None]:
from openai import OpenAI

client = OpenAI()

response = client.audio.speech.create(
    model="tts-1",
    voice="alloy",
    input="Hello world! This is a streaming test.",
)

response.stream_to_file("output.mp3")

### Supported output formats
The default response format is mp3, but other formats like opus and wav are available.

- MP3: The default response format for general use cases.
- Opus: For internet streaming and communication, low latency.
- AAC: For digital audio compression, preferred by YouTube, Android, iOS.
- FLAC: For lossless audio compression, favored by audio enthusiasts for archiving.
- WAV: Uncompressed WAV audio, suitable for low-latency applications to avoid decoding overhead.
- PCM: Similar to WAV but contains the raw samples in 24kHz (16-bit signed, low-endian), without the header.

## Speech to text

The Audio API provides two speech to text endpoints, transcriptions and translations, based on our state-of-the-art open source large-v2 Whisper model. They can be used to:

- Transcribe audio into whatever language the audio is in.
- Translate and transcribe the audio into english.
File uploads are currently limited to 25 MB and the following input file types are supported:
mp3, mp4, mpeg, mpga, m4a, wav, and webm.
### Transcriptions
The transcriptions API takes as input the audio file you want to transcribe and the desired output file format for the transcription of the audio. We currently support multiple input and output file formats.

In [None]:
from openai import OpenAI
client = OpenAI()

audio_file= open("/path/to/file/audio.mp3", "rb")
transcription = client.audio.transcriptions.create(
    model="whisper-1", 
    file=audio_file
)

print(transcription.text)

By default, the response type will be json with the raw text included.

In [None]:
{
  "text": "Imagine the wildest idea that you've ever had, and you're curious about how it might scale to something that's a 100, a 1,000 times bigger.
....
}

The Audio API also allows you to set additional parameters in a request. For example, if you want to set the response_format as text, your request would look like the following:

In [None]:
from openai import OpenAI
client = OpenAI()

audio_file = open("/path/to/file/speech.mp3", "rb")
transcription = client.audio.transcriptions.create(
    model="whisper-1", 
    file=audio_file, 
    response_format="text"
)

print(transcription.text)

### Translations
The translations API takes as input the audio file in any of the supported languages and transcribes, if necessary, the audio into English. This differs from our /Transcriptions endpoint since the output is not in the original input language and is instead translated to English text.

In [None]:
from openai import OpenAI
client = OpenAI()

audio_file = open("/path/to/file/german.mp3", "rb")
transcription = client.audio.translations.create(
    model="whisper-1", 
    file=audio_file,
)

print(transcription.text)

### Timestamps
By default, the Whisper API will output a transcript of the provided audio in text. The timestamp_granularities[] parameter enables a more structured and timestamped json output format, with timestamps at the segment, word level, or both. This enables word-level precision for transcripts and video edits, which allows for the removal of specific frames tied to individual words.

In [None]:
#without timestamp
{
  "text": "The quick brown fox jumps over the lazy dog."
}

#with segment
{
  "segments": [
    {
      "start": 0.0,
      "end": 5.0,
      "text": "The quick brown fox jumps over the lazy dog."
    }
  ]
}

#with words
{
  "words": [
    { "start": 0.0, "end": 0.5, "text": "The" },
    { "start": 0.5, "end": 0.9, "text": "quick" },
    { "start": 0.9, "end": 1.2, "text": "brown" },
    ...
  ]
}


In [None]:
#timestamp options
from openai import OpenAI
client = OpenAI()

audio_file = open("/path/to/file/speech.mp3", "rb")
transcription = client.audio.transcriptions.create(
    file=audio_file,
    model="whisper-1",
    response_format="verbose_json",
    timestamp_granularities=["word"]
)

print(transcription.words)


### Longer inputs
By default, the Whisper API only supports files that are less than 25 MB. If you have an audio file that is longer than that, you will need to break it up into chunks of 25 MB's or less or used a compressed audio format. To get the best performance, we suggest that you avoid breaking the audio up mid-sentence as this may cause some context to be lost.

One way to handle this is to use the **PyDub** open source Python package to split the audio:

In [None]:
from pydub import AudioSegment
song = AudioSegment.from_mp3("good_morning.mp3")

#pydub handle time in miliseconds
ten_minutes = 10 * 60 * 1000
first_10_minutes = song[:ten_minutes]
first_10_minutes.export("good_morning_10.mp3", format="mp3")
