<a href="https://colab.research.google.com/github/RCarteri/openAi_api/blob/main/Computer_Vision_with_OpenAI_GPT_4_Vision_model_and_TTS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 7: Working with images using GPT-4 Vision model


## What You Will Learn

- **GPT-4 Vision model**: Discover the vision capabilities of GPT-4 model and how to build computer vision applications with it.


## Getting Started

Before we jump in, ensure you have:

- A Google Colab account.
- Basic knowledge of Python and REST APIs.
- An OpenAI API key with access to the DALL-E service ([OpenAI](https://platform.openai.com/account/api-keys)).

## Embarking on a Visual Journey

Are you ready to create new AI application using GPT Vision? Let's begin our journey into the Computer Vision using GPT.



# 2. Libraries import

In [None]:
!pip install openai

Collecting openai
  Downloading openai-1.1.1-py3-none-any.whl (217 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/217.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m215.0/217.8 kB[0m [31m6.5 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m217.8/217.8 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.25.1-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.0/75.0 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
Collecting httpcore (from httpx<1,>=0.23.0->openai)
  Downloading httpcore-1.0.1-py3-none-any.whl (76 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.9/76.9 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting h11<0.15,>=0.13 (from httpcore->httpx<1,>=0.23.0->openai)
  Downloading h11-0.14.0-py3-none-any.whl (58

In [2]:
import os
import openai
import base64

from openai import OpenAI

# 3. Sending a first request to OpenAI API


### 3.1 Setting up API Key

In [3]:
from dotenv import load_dotenv
load_dotenv()
os.getenv('OPENAI_API_KEY')
client = OpenAI()

# 4. Classifing and describing images



In [4]:
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

In [6]:
base64_image = encode_image('files/test_img.jpg')

In [16]:
res = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        (
            {
                "role": "user", 
                "content": [
                    {
                        "type": "text",
                        "text": "What is in this image?"
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64, {base64_image}"
                        }
                    }
                ]
            }

        )
    ],
    max_tokens=200
    )

In [17]:
res.choices[0].message.content

'This image depicts a stylized, futuristic cityscape. The scene is bathed in vibrant, neon colors with a large, bright moon in the sky. The buildings are geometric and angular, with some incorporating transparent design elements. The road is shown in perspective, leading towards the city skyline in the background. The color palette predominantly features shades of blue, pink, and purple, giving the image a retro-futuristic or cyberpunk aesthetic.'

## Text To Speech using TTS API

In [22]:
speech_file_path = "files/speech.mp3"
response = client.audio.speech.create(
  model="tts-1",
  voice="alloy",
  input="Today is a wonderful day to build something people love!"
)

with open(speech_file_path, 'wb') as f:
    f.write(response.content)

# PROJECT 7: Generating voiceover of an video

In [25]:
!pip install opencv-python

Collecting opencv-python
  Downloading opencv_python-4.10.0.84-cp37-abi3-win_amd64.whl.metadata (20 kB)
Downloading opencv_python-4.10.0.84-cp37-abi3-win_amd64.whl (38.8 MB)
   ---------------------------------------- 0.0/38.8 MB ? eta -:--:--
   ---------------------------------------- 0.2/38.8 MB 3.5 MB/s eta 0:00:12
    --------------------------------------- 0.9/38.8 MB 5.9 MB/s eta 0:00:07
   - -------------------------------------- 1.4/38.8 MB 7.8 MB/s eta 0:00:05
   -- ------------------------------------- 2.3/38.8 MB 9.8 MB/s eta 0:00:04
   --- ------------------------------------ 3.4/38.8 MB 12.1 MB/s eta 0:00:03
   ---- ----------------------------------- 4.4/38.8 MB 13.4 MB/s eta 0:00:03
   ----- ---------------------------------- 5.2/38.8 MB 14.4 MB/s eta 0:00:03
   ------ --------------------------------- 6.2/38.8 MB 15.3 MB/s eta 0:00:03
   ------- -------------------------------- 7.3/38.8 MB 16.1 MB/s eta 0:00:02
   -------- ------------------------------- 8.4/38.8 MB 16

In [27]:
from IPython.display import display, Image, Audio
import os
import cv2
import base64

In [31]:
# Code taken from OpenAI blog
video = cv2.VideoCapture("files/experiment_video_desc.mp4")

base64Frames = []
while video.isOpened():
    success, frame = video.read()
    if not success:
        break
    _, buffer = cv2.imencode(".jpg", frame)
    base64Frames.append(base64.b64encode(buffer).decode("utf-8"))

video.release()
print(len(base64Frames), "frames read.")

4755 frames read.


In [35]:
PROMPT_MESSAGES = [
    {
        "role": "user",
        "content": "These are frames of a video. Create a short voiceover script in the style of David Attenborough. Only include the narration."
    }
]

response = client.chat.completions.create(
    model="gpt-4o",
    messages=PROMPT_MESSAGES,
    max_tokens=500,
)

print(response.choices[0])

Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='[Gentle, deliberate tone]\n\n"In the heart of the African savannah, life unfolds in a delicate balance, where every creature plays a crucial role."\n\n[Camera pans to a herd of elephants]\n\n"The matriarch leads with wisdom, guiding her family to waterholes that have sustained her kind for generations."\n\n[A lioness prowls stealthily in the grass]\n\n"Silent and stealthy, the lioness, nature’s perfect predator, hunts with a precision that is honed by aeons of evolution."\n\n[A close-up of a vibrant bird in the treetops]\n\n"In contrast, high in the treetops, a splash of dazzling color. The lilac-breasted roller, with its extraordinary plumage, brightens the canopy with its presence."\n\n[A wide shot of the savannah at dusk]\n\n"As the sun sets, painting the sky with hues of orange and purple, the savannah prepares for the night."\n\n[Camera focuses on a pair of zebras at twilight]\n\n"A world a

In [36]:
speech_file_path = "speech.mp3"
audio_response = client.audio.speech.create(
  model="tts-1",
  voice="alloy",
  input=response.choices[0].message.content
)

audio_response.stream_to_file(speech_file_path)
Audio(speech_file_path, autoplay=True)

  audio_response.stream_to_file(speech_file_path)
