
# 1. OpenAI VLM (GPT-4*) - Basics
This section demonstrates the basic usage of OpenAI's Vision Language Model (VLM) capabilities using GPT-4.
We will use the OpenAI API to analyze an image and provide detailed textual insights.

**Support Material**:
- https://platform.openai.com/docs/guides/text-generation 
- https://platform.openai.com/docs/guides/vision?lang=node
- https://platform.openai.com/docs/guides/text-generation?text-generation-quickstart-example=image 
- https://platform.openai.com/docs/api-reference/chat


In [2]:
import openai
from dotenv import load_dotenv  
import os
import base64
import json
import textwrap

# Function to encode the image
def encode_image(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode('utf-8')


load_dotenv()
#openAIclient = openai.OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as env var>"))
openAIclient = openai.OpenAI(api_key= os.getenv("OPENAI_API_KEY"))




TEXTMODEL = "gpt-4o-mini" 
IMGMODEL= "gpt-4o-mini" 

# Path to your image
img = "images/street_scene.jpg"

In [3]:
#basic call to gpt4 with prompt and image

completion = openAIclient.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{encode_image(img)}",
                        #"detail": "low"
                    }
                },
            ],
        }
    ],
)


# Wrap the text to a specified width

response = str(completion.choices[0].message)
print(textwrap.fill(response, width=120))


ChatCompletionMessage(content='The image depicts a busy urban street scene. It features various pedestrians and
vehicles, with people engaged in different activities. \n\n- A young person is sitting on the ground, engrossed in a
device.\n- Nearby, a person appears to be lying on the pavement.\n- Several birds are seen on the ground, likely
pigeons.\n- Other individuals are walking, cycling, or playing guitar in the background.\n- Buildings and street
furniture, such as a bench and a flower pot, complete the scene.\n\nThe atmosphere suggests a lively city environment
with a mix of everyday activities.', role='assistant', function_call=None, tool_calls=None, refusal=None)



# 1.1 Structured Output
Here, we expand upon the VLM example to request structured outputs. This approach allows for extracting 
well-organized information from images in a machine-readable format, such as JSON.

**Support Material**:
- https://platform.openai.com/docs/guides/text-generation?text-generation-quickstart-example=json


In [4]:
def promptLLM(prompt : str = None, sysprompt : str = None,  image : str = None, wantJson : bool = False, returnDict : bool = False):
    returnValue = ""
    messages = [{"role": "system", "content" : sysprompt}]
    modelToUse = TEXTMODEL
    #force it to be a json answer prompt
    #prompt = prompt if not wantJson else returnJSONAnswerPrompt(prompt)
    messages.append({"role": "user", "content": [{ 
        "type" : "text", 
        "text" : prompt 
    }]})
    if image is not None:
        image = f"data:image/jpeg;base64,{image}"
        messages[1]["content"].append({"type": "image_url", "image_url": { "url" : image}})
        modelToUse = IMGMODEL

    if wantJson:
        returnValue = openAIclient.chat.completions.create(
            model=modelToUse,
            #max_tokens= 400,
            response_format={ "type": "json_object" },
            messages=messages,
            temperature=0,
            #n=1,
        )
    else :
        returnValue = openAIclient.chat.completions.create(
            model=modelToUse,
            messages=messages,
            temperature=0,
            #n=1,
        )
    returnValue = returnValue.choices[0].message.content
    if returnDict:
        return json.loads(returnValue)
    return returnValue

In [5]:
output = promptLLM(prompt = "describe the image in detail",sysprompt = "you are a careful observer. the response should be in json format", image = encode_image(img), wantJson=True, returnDict=True)

In [6]:
output

{'description': {'scene': 'A bustling city street with a mix of pedestrians and vehicles.',
  'foreground': {'elements': [{'type': 'person',
     'action': 'sitting',
     'details': {'gender': 'male',
      'age': 'teen',
      'clothing': 'green jacket, shorts',
      'activity': 'using a smartphone',
      'position': 'on the ground'}},
    {'type': 'person',
     'action': 'lying down',
     'details': {'gender': 'male',
      'age': 'teen',
      'clothing': 'red hoodie',
      'position': 'on the ground'}},
    {'type': 'person',
     'action': 'sitting',
     'details': {'gender': 'female',
      'age': 'young adult',
      'clothing': 'red top, blue jeans',
      'activity': 'reading a book',
      'position': 'on a bench'}},
    {'type': 'person',
     'action': 'sitting',
     'details': {'gender': 'male',
      'age': 'older',
      'clothing': 'suit',
      'activity': 'reading a newspaper',
      'position': 'next to the young woman on the bench'}},
    {'type': 'person',


In [7]:
output["description"]["foreground"]

{'elements': [{'type': 'person',
   'action': 'sitting',
   'details': {'gender': 'male',
    'age': 'teen',
    'clothing': 'green jacket, shorts',
    'activity': 'using a smartphone',
    'position': 'on the ground'}},
  {'type': 'person',
   'action': 'lying down',
   'details': {'gender': 'male',
    'age': 'teen',
    'clothing': 'red hoodie',
    'position': 'on the ground'}},
  {'type': 'person',
   'action': 'sitting',
   'details': {'gender': 'female',
    'age': 'young adult',
    'clothing': 'red top, blue jeans',
    'activity': 'reading a book',
    'position': 'on a bench'}},
  {'type': 'person',
   'action': 'sitting',
   'details': {'gender': 'male',
    'age': 'older',
    'clothing': 'suit',
    'activity': 'reading a newspaper',
    'position': 'next to the young woman on the bench'}},
  {'type': 'person',
   'action': 'walking',
   'details': {'gender': 'female',
    'age': 'young adult',
    'clothing': 'pink top, shorts',
    'activity': 'looking at a phone',
   


# JSON Schema for Controlled Structured Outputs
In this section, we define a JSON schema for a more controlled and specific output from the model. 
Using this schema, we can ensure the model adheres to predefined data types and structures while describing images.In this case we will provide an exmaple of json format answer, but ideally 
one could also do it via e.g. pydantic library.

Example: 
```
from typing import List, Literal
from pydantic import BaseModel, Field


class Person(BaseModel):
    position: str = Field(..., description="Position of the person in the environment, e.g., standing, sitting, etc.")
    age: int = Field(..., ge=0, description="Age of the person, must be a non-negative integer.")
    activity: str = Field(..., description="Activity the person is engaged in, e.g., reading, talking, etc.")
    gender: Literal["male", "female", "non-binary", "other", "prefer not to say"] = Field(
        ..., description="Gender of the person"
    )


class ImageExtraction(BaseModel):
    number_of_people: int = Field(..., ge=0, description="The total number of people in the environment.")
    atmosphere: str = Field(..., description="Description of the atmosphere, e.g., calm, lively, etc.")
    hour_of_the_day: int = Field(..., ge=0, le=23, description="The hour of the day in 24-hour format.")
    people: List[Person] = Field(..., description="List of people and their details.")

```

In [8]:
def promptLLM(prompt : str = None, sysprompt : str = None,  image : str = None, wantJson : bool = False, returnDict : bool = False):
    returnValue = ""
    messages = [{"role": "system", "content" : sysprompt}]
    modelToUse = TEXTMODEL
    #force it to be a json answer prompt
    #prompt = prompt if not wantJson else returnJSONAnswerPrompt(prompt)
    messages.append({"role": "user", "content": [{ 
        "type" : "text", 
        "text" : prompt 
    }]})
    if image is not None:
        image = f"data:image/jpeg;base64,{image}"
        messages[1]["content"].append({"type": "image_url", "image_url": { "url" : image}})
        modelToUse = IMGMODEL

    if wantJson:
        returnValue = openAIclient.chat.completions.create(
            model=modelToUse,
            #max_tokens= 400,
            response_format={
                "type": "json_schema",
                "json_schema": {
                    "name": "img_extract",
                    "schema": {
                    "type": "object",
                    "properties": {
                        "numberOfPeople": {
                        "type": "integer",
                        "description": "The total number of people in the environment",
                        "minimum": 0
                        },
                        "atmosphere": {
                        "type": "string",
                        "description": "Description of the atmosphere, e.g., calm, lively, etc."
                        },
                        "hourOfTheDay": {
                        "type": "integer",
                        "description": "The hour of the day in 24-hour format",
                        "minimum": 0,
                        "maximum": 23
                        },
                        "people": {
                        "type": "array",
                        "description": "List of people and their details",
                        "items": {
                            "type": "object",
                            "properties": {
                            "position": {
                                "type": "string",
                                "description": "Position of the person in the environment, e.g., standing, sitting, etc."
                            },
                            "age": {
                                "type": "integer",
                                "description": "Age of the person",
                                "minimum": 0
                            },
                            "activity": {
                                "type": "string",
                                "description": "Activity the person is engaged in, e.g., reading, talking, etc."
                            },
                            "gender": {
                                "type": "string",
                                "description": "Gender of the person",
                                "enum": ["male", "female", "non-binary", "other", "prefer not to say"]
                            }
                            },
                            "required": ["position", "age", "activity", "gender"]
                        }
                        }
                    },
                    "required": ["numberOfPeople", "atmosphere", "hourOfTheDay", "people"]
                    }}},
            messages=messages,
            temperature=0,
            #n=1,
        )
    else :
        returnValue = openAIclient.chat.completions.create(
            model=modelToUse,
            messages=messages,
            temperature=0,
            #n=1,
        )
    returnValue = returnValue.choices[0].message.content
    if returnDict:
        return json.loads(returnValue)
    return returnValue

In [9]:
output_image_analysis = promptLLM(prompt = "describe the image in detail",sysprompt = "you are a careful observer. the response should be in json format", image = encode_image(img), wantJson=True, returnDict=True)

In [10]:
output_image_analysis

{'numberOfPeople': 8,
 'atmosphere': 'busy urban environment',
 'hourOfTheDay': 17,
 'people': [{'position': 'sitting',
   'age': 12,
   'activity': 'using a smartphone',
   'gender': 'male'},
  {'position': 'lying down',
   'age': 15,
   'activity': 'unconscious',
   'gender': 'male'},
  {'position': 'sitting',
   'age': 30,
   'activity': 'reading a newspaper',
   'gender': 'male'},
  {'position': 'sitting',
   'age': 25,
   'activity': 'reading a book',
   'gender': 'female'},
  {'position': 'walking',
   'age': 20,
   'activity': 'walking with a phone',
   'gender': 'female'},
  {'position': 'riding a bicycle',
   'age': 30,
   'activity': 'cycling',
   'gender': 'male'},
  {'position': 'walking',
   'age': 40,
   'activity': 'walking',
   'gender': 'female'},
  {'position': 'riding a scooter',
   'age': 35,
   'activity': 'scootering',
   'gender': 'male'}]}

In [11]:
#alert service prompt 

alert_sys_prompt = " you are an experienced first aid paramedical"
alert_prompt= """Extract from the following scene analysis give to you in json format, 
if anyone might be in danger and if the Child Hospital or normal Hospital should be alerted. 
Give then a concise answer
The situation is given to you from this object: """ + str(output_image_analysis)


In [12]:
promptLLM(prompt = alert_prompt, sysprompt= alert_sys_prompt) 

'In the given scene, there is one individual in danger: a 15-year-old male who is lying down and unconscious. This situation requires immediate attention. \n\nThe Child Hospital should be alerted due to the age of the individual in danger.'

In [14]:
promptLLM(prompt = "Considering the image analysis given" +str(output_image_analysis)+ "give me back the coordinates of the 15-years old. If these are not available, infer them form the pic", sysprompt= alert_sys_prompt) 

'Based on the provided information, there are no specific coordinates given for the individuals in the image. However, I can help you infer the likely coordinates for the 15-year-old male who is lying down and unconscious.\n\nIn a busy urban environment at 17:00, the 15-year-old might be located in a public area such as a park, sidewalk, or near a building. If we assume a typical layout, we can estimate the coordinates based on common spatial arrangements in such environments.\n\nFor example, if we consider a hypothetical coordinate system where:\n- The x-axis represents the width of the area (0 to 100 meters)\n- The y-axis represents the length of the area (0 to 100 meters)\n\nWe might infer the coordinates of the unconscious 15-year-old male to be around (50, 50), assuming he is in the center of a busy area where people are likely to gather.\n\nPlease note that this is purely an estimation and should not be used for any real-life application. In an actual emergency, it is crucial to 

In [15]:
promptLLM(prompt =  "Detect if there is a person who is under 18 years old on the floor and reutrn its coordinates as a list in the format '[ymin,xmin, ymax, xmax]'. Just output the list.", sysprompt= alert_sys_prompt, image = encode_image(img)) 

'[400, 600, 500, 700]'


# 2. Google VLM (Gemini)
This section demonstrates the use of Google's Vision Language Model, Gemini. 
We explore basic text generation as well as its ability to analyze images and provide relevant outputs.

**Support Material**:
- https://colab.research.google.com/drive/1eDvf_Ky9jLOZFShgHrm4GI-wkAaQnue6?usp=sharing


In [16]:
%matplotlib inline
import os
from dotenv import load_dotenv  
import google.generativeai as genai
from PIL import Image

load_dotenv()
#genai.configure(api_key=os.environ.get("GEMINI_API_KEY"))
genai.configure(api_key=os.getenv("GEMINI_API_KEY"))

  from .autonotebook import tqdm as notebook_tqdm


In [17]:
model = genai.GenerativeModel("gemini-1.5-flash")
response = model.generate_content("Explain how AI works")
print(response.text)


AI works by mimicking human intelligence processes through algorithms and statistical models.  There's no single "how it works" because different AI approaches employ different techniques. However, some common underlying principles include:

**1. Data Collection and Preparation:**  AI systems learn from data. This data needs to be collected, cleaned (removing errors and inconsistencies), and prepared (e.g., transformed into a format the algorithm understands).  The quality and quantity of data are crucial to the AI's performance.

**2. Algorithm Selection:**  Different algorithms are suited to different tasks.  Key categories include:

* **Machine Learning (ML):**  This is a broad category where systems learn from data without explicit programming.  ML algorithms identify patterns and relationships in data to make predictions or decisions.  Subcategories include:
    * **Supervised Learning:** The algorithm learns from labeled data (data where the desired output is already known). Exam

In [18]:
im = Image.open(img)

genai.configure(api_key=os.environ.get("GEMINI_API_KEY"))
model = genai.GenerativeModel("gemini-1.5-pro")

response = model.generate_content([
    im,
    (
        "Detect if there is a person who is under 18 years old on the floor and reutrn its coordinates as a list in the format '[ymin,xmin, ymax, xmax]'. Just output the list.\n "
    ),
])
response.resolve()
print(response.text)

[693,327,964,625]


Gemini can be used to predict bounding boxes based on free form text queries.
The model can be prompted to return the boxes in a variety of different formats (dictionary, list, etc). This of course migh need to be parsed. 
Check: https://colab.research.google.com/drive/1eDvf_Ky9jLOZFShgHrm4GI-wkAaQnue6?usp=sharing#scrollTo=WFLDgSztv77H
