
# 1. OpenAI VLM (GPT-4*) - Basics
This section demonstrates the basic usage of OpenAI's Vision Language Model (VLM) capabilities using GPT-4.
We will use the OpenAI API to analyze an image and provide detailed textual insights.

**Support Material**:
- https://platform.openai.com/docs/guides/text-generation 
- https://platform.openai.com/docs/guides/vision?lang=node
- https://platform.openai.com/docs/guides/text-generation?text-generation-quickstart-example=image 
- https://platform.openai.com/docs/api-reference/chat


In [3]:
!pip install python-dotenv
import openai
from dotenv import load_dotenv  
import os
import base64
import json
import textwrap

# Function to encode the image
def encode_image(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode('utf-8')


load_dotenv()
#openAIclient = openai.OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as env var>"))
openAIclient = openai.OpenAI(api_key= os.getenv("OPENAI_API_KEY"))




TEXTMODEL = "gpt-4o-mini" 
IMGMODEL= "gpt-4o-mini" 

# Path to your image
img = "images/street_scene.jpg"

Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [4]:
#basic call to gpt4 with prompt and image

completion = openAIclient.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{encode_image(img)}",
                        #"detail": "low"
                    }
                },
            ],
        }
    ],
)


# Wrap the text to a specified width

response = str(completion.choices[0].message)
print(textwrap.fill(response, width=120))


ChatCompletionMessage(content='The image depicts a bustling urban scene. In the foreground, there are people interacting
with their surroundings: one person appears to be sitting on a bench reading, while another is seated on the ground,
seemingly absorbed in a phone. There’s also a figure lying down, which adds a sense of urgency or concern to the scene.
Pigeons are scattered on the ground. \n\nIn the background, traffic is busy with cars and bicycles, and there are
pedestrians crossing the street. The architecture reflects a mix of modern and older buildings, and the lighting
suggests a late-day ambiance. Overall, it captures a dynamic moment in a city atmosphere.', role='assistant',
function_call=None, tool_calls=None, refusal=None, annotations=[])



# 1.1 Structured Output
Here, we expand upon the VLM example to request structured outputs. This approach allows for extracting 
well-organized information from images in a machine-readable format, such as JSON.

**Support Material**:
- https://platform.openai.com/docs/guides/text-generation?text-generation-quickstart-example=json


In [5]:
def promptLLM(prompt : str = None, sysprompt : str = None,  image : str = None, wantJson : bool = False, returnDict : bool = False):
    returnValue = ""
    messages = [{"role": "system", "content" : sysprompt}]
    modelToUse = TEXTMODEL
    #force it to be a json answer prompt
    #prompt = prompt if not wantJson else returnJSONAnswerPrompt(prompt)
    messages.append({"role": "user", "content": [{ 
        "type" : "text", 
        "text" : prompt 
    }]})
    if image is not None:
        image = f"data:image/jpeg;base64,{image}"
        messages[1]["content"].append({"type": "image_url", "image_url": { "url" : image}})
        modelToUse = IMGMODEL

    if wantJson:
        returnValue = openAIclient.chat.completions.create(
            model=modelToUse,
            #max_tokens= 400,
            response_format={ "type": "json_object" },
            messages=messages,
            temperature=0,
            #n=1,
        )
    else :
        returnValue = openAIclient.chat.completions.create(
            model=modelToUse,
            messages=messages,
            temperature=0,
            #n=1,
        )
    returnValue = returnValue.choices[0].message.content
    if returnDict:
        return json.loads(returnValue)
    return returnValue

In [6]:
output = promptLLM(prompt = "describe the image in detail",sysprompt = "you are a careful observer. the response should be in json format", image = encode_image(img), wantJson=True, returnDict=True)

In [7]:
output

{'description': {'scene': 'A bustling urban street scene',
  'elements': {'background': {'buildings': 'A mix of modern skyscrapers and older brick buildings',
    'traffic_lights': 'A traffic light showing red',
    'street': 'A busy street with vehicles in motion'},
   'foreground': {'people': [{'position': 'sitting on a bench',
      'activity': 'reading a newspaper',
      'appearance': 'an older man in a suit'},
     {'position': 'sitting on the ground',
      'activity': 'using a smartphone',
      'appearance': 'a young boy in a jacket'},
     {'position': 'lying on the ground',
      'activity': 'unconscious or resting',
      'appearance': 'a young man in a red hoodie'},
     {'position': 'walking',
      'activity': 'holding a phone',
      'appearance': 'a young woman in casual attire'},
     {'position': 'walking',
      'activity': 'playing guitar',
      'appearance': 'a man in a hat'},
     {'position': 'riding a bicycle',
      'activity': 'biking through the street',
  

In [10]:
print(output.keys())
print(output["description"].keys())



dict_keys(['description'])
dict_keys(['scene', 'elements', 'atmosphere'])


In [12]:
import json
print(json.dumps(output["description"]["elements"], indent=2))


{
  "background": {
    "buildings": "A mix of modern skyscrapers and older brick buildings",
    "traffic_lights": "A traffic light showing red",
    "street": "A busy street with vehicles in motion"
  },
  "foreground": {
    "people": [
      {
        "position": "sitting on a bench",
        "activity": "reading a newspaper",
        "appearance": "an older man in a suit"
      },
      {
        "position": "sitting on the ground",
        "activity": "using a smartphone",
        "appearance": "a young boy in a jacket"
      },
      {
        "position": "lying on the ground",
        "activity": "unconscious or resting",
        "appearance": "a young man in a red hoodie"
      },
      {
        "position": "walking",
        "activity": "holding a phone",
        "appearance": "a young woman in casual attire"
      },
      {
        "position": "walking",
        "activity": "playing guitar",
        "appearance": "a man in a hat"
      },
      {
        "position": "ridin

In [17]:
foreground_items = [
    el for el in output["description"]["elements"]
    if "position" in el and any(word in el["position"].lower() for word in ["foreground", "front", "close"])
]

In [19]:
output["description"]["elements"]["foreground"]


{'people': [{'position': 'sitting on a bench',
   'activity': 'reading a newspaper',
   'appearance': 'an older man in a suit'},
  {'position': 'sitting on the ground',
   'activity': 'using a smartphone',
   'appearance': 'a young boy in a jacket'},
  {'position': 'lying on the ground',
   'activity': 'unconscious or resting',
   'appearance': 'a young man in a red hoodie'},
  {'position': 'walking',
   'activity': 'holding a phone',
   'appearance': 'a young woman in casual attire'},
  {'position': 'walking',
   'activity': 'playing guitar',
   'appearance': 'a man in a hat'},
  {'position': 'riding a bicycle',
   'activity': 'biking through the street',
   'appearance': 'a person in a black outfit'},
  {'position': 'riding a scooter',
   'activity': 'navigating the street',
   'appearance': 'a woman in casual clothing'}],
 'animals': [{'type': 'pigeons', 'position': 'on the ground near the bench'}],
 'plants': {'type': 'flower pot',
  'position': 'next to the boy on the ground',
  '


# JSON Schema for Controlled Structured Outputs
In this section, we define a JSON schema for a more controlled and specific output from the model. 
Using this schema, we can ensure the model adheres to predefined data types and structures while describing images.In this case we will provide an exmaple of json format answer, but ideally 
one could also do it via e.g. pydantic library.

Example: 
```
from typing import List, Literal
from pydantic import BaseModel, Field


class Person(BaseModel):
    position: str = Field(..., description="Position of the person in the environment, e.g., standing, sitting, etc.")
    age: int = Field(..., ge=0, description="Age of the person, must be a non-negative integer.")
    activity: str = Field(..., description="Activity the person is engaged in, e.g., reading, talking, etc.")
    gender: Literal["male", "female", "non-binary", "other", "prefer not to say"] = Field(
        ..., description="Gender of the person"
    )


class ImageExtraction(BaseModel):
    number_of_people: int = Field(..., ge=0, description="The total number of people in the environment.")
    atmosphere: str = Field(..., description="Description of the atmosphere, e.g., calm, lively, etc.")
    hour_of_the_day: int = Field(..., ge=0, le=23, description="The hour of the day in 24-hour format.")
    people: List[Person] = Field(..., description="List of people and their details.")

```

In [20]:
def promptLLM(prompt : str = None, sysprompt : str = None,  image : str = None, wantJson : bool = False, returnDict : bool = False):
    returnValue = ""
    messages = [{"role": "system", "content" : sysprompt}]
    modelToUse = TEXTMODEL
    #force it to be a json answer prompt
    #prompt = prompt if not wantJson else returnJSONAnswerPrompt(prompt)
    messages.append({"role": "user", "content": [{ 
        "type" : "text", 
        "text" : prompt 
    }]})
    if image is not None:
        image = f"data:image/jpeg;base64,{image}"
        messages[1]["content"].append({"type": "image_url", "image_url": { "url" : image}})
        modelToUse = IMGMODEL

    if wantJson:
        returnValue = openAIclient.chat.completions.create(
            model=modelToUse,
            #max_tokens= 400,
            response_format={
                "type": "json_schema",
                "json_schema": {
                    "name": "img_extract",
                    "schema": {
                    "type": "object",
                    "properties": {
                        "numberOfPeople": {
                        "type": "integer",
                        "description": "The total number of people in the environment",
                        "minimum": 0
                        },
                        "atmosphere": {
                        "type": "string",
                        "description": "Description of the atmosphere, e.g., calm, lively, etc."
                        },
                        "hourOfTheDay": {
                        "type": "integer",
                        "description": "The hour of the day in 24-hour format",
                        "minimum": 0,
                        "maximum": 23
                        },
                        "people": {
                        "type": "array",
                        "description": "List of people and their details",
                        "items": {
                            "type": "object",
                            "properties": {
                            "position": {
                                "type": "string",
                                "description": "Position of the person in the environment, e.g., standing, sitting, etc."
                            },
                            "age": {
                                "type": "integer",
                                "description": "Age of the person",
                                "minimum": 0
                            },
                            "activity": {
                                "type": "string",
                                "description": "Activity the person is engaged in, e.g., reading, talking, etc."
                            },
                            "gender": {
                                "type": "string",
                                "description": "Gender of the person",
                                "enum": ["male", "female", "non-binary", "other", "prefer not to say"]
                            }
                            },
                            "required": ["position", "age", "activity", "gender"]
                        }
                        }
                    },
                    "required": ["numberOfPeople", "atmosphere", "hourOfTheDay", "people"]
                    }}},
            messages=messages,
            temperature=0,
            #n=1,
        )
    else :
        returnValue = openAIclient.chat.completions.create(
            model=modelToUse,
            messages=messages,
            temperature=0,
            #n=1,
        )
    returnValue = returnValue.choices[0].message.content
    if returnDict:
        return json.loads(returnValue)
    return returnValue

In [21]:
output_image_analysis = promptLLM(prompt = "describe the image in detail",sysprompt = "you are a careful observer. the response should be in json format", image = encode_image(img), wantJson=True, returnDict=True)

In [22]:
#alert service prompt 

alert_sys_prompt = " you are an experienced first aid paramedical"
alert_prompt= """Extract from the following scene analysis give to you in json format, 
if anyone might be in danger and if the Child Hospital or normal Hospital should be alerted. 
Give the a concise answer
The situation is given to you from this object: """ + str(output_image_analysis)


In [23]:
promptLLM(prompt = alert_prompt, sysprompt= alert_sys_prompt) 

'In this scene, there is a 15-year-old male who is lying down and unconscious, indicating a potential medical emergency. The Child Hospital should be alerted due to the age of the individual.'

In [24]:
promptLLM(prompt = "Considering the image analysis given" +str(output_image_analysis)+ "give me back the coordinates of the 16-years old. If these are not available, infer them form the pic", sysprompt= alert_sys_prompt) 

'Based on the provided data, there is no mention of a 16-year-old individual in the image analysis. The ages of the individuals listed are 30, 25, 15, 15, 20, 30, 40, and 35. Since there is no 16-year-old present, I cannot provide coordinates for that age group.\n\nIf you need assistance with another aspect of the scenario or have further questions, feel free to ask!'

In [25]:
promptLLM(prompt =  "Detect if there is a person who is under 18 years old on the floor and reutrn its coordinates as a list in the format '[ymin,xmin, ymax, xmax]'. Just output the list.", sysprompt= alert_sys_prompt, image = encode_image(img)) 

"I'm unable to assist with that."


# 2. Google VLM (Gemini)
This section demonstrates the use of Google's Vision Language Model, Gemini. 
We explore basic text generation as well as its ability to analyze images and provide relevant outputs.

**Support Material**:
- https://colab.research.google.com/drive/1eDvf_Ky9jLOZFShgHrm4GI-wkAaQnue6?usp=sharing


In [26]:
%matplotlib inline
import os
from dotenv import load_dotenv  
import google.generativeai as genai
from PIL import Image

load_dotenv()
#genai.configure(api_key=os.environ.get("GEMINI_API_KEY"))
genai.configure(api_key=os.getenv("GEMINI_API_KEY"))

  from .autonotebook import tqdm as notebook_tqdm


In [27]:
model = genai.GenerativeModel("gemini-1.5-flash")
response = model.generate_content("Explain how AI works")
print(response.text)


AI, or Artificial Intelligence, doesn't work in a single, unified way.  Instead, it encompasses a broad range of techniques and approaches, all aiming to create systems that can perform tasks that typically require human intelligence.  Here's a breakdown of some key concepts:

**1. Data as the Foundation:**  At the heart of most AI systems is data.  Massive amounts of data are used to train AI models. This data can be anything from images and text to sensor readings and financial transactions.  The quality and quantity of this data directly impact the performance of the AI.

**2. Algorithms: The Recipes:** Algorithms are the sets of instructions that tell the AI how to process and learn from the data. These algorithms are mathematical formulas and procedures that allow the system to identify patterns, make predictions, and improve its performance over time.  Different algorithms are suited for different tasks.

**3. Machine Learning (ML):  Learning from Data:**  Machine learning is a s

In [28]:
im = Image.open(img)

genai.configure(api_key=os.environ.get("GEMINI_API_KEY"))
model = genai.GenerativeModel("gemini-1.5-pro")

response = model.generate_content([
    im,
    (
        "Detect if there is a person who is under 18 years old on the floor and reutrn its coordinates as a list in the format '[ymin,xmin, ymax, xmax]'. Just output the list.\n "
    ),
])
response.resolve()
print(response.text)

[698,328,964,620]


Gemini can be used to predict bounding boxes based on free form text queries.
The model can be prompted to return the boxes in a variety of different formats (dictionary, list, etc). This of course migh need to be parsed. 
Check: https://colab.research.google.com/drive/1eDvf_Ky9jLOZFShgHrm4GI-wkAaQnue6?usp=sharing#scrollTo=WFLDgSztv77H
