
# 1. OpenAI VLM (GPT) - Basics
This section demonstrates the basic usage of OpenAI's Vision Language Model (VLM) capabilities using GPT-4.1.
We will use the OpenAI API to analyze an image and provide detailed textual insights.

**Support Material**

- https://platform.openai.com/docs/quickstart
- https://platform.openai.com/docs/guides/text
- https://platform.openai.com/docs/guides/images-vision?api-mode=chat
- https://platform.openai.com/docs/guides/structured-outputs


In [3]:
import openai
from dotenv import load_dotenv  
import base64
import json
import textwrap

# Function to encode the image
def encode_image(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode('utf-8')


load_dotenv()
openAIclient = openai.OpenAI()


# Path to your image
img = "images/street_scene.jpg"




In [4]:
#basic call to gpt with prompt and image

completion = openAIclient.chat.completions.create(
    model="gpt-4.1-mini",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{encode_image(img)}",
                        #"detail": "low"
                    }
                },
            ],
        }
    ],
)


# Wrap the text to a specified width

response = str(completion.choices[0].message.content)
print(textwrap.fill(response, width=120))


The image depicts a bustling city street scene during what appears to be late afternoon or early evening. Various
elements and activities are present:  - Several people are engaged in different activities on and around a crosswalk.
There is a person lying on the ground wearing a red jacket and a beanie, a young person sitting nearby using a
smartphone or tablet, and pigeons scattered on the pavement around them. - On a wooden bench near the crosswalk, there
is a woman reading a newspaper and an elderly man who seems to be resting or thinking. - Another young woman is walking
by the bench, seemingly looking at her phone or a small device. - In the background, two motorcyclists (one riding a
bike and the other on a scooter) are on the street, and a man is walking across the street playing a guitar. - Several
cars are driving past, and a taxi is noticeable in the foreground on the left. - The street is lined with tall
buildings, some modern skyscrapers and others more classic brick struct


# 1.1 Structured Output
Here, we expand upon the VLM example to request structured outputs. This approach allows for extracting 
well-organized information from images in a machine-readable format, such as JSON.

**Support Material**:
- https://platform.openai.com/docs/guides/structured-outputs


In [21]:
completion = openAIclient.chat.completions.create(
    model="gpt-4.1-mini",
    messages=[
        {"role": "system", "content": "you are a careful observer. the response should be in json format"},
        {"role": "user", "content": [
                {"type": "text", "text": "Describe the image in detail"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{encode_image(img)}",
                        #"detail": "low"
                    }
                },
            ]}
    ],
    response_format={ "type": "json_object" },# NEW!!
    temperature = 0
)

returnValue = completion.choices[0].message.content


We parse the json in a dict structure:

In [27]:
output = json.loads(returnValue)
#json. loads() converts JSON strings to Python objects
print(output.keys())

dict_keys(['scene', 'time_of_day', 'background', 'street', 'people', 'additional_elements'])


So we can access specific infos:

In [43]:
print(output["people"])



{'foreground': [{'person': 'Young individual sitting on the sidewalk near a flower pot', 'activity': 'Using a smartphone or tablet', 'clothing': 'Green jacket and shorts'}, {'person': 'Young individual lying on the sidewalk', 'clothing': 'Red jacket, blue jeans, black shoes', 'position': 'Lying flat with one arm bent'}, {'person': 'Older man sitting on a wooden bench', 'clothing': 'Gray suit, white shirt', 'activity': 'Resting head on hand, looking thoughtful'}, {'person': 'Young woman sitting on the same bench', 'clothing': 'Red striped blouse, blue jeans', 'activity': 'Reading a newspaper'}, {'person': 'Young woman walking on the sidewalk', 'clothing': 'Pink top, denim shorts, white sneakers', 'activity': 'Looking at a smartphone'}], 'midground': [{'person': 'Man walking across the street playing an acoustic guitar', 'clothing': 'Black jacket, black pants, black cap'}], 'pigeons': {'count': 7, 'location': 'Scattered on the sidewalk near the people'}}



# JSON Schema for Controlled Structured Outputs
In this section, we define a JSON schema for a more controlled and specific output from the model. 
Using this schema, we can ensure the model adheres to predefined data types and structures while describing images.In this case we will provide the json schema directly.



In [42]:
completion = openAIclient.chat.completions.create(
    model="gpt-4.1-mini",
    messages=[
        {"role": "system", "content": "you are a careful observer. the response should be in json format"},
        {"role": "user", "content": [
                {"type": "text", "text": "Describe the image in detail"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{encode_image(img)}",
                        #"detail": "low"
                    }
                },
            ]}
    ],
    response_format={
                "type": "json_schema",    
                "json_schema": {
                    "name": "img_extract",
                    "schema": {
                    "type": "object",
                    "properties": {
                        "numberOfPeople": {
                        "type":"integer",
                        "description": "The total number of people in the environment",
                        "minimum": 0
                        },
                        "atmosphere": {
                        "type": "string",
                        "description": "Description of the atmosphere, e.g., calm, lively, etc."
                        },
                        "hourOfTheDay": {
                        "type": "integer",
                        "description": "The hour of the day in 24-hour format",
                        "minimum": 0,
                        "maximum": 23
                        },
                        "people": {
                        "type": "array",
                        "description": "List of people and their details",
                        "items": {
                            "type": "object",
                            "properties": {
                            "position": {
                                "type": "string",
                                "description": "Position of the person in the environment, e.g., standing, sitting, etc."
                            },
                            "age": {
                                "type": "integer",
                                "description": "Age of the person",
                                "minimum": 0
                            },
                            "activity": {
                                "type": "string",
                                "description": "Activity the person is engaged in, e.g., reading, talking, etc."
                            },
                            "gender": {
                                "type": "string",
                                "description": "Gender of the person",
                                "enum": ["male", "female", "non-binary", "other", "prefer not to say"]
                            }
                            },
                            "required": ["position", "age", "activity", "gender"]
                        }
                        }
                    },
                    "required": ["numberOfPeople", "atmosphere", "hourOfTheDay", "people"]
                    }}},
    temperature = 0
)

returnValue = completion.choices[0].message.content


In [44]:
output_image_extraction = json.loads(returnValue)


In [45]:
output_image_extraction["people"]

[{'position': 'sitting on the sidewalk',
  'age': 16,
  'activity': 'using a smartphone',
  'gender': 'male'},
 {'position': 'lying on the sidewalk',
  'age': 18,
  'activity': 'resting or sleeping',
  'gender': 'male'},
 {'position': 'sitting on a bench',
  'age': 65,
  'activity': 'thinking or resting',
  'gender': 'male'},
 {'position': 'sitting on a bench',
  'age': 25,
  'activity': 'reading a newspaper',
  'gender': 'female'},
 {'position': 'walking on the sidewalk',
  'age': 20,
  'activity': 'looking at a phone',
  'gender': 'female'},
 {'position': 'walking on the street',
  'age': 30,
  'activity': 'playing guitar',
  'gender': 'male'},
 {'position': 'riding a motorcycle',
  'age': 35,
  'activity': 'driving',
  'gender': 'male'},
 {'position': 'riding a scooter',
  'age': 28,
  'activity': 'driving',
  'gender': 'female'},
 {'position': 'inside a taxi',
  'age': 40,
  'activity': 'driving',
  'gender': 'male'},
 {'position': 'inside a taxi',
  'age': 38,
  'activity': 'passe

Alternatively: 


OpenAI SDKs for Python and JavaScript also make it easy to define object schemas using Pydantic and Zod respectively. Below, you can see how to extract information from unstructured text that conforms to a schema defined in code.


```python
from pydantic import BaseModel


class Person(BaseModel):
    position: str 
    age: int 
    activity: str 
    gender: str


class ImageExtraction(BaseModel):
    number_of_people: int 
    atmosphere: str 
    hour_of_the_day: int 
    people: list[Person] 

completion = openAIclient.beta.chat.completions.parse(
    model="gpt-4.1-mini",
    messages=[
        {"role": "system", "content": "you are a careful observer. the response should be in json format"},
        {"role": "user", "content": "describe the image in detail"}
    ],
    response_format=ImageExtraction,
)

output_image_extraction = completion.choices[0].message.parsed


We can then integrate the extracted information in full or partially in a new prompt for a new extraction

In [46]:
#alert service prompt 

alert_sys_prompt = " you are an experienced first aid paramedical"
alert_prompt= """Extract from the following scene analysis give to you in json format, 
if anyone might be in danger and if the Child Hospital or normal Hospital should be alerted. 
Give the a concise answer
The situation is given to you from this object: """ + str(output_image_extraction)


In [47]:

completion = openAIclient.chat.completions.create(
    model="gpt-4.1-mini",
    messages=[
        {"role": "user", "content": alert_prompt},
        {"role": "user", "content": alert_prompt}
    ],
)


# Wrap the text to a specified width

response = str(completion.choices[0].message.content)
print(textwrap.fill(response, width=120))

No one appears to be in immediate danger. There are no signs of injury or distress. Therefore, no hospital alert—neither
Child Hospital nor normal Hospital—is necessary.


In [48]:
completion = openAIclient.chat.completions.create(
    model="gpt-4.1-mini",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Considering this list of people"+str(output_image_extraction["people"])+".Identify the youngest in the picture I provide and give me back their coordinates. The box_2d should be [ymin, xmin, ymax, xmax] normalized to 0-1000."},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{encode_image(img)}",
                        #"detail": "low"
                    }
                },
            ],
        }
    ],
)


# Wrap the text to a specified width

response = str(completion.choices[0].message.content)
print(textwrap.fill(response, width=120))

The youngest person in the list you provided is the 16-year-old male sitting on the sidewalk using a smartphone.  In the
picture, that corresponds to the young person sitting on the bottom left side, playing on a phone or device.  The
approximate coordinates of this person’s box_2d in [ymin, xmin, ymax, xmax] normalized to 0-1000 are: [700, 90, 990,
300]



# 2. Google VLM (Gemini)
This section demonstrates the use of Google's Vision Language Model, Gemini. 
We explore basic text generation as well as its ability to analyze images and provide relevant outputs.

**Support Material**:
- https://ai.google.dev/gemini-api/docs/quickstart
- https://ai.google.dev/gemini-api/docs/text-generation
- https://ai.google.dev/gemini-api/docs/image-understanding
- https://ai.google.dev/gemini-api/docs/structured-output?example=recipe

In [49]:
%matplotlib inline
from dotenv import load_dotenv  
from google import genai
from PIL import Image
import textwrap

import json


load_dotenv()
client = genai.Client()

# Path to your image
img = "images/street_scene.jpg"

Basic call:

In [50]:
response = client.models.generate_content(
    model="gemini-2.5-flash", contents="Explain how AI works to a 90 years old. in few words"
)

print(textwrap.fill(response.text, width=120))

Imagine a very smart computer student. You show it **lots and lots of examples**, like thousands of cat pictures. It
**learns** from all those examples what makes a cat a cat. Then, it can **recognize new cats** or **help you make smart
decisions**, almost like thinking!


and with images: 

In [51]:
im = Image.open(img)

response = client.models.generate_content(model="gemini-2.5-flash",
                                          contents=[im, "Describe the scene in details\n"],
                                          )

print(textwrap.fill(response.text, width=120))


This detailed urban scene captures a vibrant late afternoon or early evening in a bustling city, bathed in the warm,
golden light of the sun low on the horizon. The image is rich with activity, showcasing a mix of architecture, people,
and transportation.  **Overall Setting:** The scene is dominated by a wide city street flanked by a mix of historic
brick buildings and modern skyscrapers. The warm sunlight casts long shadows and highlights the textures of the
buildings and pavement. The atmosphere feels lively and dynamic, characteristic of a busy downtown area.
**Background:** In the distance, a dense urban skyline rises, featuring numerous tall buildings with glass and concrete
facades that reflect the golden light. A distinctive older building with a prominent steeple or dome stands out amidst
the modern high-rises, suggesting a historical core within the contemporary cityscape. The distant buildings fade
slightly into a soft, light haze, emphasizing depth.  **Midground (Street and 

Also here we can extract structured output (Gemini actually prefers pydantic syntax - let's see what happens with a schema as before)-> check limitations in https://ai.google.dev/gemini-api/docs/structured-output?example=recipe 

In [52]:
json_schema = {
                    "name": "img_extract",
                    "schema": {
                    "type": "object",
                    "properties": {
                        "numberOfPeople": {
                        "type":"integer",
                        "description": "The total number of people in the environment",
                        "minimum": 0
                        },
                        "atmosphere": {
                        "type": "string",
                        "description": "Description of the atmosphere, e.g., calm, lively, etc."
                        },
                        "hourOfTheDay": {
                        "type": "integer",
                        "description": "The hour of the day in 24-hour format",
                        "minimum": 0,
                        "maximum": 23
                        },
                        "people": {
                        "type": "array",
                        "description": "List of people and their details",
                        "items": {
                            "type": "object",
                            "properties": {
                            "position": {
                                "type": "string",
                                "description": "Position of the person in the environment, e.g., standing, sitting, etc."
                            },
                            "age": {
                                "type": "integer",
                                "description": "Age of the person",
                                "minimum": 0
                            },
                            "activity": {
                                "type": "string",
                                "description": "Activity the person is engaged in, e.g., reading, talking, etc."
                            },
                            "gender": {
                                "type": "string",
                                "description": "Gender of the person",
                                "enum": ["male", "female", "non-binary", "other", "prefer not to say"]
                            }
                            },
                            "required": ["position", "age", "activity", "gender"]
                        }
                        }
                    },
                    "required": ["numberOfPeople", "atmosphere", "hourOfTheDay", "people"]}}



config={
        "response_mime_type": "application/json",
        "response_json_schema": json_schema,
    }


response = client.models.generate_content(model="gemini-2.5-flash",
                                          contents=[im, "Describe the scene in details, follwoing exactly the given json schema\n"],
                                          config=config
                                          )



print(response.text)

{
  "scene": "A bustling city street bathed in the golden light of late afternoon or early morning. The scene captures a dynamic urban environment where various people engage in different activities, with a mix of modern and traditional architecture forming the backdrop.",
  "elements": [
    "A man in a black leather jacket and white helmet rides a dark-colored motorcycle across a zebra crossing.",
    "Another man walks on the street, holding an acoustic guitar, positioned between the motorcyclist and a scooter rider.",
    "A woman on a light-colored scooter follows behind the man with the guitar.",
    "On the sidewalk, an elderly man in a dark suit and spectacles sits on a wooden bench, appearing contemplative.",
    "Next to him, a woman with blonde hair, dressed in a red and white striped top and blue jeans, reads a newspaper.",
    "A young woman with long dark hair, wearing a pink t-shirt and light blue shorts, walks on the sidewalk holding a white tray with food.",
    "Furth

Does it match your schema?

Let's try to use Gemini to detect an object in the image and get its coordinates:


In [53]:
prompt = "Identify the youngest in the picture and give me back their coordinates. The box_2d should be [ymin, xmin, ymax, xmax] normalized to 0-1000."


config={"response_mime_type": "application/json"}

response = client.models.generate_content(model="gemini-2.5-flash",
                                          contents=[img, prompt],
                                          config=config
                                          )

bounding_boxes = json.loads(response.text)
print(bounding_boxes)


{'box_2d': [360, 477, 659, 584]}


Gemini2+ was trained specifically for object detection/ segmentation tasks. More details: https://colab.research.google.com/github/google-gemini/cookbook/blob/main/quickstarts/Spatial_understanding.ipynb


## 3.Extract Structured Infos from Hand-written note - GPT & Gemini

Let’s try **not** to extract structured information from a handwritten note (e.g., `prescription1.jpg`) using **both models**.

Consider the file: `/images/prescription1.jpg`.  
Have a look at it.

### JSON Schema
Let’s define a JSON schema for the extraction task:


In [54]:
json_schema_prescription = {
 "name": "prescription_extract",
"schema": {
  "type": "object",
  "properties": {
    "doctor_name": { "type": "string" },
    "patient_name": { "type": "string" },
    "patient_dob": { "type": "string" },
    "meds": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "name": { "type": "string" },
          "dose": { "type": "string" },
          "frequency": { "type": "string" },
          "instructions": { "type": "string" }
        },
        "required": ["name"]
      }
    },
    "signature": { "type": "boolean" }
  },
  "required": ["doctor_name", "patient_name", "meds"]
}}

Extract structured infos using Gemini: 

In [55]:
im = Image.open("images/prescription1.jpg")

config={
        "response_mime_type": "application/json",
        "response_json_schema": json_schema_prescription,
    }


response = client.models.generate_content(model="gemini-2.5-flash",
                                          contents=[im, "Extract infos from image, follwoing the given json schema.\n"],
                                          config=config
                                          )



print(response.text)

{
"doctor_name": "Dr. Markus Hütter",
"patient_name": "Claudie Fischer",
"patient_dob": "1.4.1978",
"prescription": [
  "medication",
  "Ibuprofen",
  "dosage",
  "400mg",
  "frequency",
  "3x",
  "instructions",
  "nach dem Essen"
],
"signature": "Keptelle"
}


If the output is **not valid JSON** and contains extra strings, it must be **parsed** before it can be loaded into a Python dict.  
Below is an example helper function that does this.

> **Note:** Since Gemini returns a Pydantic model, you *could* use Pydantic methods to handle parsing.  
> We avoid that here to keep the workflow generally compatible across models.


In [56]:
import re
import json 
def parse_json_in_output(output):
    """
    Extracts and converts JSON-like data from the given text output to a Python dictionary.
    
    Args:
        output (str): The text output containing the JSON data.
    
    Returns:
        dict: The parsed JSON data as a Python dictionary.
    """
    # Regex to extract JSON-like portion
    json_match = re.search(r"\{.*?\}", output, re.DOTALL)
    if json_match:
        json_str = json_match.group(0)
        # Fix single quotes and ensure proper JSON formatting
        json_str = json_str.replace("'", '"')  # Replace single quotes with double quotes
        try:
            # Convert the fixed JSON string into a dictionary
            json_data = json.loads(json_str)
            return json_data
        except json.JSONDecodeError:
            return "The extracted JSON is still not valid after formatting."
    else:
        return "No JSON data found in the given output."

In [57]:
#print(parse_json_in_output(response.text))


In [58]:
json.loads(response.text)

{'doctor_name': 'Dr. Markus Hütter',
 'patient_name': 'Claudie Fischer',
 'patient_dob': '1.4.1978',
 'prescription': ['medication',
  'Ibuprofen',
  'dosage',
  '400mg',
  'frequency',
  '3x',
  'instructions',
  'nach dem Essen'],
 'signature': 'Keptelle'}

Now let's do the same with GPT

In [59]:
im = "images/prescription1.jpg"

completion = openAIclient.chat.completions.create(
    model="gpt-4.1-mini",
    messages=[
        {"role": "system", "content": "you are a careful observer. the response should be in json format"},
        {"role": "user", "content": [
                {"type": "text", "text": "Describe the image in detail"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{encode_image(im)}",
                        #"detail": "low"
                    }
                },
            ]}
    ],
    response_format={
                "type": "json_schema",   "json_schema": json_schema_prescription},
    temperature = 0
)

returnValue = completion.choices[0].message.content

In [60]:
returnValue

'{"doctor_name":"Dr. Markus Müller","patient_name":"Claudia Fischer","patient_dob":"1.4.1978","meds":[{"name":"Ibuprofen","dose":"400 mg","frequency":"3x","instructions":"nach dem Essen"}],"signature":true}'

Any difference wiht the output of Gemini vs your schema? 

No need for parsing now. We load the json in a python dict structure with json.loads

In [61]:
print(json.loads(returnValue))

{'doctor_name': 'Dr. Markus Müller', 'patient_name': 'Claudia Fischer', 'patient_dob': '1.4.1978', 'meds': [{'name': 'Ibuprofen', 'dose': '400 mg', 'frequency': '3x', 'instructions': 'nach dem Essen'}], 'signature': True}
