# <a id='toc1_'></a>[Build an Image Captioning System with Vision Language Models](#toc0_)


Estimated time needed: **30** minutes


In this lab, you’ll explore how to use the Vision model to perform multimodal tasks like image captioning and visual question answering using Python.


----


## [Introduction](#toc0_)

Visual content—like photos, screenshots, or charts—often contains important information that can be hard to interpret at a glance. Wouldn’t it be useful if an AI model could instantly describe what’s in an image, or answer questions about it?

In this guided project, we’ll explore how to use a large multimodal language model to do exactly that. You'll use Vision model, integrated with Langchain, to generate text responses based on visual inputs. From scene descriptions to answering specific questions, this model can help turn images into insights.

## [What does this guided project do?](#toc0_)

This project demonstrates how to:

- Load image data from URLs.
- Encode those images so they can be processed by a language model.
- Use the Vision model to generate text responses based on each image.
## [Objectives](#toc0_)

By the end of this lab, you will be able to:

- Understand how to encode images for LLM-based visual processing.
- Use the Vision model to describe or analyze images.

## [Background](#toc0_)

### [What is large language model (LLM)?](#toc0_)

[Large language models](https://www.ibm.com/think/topics/large-language-models?utm_source=skills_network&utm_content=in_lab_content_link&utm_id=Lab-Build+an+Image+Captioning+System+with+watsonx+and+Llama-v1_1745515774) are a category of foundation models that are trained on immense amounts of data making them capable of understanding and generating natural language and other types of content to perform a wide range of tasks.



In [13]:
import os
from langchain_huggingface import HuggingFaceEndpoint, ChatHuggingFace
from dotenv import load_dotenv
load_dotenv()

model_name = "Qwen/Qwen2.5-VL-7B-Instruct"

gen_params = {
    "temperature": 0.2,
    "top_p": 0.5,
}

# Initialize the HuggingFaceEndpoint with your model details
hf_endpoint = HuggingFaceEndpoint(
    model=model_name,
    **gen_params,
)

llm = ChatHuggingFace(llm=hf_endpoint)

## <a href="#Image-preparation">Image preparation</a>

- Download the image
- Display the image


In [4]:
url_image_1 = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/5uo16pKhdB1f2Vz7H8Utkg/image-1.png'
url_image_2 = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/fsuegY1q_OxKIxNhf6zeYg/image-2.png'
url_image_3 = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/KCh_pM9BVWq_ZdzIBIA9Fw/image-3.png'
url_image_4 = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/VaaYLw52RaykwrE3jpFv7g/image-4.png'

image_urls = [url_image_1, url_image_2, url_image_3, url_image_4] 

To gain a better understanding of our data input, let's display the images.


![Image 1](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/5uo16pKhdB1f2Vz7H8Utkg/image-1.png)<figcaption>Image 1</figcaption>

![Image 2](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/fsuegY1q_OxKIxNhf6zeYg/image-2.png)<figcaption>Image 2</figcaption>

![Image 3](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/KCh_pM9BVWq_ZdzIBIA9Fw/image-3.png)<figcaption>Image 3</figcaption>

![Image 4](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/VaaYLw52RaykwrE3jpFv7g/image-4.png)<figcaption>Image 4</figcaption>


## <a id='toc1_10_'></a>[Encode the image](#toc0_)

Encode the image to `base64.b64encode`. Why do you need to encode the image to `base64.b64encode`? JSON is a text-based format and does not support binary data. By encoding the image as a Base64 string, you can embed the image data directly within the JSON structure.


In [5]:
import base64
import requests

def encode_images_to_base64(image_urls):
    """
    Downloads and encodes a list of image URLs to base64 strings.

    Parameters:
    - image_urls (list): A list of image URLs.

    Returns:
    - list: A list of base64-encoded image strings.
    """
    encoded_images = []
    for url in image_urls:
        response = requests.get(url)
        if response.status_code == 200:
            encoded_image = base64.b64encode(response.content).decode("utf-8")
            encoded_images.append(encoded_image)
            print(type(encoded_image))
        else:
            print(f"Warning: Failed to fetch image from {url} (Status code: {response.status_code})")
            encoded_images.append(None)
    return encoded_images

In [6]:
encoded_images = encode_images_to_base64(image_urls)

<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>


## <a id='#Multimodal-inference-function'></a>[Multimodal inference function](#toc0_)

Next, define a function to generate responses from the model.

The `generate_model_response` function is designed to interact with a multimodal AI model that accepts both text and image inputs. This function takes an image, along with a user’s query, and generates a response from the model.

#### Function purpose

The function sends an image and a query to the AI model and retrieves a description or answer. It combines a text-based prompt and an image to guide the model in generating a concise response.

#### Parameters

- **`encoded_image`** (`str`): A base64-encoded image string, which allows the model to process the image data.
- **`user_query`** (`str`): The user's question about the image, providing context for the model to interpret the image and answer appropriately.
- **`assistant_prompt`** (`str`): An optional text prompt to guide the model in responding in a specific way. By default, the prompt is set to: `"You are a helpful assistant. Answer the following user query in 1 or 2 sentences:"`.


In [21]:
def generate_model_response(
        image_url, 
        user_query, 
        assistant_prompt="You are a helpful assistant. Answer the following user query in 1 or 2 sentences: "
    ):
    """
    Sends an image and a query to the model and retrieves the description or answer.

    Parameters:
    - image_url (str): URL of the image.
    - user_query (str): The user's question about the image.
    - assistant_prompt (str): Optional prompt to guide the model's response.

    Returns:
    - str: The model's response for the given image and query.
    """
    
    # Create the messages object
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": assistant_prompt + user_query
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": image_url
                    }
                }
            ]
        }
    ]

    # Send the request to the model
    response = llm.invoke(messages)
    
    # Return the model's response
    return response.content

### Steps explained

1. **Create the Messages object:**  
   The function constructs a list of messages in JSON-like format. This object includes:
   - A "user" role with a "content" array. The content array contains:
     - A text field, combining the `assistant_prompt` and the `user_query`.
     - An image URL field, which includes a base64-encoded image string. This is essential for sending image data to the model.
2. **Send the request to the model:** `response = model.invoke(messages)`
	The function sends the constructed messages to the model using a chat-based API. The model.chat function is invoked with the messages parameter to generate the model's response.
3. **Return the model’s response:** `return response.content`
	The model’s response is returned as a string, extracted from the response object. Specifically, the function retrieves the content of the first choice in the model's response.


In [22]:
user_query = "Describe the photo"

for i in range(len(image_urls)):
    image = image_urls[i]

    response = generate_model_response(
        image, 
        user_query
    )

    # Print the response with a formatted description
    print(f"Description for image {i + 1}: {response}/n/n")
    break

Description for image 1: The photo depicts a bustling urban street lined with tall buildings, featuring a mix of modern and historic architecture. Tall skyscrapers dominate the scene, with one prominently displaying the "thenoma.com" advertisement, while the street below is filled with cars, pedestrians, and traffic lights, capturing the dynamic energy of city life./n/n


## <a id='#Object-detection'></a>[Object detection](#toc0_)


Now that you have showcased the model's ability to perform image captioning in the previous step, let's ask the model some questions that require object detection. Our system prompt will remain the same as in the previous section. The difference now will be in the user query. Regarding the second image depicting the woman running outdoors, you will be asking the model, "How many cars are in this image?". You can comment out the code section for image captioning if you don't want to wait for the response on that part again.


In [24]:
image = image_urls[0]

user_query = "How many cars are in this image?"

print("User Query: ", user_query)
print("Model Response: ", generate_model_response(image, user_query))

User Query:  How many cars are in this image?
Model Response:  There are several cars visible in the image, including taxis and other vehicles, but the exact number cannot be precisely determined due to the perspective and depth of the street.


In [23]:
image = image_urls[1]

user_query = "How many cars are in this image?"

print("User Query: ", user_query)
print("Model Response: ", generate_model_response(image, user_query))

User Query:  How many cars are in this image?
Model Response:  There is one car visible in the image.


The model correctly identified the singular vehicle in the image. Now, let's inquire about the damage depicted in the image of flooding.


In [25]:
image = image_urls[2]

user_query = "How severe is the damage in this image?"

print("User Query: ", user_query)
print("Model Response: ", generate_model_response(image, user_query))

User Query:  How severe is the damage in this image?
Model Response:  The image shows significant flooding that has submerged a property, including a house and some outbuildings, indicating severe damage.


This response highlights the value that multimodal AI has for domains like insurance. The model was able to detect the severity of the damage caused to the flooded home. This could be a powerful tool for improving insurance claim processing time.

Next, let's ask the model how much sodium content is in the nutrition label image.


In [26]:
image = image_urls[3]

user_query = "How much sodium is in this product?"

print("User Query: ", user_query)
print("Model Response: ", generate_model_response(image, user_query))

User Query:  How much sodium is in this product?
Model Response:  This product contains 640mg of sodium per serving.


Great! The model was able to discern objects within the images following user queries. We encourage you to try more queries to further demonstrate the model's performance.


## <a id='#Conclusion'></a>[Conclusion](#toc0_)


In this lab, you explored the capabilities of Vision model:

- Generating detailed image captions

- Answering object detection questions (e.g., number of cars in an image)

- Assessing visual damage in real-world disaster scenarios

- Extracting specific information from product labels

This lab not only introduced you to multimodal AI development but also demonstrated how cutting-edge models can turn visual content into actionable insight. Whether you're building apps for enterprise, education, or everyday use, the tools and techniques you’ve learned here are a solid foundation for what's possible with AI today.

We encourage you to extend this notebook by asking new questions, uploading your own images, or combining image and text prompts for more advanced reasoning tasks. The future of AI is multimodal—this is your starting point.


## <a id='#Exercises'></a>[Exercises](#toc0_)

Now, let's practice by exploring some other capabilities of this model. Try asking "How much cholesterol is in this product?" in the 4th image


In [27]:
image = image_urls[3]

user_query = "How much cholesterol is in this product?"

print("User Query: ", user_query)
print("Model Response: ", generate_model_response(image, user_query))

User Query:  How much cholesterol is in this product?
Model Response:  The product contains 20mg of cholesterol per serving.


Try asking "What is the color of the woman's jacket?" in the 2nd image.


In [28]:
image = image_urls[1]

user_query = "What is the color of the woman's jacket?"

print("User Query: ", user_query)
print("Model Response: ", generate_model_response(image, user_query))

User Query:  What is the color of the woman's jacket?
Model Response:  The woman is wearing a bright yellow jacket.
