## Exploring Multimodal Capabilities of LLaVA

Multimodal models are AI systems that process and integrate multiple types of data, such as text, images, audio, or video, to generate richer and more context-aware outputs. These models enhance understanding by leveraging complementary information across different modalities, improving tasks like image captioning, language translation, and interactive AI applications.

Here, we demonstrate how to make the model generate image descriptions based on an input image. 

We use LLaVA, which is a novel end-to-end trained large multimodal model that combines a vision encoder and Vicuna for general-purpose visual and language understanding.

## Lab Description:

This lab explores the multimodal capabilities of **LLaVA (Large Language and Vision Assistant)** by leveraging its ability to process and generate textual descriptions from images. Participants will load a pretrained LLaVA model and use it to generate image captions, analyze visual content, and interpret images. Through hands-on exercises, learners will gain insights into how multimodal models integrate visual and textual information to enhance AI understanding. 

## Lab Objectives

- Understand Multimodal AI and LLaVA’s Capabilities.
- Generate and Analyze Image Descriptions.
- Explore the Role of Multimodal Models in AI-Assisted Image Interpretation.

### Libraries: 

In [1]:
from langchain_ollama import ChatOllama
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from IPython.display import Markdown, display
from langchain_core.messages import HumanMessage, SystemMessage

### Passing Multimodal data into the model

The most common way to pass multimodal data like images to a model is to pass it as a byte string. We take a file path, then read the binary content of the image file and then encode it using b64. 

In [4]:
import base64

# Define the path to the local image file
file_path = "./winter.jpg"

# Open the file in binary mode and read the content
with open(file_path, "rb") as image_file:
    image_data = base64.b64encode(image_file.read()).decode("utf-8")


We then load `llava:7b` / `llava:13b` from ollama using langchain's `ChatOllama`. 

In [19]:
model = ChatOllama(model="llava:13b", base_url="http://10.79.253.112:11434")  #load the multimodal model from ollama

We use `messages` to converse with the the model. We have a `HumanMessage` object which contains a `content`. In this `content` we have our text prompt (based on which the model will provide answers after analysing the image), and then we have the byte string we generated for the image. 

<img src="./winter.jpg" alt="Winter" width="600" />


In [6]:
message = HumanMessage(                              
    content=[
        {"type": "text", "text": "Analyse the weather and atmospheric condition from the given image"},
        {
            "type": "image_url",
            "image_url": {"url": f"data:image/jpeg;base64,{image_data}"},
        },
    ],
)
response = model.invoke([message])
display(Markdown(response.content))

 The image depicts a snowy, winter scene. A log cabin is nestled amidst tall coniferous trees that are heavily laden with snow, suggesting the recent fall and current cold conditions. The sky appears to be clear and dark blue, which might indicate either early morning or late evening, contributing to the serene atmosphere. The ground is covered in a blanket of freshly fallen snow, indicating the photo was likely taken after a heavy snowfall, and it's still actively snowing.

The cabin has a warm glow emanating from its windows, which adds to the cozy and inviting ambiance despite the cold weather outside. The presence of a chimney suggests that the cabin is equipped with a fireplace for warmth, further emphasizing the need for shelter in such conditions.

Overall, the image conveys a tranquil winter scene with a strong emphasis on the contrast between the warm interior of the cabin and the cold, snowy environment outside. 

### Passing Multiple Images

We can also pass multiple images at the same time. For this, we will first have to load two images, and then generate byte strings for both the images. Once we have both strings, we can provide input to the model using the same `messages` from langchain. 

<img src="./winter.jpg" alt="Winter" width="600" />
<img src="./sunny.jpg" alt="Sunny" width="600" />


In [7]:
import base64

# Define the path to the local image file
file_path1 = "./winter.jpg"
file_path2 = "./sunny.jpg"

# Open the file in binary mode and read the content
with open(file_path1, "rb") as image_file:
    image_data1 = base64.b64encode(image_file.read()).decode("utf-8")

with open(file_path2, "rb") as image_file:
    image_data2 = base64.b64encode(image_file.read()).decode("utf-8")


We added an additional entry for passing the second image to the model. We changed the prompt to make the model analyse and compare both the images.  

In [8]:
message = HumanMessage(
    content=[
        {"type": "text", "text": "Analyse and compare weather and atmospheric conditions of the 2 images."},
        {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_data1}"}},
        {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_data2}"}},
    ],
)
response = model.invoke([message])
display(Markdown(response.content))

 Both images depict a serene coastal setting, specifically focusing on the area around a cabin by the beach during winter. Here's an analysis of the weather and atmospheric conditions:

1. Sunlight and Skies:
   - The top image shows a clear sky with no visible clouds, indicating good visibility and likely low humidity. This suggests it might be a cold but sunny day.
   - The bottom image also has a clear blue sky without any clouds, which continues to indicate sunshine and fair weather conditions.

2. Temperature:
   - The absence of visible snow or ice in the lower image indicates that while it's winter, the temperature might be above freezing, at least during the day, given that the beach is accessible without any signs of heavy frost or accumulated snow.

3. Snow and Frost:
   - In both images, there are no signs of heavy snowfall. The only snow visible in the lower image appears to be on the ground around the cabin, suggesting recent light snowfall that has not entirely covered the landscape.

4. Wind Conditions:
   - Both scenes show calm seas and minimal wind, which gives a tranquil atmosphere to both images.

Overall, the weather in both images appears to be fairly similar with clear skies, sunshine, and minimal snow or ice. The main difference seems to be the time of day due to the positioning of the sun. The top image shows the sun at a lower angle, which is consistent with either morning or evening light, whereas the bottom image has the sun at a higher angle, indicating midday. 

The model generated a comparitive description of both the images.

## Analyzing Tables

Let us now try to analyze data from a graphical table containing information about Intel Processors used in HPE Proliant DL380a Gen12. 

<img src="./table.jpg" alt="table" width="600" />

In [None]:
import base64

# Define the path to the local image file
table_path = "./table.jpg"

# Open the file in binary mode and read the content
with open(file_path, "rb") as image_file:
    image_data = base64.b64encode(image_file.read()).decode("utf-8")

In [None]:
message = HumanMessage(                              
    content=[
        {"type": "text", "text": "Analyze the given Image of the table"},
        {
            "type": "image_url",
            "image_url": {"url": f"data:image/jpeg;base64,{image_data}"},
        },
    ],
)
response = model.invoke([message])
display(Markdown(response.content))

<div style="text-align: left;">
    <img src="logo.png" alt="flow" width="150" height="100">
</div>