# Use multimodal for image summarization 

In this notebook, we will show how difficult it is to summarize an image using only text or only image. 

We will then show how to use multimodal to summarize an image using both text and image.

We will use the image table used in the previous notebooks and a new chart:


!["Chart"](../docs/images/figure9.png)

In [1]:
import sys
import os
import base64
from typing import List
from retrying import retry
from pdf2image import convert_from_path
from unstructured.partition.pdf import partition_pdf
from langchain_community.chat_models import AzureChatOpenAI
from langchain.schema.messages import HumanMessage, SystemMessage

sys.path.append('../')

In [2]:
graph_pdf_path = "../data/pdf/graph.pdf"
saved_image_directory_path = "../data/pdf/extracted_images/figures/graph/"

# Extract informations on theses images using Unstructured and Tesseract OCR 

Let's first try to understand why we need the multimdoal usage

In [3]:
graph_elements = partition_pdf(
    filename=graph_pdf_path,
    infer_table_structure=True,
    strategy="hi_res",
    include_page_breaks=True,
    chunking_strategy='auto',
    extract_images_in_pdf=True,
    extract_image_block_output_dir=saved_image_directory_path
)

Some weights of the model checkpoint at microsoft/table-transformer-structure-recognition were not used when initializing TableTransformerForObjectDetection: ['model.backbone.conv_encoder.model.layer2.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer3.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer4.0.downsample.1.num_batches_tracked']
- This IS expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


We saved in the saved_image_directory_path directory the detected images by OCR. 

Avec a look a the extracted image by this directory. 

You can see that the graph extraction is good but the informative title was not added to the image. 

Let's see the extracted text from the image and try to summarize the image using mutlimodal.

In [4]:
from utils import pretty_print_element
for element in graph_elements:
    pretty_print_element(element)

-------- Element --------
Element type: <class 'unstructured.documents.elements.Title'>
Element text: Large Language Model Context Size
-------- Element --------
Element type: <class 'unstructured.documents.elements.Image'>
Element text: & | eos S Ss So So SF s oe gs z . é % we eg s ra ge * F s r - s fe & & Sg & & Pu Poa s Fs eg od 7 4 . - A a & & f . tha - a # ia - x o xy = 2 G OpenAI ANTHROP\C —~——s F FB (P) Hugging Face o (7) <_ g = °§ 9 & o


### Conclusion
The text extraction quality is not good at all. It cannot consiedered as useful for a question answerings RAG pupeline. 

# Mutlimodal summarization of graph images

In [5]:
# gpt4 vision preview only available in SWITZERLAND
DEPLOYMENT_NAME_GPT4_VISION = 'gpt4-vision-switzerland'

GPT_4_V = AzureChatOpenAI(
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT_GPT4_VISION"),
    openai_api_version="2023-07-01-preview",
    deployment_name=DEPLOYMENT_NAME_GPT4_VISION,
    openai_api_key=os.getenv("OPENAI_API_KEY_GPT4_VISION"),
    openai_api_type="azure",
    temperature=0.3,
    max_tokens=4000,
    model_kwargs={"top_p": 0.95},
)

def encode_image(image_path: str) -> str:
    """Encode image to base64"""
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")


@retry(stop_max_attempt_number=3, wait_fixed=60000)
def summarize_image(encoded_image: str, prompt: str) -> str:
    """Apply batch image description from extracted text"""
    return GPT_4_V.invoke(
        input=[
            SystemMessage(
                content=[
                    {"type": "text", "text": prompt},
                ]
            ),
            HumanMessage(
                content=[
                    {
                        "type": "image_url",
                                "image_url": {
                                    "url": f"data:image/jpeg;base64,{encoded_image}",
                                    "detail": "high"
                                },
                    },
                ]
            )
        ]
    ).content

  warn_deprecated(


In [6]:
figure_filepaths =  [os.path.join(saved_image_directory_path,  filename) for filename in os.listdir(saved_image_directory_path)]

PROMPT_TABLE_AND_IMAGE_SUMMARIZATION = """
    You are an AI assistant that summarize images containing charts
    For each chart in the image:
    - Summarize the chart as a table
    Also, provide a title of each extracted table or image and a two paragraphs summary.
    Provide range values interval or approximations if needed.
    Add two '\n\n' between each table or chart description
    """

In [7]:
for filepath in figure_filepaths:
    encoded_image = encode_image(filepath)
    print(summarize_image(encoded_image, prompt=PROMPT_TABLE_AND_IMAGE_SUMMARIZATION))

Title: AI Model Parameter Count Comparison

| AI Model                 | Parameter Count (Approx.) |
|--------------------------|---------------------------|
| GPT-3 175B              | 175,000                   |
| GPT-3 13B               | 13,000                    |
| GPT-3 6.7B              | 6,700                     |
| GPT-3 2.7B              | 2,700                     |
| GPT-3 1.3B              | 1,300                     |
| GPT-3 355M              | 355                       |
| GPT-3 125M              | 125                       |
| GPT-2 1.5B              | 1,500                     |
| GPT-2 774M              | 774                       |
| GPT-2 355M              | 355                       |
| GPT-2 117M              | 117                       |
| GPT-NeoX 20B            | 20,000                    |
| GPT-Neo 2.7B            | 2,700                     |
| GPT-Neo 1.3B            | 1,300                     |
| GPT-J 6B                | 6,000                     |
| 

### Conclusion

The context around an image is very important to understand the image.

Without a title, on this non informative graph, GPT 4 Vision was not able to summarize the image.

# Summarize at the page level using multimodal

Unstructured saved the found images with the following format: 

`figure-[page_number]-[image_number].jpg`

We can then identify the page number from our PDF containing images. 

Let's collect each page as an image ans ask GPT 4 Vision to summarize the page_numbers with image in it.

In [8]:
print(figure_filepaths)

['../data/pdf/extracted_images/figures/graph/figure-1-1.jpg']


In [9]:
image_filepath: str = f'../data/pdf/extracted_images/pages/graph/'
page_numbers_with_images: List[int] = list(set([int(filepath.split('/')[-1].split('-')[1]) for filepath in figure_filepaths]))
images = convert_from_path(graph_pdf_path)

for i, image in enumerate(images):
    image.save(os.path.join(image_filepath, f"page_{i+1}.jpg")) # Unstructured page count starts from 1

for page_number in page_numbers_with_images:
    image_path = os.path.join(image_filepath, f"page_{i+1}.jpg")
    print(image_path)
    encoded_image = encode_image(image_path)
    print(summarize_image(encoded_image, prompt=PROMPT_TABLE_AND_IMAGE_SUMMARIZATION))

../data/pdf/extracted_images/pages/graph/page_1.jpg
Title: Large Language Model Context Size

| Model Name                | Context Size (Tokens) |
|---------------------------|-----------------------|
| bloom                     | 0 - 5,000             |
| bloom-560m                | 0 - 5,000             |
| bloom-1.1b-7-pytorch      | 0 - 5,000             |
| gpt-neo-1.3b              | 0 - 5,000             |
| gpt-neox-20b              | 0 - 5,000             |
| gpt-j                    | 0 - 5,000             |
| gpt3-6.7b                 | 0 - 5,000             |
| gpt3-13b                  | 0 - 5,000             |
| gpt3-175b                 | 0 - 5,000             |
| jurassic-1-jumbo          | 0 - 5,000             |
| gpt4-32k                  | 0 - 5,000             |
| gpt4-64k                  | 0 - 5,000             |
| gpt4-80k                  | 0 - 5,000             |
| gpt3.5-turbo              | 0 - 5,000             |
| gpt3-2.7b                 | 0 - 5,000    