# Multimodal RAG

In my other post/notebook, [GPT Vision in RAG](./content.ipynb), we use vision to help enrich our content prior to chat. By using vision, we are able to create robust descriptions of our complex slides that contain charts and graphs, and bring some amazing value to our end users.

In this post, we're going to explore using 4o vision in our generative call, that is, send images in with the users' questions, instead of relying on the vision descsription of that slide. 

## The R in multi-modal RAG

Retrieving the images is the most important piece: send GPT the right image for the question. This gets tricky when we're searching for images. If we were to embed the image, we could retreive images using vectors. My experience, vector searching doesn't suffice for text search, a hybrid approach improves search relevancy. If I was to apply the same assumptions on embedding images, I think we'll still need some level of textual descirption to help find the most relevant images. Maybe I'm wrong. I won't be exploring multi-modal embeddings and retrieval here. I'll cheat the same way I did in the other post.

Let's create our "database" of slides and search. I reran the previous notebook and took those values.

In [44]:
all_data = [
    {
        "slide_number": 1,
        "text": [
            "My Comic Book Collection",
            "As of Dec 14, 2024"
        ],
        "description": "The image is a collage of comic book covers, bordered by a dark gray rectangle in the center with the text \"My Comic Book Collection\" in bold white letters, followed by \"As of Dec 14, 2024\" in smaller white text beneath. The comic book covers displayed around the central rectangle feature vibrant and contrasting colors, showcasing various styles and characters.\n\n1. **Top Left**: A classic comic cover from \"The Amazing Spider-Man\" in bright colors, featuring Spider-Man swinging on his web.\n\n2. **Top Center Left**: A \"Batman '89\" cover with a dark and moody aesthetic, showing Batman with an ominous backdrop.\n\n3. **Top Center Right**: The cover of a comic titled \"Nebula,\" with a futuristic theme, featuring a character in a bold, abstract design with metallic hues.\n\n4. **Top Right**: A cover from \"Eve\" showing a character behind a glass helmet, surrounded by a deep green, watery environment.\n\n5. **Bottom Left**: \"Scarlet Witch\" in bold reds and magentas, the character prominently featured with a mystical aura.\n\n6. **Bottom Center Left**: An old Western-style comic with a character holding a gun, illustrated in a vintage, earthy color palette.\n\n7. **Bottom Center Right**: A vibrant comic cover from a superhero series, featuring dynamic action and characters in colorful costumes.\n\n8. **Bottom Right**: A cover from \"The Incredible Hulk,\" portraying the Hulk in action, emphasizing shades of green contrasted with a chaotic background.\n\nOverall, the composition reflects a diverse range of comic book genres, from vintage to modern, superhero to sci-fi."
    },
    {
        "slide_number": 2,
        "text": [
            "Comic Books",
            "Comic books started in 1938 with the introduction of Action Comics #1 marking the debut of Superman. Over the years, comic books have sold for over $3.5M, making comic books investment-grade collectibles."
        ],
        "description": "The image is titled \"Comic Books\" and contains a brief informative text alongside several comic book covers. The text explains that comic books began in 1938 with the introduction of Action Comics #1, debuting Superman. It mentions that over the years, comic books have sold for over $3.5 million, classifying them as investment-grade collectibles.\n\nOn the right side, there is a cover of Action Comics #1 from June 1938. It features a vibrant image of a strong figure lifting a green car above their head with people reacting dramatically around them.\n\nBelow the text are additional comic book covers:\n\n1. The first cover titled \"The Warlord\" is colorful, featuring a muscular character in a fantasy setting with a dragon-like creature in the background.\n\n2. The second cover, \"The Thing,\" shows a superhero with rocky, orange skin, clad in a blue outfit, standing boldly.\n\n3. The third cover is from a \"Star Wars\" comic book, showing a collage of characters including Darth Vader and others in a dynamic space-themed composition.\n\nEach cover features vivid colors and striking images typical of classic comic book art."
    },
    {
        "slide_number": 3,
        "text": [
            "My Comic Books",
            "I started collecting July 4th, 2021, and fell in love with the stories and artwork. I quickly learned certain artists and rarity of the covers could increase a book\u2019s value ten-fold in the first day."
        ],
        "description": "The image features a digital presentation slide titled \"My Comic Books.\" Below the title is a brief paragraph that reads: \"I started collecting July 4th, 2021, and fell in love with the stories and artwork. I quickly learned certain artists and rarity of the covers could increase a book\u2019s value ten-fold in the first day.\"\n\nBelow the text, there are four comic book covers:\n\n1. **BRZRKR #12**: This cover has a monochromatic color scheme, primarily in shades of gray. It depicts a strong, imposing figure, possibly a warrior, wearing armor and holding weapons.\n\n2. **Stray Dogs: Dog Days #1**: The cover features a stylized illustration with elements of horror. It shows a dog against a red backdrop with what appears to be comic panels and blood spatters as part of the design.\n\n3. **Star Wars: Darth Vader #20**: This cover displays the character of Darth Vader standing prominently in the center, with a golden and dark color palette. The art style is highly detailed, capturing the sci-fi essence of Star Wars.\n\n4. **We Don\u2019t Kill Spiders**: The cover shows a mysterious, cloaked figure with a skeletal face, adorned with red spider-like appendages, set against a dark, starry background. The title is in bright pink, creating a striking contrast with the dark blues of the rest of the cover.\n\nThe overall design of the slide is clean and structured, with a focus on showcasing the comic book artwork."
    },
    {
        "slide_number": 4,
        "text": [
            "My Comic Books",
            "As a Star Wars fan, the stories from the comics fill much of the gaps between the movies and TV shows. We get to see our beloved heroes, and villains, in their natural elements like we never see on the screen."
        ],
        "description": "The image is a collage showcasing several Star Wars comic book covers alongside a brief written paragraph. \n\nAt the top, the title \"My Comic Books\" is in bold, black text. Below it, there\u2019s a paragraph that reads: \"As a Star Wars fan, the stories from the comics fill much of the gaps between the movies and TV shows. We get to see our beloved heroes, and villains, in their natural elements like we never see on the screen.\"\n\nBelow this text, four different comic book covers are displayed:\n\n1. **Far Left Cover**: Features a woman with a stern expression holding a red lightsaber. The background is dark, and she is wearing a white outfit. The words \u201cDarth Vader\u201d appear at the bottom.\n\n2. **Second from the Left**: Shows the character Darth Vader with his iconic black armor and a red lightsaber. He is front and center, with a dramatic, shadowy background.\n\n3. **Third from the Left**: Depicts a more classic comic style with vibrant colors. Multiple characters, including Vader, are illustrated, and the title \u201cStar Wars\u201d is prominently at the top, in bold yellow letters against a black background.\n\n4. **Far Right Cover**: Displays a large image of the character Chewbacca, a furry, tall figure with a gentle expression. He is holding a weapon, and the title reads \u201cHan Solo & Chewbacca.\u201d\n\nEach cover highlights different aspects and characters from the Star Wars universe, illustrated in a visually engaging manner to appeal to fans of the series."
    },
    {
        "slide_number": 5,
        "text": [
            "My Collection",
            "The following are some stats from my collection"
        ],
        "description": "The image is a slide with a simple design. It has a white background with minimal text. The main text, in larger black font, reads \"My Collection.\" Below it, in smaller gray font, is a subheading that says, \"The following are some stats from my collection.\" The overall layout is clean and centered toward the left side of the slide, providing a straightforward and uncluttered look."
    },
    {
        "slide_number": 6,
        "text": [
            "Chart Title: ",
            "Series Name: Total",
            "Values: [835.0, 203.0, 116.0, 113.0, 80.0, 50.0, 21.0, 18.0, 17.0, 17.0, 13.0, 10.0, 8.0, 7.0, 7.0, 6.0, 5.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]",
            "by publisher",
            "Contains more original stories (better in my opinion)",
            "This is so high due to Star Wars"
        ],
        "description": "The image is a bar chart titled \"by publisher,\" which compares the total number of publications for various publishers. The chart has a horizontal axis representing different publishers and a vertical axis indicating the number of publications ranging from 0 to 900.\n\n- **Marvel Comics** has the highest bar, reaching above 800 publications. A note on top of this bar says, \"This is so high due to Star Wars.\"\n\n- **DC Comics** and **Dark Horse Comics** have significantly lower bars, both under 300 publications.\n\n- **BOOM! Studios**, **IDW Publishing**, and several other publishers have progressively smaller bars, all below 200.\n\n- Another note near the smaller bars states, \"Contains more original stories (better in my opinion).\"\n\n- The majority of the publishers have very small bars, showing comparatively few publications, almost missing in contrast to the larger ones.\n\nThe bars are filled with a teal color, and the chart presents data in a straightforward manner, highlighting the dominance of Marvel Comics in terms of the number of publications."
    },
    {
        "slide_number": 7,
        "text": [
            "Series Name: Count of Series Name",
            "Values: [1.0, 10.0, 11.0, 9.0, 25.0, 15.0, 8.0, 2.0, 3.0, 2.0, 1.0, 6.0, 13.0, 14.0, 8.0, 14.0, 8.0, 2.0, 7.0, 6.0, 4.0, 3.0, 5.0, 1.0, 6.0, 2.0, 2.0, 2.0, 5.0, 12.0, 1.0, 10.0, 8.0, 27.0, 27.0, 11.0, 35.0, 21.0, 71.0, 369.0, 497.0, 298.0, 198.0]",
            "release years",
            "Started Collecting",
            "That was an expensive year"
        ],
        "description": "The image is a line graph titled \"release years.\" It features a horizontal axis labeled with years ranging from 1969 to 2024, and a vertical axis with values from 0 to 600. The graph depicts a line that remains relatively flat, fluctuating slightly between 0 and 100 from 1969 through 2015. \n\nIn 2015, there is a noticeable increase, and by 2018, there is a significant peak reaching close to 500. After 2019, the line descends rapidly back towards 0 by 2020. \n\nTwo text boxes are included in the graph: one near the peak, labeled \"That was an expensive year,\" and another at the base of the rising slope in 2018 labeled \"Started Collecting.\" The colors are predominantly blue and black on a white background, creating a clean and straightforward visual representation."
    },
    {
        "slide_number": 8,
        "text": [
            "Chart Title: Top Publishers",
            "Series Name: Read",
            "Values: [550.0, 100.0, 57.0, 43.0, 41.0]",
            "Series Name: Unread",
            "Values: [247.0, 87.0, 51.0, 64.0, 34.0]",
            "reading status",
            "Chart Title: All Comic Books",
            "Series Name: Total",
            "Values: [964.0, 816.0]"
        ],
        "description": "The image contains two main sections, a pie chart on the left and a bar chart on the right. It\u2019s titled \"reading status.\"\n\n**Pie Chart:**\n\n- Positioned on the left side, representing \"All Comic Books.\"\n- The chart is divided into two colored sections:\n  - A blue segment, constituting 54% of the circle, labeled \"Read.\"\n  - An orange segment, making up 46% of the circle, labeled \"Unread.\"\n\n**Bar Chart:**\n\n- Located on the right side, titled \"Top Publishers.\"\n- It features five bars, each representing different comic book publishers. The bars have blue and orange segments indicating read and unread statuses:\n  1. **Marvel Comics**: The tallest bar, with a majority in blue and a smaller orange section.\n  2. **DC Comics**: Slightly shorter, with a significant blue and some orange.\n  3. **Boom! Studios**: A shorter bar, evenly split between blue and orange.\n  4. **Dark Horse Comics**: Slightly shorter than Boom! Studios, with a balanced distribution.\n  5. **IDW Publishing**: The shortest bar, also balanced between the two colors.\n\nBelow the bar chart are the recognizable logos for each publisher, arranged in the same order as the bars above them.\n\nOverall, the chart visually represents the reading status of comic books by percentage and by publisher, with distinct use of blue and orange to differentiate between read and unread."
    },
    {
        "slide_number": 9,
        "text": [
            "Series Name: Star Wars Books",
            "Values: [554.0, 34.0, 67.0, 2.0]",
            "Series Name: Other Books",
            "Values: [243.0, 41.0, 40.0, 0.0]",
            "Series Name: ",
            "Values: [657.0, 324.0]",
            "star wars comics",
            "Star Wars\n67%",
            "All other books\n33%",
            "2"
        ],
        "description": "The image features two charts and some logos related to \"Star Wars Comics.\"\nOn the left is a pie chart. It is divided into two segments. The larger segment, which is colored blue, is labeled \"Star Wars 67%.\" The smaller segment, in orange, is labeled \"All other books 33%.\"\n\nTo the right is a bar graph. It has five vertical bars, each associated with different comic publishers. From left to right:\n\nThe tallest bar represents \"Marvel Comics\" and is mostly blue with an orange top.\nA shorter bar for \"IDW Publishing\" is entirely blue.\nAnother small bar for \"Dark Horse Comics\" is mainly blue with an orange top.\nThe final bar, for \"VIZ,\" is barely noticeable, small, and completely blue.\nBelow the bars, corresponding logos of each publisher are shown in a row: Marvel Comics, IDW Publishing, Dark Horse Comics, and VIZ.\n\nOverall, the image illustrates the distribution of Star Wars comics compared to other books and their presence across different publishers."
    }
]

## Set up

Before we go further, let's setup our environment. Copy the `.env.example` to `.env` and add your OpenAI API key. Then run the following:

In [None]:
%pip install openai python-dotenv

from dotenv import load_dotenv
load_dotenv()

Here's a basic function for calling GPT, sending in images and the user's question:

In [109]:
from openai import OpenAI
import base64
client = OpenAI()

total_completion = 0
total_prompt = 0

sys_prompt = """You are a helpful GPT. You will answer the user's question by using the provided image of slides from a PowerPoint.

You should:
1. Identify the most relevant images that answer the user's query.
2. Answer the user's question from the relevant images.

Question:
"""

def call_gpt(text, images):
    global total_completion, total_prompt
    content = [
        {"type": "text", "text": f"{sys_prompt}{text}"},
    ]
    
    for image in images:
        with open(image, "rb") as image_file:
            image_base64 = base64.b64encode(image_file.read()).decode('utf-8')
            content.append({
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/jpeg;base64,{image_base64}"
                }
            })

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": content
            }
        ],
        max_tokens=4000,
    )

    total_completion += response.usage.completion_tokens
    total_prompt += response.usage.prompt_tokens
    return response.choices[0].message.content

Admittedly, I spent more time on this prompt, because it's getting one question wrong. More below.

Using the same questions from before.

In [110]:
questions = [
    {
        "question": "How many Marvel books are there?",
        "keywords": ["Marvel"]
    },
    {
        "question": "How many Marvel books are Star Wars related?",
        "keywords": ["Marvel", "Star Wars"]
    },
    {
        "question": "What year did the collection start?",
        "keywords": ["start", "collect", "journey", "Collecting"]
    },
    {
        "question": "What are the top brands?",
        "keywords": ["brands", "publishers"]
    }
]

## The G in multi-modal RAG

Let's generate some answers using images, not text!

In [112]:
from IPython.display import Markdown, display


for question in questions:
    slides = [
        f"./comics/Slide{slide['slide_number']}.jpeg"
        for slide in all_data
        if any(k.lower() in slide["description"].lower() or k.lower() in ' '.join(slide['text']).lower() for k in question["keywords"])
    ]
    answer = call_gpt(question["question"], slides)

    display(Markdown(f"### {question['question']}"))
    for slide in slides:
        display(Markdown(f"<img src='{slide}' style='width:30%;' />"))
    display(Markdown(f"{answer}"))

print("DONE!")
print(f"Total {total_prompt} input tokens, {total_completion} output tokens")

### How many Marvel books are there?

<img src='./comics/Slide6.jpeg' style='width:30%;' />

<img src='./comics/Slide8.jpeg' style='width:30%;' />

<img src='./comics/Slide9.jpeg' style='width:30%;' />

From the provided slides, Marvel Comics has approximately 800 books.

### How many Marvel books are Star Wars related?

<img src='./comics/Slide2.jpeg' style='width:30%;' />

<img src='./comics/Slide3.jpeg' style='width:30%;' />

<img src='./comics/Slide4.jpeg' style='width:30%;' />

<img src='./comics/Slide6.jpeg' style='width:30%;' />

<img src='./comics/Slide8.jpeg' style='width:30%;' />

<img src='./comics/Slide9.jpeg' style='width:30%;' />

The relevant image to answer your question is the pie chart labeled "star wars comics" which shows that 67% of the books are Star Wars related.

### What year did the collection start?

<img src='./comics/Slide1.jpeg' style='width:30%;' />

<img src='./comics/Slide2.jpeg' style='width:30%;' />

<img src='./comics/Slide3.jpeg' style='width:30%;' />

<img src='./comics/Slide5.jpeg' style='width:30%;' />

<img src='./comics/Slide7.jpeg' style='width:30%;' />

The collection started on July 4th, 2021.

### What are the top brands?

<img src='./comics/Slide6.jpeg' style='width:30%;' />

<img src='./comics/Slide8.jpeg' style='width:30%;' />

<img src='./comics/Slide9.jpeg' style='width:30%;' />

The top brands based on the images provided are:

1. Marvel Comics
2. DC Comics
3. BOOM! Studios
4. Dark Horse Comics
5. IDW Publishing

These brands appear prominently in the charts as top publishers.

DONE!
Total 23798 input tokens, 200 output tokens


Not bad! 

- It got the question **What year did the collection start?** right this time! [In my other post](./content.ipynb), it got it wrong.
- However, it got **How many Marvel books are Star Wars related?** kinda right. I'd rather have a count verse a useless percentage. I ran this a few times and it got it right once or twice. I could spend a lot more time on my prompt to force this but possibly at the cost of other questions and answers.

_Side note, I spent several minutes troubleshooting really bad answers from GPT, then realized I was running on `gpt-4o-mini`, which isn't great with relationships on images, like charts. Silly me._

## A picture is worth a thousand words

I have a gut feeling that when we use multi-modal RAG, send images and ask questions, the machine can better identify the key portions of those images and answer. If we convert to text first, all the machine has to work off of is the text. That text, though generated by GPT, may not include all of the context that pertains to the user's query. Let's see if I can prove this.

Let's create some new questions which may not be as easy from a textual summary.

In [113]:
hard_questions = [
    {
        "question": "What are the top brands that have better stories?",
        "keywords": ["publisher", "original"]
    },
    {
        "question": "How many books are from 1975?",
        "keywords": ["1975", "release"]
    },
    {
        "question": "How many Marvel books are read?",
        "keywords": ["Marvel", "reading"]
    }
]

Another GPT function for chatting with text, instead of images.

In [114]:
def call_gpt_text(text, context):
    completion = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "developer", "content": "Help the user by answering their question from the provided context."},
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion:\n{text}"
            }
        ],
        max_tokens=4000
    )
    return completion.choices[0].message.content

Now let's loop through the questions, get the iamges, and the vision description, and make 2 calls to GPT to get our answers for comparison.

In [115]:
total_prompt = 0
total_completion = 0

for question in hard_questions:
    slides = [
        slide
        for slide in all_data
        if any(k.lower() in slide["description"].lower() or k.lower() in ' '.join(slide['text']).lower() for k in question["keywords"])
    ]
    images = [f"./comics/Slide{slide['slide_number']}.jpeg" for slide in slides]
    image_answer = call_gpt(question["question"], images)
    text_answer = call_gpt_text(question["question"], '"\n*****\n"'.join([slide["description"] for slide in slides]))

    display(Markdown(f"### {question['question']}"))
    for slide in slides:
        src = f"./comics/Slide{slide['slide_number']}.jpeg"
        display(Markdown(f"<img src='{src}' style='width:30%;' />"))
    display(Markdown(f"""### Multi-modal Answer
{image_answer}

### Vision Description Answer
{text_answer}

"""))

print("DONE!")
print(f"Total {total_prompt} input tokens, {total_completion} output tokens")

### What are the top brands that have better stories?

<img src='./comics/Slide6.jpeg' style='width:30%;' />

<img src='./comics/Slide8.jpeg' style='width:30%;' />

<img src='./comics/Slide9.jpeg' style='width:30%;' />

### Multi-modal Answer
Based on the provided slides, the top brands with better stories are:

1. **Image Comics** - Highlighted for having more original stories.
2. **Marvel Comics** - Noted for its high numbers, largely due to "Star Wars" content.
3. **DC Comics**

These publishers are indicated as the top ones in terms of story content.

### Vision Description Answer
Based on the provided context, the publishers that are suggested to have "better" stories, which are noted as containing more original stories, are likely among the smaller publishers with bars below 200 publications, such as BOOM! Studios and others. These publishers are indicated in the context to potentially offer more original content compared to the larger ones like Marvel Comics.



### How many books are from 1975?

<img src='./comics/Slide7.jpeg' style='width:30%;' />

### Multi-modal Answer
The number of books from 1975 is approximately 50, based on the graph.

### Vision Description Answer
Based on the description of the line graph, the line fluctuates between 0 and 100 from 1969 through 2015, indicating the number of books for most of those years is relatively low. Since there is no specific mention of a peak or notable increase in 1975, it can be inferred that the number of books from 1975 is likely within that consistent range, but the exact number cannot be determined from the information provided. You would need access to the specific data point for 1975 on the graph to know the exact count.



### How many Marvel books are read?

<img src='./comics/Slide6.jpeg' style='width:30%;' />

<img src='./comics/Slide8.jpeg' style='width:30%;' />

<img src='./comics/Slide9.jpeg' style='width:30%;' />

### Multi-modal Answer
Based on the second slide, around 300 Marvel books have been read.

### Vision Description Answer
Based on the provided context, in the bar chart titled "Top Publishers," Marvel Comics has the tallest bar with a majority in blue, representing "read" books. However, the exact number of read Marvel books is not specified in the description. If you have access to the specific numerical data or the percentage of the bar that is blue, you could calculate or derive the number of read Marvel books. Otherwise, the exact number cannot be determined solely from the given context.



DONE!
Total 5306 input tokens, 108 output tokens


Is it better? 

**What are the top brands that have better stories?**

- I love the multi-modal answer. It can read the graph and spit out Image Comics. However, it threw in Marvel because of the high amount of Star Wars. It used some reasoning to determine that since there are so many Star Wars books, I must find them better stories. (it's true, I do). However, it did throw in DC, which isn't accurate.
- The vision answer is adequate, and reading the vision description for the slide, that's all it could tell us. 

**How many books are from 1975?**

- The multi-modal nailed it, maybe closer to 40, but 50 is good.
- The vision answer did what it could. As a user, I'd read that and dive into the slide to learn more. Given the answer above, I'd just go with about 90.

**How many Marvel books are read?**

- Multi-modal thinks it's 300? Seems to have swapped colors, with roughly 250 unread.
- Vision description couldn't answer it.

There wasn't a clear winner. I need to run this on many more slides, various topics and content, but I think it's promising!

I wonder...

If we send both, the vision description and an image, would these answers improve?

Let's create a new function to handle this new set up:

In [120]:
total_completion = 0
total_prompt = 0

sys_prompt = """You are a helpful GPT. You will answer the user's question by using the provided image of slides from a PowerPoint.

You should:
1. Identify the most relevant images that answer the user's query. You will be provided the image and a description of that slide.
2. Answer the user's question from the relevant images.

Question:
"""

def call_gpt_with_desc(text, images):
    global total_completion, total_prompt
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": f"{sys_prompt}{text}"},
            ]
        }
    ]
    
    for image in images:
        with open(f"./comics/Slide{image['slide_number']}.jpeg", "rb") as image_file:
            image_base64 = base64.b64encode(image_file.read()).decode('utf-8')

        messages.append({
            "role": "user",
            "content": [
                {"type": "text", "text": f"{image['description']}"},
                {"type": "image_url",
                "image_url": {
                    "url": f"data:image/jpeg;base64,{image_base64}"
                }
            }
            ]
        })

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        max_tokens=4000,
    )

    total_completion += response.usage.completion_tokens
    total_prompt += response.usage.prompt_tokens
    return response.choices[0].message.content

In [122]:
total_prompt = 0
total_completion = 0

for question in hard_questions:
    slides = [
        slide
        for slide in all_data
        if any(k.lower() in slide["description"].lower() or k.lower() in ' '.join(slide['text']).lower() for k in question["keywords"])
    ]
    image_answer = call_gpt_with_desc(question['question'], slides)

    display(Markdown(f"### {question['question']}"))
    for slide in slides:
        src = f"./comics/Slide{slide['slide_number']}.jpeg"
        display(Markdown(f"<img src='{src}' style='width:30%;' />"))
    display(Markdown(f"""### Multi-modal Answer
{image_answer}

"""))

print("DONE!")
print(f"Total {total_prompt} input tokens, {total_completion} output tokens")

### What are the top brands that have better stories?

<img src='./comics/Slide6.jpeg' style='width:30%;' />

<img src='./comics/Slide8.jpeg' style='width:30%;' />

<img src='./comics/Slide9.jpeg' style='width:30%;' />

### Multi-modal Answer
Based on the provided slides, the top brands that have better stories, according to the notes, are those with more original stories. The slide indicates that smaller publishers such as those under "Contains more original stories (better in my opinion)" have this attribute. While these publishers have fewer total publications compared to Marvel Comics and DC Comics, they are noted for originality in storytelling.



### How many books are from 1975?

<img src='./comics/Slide7.jpeg' style='width:30%;' />

### Multi-modal Answer
The line graph indicates book release years, with data fluctuating between 0 to 100 during 1969-2015. For the year 1975, the graph shows that the number of books released is quite low, likely close to 10.



### How many Marvel books are read?

<img src='./comics/Slide6.jpeg' style='width:30%;' />

<img src='./comics/Slide8.jpeg' style='width:30%;' />

<img src='./comics/Slide9.jpeg' style='width:30%;' />

### Multi-modal Answer
To determine how many Marvel books are read, we can refer to the second image with the bar chart on the right, titled "Top Publishers."

The Marvel Comics bar shows a majority of the bar in blue, indicating that a substantial portion of Marvel comics has been read. Specifically, it looks like around 700 out of the total Marvel publications (over 800) are read.



DONE!
Total 6995 input tokens, 205 output tokens


Yea, thought so. Adding the description didn't seem to help at all. What might be interesting is sending the document summary along with the slide, to give additional context to the slide. In my examples here, it's all one document, but with a large knowledge corpus, where a slide could come from any deck, sending that deck summary can provide some much needed context to the machine. 