# Using GPT Vision in RAG

We're going to explore using GPT Vision with 4o model to gather more insight from complex slides, charts, images, to improve our chat experiences in our RAG solutions.

## The PowerPoint PPTX file

I'll be using a PowerPoint file I made of my comic book collection. I didn't want to use real content from work, and I had a little fun building this ;). The value of adding vision to your RAG pipeline will still be realized using this data, I promise.

You can review the [PowerPoint file, in all its glory, here](./comics.pptx).

## Getting text from the PPTX

But first, you'll need to create a .env file (see .env.example), update it with your OpenAI API key and install some libraries.

_let's get through some of the boring stuff, installing libraries and stuff_

In [None]:
%pip install numpy pandas python-pptx openai python-dotenv

from dotenv import load_dotenv
load_dotenv()

Easy enough. Now our function to pull text from slides:

In [2]:
import os
import json
from pptx import Presentation
from pptx.enum.shapes import MSO_SHAPE_TYPE

ppt_file = "comics.pptx"
json_file = "slides_output.json"

# Load the presentation
prs = Presentation(ppt_file)
all_slides_data = []

for idx, slide in enumerate(prs.slides, start=1):
    slide_data = {
        "slide_number": idx,
        "text": []
    }

    for shape in slide.shapes:
        # Extract text from text frames
        if shape.has_text_frame and shape.text.strip():
            slide_data["text"].append(shape.text.strip())

        # Extract text from charts
        if shape.shape_type == MSO_SHAPE_TYPE.CHART:
            chart = shape.chart
            if chart.has_title:
                slide_data["text"].append(f"Chart Title: {chart.chart_title.text_frame.text}")
            for s in chart.series:
                slide_data["text"].append(f"Series Name: {s.name}")
                if s.values:
                    slide_data["text"].append(f"Values: {list(s.values)}")

    all_slides_data.append(slide_data)

print(f"Pulled text from {len(all_slides_data)} slides")


Pulled text from 9 slides


The above has pulled text from all of the slides. We'll explore that data below.

I manually exported the deck to JPEG into the `./comics/` folder. Finding a library to do that programatically was getting annoying. In production, we use ASPose, on Java. There are some Python libraries available, but decided not to implement as that's outside of the goal for this exercise. So for today, I've saved my deck as JPEGs sized 1024x578.

Let's quickly review the deck and confirm the text I pulled off it. And note how useless some of the content is!

In [32]:
from IPython.display import display, Markdown

for slide in all_slides_data:
    all_text = "\n".join(slide["text"])
    display(Markdown(f"""![{slide['slide_number']}](./comics/Slide{slide['slide_number']}.jpeg)
                     
**Slide {slide['slide_number']}**

{all_text}"""))

![1](./comics/Slide1.jpeg)
                     
**Slide 1**

My Comic Book Collection
As of Dec 14, 2024

![2](./comics/Slide2.jpeg)
                     
**Slide 2**

Comic Books
Comic books started in 1938 with the introduction of Action Comics #1 marking the debut of Superman. Over the years, comic books have sold for over $3.5M, making comic books investment-grade collectibles.

![3](./comics/Slide3.jpeg)
                     
**Slide 3**

My Comic Books
I started collecting July 4th, 2021, and fell in love with the stories and artwork. I quickly learned certain artists and rarity of the covers could increase a book’s value ten-fold in the first day.

![4](./comics/Slide4.jpeg)
                     
**Slide 4**

My Comic Books
As a Star Wars fan, the stories from the comics fill much of the gaps between the movies and TV shows. We get to see our beloved heroes, and villains, in their natural elements like we never see on the screen.

![5](./comics/Slide5.jpeg)
                     
**Slide 5**

My Collection
The following are some stats from my collection

![6](./comics/Slide6.jpeg)
                     
**Slide 6**

Chart Title: 
Series Name: Total
Values: [835.0, 203.0, 116.0, 113.0, 80.0, 50.0, 21.0, 18.0, 17.0, 17.0, 13.0, 10.0, 8.0, 7.0, 7.0, 6.0, 5.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
by publisher
Contains more original stories (better in my opinion)
This is so high due to Star Wars

![7](./comics/Slide7.jpeg)
                     
**Slide 7**

Series Name: Count of Series Name
Values: [1.0, 10.0, 11.0, 9.0, 25.0, 15.0, 8.0, 2.0, 3.0, 2.0, 1.0, 6.0, 13.0, 14.0, 8.0, 14.0, 8.0, 2.0, 7.0, 6.0, 4.0, 3.0, 5.0, 1.0, 6.0, 2.0, 2.0, 2.0, 5.0, 12.0, 1.0, 10.0, 8.0, 27.0, 27.0, 11.0, 35.0, 21.0, 71.0, 369.0, 497.0, 298.0, 198.0]
release years
Started Collecting
That was an expensive year

![8](./comics/Slide8.jpeg)
                     
**Slide 8**

Chart Title: Top Publishers
Series Name: Read
Values: [550.0, 100.0, 57.0, 43.0, 41.0]
Series Name: Unread
Values: [247.0, 87.0, 51.0, 64.0, 34.0]
reading status
Chart Title: All Comic Books
Series Name: Total
Values: [964.0, 816.0]

![9](./comics/Slide9.jpeg)
                     
**Slide 9**

Series Name: Star Wars Books
Values: [554.0, 34.0, 67.0, 2.0]
Series Name: Other Books
Values: [243.0, 41.0, 40.0, 0.0]
Series Name: 
Values: [657.0, 324.0]
star wars comics
Star Wars
67%
All other books
33%
2

Thoughts? Reading the text has little value to what the slide actually says, right? Some slides it's mostly number! Useless. Enter GPT 4o Vision!! Let's see what we can get GPT to help us out with.

## Getting value from GPT 4o Vision

Let's use [OpenAI's documentation on using vision](https://platform.openai.com/docs/guides/vision), I modified to handle my images.

In [4]:
from openai import OpenAI
import base64
client = OpenAI()

total_completion = 0
total_prompt = 0
sys_prompt = """Review the image and describe it in detail as describing it to someone who has imparied vision."""

def get_image_desc(image):
    global total_completion, total_prompt
    with open(image, "rb") as image_file:
        image_base64 = base64.b64encode(image_file.read()).decode('utf-8')
    
    response = client.chat.completions.create(
        model="gpt-4o", # change to 4o from 40-mini, as mini has limitations in understanding relationships
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": sys_prompt},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{image_base64}",
                        },
                    },
                ],
            }
        ],
        max_tokens=4000, # using a lot more to give the machine room to give me a good answer 
    )
    total_completion += response.usage.completion_tokens
    total_prompt += response.usage.prompt_tokens
    return response.choices[0].message.content


What did I do:

- Created a single function so I can call it for each slide.
- I have a basic "system prompt" (not really, it's user text, but I'm stubborn). Note the wording, I have found asking it to describe it to someone with impaired vision really brings out a rich description.
- I collect all usage to share total costs of my tests below.

Now let's run all the images!

In [5]:
for slide in all_slides_data:
    image = f"./comics/Slide{slide['slide_number']}.jpeg"
    slide["description"] = get_image_desc(image)
    print(f"Visioned slide {slide['slide_number']}")

print(f"DONE!\nTotal completion tokens used: {total_completion}\nTotal prompt tokens used: {total_prompt}")

Visioned slide 1
Visioned slide 2
Visioned slide 3
Visioned slide 4
Visioned slide 5
Visioned slide 6
Visioned slide 7
Visioned slide 8
Visioned slide 9
DONE!
Total completion tokens used: 1977
Total prompt tokens used: 7974


Drum roll please, let's see what the plain text extraction looks like against the vision description!

In [8]:
for slide in all_slides_data:
    image = f"./comics/Slide{slide['slide_number']}.jpeg"
    display(Markdown(f"![Slide {slide['slide_number']}]({image})"))
    all_text = "\n".join(slide["text"])
    display(Markdown(f"""<table style='width:100%' border='0'>
        <tr><td colspan='2'>Slide {slide['slide_number']}</tr>
        <tr>
            <td style='width:48%; vertical-align:top'>
                <strong>Plain Text</strong><br/><br/>
                {all_text}
            </td>
            <td style='width:48%; vertical-align:top'>
                <strong>Vision Description:</strong><br/><br/>
                {slide['description']}
            </td>
        </tr>
    </table>
    """))
    

![Slide 1](./comics/Slide1.jpeg)

<table style='width:100%' border='0'>
        <tr><td colspan='2'>Slide 1</tr>
        <tr>
            <td style='width:48%; vertical-align:top'>
                <strong>Plain Text</strong><br/><br/>
                My Comic Book Collection
As of Dec 14, 2024
            </td>
            <td style='width:48%; vertical-align:top'>
                <strong>Vision Description:</strong><br/><br/>
                The image is a collage of comic book covers with a central dark grey rectangle containing text in white. The text reads "My Comic Book Collection" with a subheading that says "As of Dec 14, 2024."

1. **Top Left:** A cover of "The Amazing Spider-Man," featuring Spider-Man in his iconic red and blue suit with a web background.

2. **Top Center Left:** "Batman '89" showcasing Batman in a dark cityscape with a shadowy and mysterious atmosphere.

3. **Top Center:** "Nebula #1" displaying a futuristic design with a character wearing a helmet and vibrant colors.

4. **Top Center Right:** "Eve" shows a young girl in a circular window, giving a sense of isolation, surrounded by green tones.

5. **Top Right:** Part of a "Hulk" cover with the Hulk character depicted with large muscles and a fierce expression, engaged in action.

6. **Bottom Left:** "Scarlet Witch" featuring a character in a red outfit, with mystical and powerful energy around her.

7. **Bottom Center Left:** A comic with a woman holding a weapon, giving off a vintage vibe, possibly set in an earlier era.

8. **Bottom Center:** A bright and vivid comic cover with a group of characters in colorful costumes, appearing energetic and ready for action.

9. **Bottom Right:** Another "Hulk" comic showing the Hulk in the midst of a dramatic moment, surrounded by chaos.

The artwork across the collage varies in style, from classic to modern, each indicating a different genre or theme within the comic book world.
            </td>
        </tr>
    </table>
    

![Slide 2](./comics/Slide2.jpeg)

<table style='width:100%' border='0'>
        <tr><td colspan='2'>Slide 2</tr>
        <tr>
            <td style='width:48%; vertical-align:top'>
                <strong>Plain Text</strong><br/><br/>
                Comic Books
Comic books started in 1938 with the introduction of Action Comics #1 marking the debut of Superman. Over the years, comic books have sold for over $3.5M, making comic books investment-grade collectibles.
            </td>
            <td style='width:48%; vertical-align:top'>
                <strong>Vision Description:</strong><br/><br/>
                The image is a collage focusing on comic books. At the top left, the title "Comic Books" is written in bold black letters. Below it, there's a block of text explaining that comic books started in 1938 with the introduction of "Action Comics #1," which marked Superman's debut. It mentions that some comics have sold for over $3.5 million, highlighting their value as collectibles.

Beneath this text, there are four comic book covers displayed:

1. **The Warlord**: The cover features mystical artwork, with a warrior in armor fighting off menacing creatures, set against an adventurous, fantasy-themed background.

2. **The Thing (Marvel Comics)**: It displays a muscular, rocky-skinned superhero in blue attire, known as The Thing from the Fantastic Four, against a vivid, action-packed backdrop.

3. **Star Wars (Marvel Comics)**: This cover shows characters from the Star Wars universe, including Darth Vader, with a dramatic, space-themed design.

4. **Action Comics #1**: This iconic cover depicts Superman lifting a green car over his head with people watching in surprise, set against a yellow background. This comic is historical for introducing Superman in 1938.

The overall layout effectively illustrates the historical and cultural significance of comic books.
            </td>
        </tr>
    </table>
    

![Slide 3](./comics/Slide3.jpeg)

<table style='width:100%' border='0'>
        <tr><td colspan='2'>Slide 3</tr>
        <tr>
            <td style='width:48%; vertical-align:top'>
                <strong>Plain Text</strong><br/><br/>
                My Comic Books
I started collecting July 4th, 2021, and fell in love with the stories and artwork. I quickly learned certain artists and rarity of the covers could increase a book’s value ten-fold in the first day.
            </td>
            <td style='width:48%; vertical-align:top'>
                <strong>Vision Description:</strong><br/><br/>
                The image is a presentation slide titled "My Comic Books." It features a brief narrative about the person’s comic book collection journey, which began on July 4th, 2021, with an emphasis on the influence of artists and cover rarity on a comic's value.

Below the text are four comic book covers displayed in two rows. 

1. **First Cover (Left)**: "BRZRKR" issue 12, with a grayscale illustration of a character in an action pose, possibly wearing tactical gear.

2. **Second Cover (Middle Left)**: "Stray Dogs: Dog Days" issue 1, with a white and red color scheme, featuring an illustration that includes a drawing within a splash of blood.

3. **Third Cover (Middle Right)**: A "Star Wars: Darth Vader" issue, in a vintage style with a figure resembling Darth Vader standing at the center.

4. **Fourth Cover (Right)**: "We Don't Kill Spiders" by Joseph Schmalke, which appears atmospheric with a dark figure surrounded by a misty or mystical background, incorporating pink and teal colors.

The overall slide design is clean and emphasizes the passionate start of a comic book collection, highlighting various genres and artistic styles.
            </td>
        </tr>
    </table>
    

![Slide 4](./comics/Slide4.jpeg)

<table style='width:100%' border='0'>
        <tr><td colspan='2'>Slide 4</tr>
        <tr>
            <td style='width:48%; vertical-align:top'>
                <strong>Plain Text</strong><br/><br/>
                My Comic Books
As a Star Wars fan, the stories from the comics fill much of the gaps between the movies and TV shows. We get to see our beloved heroes, and villains, in their natural elements like we never see on the screen.
            </td>
            <td style='width:48%; vertical-align:top'>
                <strong>Vision Description:</strong><br/><br/>
                The image features a collage of comic book covers and text. At the top, there is a heading in bold text that reads "My Comic Books." Below this, a paragraph explains the enjoyment of being a Star Wars fan and how the comics fill gaps between the movies and TV shows by showcasing beloved heroes and villains in their natural settings.

Beneath the text are four comic book covers:

1. The first cover shows a character resembling a female Jedi holding a red lightsaber, standing in front of a large figure with a helmet, likely Darth Vader.

2. The second cover depicts Darth Vader prominently, with a red lightsaber glowing across the center.

3. The third cover is a vintage-style Star Wars comic, featuring multiple characters, including Darth Vader, with the title "Star Wars" in bold letters.

4. The fourth cover features Chewbacca, a large, furry character with bandolier straps across his chest. The title reads "Han Solo & Chewbacca."

Each cover includes the Marvel Comics logo, indicating the publisher of these comics.
            </td>
        </tr>
    </table>
    

![Slide 5](./comics/Slide5.jpeg)

<table style='width:100%' border='0'>
        <tr><td colspan='2'>Slide 5</tr>
        <tr>
            <td style='width:48%; vertical-align:top'>
                <strong>Plain Text</strong><br/><br/>
                My Collection
The following are some stats from my collection
            </td>
            <td style='width:48%; vertical-align:top'>
                <strong>Vision Description:</strong><br/><br/>
                The image is a simple, minimalistic presentation slide with a white background. Near the top left corner, there is bold, black text that reads "My Collection." Below that, there is smaller, gray text that states, "The following are some stats from my collection." The overall design is clean and uncluttered, with a focus on the text information.
            </td>
        </tr>
    </table>
    

![Slide 6](./comics/Slide6.jpeg)

<table style='width:100%' border='0'>
        <tr><td colspan='2'>Slide 6</tr>
        <tr>
            <td style='width:48%; vertical-align:top'>
                <strong>Plain Text</strong><br/><br/>
                Chart Title: 
Series Name: Total
Values: [835.0, 203.0, 116.0, 113.0, 80.0, 50.0, 21.0, 18.0, 17.0, 17.0, 13.0, 10.0, 8.0, 7.0, 7.0, 6.0, 5.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
by publisher
Contains more original stories (better in my opinion)
This is so high due to Star Wars
            </td>
            <td style='width:48%; vertical-align:top'>
                <strong>Vision Description:</strong><br/><br/>
                The image is a bar chart titled "by publisher." It displays the number of publications by several different publishers, represented by vertical bars. 

- The most prominent bar corresponds to "Marvel Comics," reaching a height of about 850, with a note beside it stating, "This is so high due to Star Wars."
- The second tallest bar is "DC Comics," at around 350.
- A few other noticeable bars include "Dark Horse Comics" and "BOOM! Studios," both just below 150, and "IDW Publishing" slightly below 100.
- The rest of the publishers, including names like "Dynamite," "Oni Press," "Image Comics," and "Archie Comics," have bars much shorter, generally around 50 or less.
- Another annotation appears around the middle of the chart, stating, "Contains more original stories (better in my opinion)," above a small cluster of publishers with slightly higher values around 50 to 100.

The chart uses a light blue color for the bars, and it is organized along the horizontal axis with publisher names and a vertical axis labeled "Total," ranging from 0 to 900.
            </td>
        </tr>
    </table>
    

![Slide 7](./comics/Slide7.jpeg)

<table style='width:100%' border='0'>
        <tr><td colspan='2'>Slide 7</tr>
        <tr>
            <td style='width:48%; vertical-align:top'>
                <strong>Plain Text</strong><br/><br/>
                Series Name: Count of Series Name
Values: [1.0, 10.0, 11.0, 9.0, 25.0, 15.0, 8.0, 2.0, 3.0, 2.0, 1.0, 6.0, 13.0, 14.0, 8.0, 14.0, 8.0, 2.0, 7.0, 6.0, 4.0, 3.0, 5.0, 1.0, 6.0, 2.0, 2.0, 2.0, 5.0, 12.0, 1.0, 10.0, 8.0, 27.0, 27.0, 11.0, 35.0, 21.0, 71.0, 369.0, 497.0, 298.0, 198.0]
release years
Started Collecting
That was an expensive year
            </td>
            <td style='width:48%; vertical-align:top'>
                <strong>Vision Description:</strong><br/><br/>
                The image is a line graph titled "release years," displaying data over time from 1969 to 2024 on the horizontal axis and numbers from 0 to 600 on the vertical axis. 

The line representing the data is mostly flat and low until around 2015, where there's a sharp increase peaking between 2017 and 2018. This peak is labeled with a text box saying "That was an expensive year." After the peak, there's an immediate drop, with the line declining sharply back toward low levels.

Additionally, at the beginning of this upward trend, around 2015, another text box is labeled "Started Collecting."

The overall impression is that there was a significant spike in activity or value starting in 2015, peaking around 2017-2018, and then decreasing again.
            </td>
        </tr>
    </table>
    

![Slide 8](./comics/Slide8.jpeg)

<table style='width:100%' border='0'>
        <tr><td colspan='2'>Slide 8</tr>
        <tr>
            <td style='width:48%; vertical-align:top'>
                <strong>Plain Text</strong><br/><br/>
                Chart Title: Top Publishers
Series Name: Read
Values: [550.0, 100.0, 57.0, 43.0, 41.0]
Series Name: Unread
Values: [247.0, 87.0, 51.0, 64.0, 34.0]
reading status
Chart Title: All Comic Books
Series Name: Total
Values: [964.0, 816.0]
            </td>
            <td style='width:48%; vertical-align:top'>
                <strong>Vision Description:</strong><br/><br/>
                The image contains a graphical representation of comic book reading status. In the top left corner, there is a pie chart labeled "All Comic Books." This chart is divided into two sections: one in blue, representing 54% of comic books that have been read, and one in orange, representing 46% of books that are unread.

To the right of the pie chart is a bar graph titled "Top Publishers." The bar graph includes five different publishers. Each publisher has a bar divided into two colors, similar to the pie chart: blue for the number of read comics and orange for unread comics. The publishers are represented with their logos from left to right: Marvel Comics, DC Comics, Boom! Studios, Dark Horse Comics, and IDW Publishing.

Marvel Comics has the highest total bar, indicating the largest number of comics, both read and unread. The other publishers have progressively smaller bars. The graph visually contrasts how much of each publisher's comics have been read versus unread.
            </td>
        </tr>
    </table>
    

![Slide 9](./comics/Slide9.jpeg)

<table style='width:100%' border='0'>
        <tr><td colspan='2'>Slide 9</tr>
        <tr>
            <td style='width:48%; vertical-align:top'>
                <strong>Plain Text</strong><br/><br/>
                Series Name: Star Wars Books
Values: [554.0, 34.0, 67.0, 2.0]
Series Name: Other Books
Values: [243.0, 41.0, 40.0, 0.0]
Series Name: 
Values: [657.0, 324.0]
star wars comics
Star Wars
67%
All other books
33%
2
            </td>
            <td style='width:48%; vertical-align:top'>
                <strong>Vision Description:</strong><br/><br/>
                The image consists of two main parts: a pie chart and a bar graph.

**Pie Chart:**
- Title: "star wars comics"
- The pie chart is divided into two sections, representing different categories of books.
- A larger blue section labeled "Star Wars" makes up 67% of the chart.
- A smaller orange section labeled "All other books" accounts for 33% of the chart.

**Bar Graph:**
- The bar graph compares different comic publishers by the number of comics produced.
- Four publisher logos are displayed at the bottom: Marvel Comics, IDW Publishing, Dark Horse Comics, and VIZ.
- The y-axis represents the number of comics, ranging from 0 to 900.
- Marvel Comics has the tallest bar with a combination of blue and orange, indicating a significant number of Star Wars comics and others.
- IDW Publishing has a small orange bar.
- Dark Horse Comics also has a small orange bar, slightly taller than IDW's.
- VIZ has a very small orange bar, almost negligible in height.

The chart visually highlights the dominance of Star Wars comics within this dataset and emphasizes the role of Marvel Comics in publishing them.
            </td>
        </tr>
    </table>
    

YYAAA HOOOOO!!! That's some really helpful descriptions! With these rich descriptions we can provide GPT a greater content to help answer your users' questions!

## Does vision actually help?

Obviously, the above will work better, but since I have some time on my hands, let's quickly find out 😉.

Again, I'm using [OpenAI's basic scripts for generating text](https://platform.openai.com/docs/guides/text-generation), and polished up a little for my use case:

In [11]:
from openai import OpenAI
client = OpenAI()

def call_gpt(text, context):
    completion = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "developer", "content": "Help the user by answering their question from the provided context."},
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion:\n{text}"
            }
        ]
    )
    return completion.choices[0].message.content


The above is a basic function that'll call GPT. I concatenate the context, the text from relevant slides, with my question in one message. 

In [33]:
questions = [
    {
        "question": "How many Marvel books are there?",
        "keywords": ["Marvel"]
    },
    {
        "question": "How many Marvel books are Star Wars related?",
        "keywords": ["Marvel", "Star Wars"]
    },
    {
        "question": "What year did the collection start?",
        "keywords": ["start", "collect", "journey"]
    },
    {
        "question": "What are the top brands?",
        "keywords": ["brands", "publishers"]
    }
]

Above are my questions, and I added search keywords to find the content. Any decent RAG system should transform user questions into semantic phrases, and along with vectors, to return some really robust results. I'm not creating a search tool for this exercise, so I'm cheating a little bit here to find related content.

The following loops through my questions, and cals GPT with the plain text extraction and the vision descriptions from earlier.

In [34]:
for question in questions:
    plain_context = "\n*****\n".join([
        " ".join(slide["text"])
        for slide in all_slides_data
        if any(k in slide["description"] for k in question["keywords"])
    ])
    vision_context = "\n*****\n".join([
        slide["description"]
        for slide in all_slides_data
        if any(k in slide["description"] for k in question["keywords"])
    ])
    plain_answer = call_gpt(question["question"], plain_context)
    vision_answer = call_gpt(question["question"], vision_context)
    display(Markdown(f"""**{question['question']}**
                     
_Plain Answer_
                     
- {plain_answer}

_Vision Answer_

- {vision_answer}

"""))

print("DONE!")

**How many Marvel books are there?**
                     
_Plain Answer_
                     
- The context provided does not explicitly mention the number of Marvel books, so it is not possible to determine the exact number of Marvel books from the information given.

_Vision Answer_

- According to the information provided, Marvel Comics has the highest total bar in two different representations: one is approximately reaching a height of 850 in the bar chart titled "by publisher," and the other confirms Marvel as having the tallest bar in the "Top Publishers" graph, indicating it produces a large number of comics. Therefore, there are about 850 Marvel books.



**How many Marvel books are Star Wars related?**
                     
_Plain Answer_
                     
- Based on the information provided, there are 554.0 Star Wars books. However, the context does not specify how many of these are published by Marvel or how many are Star Wars related within the Marvel category. To determine how many Marvel books specifically are Star Wars related, additional information about the specific publications or publishers would be needed.

_Vision Answer_

- According to the context provided, the pie chart labeled "star wars comics" indicates that 67% are Star Wars comics, specifically highlighting Marvel Comics' contribution. Additionally, the bar chart under "by publisher" shows that the high number of publications by Marvel Comics (approximately 850) is largely due to Star Wars. Therefore, if we assume Marvel's total comic publications are about 850, then approximately 67% of these, which would be around 570, are Star Wars related.



**What year did the collection start?**
                     
_Plain Answer_
                     
- The collection started on July 4th, 2021.

_Vision Answer_

- The collection started in 2015.



**What are the top brands?**
                     
_Plain Answer_
                     
- Based on the context provided, the top brands appear to be related to comic books and are influenced by the series of "Star Wars" given its prominence in the data. In the "Total Values" list by publisher, a significant value is associated with Star Wars, suggesting it is a leading brand. Other top brands or series in the context are not explicitly mentioned by name but might be implied by the high values in the other series and publishers mentioned.

_Vision Answer_

- Based on the provided context, the top brands in terms of comic book publications are:

1. Marvel Comics - It has the highest number of publications, significantly due to its Star Wars comics.
2. DC Comics - It has the second highest number of publications.
3. Boom! Studios, Dark Horse Comics, and IDW Publishing - These publishers follow, with Boom! Studios and Dark Horse Comics having just below 150 publications and IDW Publishing slightly below 100. 

Marvel Comics stands out as the top brand, especially due to the high number of Star Wars-related comics.



DONE!


Oh my word... you can't beat that, can you? Using vision with these slides significantly improved my answers (except for that one wrong answer, did you see it? More below). Now apply this to your content, how many slides and pages have charts, images, and other non-textual context that is lost by simply scraping text? 

Check out the answers to "What year did the collection start?" The plain answer is good because a slide had it explicitly stated. Vision read that fine, but also read another slide of a chart, which was misaligned by vision. GPT had to choose between the two pieces of content: one slide said "person’s comic book collection journey, which began on July 4th, 2021" and the other slide says "around 2015, another text box is labeled "Started Collecting.". GPT isn't perfect, yet.

## Just use GPT 4o Vision!

It seems quite obvious to just say "Hey, let's use vision on our content!", and if you had a blank check, I'd say do it today!

In reality, most of us live within budgets, end-user performance expectations, speed to market rquirements, blah blah blah. It'll come down to the business stakeholders to see the value in using vision, compared to the current value you're delivering, and decide the cost is worth it. 



In [35]:
one_mil = 1000000
# costs as of Dec 2024 from https://openai.com/api/pricing/
cost_input = 2.5 
cost_output = 10
display(Markdown(f"""
For my limited example above, vision cost me {total_prompt} input tokens and {total_completion} output tokens. \
This cost me a total of ${round(float((total_completion/one_mil*cost_output)+(total_prompt/one_mil*cost_input)), 2)}."""))


For my limited example above, vision cost me 7974 input tokens and 1977 output tokens. This cost me a total of $0.04.

Cheap enough for a dozen slides, but when you're in the hundreds of thousands to millions of slides, this can add up quickly! There are a few things to check out to help curb the cost:

- Only send slides that need vision. Check for specific element types on the slides, and only send if there are images and charts. An all text slide doesn't need vision.
- Explore setting the `detail` parameter to low, which should incur lower costs, but might impact quality of the output. Learn more on [OpenAI's site about Low or High fidelity image understanding](https://platform.openai.com/docs/guides/vision#low-or-high-fidelity-image-understanding). I'd guess most powerpoints can get away with low.
- The size of the slide also impacts the cost. All of my examples here are 1024x578, which cost $0.001913 per high-res slide, $0.000213 per low-res ([from OpenAI's pricing page](https://openai.com/api/pricing/)). A larger image, like 2048 wide, will cost twice as much. _OpenAI does some resizing and calculations to determine the total tokens for an image. See their Vision pricing calculate for details._

Given the results we see above, and the vastly improved answers, adding vision to your RAG should be seriously considered! Figure out your scale of cost, perform some tests, and showcase the value to your team!