# Using GPT Vision in RAG

We're going to explore using GPT Vision with 4o model to gather more insight from complex slides, charts, images, to improve our chat experiences in our RAG solutions.

## The PowerPoint PPTX file

I'll be using a PowerPoint file I made of my comic book collection. I didn't want to use real content from work, and I had a little fun building this ;). The value of adding vision to your RAG pipeline will still be realized using this data, I promise.

You can review the [PowerPoint file, in all its glory, here](./comics.pptx).

## Getting text from the PPTX

But first, you'll need to create a .env file (see .env.example), update it with your OpenAI API key and install some libraries.

_let's get through some of the boring stuff, installing libraries and stuff_

In [1]:
%pip install numpy pandas python-pptx openai python-dotenv

from dotenv import load_dotenv
load_dotenv()

True

Easy enough. Now our function to pull text from slides:

In [2]:
import os
import json
from pptx import Presentation
from pptx.enum.shapes import MSO_SHAPE_TYPE

ppt_file = "comics.pptx"
json_file = "slides_output.json"

# Load the presentation
prs = Presentation(ppt_file)
all_slides_data = []

for idx, slide in enumerate(prs.slides, start=1):
    slide_data = {
        "slide_number": idx,
        "text": []
    }

    for shape in slide.shapes:
        # Extract text from text frames
        if shape.has_text_frame and shape.text.strip():
            slide_data["text"].append(shape.text.strip())

        # Extract text from charts
        if shape.shape_type == MSO_SHAPE_TYPE.CHART:
            chart = shape.chart
            if chart.has_title:
                slide_data["text"].append(f"Chart Title: {chart.chart_title.text_frame.text}")
            for s in chart.series:
                slide_data["text"].append(f"Series Name: {s.name}")
                if s.values:
                    slide_data["text"].append(f"Values: {list(s.values)}")

    all_slides_data.append(slide_data)

print(f"Pulled text from {len(all_slides_data)} slides")


Pulled text from 9 slides


The above has pulled text from all of the slides. We'll explore that data below.

I manually exported the deck to JPEG into the `./comics/` folder. Finding a library to do that programatically was getting annoying. In production, we use ASPose, on Java. There are some Python libraries available, but decided not to implement as that's outside of the goal for this exercise. So for today, I've saved my deck as JPEGs sized 1024x578.

Let's quickly review the deck and confirm the text I pulled off it. And note how useless some of the content is!

In [3]:
from IPython.display import display, Markdown

for slide in all_slides_data:
    all_text = "\n".join(slide["text"])
    display(Markdown(f"""![{slide['slide_number']}](./comics/Slide{slide['slide_number']}.jpeg)
                     
**Slide {slide['slide_number']}**

{all_text}"""))

![1](./comics/Slide1.jpeg)
                     
**Slide 1**

My Comic Book Collection
As of Dec 14, 2024

![2](./comics/Slide2.jpeg)
                     
**Slide 2**

Comic Books
Comic books started in 1938 with the introduction of Action Comics #1 marking the debut of Superman. Over the years, comic books have sold for over $3.5M, making comic books investment-grade collectibles.

![3](./comics/Slide3.jpeg)
                     
**Slide 3**

My Comic Books
I started collecting July 4th, 2021, and fell in love with the stories and artwork. I quickly learned certain artists and rarity of the covers could increase a book’s value ten-fold in the first day.

![4](./comics/Slide4.jpeg)
                     
**Slide 4**

My Comic Books
As a Star Wars fan, the stories from the comics fill much of the gaps between the movies and TV shows. We get to see our beloved heroes, and villains, in their natural elements like we never see on the screen.

![5](./comics/Slide5.jpeg)
                     
**Slide 5**

My Collection
The following are some stats from my collection

![6](./comics/Slide6.jpeg)
                     
**Slide 6**

Chart Title: 
Series Name: Total
Values: [835.0, 203.0, 116.0, 113.0, 80.0, 50.0, 21.0, 18.0, 17.0, 17.0, 13.0, 10.0, 8.0, 7.0, 7.0, 6.0, 5.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
by publisher
Contains more original stories (better in my opinion)
This is so high due to Star Wars

![7](./comics/Slide7.jpeg)
                     
**Slide 7**

Series Name: Count of Series Name
Values: [1.0, 10.0, 11.0, 9.0, 25.0, 15.0, 8.0, 2.0, 3.0, 2.0, 1.0, 6.0, 13.0, 14.0, 8.0, 14.0, 8.0, 2.0, 7.0, 6.0, 4.0, 3.0, 5.0, 1.0, 6.0, 2.0, 2.0, 2.0, 5.0, 12.0, 1.0, 10.0, 8.0, 27.0, 27.0, 11.0, 35.0, 21.0, 71.0, 369.0, 497.0, 298.0, 198.0]
release years
Started Collecting
That was an expensive year

![8](./comics/Slide8.jpeg)
                     
**Slide 8**

Chart Title: Top Publishers
Series Name: Read
Values: [550.0, 100.0, 57.0, 43.0, 41.0]
Series Name: Unread
Values: [247.0, 87.0, 51.0, 64.0, 34.0]
reading status
Chart Title: All Comic Books
Series Name: Total
Values: [964.0, 816.0]

![9](./comics/Slide9.jpeg)
                     
**Slide 9**

Series Name: Star Wars Books
Values: [554.0, 34.0, 67.0, 2.0]
Series Name: Other Books
Values: [243.0, 41.0, 40.0, 0.0]
Series Name: 
Values: [657.0, 324.0]
star wars comics
Star Wars
67%
All other books
33%
2

Thoughts? Reading the text has little value to what the slide actually says, right? Some slides it's mostly number! Useless. Enter GPT 4o Vision!! Let's see what we can get GPT to help us out with.

## Getting value from GPT 4o Vision

Let's use [OpenAI's documentation on using vision](https://platform.openai.com/docs/guides/vision), I modified to handle my images.

In [4]:
from openai import OpenAI
import base64
client = OpenAI()

total_completion = 0
total_prompt = 0
sys_prompt = """Review the image and describe it in detail as describing it to someone who has imparied vision."""

def get_image_desc(image):
    global total_completion, total_prompt
    with open(image, "rb") as image_file:
        image_base64 = base64.b64encode(image_file.read()).decode('utf-8')
    
    response = client.chat.completions.create(
        model="gpt-4o", # change to 4o from 40-mini, as mini has limitations in understanding relationships
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": sys_prompt},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{image_base64}",
                        },
                    },
                ],
            }
        ],
        max_tokens=4000, # using a lot more to give the machine room to give me a good answer 
    )
    total_completion += response.usage.completion_tokens
    total_prompt += response.usage.prompt_tokens
    return response.choices[0].message.content


What did I do:

- Created a single function so I can call it for each slide.
- I have a basic "system prompt" (not really, it's user text, but I'm stubborn). Note the wording, I have found asking it to describe it to someone with impaired vision really brings out a rich description.
- I collect all usage to share total costs of my tests below.

Now let's run all the images!

In [5]:
for slide in all_slides_data:
    image = f"./comics/Slide{slide['slide_number']}.jpeg"
    slide["description"] = get_image_desc(image)
    print(f"Visioned slide {slide['slide_number']}")

print(f"DONE!\nTotal completion tokens used: {total_completion}\nTotal prompt tokens used: {total_prompt}")

Visioned slide 1
Visioned slide 2
Visioned slide 3
Visioned slide 4
Visioned slide 5
Visioned slide 6
Visioned slide 7
Visioned slide 8
Visioned slide 9
DONE!
Total completion tokens used: 2089
Total prompt tokens used: 7974


Drum roll please, let's see what the plain text extraction looks like against the vision description!

In [6]:
for slide in all_slides_data:
    image = f"./comics/Slide{slide['slide_number']}.jpeg"
    display(Markdown(f"![Slide {slide['slide_number']}]({image})"))
    all_text = "\n".join(slide["text"])
    display(Markdown(f"""<table style='width:100%' border='0'>
        <tr><td colspan='2'>Slide {slide['slide_number']}</tr>
        <tr>
            <td style='width:48%; vertical-align:top'>
                <strong>Plain Text</strong><br/><br/>
                {all_text}
            </td>
            <td style='width:48%; vertical-align:top'>
                <strong>Vision Description:</strong><br/><br/>
                {slide['description']}
            </td>
        </tr>
    </table>
    """))
    

![Slide 1](./comics/Slide1.jpeg)

<table style='width:100%' border='0'>
        <tr><td colspan='2'>Slide 1</tr>
        <tr>
            <td style='width:48%; vertical-align:top'>
                <strong>Plain Text</strong><br/><br/>
                My Comic Book Collection
As of Dec 14, 2024
            </td>
            <td style='width:48%; vertical-align:top'>
                <strong>Vision Description:</strong><br/><br/>
                The image displays a collage of comic book covers arranged in two rows with a central overlay. The overlay is a gray rectangle with white text that reads, "My Comic Book Collection" and underneath it, "As of Dec 14, 2024."

1. **Top Row Covers (from left to right):**
   - The first cover features "The Amazing Spider-Man" with Spider-Man swinging from webbing.
   - The second cover is "Batman '89" with Batman standing heroically against a dark cityscape background.
   - The third cover shows "Nebula," with a futuristic theme showcasing a character in a space-like helmet against a colorful background.
   - The fourth cover is from the comic series "Eve" with a character in a submerged, porthole-like environment, suggesting an underwater or sci-fi theme.

2. **Bottom Row Covers (from left to right):**
   - The first cover features "Scarlet Witch," with a close-up of the character in a vibrant red costume.
   - The second cover has a vintage style, depicting a character in a pirate-like costume holding a rifle.
   - The third cover is a classic comic styled with small figures, including a superhero flying alongside people in action poses.
   - The last cover includes "Hulk," with an image of the Hulk in a fierce pose, smashing the ground.

The image is colorful and captures a diverse range of comic book themes, from superheroes to sci-fi and fantasy genres.
            </td>
        </tr>
    </table>
    

![Slide 2](./comics/Slide2.jpeg)

<table style='width:100%' border='0'>
        <tr><td colspan='2'>Slide 2</tr>
        <tr>
            <td style='width:48%; vertical-align:top'>
                <strong>Plain Text</strong><br/><br/>
                Comic Books
Comic books started in 1938 with the introduction of Action Comics #1 marking the debut of Superman. Over the years, comic books have sold for over $3.5M, making comic books investment-grade collectibles.
            </td>
            <td style='width:48%; vertical-align:top'>
                <strong>Vision Description:</strong><br/><br/>
                The image is a collage discussing comic books. It features the following elements:

1. **Text Header:**
   - Title: "Comic Books"
   - Description: Comic books began in 1938 with Action Comics #1, featuring Superman's debut. They have sold for over $3.5 million, making them valuable collectibles.

2. **Images of Comic Book Covers:**
   - **Top Right:** The cover of "Action Comics #1" from June 1938. It shows a hero lifting a green car overhead while people in the background appear startled or in distress, against a bright yellow backdrop.
   - **Bottom Left (Three Covers):**
     - **The Warlord:** This cover features dynamic, illustrated scenes with characters in action, including what appears to be a warrior figure. The colors are vibrant, dominated by greens and reds.
     - **The Thing:** Displays a superhero character in a powerful pose, clad in a blue and rocky-textured suit. The cover uses bold colors with shades of blue and orange.
     - **Star Wars:** This cover features characters from the Star Wars universe with bold lettering and illustrations, primarily in reds and blues.

The collage showcases iconic and diverse comic book art, emphasizing their historical and monetary significance.
            </td>
        </tr>
    </table>
    

![Slide 3](./comics/Slide3.jpeg)

<table style='width:100%' border='0'>
        <tr><td colspan='2'>Slide 3</tr>
        <tr>
            <td style='width:48%; vertical-align:top'>
                <strong>Plain Text</strong><br/><br/>
                My Comic Books
I started collecting July 4th, 2021, and fell in love with the stories and artwork. I quickly learned certain artists and rarity of the covers could increase a book’s value ten-fold in the first day.
            </td>
            <td style='width:48%; vertical-align:top'>
                <strong>Vision Description:</strong><br/><br/>
                The image is a collage with a heading and several comic book covers. At the top, there is text that says "My Comic Books," followed by a personal note about starting to collect comics on July 4th, 2021, and learning about the value of artists and cover rarity.

Below the text, there are four comic book covers:

1. **Top Left Cover**: Titled “BRZRKR,” this cover features a monochrome image of a muscular figure with long hair, in a combat-ready stance, equipped with weapons. The background has a bold, blocky text with the title.

2. **Middle Cover**: Titled “Stray Dogs: Dog Days,” it shows a colorful illustration of a comic book with a droplet of blood splashing on it. The cover blends elements of mystery or horror.

3. **Bottom Middle Cover**: A “Darth Vader” comic cover featuring a detailed portrait of Darth Vader, standing rigidly in a desert-like environment. The background shows additional sci-fi elements like space and stars.

4. **Right Cover**: Titled “We Don’t Kill Spiders,” this cover is vivid with fantasy elements, showing a woman in a striking pose wrapped in spider webs, holding a staff, against a starry or mystical background.

Overall, the image highlights the visual appeal of comic book cover art and the fascination with collecting them.
            </td>
        </tr>
    </table>
    

![Slide 4](./comics/Slide4.jpeg)

<table style='width:100%' border='0'>
        <tr><td colspan='2'>Slide 4</tr>
        <tr>
            <td style='width:48%; vertical-align:top'>
                <strong>Plain Text</strong><br/><br/>
                My Comic Books
As a Star Wars fan, the stories from the comics fill much of the gaps between the movies and TV shows. We get to see our beloved heroes, and villains, in their natural elements like we never see on the screen.
            </td>
            <td style='width:48%; vertical-align:top'>
                <strong>Vision Description:</strong><br/><br/>
                The image features a slide titled "My Comic Books." Below the title, there is text explaining that as a "Star Wars" fan, the stories in these comics fill gaps between the movies and TV shows, depicting heroes and villains in their natural settings.

The slide showcases four comic book covers:

1. The first cover on the left features a character with a lightsaber, suggesting a futuristic or space theme associated with "Star Wars." The title at the bottom reads "Star Wars."

2. The second cover, labeled "Darth Vader," depicts the iconic character with his red lightsaber, hinting at a dark and intense atmosphere.

3. The third cover is vibrant with illustrations of various "Star Wars" characters, with the classic logo "Star Wars" in bold letters.

4. The fourth cover on the right features a close-up of a furry, creature-like character holding a weapon. The title reads "Han Solo & Chewbacca," indicating the characters involved.

Each comic book cover uses distinct artwork associated with the "Star Wars" universe, with prominent use of the color schemes and styles typical of the franchise. The layout of the image is clean and organized, emphasizing the visual appeal of the comic covers.
            </td>
        </tr>
    </table>
    

![Slide 5](./comics/Slide5.jpeg)

<table style='width:100%' border='0'>
        <tr><td colspan='2'>Slide 5</tr>
        <tr>
            <td style='width:48%; vertical-align:top'>
                <strong>Plain Text</strong><br/><br/>
                My Collection
The following are some stats from my collection
            </td>
            <td style='width:48%; vertical-align:top'>
                <strong>Vision Description:</strong><br/><br/>
                The image is primarily blank with a simple, minimalistic design. Against a plain white background, there are two lines of text. 

The first line reads "My Collection" in large, bold black letters, centered and creating a focal point in the image. 

The second line is just below the first, in smaller, grey letters. It reads "The following are some stats from my collection."

There are no other visual elements or colors in the image, emphasizing a clean and straightforward layout.
            </td>
        </tr>
    </table>
    

![Slide 6](./comics/Slide6.jpeg)

<table style='width:100%' border='0'>
        <tr><td colspan='2'>Slide 6</tr>
        <tr>
            <td style='width:48%; vertical-align:top'>
                <strong>Plain Text</strong><br/><br/>
                Chart Title: 
Series Name: Total
Values: [835.0, 203.0, 116.0, 113.0, 80.0, 50.0, 21.0, 18.0, 17.0, 17.0, 13.0, 10.0, 8.0, 7.0, 7.0, 6.0, 5.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
by publisher
Contains more original stories (better in my opinion)
This is so high due to Star Wars
            </td>
            <td style='width:48%; vertical-align:top'>
                <strong>Vision Description:</strong><br/><br/>
                The image is a bar chart titled "by publisher" with a focus on comic book publishers. The X-axis lists different publishers, and the Y-axis represents the number of publications, ranging from 0 to 900.

- The first publisher on the left, "Marvel Comics," has the tallest bar, reaching close to 800. A text annotation reads, "This is so high due to Star Wars," indicating the impact of Star Wars comics on their numbers.
  
- The second publisher, "DC Comics," has a significantly shorter bar, just below 300.
  
- Other publishers, such as "Dark Horse Comics," "BOOM! Studios," and "IDW Publishing," have progressively smaller bars.

- Towards the middle of the chart, there's another annotation above a small bump, reading, "Contains more original stories (better in my opinion)," referring to one of the mid-sized publishers.

- The remaining publishers have very low bars, almost indistinguishable from each other, representing a minimal number of publications compared to the leading companies.

- The chart is mostly dominated by Marvel Comics, with few other publishers standing out prominently.
            </td>
        </tr>
    </table>
    

![Slide 7](./comics/Slide7.jpeg)

<table style='width:100%' border='0'>
        <tr><td colspan='2'>Slide 7</tr>
        <tr>
            <td style='width:48%; vertical-align:top'>
                <strong>Plain Text</strong><br/><br/>
                Series Name: Count of Series Name
Values: [1.0, 10.0, 11.0, 9.0, 25.0, 15.0, 8.0, 2.0, 3.0, 2.0, 1.0, 6.0, 13.0, 14.0, 8.0, 14.0, 8.0, 2.0, 7.0, 6.0, 4.0, 3.0, 5.0, 1.0, 6.0, 2.0, 2.0, 2.0, 5.0, 12.0, 1.0, 10.0, 8.0, 27.0, 27.0, 11.0, 35.0, 21.0, 71.0, 369.0, 497.0, 298.0, 198.0]
release years
Started Collecting
That was an expensive year
            </td>
            <td style='width:48%; vertical-align:top'>
                <strong>Vision Description:</strong><br/><br/>
                This image is a line graph titled "release years." The graph has a horizontal X-axis representing years, ranging from 1969 to 2024, marked at intervals of a few years. The vertical Y-axis represents quantity or possibly cost, ranging from 0 to 600, marked at intervals of 100.

The line on the graph is mostly flat, with small variations, until it starts to rise sharply around the year 2015, reaching a peak between 2018 and 2019 at a value close to 500. After this peak, the line drops quickly and then stabilizes again.

There are two labeled points highlighted on the graph. Near the peak at 2018-2019, there's a label saying "That was an expensive year," indicating a significant increase during this period. Closer to the beginning of the rise, around the year 2015, there's another label saying "Started Collecting."

Overall, the graph shows a significant spike in activity or costs around 2018-2019, preceded by a steady period and followed by another steady period.
            </td>
        </tr>
    </table>
    

![Slide 8](./comics/Slide8.jpeg)

<table style='width:100%' border='0'>
        <tr><td colspan='2'>Slide 8</tr>
        <tr>
            <td style='width:48%; vertical-align:top'>
                <strong>Plain Text</strong><br/><br/>
                Chart Title: Top Publishers
Series Name: Read
Values: [550.0, 100.0, 57.0, 43.0, 41.0]
Series Name: Unread
Values: [247.0, 87.0, 51.0, 64.0, 34.0]
reading status
Chart Title: All Comic Books
Series Name: Total
Values: [964.0, 816.0]
            </td>
            <td style='width:48%; vertical-align:top'>
                <strong>Vision Description:</strong><br/><br/>
                The image is divided into two main sections, focusing on the status of reading comic books.

**Left Section: Pie Chart**
- This section features a pie chart titled "All Comic Books."
- The chart is split into two segments. 
- The larger segment is colored blue, representing 54% of the comic books labeled as "Read."
- The smaller segment is colored orange, representing 46% as "Unread."

**Right Section: Bar Chart**
- This section displays a horizontal bar chart titled "Top Publishers."
- Five publishers are represented, each with a vertical bar.
- The first bar, representing Marvel Comics, is the tallest, with the majority blue and a smaller orange top, indicating more read than unread.
- The second bar, representing DC Comics, is shorter and has a similar color distribution, mostly blue.
- The bars for BOOM! Studios, Dark Horse Comics, and IDW Publishing are significantly smaller, each with a similar blue-orange pattern as the others, indicating fewer comics overall, with a larger portion read.

At the bottom of the bar chart are the logos for each publisher, enhancing the visual presentation of the data.
            </td>
        </tr>
    </table>
    

![Slide 9](./comics/Slide9.jpeg)

<table style='width:100%' border='0'>
        <tr><td colspan='2'>Slide 9</tr>
        <tr>
            <td style='width:48%; vertical-align:top'>
                <strong>Plain Text</strong><br/><br/>
                Series Name: Star Wars Books
Values: [554.0, 34.0, 67.0, 2.0]
Series Name: Other Books
Values: [243.0, 41.0, 40.0, 0.0]
Series Name: 
Values: [657.0, 324.0]
star wars comics
Star Wars
67%
All other books
33%
2
            </td>
            <td style='width:48%; vertical-align:top'>
                <strong>Vision Description:</strong><br/><br/>
                The image features two charts and some logos related to "Star Wars Comics."

On the left is a pie chart. It is divided into two segments. The larger segment, which is colored blue, is labeled "Star Wars 67%." The smaller segment, in orange, is labeled "All other books 33%."

To the right is a bar graph. It has five vertical bars, each associated with different comic publishers. From left to right:

1. The tallest bar represents "Marvel Comics" and is mostly blue with an orange top.
2. A shorter bar for "IDW Publishing" is entirely blue.
3. Another small bar for "Dark Horse Comics" is mainly blue with an orange top.
4. The final bar, for "VIZ," is barely noticeable, small, and completely blue.

Below the bars, corresponding logos of each publisher are shown in a row: Marvel Comics, IDW Publishing, Dark Horse Comics, and VIZ.

Overall, the image illustrates the distribution of Star Wars comics compared to other books and their presence across different publishers.
            </td>
        </tr>
    </table>
    

YYAAA HOOOOO!!! That's some really helpful descriptions! With these rich descriptions we can provide GPT a greater content to help answer your users' questions!

## Does vision actually help?

Obviously, the above will work better, but since I have some time on my hands, let's quickly find out 😉.

Again, I'm using [OpenAI's basic scripts for generating text](https://platform.openai.com/docs/guides/text-generation), and polished up a little for my use case:

In [7]:
from openai import OpenAI
client = OpenAI()

def call_gpt(text, context):
    completion = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "developer", "content": "Help the user by answering their question from the provided context."},
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion:\n{text}"
            }
        ]
    )
    return completion.choices[0].message.content


The above is a basic function that'll call GPT. I concatenate the context, the text from relevant slides, with my question in one message. 

In [8]:
questions = [
    {
        "question": "How many Marvel books are there?",
        "keywords": ["Marvel"]
    },
    {
        "question": "How many Marvel books are Star Wars related?",
        "keywords": ["Marvel", "Star Wars"]
    },
    {
        "question": "What year did the collection start?",
        "keywords": ["start", "collect", "journey"]
    },
    {
        "question": "What are the top brands?",
        "keywords": ["brands", "publishers"]
    }
]

Above are my questions, and I added search keywords to find the content. Any decent RAG system should transform user questions into semantic phrases, and along with vectors, to return some really robust results. I'm not creating a search tool for this exercise, so I'm cheating a little bit here to find related content.

The following loops through my questions, and cals GPT with the plain text extraction and the vision descriptions from earlier.

In [11]:
for question in questions:
    plain_context = "\n*****\n".join([
        " ".join(slide["text"])
        for slide in all_slides_data
        if any(k.lower() in slide["description"].lower() or k.lower() in ' '.join(slide['text']).lower() for k in question["keywords"])
    ])
    vision_context = "\n*****\n".join([
        slide["description"]
        for slide in all_slides_data
        if any(k.lower() in slide["description"].lower() or k.lower() in ' '.join(slide['text']).lower() for k in question["keywords"])
    ])
    plain_answer = call_gpt(question["question"], plain_context)
    vision_answer = call_gpt(question["question"], vision_context)
    display(Markdown(f"""**{question['question']}**
                     
_Plain Answer_
                     
- {plain_answer}

_Vision Answer_

- {vision_answer}

"""))

print("DONE!")

**How many Marvel books are there?**
                     
_Plain Answer_
                     
- The context does not directly provide information about the exact number of Marvel books. To determine this, you would need additional data or information specifically indicating the breakdown of books by publisher that includes Marvel.

_Vision Answer_

- The bar chart indicates that Marvel Comics has close to 800 publications.



**How many Marvel books are Star Wars related?**
                     
_Plain Answer_
                     
- The context does not provide specific information on how many Marvel books are specifically Star Wars related. However, it mentions values and comparisons of Star Wars books versus other books and comic books, but it's not explicitly detailed about the Star Wars books published by Marvel. You might need to refer to a detailed catalog or list specific to Star Wars comics published by Marvel to find this information.

_Vision Answer_

- The chart does not provide an exact number of Star Wars related books published by Marvel Comics. However, it indicates that the tallest bar, representing Marvel Comics, is significantly impacted by Star Wars. Additionally, another pie chart shows "Star Wars" as 67% of the total, while "All other books" make up 33%. To determine an exact number, you would need additional data on the total number of Marvel publications.



**What year did the collection start?**
                     
_Plain Answer_
                     
- The collection started on July 4th, 2021.

_Vision Answer_

- The collection started in the year 2015.



**What are the top brands?**
                     
_Plain Answer_
                     
- Based on the context provided, the top brands appear to be related to publishers and Star Wars. The "Top Publishers" chart shows high value indications that the top publishers include those related to Star Wars content, as referenced by the high values in the "Unread" and "Read" categories, as well as the notable mention of Star Wars. Thus, the context suggests that Star Wars is a significant brand among the top publishers mentioned.

_Vision Answer_

- The top brands, based on the context provided, are Marvel Comics, DC Comics, IDW Publishing, Dark Horse Comics, and BOOM! Studios. Marvel Comics is the leading publisher with the highest number of publications, driven significantly by their Star Wars comics.



DONE!


Oh my word... you can't beat that, can you? Using vision with these slides significantly improved my answers (except for that one wrong answer, did you see it? More below). Now apply this to your content, how many slides and pages have charts, images, and other non-textual context that is lost by simply scraping text? 

Check out the answers to "What year did the collection start?" The plain answer is good because a slide had it explicitly stated. Vision read that fine, but also read another slide of a chart, which was misaligned by vision. GPT had to choose between the two pieces of content: one slide said "person’s comic book collection journey, which began on July 4th, 2021" and the other slide says "around 2015, another text box is labeled "Started Collecting.". GPT isn't perfect, yet.

## Just use GPT 4o Vision!

It seems quite obvious to just say "Hey, let's use vision on our content!", and if you had a blank check, I'd say do it today!

In reality, most of us live within budgets, end-user performance expectations, speed to market rquirements, blah blah blah. It'll come down to the business stakeholders to see the value in using vision, compared to the current value you're delivering, and decide the cost is worth it. 



In [10]:
one_mil = 1000000
# costs as of Dec 2024 from https://openai.com/api/pricing/
cost_input = 2.5 
cost_output = 10
display(Markdown(f"""
For my limited example above, vision cost me {total_prompt} input tokens and {total_completion} output tokens. \
This cost me a total of ${round(float((total_completion/one_mil*cost_output)+(total_prompt/one_mil*cost_input)), 2)}."""))


For my limited example above, vision cost me 7974 input tokens and 2089 output tokens. This cost me a total of $0.04.

Cheap enough for a dozen slides, but when you're in the hundreds of thousands to millions of slides, this can add up quickly! There are a few things to check out to help curb the cost:

- Only send slides that need vision. Check for specific element types on the slides, and only send if there are images and charts. An all text slide doesn't need vision.
- Explore setting the `detail` parameter to low, which should incur lower costs, but might impact quality of the output. Learn more on [OpenAI's site about Low or High fidelity image understanding](https://platform.openai.com/docs/guides/vision#low-or-high-fidelity-image-understanding). I'd guess most powerpoints can get away with low.
- The size of the slide also impacts the cost. All of my examples here are 1024x578, which cost $0.001913 per high-res slide, $0.000213 per low-res ([from OpenAI's pricing page](https://openai.com/api/pricing/)). A larger image, like 2048 wide, will cost twice as much. _OpenAI does some resizing and calculations to determine the total tokens for an image. See their Vision pricing calculate for details._

Given the results we see above, and the vastly improved answers, adding vision to your RAG should be seriously considered! Figure out your scale of cost, perform some tests, and showcase the value to your team!