# Using GPT Vision in RAG

We're going to explore using OpenAI's GPT Vision with 4o model to gather more insight from complex slides, charts, and images, to improve our chat experiences in our RAG solutions.

See my blog for a cleaner version of this content [Using GPT vision in RAG](https://davidlozzi.com/2025/02/01/using-gpt-vision-in-rag/).

### The PowerPoint PPTX file

I'll be using a PowerPoint file I made of my comic book collection. I didn't want to use real content from work, and I had a little fun building this ;). The value of adding vision to your RAG pipeline will still be realized using this data, I promise.

You can review the [PowerPoint file, in all its glory, here](./comics.pptx).

## Getting text from the PPTX

But first, you'll need to create a .env file (see .env.example), update it with your OpenAI API key and install some libraries.

_let's get through some of the boring stuff, installing libraries and stuff_

In [None]:
%pip install numpy pandas python-pptx openai python-dotenv

from dotenv import load_dotenv
load_dotenv()

Easy enough. Now our function to pull text from slides:

In [2]:
import os
import json
from pptx import Presentation
from pptx.enum.shapes import MSO_SHAPE_TYPE

ppt_file = "comics.pptx"
json_file = "slides_output.json"

# Load the presentation
prs = Presentation(ppt_file)
all_slides_data = []

for idx, slide in enumerate(prs.slides, start=1):
    slide_data = {
        "slide_number": idx,
        "text": []
    }

    for shape in slide.shapes:
        # Extract text from text frames
        if shape.has_text_frame and shape.text.strip():
            slide_data["text"].append(shape.text.strip())

        # Extract text from charts
        if shape.shape_type == MSO_SHAPE_TYPE.CHART:
            chart = shape.chart
            if chart.has_title:
                slide_data["text"].append(f"Chart Title: {chart.chart_title.text_frame.text}")
            for s in chart.series:
                slide_data["text"].append(f"Series Name: {s.name}")
                if s.values:
                    slide_data["text"].append(f"Values: {list(s.values)}")

    all_slides_data.append(slide_data)

print(f"Pulled text from {len(all_slides_data)} slides")


Pulled text from 9 slides


The above has pulled text from all of the slides. We'll explore that data below.

I manually exported the deck to JPEG into the `./comics/` folder. Finding a library to do that programmatically was getting annoying. In production, we use ASPose, on Java. There are some Python libraries available, but decided not to implement as that's outside of the goal for this exercise. So for today, I've saved my deck as `.jpeg` sized 1024x578.

Let's quickly review the deck and confirm the text I pulled off it. And note how "useless" (from a plain text point of view) some of the content is!

In [21]:
from IPython.display import display, Markdown

for slide in all_slides_data:
    all_text = "\n".join(slide["text"])
    display(Markdown(f"""<img src='./comics/Slide{slide['slide_number']}.jpeg' style='width: 50%' />
                     
**Slide {slide['slide_number']}**

{all_text}"""))

<img src='./comics/Slide1.jpeg' style='width: 50%' />
                     
**Slide 1**

My Comic Book Collection
As of Dec 14, 2024

<img src='./comics/Slide2.jpeg' style='width: 50%' />
                     
**Slide 2**

Comic Books
Comic books started in 1938 with the introduction of Action Comics #1 marking the debut of Superman. Over the years, comic books have sold for over $3.5M, making comic books investment-grade collectibles.

<img src='./comics/Slide3.jpeg' style='width: 50%' />
                     
**Slide 3**

My Comic Books
I started collecting July 4th, 2021, and fell in love with the stories and artwork. I quickly learned certain artists and rarity of the covers could increase a book’s value ten-fold in the first day.

<img src='./comics/Slide4.jpeg' style='width: 50%' />
                     
**Slide 4**

My Comic Books
As a Star Wars fan, the stories from the comics fill much of the gaps between the movies and TV shows. We get to see our beloved heroes, and villains, in their natural elements like we never see on the screen.

<img src='./comics/Slide5.jpeg' style='width: 50%' />
                     
**Slide 5**

My Collection
The following are some stats from my collection

<img src='./comics/Slide6.jpeg' style='width: 50%' />
                     
**Slide 6**

Chart Title: 
Series Name: Total
Values: [835.0, 203.0, 116.0, 113.0, 80.0, 50.0, 21.0, 18.0, 17.0, 17.0, 13.0, 10.0, 8.0, 7.0, 7.0, 6.0, 5.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
by publisher
Contains more original stories (better in my opinion)
This is so high due to Star Wars

<img src='./comics/Slide7.jpeg' style='width: 50%' />
                     
**Slide 7**

Series Name: Count of Series Name
Values: [1.0, 10.0, 11.0, 9.0, 25.0, 15.0, 8.0, 2.0, 3.0, 2.0, 1.0, 6.0, 13.0, 14.0, 8.0, 14.0, 8.0, 2.0, 7.0, 6.0, 4.0, 3.0, 5.0, 1.0, 6.0, 2.0, 2.0, 2.0, 5.0, 12.0, 1.0, 10.0, 8.0, 27.0, 27.0, 11.0, 35.0, 21.0, 71.0, 369.0, 497.0, 298.0, 198.0]
release years
Started Collecting
That was an expensive year

<img src='./comics/Slide8.jpeg' style='width: 50%' />
                     
**Slide 8**

Chart Title: Top Publishers
Series Name: Read
Values: [550.0, 100.0, 57.0, 43.0, 41.0]
Series Name: Unread
Values: [247.0, 87.0, 51.0, 64.0, 34.0]
reading status
Chart Title: All Comic Books
Series Name: Total
Values: [964.0, 816.0]

<img src='./comics/Slide9.jpeg' style='width: 50%' />
                     
**Slide 9**

Series Name: Star Wars Books
Values: [554.0, 34.0, 67.0, 2.0]
Series Name: Other Books
Values: [243.0, 41.0, 40.0, 0.0]
Series Name: 
Values: [657.0, 324.0]
star wars comics
Star Wars
67%
All other books
33%
2

Thoughts? Reading the text has little value to what the slide actually says, right? Some slides it's mostly numbers! Useless. Enter GPT 4o Vision!! Let's see what we can get GPT to help us out with.

## Getting value from GPT 4o Vision

Let's use [OpenAI's documentation on using vision](https://platform.openai.com/docs/guides/vision).

In [11]:
from openai import OpenAI
import base64
client = OpenAI()

total_completion = 0
total_prompt = 0
sys_prompt = """Review the image and describe it in detail as describing it to someone who has impaired vision.
Pay particular attention to the relationships between the objects in the image, especially charts and graphs."""

def get_image_desc(image):
    global total_completion, total_prompt
    with open(image, "rb") as image_file:
        image_base64 = base64.b64encode(image_file.read()).decode('utf-8')
    
    response = client.chat.completions.create(
        model="gpt-4o", # change to 4o from 40-mini, as mini has limitations in understanding relationships
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": sys_prompt},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{image_base64}",
                        },
                    },
                ],
            }
        ],
        max_tokens=4000, # using a lot more to give the machine room to give me a good answer 
    )
    total_completion += response.usage.completion_tokens
    total_prompt += response.usage.prompt_tokens
    return response.choices[0].message.content


What did I do:

- Created a single function so I can call it for each slide.
- I have a basic "system prompt" (not really a system prompt, rather it's user text, but I'm stubborn). Note the wording, I have found asking it to describe it to someone with impaired vision really brings out a rich description.
- I collect all usage to share total costs of my tests below.

Now let's run all the images!

In [12]:
for slide in all_slides_data:
    image = f"./comics/Slide{slide['slide_number']}.jpeg"
    slide["description"] = get_image_desc(image)
    print(f"Visioned slide {slide['slide_number']}")

print(f"DONE!\nTotal completion tokens used: {total_completion}\nTotal prompt tokens used: {total_prompt}")

Visioned slide 1
Visioned slide 2
Visioned slide 3
Visioned slide 4
Visioned slide 5
Visioned slide 6
Visioned slide 7
Visioned slide 8
Visioned slide 9
DONE!
Total completion tokens used: 2692
Total prompt tokens used: 8127


_not bad token usage, we'll talk costs a little later_

Drum roll please, let's see what the plain text extraction looks like against the vision description!

<img src="../images/you%20need%20to%20see%20this.jpg" alt="Comic panel: You need to see this" style="width: 50%" />

In [13]:
for slide in all_slides_data:
    image = f"./comics/Slide{slide['slide_number']}.jpeg"
    all_text = "\n".join(slide["text"])
    display(Markdown(f"""<table style='width:100%' border='0'>
        <tr><td colspan="2" style="text-align:center"><img src="{image}" style="width:50%" /><br/>Slide {slide['slide_number']}</tr>
        <tr>
            <td style='width:48%; vertical-align:top'>
                <strong>Plain Text</strong><br/><br/>
                {all_text}
            </td>
            <td style='width:48%; vertical-align:top'>
                <strong>Vision Description:</strong><br/><br/>
                {slide['description']}
            </td>
        </tr>
    </table>
    """))
    

<table style='width:100%' border='0'>
        <tr><td colspan="2" style="text-align:center"><img src="./comics/Slide1.jpeg" style="width:50%" /><br/>Slide 1</tr>
        <tr>
            <td style='width:48%; vertical-align:top'>
                <strong>Plain Text</strong><br/><br/>
                My Comic Book Collection
As of Dec 14, 2024
            </td>
            <td style='width:48%; vertical-align:top'>
                <strong>Vision Description:</strong><br/><br/>
                The image is a collage of comic book covers with a central text box overlay. Each comic book cover is distinct, displaying various characters and art styles. Here's a detailed description:

1. **Central Text Box**: A large, dark gray rectangle is centered in the image, with the text "My Comic Book Collection" in bold white lettering. Below this, it says "As of Dec 14, 2024" in smaller white text.

2. **Comic Covers**:
   - **Top Row**:
     - *Left*: A cover of "The Amazing Spider-Man" featuring Spider-Man in his iconic red and blue suit, swinging on a web.
     - *Middle Left*: "Batman 89" shows Batman standing in front of a dark, urban backdrop with a full moon.
     - *Middle Right*: "Nebula" depicts a futuristic character with mechanical elements and vibrant colors.
     - *Right*: "Eve" shows a girl with eyes closed, inside a circular window, hinting at a science fiction theme.

   - **Bottom Row**:
     - *Left*: "Scarlet Witch" has a striking image of a woman in red, with intense green eyes.
     - *Middle Left*: A comic with a woman in a pirate-like outfit, holding a sword.
     - *Middle Right*: A vibrant and colorful comic cover with a group of characters in action poses.
     - *Right*: A retro-style comic featuring "The Hulk" with a dynamic and bold illustration.

3. **Arrangement**: The comic covers are arranged in two rows of four, framing the central text box. Each cover has vivid, engaging artwork, highlighting the variety and diversity in the comic collection.

Overall, the image is a visually engaging showcase of the individual's diverse comic book collection as of December 14, 2024.
            </td>
        </tr>
    </table>
    

<table style='width:100%' border='0'>
        <tr><td colspan="2" style="text-align:center"><img src="./comics/Slide2.jpeg" style="width:50%" /><br/>Slide 2</tr>
        <tr>
            <td style='width:48%; vertical-align:top'>
                <strong>Plain Text</strong><br/><br/>
                Comic Books
Comic books started in 1938 with the introduction of Action Comics #1 marking the debut of Superman. Over the years, comic books have sold for over $3.5M, making comic books investment-grade collectibles.
            </td>
            <td style='width:48%; vertical-align:top'>
                <strong>Vision Description:</strong><br/><br/>
                The image is a slide titled "Comic Books." The title appears at the top in large, bold black text. Below the title is a paragraph in smaller black text. It reads: "Comic books started in 1938 with the introduction of Action Comics #1 marking the debut of Superman. Over the years, comic books have sold for over $3.5M, making comic books investment-grade collectibles."

To the right of the text is an image of the cover of "Action Comics #1" from June 1938. It shows a colorful illustration of Superman lifting a green car. There are people in the foreground reacting with surprise and fear. The cover has a bold red and yellow background.

Below this, there are additional single comic book covers displayed in a row from left to right:

1. **The Warlord**: The cover features bold, vivid colors with an image of a muscular warrior wielding a sword against a dinosaur. The title is set against a dark background in an elaborate stylized font.

2. **The Thing**: This cover features a group of heroes against a space-themed backdrop, with "THE THING" in a large red font at the top. The characters are dressed in futuristic outfits.

3. **Star Wars**: The cover prominently displays the characters from the Star Wars universe, including Darth Vader, in front of starry space. The title "STAR WARS" is large, yellow, and bold.

The overall layout presents a historical context alongside visual examples of comic book covers, highlighting their cultural and collectible importance.
            </td>
        </tr>
    </table>
    

<table style='width:100%' border='0'>
        <tr><td colspan="2" style="text-align:center"><img src="./comics/Slide3.jpeg" style="width:50%" /><br/>Slide 3</tr>
        <tr>
            <td style='width:48%; vertical-align:top'>
                <strong>Plain Text</strong><br/><br/>
                My Comic Books
I started collecting July 4th, 2021, and fell in love with the stories and artwork. I quickly learned certain artists and rarity of the covers could increase a book’s value ten-fold in the first day.
            </td>
            <td style='width:48%; vertical-align:top'>
                <strong>Vision Description:</strong><br/><br/>
                The image is a collage with a text portion on the left side and four comic book covers. Here's a detailed description:

**Text:**
- At the top left, there's a title that reads "My Comic Books."
- Below the title, a paragraph states: "I started collecting July 4th, 2021, and fell in love with the stories and artwork. I quickly learned certain artists and rarity of the covers could increase a book’s value ten-fold in the first day."

**Comic Book Covers:**
1. **Top Right Cover:**
   - Titled "We Don't Kill Spiders."
   - Features an illustrated person holding a glowing red object with several dark shadows around. The background is blue with bright pink text.

2. **Bottom Row - Left Cover:**
   - Titled "BRZRKR."
   - Shows a monochrome figure in dark armor, holding a weapon, standing against a backdrop of large block letters in blue.

3. **Bottom Row - Middle Cover:**
   - Titled "Stray Dogs: Dog Days."
   - Has an image of a bloody envelope and a paw print, implying a mysterious or suspenseful theme, with mostly dark red and brown tones.

4. **Bottom Row - Right Cover:**
   - Titled "Darth Vader."
   - Depicts the iconic Star Wars character, Darth Vader, in full armor holding a lightsaber against a space-themed background with warm brown and gold colors.
  
The comic book covers are arranged in two rows: one cover on the top right by itself, and three covers along the bottom. The text is adjacent to the bottom row. Each cover has its unique style and theme.
            </td>
        </tr>
    </table>
    

<table style='width:100%' border='0'>
        <tr><td colspan="2" style="text-align:center"><img src="./comics/Slide4.jpeg" style="width:50%" /><br/>Slide 4</tr>
        <tr>
            <td style='width:48%; vertical-align:top'>
                <strong>Plain Text</strong><br/><br/>
                My Comic Books
As a Star Wars fan, the stories from the comics fill much of the gaps between the movies and TV shows. We get to see our beloved heroes, and villains, in their natural elements like we never see on the screen.
            </td>
            <td style='width:48%; vertical-align:top'>
                <strong>Vision Description:</strong><br/><br/>
                The image is a slide titled "My Comic Books," featuring a collection of Star Wars-themed comic book covers. 

At the top, there's a heading in bold: "My Comic Books." Below this, there's a paragraph of text: "As a Star Wars fan, the stories from the comics fill much of the gaps between the movies and TV shows. We get to see our beloved heroes, and villains, in their natural elements like we never see on the screen."

Beneath the text, four comic book covers are displayed. From left to right:

1. **First Cover (Leftmost):** Features a character in a white outfit holding a red lightsaber. The background is dark, with the words "Darth Vader" prominently displayed at the bottom.

2. **Second Cover:** Shows Darth Vader holding a red lightsaber, standing against a smoky background. "Star Wars Darth Vader" is written at the top in bold.

3. **Third Cover:** This is a classic-style comic cover with multiple characters including Darth Vader in the center. The title "Star Wars" is written in bright yellow at the top.

4. **Fourth Cover (Rightmost):** Features the character Chewbacca, a large, furry being with a serious expression. It is titled "Han Solo & Chewbacca." This cover is positioned separately to the right of the other three covers.

The covers are aligned neatly, with the first three creating a row, and the fourth slightly offset to the right.
            </td>
        </tr>
    </table>
    

<table style='width:100%' border='0'>
        <tr><td colspan="2" style="text-align:center"><img src="./comics/Slide5.jpeg" style="width:50%" /><br/>Slide 5</tr>
        <tr>
            <td style='width:48%; vertical-align:top'>
                <strong>Plain Text</strong><br/><br/>
                My Collection
The following are some stats from my collection
            </td>
            <td style='width:48%; vertical-align:top'>
                <strong>Vision Description:</strong><br/><br/>
                The image has a simple, minimalistic design. It consists of a plain white background with text aligned to the left side. The main heading, "My Collection," is placed prominently in bold, black font. Below this heading, there is a subheading in smaller, gray font that reads, "The following are some stats from my collection." There are no charts, graphs, or additional objects present in the image—just these two lines of text centered towards the top left portion.
            </td>
        </tr>
    </table>
    

<table style='width:100%' border='0'>
        <tr><td colspan="2" style="text-align:center"><img src="./comics/Slide6.jpeg" style="width:50%" /><br/>Slide 6</tr>
        <tr>
            <td style='width:48%; vertical-align:top'>
                <strong>Plain Text</strong><br/><br/>
                Chart Title: 
Series Name: Total
Values: [835.0, 203.0, 116.0, 113.0, 80.0, 50.0, 21.0, 18.0, 17.0, 17.0, 13.0, 10.0, 8.0, 7.0, 7.0, 6.0, 5.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
by publisher
Contains more original stories (better in my opinion)
This is so high due to Star Wars
            </td>
            <td style='width:48%; vertical-align:top'>
                <strong>Vision Description:</strong><br/><br/>
                The image is a bar chart with the title "by publisher" at the top left. The chart represents the total number of items attributed to various publishers along the x-axis, while the y-axis shows a numerical scale from 0 to 900.

1. **X-Axis**: Represents different publishers. The list includes Marvel Comics at the far left, followed by DC Comics, Dark Horse Comics, and several others. Farther right, smaller publishers are also listed.

2. **Y-Axis**: Ranges from 0 to 900, indicating the total count of items published.

3. **Bars**: Each publisher has a vertical bar corresponding to the number of items. Notably:
   - **Marvel Comics**: Has the tallest bar, reaching above 800, with a note stating, "This is so high due to Star Wars."
   - **DC Comics**: The second-highest bar, slightly above 200.
   - **Dark Horse Comics**: Slightly above 100.
   - The other publishers have significantly shorter bars, many below 100.

4. **Annotations**:
   - Next to the Marvel Comics bar, there is a text pointing at the height of the bar, explaining its height due to Star Wars.
   - Above the middle publishers’ shorter bars, there is a note stating, "Contains more original stories (better in my opinion)."

5. **Overall Trend**: There is a steep drop-off in height after Marvel Comics and DC Comics, with minor fluctuations among the remaining publishers.

The visual presentation emphasizes the dominance of Marvel Comics in terms of volume, with DC Comics being a distant second, highlighting the contribution of Star Wars to Marvel’s count.
            </td>
        </tr>
    </table>
    

<table style='width:100%' border='0'>
        <tr><td colspan="2" style="text-align:center"><img src="./comics/Slide7.jpeg" style="width:50%" /><br/>Slide 7</tr>
        <tr>
            <td style='width:48%; vertical-align:top'>
                <strong>Plain Text</strong><br/><br/>
                Series Name: Count of Series Name
Values: [1.0, 10.0, 11.0, 9.0, 25.0, 15.0, 8.0, 2.0, 3.0, 2.0, 1.0, 6.0, 13.0, 14.0, 8.0, 14.0, 8.0, 2.0, 7.0, 6.0, 4.0, 3.0, 5.0, 1.0, 6.0, 2.0, 2.0, 2.0, 5.0, 12.0, 1.0, 10.0, 8.0, 27.0, 27.0, 11.0, 35.0, 21.0, 71.0, 369.0, 497.0, 298.0, 198.0]
release years
Started Collecting
That was an expensive year
            </td>
            <td style='width:48%; vertical-align:top'>
                <strong>Vision Description:</strong><br/><br/>
                The image is a line graph titled "release years." The x-axis represents years ranging from 1969 to 2024, marked in increments of a few years, while the y-axis represents an unspecified numerical value ranging from 0 to 600.

The data line starts low in 1969, with values remaining under 100 through to about 2014. There are minor fluctuations throughout these years. From around 2014, the line shows a steep upward trend, peaking sharply at around the year 2020, where it hits just below 500. After this peak, the line quickly descends, showing a sharp decline.

Two labels are present on the graph. One is at the peak, indicating "That was an expensive year," suggesting that 2020 was notable. Another is near the start of the upward trend, marked "Started Collecting," which points around the year 2014.

These labels provide context for the significant increase in numbers during this period, indicating critical points of interest in the timeline of data collection or events.
            </td>
        </tr>
    </table>
    

<table style='width:100%' border='0'>
        <tr><td colspan="2" style="text-align:center"><img src="./comics/Slide8.jpeg" style="width:50%" /><br/>Slide 8</tr>
        <tr>
            <td style='width:48%; vertical-align:top'>
                <strong>Plain Text</strong><br/><br/>
                Chart Title: Top Publishers
Series Name: Read
Values: [550.0, 100.0, 57.0, 43.0, 41.0]
Series Name: Unread
Values: [247.0, 87.0, 51.0, 64.0, 34.0]
reading status
Chart Title: All Comic Books
Series Name: Total
Values: [964.0, 816.0]
            </td>
            <td style='width:48%; vertical-align:top'>
                <strong>Vision Description:</strong><br/><br/>
                The image consists of two main parts: a pie chart on the left and a bar chart on the right. The overall topic is "reading status" of comic books.

**Pie Chart (Left Side):**

- Title: "All Comic Books."
- The pie chart is divided into two sections.
- The larger section is labeled "Read" and is represented in dark blue, making up 54% of the chart.
- The smaller section, labeled "Unread," is in orange and accounts for 46% of the chart.
- The chart visually shows that more comic books have been read than are unread.

**Bar Chart (Right Side):**

- Title: "Top Publishers."
- There are five publishers along the horizontal axis, each with its own bar rising upward:
  1. Marvel Comics
  2. DC Comics
  3. BOOM! Studios
  4. Dark Horse Comics
  5. IDW Publishing
- The height of the bars corresponds to the number of comic books, with specific amounts divided into read and unread.
- The bars have two sections, similar in color to the pie chart: dark blue for read and orange for unread.
- Marvel Comics' bar is the tallest, showing the highest number of comics, with a significant portion unread.
- The other publishers have smaller bars, with varying ratios of read to unread comics.

This combination of charts provides an overview of comic book reading status, both generally and by specific publishers.
            </td>
        </tr>
    </table>
    

<table style='width:100%' border='0'>
        <tr><td colspan="2" style="text-align:center"><img src="./comics/Slide9.jpeg" style="width:50%" /><br/>Slide 9</tr>
        <tr>
            <td style='width:48%; vertical-align:top'>
                <strong>Plain Text</strong><br/><br/>
                Series Name: Star Wars Books
Values: [554.0, 34.0, 67.0, 2.0]
Series Name: Other Books
Values: [243.0, 41.0, 40.0, 0.0]
Series Name: 
Values: [657.0, 324.0]
star wars comics
Star Wars
67%
All other books
33%
2
            </td>
            <td style='width:48%; vertical-align:top'>
                <strong>Vision Description:</strong><br/><br/>
                The image contains two main elements: a pie chart on the left and a bar graph on the right, with logos of comic publishers below the bar graph.

### Pie Chart (Left Side)
- **Title:** The chart is titled "Star Wars Comics."
- **Composition:** The pie chart is divided into two segments.
  - **Larger Segment:** This takes up 67% of the circle, colored in blue, and labeled as "Star Wars 67%."
  - **Smaller Segment:** This occupies 33% of the circle, colored in orange, and labeled as "All other books 33%."

### Bar Graph (Right Side)
- **Axes:** The vertical axis represents quantity, with numerical labels from 0 to 900 at intervals of 100. The horizontal axis lacks numerical labels but is associated with publisher logos.
- **Bars:**
  - The largest bar corresponds to the "Marvel Comics" logo. It is composed of a large blue lower section and a smaller orange upper section.
  - The next two bars are significantly smaller and correspond to "IDW Publishing" and "Dark Horse Comics" logos. Each has a small orange segment on top of a blue base.
  - There is also the "Viz" logo on the far right with a barely noticeable or absent bar.

### Publisher Logos (Below the Graph)
- The logos of four different comic publishers are displayed underneath the bar graph, aligned with the bars they represent:
  1. **Marvel Comics:** Identified with the largest bar.
  2. **IDW Publishing:** With a smaller bar.
  3. **Dark Horse Comics:** Also with a smaller bar.
  4. **Viz:** Associated with a negligible bar presence.

The image presents a visual comparison and distribution of "Star Wars" comics versus other comic books, showcasing dominance by Star Wars comics in both charts and identifying the publishers involved in this context.
            </td>
        </tr>
    </table>
    

<img src="../images/yeee hoooo.PNG" style="width: 50%" alt="Comic panel of Han Solo yelling YEEE-HOOO!"/>

Those are some really helpful descriptions! With these rich descriptions, we can provide our LLMs with greater content to help answer your users' questions!

## Does vision actually help?

Obviously, the above will work better, but since I have some time on my hands, let's quickly find out 😉.

Again, I'm using [OpenAI's basic scripts for generating text](https://platform.openai.com/docs/guides/text-generation), and polished up a little for my use case:

In [8]:
from openai import OpenAI
client = OpenAI()

def call_gpt(text, context):
    completion = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "developer", "content": "Help the user by answering their question from the provided context."},
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion:\n{text}"
            }
        ]
    )
    return completion.choices[0].message.content


The above is a basic function that'll call GPT. I concatenate the context, the text from relevant slides, with my question in one message. 

In [9]:
questions = [
    {
        "question": "How many Marvel books are there?",
        "keywords": ["Marvel"]
    },
    {
        "question": "How many Marvel books are Star Wars related?",
        "keywords": ["Marvel", "Star Wars"]
    },
    {
        "question": "What year did the collection start?",
        "keywords": ["start", "collect", "journey"]
    },
    {
        "question": "What are the top brands?",
        "keywords": ["brands", "publishers"]
    }
]

Above are my questions, and I added search keywords to find the content. Any decent RAG system should transform user questions into semantic phrases, and possibly vectors, to return some really robust results. I'm not creating a search tool for this exercise, so I'm cheating a little bit here to find related content.

The following loops through my questions, and calls GPT twice: one with with the plain text extraction and one with the vision descriptions.

In [14]:
for question in questions:
    plain_context = "\n*****\n".join([
        " ".join(slide["text"])
        for slide in all_slides_data
        if any(k.lower() in slide["description"].lower() or k.lower() in ' '.join(slide['text']).lower() for k in question["keywords"])
    ])
    vision_context = "\n*****\n".join([
        slide["description"]
        for slide in all_slides_data
        if any(k.lower() in slide["description"].lower() or k.lower() in ' '.join(slide['text']).lower() for k in question["keywords"])
    ])
    plain_answer = call_gpt(question["question"], plain_context)
    vision_answer = call_gpt(question["question"], vision_context)
    display(Markdown(f"""**{question['question']}**
                     
_Plain Answer_
                     
- {plain_answer}

_Vision Answer_

- {vision_answer}

"""))

print("DONE!")

**How many Marvel books are there?**
                     
_Plain Answer_
                     
- The context provided does not contain specific information about the total number of Marvel books. The data mentions Star Wars books and other books, but does not explicitly reference Marvel books or provide their count. Therefore, based on the data available, it is not possible to determine the number of Marvel books.

_Vision Answer_

- In the context provided, the number of items attributed to Marvel Comics is represented in the bar chart titled "by publisher," where the height of Marvel Comics’ bar reaches above 800. Therefore, there are more than 800 Marvel books.



**How many Marvel books are Star Wars related?**
                     
_Plain Answer_
                     
- Based on the context provided, there isn't explicit information about how many Marvel books are specifically Star Wars related. The values given in the "Star Wars Books" category are [554.0, 34.0, 67.0, 2.0], but there is no direct reference associating these numbers with Marvel. To determine how many Marvel books are Star Wars related, you would need additional information linking the publishers to these values.

_Vision Answer_

- The pie chart indicates that 67% of the total "Star Wars Comics" belong to Marvel Comics. Depending on the precise context and total volume of material available, this percentage suggests a substantial portion of Marvel's catalog is Star Wars related, especially considering Marvel's large bar in the related bar graph. However, specific numeric values are not provided within the context, so we cannot determine the exact number without additional data.



**What year did the collection start?**
                     
_Plain Answer_
                     
- The collection started on July 4th, 2021.

_Vision Answer_

- The collection started around the year 2014.



**What are the top brands?**
                     
_Plain Answer_
                     
- Based on the context provided, it seems like the top brands or series are likely related to the highest recorded values. "Star Wars" is mentioned several times and associated with high values in the data, suggesting it might be one of the top brands. However, specific "top brands" are not directly listed in the charts, so it's inferred from the context that "Star Wars" is a prominent brand. If this does not fully answer your question, please provide more context or clarify.

_Vision Answer_

- Based on the bar charts provided in the contexts, the top brands in terms of comic book volume are:

1. **Marvel Comics:** Consistently has the tallest bar, indicating it is the leading publisher across different charts.
2. **DC Comics:** Typically appears as the second highest after Marvel Comics.
3. **Dark Horse Comics:** Also features prominently, though with smaller volumes compared to Marvel and DC.
4. **IDW Publishing:** Present with a notable quantity of comics, though less than Marvel and DC.
5. **BOOM! Studios:** Listed as a top publisher in one of the charts.

These publishers are repeatedly highlighted as the major contributors in the comic book industry within the given contexts.



DONE!


<img src="../images/gaaaarooooo.PNG" style="width: 20%;" alt="Comic panel of Chewie yelling" />

You can't beat that, can you? Using vision with these slides significantly improved my answers (except for that one wrong answer, did you see it? More below). Now apply this to your content, how many slides and pages have charts, images, and other non-text context that is lost by simply scraping text? 

<img src="../images/you tell me.jpg" style="width: 20%" alt="Comiv panel of Mand saying you tell me" />

Check out the answers to "How many Marvel books are Star Wars related?" Not great, right? The challenge here is that the original vision description didn't have this detail in it. So this approach is not perfect, but it certainly averages better than plain text! To really improve things we should explore multimodal RAG ([see other notebook](./multimodal.ipynb)).

## Just use GPT 4o Vision!

It seems quite obvious to just say "Hey, let's use vision on our content!", and if you had a blank check, I'd say do it today!

In reality, most of us live within budgets, have end-user performance expectations, speed to market requirements, blah blah blah. It'll come down to the business stakeholders to see the value in using vision, compared to the current value you're delivering, and decide the cost is worth it. 

How much did this cost me?

In [15]:
one_mil = 1000000
# costs as of Dec 2024 from https://openai.com/api/pricing/
cost_input = 2.5 
cost_output = 10
display(Markdown(f"""
For my limited example above, vision cost me {total_prompt} input tokens and {total_completion} output tokens. \
This cost me a total of ${round(float((total_completion/one_mil*cost_output)+(total_prompt/one_mil*cost_input)), 2)}."""))


For my limited example above, vision cost me 8127 input tokens and 2692 output tokens. This cost me a total of $0.05.

Cheap enough for a dozen slides, but when you're in the hundreds of thousands to millions of slides, this can add up quickly! There are a few things to check out to help curb the cost:

- Only send slides that need vision. Check for specific element types on the slides, and only send if there are images and charts. An all text slide doesn't need vision.
- Explore setting the `detail` parameter to low, which should incur lower costs, but might impact quality of the output. Learn more on [OpenAI's site about Low or High fidelity image understanding](https://platform.openai.com/docs/guides/vision#low-or-high-fidelity-image-understanding). I'd guess most powerpoints can get away with low.
- The size of the slide also impacts the cost. All of my examples here are 1024x578, which cost $0.001913 per high-res slide, $0.000213 per low-res ([from OpenAI's pricing page](https://openai.com/api/pricing/)). A larger image, like 2048 wide, will cost twice as much. _OpenAI does some resizing and calculations to determine the total tokens for an image. See their Vision pricing calculate for details._

Given the results we see above, and the vastly improved answers, adding vision to your RAG should be seriously considered! Figure out your scale of cost, perform some tests, and showcase the value to your team!

In [18]:
# exporting the slide data to a json file for the next page
with open(json_file, "w") as f:
    json.dump(all_slides_data, f)