# MultiModal Document RAG with ColQwen2 and Llama 3.2 90B Vision
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/togethercomputer/together-cookbook/blob/main/MultiModal_RAG_with_Nvidia_Investor_Slide_Deck.ipynb)

## Hardware Requirements
*To ensure the notebook runs faster please change the runtime type to T4 GPU:
`Runtime` -> `Change runtime type` -> `T4 GPU`*

*You can also run this notebook on a 16Gb M1 Macbook Pro

## Introduction

In this notebook we will see how to use Multimodal RAG to chat with Nvidia's invester slide deck from last year. The [slide deck](https://s201.q4cdn.com/141608511/files/doc_presentations/2023/Oct/01/ndr_presentation_oct_2023_final.pdf) is 39 pages with a combination of text, visuals, tables, charts and annotations. The document structure and templates vary from page to page and is quite difficult to RAG over using traditional methods.

We will be using a new multimodal approach!

<img src="https://github.com/togethercomputer/together-cookbook/blob/main/images/Nvidia_collage.png?raw=1" width="500">

## MultiModal RAG Workflow

[ColPali](https://arxiv.org/abs/2407.01449) is a new multimodal retrieval system that seamlessly enables image retrieval.

By directly encoding image patches, it eliminates the need for optical character recognition (OCR), or image captioning to extract text from PDFs.

We will use `byaldi`, a library from [AnswerAI](https://www.answer.ai/), that makes it easier to work with an upgraded version of ColPali, called ColQwen2, to embed and retrieve images of our PDF documents.

Retrieved pages will then be passed into the Llama-3.2 90B Vision model served via a [Together AI](https://www.together.ai/) inference endpoint for it to answer questions.

To get a better explanation of how ColPali and the new Llama 3.2 Vision models work checkout the [blog post](https://www.together.ai/blog/multimodal-document-rag-with-llama-3-2-vision-and-colqwen2) connected to this notebook.

<img src="https://github.com/togethercomputer/together-cookbook/blob/main/images/mmrag_only.png?raw=1" width="600">

### Install relevant libraries

In [4]:
!pip install byaldi together pdf2image

Collecting byaldi
  Downloading Byaldi-0.0.7-py3-none-any.whl.metadata (20 kB)
Collecting together
  Downloading together-1.5.4-py3-none-any.whl.metadata (14 kB)
Collecting pdf2image
  Downloading pdf2image-1.17.0-py3-none-any.whl.metadata (6.2 kB)
Collecting colpali-engine<0.4.0,>=0.3.4 (from byaldi)
  Downloading colpali_engine-0.3.8-py3-none-any.whl.metadata (27 kB)
Collecting mteb==1.6.35 (from byaldi)
  Downloading mteb-1.6.35-py3-none-any.whl.metadata (23 kB)
Collecting ninja (from byaldi)
  Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (5.0 kB)
Collecting datasets>=2.2.0 (from mteb==1.6.35->byaldi)
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting jsonlines (from mteb==1.6.35->byaldi)
  Downloading jsonlines-4.0.0-py3-none-any.whl.metadata (1.6 kB)
Collecting pytrec-eval-terrier>=0.5.6 (from mteb==1.6.35->byaldi)
  Downloading pytrec_eval_terrier-0.5.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

In [7]:
!sudo apt-get install -y poppler-utils

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
poppler-utils is already the newest version (22.02.0-2ubuntu0.6).
0 upgraded, 0 newly installed, 0 to remove and 29 not upgraded.


In [None]:
# Paste in your Together AI API Key or load it
api_key = os.environ.get("TOGETHER_API_KEY")

### Initialize the ColPali Model

In [9]:
import os
from pathlib import Path
from byaldi import RAGMultiModalModel

# Initialize RAGMultiModalModel
model = RAGMultiModalModel.from_pretrained("vidore/colqwen2-v0.1")

Verbosity is set to 1 (active). Pass verbose=0 to make quieter.


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


adapter_config.json:   0%|          | 0.00/728 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/56.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.85G [00:00<?, ?B/s]

`Qwen2VLRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/220 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/74.0M [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/619 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/4.30k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/392 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/613 [00:00<?, ?B/s]

chat_template.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

### The document we will be retrieving from is a 39 page Nvidia investor presentation from 2023: [Investor Presentation October 2023](https://s201.q4cdn.com/141608511/files/doc_presentations/2023/Oct/01/ndr_presentation_oct_2023_final.pdf)

In [2]:
# 1. Download with SSL verification disabled
!wget --no-check-certificate https://ongcindia.com/documents/77751/2660534/ar2023-24.pdf

# 2. Rename the ACTUAL downloaded file
!mv ar2023-24.pdf ongc_annual_report.pdf

# 3. Verify
!ls -lh ongc_annual_report.pdf

--2025-04-01 05:46:18--  https://ongcindia.com/documents/77751/2660534/ar2023-24.pdf
Resolving ongcindia.com (ongcindia.com)... 210.212.78.205
Connecting to ongcindia.com (ongcindia.com)|210.212.78.205|:443... connected.
  Unable to locally verify the issuer's authority.
HTTP request sent, awaiting response... 200 
Length: 20060120 (19M) [application/pdf]
Saving to: ‘ar2023-24.pdf’


2025-04-01 05:46:30 (1.81 MB/s) - ‘ar2023-24.pdf’ saved [20060120/20060120]

-rw-r--r-- 1 root root 20M Aug  7  2024 ongc_annual_report.pdf


### Lets create our index that will store the embeddings for the page images.

Caution: This cell below takes ~5 mins to index the whole PDF!

In [10]:
# Use ColQwen2 to index and store the presentation
index_name = "nvidia_index"
model.index(input_path=Path("/content/ongc_annual_report.pdf"),
    index_name=index_name,
    store_collection_with_index=True, # Stores base64 images along with the vectors
    overwrite=True
)

Added page 1 of document 0 to index.
Added page 2 of document 0 to index.
Added page 3 of document 0 to index.
Added page 4 of document 0 to index.
Added page 5 of document 0 to index.
Added page 6 of document 0 to index.
Added page 7 of document 0 to index.
Added page 8 of document 0 to index.
Added page 9 of document 0 to index.
Added page 10 of document 0 to index.
Added page 11 of document 0 to index.
Added page 12 of document 0 to index.
Added page 13 of document 0 to index.
Added page 14 of document 0 to index.
Added page 15 of document 0 to index.
Added page 16 of document 0 to index.
Added page 17 of document 0 to index.
Added page 18 of document 0 to index.
Added page 19 of document 0 to index.
Added page 20 of document 0 to index.
Added page 21 of document 0 to index.
Added page 22 of document 0 to index.
Added page 23 of document 0 to index.
Added page 24 of document 0 to index.
Added page 25 of document 0 to index.
Added page 26 of document 0 to index.
Added page 27 of docu

{0: '/content/ongc_annual_report.pdf'}

### This concludes the indexing of the PDF phase - everything below happens at query time.

<img src="https://github.com/togethercomputer/together-cookbook/blob/main/images/colpali_arch.png?raw=1" width="700">

### Let's query our indexed document.

Here the important thing to note is that the query is asking for details that are found on page 25 of the PDF!

In [32]:
# Lets query our index and retrieve the page that has content with the highest similarity to the query

# The Data Centre revenue results are on page 25 - for context!
query = "How has the company’s R&D investment contributed to innovation and competitive advantage?"
results = model.search(query, k=5)

print(f"Search results for '{query}':")
for result in results:
    print(f"Doc ID: {result.doc_id}, Page: {result.page_num}, Score: {result.score}")

print("Test completed successfully!")

Search results for 'How has the company’s R&D investment contributed to innovation and competitive advantage?':
Doc ID: 0, Page: 48, Score: 18.625
Doc ID: 0, Page: 46, Score: 17.875
Doc ID: 0, Page: 47, Score: 17.375
Doc ID: 0, Page: 117, Score: 16.75
Doc ID: 0, Page: 34, Score: 16.75
Test completed successfully!


### Notice that ColQwen2 is able to retrieve that correct page with the highest similarity!

<img src="https://github.com/togethercomputer/together-cookbook/blob/main/images/page_25.png?raw=1" width="700">

### How does this work? What happens under the hood between the different pages and query token?

The interaction operation between page image patch and query text token representations to score each page of the document is what allows this great retreival performance.

Typically each image is resized and cut into patch sizes of 16x16 pixels. These patches are then embedded into 128 dimensional vectors which are stored and used to perform the MaxSim and late interaction operations between the image and text tokens. ColPali is a multi-vector approach because it produces multiple vectors for each image/query; one vector for each token instead of just one vector for all tokens.

<img src="https://github.com/togethercomputer/together-cookbook/blob/main/images/ColPaliMaxSim-1.png?raw=1" width="700">

The retrieval step takes about 185 ms.

In [16]:
%%timeit
model.search(query, k=5)

241 ms ± 2.25 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Lets now pass in the retrieved page to the Llama-3.2 90B Vision Model.

This model will read the question: `"What are the half year data centre renevue results and the 5 year CAGR for Nivida data centre revenue?"`

And take in the retrieved page and produce an answer!

You can pass in a URL to the image of the retrieved page or a base64 encoded version of the image.

In [17]:
# Since we stored the collection along with the index we have the base64 images of all PDF pages aswell!
model.search(query, k=1)

[{'doc_id': 0, 'page_num': 68, 'score': 13.625, 'metadata': {}, 'base64': 'iVBORw0KGgoAAAANSUhEUgAABqQAAAh+CAIAAAD+KKl7AAEAAElEQVR4nOzdd2AURcPH8dlLb5CE0Am999577yIdld4FAQEBQeyACCIKgohIl470DqEX6TV0EgiBkEB6z2WfPxbX9e5yuRQI3PP9vD68d3Ozu3O7myu/m52RZFkWAAAAAAAAAN5+uqxuAAAAAAAAAIDMQdgHAAAAAAAAWAnCPgAAAAAAAMBKEPYBAAAAAAAAVoKwDwAAAAAAALAShH0AAAAAAACAlSDsAwAAAAAAAKwEYR8AAAAAAABgJQj7AAAAAAAAACtB2AcAAAAAAABYCcI+AAAAAAAAwEoQ9gEAAAAAAABWgrAPAAAAAAAAsBKEfQAAAAAAAICVIOwDAAAAAAAArARhHwAAAAAAAGAlCPsAAAAAAAAAK0HYBwAAAAAAAFgJwj4AAAAAAADAShD2AQAAAAAAAFaCsA8AAAAAAACwEoR9AAAAAAAAgJUg7AMAAAAAAACsBGEfAAAAAAAAYCUI+wAAAAAAAAArQdgHAAAAAAAAWAnCPgAAAAAAAMBKEPYBAAAAAAAAVoKwDwAAAAAAALAShH0AAAAAAACAlSDsAwAAAAAAAKwEYR8AAAAAAABgJQj7AAAAAAAAACtB2AcAAAAAAABYCcI+AAAAAAAAwEoQ9gEAAAAAAABWgrAPAAAAAAAAsBKEfQAAAAAAAICVIOwDAAAAAAAArARhHwAAAAAAAGAlCPsAAAAAAAAAK0HYBwAAAAAAAFgJwj4AAAAAAADAShD2AQAAAAAAAFaCsA8AAAAAAACwEoR9AAAAAAAAgJUg7AMAAAAAAACsBGEfAAAAAAAAYCUI+wAAAAAAAAArQdgHAAAAAAAAWAnCPgAAAAAAAMBKEPYBAAAAAAAAVoKwDwAAAAAAALAShH0AA

In [18]:
returned_page = model.search(query, k=1)[0].base64

## We'll use a [Together AI](together.ai) inference endpoint to access the Llama-3.2 90B Vision model

In [33]:
import os
from together import Together

client = Together(api_key = api_key)

response = client.chat.completions.create(
  model="meta-llama/Llama-3.2-90B-Vision-Instruct-Turbo",
  messages=[
    {
      "role": "user",
      "content": [
        {"type": "text", "text": query}, #query
        {
          "type": "image_url",
          "image_url": {
            "url": f"data:image/jpeg;base64,{returned_page}", #retrieved page image
          },
        },
      ],
    }
  ],
  max_tokens=300,
)

print(response.choices[0].message.content)

The report highlights the company's commitment to research and development (R&D) with an expenditure of ₹228.11 Crore during FY'24, building on its continuous exploration programme. This investment is crucial in driving innovation and achieving a competitive advantage in the energy market. ONGC's focus on R&D has led to notable discoveries and expansions in its reserves, contributing significantly to its operational performance and financial results. As the energy sector evolves under the influence of international economic factors, ongoing technological advancements are essential for sustainability and growth. By consolidating its efforts in R&D, ONGC positions itself to meet future challenges and opportunistically capitalise on new prospects within the industry.

**Key Outcomes of R&D Efforts:**

*   Discovery of nine new hydrocarbon blocks.
*   Enhancement of reserve replacement ratio.
*   Operation of 5 discoveries in OALP blocks and 6 discoveries in nomination blocks.

These outco

Here we can see that the combination of ColQwen2 as a image retriever and Llama-3.2 90B Vision is a powerful duo for multimodal RAG applications specially with PDFs.

Not only was ColQwen2 able to retrieve the correct page that had the right answer on it but then Llama-3.2 90B Vision was also able to find exactly where on the page this answer was, ignoring all the irrelvant details!

Voila!🎉🎉

Learn more about Llama 3.2 Vision in the [docs](https://docs.together.ai/docs/vision-overview) here!