# MultiModal Document RAG with ColQwen2 and Llama 3.2 90B Vision
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/togethercomputer/together-cookbook/blob/main/MultiModal_RAG_with_Nvidia_Investor_Slide_Deck.ipynb)

## MultiModal RAG Workflow

[ColPali](https://arxiv.org/abs/2407.01449) is a new multimodal retrieval system that seamlessly enables image retrieval.

By directly encoding image patches, it eliminates the need for optical character recognition (OCR), or image captioning to extract text from PDFs.

We will use `byaldi`, a library from [AnswerAI](https://www.answer.ai/), that makes it easier to work with an upgraded version of ColPali, called ColQwen2, to embed and retrieve images of our PDF documents.

Retrieved pages will then be passed into the Llama-3.2 90B Vision model served via a [Together AI](https://www.together.ai/) inference endpoint for it to answer questions.

To get a better explanation of how ColPali and the new Llama 3.2 Vision models work checkout the [blog post](https://www.together.ai/blog/multimodal-document-rag-with-llama-3-2-vision-and-colqwen2) connected to this notebook.

### Install relevant libraries

In [None]:
!pip install byaldi together pdf2image

Collecting byaldi
  Downloading Byaldi-0.0.7-py3-none-any.whl.metadata (20 kB)
Collecting together
  Downloading together-1.3.5-py3-none-any.whl.metadata (11 kB)
Collecting pdf2image
  Downloading pdf2image-1.17.0-py3-none-any.whl.metadata (6.2 kB)
Collecting colpali-engine<0.4.0,>=0.3.4 (from byaldi)
  Downloading colpali_engine-0.3.4-py3-none-any.whl.metadata (21 kB)
Collecting mteb==1.6.35 (from byaldi)
  Downloading mteb-1.6.35-py3-none-any.whl.metadata (23 kB)
Collecting ninja (from byaldi)
  Downloading ninja-1.11.1.1-py2.py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.whl.metadata (5.3 kB)
Collecting datasets>=2.2.0 (from mteb==1.6.35->byaldi)
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting jsonlines (from mteb==1.6.35->byaldi)
  Downloading jsonlines-4.0.0-py3-none-any.whl.metadata (1.6 kB)
Collecting pytrec-eval-terrier>=0.5.6 (from mteb==1.6.35->byaldi)
  Downloading pytrec_eval_terrier-0.5.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

In [None]:
import os

In [None]:
!sudo apt-get install -y poppler-utils

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following NEW packages will be installed:
  poppler-utils
0 upgraded, 1 newly installed, 0 to remove and 49 not upgraded.
Need to get 186 kB of archives.
After this operation, 696 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 poppler-utils amd64 22.02.0-2ubuntu0.5 [186 kB]
Fetched 186 kB in 1s (186 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 78, <> line 1.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin: 
Selecting previously unselected package poppler-utils.
(Reading database ... 123629 

In [None]:
# Paste in your Together AI API Key or load it
api_key = os.environ.get("YOUR_API_KEY")

### Initialize the ColPali Model

In [None]:
import os
from pathlib import Path
from byaldi import RAGMultiModalModel

# Initialize RAGMultiModalModel
model = RAGMultiModalModel.from_pretrained("vidore/colqwen2-v0.1")

Verbosity is set to 1 (active). Pass verbose=0 to make quieter.


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


adapter_config.json:   0%|          | 0.00/728 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/56.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.85G [00:00<?, ?B/s]

`Qwen2VLRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/220 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/74.0M [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/568 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/4.30k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/392 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/613 [00:00<?, ?B/s]

chat_template.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

### The document we will be retrieving from is a 15 page Microsoft Pitch Deck, which contains infographics and charts and tables.

### Lets create our index that will store the embeddings for the page images.



In [None]:
# Use ColQwen2 to index and store the presentation
index_name = "msft_index"
model.index(input_path=Path("/content/Microsoft-Pitch-Deck_watermark.pdf"),
    index_name=index_name,
    store_collection_with_index=True, # Stores base64 images along with the vectors
    overwrite=True
)

Added page 1 of document 0 to index.
Added page 2 of document 0 to index.
Added page 3 of document 0 to index.
Added page 4 of document 0 to index.
Added page 5 of document 0 to index.
Added page 6 of document 0 to index.
Added page 7 of document 0 to index.
Added page 8 of document 0 to index.
Added page 9 of document 0 to index.
Added page 10 of document 0 to index.
Added page 11 of document 0 to index.
Added page 12 of document 0 to index.
Added page 13 of document 0 to index.
Added page 14 of document 0 to index.
Added page 15 of document 0 to index.
Index exported to .byaldi/nvidia_index
Index exported to .byaldi/nvidia_index


{0: '/content/Microsoft-Pitch-Deck_watermark.pdf'}

### This concludes the indexing of the PDF phase - everything below happens at query time.


### Let's query our indexed document.

Here the important thing to note is that the query is asking for details that are found on page 15 of the PDF!

In [None]:
# Lets query our index and retrieve the page that has content with the highest similarity to the query

# The Data Centre revenue results are on page 15 - for context!
query = "What is the average cpc of contoso fully complete and repurposed wheat-free?"
results = model.search(query, k=5)

print(f"Search results for '{query}':")
for result in results:
    print(f"Doc ID: {result.doc_id}, Page: {result.page_num}, Score: {result.score}")

print("Test completed successfully!")

Search results for 'What is the average cpc of contoso fully complete and repurposed wheat-free?':
Doc ID: 0, Page: 11, Score: 21.0
Doc ID: 0, Page: 12, Score: 20.625
Doc ID: 0, Page: 10, Score: 12.1875
Doc ID: 0, Page: 7, Score: 11.375
Doc ID: 0, Page: 14, Score: 10.8125
Test completed successfully!


### Notice that ColQwen2 is able to retrieve that correct page with the highest similarity!

### How does this work? What happens under the hood between the different pages and query token?

The interaction operation between page image patch and query text token representations to score each page of the document is what allows this great retreival performance.

Typically each image is resized and cut into patch sizes of 16x16 pixels. These patches are then embedded into 128 dimensional vectors which are stored and used to perform the MaxSim and late interaction operations between the image and text tokens. ColPali is a multi-vector approach because it produces multiple vectors for each image/query; one vector for each token instead of just one vector for all tokens.



The retrieval step takes about 171 ms.

In [None]:
%%timeit
model.search(query, k=5)

171 ms ± 645 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Lets now pass in the retrieved page to the Llama-3.2 90B Vision Model.

This model will read the question: `"What is the average cpc of contoso fully complete and repurposed wheat-free?"`

And take in the retrieved page and produce an answer!

You can pass in a URL to the image of the retrieved page or a base64 encoded version of the image.

In [None]:
# Since we stored the collection along with the index we have the base64 images of all PDF pages aswell!
result = model.search(query, k=1)

In [None]:
result

[{'doc_id': 0, 'page_num': 11, 'score': 21.0, 'metadata': {}, 'base64': 'iVBORw0KGgoAAAANSUhEUgAACSMAAAZ1CAIAAABT8cRbAAEAAElEQVR4nOzdWXMjOZamYbhT1L7vUqyZkVkz090/YP7/7YxZd7VVVS6xKUL7vq8k3efiMz92CDgpBTOiMivnfS7CFBLpCxwOB3AAeFaWZQAAAAAAAAAAAADwhfLf+wAAAAAAAAAAAACAf0lE2gAAAAAAAAAAAIBBEGkDAAAAAAAAAAAABkGkDQAAAAAAAAAAABgEkTYAAAAAAAAAAABgEETaAAAAAAAAAAAAgEEQaQMAAAAAAAAAAAAGQaQNAAAAAAAAAAAAGASRNgAAAAAAAAAAAGAQRNoAAAAAAAAAAACAQRBpAwAAAAAAAAAAAAZBpA0AAAAAAAAAAAAYBJE2AAAAAAAAAAAAYBBE2gAAAAAAAAAAAIBBEGkDAAAAAAAAAAAABkGkDQAAAAAAAAAAABgEkTYAAAAAAAAAAABgEETaAAAAAAAAAAAAgEEQaQMAAAAAAAAAAAAGQaQNAAAAAAAAAAAAGASRNgAAAAAAAAAAAGAQRNoAAAAAAAAAAACAQRBpAwAAAAAAAAAAAAZBpA0AAAAAAAAAAAAYBJE2AAAAAAAAAAAAYBBE2gAAAAAAAAAAAIBBEGkDAAAAAAAAAAAABkGkDQAAAAAAAAAAABgEkTYAAAAAAAAAAABgEETaAAAAAAAAAAAAgEEQaQMAAAAAAAAAAAAGQaQNAAAAAAAAAAAAGASRNgAAAAAAAAAAAGAQRNoAAAAAAAAAAACAQRBpAwAAAAAAAAAAAAZBpA0AAAAAAAAAAAAYBJE2AAAAAAAAAAAAYBBE2gAAAAAAAAAAAIBBEGkDAAAAAAAAAAAABkGkDQAAAAAAAAAAABgEkTYAAAAAAAAAAABgEETaAAAAAAAAAAAAgEEQaQMAAAAAAAA

In [None]:
returned_page = result[0].base64

## We'll use a [Together AI](together.ai) inference endpoint to access the Llama-3.2 90B Vision model

In [None]:
import os
from together import Together

client = Together(api_key = "YOUR_API_KEY")

response = client.chat.completions.create(
  model="meta-llama/Llama-3.2-90B-Vision-Instruct-Turbo",
  messages=[
    {
      "role": "user",
      "content": [
        {"type": "text", "text": query}, #query
        {
          "type": "image_url",
          "image_url": {
            "url": f"data:image/jpeg;base64,{returned_page}", #retrieved page image
          },
        },
      ],
    }
  ],
  max_tokens=300,
)

print(response.choices[0].message.content)


The average CPC of Contoso Fully Complete is 2.12 and the average CPC of Repurposed Wheat-Free is 2.05. Therefore, the average CPC of both products is (2.12 + 2.05) / 2 = 2.085.


Here we can see that the combination of ColQwen2 as a image retriever and Llama-3.2 90B Vision is a powerful duo for multimodal RAG applications specially with PDFs.

Not only was ColQwen2 able to retrieve the correct page that had the right answer on it but then Llama-3.2 90B Vision was also able to find exactly where on the page this answer was, ignoring all the irrelvant details!
